<a href="https://colab.research.google.com/github/theheking/intro-to-python/blob/gh-pages/4_Missing_Filled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dealing with missing data

Most real world data has lots of missing values and this can be an issue, in this notebook we'll go through how to deal with missing data.



In [None]:
!pip install pandas matplotlib # install the packages directly into the notebook


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Make sure the libraries are loaded and we have the data we need.


In [None]:
# Download a file using python
import urllib.request # this is the library we need 

url = 'https://raw.githubusercontent.com/theheking/intro-to-python/gh-pages/docs/patient_data.csv'
#retrieve the file

urllib.request.urlretrieve(url, 'patient_data.csv')

#import pandas as a package
import pandas as pd 

#read in the patient data 
patient_df=pd.read_csv('patient_data.csv', sep=',')

### Finding Missing Values
Let's identify all locations in the survey data that have null (missing or NaN) data values. 


We can use the `isnull` method to do this. The `isnull` method will compare each cell with a null value. If an element has a null value, it will be assigned a value of `True` in the output object.

In [None]:
pd.isnull(patient_df).head()


Unnamed: 0,patient_id,site_id,sex,time,year,month,day,illness,weight
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False


## How to select rows with missing data

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:



In [None]:
# To select only the rows with NaN values, we can use the 'any()' method
patient_df[pd.isnull(patient_df).any(axis=1)]

Unnamed: 0,patient_id,site_id,sex,time,year,month,day,illness,weight
4,5,3,M,7:26 pm,2022.0,1.0,12.0,,36.0
13,14,8,,11:35 pm,2022.0,3.0,12.0,fall,21.0
18,19,4,,3:18 pm,2022.0,5.0,12.0,mva,20.0
19,20,11,F,5:34 pm,2022.0,5.0,12.0,,26.0
22,23,13,M,11:13 am,2022.0,6.0,12.0,faint,
...,...,...,...,...,...,...,...,...,...
1176,1177,14,M,7:49 pm,2022.0,12.0,2.0,MVA,
1177,1178,16,M,10:30 am,2022.0,11.0,16.0,diabetic ketoacidosois,
1179,1180,4,F,2:30 pm,2022.0,10.0,29.0,,39.0
1184,1185,1,F,,,,,,37.0


### Explaination
Notice that we have 444 observations/rows that contain one or more missing values. Thats roughly **37%** of data contains missing values.

We have used `[]` convension to select subset of data.

More information about slicing and indexing can be found out here.

`(axis=1)` is a numpy convention to specify columns.

Note that the weight column of our DataFrame contains many null or NaN values. Next, we will explore ways of dealing with this.

If we look at the weight column in the surveys data we notice that there are NaN (Not a Number) values. NaN values are undefined values that cannot be represented mathematically. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.






## Dealing with missing values.

Thoughts: 

1. What are some reasons why there might be missing data?
2. How would you deal with missing values?
3. Is it OK to ignore missing values when calculating the mean?
4. What effect do missing values have when you multiply 2 columns (either test this out or think about what would happen).

In [None]:
## Where Are the NaN's?

## how many missing values are there in weight column
len(patient_df[pd.isnull(patient_df.weight)])

217

In [None]:
# number of rows have weight values
len(patient_df[patient_df.weight> 0])

974

We can replace all NaN values with zeroes using the .fillna() method (after making a copy of the data so we don't lose our work): 

In [None]:
# create a new DataFrame using copy
df1 = patient_df.copy()

# fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

print(patient_df['weight'].mean())
print(df1['weight'].mean())

33.063655030800824
27.039462636439968


In [None]:
# Fill all NaN values with mean
df1 = patient_df.copy()
df1['weight'] = df1['weight'].fillna(patient_df['weight'].mean())

#check the mean 
print(patient_df['weight'].mean())
print(df1['weight'].mean())

33.063655030800824
33.063655030800824


## Writing Out Data to CSV

Great, so you've filled out all your data but now you want to share it with your collaborators, to do that you can just write to a CSV as usual.

In [None]:
# Write DataFrame to CSV
df1.to_csv('surveys_complete.csv', index=False)

## Recap
What we've learned:

What NaN values are, how they might be represented, and what this means for your work
How to replace NaN values, if desired
How to use to_csv to write manipulated data to a file.
