## Dealing with missing data

Most real world data has lots of missing values and this can be an issue, in this notebook we'll go through how to deal with missing data.



In [None]:
!pip install pandas matplotlib # install the packages directly into the notebook


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Make sure the libraries are loaded and we have the data we need.


In [None]:
# Now import pandas into your notebook as pd
import pandas as pd

# Use what we learnt previously to download the file again
import urllib.request # this is the library we need 


url = 'https://monashdatafluency.github.io/python-workshop-base/modules/data/surveys.csv'
# This is getting the file
urllib.request.urlretrieve(url, 'surveys.csv')
surveys_df = pd.read_csv("surveys.csv")
surveys_df

Unnamed: 0,record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
...,...,...,...,...,...,...,...,...,...
35544,35545,12,31,2002,15,AH,,,
35545,35546,12,31,2002,15,AH,,,
35546,35547,12,31,2002,10,RM,F,15.0,14.0
35547,35548,12,31,2002,7,DO,M,36.0,51.0


## Using masks to identify a specific condition


A mask can be useful to locate where a particular subset of values exist or don't exist - for example, NaN, or "Not a Number" values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include True or False. For example,



In [None]:
# set value of x to be 5
x = 5
x > 5 # Check whether x > 5 


False

In [None]:
x == 5


## Finding Missing Values
Let's identify all locations in the survey data that have null (missing or NaN) data values. 


We can use the `isnull` method to do this. The `isnull` method will compare each cell with a null value. If an element has a null value, it will be assigned a value of `True` in the output object.

In [None]:
pd.isnull(surveys_df).head()


Unnamed: 0,record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
0,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,True


## How to select rows with missing data

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:



In [None]:
# To select only the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]

Unnamed: 0,record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
...,...,...,...,...,...,...,...,...,...
35530,35531,12,31,2002,13,PB,F,27.0,
35543,35544,12,31,2002,15,US,,,
35544,35545,12,31,2002,15,AH,,,
35545,35546,12,31,2002,15,AH,,,


### Explaination
Notice that we have 4873 observations/rows that contain one or more missing values. Thats roughly **14%** of data contains missing values.

We have used `[]` convension to select subset of data.

More information about slicing and indexing can be found out here.

`(axis=1)` is a numpy convention to specify columns.

Note that the weight column of our DataFrame contains many null or NaN values. Next, we will explore ways of dealing with this.

If we look at the weight column in the surveys data we notice that there are NaN (Not a Number) values. NaN values are undefined values that cannot be represented mathematically. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.






## Dealing with missing values.

#### Breakout rooms!

Thoughts: 

1. What are some reasons why there might be missing data?
2. How would you deal with missing values?
3. Is it OK to ignore missing values when calculating the mean?
4. What effect do missing values have when you multiply 2 columns (either test this out or think about what would happen).

## Where Are the NaN's?

Let's explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):



In [None]:
## How many missing values are there in weight column?
len(surveys_df[pd.isnull(surveys_df.weight)])

In [None]:
# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])

We can replace all NaN values with zeroes using the .fillna() method (after making a copy of the data so we don't lose our work): 

In [None]:
# Creat a new DataFrame using copy
df1 = surveys_df.copy()

# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

print(surveys_df['weight'].mean())
print(df1['weight'].mean())

42.672428212991356
38.751976145601844


## Filling nans with better values

### Breakout rooms!

Come up with a better value to fill the missing values with than 0. Why did you choose this value?

When you fill your missing values with your new value, what is the mean of weight now?

In [None]:
# Fill all NaN values with mean
df1 = surveys_df.copy()
df1['weight'] = df1['weight'].fillna(surveys_df['weight'].mean())

print(surveys_df['weight'].mean())
print(df1['weight'].mean())

42.672428212991356
42.672428212991356


## Writing Out Data to CSV

Great, so you've filled out all your data but now you want to share it with your collaborators, to do that you can just write to a CSV as usual.

In [None]:
# Write DataFrame to CSV
df1.to_csv('surveys_complete.csv', index=False)


## Recap
What we've learned:

What NaN values are, how they might be represented, and what this means for your work
How to replace NaN values, if desired
How to use to_csv to write manipulated data to a file.
