# Chapter 2 - Data Preparation Basics
## Segment 2 - Treating missing values

### Figuring out what data is missing
Now I'm going to show you how to figure out what data is missing from your data set. numpy offers a special values called nan. It's meant to be used to represent missing values in a data set.

0    row 1
1    row 2
2      NaN
3    row 4
4    row 5
5    row 6
6      NaN
7    row 8
dtype: object

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
dtype: bool

### Filling in for missing values

Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,-0.232312
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,-1.049942
3,1.05661,-0.419678,2.294842,-2.594487,2.822756,0.680889
4,-1.577693,-1.976254,0.53334,-0.29087,-0.51352,1.982626
5,0.226001,-1.839905,1.607671,0.388292,0.399732,0.405477


Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,
3,,-0.419678,2.294842,-2.594487,2.822756,
4,,-1.976254,0.53334,-0.29087,-0.51352,
5,,-1.839905,1.607671,0.388292,0.399732,0.405477


Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,0.0
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,0.0
3,0.0,-0.419678,2.294842,-2.594487,2.822756,0.0
4,0.0,-1.976254,0.53334,-0.29087,-0.51352,0.0
5,0.0,-1.839905,1.607671,0.388292,0.399732,0.405477


Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,1.25
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,1.25
3,0.1,-0.419678,2.294842,-2.594487,2.822756,1.25
4,0.1,-1.976254,0.53334,-0.29087,-0.51352,1.25
5,0.1,-1.839905,1.607671,0.388292,0.399732,0.405477


You may be wondering how this could be useful. Well, imagine you have a predictive application, and requires you to input data from four variables. Three of your variables are great, you've got all the information you need, but one of them has lots of missing values.

You still need to import data from that variable. So what you could do, is you could set those missing values equal to an approximation, in order to make your predictive application work. 


Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,-0.222326
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,-0.222326
3,2.152957,-0.419678,2.294842,-2.594487,2.822756,-0.222326
4,2.152957,-1.976254,0.53334,-0.29087,-0.51352,-0.222326
5,2.152957,-1.839905,1.607671,0.388292,0.399732,0.405477


### Counting missing values
Now I want to show you how to count missing values. But before that, I want to explain to you how this could be useful. Sometimes you just want to create a summary statistic of your DataSet, in order to understandwhat you've got in there.

You can count missing values in order to figure out which variables are most problematic. In other words, which variables in your data set have the most numbers of missing values? 
 
What I want to do for this demonstration is just reuse the DataFrame we already created earlier in this demonstration with the missing values. We need to recreate it because we've now filled them in. Let's reuse the DataFrame that we created earlier, the one with the missing values still present. To do that, we'll...

Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,
3,,-0.419678,2.294842,-2.594487,2.822756,
4,,-1.976254,0.53334,-0.29087,-0.51352,
5,,-1.839905,1.607671,0.388292,0.399732,0.405477


0    3
1    0
2    0
3    0
4    0
5    4
dtype: int64

### Filtering out missing values

Unnamed: 0,1,2,3,4
0,1.02689,-0.839585,-0.591182,-0.956888
1,1.837905,-2.053231,0.868583,-0.920734
2,-1.334661,0.07638,-1.246089,1.202272
3,-0.419678,2.294842,-2.594487,2.822756
4,-1.976254,0.53334,-0.29087,-0.51352
5,-1.839905,1.607671,0.388292,0.399732


But that sort of excessive actually, right? Because you lost most of the data from the data set when you did that. 
 
Next, I'm going to show you how to apply this method in a different way to avoid that problem. But first I want to mention one thing. If you want to drop columns that contain any missing values instead of rows, you just pass in the axis equals one argument to select and search the DataFrame via columns instead of by row.
We can actually do that right here... 


Unnamed: 0,1,2,3,4
0,1.02689,-0.839585,-0.591182,-0.956888
1,1.837905,-2.053231,0.868583,-0.920734
2,-1.334661,0.07638,-1.246089,1.202272
3,-0.419678,2.294842,-2.594487,2.822756
4,-1.976254,0.53334,-0.29087,-0.51352
5,-1.839905,1.607671,0.388292,0.399732


Unnamed: 0,0,1,2,3,4,5
0,0.228273,1.02689,-0.839585,-0.591182,-0.956888,-0.222326
1,-0.619915,1.837905,-2.053231,0.868583,-0.920734,
2,2.152957,-1.334661,0.07638,-1.246089,1.202272,
3,,-0.419678,2.294842,-2.594487,2.822756,
4,,-1.976254,0.53334,-0.29087,-0.51352,
5,,-1.839905,1.607671,0.388292,0.399732,0.405477


And I just want to leave you with a cautionary note about dropping rows and columns from your data sets. Every time you drop rows and columns from your data set, you're losing a lot of information. So the better method is usually to do approximations to try and fill in those missing values rather than drop them completely. Let's move on now to removing duplicates.