## Identify Missing Data
Values will be originally missing from a dataset or be a product of data manipulation. In pandas, missing values are typically called `NaN` or `None`.

**Missing data** can: 
* Hint at data collection errors.
* Indicate improper conversion or manipulation.
* Actually not be considered missing. For some datasets, missing data can be listed as "zero", "false", "not applicable", "entered an empty string", among other possibilities. 

**Missing Data** may be listed: 
- Zero
- False
- Not applicable
- Entered an empty string

This is an important subject as ``before you can graph data, you should make sure you aren't trying to graph some missing values as that can cause an error or misinterpretation of the data``. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np


In [2]:
# Load Excel File
filename = 'car_financing_filter.xlsx'
df = pd.read_excel(filename)

### Finding Missing Values

As you see it in the code over here I have the Panda Series interest_paid. I'm using the isna method, and what this does is this is producing a Panda Series of true and false values. It'll be true where I have a NaN value, and it'll be false where I don't.


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   month             60 non-null     int64  
 1   starting_balance  60 non-null     float64
 2   interest_paid     59 non-null     float64
 3   principal_paid    60 non-null     float64
 4   new_balance       60 non-null     float64
 5   interest_rate     60 non-null     float64
 6   car_type          60 non-null     object 
dtypes: float64(5), int64(1), object(1)
memory usage: 3.4+ KB


Two common methods to indicate where values in a DataFrame are missing are `isna` and `isnull`. They are exactly the same methods, but with different names.

The reason why this is the case is in the R language, NA and null are two different things. This is to make our programmers have an easier time when working with Python. I tend to prefer isna as this tends to be similar in naming to other Python methods.

In [4]:
# Notice we have a Pandas Series of True and False values
df['interest_paid'].isna().head()

0    False
1    False
2    False
3    False
4    False
Name: interest_paid, dtype: bool

The next thing I'm doing is I'm assigning this true and false filter to the variable interest_missing. And the reason why I'm doing this is I want to take that filter and eventually use it to isolate my missing data. 
What the code here is doing is I want to look at the row where I have the missing values.

And what you see here is I have a missing value in the interest paid column. This will be a problem for later. 


In [5]:
interest_missing = df['interest_paid'].isna()

In [6]:
# Looks at the row that contains the NaN for interest_paid
df.loc[interest_missing,:]

Unnamed: 0,month,starting_balance,interest_paid,principal_paid,new_balance,interest_rate,car_type
35,36,15940.06,,593.99,15346.07,0.0702,Toyota Sienna


It's important to keep in mind, that you can also use the knot operator to negate the filter so that every row that's returned doesn't have a NaN. And as you see in the Pandas DataFrame, the row with index 35 is no longer here. 


In [7]:
# Keep in mind that we can use the not operator (~) to negate the filter
# every row that doesn't have a nan is returned.
df.loc[~interest_missing,:]

Unnamed: 0,month,starting_balance,interest_paid,principal_paid,new_balance,interest_rate,car_type
0,1,34689.96,202.93,484.3,34205.66,0.0702,Toyota Sienna
1,2,34205.66,200.1,487.13,33718.53,0.0702,Toyota Sienna
2,3,33718.53,197.25,489.98,33228.55,0.0702,Toyota Sienna
3,4,33228.55,194.38,492.85,32735.7,0.0702,Toyota Sienna
4,5,32735.7,191.5,495.73,32239.97,0.0702,Toyota Sienna
5,6,32239.97,188.6,498.63,31741.34,0.0702,Toyota Sienna
6,7,31741.34,185.68,501.55,31239.79,0.0702,Toyota Sienna
7,8,31239.79,182.75,504.48,30735.31,0.0702,Toyota Sienna
8,9,30735.31,179.8,507.43,30227.88,0.0702,Toyota Sienna
9,10,30227.88,176.83,510.4,29717.48,0.0702,Toyota Sienna


It's important to note, you'll often see code similar to what you see here.What we have is that same Pandas filter of true and false values. And then after, you have the aggregate function sum, which then sums all the true and false values to produce a result. The reason why this works is in Python, Boolean are a subtype of integer where true are ones and falses are zero.  


In [8]:
# The code counts the number of missing values
# sum() works because Booleans are a subtype of integers. 
df['interest_paid'].isna().sum()

np.int64(1)

In [9]:
True + False + False 

1

``When working with a dataset, it's important to identify your missing values, as missing values can cause data misinterpretation errors, or even cause you an error when you try to graph your data.``
