## Removing or Filling in Missing Data
This is an important subject as before you can graph data, you should make sure you aren't trying to graph some missing values as that can cause an error or misinterpretation of the data.

In [2]:
# Import libraries
import pandas as pd
import numpy as np

In [3]:
# Load Excel File
filename = 'car_financing_filter.xlsx'
df = pd.read_excel(filename)

We're working with the car loan dataset and the first thing we're going to do is we're going to utilize the info method. And what the info method does is it shows us how many missing values we have in each of our columns. 
And as you see, we have 60 non-null values for every column except for the interest paid column. This means that we have one null value. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   month             60 non-null     int64  
 1   starting_balance  60 non-null     float64
 2   interest_paid     59 non-null     float64
 3   principal_paid    60 non-null     float64
 4   new_balance       60 non-null     float64
 5   interest_rate     60 non-null     float64
 6   car_type          60 non-null     object 
dtypes: float64(5), int64(1), object(1)
memory usage: 3.4+ KB


### *Remove Missing Values*

There are a couple different ways to deal with missing data. The first way is simply to remove the missing values. And in pandas you can remove the missing values by using the drop NA method. And what the code here does is I have a pandas data frame from index 30 up until, but not including index 40, and I'm dropping the rows where I have any NAN values. And as you see here, I don't have a row at index 35 because I had a NAN value here. 

You can remove missing values by using the `dropna` method. 

In [5]:
# You can drop entire rows if they contain 'any' nans in them or 'all'
# this may not be the best strategy for our dataset
df[30:40].dropna(how = 'any')

Unnamed: 0,month,starting_balance,interest_paid,principal_paid,new_balance,interest_rate,car_type
30,31,18858.57,110.32,576.91,18281.66,0.0702,Toyota Sienna
31,32,18281.66,106.94,580.29,17701.37,0.0702,Toyota Sienna
32,33,17701.37,103.55,583.68,17117.69,0.0702,Toyota Sienna
33,34,17117.69,100.13,587.1,16530.59,0.0702,Toyota Sienna
34,35,16530.59,96.7,590.53,15940.06,0.0702,Toyota Sienna
36,37,15346.07,89.77,597.46,14748.61,0.0702,Toyota Sienna
37,38,14748.61,86.27,600.96,14147.65,0.0702,Toyota Sienna
38,39,14147.65,82.76,604.47,13543.18,0.0702,Toyota Sienna
39,40,13543.18,79.22,608.01,12935.17,0.0702,Toyota Sienna


### Filling in Missing Values
There are a [variety of ways to fill in missing values](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html). 

The first thing we're going to do is we're going to look at where the missing data is located by using a pandas series and then slicing it to look at indexes 30 up until, but not including index 40. As you see here, I have a NAN at index 35. 


In [6]:
# Looking at where missing data is located
df['interest_paid'][30:40]

30    110.32
31    106.94
32    103.55
33    100.13
34     96.70
35       NaN
36     89.77
37     86.27
38     82.76
39     79.22
Name: interest_paid, dtype: float64

The first thing we're going to try is we're going to try to fill the NAN with a zero by using the fill NA method. The reason why filling in a NAN with a zero is often not a good idea, is originally the NAN could have been something else. A zero could help you misinterpret the data. It's just one option. 


In [7]:
# Filling in the nan with a zero is probably a bad idea. 
df['interest_paid'][30:40].fillna(0)

30    110.32
31    106.94
32    103.55
33    100.13
34     96.70
35      0.00
36     89.77
37     86.27
38     82.76
39     79.22
Name: interest_paid, dtype: float64

**backfill**

The other method we could use is to fill in with a ``backfill``. And the way this works is perhaps better to show you. Where at Index 35, before I had a zero or a NAN, now I have an 89.77. This is because the index after it was an 89.77. This is very commonly done with ***time series data*** when you have a missing value. 


In [8]:
# back fill in value
df['interest_paid'][30:40].fillna(method='bfill')

  df['interest_paid'][30:40].fillna(method='bfill')


30    110.32
31    106.94
32    103.55
33    100.13
34     96.70
35     89.77
36     89.77
37     86.27
38     82.76
39     79.22
Name: interest_paid, dtype: float64

***forward fill***

Another way is to ***forward fill*** in your value. And this is also done with ***time series data***. 


In [9]:
# forward fill in value
df['interest_paid'][30:40].fillna(method='ffill')

  df['interest_paid'][30:40].fillna(method='ffill')


30    110.32
31    106.94
32    103.55
33    100.13
34     96.70
35     96.70
36     89.77
37     86.27
38     82.76
39     79.22
Name: interest_paid, dtype: float64

``The difference between backfill and forward fill is backfill takes the value after the missing value and inserts it at the value that's missing. ``

``And what forward fill does is it takes the value before the missing value and inserts it where the missing value is. ``

***The reason why you use one versus the other is really dependent on your domain knowledge and your application. This is also a current area of research. It's called data imputation. ***


***linear interpolation***

Another way to fill in missing values is through ***linear interpolation***. And what this does is it uses a linear model to fill in the missing value. And as you see here, this 93 is between the 96 and the 89. 


In [10]:
# linear interpolation (filling in of values)
df['interest_paid'][30:40].interpolate(method = 'linear')

30    110.320
31    106.940
32    103.550
33    100.130
34     96.700
35     93.235
36     89.770
37     86.270
38     82.760
39     79.220
Name: interest_paid, dtype: float64

What the code here is doing is I'm finding the total interest paid over the course of a loan by using the sum method. And I should note, the sum method doesn't account for NANs. And as you see here, this is the total amount of money paid toward interest over the course of a loan. It's important to keep in mind that the sum method by default ignores NANs. So after we fill in the NAN value with a real value, this might change. 


In [11]:
# Interest paid before filling in the nan with a value
df['interest_paid'].sum()

np.float64(6450.2699999999995)

What the code here is doing is this is producing a Boolean array of true and false values where I'll have a true value where I have a NAN and a false value where I don't, and I'm assigning it to the variable interest_missing. From there, I'm utilizing the LOC operator and filling in that missing N value with the value 93.24. 


In [13]:
# Fill in with the actual value
interest_missing = df['interest_paid'].isna()
df.loc[interest_missing,'interest_paid'] = 93.24

Now, when I sum over the entire column, I'll get a different number. This is perhaps more accurate, and I should note the value of removing or filling in your data is that oftentimes you get more accurate calculations. 


In [14]:
# Interest paid after filling in the nan with a value
df['interest_paid'].sum()

np.float64(6543.509999999999)

In this case, the reason why I filled in the value with 93.24 is because I knew what the actual value should have been. This is due to my domain knowledge of loans. For whatever application you're working with, it's often best to try to get the most accurate value to fill in for your missing values. 

And as you can see here, we don't have NAN values in the data frame anymore. 


In [15]:
# Notice we dont have NaN values in the DataFrame anymore
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   month             60 non-null     int64  
 1   starting_balance  60 non-null     float64
 2   interest_paid     60 non-null     float64
 3   principal_paid    60 non-null     float64
 4   new_balance       60 non-null     float64
 5   interest_rate     60 non-null     float64
 6   car_type          60 non-null     object 
dtypes: float64(5), int64(1), object(1)
memory usage: 3.4+ KB


Once you've identified your missing values, removing them or filling them in often gives you more accurate calculations and makes the results more interpretable.


In [16]:
df.to_excel('car_financing_misssing.xlsx', index=False)