# Filter and clean data

When working with a dataset, oftentimes you're only interested in a smaller subset of your data. For example, say I have a car loan dataset and I want to filter out the data to only have a car type of Toyota Sienna with an interest rate of 7.02%. So the first thing I'm going to do is I'm look at the first five rows in my dataset, and while it appears that the first five rows are only Toyota Siennas, that doesn't mean the rest of my dataset is all of car type Toyota Sienna. 


### car_type filter
Comparison Operator | Meaning
--- | --- 
<  | less than
<= | less than or equal to
'>'| greater than
'>=' | greater than or equal to
== | equal
!= | not equal

In [2]:
# Import libraries
import pandas as pd
import numpy as np

# Load Excel File
filename = 'car_financing.xlsx'
df = pd.read_excel(filename)

In [4]:
df.head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna


The first thing I'm going to do is I'll use the value counts method on the car type column to see what other kind of cars I have my dataset. I have my data frame, I have square brackets, I have the column I'm interested in. I'm going to close those brackets, and then I have the value counts method. When I press shift enter, you'll see that I have 120 Toyota Siennas. 

In [5]:
# Let's first start by looking at the car_type column. 
df['car_type'].value_counts()

car_type
VW Golf R         144
Toyota Sienna     120
Toyota Carolla    111
Toyota Corolla     33
Name: count, dtype: int64

Say for example, I was interested in Toyota Corollas instead, I would have to fix the data entry errors because I have 111 Toyota Carollas, instead of Corollas. This is a really important thing to take note of, as oftentimes you'll have misspellings in your dataset, you'll have errors, you'll have things you don't quite understand, but that's all part of the data exploration process. 

And what I'll do now is I'll create a car filter, and the way this works is I have a data frame, I have square brackets, I have the column that I'm interested in. I close those single brackets. I have two equal signs because this is equality, and I have the type of car I'm interested in, in this case, Toyota Sienna. And what this produces is a pandas series of true and false values. 

In [6]:
# Notice that the filter produces a pandas series of True and False values
car_filter = df['car_type']=='Toyota Sienna'

In [8]:
car_filter.head()

0    True
1    True
2    True
3    True
4    True
Name: car_type, dtype: bool

There's a couple different ways to utilize this pandas filter of true and false values to get a data frame of just Toyota Siennas. One way is to have your data frame, square brackets, your car filter, which is your pandas series, you close those brackets, and then I'm just going to look at the first five rows.

In [12]:
# Approach 1 using square brackets
# Filter dataframe to get a DataFrame of only 'Toyota Sienna'
df[car_filter].head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna


The second way is to use a lock attribute, and the way this works is I have a data frame, I have .loc, I have single brackets, I have my pandas series of true and false values, and this colon just means that I want to look at all the columns and I'll press shift plus enter, and these two approaches are equivalent in the result they produce, but oftentimes the second approach is more legible. 

In [13]:
# Approach 2 using loc
# Filter dataframe to get a DataFrame of only 'Toyota Sienna'
df.loc[car_filter, :]

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.30,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.10,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.70,60,0.0702,Toyota Sienna
4,5,32735.70,687.23,191.50,495.73,32239.97,60,0.0702,Toyota Sienna
...,...,...,...,...,...,...,...,...,...
115,56,3133.83,632.47,9.37,623.10,2510.73,60,0.0359,Toyota Sienna
116,57,2510.73,632.47,7.51,624.96,1885.77,60,0.0359,Toyota Sienna
117,58,1885.77,632.47,5.64,626.83,1258.94,60,0.0359,Toyota Sienna
118,59,1258.94,632.47,3.76,628.71,630.23,60,0.0359,Toyota Sienna


As you can see, I have identical outputs for the two different approaches. 

One thing to keep in mind is that if I try to use the value counts method again on the pandas series, for the car type column, it'll seem like nothing changed. 

In [14]:
# Notice that it looks like nothing changed
# This is because we did not update the dataframe after applying the filter
df['car_type'].value_counts()

car_type
VW Golf R         144
Toyota Sienna     120
Toyota Carolla    111
Toyota Corolla     33
Name: count, dtype: int64

And the reason why it looks like nothing changed is because we didn't assign the filtered data frame back to the original data frame. And the way to fix this is by assigning the filtered data frame back to the original data frame. 

In [15]:
# Filter dataframe to get a DataFrame of only 'Toyota Sienna'
df = df.loc[car_filter, :]

Now if you look at the value counts, it looks like we have a filtered data frame. 


In [16]:
df['car_type'].value_counts()

car_type
Toyota Sienna    120
Name: count, dtype: int64

**interest_rate Filter**

Now that we've taken care of the car type filter, we also have to make an interest rate filter. And if I look at the pandas series for the interest rate column, and the value counts for it, you'll see that we have 60 rows with a 7.02% interest rate, and 60 rows with a 3.59% interest rate. 


In [17]:
df['interest_rate'].value_counts()

interest_rate
0.0702    60
0.0359    60
Name: count, dtype: int64

And what I want to do in this section is filter the data frame to only have the 7.02% interest rate. The code here is a filter that produces a pandas series of true and false values, where the rows that are true are the ones with the 7.02% interest rate, and the false ones will be the rows with the 3.59% interest rate. So I have the data frame, I have single brackets, I have a string of the column that I'm interested in. I close those brackets. I have two equal signs, and then the 7.02% interest rate, and this produces a pandas series of true and false values. 


In [19]:
# Notice that the filter produces a pandas series of True and False values
df['interest_rate'] == 0.0702

0       True
1       True
2       True
3       True
4       True
       ...  
115    False
116    False
117    False
118    False
119    False
Name: interest_rate, Length: 120, dtype: bool

What I'm going to do next is I'm going to assign that pandas series of true and false values to the variable interest filter. To utilize my interest filter, I'm going to use the lock attribute, followed by single brackets. I'll have my pandas series of true and false values, and then I'm going to select all the columns, and I'm going to take this filtered data frame and assign it back to the original data frame. I'm going to do shift plus enter to create the filtered data frame. To check that my interest filter worked as intended, I'm going to look at the pandas series of the interest rate column. I'll use the value counts method, and I'm going to press shift plus enter, and you'll see that I have 60 rows with the 7.02% interest rate. 


In [20]:
interest_filter = df['interest_rate'] == 0.0702

In [21]:
df = df.loc[interest_filter, :]

In [22]:
df['interest_rate'].value_counts(dropna=False)

interest_rate
0.0702    60
Name: count, dtype: int64


In the previous sections, we created a car filter and an interest filter and used a lock command to filter the data by first applying the car filter and then the interest filter. A more concise way to do this is shown below. 

Bitwise Logic Operator | Meaning
--- | --- 
& | and
\| | or
^ | exclusive or
~ | not

I have my data frame, I have the lock attribute, I have single brackets, I have my car filter, and then I use the and bitwise logic operator, along with my interest filter, and then I say, I want all the columns. And this would've worked just as well as applying each of the filters individually. As you can see, by using filtering, you can get the data that you're just interested in looking at.

In [27]:
df.loc[car_filter & interest_filter].head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna


In [23]:
df.loc[car_filter & interest_filter, :]

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna
5,6,32239.97,687.23,188.6,498.63,31741.34,60,0.0702,Toyota Sienna
6,7,31741.34,687.23,185.68,501.55,31239.79,60,0.0702,Toyota Sienna
7,8,31239.79,687.23,182.75,504.48,30735.31,60,0.0702,Toyota Sienna
8,9,30735.31,687.23,179.8,507.43,30227.88,60,0.0702,Toyota Sienna
9,10,30227.88,687.23,176.83,510.4,29717.48,60,0.0702,Toyota Sienna
