# Data Cleaning in pandas

A great resource for any pandas questions - summarizes all the basic functionality well. Bookmark it! 
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf


In [1]:
#Import pandas
import pandas as pd 

##1: Constructing DataFrames

We'll be using very small, synthetic datasets in this notebook so you can see EXACTLY what these cleaning methods do. 

How do you construct a df from several equal-length, same-ordered lists? 

Dictionaries are one easy way!

key = column name, value = data. 

In [None]:
numbers = [1,2,3,4,5]
letters = ['a', 'b', 'c', 'd', 'e']
booleans = [True, False, True, True, False]
df = pd.DataFrame({'numbers':numbers, 'letters':letters, 'bools':booleans})
df

NameError: ignored

##2: Changing data types (Casting/Converting)

In [2]:
classes  = ['CS1111', 'PSYC1010', 'CS2150', 'ECON2010', 'SOC2010']
cf_ratings = ['4', 3.8, '0.2', 2, '4']


In [13]:
#Exercise: Create a DataFrame called lou, with columns 'courses' and 'ratings',
#from the lists created above. 
lou = pd.DataFrame({'courses':classes, 'ratings':cf_ratings})
lou

Unnamed: 0,courses,ratings
0,CS1111,4.0
1,PSYC1010,3.8
2,CS2150,0.2
3,ECON2010,2.0
4,SOC2010,4.0


Type inconsistencies can cause issues when trying to perform column arithmetic. 

Pandas can hold multiple types in the same column. This is nice, but can get us into trouble. 

In [5]:
lou['ratings_10'] = lou['ratings']*10
lou

Unnamed: 0,courses,ratings,ratings_10
0,CS1111,4.0,4444444444
1,PSYC1010,3.8,38
2,CS2150,0.2,0.20.20.20.20.20.20.20.20.20.2
3,ECON2010,2.0,20
4,SOC2010,4.0,4444444444


We should drop that bad column. 

Remember, axes: rows = 0, columns = 1. 

In [7]:
lou = lou.drop('ratings_10', axis = 1)
#OR: lou = lou.drop(columns = 'ratings_10')

KeyError: ignored

In [8]:
lou

Unnamed: 0,courses,ratings
0,CS1111,4.0
1,PSYC1010,3.8
2,CS2150,0.2
3,ECON2010,2.0
4,SOC2010,4.0


Convert all ratings to float so we can properly rescale them using column arithmetic. 

In [16]:
def astype(input):
  

5.0

In [15]:
lou['ratings'] = lou['ratings'].astype('float')
lou

Unnamed: 0,courses,ratings
0,CS1111,4.0
1,PSYC1010,3.8
2,CS2150,0.2
3,ECON2010,2.0
4,SOC2010,4.0


In [17]:
lou['ratings_10'] = lou['ratings']*10 
lou

Unnamed: 0,courses,ratings,ratings_10
0,CS1111,4.0,40.0
1,PSYC1010,3.8,38.0
2,CS2150,0.2,2.0
3,ECON2010,2.0,20.0
4,SOC2010,4.0,40.0


##3: Handling missing and null/nan values

Let's simulate data as you'll actually find it in real-world problems - not perfecly clean and complete, but full of missing values. 

In [18]:
import numpy as np

In [19]:
students = ['Student_A', 'Student_B', 'Student_C', 'Student_D', np.nan, 'Student_F', 'Student_G']
years = [1, np.nan, 3, None, 4, np.nan, 1]
df = pd.DataFrame({'student':students,'year' :years})
df

Unnamed: 0,student,year
0,Student_A,1.0
1,Student_B,
2,Student_C,3.0
3,Student_D,
4,,4.0
5,Student_F,
6,Student_G,1.0


Calling df.isna() will return a dataframe full of booleans, which are (intuitively) True for data points that are null, and False for those which are non-null.

In [20]:
df.isna()

Unnamed: 0,student,year
0,False,False
1,False,True
2,False,False
3,False,True
4,True,False
5,False,True
6,False,False


What can we do to get a tally of the number of null values per column? 

Hint: how are booleans mathematically evaluated in python?

In [21]:
#EXERCISE: find null values per column.
df.isna().sum()


student    1
year       3
dtype: int64

In [None]:
#SOLUTION extension
df.isna().apply(sum)

student    1
year       3
dtype: int64

We can handle missing data in several ways. The method you choose should be determined by its suitability to the problem at hand—much like machine learning, there isn't a universal solution that you can apply in every case. 

###A: Dropping rows with missing values.

This is by far the easiest way of dealing with missing data: 

In [None]:
df_dropped = df.dropna()
df_dropped

Unnamed: 0,student,year
0,Student_A,1.0
2,Student_C,3.0
6,Student_G,1.0


If instead you only wanted to drop rows with NaN vals *in a certain set of columns*, you can specify subset=['listofcolumns']

In [None]:
df.dropna(subset=['year'])

Unnamed: 0,student,year
0,Student_A,1.0
2,Student_C,3.0
4,,4.0
6,Student_G,1.0


In [22]:
df.dropna(subset=['student', 'year'])

Unnamed: 0,student,year
0,Student_A,1.0
2,Student_C,3.0
6,Student_G,1.0


However, dropping every row that contains a missing value is usually pretty costly. In our example, we sacrificed over 50% of our data.

Occasionally, there will be a few **columns** which are sparse or poorly collected, and have missing data for a great majority of the dataset. It is common practice to *drop all columns with greater than n% missing data*, where n is a threshold chosen by the practitioner. 

As an exercise, let's code a function that we can use every time we want to preprocess in this way!


###B: Filling nulls with a reasonable common value

In [23]:
df

Unnamed: 0,student,year
0,Student_A,1.0
1,Student_B,
2,Student_C,3.0
3,Student_D,
4,,4.0
5,Student_F,
6,Student_G,1.0


In [None]:
df.fillna(0)

Unnamed: 0,student,year
0,Student_A,1.0
1,Student_B,0.0
2,Student_C,3.0
3,Student_D,0.0
4,0,4.0
5,Student_F,0.0
6,Student_G,1.0


To preserve observations, you can use the fillna() method to fill missing values with a safe guess for that variable, like the mean or median. 

In [24]:
df_copy = df.copy()
df_copy.year = df['year'].fillna(df.year.mean())
df_copy

Unnamed: 0,student,year
0,Student_A,1.0
1,Student_B,2.25
2,Student_C,3.0
3,Student_D,2.25
4,,4.0
5,Student_F,2.25
6,Student_G,1.0


###Neighbor-based imputation: FOR LOGICALLY ORDERED DATA ONLY

In [26]:
dates = ['1-1-2020', '1-2-2020', '1-3-2020', '1-4-2020', '1-5-2020', 
         '1-6-2020', '1-7-2020', '1-8-2020', '1-9-2020', '1-10-2020']

temps = [32, 35, 37, np.nan, 44, 55, np.nan, 59, 55, 54]

In [27]:
weather = pd.DataFrame({'date': dates, 'temp':temps})
weather

Unnamed: 0,date,temp
0,1-1-2020,32.0
1,1-2-2020,35.0
2,1-3-2020,37.0
3,1-4-2020,
4,1-5-2020,44.0
5,1-6-2020,55.0
6,1-7-2020,
7,1-8-2020,59.0
8,1-9-2020,55.0
9,1-10-2020,54.0


Data that is logically ordered, like the **time-series** data simulated here, tends to have high correlations between sequential observations. 

This is called **serial correlation.** Because of it, the best guess at any given missing value is often the observation before or after it. 

In [None]:
weather.fillna(method='ffill')

Unnamed: 0,date,temp
0,1-1-2020,32.0
1,1-2-2020,35.0
2,1-3-2020,37.0
3,1-4-2020,37.0
4,1-5-2020,44.0
5,1-6-2020,55.0
6,1-7-2020,55.0
7,1-8-2020,59.0
8,1-9-2020,55.0
9,1-10-2020,54.0


In [None]:
weather.fillna(method='bfill')

Unnamed: 0,date,temp
0,1-1-2020,32.0
1,1-2-2020,35.0
2,1-3-2020,37.0
3,1-4-2020,44.0
4,1-5-2020,44.0
5,1-6-2020,55.0
6,1-7-2020,59.0
7,1-8-2020,59.0
8,1-9-2020,55.0
9,1-10-2020,54.0


Reminder (truly can't stress this enough): **don't use ffill or bfill unless you're using time series data or data that has some other natural ordering.**

If the data isn't ordered, choosing the point before or after a dataframe is completely arbitrary, and dependent on whatever nonsensical order the data came in.

##4. String processing

In [None]:
schools = ["UVA", "Duke", "Unc", "VT", "pitt", "uva", "Duke", "UNC", "vt", "Pitt"]
sports = ["Basketball"]*5 + ["Football"]*5
wins = [50, 20, 25, 0, 30]*2

sports = pd.DataFrame({"School": schools, "Sport": sports, "Wins": wins})
sports

Unnamed: 0,School,Sport,Wins
0,UVA,Basketball,50
1,Duke,Basketball,20
2,Unc,Basketball,25
3,VT,Basketball,0
4,pitt,Basketball,30
5,uva,Football,50
6,Duke,Football,20
7,UNC,Football,25
8,vt,Football,0
9,Pitt,Football,30


With a dataframe like the one above, we would run into issues if we wanted to figure out the total wins per school, like so:

In [None]:
sports.groupby("School").sum()

Unnamed: 0_level_0,Wins
School,Unnamed: 1_level_1
Duke,40
Pitt,30
UNC,25
UVA,50
Unc,25
VT,0
pitt,30
uva,50
vt,0


We can see here that **capitalization presents a pretty big problem when working with a text data.**

An easy way to solve this is through converting all the text to a uniform case, again using .str before our string operations! 

In [None]:
"Silas".upper()

'SILAS'

There are a bunch of string methods out there that modify strings in a similar way. You can find a ton of them [here.](https://www.w3schools.com/python/python_ref_string.asp)

It would be great if we could just apply that method to the sports["School"] column, but that results in an error because Series objects and string functions don't work together

In [None]:
sports["School"].upper()

AttributeError: ignored

Luckily, there is a super easily solution (.str)

In [None]:
sports["School"] = sports.School.str.upper()
sports.School

0     UVA
1    DUKE
2     UNC
3      VT
4    PITT
5     UVA
6    DUKE
7     UNC
8      VT
9    PITT
Name: School, dtype: object

Now that we've modified the "School" column, our aggregation function will work like we wanted it to:

In [None]:
sports.groupby("School").sum()

Unnamed: 0_level_0,Wins
School,Unnamed: 1_level_1
DUKE,40
PITT,60
UNC,50
UVA,100
VT,0


##5: Date and time processing

pd.to_datetime is super powerful and flexible. Always see how it does before manually specifying anything yourself—it can save you a ton of work!

In [None]:
presidents = ['Washington' ,'Lincoln', 'Kennedy', 'Obama', 'Trump']
birthdays = ['Feb 27 1732', '2-12-1809', 'May 29th, 1917', '8 4 1961','06//14// //1946' ]

bdays = pd.DataFrame({'president': presidents, 'birthday': birthdays})
bdays

Unnamed: 0,president,birthday
0,Washington,Feb 27 1732
1,Lincoln,2-12-1809
2,Kennedy,"May 29th, 1917"
3,Obama,8 4 1961
4,Trump,06//14// //1946


In [None]:
bdays['datetime_bday'] = pd.to_datetime(bdays['birthday'])
bdays

Unnamed: 0,president,birthday,datetime_bday
0,Washington,Feb 27 1732,1732-02-27
1,Lincoln,2-12-1809,1809-02-12
2,Kennedy,"May 29th, 1917",1917-05-29
3,Obama,8 4 1961,1961-08-04
4,Trump,06//14// //1946,1946-06-14


### Using pandas datetime objects

In [None]:
washington = bdays.datetime_bday[0]
print(washington)
washington.month

1732-02-27 00:00:00


2

In [None]:
washington.month_name()

'February'

In [None]:
washington.year

1732

In [None]:
washington.is_leap_year

True

In [None]:
washington.daysinmonth

29