# Data Cleaning

In this lecture we'll introduce two python packages, numpy and pandas

In [6]:
import numpy as np
import pandas as pd 

Numpy (pronounced 'num-pie') offers more sophisticated data structures, most importantly the numpy array. Think of this as a python list but with more functionality. 
Pandas is built on top of numpy, so we won't work with it very much directly, however we do still need to import it

Pandas (like the animal) is the package that offers a variety of tools to import, manipulate, analyze, and export large amounts of data.


## Importing Data with Pandas

Before we start to understand some new data structures, lets get some data to work with and look at by opening an excel file.

look at the excel file 'open_example_1.xlsx' and run the cell below to see what pandas generates for us

In [15]:
#Import data from an excel file
pd.read_excel('Historicalinvesttemp.xlsx')

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,
2,,,,
3,,,,
4,,Annual Returns on Investments in,,
...,...,...,...,...
85,2007,0.0549,0.0988,0.0466
86,2008,-0.37,0.2587,0.016
87,2009,0.2646,-0.149,0.001
88,,stocks,tbills,bonds


And that's all we have to do to import the data.
This is why packages are so beneficial, rather than writing code to access other files from scratch, we can simply learn and use what already exists.

There are two other files we'll import from, the first being a CSV or 'Comma-Separated Value' sheet. look at 'open_example_2.csv' by opening it in a *text editor* like notepad, then run the cell below. Notice your computer can open it in a program like excel as well, as most CSVs are made from exporting an excel file in this format.

In [14]:
#Import data from a csv file
pd.read_csv('stock prices.csv')

Unnamed: 0,symbol,date,open,high,low,close,volume
0,AAL,2014-01-02,25.0700,25.8200,25.0600,25.3600,8998943
1,AAPL,2014-01-02,79.3828,79.5756,78.8601,79.0185,58791957
2,AAP,2014-01-02,110.3600,111.8800,109.2900,109.7400,542711
3,ABBV,2014-01-02,52.1200,52.3300,51.5200,51.9800,4569061
4,ABC,2014-01-02,70.1100,70.2300,69.4800,69.8900,1148391
...,...,...,...,...,...,...,...
497467,XYL,2017-12-29,68.5300,68.8000,67.9200,68.2000,1046677
497468,YUM,2017-12-29,82.6400,82.7100,81.5900,81.6100,1347613
497469,ZBH,2017-12-29,121.7500,121.9500,120.6200,120.6700,1023624
497470,ZION,2017-12-29,51.2800,51.5500,50.8100,50.8300,1261916


The last way we'll import data is directly from the internet, more specifically from HTML, the 'mark-up' language of we browsers.
Run the Cell below and see what it gives you:

In [8]:
#import from HTML
pd.read_html('https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/')

[                         Bank NameBank           CityCity StateSt  CertCert  \
 0                    Almena State Bank             Almena      KS     15426   
 1           First City Bank of Florida  Fort Walton Beach      FL     16748   
 2                 The First State Bank      Barboursville      WV     14361   
 3                   Ericson State Bank            Ericson      NE     18265   
 4     City National Bank of New Jersey             Newark      NJ     21111   
 ..                                 ...                ...     ...       ...   
 558                 Superior Bank, FSB           Hinsdale      IL     32646   
 559                Malta National Bank              Malta      OH      6629   
 560    First Alliance Bank & Trust Co.         Manchester      NH     34264   
 561  National State Bank of Metropolis         Metropolis      IL      3815   
 562                   Bank of Honolulu           Honolulu      HI     21029   
 
                  Acquiring Institutio

What we imported was a table from fdic.gov, with information on bank closures since October of 2000. Note the 500+ entries, that gets abbreviated when displayed as to not crowd your screen.

## Pandas Data Structures

To understand how to maniuplate what we've made above, we first need to understand what structure we're storing data into. The two things we will look at are series and data frames.

### Series

A series is a sort of list, where we can define another value to be the index of the elements. Lets see an example:

In [43]:
labels = ['a','b','c'] #python list
data = [10,20,30] #another python list, same number of elements as 'labels'

In [44]:
pd.Series(data,labels)

a    10
b    20
c    30
dtype: int64

This creates a Series object, with the indexes of the data defined by the list labels. If you want to know the arguments of a function, while typing the function press 'shift+tab'

we can store this series into a variable like we can a list:

In [45]:
my_series_1 = pd.Series(data,labels)

In [46]:
my_series_1

a    10
b    20
c    30
dtype: int64

We can access individual pieces of data from the series by calling the indexs that we gave it:

In [47]:
my_series_1['a']

10

Why not just use a python list instead? The series object has more usability, for example lets add two series together:

In [50]:
my_series_2 = pd.Series([20,40,60],['a','b','c'])

In [51]:
my_series_1 + my_series_2

a    30
b    60
c    90
dtype: int64

Practice: Access the Data for index 'b' of my_series_2:

In [52]:
#your code here
my_series_2['b']


40

Lets change the indecies of my_series_2 and see what happens:

In [53]:
my_series_2 = pd.Series([20,40,60],['a','c','d'])

In [54]:
my_series_2 #take a look at the new series, what do you think will happen when we add it to series 1?

a    20
c    40
d    60
dtype: int64

In [55]:
my_series_1 + my_series_2


a    30.0
b     NaN
c    70.0
d     NaN
dtype: float64

It could not find matching indecies for 'b' and 'd', so they didn't have valid addition operations, so the result was 'null' represented by 'NaN'. We'll learn what to do with this in a little bit

### Data Frames

A data frame is a set of series data layed out next to each other using matching indecies. The data we imported from the .xlsx, .csv, and HTML were stored into data frames by pandas:

In [16]:
df = pd.read_csv('open_example_2.csv')

In [17]:
df

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
1,-2.534499,-2.016289,1.369435,-2.728555,4.401169
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
4,-3.828709,-3.282007,-1.973941,4.858205,-3.896155
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961
7,-0.834065,2.78516,1.662038,-0.12764,1.881524


The left most column is the index, and the top most row is the name of the series, so here we have series 'a', series 'b', series 'c',... all next to each other.

## Cleaning Data by Modifying Data Frames

Pandas has lots of functionality to maniuplate these data frames. We can remove rows or columns, rename the sereies, select only parts of the data frame, add data, and more

### Shape

Lets first discribe the 'shape' of the data frame. This is given by a tuple of 2 values: (number of rows, number of columns)

Look at the shape of the frame made from the csv file that we defined above:

In [13]:
df.shape

(8, 5)

### Access Data in the Frame

We may only want to grab certain parts of the frame at a time.

The syntax to do this may be strange at first, but with a little practice it becomes easy

To grab a column we can simply say:

In [14]:
df['a']



0    0.513900
1   -2.534499
2    3.653782
3    4.509373
4   -3.828709
5    0.942760
6    1.514906
7   -0.834065
Name: a, dtype: float64

And it will return the column with series name 'a', we can also get a new data frame (list of series) by passing in a list of column names we want:

In [15]:
df[['a','b','e']] 
#notice nested bracket -> inner bracket is a list of values we want

Unnamed: 0,a,b,e
0,0.5139,1.762665,-2.297997
1,-2.534499,-2.016289,4.401169
2,3.653782,-3.826753,1.774407
3,4.509373,1.352446,-1.102341
4,-3.828709,-3.282007,-3.896155
5,0.94276,3.586993,-0.139516
6,1.514906,1.760634,-1.361961
7,-0.834065,2.78516,1.881524


What about rows? The syntax is a little bit different. Lets grab row 3:

In [16]:
df.loc[3] #.loc[n] tells it to 'locate' a row with index n

a    4.509373
b    1.352446
c    4.019529
d   -2.125082
e   -1.102341
Name: 3, dtype: float64

Notice that this returns a series, but now the names of the columns are the indexes. Really a data frame is a list of series, but in both horizontal and verticle directions!

In [58]:
df.loc[[2,5,6]] #we can pass in a list of indexes too


Unnamed: 0,a,b,c,d,e
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961


If we want one piece of data, we can combine these commands together:

In [60]:
df.loc[2]['a']
#df.describe()

3.653781594387464


going left to right in the command above, df.loc[2] returns a series, and ['a'] returns the value at index 'a' of that series

### Data Fetching Practice

In [19]:
df #see what the full data frame is

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
1,-2.534499,-2.016289,1.369435,-2.728555,4.401169
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
4,-3.828709,-3.282007,-1.973941,4.858205,-3.896155
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961
7,-0.834065,2.78516,1.662038,-0.12764,1.881524


1: Fetch the series 'e':

In [20]:
#your code here
df['e']

0   -2.297997
1    4.401169
2    1.774407
3   -1.102341
4   -3.896155
5   -0.139516
6   -1.361961
7    1.881524
Name: e, dtype: float64

2: Fetch Row 3,4, and 5

In [21]:
#your code here
df.loc[[3,4,5]]

Unnamed: 0,a,b,c,d,e
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
4,-3.828709,-3.282007,-1.973941,4.858205,-3.896155
5,0.94276,3.586993,3.890946,-2.21678,-0.139516


3: Fetch Columns b,c, and e, but only rows 2 and 4

In [22]:
#your code here
df.loc[[2,4]][['b','c','e']]

Unnamed: 0,b,c,e
2,-3.826753,-2.015719,1.774407
4,-3.282007,-1.973941,-3.896155


## Retrieving Data Based on Conditions

Conditonal statements can also be used to grab data:

In [23]:
df>0 #returns a boolean overlay of the data frame

Unnamed: 0,a,b,c,d,e
0,True,True,False,True,False
1,False,False,True,False,True
2,True,False,False,True,True
3,True,True,True,False,False
4,False,False,False,True,False
5,True,True,True,False,False
6,True,True,False,False,False
7,False,True,True,False,True


If we combine what's above with the original data frame, we can output only positive values (or any other condidtion we wish to look for)

In [24]:
df[df>0] #this is where the tricky syntax comes in, this reads as 'Data frame df where Data Frame df is greater than 0'

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,,1.555555,
1,,,1.369435,,4.401169
2,3.653782,,,3.517423,1.774407
3,4.509373,1.352446,4.019529,,
4,,,,4.858205,
5,0.94276,3.586993,3.890946,,
6,1.514906,1.760634,,,
7,,2.78516,1.662038,,1.881524


Anything less than zero is replaced with null


Say a row was only valid if column 'a' is positive, lets throw out every row where the value for 'a' is negative:

In [25]:
df[df['a']>0]

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961


Note how the indecies in the left most column keep their original values (we will learn how to reformat this soon)

Try getting the values of df where the value of 'b', is less than the value in 'd' for the same index:

In [26]:
#your code here
df[df['b']<df['d']]

Unnamed: 0,a,b,c,d,e
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
4,-3.828709,-3.282007,-1.973941,4.858205,-3.896155


If we want to use multiple conditionals, we can separate each condition with parenthesis, but rather than using 'and' or 'or' we have to use '&' and '|' respectivly. ('and' and 'or' are built in functions that can only operate on single pieces of data, not entire data structures)

lets get the rows where the values of a are postive or the values of c are negative:

In [27]:
df[(df['a']>0) == (df['c']<0)]

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
4,-3.828709,-3.282007,-1.973941,4.858205,-3.896155
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961


It's important to note that when we do these operations, the actual data frame stored in memory is not permanently altered. This is because pandas does not do anything 'in-place' by default

For example, if we define x=1 and y=2. simply having a cell execute x+y doesnt change any variables, thus it's not 'in-place'. But if we wanted to alter x we could say x = x+y. This is in-place because it over-rides the data used for the operation.

So if we wanted to permanently remove any row with a negative 'a' value, we can do something similar:

In [61]:
df = df[df['a']>0]

In [62]:
df

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
2,3.653782,-3.826753,-2.015719,3.517423,1.774407
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961


We can also explicity tell pandas to drop a row or column based off of it's index:

To remove a row, we can use the 'drop' function:

In [30]:
df.drop(2)

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
3,4.509373,1.352446,4.019529,-2.125082,-1.102341
5,0.94276,3.586993,3.890946,-2.21678,-0.139516
6,1.514906,1.760634,-1.780245,-2.069475,-1.361961


note the drop function by default is not in place, but we can set it to be:

In [18]:
df.drop('b',inplace = True, axis=1)


In [32]:
df

Unnamed: 0,a,c,d,e
0,0.5139,-2.654971,1.555555,-2.297997
2,3.653782,-2.015719,3.517423,1.774407
3,4.509373,4.019529,-2.125082,-1.102341
5,0.94276,3.890946,-2.21678,-0.139516
6,1.514906,-1.780245,-2.069475,-1.361961


To drop a column, we have to specify that we are looking in the column indecies. We do this by defining the argument in the drop function 'axis' to be 1. By default it is 0, which indicates to drop a row.

To drop column 'a':

In [33]:
df.drop('a',inplace=True,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [34]:
df

Unnamed: 0,c,d,e
0,-2.654971,1.555555,-2.297997
2,-2.015719,3.517423,1.774407
3,4.019529,-2.125082,-1.102341
5,3.890946,-2.21678,-0.139516
6,-1.780245,-2.069475,-1.361961


We can pass in single columns/rows or a list of them, just like when we search for certain columns/rows

Try dropping columns c and e: (out of place, as we want the data to work on later)

In [35]:
#your code here
df.drop('c',inplace=True,axis=1)
df


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,d,e
0,1.555555,-2.297997
2,3.517423,1.774407
3,-2.125082,-1.102341
5,-2.21678,-0.139516
6,-2.069475,-1.361961


Now the index on the left are kept for their original rows, if we want to fix it, we simply say:

In [67]:
df.reset_index(drop=True,inplace = True)
#'drop = True' prevents pandas from simply tacking on a second index column

TypeError: reset_index() got an unexpected keyword argument 'axis'

In [68]:
df

Unnamed: 0,a,b,c,d,e
0,0.5139,1.762665,-2.654971,1.555555,-2.297997
1,3.653782,-3.826753,-2.015719,3.517423,1.774407
2,4.509373,1.352446,4.019529,-2.125082,-1.102341
3,0.94276,3.586993,3.890946,-2.21678,-0.139516
4,1.514906,1.760634,-1.780245,-2.069475,-1.361961


We can add new columns to a data frame as well:

In [72]:
names = [4,6,5,6,7] #create a list of strings
df['names']=names
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,a,b,c,d,e,names
0,0.5139,1.762665,-2.654971,1.555555,-2.297997,4
1,3.653782,-3.826753,-2.015719,3.517423,1.774407,6
2,4.509373,1.352446,4.019529,-2.125082,-1.102341,5
3,0.94276,3.586993,3.890946,-2.21678,-0.139516,6
4,1.514906,1.760634,-1.780245,-2.069475,-1.361961,7


In [39]:
df

Unnamed: 0,d,e
0,1.555555,-2.297997
1,3.517423,1.774407
2,-2.125082,-1.102341
3,-2.21678,-0.139516
4,-2.069475,-1.361961


We can set this new row to be the name of the indecies:

In [73]:
df.set_index('names',inplace = True)

Lets make a new column from data we have, say column 'f' is just b + c:

In [74]:
df['f'] = df['b'] + df['c']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [42]:
df

Unnamed: 0,d,e
0,1.555555,-2.297997
1,3.517423,1.774407
2,-2.125082,-1.102341
3,-2.21678,-0.139516
4,-2.069475,-1.361961


## Missing Data

When Data is missing from a chart, it can mess with mathmatical analysis. We can clean them out entirely, or set them to values that would have no or atleast minimal impact on the accuracy of the results.

First lets get a new data frame:

In [3]:
missingdf = pd.read_excel('missing_data_example.xlsx')

In [4]:
missingdf

Unnamed: 0,A,B,C,D
0,5.0,3.0,,3.0
1,2.0,,,
2,,4.0,5.0,
3,,,5.0,6.0
4,8.0,7.0,4.0,3.0


Drop out any series with missing data using 'dropna'. Choose the axis to drop by colomn or row:

In [5]:
missingdf.dropna(axis = 0)

Unnamed: 0,A,B,C,D
4,8.0,7.0,4.0,3.0


In [64]:
missingdf.dropna(axis = 1)

0
1
2
3
4


Now all our data is gone because every column had missing Data! We can change the threshold to clear a series using the argument 'thresh'. The number assigned to it says how many valid data points are needed for a series to be kept:

In [43]:
missingdf.dropna(axis = 0, thresh=3) #drop rows that have less than 3 non-null data values

Unnamed: 0,A,B,C,D
0,5.0,3.0,,3.0
4,8.0,7.0,4.0,3.0


instead of removing it entirely, lets change null values to something else we can work with, the 'fillna()' function can be used to assign anything to the null values.

In [199]:
missingdf.fillna(axis=1,value='Fill Value')

Unnamed: 0,A,B,C,D
0,5,3,Fill Value,3
1,2,Fill Value,Fill Value,Fill Value
2,Fill Value,4,5,Fill Value
3,Fill Value,Fill Value,5,6
4,8,7,4,3


We still can't operate on strings, instead let's set the missing values in the data frame to a number, like 0:

In [201]:
missingdf.fillna(value=0)

Unnamed: 0,A,B,C,D
0,5.0,3.0,0.0,3.0
1,2.0,0.0,0.0,0.0
2,0.0,4.0,5.0,0.0
3,0.0,0.0,5.0,6.0
4,8.0,7.0,4.0,3.0


Or for just one column:

In [202]:
missingdf['A'] = missingdf['A'].fillna(value=0) #note that the left hand side of the equals sign makes this 'out of place'

In [203]:
missingdf

Unnamed: 0,A,B,C,D
0,5.0,3.0,,3.0
1,2.0,,,
2,0.0,4.0,5.0,
3,0.0,,5.0,6.0
4,8.0,7.0,4.0,3.0


Try replacing the values in just row 1 with any number of your choice.

In [206]:
#your code here, hint: do this in-place
missingdf.fillna(axis=0,value=1)

TypeError: fillna() got multiple values for argument 'value'

In [209]:
missingdf

Unnamed: 0,A,B,C,D
0,5.0,3.0,,3.0
1,2.0,,,
2,0.0,4.0,5.0,
3,0.0,,5.0,6.0
4,8.0,7.0,4.0,3.0


You could use a large variety of functions in place of 0, such as mean(), sum(), count() (which would give the number of elements), etc.

It just depends on what would make sense for your data.
We will go over many of these function in the 'Hands on Data Analysis' Lab

## Combining Data Frames

If you are pulling data from many sources, it's natrual to first import each into their own data frame. Once we have them in structures we can work with in python, we are able to stitch them together based off of the information they contain.

First we'll set up two data frames:

What if the data doesnt come in the correct order?

In [12]:
df4 = pd.read_excel('Book 1 (1).xlsx')
df5 = pd.read_excel('Book (1).xlsx')

In [13]:
df4

Unnamed: 0,A,B,C,D
0,Kia,m,n,o
1,John,p,q,r


In [14]:
df5 #has the same people as df4, but the data doesn't come in the same order.

Unnamed: 0,A,B,C,D
0,Sal,d,e,f
1,Kia,g,h,i


merge df4 and df5 on 'A' to see what happens:

In [15]:
mergeddf = pd.merge(df4,df5,on='A')

In [16]:
mergeddf

Unnamed: 0,A,B_x,C_x,D_x,B_y,C_y,D_y
0,Kia,m,n,o,g,h,i


Don't worry, the developers had this in mind when making pandas, developing it to account for this very thing automatically.

But always consider cases like this, and research the packages you use to know if it's going to be an issue you have to account for.

## Saving back to excel or CSV

*Saving data to files like .csv or .xlsx is a great way to store data on your computer for long term use, because once you close the kernel, all your variables are erased and would have to be re-calculated!*

Saving back to an excel file or CSV is just as easy as loading them:

**Important Notes:**

The files are saved in the same directory as teh jupyter file.

If the file with the specified name already exists, it will be saved-over.

If the file with the specified name does not exist, it will be created and added to the folder.

In [82]:
#save to excel
mergeddf.to_excel('Merged_frames.xlsx',sheet_name='sheet1')
# excel 'workbooks' have sheets, we must state what sheet we're putting it on with the second argument

In [83]:
#save to CSV
mergeddf.to_csv('Merged_frames.csv')

Now you can go look at the files you've made!