<h1> Tidy </h1>

<h2> Read Data In </h2>

This section simply reads the Parquet files stored on the disk to a DataFrame, so they can be Tidied.

In [1]:
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import numpy as np
import dask.dataframe as dd

These are the strings representing the filepaths of the Parquet files.

In [2]:
filenamekillings = "/data/skariyadan/daskkillings.parquet"
filenameincome = "/data/skariyadan/income.parquet"
filenamebelowpoverty = "/data/skariyadan/belowpoverty.parquet"
filenameeducation = "/data/skariyadan/education.parquet"
filenameracestats = "/data/skariyadan/racestats.parquet"

The killings DataFrame:

In [3]:
killings = df = dd.read_parquet(filenamekillings)

In [4]:
killings.head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False


The income DataFrame:

In [5]:
income = pq.read_table(filenameincome).to_pandas()

In [6]:
income.head()

Unnamed: 0,Geographic Area,City,Median Income
0,AL,Abanda CDP,11207
1,AL,Abbeville city,25615
2,AL,Adamsville city,42575
3,AL,Addison town,37083
4,AL,Akron town,21667


The belowpoverty DataFrame:

In [7]:
belowpoverty = pq.read_table(filenamebelowpoverty).to_pandas()

In [8]:
belowpoverty.head()

Unnamed: 0,Geographic Area,City,poverty_rate
0,AL,Abanda CDP,78.8
1,AL,Abbeville city,29.1
2,AL,Adamsville city,25.5
3,AL,Addison town,30.7
4,AL,Akron town,42.0


The education DataFrame:

In [9]:
education = pq.read_table(filenameeducation).to_pandas()

In [10]:
education.head()

Unnamed: 0,Geographic Area,City,percent_completed_hs
0,AL,Abanda CDP,21.2
1,AL,Abbeville city,69.1
2,AL,Adamsville city,78.9
3,AL,Addison town,81.4
4,AL,Akron town,68.6


The racestats DataFrame:

In [11]:
racestats = pq.read_table(filenameracestats).to_pandas()

In [12]:
racestats.head()

Unnamed: 0,Geographic area,City,share_white,share_black,share_native_american,share_asian,share_hispanic
0,AL,Abanda CDP,67.2,30.2,0.0,0.0,1.6
1,AL,Abbeville city,54.4,41.4,0.1,1.0,3.1
2,AL,Adamsville city,52.3,44.9,0.5,0.3,2.3
3,AL,Addison town,99.1,0.1,0.0,0.1,0.4
4,AL,Akron town,13.2,86.5,0.0,0.0,0.3


<h2> Tidying </h2>

Now that the DataFrames have been read into the notebook, Tidying can begin. 
Tidying will consist of several steps:  
-  Removing Duplicates
-  Dealing with Missing Values
-  Replacing Values and Transforming Data
-  Renaming Columns
-  Modifying Column Data Accross Tables to Ensure Linkage
- Modify DataTypes

<h3> Removing Duplicates </h3>

It is possible for DataFrames to have duplicated rows that need to be removed for proper Tidy data. In this section each DataFrame will be checked for duplicates, and will deal with them accordingly.

Killings is in the form dask, therefore we can just call it to drop_duplicates().

In [13]:
killings.drop_duplicates()

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...



Next, check the income DataFrame for duplicated data.

In [14]:
income[income.duplicated() == True]

Unnamed: 0,Geographic Area,City,Median Income


income does not have any duplicated data. 

Next, check the belowpoverty DataFrame for duplicated data.

In [15]:
belowpoverty[belowpoverty.duplicated() == True]

Unnamed: 0,Geographic Area,City,poverty_rate


belowpoverty does not have any duplicated data.

Next, check the education DataFrame for duplicated data. 

In [16]:
education[education.duplicated() == True]

Unnamed: 0,Geographic Area,City,percent_completed_hs


education does not have any duplicated data.

Next, check racestats DataFrame for duplicated data.

In [17]:
racestats[racestats.duplicated() == True]

Unnamed: 0,Geographic area,City,share_white,share_black,share_native_american,share_asian,share_hispanic


racestats does not have any duplicated data, as well. 

All 5 of the DataFrames used did not have and duplicate data, however if they did, the duplicates could have been dealt with a simple call to the .drop_duplicates() function. 

<h3> Dealing With Missing Values </h3>

Missing values and NaN's need to be handles appropriately. Different datasets have different representations to missing values, therefore the DataFrames need to be checked for missing values, and handle them accordingly. 

The following will check the <b>killings</b> DataFrame's columns for null values, and will handle them accordingly. (The list of variables and their types can be found in 02-Import).

Checking the killings name column.

In [18]:
killings[killings.name.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The names column does not appear to have any null values.

Checking the killings date column.

In [19]:
killings[killings.date.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The date column does not appear to have any null values.

Checking the killings manner_of_death column.

In [20]:
killings[killings.manner_of_death.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The manner_of_death column does not appear to have any null values.

Checking the killings armed column.

In [21]:
killings[killings.armed.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The armed column appears to have several values which are None. Since armed is a categorical variable, we will fill the missing armed entries with the string "unknown". We will deal with having to replace any values later if there are multiple representations of "unknown" in the column. 

In [22]:
killings.armed = killings.armed.fillna("unknown")

Now that the missing values in the armed column have been filled with the string "unknown", we are going back to check the column to ensure that all the missing values indeed have been taken care of.

In [23]:
killings[killings.armed.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The armed column has been successfully dealt with.

Checking the killings age column.

In [24]:
killings[killings.age.isnull()].head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
124,584,Alejandro Salazar,20/02/15,shot,gun,,M,H,Houston,TX,False,attack,Car,False
658,789,Roger Albrecht,29/08/15,shot,knife,,M,W,San Antonio,TX,False,other,Not fleeing,False
707,839,Lawrence Price,17/09/15,shot,gun,,M,W,Brodhead,KY,False,attack,Not fleeing,False
769,908,Jason Day,12/10/15,shot,gun,,M,B,Lawton,OK,False,attack,Not fleeing,False
802,1283,John Tozzi,24/10/15,shot,gun,,M,,New Paltz,NY,False,attack,Not fleeing,False


The age column appears to have several values which are NaN. Age is a numerical column, so there are one of several ways to deal with this issue, the columns with null could be dropped, or the NaN values could be filled. In this case, although age is an important factor, this project is primarily examining the relationship between police shootings and race, income, and education, therefore it would be safe to simply fill the missing values in the age column with the average value of the ages.

In [25]:
killings.age = killings.age.fillna(killings.age.mean())

Now that the missing values in the age column have been filled with the mean of the age column, we are going to go check for null values again to ensure if all the missing values have indeed been filled.

In [26]:
killings[killings.age.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The age column has been successfully dealt with.

Checking the killings gender column.

In [27]:
killings[killings.gender.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The gender column does not appear to have any missing values.

Checking the killings race column.

In [28]:
killings[killings.race.isnull()].head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
59,110,William Campbell,25/01/15,shot,gun,59.0,M,,Winslow,NJ,False,attack,Not fleeing,False
241,244,John Marcell Allen,30/03/15,shot,gun,54.0,M,,Boulder City,NV,False,attack,Not fleeing,False
266,534,Mark Smith,09/04/15,shot and Tasered,vehicle,54.0,M,,Kellyville,OK,False,attack,Other,False
340,433,Joseph Roy,07/05/15,shot,knife,72.0,M,,Lawrenceville,GA,True,other,Not fleeing,False
398,503,James Anthony Morris,31/05/15,shot,gun,40.0,M,,Medford,OR,True,attack,Not fleeing,False


The race column appears to have several values which are None. Race is a categorical column, however given how important race statistics are to this dataset, it would be best to simply drop all the entries which don't have race specified. 

In [29]:
killings = killings[killings.race.isnull() == False]

Now that the missing values in the race column have been dropped, we are going to go back and check for null values again to ensure that all the rows with missing entities in the race column have indeed been dropped.

In [30]:
killings[killings.race.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The race column has been successfully dealt with. Note, by dropping the rows with missing entities in the race column, the length of the killings dataset is now 2340.

Checking the killings city column.

In [31]:
killings[killings.city.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The city column does not appear to have any missing values.

Checking the killings state column.

In [32]:
killings[killings.state.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The state column does not appear to have any missing values.

Checking the killings signs_of_mental_illness column.

In [33]:
killings[killings.signs_of_mental_illness.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The signs_of_mental_illness column does not appear to have any missing values.

Checking the killings threat_level column.

In [34]:
killings[killings.threat_level.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The threat_level column does not appear to have any missing values.

Checking the killings flee column.

In [35]:
killings[killings.flee.isnull()].head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
857,1007,Ernesto Gamino,13/11/15,shot,undetermined,25.0,M,H,Jurupa Valley,CA,False,undetermined,,False
874,1020,Randy Allen Smith,19/11/15,shot,gun,34.0,M,B,Manatee,FL,False,attack,,False
898,1042,Zachary Grigsby,29/11/15,shot,gun,29.0,M,W,Lincoln,NE,False,attack,,False
935,1083,Roy Carreon,12/12/15,shot,knife,49.0,M,H,San Bernardino,CA,False,attack,,False
946,1093,Hector Alvarez,14/12/15,shot,undetermined,19.0,M,H,Gilroy,CA,False,undetermined,,True


The flee column appears to have several values which are None. Since armed is a categorical variable, we will fill the missing armed entries with the string "unknown". We will deal with having to replace any values later if there are multiple representations of "unknown" in the column. 

In [36]:
killings.flee = killings.flee.fillna("unknown")

Now that the missing values in the flee column have been filled with the string "unknown", we are going back to check the column to ensure that all the missing values indeed have been taken care of.

In [37]:
killings[killings.flee.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The flee column has been successfully dealt with.

Checking the killings body_camera column.

In [38]:
killings[killings.body_camera.isnull()]

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,object,object,object,object,float64,object,object,object,object,bool,object,object,bool
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


The body_camera column does not appear to have any missing values.

The following will check the <b>income</b> DataFrame's columns for null values, and will handle them accordingly. (The list of variables and their types can be found in 02-Import).

Checking the income Geographic Area column.

In [39]:
income[income.iloc[:,0].isnull()]

Unnamed: 0,Geographic Area,City,Median Income


The Geographic Area column does not appear to have any missing values. 

Checking the income City column.

In [40]:
income[income.iloc[:,1].isnull()]

Unnamed: 0,Geographic Area,City,Median Income


The City column does not appear to have any missing values.

Checking the income Median Income column.

In [41]:
income[income.iloc[:,2].isnull()].head()

Unnamed: 0,Geographic Area,City,Median Income
29119,WY,Albany CDP,
29121,WY,Alcova CDP,
29123,WY,Alpine Northeast CDP,
29126,WY,Antelope Hills CDP,
29129,WY,Arlington CDP,


The Median Income column appears to have several values which are None. Since the entire table is based on income, it would not make any sense to keep any rows with missing incomes, therefore the rows with missing incomes will be dropped.

In [42]:
income = income.dropna()

In addition, missing values are indicated with a "(X)", , or a "-", so we will drop those as well. Values indicated with "2,500-" or "250,000+" indicate incomes less than 2,500 and greater than 250,000 respectively.

In [43]:
income = income[income.iloc[:,2] != "(X)"]
income = income[income.iloc[:,2] != "-"]
income.iloc[:,2] = income.iloc[:,2].replace("2,500-",2500)
income.iloc[:,2] = income.iloc[:,2].replace("250,000+",250000)

Now that the missing values in the Median Income column have been dropped, we are going to go back and check for null values again to ensure that all the null values have been successfully taken care of.

In [44]:
income[income.iloc[:,2].isnull()]

Unnamed: 0,Geographic Area,City,Median Income


The Median Income column has been successfully taken care of.

The following will check the <b>belowpoverty</b> DataFrame's columns for null values, and will handle them accordingly. (The list of variables and their types can be found in 02-Import).

Checking the belowpoverty Geographic Area column.

In [45]:
belowpoverty[belowpoverty.iloc[:,0].isnull()]

Unnamed: 0,Geographic Area,City,poverty_rate


The Geographic Area column does not appear to have any missing values.

Checking the belowpoverty City column.

In [46]:
belowpoverty[belowpoverty.iloc[:,1].isnull()]

Unnamed: 0,Geographic Area,City,poverty_rate


The City column does not appear to have any missing values.

Checking the belowpoverty poverty_rate column.

In [47]:
belowpoverty[belowpoverty.iloc[:,2].isnull()]

Unnamed: 0,Geographic Area,City,poverty_rate


The poverty_rate column does not appear to have any None/NaN values. However, missing values are encoded as the string "-" as well, therefore rows with these values must be dropped.

In [48]:
belowpoverty = belowpoverty[belowpoverty.iloc[:,2] != "-"]

The following will check the <b>education</b> DataFrame's columns for null values, and will handle them accordingly. (The list of variables and their types can be found in 02-Import).

Checking the education Geographic Area column.

In [49]:
education[education.iloc[:,0].isnull()]

Unnamed: 0,Geographic Area,City,percent_completed_hs


The Geographic Area column does not appear to have any missing values.

Checking the education City column.

In [50]:
education[education.iloc[:,1].isnull()]

Unnamed: 0,Geographic Area,City,percent_completed_hs


The City column does not appear to have any missing values.

Checking education percent_completed_hs column.

In [51]:
education[education.iloc[:,2].isnull()]

Unnamed: 0,Geographic Area,City,percent_completed_hs


The percent_completed_hs column does not appear to have any missing values. However, missing values are encoded as the string "-" as well, therefore rows with these values must be dropped.

In [52]:
education = education[education.iloc[:,2] != "-"]

The following will check the <b>racestats</b> DataFrame's columns for null values, and will handle them accordingly. (The list of variables and their types can be found in 02-Import).

Checking the racestats Geographic area column.

In [53]:
racestats[racestats.iloc[:,0].isnull()]

Unnamed: 0,Geographic area,City,share_white,share_black,share_native_american,share_asian,share_hispanic


The Geographic area column does not appear to have any missing values.

Checking the racestats City column.

In [54]:
racestats[racestats.iloc[:,1].isnull()]

Unnamed: 0,Geographic area,City,share_white,share_black,share_native_american,share_asian,share_hispanic


The City column does not appear to have any missing values. 

Checking the share_white, share_black, share_native_american, share_asian, and share_hispanic columns. Note how since this table is contingent on this data, it makes sense to go ahead and drop all rows with null values in any of these columns.

In [55]:
racestats = racestats.dropna()

Successfully dealt with None/NaN values in the share_white, share_black, share_native_american, share_asian, and share_hispanic columns.However, missing values are encoded as the string "(X)" as well, therefore rows with these values must be dropped.

In [56]:
racestats = racestats[racestats.iloc[:,2] != "(X)"]
racestats = racestats[racestats.iloc[:,3] != "(X)"]
racestats = racestats[racestats.iloc[:,4] != "(X)"]
racestats = racestats[racestats.iloc[:,5] != "(X)"]
racestats = racestats[racestats.iloc[:,6] != "(X)"]

<h3> Replacing Values and Transforming Data</h3>

Categorical variables will sometimes be represented in a single category in different ways. Replacement is then needed to convert the representations to one consistent method. In addition, when necessary using a dictionary for mapping, the data will be transformed to values that are more meaningful.

The following will check the killings DataFrame's columns with categorical variables for inconsistent value representation. (The list of variables and their types can be found in 02-Import).

Checking killings manner_of_death column.

In [57]:
killings.manner_of_death.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: manner_of_death, dtype: object
Dask Name: unique-agg, 22 tasks

manner_of_death seems to have consistent value representation. 

Now, checking the killing armed column.

In [58]:
killings.armed.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: armed, dtype: object
Dask Name: unique-agg, 22 tasks

We can see that there are several different representations of the same or similar variables. 

"undeterimined", "unknown", "unknown weapon" are different categorical strings representing the same value. We can put them all under "unknown".

We can also consolidate items of similar type under one overarching categorical string to have a more easy-to-read set of unique values.

"gun", "guns and explosives", "gun and knife", "machete and gun" will be under "gun". Note, "bean-bag gun" will not be put into this category since "bean-bag gun"s are nonlethal. 

"knife", "machete", "sword", "box cutter", "meat cleaver", "straight edge razor", "bayonet", "pole and knife", "lawn mower blade" will be under "knife"

"vehicle", "motorcycle" will be under "vehicle". 

"hatchet", "ax" will be under "hatchet"

"metal object", "flagpole", "metal pole", "metal pipe", "metal hand tool", "metal stick" will be under "blunt metal object".

"blunt object", "baseball bat and fireplace poker", "brick", "baseball bat", "pole", "rock", "piece of wood", "pipe", "baton", "oar", "hammer" will be under "blunt object".

Transform all the strings to lowercase, just for consistency.

In [59]:
killings.armed = killings.armed.str.lower()

Replace the words for "unknown".

In [60]:
killings.armed = killings.armed.mask(df.armed=="undetermined","unknown")
killings.armed = killings.armed.mask(df.armed=="unknown weapon", "unknown")

Replace the words for "gun".

In [61]:
killings.armed = killings.armed.mask(df.armed=="guns and explosives","gun")
killings.armed = killings.armed.mask(df.armed=="gun and knife","gun")
killings.armed = killings.armed.mask(df.armed=="machete and gun","gun")
killings.armed = killings.armed.mask(df.armed=="hatchet and gun","gun")

Replace the words for "knife".

In [62]:
killings.armed = killings.armed.mask(df.armed=="machete","knife")
killings.armed = killings.armed.mask(df.armed=="sword","knife")
killings.armed = killings.armed.mask(df.armed=="box cutter","knife")
killings.armed = killings.armed.mask(df.armed=="meat cleaver","knife")
killings.armed = killings.armed.mask(df.armed=="straight edge razor","knife")
killings.armed = killings.armed.mask(df.armed=="bayonet","knife")
killings.armed = killings.armed.mask(df.armed=="pole and knife","knife")
killings.armed = killings.armed.mask(df.armed=="lawn mower blade","knife")

Replace the words for "vehicle".

In [63]:
killings.armed = killings.armed.mask(df.armed=="motorcycle","vehicle")

Replace the words for "hatchet".

In [64]:
killings.armed = killings.armed.mask(df.armed=="ax","hatchet")

Replace the words for "blunt metal object".

In [65]:
killings.armed = killings.armed.mask(df.armed=="metal object","blunt metal object")
killings.armed = killings.armed.mask(df.armed=="flagpole","blunt metal object")   
killings.armed = killings.armed.mask(df.armed=="metal pole","blunt metal object")
killings.armed = killings.armed.mask(df.armed=="metal pipe","blunt metal object")
killings.armed = killings.armed.mask(df.armed=="metal hand tool","blunt metal object")
killings.armed = killings.armed.mask(df.armed=="metal stick","blunt metal object")

Replace the words for "blunt object".

In [66]:
killings.armed = killings.armed.mask(df.armed=="baseball bat and fireplace poker","blunt object")
killings.armed = killings.armed.mask(df.armed=="brick","blunt object")
killings.armed = killings.armed.mask(df.armed=="baseball bat","blunt object")
killings.armed = killings.armed.mask(df.armed=="pole","blunt object")
killings.armed = killings.armed.mask(df.armed=="rock","blunt object")
killings.armed = killings.armed.mask(df.armed=="piece of wood","blunt object")
killings.armed = killings.armed.mask(df.armed=="pipe","blunt object")
killings.armed = killings.armed.mask(df.armed=="baton","blunt object")
killings.armed = killings.armed.mask(df.armed=="oar","blunt object")
killings.armed = killings.armed.mask(df.armed=="baseball bat and bottle","blunt object")
killings.armed = killings.armed.mask(df.armed=="hammer","blunt object")

Now that we have replaced the categorical strings where multiple strings represent a single value, we go back and check to ensure the values have consisten representation.

In [67]:
killings.armed.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: armed, dtype: object
Dask Name: unique-agg, 158 tasks

The armed column has been successfully dealt with.

Checking the killings gender column.



In [68]:
killings.gender.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: gender, dtype: object
Dask Name: unique-agg, 158 tasks

The gender column seems to have consistent representation.

Checking the killings race column.

In [69]:
killings.race.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: race, dtype: object
Dask Name: unique-agg, 158 tasks

The race column seems to have consistent representation. However, the data as of now is a bit vague in terms of being able to immediately understand what each letter indicates by a single glance. The data can be transformed using a dictionary.

In [70]:
racetoword = {"A":"asian", "W":"white","H":"hispanic", "B":"black","O":"other", "N":"nativeamerican"}
killings.race = killings.race.map(racetoword)

In [71]:
killings.race.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: race, dtype: object
Dask Name: unique-agg, 161 tasks

The race column has consistent representation and has been transformed to values that are much more meaningful.
Now, checking the killings state column.

In [72]:
killings.state.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: state, dtype: object
Dask Name: unique-agg, 161 tasks

The state column seems to have consistent representation.

Now checking the killings threat_level column.

In [73]:
killings.threat_level.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: threat_level, dtype: object
Dask Name: unique-agg, 161 tasks

The threat_level column seems to have consistent representation.

Now checking the killings flee column.

In [74]:
killings.flee.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: flee, dtype: object
Dask Name: unique-agg, 161 tasks

The flee column seems to have consistent representation, however for consistency, transform all of the values to lower case.

In [75]:
killings.flee = killings.flee.str.lower()

In [76]:
killings.flee.unique()

Dask Series Structure:
npartitions=1
    object
       ...
Name: flee, dtype: object
Dask Name: unique-agg, 164 tasks

The flee column seems to have consistent representation.


The killings name and city columns are not being checked since they serve mostly as identifier labels.

The killings date column and the rest of the numerical variables are not being checked, since this does not apply to numerical data.

The killings signs_of_mental_illness and body_camera columns are not being checked since they are booleans. 

The income, belowpoverty, education, and racestats DataFrames are not being checked since the significant data in each of those columns are numerical, and any "categorical" data serves as more of an identification label.

<h3> Renaming Columns </h3>

In order to have to have tables that properly encode one-one, one-many relationships, the columns of the DataFrames need to be renamed so that the columns all have a column linking them. In addition, meaningful column names are helpful during later data analysis.

The following will rename the <b>killings</b> DataFrame's columns to more meaningful, and easier to use column names.

The "id" column in killings does not serve a quantifiable purpose, therefore removing it would improve the readability of the DataFrame.

In [77]:
killings = killings.drop(["id"],axis=1)

In [78]:
killings.columns = ["name","date","cause_of_death","armed","age","gender","race","city","state","mental_illness","threat_level","flee","body_camera"]

In [79]:
killings.head()

Unnamed: 0,name,date,cause_of_death,armed,age,gender,race,city,state,mental_illness,threat_level,flee,body_camera
0,Tim Elliot,02/01/15,shot,gun,53.0,M,asian,Shelton,WA,True,attack,not fleeing,False
1,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,white,Aloha,OR,False,attack,not fleeing,False
2,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,hispanic,Wichita,KS,False,other,not fleeing,False
3,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,white,San Francisco,CA,True,attack,not fleeing,False
4,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,hispanic,Evans,CO,False,attack,not fleeing,False


The following will rename the <b>income</b> DataFrame's columns to more meaningful, and easier to use column names. The "Geographic Area" and "City" columns will be renamed to "state" and "city" to allow for linkage.

In [80]:
income.columns = ["state","city","median_income"]

In [81]:
income.head()

Unnamed: 0,state,city,median_income
0,AL,Abanda CDP,11207
1,AL,Abbeville city,25615
2,AL,Adamsville city,42575
3,AL,Addison town,37083
4,AL,Akron town,21667


The following will rename the <b>belowpoverty</b> DataFrame's columns to more meaningful, and easier to use column names. The "Geographic Area" and "City" columns will be renamed to "state" and "city" to allow for linkage.

In [82]:
belowpoverty.columns = ["state","city","poverty_rate"]

In [83]:
belowpoverty.head()

Unnamed: 0,state,city,poverty_rate
0,AL,Abanda CDP,78.8
1,AL,Abbeville city,29.1
2,AL,Adamsville city,25.5
3,AL,Addison town,30.7
4,AL,Akron town,42.0


The following will rename the <b>education</b> DataFrame's columns to more meaningful, and easier to use column names. The "Geographic Area" and "City" columns will be renamed to "state" and "city" to allow for linkage.

In [84]:
education.columns = ["state","city","completed_hs"]

In [85]:
education.head()

Unnamed: 0,state,city,completed_hs
0,AL,Abanda CDP,21.2
1,AL,Abbeville city,69.1
2,AL,Adamsville city,78.9
3,AL,Addison town,81.4
4,AL,Akron town,68.6


The following will rename the <b>racestats</b> DataFrame's columns to more meaningful, and easier to use column names. The "Geographic Area" and "City" columns will be renamed to "state" and "city" to allow for linkage.

In [86]:
racestats.columns = ["state","city","white","black","native_american","asian","hispanic"]

In [87]:
racestats.head()

Unnamed: 0,state,city,white,black,native_american,asian,hispanic
0,AL,Abanda CDP,67.2,30.2,0.0,0.0,1.6
1,AL,Abbeville city,54.4,41.4,0.1,1.0,3.1
2,AL,Adamsville city,52.3,44.9,0.5,0.3,2.3
3,AL,Addison town,99.1,0.1,0.0,0.1,0.4
4,AL,Akron town,13.2,86.5,0.0,0.0,0.3


<h3> Modifying Column Data Accross Tables to Ensure Linkage </h3>

The strings in each of the tables need to be formatted correctly and consistently. In this case specfically, the city name is what is used to link accross tables. However, the case is different accross the tables, and the city names are formatted slightly differently accross tables. This section will wrangle the strings to a consistent format.

Make all the city names lowercase for continuity.

In [88]:
killings.city = killings.city.str.lower()
income.city = income.city.str.lower()
belowpoverty.city = belowpoverty.city.str.lower()
education.city = education.city.str.lower()
racestats.city = racestats.city.str.lower()

The below code will format the city name in killings.

In [89]:
killings = killings.compute()

In [90]:
for index,row in killings.iterrows():
    tempcity = row["city"]
    citysplit = tempcity.split(" ")
    if(citysplit[-1] == "township" or citysplit[-1]=="county"):
        del citysplit[-1]
    tempstring = " ".join(citysplit)
    killings.set_value(index,"city",tempstring)

  import sys


In [91]:
killings = dd.from_pandas(killings,npartitions = 1)

In [92]:
killings.head()

Unnamed: 0,name,date,cause_of_death,armed,age,gender,race,city,state,mental_illness,threat_level,flee,body_camera
0,Tim Elliot,02/01/15,shot,gun,53.0,M,asian,shelton,WA,True,attack,not fleeing,False
1,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,white,aloha,OR,False,attack,not fleeing,False
2,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,hispanic,wichita,KS,False,other,not fleeing,False
3,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,white,san francisco,CA,True,attack,not fleeing,False
4,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,hispanic,evans,CO,False,attack,not fleeing,False


The below code will format city names in income.

In [93]:
for index,row in income.iterrows():
    tempcity = row["city"]
    citysplit = tempcity.split(" ")
    if(citysplit[-1] == "cdp" or citysplit[-1] == "city" or  citysplit[-1] == "town" or citysplit[-1]=="village"): 
        del citysplit[-1] 
    tempstring = " ".join(citysplit)
    income.set_value(index,"city",tempstring)

  import sys


In [94]:
income.head()

Unnamed: 0,state,city,median_income
0,AL,abanda,11207
1,AL,abbeville,25615
2,AL,adamsville,42575
3,AL,addison,37083
4,AL,akron,21667


The below code will format city names in belowpoverty.

In [95]:
for index,row in belowpoverty.iterrows():
    tempcity = row["city"]
    citysplit = tempcity.split(" ")
    if(citysplit[-1] == "cdp" or citysplit[-1] == "city" or  citysplit[-1] == "town" or citysplit[-1]=="village"): 
        del citysplit[-1] 
    tempstring = " ".join(citysplit)
    belowpoverty.set_value(index,"city",tempstring)

  import sys


In [96]:
belowpoverty.head()

Unnamed: 0,state,city,poverty_rate
0,AL,abanda,78.8
1,AL,abbeville,29.1
2,AL,adamsville,25.5
3,AL,addison,30.7
4,AL,akron,42.0


The below code will format city names in education.

In [97]:
for index,row in education.iterrows():
    tempcity = row["city"]
    citysplit = tempcity.split(" ")
    if(citysplit[-1] == "cdp" or citysplit[-1] == "city" or  citysplit[-1] == "town" or citysplit[-1]=="village"): 
        del citysplit[-1] 
    tempstring = " ".join(citysplit)
    education.set_value(index,"city",tempstring)

  import sys


In [98]:
education.head()

Unnamed: 0,state,city,completed_hs
0,AL,abanda,21.2
1,AL,abbeville,69.1
2,AL,adamsville,78.9
3,AL,addison,81.4
4,AL,akron,68.6


The below code will format city names in racestats.

In [99]:
for index,row in racestats.iterrows():
    tempcity = row["city"]
    citysplit = tempcity.split(" ")
    if(citysplit[-1] == "cdp" or citysplit[-1] == "city" or  citysplit[-1] == "town" or citysplit[-1]=="village"): 
        del citysplit[-1] 
    tempstring = " ".join(citysplit)
    racestats.set_value(index,"city",tempstring)

  import sys


In [100]:
racestats.head()

Unnamed: 0,state,city,white,black,native_american,asian,hispanic
0,AL,abanda,67.2,30.2,0.0,0.0,1.6
1,AL,abbeville,54.4,41.4,0.1,1.0,3.1
2,AL,adamsville,52.3,44.9,0.5,0.3,2.3
3,AL,addison,99.1,0.1,0.0,0.1,0.4
4,AL,akron,13.2,86.5,0.0,0.0,0.3


<h3> Modify DataTypes </h3>

To have Tidy data, it is also important to ensure that each column has only one type of datatype, and that datatype should be appropriate for that variable. This section will cast the datatypes of important variables of each table to their respective datatypes.

The following code will ensure that the killings date column is of a date-time datatype.

In [101]:
killings = killings.compute()
killings.date = pd.to_datetime(killings.date,dayfirst = True)
killings = dd.from_pandas(killings,npartitions = 1)

In [102]:
killings.head()

Unnamed: 0,name,date,cause_of_death,armed,age,gender,race,city,state,mental_illness,threat_level,flee,body_camera
0,Tim Elliot,2015-01-02,shot,gun,53.0,M,asian,shelton,WA,True,attack,not fleeing,False
1,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,white,aloha,OR,False,attack,not fleeing,False
2,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,hispanic,wichita,KS,False,other,not fleeing,False
3,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,white,san francisco,CA,True,attack,not fleeing,False
4,Michael Rodriguez,2015-01-04,shot,nail gun,39.0,M,hispanic,evans,CO,False,attack,not fleeing,False


The following code will ensure that the income median_income column is of a integer datatype.

In [103]:
income.median_income = pd.to_numeric(income.median_income,downcast = "integer")

The following code will ensure that the belowpoverty poverty_rate column is of a float datatype.

In [104]:
belowpoverty.poverty_rate = pd.to_numeric(belowpoverty.poverty_rate,downcast = "float")

The following code will ensure that the education completed_hs column is of a float datatype.

In [105]:
education.completed_hs = pd.to_numeric(education.completed_hs,downcast = "float")

The following code will ensure that the racestats race columns are of a float datatype.

In [106]:
racestats.white = pd.to_numeric(racestats.white,downcast = "float")
racestats.black = pd.to_numeric(racestats.black,downcast = "float")
racestats.native_american = pd.to_numeric(racestats.native_american,downcast = "float")
racestats.asian = pd.to_numeric(racestats.asian,downcast = "float")
racestats.hispanic = pd.to_numeric(racestats.hispanic,downcast = "float")

<h2> Tests </h2>

Now that the DataFrames have been Tidied, time to run tests on each DataFrame to ensure that each table follows Tidy regulations and everything is formatted how we expect it to be.

The below code will test the killings DataFrame.

In [107]:
assert len(killings) == 2340
assert len(killings.isnull()) == 2340
assert list(killings.columns) == ["name","date","cause_of_death","armed","age","gender","race","city","state","mental_illness","threat_level","flee","body_camera"]
assert np.dtype(killings.name) == "O"
assert np.dtype(killings.date) == '<M8[ns]'
assert np.dtype(killings.cause_of_death) == "O"
assert np.dtype(killings.armed) == "O"
assert np.dtype(killings.age) == "float"
assert np.dtype(killings.gender) == "O"
assert np.dtype(killings.race) == "O"
assert np.dtype(killings.city) == "O"
assert np.dtype(killings.state) == "O"
assert np.dtype(killings.mental_illness) == "bool"
assert np.dtype(killings.threat_level) == "O"
assert np.dtype(killings.flee) == "O"
assert np.dtype(killings.body_camera) == "bool"
killings = killings.compute()
assert list(killings.cause_of_death.unique()) == ['shot', 'shot and Tasered']
assert list(killings.armed.unique()) == ['gun', 'unarmed', 'toy weapon', 'nail gun', 'knife', 'vehicle',
       'shovel', 'blunt object', 'hatchet', 'unknown',
       'blunt metal object', 'screwdriver', 'cordless drill', 'taser',
       'sharp object', 'carjack', 'chain', "contractor's level", 'stapler',
       'crossbow', 'bean-bag gun', 'hand torch', 'chain saw',
       'garden tool', 'scissors', 'pick-axe', 'flashlight', 'spear',
       'pitchfork', 'glass shard', 'metal rake', 'crowbar',
       'air conditioner', 'beer bottle', 'fireworks', 'pen']
assert list(killings.gender.unique()) == ['M','F']
assert list(killings.race.unique()) == ['asian', 'white', 'hispanic', 'black', 'other', 'nativeamerican']
assert list(killings.threat_level.unique()) == ['attack', 'other', 'undetermined']
assert list(killings.flee.unique()) == ['not fleeing', 'car', 'foot', 'other', 'unknown']
killings = dd.from_pandas(killings,npartitions = 1)

The below code will test the income DataFrame.

In [108]:
assert len(income) == 27418
assert len(income[income.isnull() == False]) == 27418
assert list(income.columns) == ['state', 'city', 'median_income']
assert np.dtype(income.state) == "O"
assert np.dtype(income.city) == "O"
assert np.dtype(income.median_income) == "int32"

The below code will test the belowpoverty DataFrame.

In [109]:
assert len(belowpoverty) == 29128
assert len(belowpoverty[belowpoverty.isnull() == False]) == 29128
assert list(belowpoverty.columns) == ['state', 'city', 'poverty_rate']
assert np.dtype(belowpoverty.state) == "O"
assert np.dtype(belowpoverty.city) == "O"
assert np.dtype(belowpoverty.poverty_rate) == "float32"

The below code will test the education DataFrame.

In [110]:
assert len(education) == 29132
assert len(education[education.isnull() == False]) == 29132
assert list(education.columns) == ['state', 'city', 'completed_hs']
assert np.dtype(education.state) == "O"
assert np.dtype(education.city) == "O"
assert np.dtype(education.completed_hs) == "float32"

The below code will test the racestats DataFrame.

In [111]:
assert len(racestats) == 29248
assert len(racestats[racestats.isnull() == False]) == 29248
assert list(racestats.columns) == ['state', 'city', 'white', 'black', 'native_american', 'asian',
       'hispanic']
assert np.dtype(racestats.state) == "O"
assert np.dtype(racestats.city) == "O"
assert np.dtype(racestats.white) == "float32"
assert np.dtype(racestats.black) == "float32"
assert np.dtype(racestats.native_american) == "float32"
assert np.dtype(racestats.asian) == "float32"
assert np.dtype(racestats.hispanic) == "float32"

<h2> Store on Disk </h2>

Now that the Pandas DataFrames have been Tidied, they will be stored on the disk through the use of Arrow/Parquet. 

In [112]:
import pyarrow.parquet as pq
import pyarrow as pa

Below, strings representing the datapath for each of the Parquet files to be written are created.

In [113]:
filenamekillings = "/data/skariyadan/DaskKillings.parquet"
filenameincome = "/data/skariyadan/Income.parquet"
filenamebelowpoverty = "/data/skariyadan/BelowPoverty.parquet"
filenameeducation = "/data/skariyadan/Education.parquet"
filenameracestats = "/data/skariyadan/RaceStats.parquet"

Below, the killings DataFrame will be stored as a Parquet file.

In [114]:
killings.to_parquet(filenamekillings)

Below, the income DataFrame will be stored as a Parquet file.

In [115]:
tableincome = pa.Table.from_pandas(income)
pq.write_table(tableincome,filenameincome)

Below, the belowpoverty DataFrame will be stored as a Parquet file.

In [116]:
tablebelowpoverty = pa.Table.from_pandas(belowpoverty)
pq.write_table(tablebelowpoverty,filenamebelowpoverty)

Below, the education DataFrame will be stored as a Parquet file.

In [117]:
tableeducation = pa.Table.from_pandas(education)
pq.write_table(tableeducation,filenameeducation)

Below, the racestats DataFrame will be stored as a Parquet file.

In [118]:
tableracestats = pa.Table.from_pandas(racestats)
pq.write_table(tableracestats,filenameracestats)

<b> What's Next? </b>

In the next notebook, 04-EDA, we will begin data analysis and data visualization.