<h1> Import </h1>

<h2> Importing the Datasets </h2>

In this section, the datasets (in CSV formats) are to be read into Pandas DataFrames.

In [1]:
import pandas as pd
import numpy as np
import dask.dataframe as dd

The following DataFrame of length 2535 is from the data compiled by the Washington Post since January 1, 2015 regarding fatal police shootings.

In [2]:
killings = dd.read_csv("/data/skariyadan/PoliceKillingsUS.csv",encoding = "ISO-8859-1")

In [3]:
killings.head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False


<b> Summary of killings DataFrame Variables </b>

Categorical: 
- name (name of victim)
- manner_of_death (how the victim died)
- armed (the weapon the victim had)
- gender (gender of victim)
- race (race of victim)
- city (city of incident)
- state (state of incident)
- signs_of_mental_illness (whether the victim was mentally ill)
- threat_level (whether the victim was attacking)
- flee (whether the victim was attempting to flee)
- body_camera (whether the officer had a body camera)

Numerical:
- date (date of incident)
- age (age of victim)

The following DataFrame is from the data from the U.S. Census that has the median income of 29322 U.S. cities. 

In [4]:
income = pd.read_csv("/data/skariyadan/MedianHouseholdIncome2015.csv",encoding = "ISO-8859-1")

In [5]:
income.head()

Unnamed: 0,Geographic Area,City,Median Income
0,AL,Abanda CDP,11207
1,AL,Abbeville city,25615
2,AL,Adamsville city,42575
3,AL,Addison town,37083
4,AL,Akron town,21667


<b> Summary of income DataFrame Variables </b>

Categorical:
- Geographic Area (state)
- City (city)

Numerical:
- Median Income (median income of city)

The following DataFrame is from the data from the U.S. Census that has the percentage of the population below the poverty line of 29329 cities.

In [6]:
belowpoverty = pd.read_csv("/data/skariyadan/PercentagePeopleBelowPovertyLevel.csv",encoding = "ISO-8859-1")

In [7]:
belowpoverty.head()

Unnamed: 0,Geographic Area,City,poverty_rate
0,AL,Abanda CDP,78.8
1,AL,Abbeville city,29.1
2,AL,Adamsville city,25.5
3,AL,Addison town,30.7
4,AL,Akron town,42.0


<b> Summary of belowpoverty DataFrame Variables </b>

Categorical:
- Geographic Area (state)
- City (city)

Numerical:
- poverty_rate (percent rate of people below poverty line)

The following DataFrame is from the data from the U.S Census that has the percentage of the population above the age of 25 that have completed highschool of 29329 cities. 

In [8]:
education = pd.read_csv("/data/skariyadan/PercentOver25CompletedHighSchool.csv",encoding = "ISO-8859-1")

In [9]:
education.head()

Unnamed: 0,Geographic Area,City,percent_completed_hs
0,AL,Abanda CDP,21.2
1,AL,Abbeville city,69.1
2,AL,Adamsville city,78.9
3,AL,Addison town,81.4
4,AL,Akron town,68.6


<b> Summary of education DataFrame variables </b>

Categorical:
- Geographic Area (state)
- City (city)

Numerical:
- percent_completed_hs (percentage of adults over the age of 25 that completed high school)

The following DataFrame is from the data from the U.S. Census that has the racial breakdown by percentage of 29268 cities.

In [10]:
racestats = pd.read_csv("/data/skariyadan/ShareRaceByCity.csv",encoding = "ISO-8859-1")

In [11]:
racestats.head()

Unnamed: 0,Geographic area,City,share_white,share_black,share_native_american,share_asian,share_hispanic
0,AL,Abanda CDP,67.2,30.2,0.0,0.0,1.6
1,AL,Abbeville city,54.4,41.4,0.1,1.0,3.1
2,AL,Adamsville city,52.3,44.9,0.5,0.3,2.3
3,AL,Addison town,99.1,0.1,0.0,0.1,0.4
4,AL,Akron town,13.2,86.5,0.0,0.0,0.3


<b> Summary of racestats DataFrame variables </b>

Categorical:
- Geographic area (state)
- City (city)

Numerical:
- share_white (percentage of people who are White)
- share_black (percentage of people who are Black)
- share_native_american (percentage of people who are Native American)
- share_asian (percentage of people who are Asian)
- share_hispanic (percentage of people who are Hispanic)

<h2> Store On Disk </h2>

Now that the CSVs have been saved as Pandas DataFrames, they will now be stored on the disk through the use of Arrow/Parquet. Arrow is an efficient form of data storage, and saving to the disk will allow for this data to be accessed throughout the course of the project.

In [12]:
import pyarrow.parquet as pq
import pyarrow as pa

Below, strings representing the datapath for each of the Parquet files to be written are created.

In [13]:
filenamekillings = "/data/skariyadan/daskkillings.parquet"
filenameincome = "/data/skariyadan/income.parquet"
filenamebelowpoverty = "/data/skariyadan/belowpoverty.parquet"
filenameeducation = "/data/skariyadan/education.parquet"
filenameracestats = "/data/skariyadan/racestats.parquet"

Below, the killings DataFrame will be stored as a Parquet file.

In [14]:
killings.to_parquet(filenamekillings)

Below, the income DataFrame will be stored as a Parquet file.

In [15]:
tableincome = pa.Table.from_pandas(income)
pq.write_table(tableincome,filenameincome)

Below, the belowpoverty DataFrame will be stored as a Parquet file.

In [16]:
tablebelowpoverty = pa.Table.from_pandas(belowpoverty)
pq.write_table(tablebelowpoverty,filenamebelowpoverty)

Below, the education DataFrame will be stored as a Parquet file.

In [17]:
tableeducation = pa.Table.from_pandas(education)
pq.write_table(tableeducation,filenameeducation)

Below, the racestats DataFrame will be stored as a Parquet file.

In [18]:
tableracestats = pa.Table.from_pandas(racestats)
pq.write_table(tableracestats,filenameracestats)

<br>
<b> What's Next? </b> 

In the next notebook, 03-Tidy, the dataframes will become cleaned and transformed into the tidy-data standard.