# Guidelines for ETL Project

This document contains guidelines, requirements, and suggestions for Project 1.

## Team Effort

Due to the short timeline, teamwork will be crucial to the success of this project! Work closely with your team through all phases of the project to ensure that there are no surprises at the end of the week.

Working in a group enables you to tackle more difficult problems than you'd be able to working alone. In other words, working in a group allows you to **work smart** and **dream big**. Take advantage of it!

## Project Proposal

Before you start writing any code, remember that you only have one week to complete this project. View this project as a typical assignment from work. Imagine a bunch of data came in and you and your team are tasked with migrating it to a production data base.

Take advantage of your Instructor and TA support during office hours and class project work time. They are a valuable resource and can help you stay on track.

## Finding Data

Your project must use 2 or more sources of data. We recommend the following sites to use as sources of data:

* [data.world](https://data.world/)

* [Kaggle](https://www.kaggle.com/)

You can also use APIs or data scraped from the web. However, get approval from your instructor first. Again, there is only a week to complete this!

## Data Cleanup & Analysis

Once you have identified your datasets, perform ETL on the data. Make sure to plan and document the following:

* The sources of data that you will extract from.

* The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc).

* The type of final production database to load the data into (relational or non-relational).

* The final tables or collections that will be used in the production database.

You will be required to submit a final technical report with the above information and steps required to reproduce your ETL process.

## Project Report

At the end of the week, your team will submit a Final Report that describes the following:

* **E**xtract: your original data sources and how the data was formatted (CSV, JSON, MySQL, etc).

* **T**ransform: what data cleaning or transformation was required.

* **L**oad: the final database, tables/collections, and why this was chosen.

Please upload the report to Github and submit a link to Bootcampspot.

- - -

### Copyright

Coding Boot Camp © 2018. All Rights Reserved.

In [2]:
from census import Census
import pandas as pd
from sqlalchemy import create_engine
cen_key = ""

In [3]:
cen_config = Census(cen_key, year=2016)

In [8]:
censusData = cen_config.acs1.get(
                                    ( 
                                     "NAME", 
                                    "B19013_001E", #household income (median)
                                    "B01003_001E", #population (total)
                                    "B09001_001E", #population under 18
                                    "B01002_001E", #median age
                                    "B19301_001E", #per capita income (average income per person)
                                    "B17001_002E", #poverty count
                                    "B23025_005E", #unemployment count
                                    "B15012_001E", #total recorded bachelor degrees
                                    "B15003_002E", 
                                    "B15003_017E",
                                    "B15003_018E",
                                    "B15003_021E",
                                    "B15003_022E",
                                    "B15003_023E",
                                    "B15003_024E",
                                    "B15003_025E"
                                    ),
                                    {'for': 'county:*'} #get all states - includes Puerto Rico and Washington
                                    ) 

In [9]:
censusDf = pd.DataFrame(censusData)
censusDf.head(10)

Unnamed: 0,B01002_001E,B01003_001E,B09001_001E,B15003_002E,B15003_017E,B15003_018E,B15003_021E,B15003_022E,B15003_023E,B15003_024E,B15003_025E,B15012_001E,B17001_002E,B19013_001E,B19301_001E,B23025_005E,NAME,county,state
0,42.4,208563.0,45955.0,2026.0,35760.0,6062.0,15580.0,25524.0,12717.0,3720.0,1586.0,46324.0,23375.0,56732.0,29977.0,3360.0,"Baldwin County, Alabama",3,1
1,39.1,114611.0,24953.0,1319.0,19896.0,4865.0,5453.0,7295.0,4306.0,900.0,408.0,14224.0,18193.0,41687.0,23818.0,4778.0,"Calhoun County, Alabama",15,1
2,40.4,82471.0,18000.0,1396.0,14343.0,4138.0,5733.0,6575.0,2595.0,453.0,109.0,10079.0,11524.0,39411.0,21237.0,1602.0,"Cullman County, Alabama",43,1
3,39.8,70900.0,17109.0,1556.0,11198.0,4127.0,4671.0,3259.0,1591.0,244.0,149.0,5542.0,15029.0,35963.0,19215.0,1551.0,"DeKalb County, Alabama",49,1
4,38.3,81799.0,18627.0,258.0,12734.0,4282.0,5113.0,8896.0,4336.0,820.0,593.0,15601.0,11283.0,52579.0,26230.0,1743.0,"Elmore County, Alabama",51,1
5,41.2,102564.0,22662.0,1413.0,19348.0,5383.0,5453.0,6883.0,3283.0,620.0,289.0,11939.0,16955.0,41152.0,22117.0,2799.0,"Etowah County, Alabama",55,1
6,39.5,104056.0,24437.0,939.0,17324.0,4177.0,6759.0,9392.0,3734.0,940.0,451.0,15376.0,20571.0,42321.0,24013.0,3418.0,"Houston County, Alabama",69,1
7,38.0,659521.0,151782.0,3654.0,101343.0,16580.0,38288.0,89977.0,35518.0,14228.0,8108.0,158432.0,97105.0,50180.0,30674.0,22588.0,"Jefferson County, Alabama",73,1
8,41.6,92318.0,18925.0,1136.0,18735.0,4177.0,4515.0,9044.0,3698.0,580.0,493.0,15047.0,12965.0,43427.0,24592.0,3064.0,"Lauderdale County, Alabama",77,1
9,31.0,158991.0,34079.0,2012.0,15745.0,5269.0,7544.0,18716.0,10336.0,1487.0,3366.0,36320.0,29192.0,48056.0,25081.0,4625.0,"Lee County, Alabama",81,1


In [10]:
re_censusDf = censusDf.rename(columns={
                                    "NAME": "Name", 
                                    "state": "State ID",
                                    "B19013_001E": "Household Income",
                                    "B01003_001E": "Population",
                                    "B09001_001E": "Population (Under 18)",
                                    "B01002_001E": "Median Age",
                                    "B19301_001E": "Per Capita Income",
                                    "B17001_002E": "Poverty Count",
                                    "B23025_005E": "Unemployment Count",
                                    "B15012_001E": "Total Recorded Bachelor Degrees",
                                    "B15003_002E": "No Education",
                                    "B15003_017E": "High School",
                                    "B15003_018E": "GED",
                                    "B15003_021E": "Associates",
                                    "B15003_022E": "Bachelors",
                                    "B15003_023E": "Masters",
                                    "B15003_024E": "Professional Degree",
                                    "B15003_025E": "Doctorate"
                                    })

In [13]:
re_censusDf['Fips Code'] = re_censusDf['State ID'] + re_censusDf['county']

In [14]:
re_censusDf

Unnamed: 0,Median Age,Population,Population (Under 18),No Education,High School,GED,Associates,Bachelors,Masters,Professional Degree,Doctorate,Total Recorded Bachelor Degrees,Poverty Count,Household Income,Per Capita Income,Unemployment Count,Name,county,State ID,Fips Code
0,42.4,208563.0,45955.0,2026.0,35760.0,6062.0,15580.0,25524.0,12717.0,3720.0,1586.0,46324.0,23375.0,56732.0,29977.0,3360.0,"Baldwin County, Alabama",003,01,01003
1,39.1,114611.0,24953.0,1319.0,19896.0,4865.0,5453.0,7295.0,4306.0,900.0,408.0,14224.0,18193.0,41687.0,23818.0,4778.0,"Calhoun County, Alabama",015,01,01015
2,40.4,82471.0,18000.0,1396.0,14343.0,4138.0,5733.0,6575.0,2595.0,453.0,109.0,10079.0,11524.0,39411.0,21237.0,1602.0,"Cullman County, Alabama",043,01,01043
3,39.8,70900.0,17109.0,1556.0,11198.0,4127.0,4671.0,3259.0,1591.0,244.0,149.0,5542.0,15029.0,35963.0,19215.0,1551.0,"DeKalb County, Alabama",049,01,01049
4,38.3,81799.0,18627.0,258.0,12734.0,4282.0,5113.0,8896.0,4336.0,820.0,593.0,15601.0,11283.0,52579.0,26230.0,1743.0,"Elmore County, Alabama",051,01,01051
5,41.2,102564.0,22662.0,1413.0,19348.0,5383.0,5453.0,6883.0,3283.0,620.0,289.0,11939.0,16955.0,41152.0,22117.0,2799.0,"Etowah County, Alabama",055,01,01055
6,39.5,104056.0,24437.0,939.0,17324.0,4177.0,6759.0,9392.0,3734.0,940.0,451.0,15376.0,20571.0,42321.0,24013.0,3418.0,"Houston County, Alabama",069,01,01069
7,38.0,659521.0,151782.0,3654.0,101343.0,16580.0,38288.0,89977.0,35518.0,14228.0,8108.0,158432.0,97105.0,50180.0,30674.0,22588.0,"Jefferson County, Alabama",073,01,01073
8,41.6,92318.0,18925.0,1136.0,18735.0,4177.0,4515.0,9044.0,3698.0,580.0,493.0,15047.0,12965.0,43427.0,24592.0,3064.0,"Lauderdale County, Alabama",077,01,01077
9,31.0,158991.0,34079.0,2012.0,15745.0,5269.0,7544.0,18716.0,10336.0,1487.0,3366.0,36320.0,29192.0,48056.0,25081.0,4625.0,"Lee County, Alabama",081,01,01081


In [15]:
cols = re_censusDf.columns.tolist()

In [17]:
cols = cols[-4:] + cols[:-4]

In [21]:
re_censusDf = re_censusDf[cols]

In [23]:
re_censusDf = re_censusDf.drop(['State ID','county'],axis=1)
re_censusDf

Unnamed: 0,Name,Fips Code,Median Age,Population,Population (Under 18),No Education,High School,GED,Associates,Bachelors,Masters,Professional Degree,Doctorate,Total Recorded Bachelor Degrees,Poverty Count,Household Income,Per Capita Income,Unemployment Count
0,"Baldwin County, Alabama",01003,42.4,208563.0,45955.0,2026.0,35760.0,6062.0,15580.0,25524.0,12717.0,3720.0,1586.0,46324.0,23375.0,56732.0,29977.0,3360.0
1,"Calhoun County, Alabama",01015,39.1,114611.0,24953.0,1319.0,19896.0,4865.0,5453.0,7295.0,4306.0,900.0,408.0,14224.0,18193.0,41687.0,23818.0,4778.0
2,"Cullman County, Alabama",01043,40.4,82471.0,18000.0,1396.0,14343.0,4138.0,5733.0,6575.0,2595.0,453.0,109.0,10079.0,11524.0,39411.0,21237.0,1602.0
3,"DeKalb County, Alabama",01049,39.8,70900.0,17109.0,1556.0,11198.0,4127.0,4671.0,3259.0,1591.0,244.0,149.0,5542.0,15029.0,35963.0,19215.0,1551.0
4,"Elmore County, Alabama",01051,38.3,81799.0,18627.0,258.0,12734.0,4282.0,5113.0,8896.0,4336.0,820.0,593.0,15601.0,11283.0,52579.0,26230.0,1743.0
5,"Etowah County, Alabama",01055,41.2,102564.0,22662.0,1413.0,19348.0,5383.0,5453.0,6883.0,3283.0,620.0,289.0,11939.0,16955.0,41152.0,22117.0,2799.0
6,"Houston County, Alabama",01069,39.5,104056.0,24437.0,939.0,17324.0,4177.0,6759.0,9392.0,3734.0,940.0,451.0,15376.0,20571.0,42321.0,24013.0,3418.0
7,"Jefferson County, Alabama",01073,38.0,659521.0,151782.0,3654.0,101343.0,16580.0,38288.0,89977.0,35518.0,14228.0,8108.0,158432.0,97105.0,50180.0,30674.0,22588.0
8,"Lauderdale County, Alabama",01077,41.6,92318.0,18925.0,1136.0,18735.0,4177.0,4515.0,9044.0,3698.0,580.0,493.0,15047.0,12965.0,43427.0,24592.0,3064.0
9,"Lee County, Alabama",01081,31.0,158991.0,34079.0,2012.0,15745.0,5269.0,7544.0,18716.0,10336.0,1487.0,3366.0,36320.0,29192.0,48056.0,25081.0,4625.0


In [42]:
file_path = './2016ElectionData.csv'
elecData = pd.read_csv(file_path, index_col=0)

In [53]:
# Remove all rows besides one for Alaksa - data was repeating
elecData = elecData[elecData.state_abbr != "AK"]
elecData.head()

Unnamed: 0,votes_dem,votes_gop,total_votes,per_dem,per_gop,diff,per_point_diff,state_abbr,county_name,combined_fips
29,5908.0,18110.0,24661.0,0.239569,0.734358,12202,49.48%,AL,Autauga County,1001
30,18409.0,72780.0,94090.0,0.195653,0.773515,54371,57.79%,AL,Baldwin County,1003
31,4848.0,5431.0,10390.0,0.466603,0.522714,583,5.61%,AL,Barbour County,1005
32,1874.0,6733.0,8748.0,0.21422,0.769662,4859,55.54%,AL,Bibb County,1007
33,2150.0,22808.0,25384.0,0.084699,0.898519,20658,81.38%,AL,Blount County,1009


In [57]:
# Function to add 0 to the front of the fips code (fips code under 10,000)
def math(x):
    if len(str(x)) < 5:
        return "0" + str(x)
    else:
        return str(x)

In [59]:
# Apply math function and create a new column
elecData['Fips Code'] = elecData['combined_fips'].apply(math)
elecData.head()

Unnamed: 0,votes_dem,votes_gop,total_votes,per_dem,per_gop,diff,per_point_diff,state_abbr,county_name,combined_fips,Fips Code
29,5908.0,18110.0,24661.0,0.239569,0.734358,12202,49.48%,AL,Autauga County,1001,1001
30,18409.0,72780.0,94090.0,0.195653,0.773515,54371,57.79%,AL,Baldwin County,1003,1003
31,4848.0,5431.0,10390.0,0.466603,0.522714,583,5.61%,AL,Barbour County,1005,1005
32,1874.0,6733.0,8748.0,0.21422,0.769662,4859,55.54%,AL,Bibb County,1007,1007
33,2150.0,22808.0,25384.0,0.084699,0.898519,20658,81.38%,AL,Blount County,1009,1009


In [78]:
combData = pd.merge(elecData,re_censusDf,on="Fips Code")
combData

Unnamed: 0,votes_dem,votes_gop,total_votes,per_dem,per_gop,diff,per_point_diff,state_abbr,county_name,combined_fips,...,Associates,Bachelors,Masters,Professional Degree,Doctorate,Total Recorded Bachelor Degrees,Poverty Count,Household Income,Per Capita Income,Unemployment Count
0,18409.0,72780.0,94090.0,0.195653,0.773515,54371,57.79%,AL,Baldwin County,1003,...,15580.0,25524.0,12717.0,3720.0,1586.0,46324.0,23375.0,56732.0,29977.0,3360.0
1,13197.0,32803.0,47376.0,0.278559,0.692397,19606,41.38%,AL,Calhoun County,1015,...,5453.0,7295.0,4306.0,900.0,408.0,14224.0,18193.0,41687.0,23818.0,4778.0
2,3730.0,32734.0,37278.0,0.100059,0.878105,29004,77.80%,AL,Cullman County,1043,...,5733.0,6575.0,2595.0,453.0,109.0,10079.0,11524.0,39411.0,21237.0,1602.0
3,3682.0,21779.0,26086.0,0.141149,0.834892,18097,69.37%,AL,DeKalb County,1049,...,4671.0,3259.0,1591.0,244.0,149.0,5542.0,15029.0,35963.0,19215.0,1551.0
4,8436.0,27619.0,36905.0,0.228587,0.748381,19183,51.98%,AL,Elmore County,1051,...,5113.0,8896.0,4336.0,820.0,593.0,15601.0,11283.0,52579.0,26230.0,1743.0
5,10350.0,32132.0,43474.0,0.238073,0.739108,21782,50.10%,AL,Etowah County,1055,...,5453.0,6883.0,3283.0,620.0,289.0,11939.0,16955.0,41152.0,22117.0,2799.0
6,10547.0,30567.0,42030.0,0.250940,0.727266,20020,47.63%,AL,Houston County,1069,...,6759.0,9392.0,3734.0,940.0,451.0,15376.0,20571.0,42321.0,24013.0,3418.0
7,151581.0,130614.0,290111.0,0.522493,0.450221,20967,7.23%,AL,Jefferson County,1073,...,38288.0,89977.0,35518.0,14228.0,8108.0,158432.0,97105.0,50180.0,30674.0,22588.0
8,9877.0,27735.0,38813.0,0.254477,0.714580,17858,46.01%,AL,Lauderdale County,1077,...,4515.0,9044.0,3698.0,580.0,493.0,15047.0,12965.0,43427.0,24592.0,3064.0
9,20987.0,34321.0,57668.0,0.363928,0.595148,13334,23.12%,AL,Lee County,1081,...,7544.0,18716.0,10336.0,1487.0,3366.0,36320.0,29192.0,48056.0,25081.0,4625.0


In [63]:
combData.dtypes

votes_dem                          float64
votes_gop                          float64
total_votes                        float64
per_dem                            float64
per_gop                            float64
diff                                object
per_point_diff                      object
state_abbr                          object
county_name                         object
combined_fips                        int64
Fips Code                           object
Name                                object
Median Age                         float64
Population                         float64
Population (Under 18)              float64
No Education                       float64
High School                        float64
GED                                float64
Associates                         float64
Bachelors                          float64
Masters                            float64
Professional Degree                float64
Doctorate                          float64
Total Recor

In [None]:
# Creating engine
engine = create_engine('sqlite://', echo=False)
combData.to_sql('election_db', con=engine)

In [None]:
from sqlalchemy import create_engine

# This engine just used to query for list of databases
mysql_engine = create_engine('mysql://{0}:{1}@{2}:{3}'.format(user, pass, host, port))

# Query for existing databases
existing_databases = mysql_engine.execute("SHOW DATABASES;")
# Results are a list of single item tuples, so unpack each tuple
existing_databases = [d[0] for d in existing_databases]

# Create database if not exists
if database not in existing_databases:
    mysql_engine.execute("CREATE DATABASE {0}".format(database))
    print("Created database {0}".format(database))

# Go ahead and use this engine
db_engine = create_engine('mysql://{0}:{1}@{2}:{3}/{4}'.format(user, pass, host, port, db))