# Data Science with Python, Part 1

Data science is a broad term.  This is the definition on Wikipedia:

"Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD)."

We do not go deeply into statistics and machine learning in this course, but focus more on the initial part of this definition -- extracting insights from structured and unstructured data.

Chapter 9 in Python for Data Analysis demonstrates a variety of methods to analyze data via data aggregation and grouping operations. Those are the focus of this session.

We start by experimenting with the datasets we used in the data visualization session.

In [31]:
import pandas as pd

sf1store = pd.HDFStore('bay_sf1_small.h5')
sf1 = sf1store['sf1_extract']
sf1['pct_rent'] = sf1['H0040004'] / sf1['H0040001'] * 100
sf1['pct_black'] = sf1['P0030003'] / sf1['P0030001'] * 100
sf1['pct_asian'] = sf1['P0030005'] / sf1['P0030001'] * 100
sf1['pct_white'] = sf1['P0030002'] / sf1['P0030001'] * 100
sf1['pct_hisp'] = sf1['P0040003'] / sf1['P0040001'] * 100
sf1['pct_vacant'] = sf1['H0050001'] / sf1['H00010001'] * 100
sf1['pop_sqmi'] = (sf1['P0010001'] / (sf1['arealand'] / 2589988))
sf1 = sf1[sf1['P0030001']>0]
sf1[:5]

Unnamed: 0,logrecno,blockfips,state,county,tract,blkgrp,block,arealand,P0010001,P0020001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
1,26,60014271001001,6,1,427100,1,1001,79696,113,113,...,0,0,0,80.0,13.274336,5.309735,78.761062,1.769912,0.0,3672.312839
3,28,60014271001003,6,1,427100,1,1003,19546,29,29,...,0,0,2,70.0,13.793103,27.586207,37.931034,24.137931,23.076923,3842.712166
4,29,60014271001004,6,1,427100,1,1004,14364,26,26,...,0,0,0,75.0,0.0,38.461538,34.615385,0.0,0.0,4688.087441
6,31,60014271001006,6,1,427100,1,1006,1281,2,2,...,0,0,0,100.0,0.0,50.0,0.0,100.0,0.0,4043.697112
7,32,60014271001007,6,1,427100,1,1007,19020,30,30,...,0,0,0,33.333333,0.0,43.333333,50.0,0.0,10.0,4085.154574


## Groupby operations

Groupby is a powerful method in pandas that follows the split-apply-combine approach to data.  As shown in Figure 9-1 in the context of a sum operation, the data is first split into groups that share the same key values.  Then an operation, in this case a sum, is applied to each group.  Then the results are combined.

![Groupby Operations](groupby.png "Groupby")

Let's apply this approach to computing total population in each county in our dataset.  We can do this in two steps to help explain what is happening.  First we create a groupby object, using county codes to group all the census blocks in sf1 into groups that share the same county code.

In [32]:
grouped = sf1['P0010001'].groupby(sf1['county'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x115f60860>

Now that we have this grouping object that represents the **split** part of the workflow in the figure above, we can **apply** operations and **combine** the results using methods like sum:

In [33]:
grouped.sum()

county
001    1510271
013    1049025
041     252409
055     136484
075     805235
081     718451
085    1781642
095     413344
097     483878
Name: P0010001, dtype: int64

Let's add county names to the dataframe so we get more readable output, and rerun this aggregation.

In [34]:
county_names = {'001': 'Alameda', '013': 'Contra Costa', '041': 'Marin', '055': 'Napa', '075': 'San Francisco',
                '081': 'San Mateo', '085': 'Santa Clara', '095': 'Solano', '097': 'Sonoma'}

In [35]:
sf1.insert(4, 'county_name', sf1['county'].replace(county_names))
sf1[:5]

Unnamed: 0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
1,26,60014271001001,6,1,Alameda,427100,1,1001,79696,113,...,0,0,0,80.0,13.274336,5.309735,78.761062,1.769912,0.0,3672.312839
3,28,60014271001003,6,1,Alameda,427100,1,1003,19546,29,...,0,0,2,70.0,13.793103,27.586207,37.931034,24.137931,23.076923,3842.712166
4,29,60014271001004,6,1,Alameda,427100,1,1004,14364,26,...,0,0,0,75.0,0.0,38.461538,34.615385,0.0,0.0,4688.087441
6,31,60014271001006,6,1,Alameda,427100,1,1006,1281,2,...,0,0,0,100.0,0.0,50.0,0.0,100.0,0.0,4043.697112
7,32,60014271001007,6,1,Alameda,427100,1,1007,19020,30,...,0,0,0,33.333333,0.0,43.333333,50.0,0.0,10.0,4085.154574


We could print the results of summing population by county:

In [36]:
print('Total Population by County:')
print(sf1['P0010001'].groupby(sf1['county_name']).sum())

Total Population by County:
county_name
Alameda          1510271
Contra Costa     1049025
Marin             252409
Napa              136484
San Francisco     805235
San Mateo         718451
Santa Clara      1781642
Solano            413344
Sonoma            483878
Name: P0010001, dtype: int64


Or capture the result in a DataFrame if we want to use it in other processing, like merging the results to the original DataFrame.

In [37]:
county_pop = sf1['P0010001'].groupby(sf1['county_name']).sum().to_frame(name='county_pop')
county_pop

Unnamed: 0_level_0,county_pop
county_name,Unnamed: 1_level_1
Alameda,1510271
Contra Costa,1049025
Marin,252409
Napa,136484
San Francisco,805235
San Mateo,718451
Santa Clara,1781642
Solano,413344
Sonoma,483878


Here we merge the county total population back on to the original DataFrame.

In [38]:
sf1 = pd.merge(sf1,county_pop, left_on='county_name', right_index=True)
sf1[:5]

Unnamed: 0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi,county_pop
1,26,60014271001001,6,1,Alameda,427100,1,1001,79696,113,...,0,0,80.0,13.274336,5.309735,78.761062,1.769912,0.0,3672.312839,1510271
3,28,60014271001003,6,1,Alameda,427100,1,1003,19546,29,...,0,2,70.0,13.793103,27.586207,37.931034,24.137931,23.076923,3842.712166,1510271
4,29,60014271001004,6,1,Alameda,427100,1,1004,14364,26,...,0,0,75.0,0.0,38.461538,34.615385,0.0,0.0,4688.087441,1510271
6,31,60014271001006,6,1,Alameda,427100,1,1006,1281,2,...,0,0,100.0,0.0,50.0,0.0,100.0,0.0,4043.697112,1510271
7,32,60014271001007,6,1,Alameda,427100,1,1007,19020,30,...,0,0,33.333333,0.0,43.333333,50.0,0.0,10.0,4085.154574,1510271


## Your turn:

Create a total households and total landarea per county and merge these on to sf1 as county_hhlds and county_landarea.