## Module 5: Data cleaning and manipulation

In the previous module, we learn the basic of working with data and file with pandas. In this module, we are going to work with a real-world dataset, and learn how to clean, select, filter, and slice data from a dataset, and merge and concatenate datasets.

Relevant reading: McKinney's Python for Data Analysis Chapters 7-8

### What are the data?  
We will be working with [2017 5-year tract-level ACS](https://www.census.gov/data/developers/data-sets/acs-5year.2017.html) from the Census Bureau, Below is the descriptive tables of the dataset. It's a set of socioeconomic variables across all Massachusetts census tracts:



##### Table 1: Variable Names and ACS Dataset


|variable | ACS dataset | description |
|---- | --- | --- |
|tot_pop | DP05_0001E | Total population |
|age2034 | DP05_0009PE| Percent of population 20–34 years old |
| age65up | DP05_0024PE  | Percent of population 65 years and older |
|black | DP05_0078PE |Percent of population that is non-Hispanic Black/African American|
|hispanic | DP05_0070PE | Percent of population that is Hispanic/Latino |
|asian | DP05_0080PE | Percent of population that is non-Hispanic Asian |
|white | DP05_0077PE | Percent of population that is non-Hispanic White |
|pct_college_degree_higher | DP02_0067PE | Percent of population (25 years and older) with bachelor’s degree or higher |
| pct_college_grad_student | DP02_0057PE | Percent of population who currently enroll in college or grad school | 
| hhincome | DP03_0062E | Median household income (US dollars) |
| pct_male | DP05_0002PE | Percent of population that is male |
| pct_female | DP05_0003PE | Percent of population that is female |
| poverty | DP03_0128PE | Percent of families and people whose income in the past 12 months is below the poverty level |
| mean_commute_time | DP03_0025E | Workers 16 years and over: Mean travel time to work (minutes) |
| pct_english_only | DP02_0111PE | Percent of population with english as only language spoken at home | 
| pct_foreign_born | DP02_0092PE | Percent of population that are foreign borned |
| median_rent | DP04_0134E | Occupied units paying rent: Median gross rent (US dollars)|

In [1]:
import numpy as np
import pandas as pd

## 1. Loading data

In [2]:
# load a data file
# note the relative filepath! where is this file located?
df = pd.read_csv('../data/acs_data_tracts_MA.csv', dtype={'GEOID':str})

In [3]:
# dataframe shape as rows, columns
df.shape

(1478, 21)

In [4]:
# or use len to just see the number of rows
len(df)

1478

In [5]:
# view the dataframe's "head"
df.head()

Unnamed: 0.1,Unnamed: 0,GEOID,stateID,countyID,population_size,age2034,age65up,black,hispanic,asian,...,pct_college_degree_higher,pct_college_grad_student,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent
0,0,25001010100,25,1,2952,3.0,28.9,1.7,2952,1.6,...,48.8,47.4,47500.0,53.3,46.7,10.7,13.9,88.5,9.2,1120
1,1,25001010206,25,1,3171,8.2,38.1,2.2,3171,3.4,...,52.6,27.5,59042.0,44.3,55.7,11.3,22.6,95.3,7.8,1027
2,2,25001010208,25,1,1580,2.1,37.3,1.3,1580,0.0,...,45.9,9.5,62844.0,50.4,49.6,11.2,16.8,93.6,9.6,1019
3,3,25001010304,25,1,2332,7.7,43.4,0.9,2332,4.8,...,51.2,30.2,71250.0,44.2,55.8,4.8,23.5,93.7,7.0,980
4,4,25001010306,25,1,2576,4.4,33.2,3.5,2576,0.5,...,45.1,10.2,55694.0,47.7,52.3,8.2,17.8,96.9,5.0,1176


In [6]:
df.columns

Index(['Unnamed: 0', 'GEOID', 'stateID', 'countyID', 'population_size',
       'age2034', 'age65up', 'black', 'hispanic', 'asian', 'white',
       'pct_college_degree_higher', 'pct_college_grad_student', 'hhincome',
       'pct_male', 'pct_female', 'poverty', 'mean_commute_time',
       'pct_english_only', 'pct_foreign_born', 'median_rent'],
      dtype='object')

## 2. Clean and process data

In [7]:
# data types of the columns
df.dtypes

Unnamed: 0                     int64
GEOID                         object
stateID                        int64
countyID                       int64
population_size                int64
age2034                      float64
age65up                      float64
black                        float64
hispanic                       int64
asian                        float64
white                        float64
pct_college_degree_higher    float64
pct_college_grad_student     float64
hhincome                     float64
pct_male                     float64
pct_female                   float64
poverty                      float64
mean_commute_time            float64
pct_english_only             float64
pct_foreign_born             float64
median_rent                   object
dtype: object

In [8]:
# access a single column like df['col_name']
df['median_rent'].head(10)

0    1120
1    1027
2    1019
3     980
4    1176
5    1198
6     426
7    1049
8    1264
9     NaN
Name: median_rent, dtype: object

In [9]:
# pandas uses numpy's nan to represent null (missing) values
print(np.nan)
print(type(np.nan))

nan
<class 'float'>


In [10]:
# convert rent from string -> float
df['median_rent'].astype(float)

ValueError: could not convert string to float: '1283 (USD)'

Didn't work! We need to clean up the stray alphabetical characters to get a numerical value. You can do string operations on pandas Series to clean up their values

In [11]:
# do a string replace and assign back to that column, then change type to float
df['median_rent'] = df['median_rent'].str.replace(' (USD)', '', regex=False)
df['median_rent'] = df['median_rent'].astype(float)

In [12]:
# convert rent from float -> int
df['median_rent'].astype(int)

ValueError: Cannot convert non-finite values (NA or inf) to integer

You cannot store null values as type `int`, only as type `float`. You have three basic options:

  1. Keep the column as float to retain the nulls - they are often important!
  2. Drop all the rows that contain nulls if we need non-null data for our analysis
  3. Fill in all the nulls with another value if we know a reliable default value

In [13]:
df.shape

(1478, 21)

In [14]:
# drop rows that contain nulls
# this doesn't save the result, because we didn't reassign! (in reality, want to keep the nulls here)
df.dropna(subset=['median_rent']).shape

(1415, 21)

In [15]:
# fill in rows that contain nulls
# this doesn't save the result, because we didn't reassign! (in reality, want to keep the nulls here)
df['median_rent'].fillna(value=0).head(10)

0    1120.0
1    1027.0
2    1019.0
3     980.0
4    1176.0
5    1198.0
6     426.0
7    1049.0
8    1264.0
9       0.0
Name: median_rent, dtype: float64

In [16]:
df['stateID']

0       25
1       25
2       25
3       25
4       25
        ..
1473    25
1474    25
1475    25
1476    25
1477    25
Name: stateID, Length: 1478, dtype: int64

In [17]:
# dict that maps state fips code -> state name
fips = {25 : 'MA'}

# replace fips code with state name with the replace() method
df['stateID'] = df['stateID'].replace(fips)
df['stateID']

0       MA
1       MA
2       MA
3       MA
4       MA
        ..
1473    MA
1474    MA
1475    MA
1476    MA
1477    MA
Name: stateID, Length: 1478, dtype: object

In [18]:
# you can rename columns with the rename() method
# remember to reassign to save the result
df = df.rename(columns={'stateID' : 'state_name'})

df = df.rename(columns={'population_size' : 'total_pop'})

In [19]:
# you can drop columns you don't need with the drop() method
# remember to reassign to save the result
df = df.drop(columns=['Unnamed: 0'])

In [20]:
# inspect the cleaned-up dataframe
df.head()

Unnamed: 0,GEOID,state_name,countyID,total_pop,age2034,age65up,black,hispanic,asian,white,pct_college_degree_higher,pct_college_grad_student,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent
0,25001010100,MA,1,2952,3.0,28.9,1.7,2952,1.6,87.6,48.8,47.4,47500.0,53.3,46.7,10.7,13.9,88.5,9.2,1120.0
1,25001010206,MA,1,3171,8.2,38.1,2.2,3171,3.4,90.8,52.6,27.5,59042.0,44.3,55.7,11.3,22.6,95.3,7.8,1027.0
2,25001010208,MA,1,1580,2.1,37.3,1.3,1580,0.0,97.2,45.9,9.5,62844.0,50.4,49.6,11.2,16.8,93.6,9.6,1019.0
3,25001010304,MA,1,2332,7.7,43.4,0.9,2332,4.8,93.4,51.2,30.2,71250.0,44.2,55.8,4.8,23.5,93.7,7.0,980.0
4,25001010306,MA,1,2576,4.4,33.2,3.5,2576,0.5,90.7,45.1,10.2,55694.0,47.7,52.3,8.2,17.8,96.9,5.0,1176.0


In [21]:
# save it to disk as a "clean" copy
# note the relative filepath
df.to_csv('../data/acs_data_tracts_MA-clean.csv', index=False, encoding='utf-8')

## 3. Selecting and slicing data from a DataFrame

In [22]:
# CHEAT SHEET OF COMMON TASKS
# Operation                       Syntax           Result
#------------------------------------------------------------
# Select column by name           df[col]          Series
# Select columns by name          df[col_list]     DataFrame
# Select row by label             df.loc[label]    Series
# Select row by integer location  df.iloc[loc]     Series
# Slice rows by label             df.loc[a:c]      DataFrame
# Select rows by boolean vector   df[mask]         DataFrame

### 3a. Select DataFrame's column(s) by name

We saw some of this a minute ago. Let's look in a bit more detail and break down what's happening.

In [23]:
# select a single column by column name
# this is a pandas series
df['total_pop']

0       2952
1       3171
2       1580
3       2332
4       2576
        ... 
1473    3406
1474    5199
1475    5601
1476    3467
1477    6542
Name: total_pop, Length: 1478, dtype: int64

In [24]:
# select multiple columns by a list of column names
# this is a pandas dataframe that is a subset of the original
df[['total_pop', 'hhincome']]

Unnamed: 0,total_pop,hhincome
0,2952,47500.0
1,3171,59042.0
2,1580,62844.0
3,2332,71250.0
4,2576,55694.0
...,...,...
1473,3406,64219.0
1474,5199,68490.0
1475,5601,87067.0
1476,3467,93750.0


In [25]:
# create a new column by assigning df['new_col'] to some set of values
# you can do math operations on any numeric columns
df['monthly_income'] = df['hhincome'] / 12
df['rent_burden'] = df['median_rent'] / df['monthly_income']

# inspect the results
df[['hhincome', 'monthly_income', 'median_rent', 'rent_burden']].head()

Unnamed: 0,hhincome,monthly_income,median_rent,rent_burden
0,47500.0,3958.333333,1120.0,0.282947
1,59042.0,4920.166667,1027.0,0.208733
2,62844.0,5237.0,1019.0,0.194577
3,71250.0,5937.5,980.0,0.165053
4,55694.0,4641.166667,1176.0,0.253385


### 3b. Select row(s) by label

In [26]:
# use .loc to select by row label
# returns the row as a series whose index is the dataframe column names
df.loc[0]

GEOID                        25001010100
state_name                            MA
countyID                               1
total_pop                           2952
age2034                              3.0
age65up                             28.9
black                                1.7
hispanic                            2952
asian                                1.6
white                               87.6
pct_college_degree_higher           48.8
pct_college_grad_student            47.4
hhincome                         47500.0
pct_male                            53.3
pct_female                          46.7
poverty                             10.7
mean_commute_time                   13.9
pct_english_only                    88.5
pct_foreign_born                     9.2
median_rent                       1120.0
monthly_income               3958.333333
rent_burden                     0.282947
Name: 0, dtype: object

In [27]:
# use .loc to select single value by row label, column name
df.loc[0, 'poverty']

10.7

In [28]:
# slice of rows from label 5 to label 7, inclusive
# this returns a pandas dataframe
df.loc[5:7]

Unnamed: 0,GEOID,state_name,countyID,total_pop,age2034,age65up,black,hispanic,asian,white,...,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent,monthly_income,rent_burden
5,25001010400,MA,1,3037,2.6,35.2,6.4,3037,0.9,90.0,...,48442.0,44.1,55.9,6.3,19.3,90.9,9.4,1198.0,4036.833333,0.296767
6,25001010500,MA,1,2790,2.3,42.9,0.2,2790,0.0,98.7,...,76087.0,47.1,52.9,8.2,15.9,96.7,4.1,426.0,6340.583333,0.067186
7,25001010600,MA,1,3091,5.5,41.8,2.8,3091,0.6,88.9,...,73625.0,46.8,53.2,7.8,18.7,92.6,5.9,1049.0,6135.416667,0.170975


In [29]:
# slice of rows from label 1 to label 3, inclusive
# slice of columns from hispanic to white, inclusive
df.loc[1:3, 'hispanic':'white']

Unnamed: 0,hispanic,asian,white
1,3171,3.4,90.8
2,1580,0.0,97.2
3,2332,4.8,93.4


In [30]:
# subset of rows from with labels in list
# subset of columns with names in list
df.loc[[1, 3], ['hispanic', 'white']]

Unnamed: 0,hispanic,white
1,3171,90.8
3,2332,93.4


In [31]:
# you can use a column of unique identifiers as the index
# fips codes uniquely identify each row (but verify!)
df = df.set_index('GEOID')
df.index.is_unique

True

In [32]:
df.head()

Unnamed: 0_level_0,state_name,countyID,total_pop,age2034,age65up,black,hispanic,asian,white,pct_college_degree_higher,...,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent,monthly_income,rent_burden
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25001010100,MA,1,2952,3.0,28.9,1.7,2952,1.6,87.6,48.8,...,47500.0,53.3,46.7,10.7,13.9,88.5,9.2,1120.0,3958.333333,0.282947
25001010206,MA,1,3171,8.2,38.1,2.2,3171,3.4,90.8,52.6,...,59042.0,44.3,55.7,11.3,22.6,95.3,7.8,1027.0,4920.166667,0.208733
25001010208,MA,1,1580,2.1,37.3,1.3,1580,0.0,97.2,45.9,...,62844.0,50.4,49.6,11.2,16.8,93.6,9.6,1019.0,5237.0,0.194577
25001010304,MA,1,2332,7.7,43.4,0.9,2332,4.8,93.4,51.2,...,71250.0,44.2,55.8,4.8,23.5,93.7,7.0,980.0,5937.5,0.165053
25001010306,MA,1,2576,4.4,33.2,3.5,2576,0.5,90.7,45.1,...,55694.0,47.7,52.3,8.2,17.8,96.9,5.0,1176.0,4641.166667,0.253385


In [33]:
# .loc works by label, not by position in the dataframe
df.loc[0]

KeyError: 0

In [34]:
# the index now contains fips codes, so you have to use .loc accordingly to select by row label
df.loc['25001010100']

state_name                            MA
countyID                               1
total_pop                           2952
age2034                              3.0
age65up                             28.9
black                                1.7
hispanic                            2952
asian                                1.6
white                               87.6
pct_college_degree_higher           48.8
pct_college_grad_student            47.4
hhincome                         47500.0
pct_male                            53.3
pct_female                          46.7
poverty                             10.7
mean_commute_time                   13.9
pct_english_only                    88.5
pct_foreign_born                     9.2
median_rent                       1120.0
monthly_income               3958.333333
rent_burden                     0.282947
Name: 25001010100, dtype: object

### 3c. Select by (integer) position

In [35]:
# get the row in the zero-th position in the dataframe
df.iloc[0]

state_name                            MA
countyID                               1
total_pop                           2952
age2034                              3.0
age65up                             28.9
black                                1.7
hispanic                            2952
asian                                1.6
white                               87.6
pct_college_degree_higher           48.8
pct_college_grad_student            47.4
hhincome                         47500.0
pct_male                            53.3
pct_female                          46.7
poverty                             10.7
mean_commute_time                   13.9
pct_english_only                    88.5
pct_foreign_born                     9.2
median_rent                       1120.0
monthly_income               3958.333333
rent_burden                     0.282947
Name: 25001010100, dtype: object

In [36]:
# you can slice as well
# note, while .loc[] is inclusive, .iloc[] is not
# get the rows from position 0 up to but not including position 3 (ie, rows 0, 1, and 2)
df.iloc[0:3]

Unnamed: 0_level_0,state_name,countyID,total_pop,age2034,age65up,black,hispanic,asian,white,pct_college_degree_higher,...,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent,monthly_income,rent_burden
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25001010100,MA,1,2952,3.0,28.9,1.7,2952,1.6,87.6,48.8,...,47500.0,53.3,46.7,10.7,13.9,88.5,9.2,1120.0,3958.333333,0.282947
25001010206,MA,1,3171,8.2,38.1,2.2,3171,3.4,90.8,52.6,...,59042.0,44.3,55.7,11.3,22.6,95.3,7.8,1027.0,4920.166667,0.208733
25001010208,MA,1,1580,2.1,37.3,1.3,1580,0.0,97.2,45.9,...,62844.0,50.4,49.6,11.2,16.8,93.6,9.6,1019.0,5237.0,0.194577


In [37]:
# get the value from the row in position 3 and the column in position 2 (zero-indexed)
df.iloc[3, 2]

2332

### 3d. Select/filter by value

You can subset or filter a dataframe based on the values in its rows/columns.

In [38]:
# filter the dataframe by rows with 30%+ rent burden
df[df['rent_burden'] > 0.3]

Unnamed: 0_level_0,state_name,countyID,total_pop,age2034,age65up,black,hispanic,asian,white,pct_college_degree_higher,...,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent,monthly_income,rent_burden
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25001012602,MA,1,4920,5.0,12.1,17.6,4920,0.6,64.2,10.5,...,45206.0,45.5,54.5,21.6,16.8,72.3,23.3,1248.0,3767.166667,0.331283
25001013100,MA,1,5174,5.0,17.8,2.3,5174,0.0,95.8,39.3,...,80881.0,49.8,50.2,6.8,22.3,97.4,3.2,2202.0,6740.083333,0.326702
25001013200,MA,1,4821,5.9,24.5,0.2,4821,0.0,93.8,53.3,...,71641.0,46.0,54.0,4.5,28.0,86.3,8.1,1878.0,5970.083333,0.314568
25001014100,MA,1,881,8.6,0.0,3.5,881,2.2,80.4,33.0,...,48750.0,54.6,45.4,9.0,13.3,89.6,3.2,1240.0,4062.500000,0.305231
25001015300,MA,1,3109,10.8,11.5,9.6,3109,3.0,77.2,13.0,...,33979.0,53.0,47.0,23.7,16.3,66.5,28.5,1030.0,2831.583333,0.363754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25027732500,MA,27,1953,11.0,5.6,6.8,1953,10.1,36.4,15.5,...,27895.0,53.6,46.4,39.6,20.8,57.5,24.3,983.0,2324.583333,0.422871
25027732700,MA,27,3766,5.6,12.8,19.5,3766,5.0,39.8,13.5,...,29875.0,50.1,49.9,30.0,28.2,69.8,18.0,1073.0,2489.583333,0.430996
25027733000,MA,27,3737,6.0,8.7,7.6,3737,9.9,41.0,10.4,...,40319.0,50.6,49.4,28.0,25.8,54.9,26.0,1142.0,3359.916667,0.339889
25027757200,MA,27,2558,11.9,9.0,2.3,2558,0.9,40.3,8.2,...,30724.0,47.3,52.7,32.9,22.8,57.6,4.7,849.0,2560.333333,0.331597


In [39]:
# what exactly did that do? let's break it out.
df['rent_burden'] > 0.3

GEOID
25001010100    False
25001010206    False
25001010208    False
25001010304    False
25001010306    False
               ...  
25027760100    False
25027761100    False
25027761200    False
25027761300    False
25027761400    False
Name: rent_burden, Length: 1478, dtype: bool

In [40]:
# essentially a true/false mask that filters by value
mask = df['rent_burden'] > 0.3
df[mask]

Unnamed: 0_level_0,state_name,countyID,total_pop,age2034,age65up,black,hispanic,asian,white,pct_college_degree_higher,...,hhincome,pct_male,pct_female,poverty,mean_commute_time,pct_english_only,pct_foreign_born,median_rent,monthly_income,rent_burden
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25001012602,MA,1,4920,5.0,12.1,17.6,4920,0.6,64.2,10.5,...,45206.0,45.5,54.5,21.6,16.8,72.3,23.3,1248.0,3767.166667,0.331283
25001013100,MA,1,5174,5.0,17.8,2.3,5174,0.0,95.8,39.3,...,80881.0,49.8,50.2,6.8,22.3,97.4,3.2,2202.0,6740.083333,0.326702
25001013200,MA,1,4821,5.9,24.5,0.2,4821,0.0,93.8,53.3,...,71641.0,46.0,54.0,4.5,28.0,86.3,8.1,1878.0,5970.083333,0.314568
25001014100,MA,1,881,8.6,0.0,3.5,881,2.2,80.4,33.0,...,48750.0,54.6,45.4,9.0,13.3,89.6,3.2,1240.0,4062.500000,0.305231
25001015300,MA,1,3109,10.8,11.5,9.6,3109,3.0,77.2,13.0,...,33979.0,53.0,47.0,23.7,16.3,66.5,28.5,1030.0,2831.583333,0.363754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25027732500,MA,27,1953,11.0,5.6,6.8,1953,10.1,36.4,15.5,...,27895.0,53.6,46.4,39.6,20.8,57.5,24.3,983.0,2324.583333,0.422871
25027732700,MA,27,3766,5.6,12.8,19.5,3766,5.0,39.8,13.5,...,29875.0,50.1,49.9,30.0,28.2,69.8,18.0,1073.0,2489.583333,0.430996
25027733000,MA,27,3737,6.0,8.7,7.6,3737,9.9,41.0,10.4,...,40319.0,50.6,49.4,28.0,25.8,54.9,26.0,1142.0,3359.916667,0.339889
25027757200,MA,27,2558,11.9,9.0,2.3,2558,0.9,40.3,8.2,...,30724.0,47.3,52.7,32.9,22.8,57.6,4.7,849.0,2560.333333,0.331597


In [41]:
# you can chain multiple conditions together
# pandas logical operators are: | for or, & for and, ~ for not
# these must be grouped by using parentheses due to order of operations
# question: which tracts are both rent-burdened and majority-Black?
mask = (df['rent_burden'] > 0.3) & (df['black'] > 50)
df[mask].shape

(29, 21)

In [42]:
# which tracts are both rent-burdened and either majority-Black or majority-Hispanic?
mask1 = df['rent_burden'] > 0.3
mask2 = df['black'] > 50
mask3 = df['hispanic'] > 50
mask = mask1 & (mask2 | mask3)
df[mask].shape

(222, 21)

In [43]:
# see the mask
mask

GEOID
25001010100    False
25001010206    False
25001010208    False
25001010304    False
25001010306    False
               ...  
25027760100    False
25027761100    False
25027761200    False
25027761300    False
25027761400    False
Length: 1478, dtype: bool

In [44]:
# ~ means not... it essentially flips trues to falses and vice-versa
~mask

GEOID
25001010100    True
25001010206    True
25001010208    True
25001010304    True
25001010306    True
               ... 
25027760100    True
25027761100    True
25027761200    True
25027761300    True
25027761400    True
Length: 1478, dtype: bool

In [46]:
# now it's your turn
# create a new subset dataframe containing all the rows with median household income above $60,000 and percent-White above 60%
# how many rows did you get?
len(df[(df['hhincome'] > 60000) & (df['white'] > 60)])

907

## 4. Merge and concatenate

### 4a. Merging DataFrames

In [None]:
# create a subset dataframe with only race/ethnicity variables
race_cols = ['asian', 'black', 'hispanic', 'white']
df_race = df[race_cols]
df_race.head()

In [None]:
# create a subset dataframe with only economic variables
econ_cols = ['median_rent', 'hhincome']
df_econ = df[econ_cols].sort_values('hhincome')
df_econ.head()

In [None]:
# merge them together, aligning rows based on their labels in the index
df_merged = pd.merge(left=df_econ, right=df_race, how='inner', left_index=True, right_index=True)
df_merged.head()

In [None]:
# now it's your turn
# change the "how" argument: what happens if you try an "outer" join? or a "left" join? or a "right" join?


In [None]:
# reset df_econ's index
df_econ = df_econ.reset_index()
df_econ.head()

In [None]:
# merge them together, aligning rows based on their labels in the index
# doesn't work! their indexes do not share any labels to match/align the rows
df_merged = pd.merge(left=df_econ, right=df_race, how='inner', left_index=True, right_index=True)
df_merged

In [None]:
# instead merge where df_race index matches df_econ GEOID10 column
df_merged = pd.merge(left=df_econ, right=df_race, how='inner', left_on='GEOID', right_index=True)
df_merged.head()

### 4b. Concatenating DataFrames

In [None]:
# select data within suffolk county, the county id is 25
df_suffolk = df[df['countyID']==25]

# select data within middlesex county, the county id is 17
df_middlesex = df[df['countyID']==17]

In [None]:
# merging joins data together aligned by the index, but concatenating just smushes it together along some axis
df_all = pd.concat([df_middlesex, df_suffolk], sort=False)
df_all

## Summary
1. The basics of numpy arrays and pandas dataframes 
2. Loading and working with files in pandas
    - Selecting, filtering and slicing data
    - Saving a dataframe as a file
    - Other dataframes functionalities
3. Data cleaning and processing
    

## Assignment 3

See instrutions on Canvas