# Final Project Exploratory Data Analysis

Do your EDA in this notebook!

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

Imported 2015 American Housing Survey Public Use Flat Files from Census.gov including the National Public Use File (ahs2015n.csv) and Metropolitan Public Use File (ahs2015m.csv) downloaded from https://www.census.gov/programs-surveys/ahs/data/2015/ahs-2015-public-use-file--puf-.html To allow the descriptive labels to be automatically applied to each column as needed, I also imported a separate labels table from the AHS 2015 Value Labels.csv to allow the proper labeling of axes using a data dictionary.
To allow conversion of OMB13CBSA numerical codes to their descriptive metropolitan labels, I imported a code crosswalk from https://public.opendatasoft.com/explore/dataset/core-based-statistical-areas-cbsas-and-combined-statistical-areas-csas/download/?format=xls&timezone=America/Los_Angeles&lang=en&use_labels_for_header=true as the codeset maintained by the OMB is only available as a PDF file.

To improve performance, I only want to import the columns needed in this data analysis instead of the entire 2995 available.

In [23]:
columns=['CONTROL', 'OMB13CBSA', 'VACANCY', 'BLD', 'UNITSIZE', 'TOTROOMS', 'BEDROOMS',
         'BATHROOMS', 'HSHLDTYPE', 'NUMPEOPLE', 'NUMADULTS', 'NUMELDERS', 'NUMYNGKIDS', 'NUMOLDKIDS',
         'NUMVETS', 'NUMNONREL', 'MULTIGEN', 'GRANDHH', 'NUMSUBFAM', 'NUMSECFAM', 'DISHH', 'HHSEX', 
        'HHAGE', 'HHMAR', 'HHRACE', 'HHRACEAS', 'HHRACEPI', 'HHSPAN', 'HHCITSHP', 'HHNATVTY']

In [24]:
ahs_n=pd.read_csv('data/ahs2015n.csv', usecols=columns)
ahs_m=pd.read_csv('data/ahs2015m.csv', usecols=columns)
cbsa=pd.read_csv('data/cbsa.csv')
labels=pd.read_csv('data/AHS 2015 Value Labels.csv')

Verify the file imports:

In [26]:
ahs_n.shape

(69493, 30)

In [27]:
ahs_m.shape

(24886, 30)

In [28]:
cbsa.shape

(1918, 12)

In [29]:
labels.shape

(11121, 8)

For ease of analysis of the AHS data, I will append the national dataset and the metropolitan dataset.

In [30]:
ahs=ahs_n.append(ahs_m)
ahs.shape

(94379, 30)

In [31]:
ahs.columns

Index(['CONTROL', 'TOTROOMS', 'OMB13CBSA', 'BLD', 'HHSEX', 'HHMAR', 'HHSPAN',
       'HHCITSHP', 'HHAGE', 'HHRACE', 'HHRACEAS', 'HHRACEPI', 'HHNATVTY',
       'HSHLDTYPE', 'NUMELDERS', 'NUMADULTS', 'NUMNONREL', 'NUMVETS',
       'NUMYNGKIDS', 'NUMOLDKIDS', 'NUMSUBFAM', 'NUMSECFAM', 'NUMPEOPLE',
       'GRANDHH', 'MULTIGEN', 'UNITSIZE', 'BEDROOMS', 'DISHH', 'BATHROOMS',
       'VACANCY'],
      dtype='object')

In [32]:
ahs.set_index('CONTROL', inplace=True)

In [33]:
ahs.index.name

'CONTROL'

In [34]:
ahs.head(5)

Unnamed: 0_level_0,TOTROOMS,OMB13CBSA,BLD,HHSEX,HHMAR,HHSPAN,HHCITSHP,HHAGE,HHRACE,HHRACEAS,...,NUMSUBFAM,NUMSECFAM,NUMPEOPLE,GRANDHH,MULTIGEN,UNITSIZE,BEDROOMS,DISHH,BATHROOMS,VACANCY
CONTROL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'11000001',7,'37980','02','1','1','2','1',48,'01','-6',...,0,0,3,'2','2','6',3,'2','04','-6'
'11000002',7,'99998','02','2','4','2','1',77,'01','-6',...,0,0,2,'2','1','8',3,'1','04','-6'
'11000003',4,'99998','03','2','6','2','1',24,'02','-6',...,0,0,3,'2','2','3',2,'-9','01','-6'
'11000005',8,'99998','02','1','1','2','1',68,'01','-6',...,0,0,3,'2','5','6',4,'1','05','-6'
'11000006',5,'99998','02','1','6','2','1',20,'01','-6',...,0,0,3,'2','1','-9',3,'2','03','-6'


In [35]:
ahs['OMB13CBSA'].unique()

array(["'37980'", "'99998'", "'99999'", "'47900'", "'35620'", "'14460'",
       "'41860'", "'26420'", "'33100'", "'12060'", "'38060'", "'16980'",
       "'19100'", "'19820'", "'42660'", "'31080'", "'40140'", "'28140'",
       "'38900'", "'38300'", "'35380'", "'39580'", "'19740'", "'32820'",
       "'17460'", "'17140'", "'33340'"], dtype=object)

In [36]:
print(ahs.shape)
ahs.isnull().sum()

(94379, 29)


TOTROOMS      0
OMB13CBSA     0
BLD           0
HHSEX         0
HHMAR         0
HHSPAN        0
HHCITSHP      0
HHAGE         0
HHRACE        0
HHRACEAS      0
HHRACEPI      0
HHNATVTY      0
HSHLDTYPE     0
NUMELDERS     0
NUMADULTS     0
NUMNONREL     0
NUMVETS       0
NUMYNGKIDS    0
NUMOLDKIDS    0
NUMSUBFAM     0
NUMSECFAM     0
NUMPEOPLE     0
GRANDHH       0
MULTIGEN      0
UNITSIZE      0
BEDROOMS      0
DISHH         0
BATHROOMS     0
VACANCY       0
dtype: int64

# Questions
* 
* Grouped by CBSA metro areas, what percentage of housing units have:
**(A) At least one more bathroom than number of people in unit
**(B) At least one bedroom than number of people in unit
**There is both (A) and (B)
* Compare across West Coast metropolitan areas and national baseline with table
* Compare across visualizations
* Create visualization of west coast metro areas Covid surge
