# 2006-2010 Census Data cleaning and integrating with LC Data


The data provided by the census has the geographical categories as county codes. However Lending club data has only three digit zip code. We downloaded a mapping from Missouri Census Data Center to map three digit zip codes to county codes.

http://mcdc.missouri.edu/applications/geocorr2014.html

We know that for each county code, there will be an associated three digit zip code. However there might be more than one county code for each three digit zip code. Therefore we create a dictionary to assign a 3 digit zip code for each county code, which will be used to calculate 3 digit zip code level census data.

In [2]:

#Read the Missouri Data Science Center Zip Code County Code Map file
zip_county_map = pd.read_csv("../data/census_data/zip_census_tract.csv",encoding="latin-1",dtype=str)

#necessary columns
zip_county_map_columns =['ZIP census tabulation area', 'County code',"Tract"]
zip_county_map["Tract"]=zip_county_map["Tract"].str.replace('.',"")

zip_county_map["County Tract Code"]=zip_county_map["County code"]+zip_county_map["Tract"]

zip_county_map_columns =['ZIP census tabulation area', 'County Tract Code']

zip_county_map_drop_cols=[x for x in zip_county_map if x not in zip_county_map_columns]

#drop unnecessary columns
zip_county_map.drop(columns=zip_county_map_drop_cols, axis=0,inplace=True)


#calculate 3 digit zip code

zip_county_map["3 Digit Zip"]= [x[:3] if len(x)==5 else x[:2] for x in zip_county_map['ZIP census tabulation area']]

county_zip_map=dict(zip(zip_county_map["County Tract Code"],zip_county_map["3 Digit Zip"]))

NameError: name 'pd' is not defined

Per capita income, median house value, median gross rent, and unemployed population over 16 years old data is obtained from the American Cummunity Survey.

We download the dictionary for the variable names in the data. Strip white spaces and store in a dictionary file

In [3]:
#Read the American Community Survey Dictionary file
census_dict = pd.read_csv("../data/census_data/census_dictionary.csv")

#remove extra columns that will not be used
census_dict.drop(columns=[x for x in census_dict.columns if x not in ["Code","Description"]],inplace=True)

#rename the column name so that the whitespace is trimmed.
census_dict.columns=census_dict.columns.str.lstrip().str.rstrip()

#remove whitespace before and after string
census_dict["Code"]=census_dict["Code"].str.lstrip().str.rstrip()
# census_dict["FIPS"]=census_dict["FIPS"].str.rstrip(" ")

census_code_dict = dict(zip(census_dict["Code"],census_dict["Description"]))
census_description_dict = dict(zip(census_dict["Description"],census_dict["Code"]))

NameError: name 'pd' is not defined

For the per capita income, Total Population over 16 Years Old, Total Population variables, we will simply sum each county code in the 3 digit zip code level. 

However, for median house value and median gross rent, the underlying data is not available.
Therefore although it is not absolutely correct, we average these variables to get average median house value  and average median gross rent for each three digit zip code.

Below is the weighted average function to be applied. 

In [4]:
def weighted_average_funct(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean. Customize this if your business case
    should return otherwise.
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

Read the American Census data csv file, specifying the data type for the variables that will be used.

In [None]:
#read the American Community Survey (2006-2010) data.
#Data is about 67mb.
#source is Social Explorer, granted a student licence through Rutgers University.
#encoding of the source file is Western Latin-1 (ISO-8859-1)

census_data = pd.read_csv("../data/census_data/ACS_2006_2010_census_tract.csv",encoding ="latin-1",
                         dtype={"Geo_FIPS": str,
                                census_description_dict['     Total Population    ']: np.int64,
                                census_description_dict['     Civilian Population in Labor Force 16 Years and Over    ']: np.int64,
                                census_description_dict['     Civilian Population in Labor Force 16 Years and Over  Unemployed   ']: np.int64,
                                census_description_dict['     Per Capita Income (In 2010 Inflation Adjusted Dollars)    ']: np.float64,
                                census_description_dict['     Median Value    ']: np.float64,
                                census_description_dict['     Median Gross Rent    ']: np.float64,})


Below we clean out the column names, change the variable codes to descriptions, drop unnecessary columns

In [None]:

#strip the variable categorical indicators and underscores
census_data.columns=census_data.columns.str.lstrip("Geo")
census_data.columns=census_data.columns.str.lstrip("SE")
census_data.columns=census_data.columns.str.lstrip("_")

#replace variable codes with variable names
census_data.columns=[x if x not in census_code_dict else census_code_dict[x] for x in census_data.columns]

#columns to keep from the data source, we will drop every other column
census_data_columns_keep = ['FIPS',
                        '     Total Population    ',
                       '     Civilian Population in Labor Force 16 Years and Over    ',
                       '     Civilian Population in Labor Force 16 Years and Over  Unemployed   ',
                       '     Per Capita Income (In 2010 Inflation Adjusted Dollars)    ',
                       '     Median Value    ',
                        '     Median Gross Rent    ']

#columns to remove
census_data_columns_remove=[x for x in census_data.columns if x not in census_data_columns_keep]

#drop columns
census_data.drop(columns=census_data_columns_remove,inplace=True)

#strip whitespace from end and beginning
census_data.columns=census_data.columns.str.lstrip().str.rstrip()

#rename the column names
census_data_columns_newnames = ['County Tract Code',
                                'Total Population',
                                'Total Population over 16 Years Old',
                                'Unemployed',
                                'Per Capita Income',
                                'Median House Value', 
                                'Median Gross Rent']

census_data.columns = census_data_columns_newnames


We now map the the zip code information in the census data for each county code.

In [None]:

census_data["County Tract Code"]=census_data["County Tract Code"].astype(str)

census_data["3 Digit Zip Code"]=census_data['County Tract Code'].map(county_zip_map)

#number of unique values for 3 digit zip code in the census data
census_data["3 Digit Zip Code"].nunique()


We now calculate the weighted average of the four county wide data to obtaine zip code level data for median house value and gross rent simply sum other variables.

We also calculate the unemployment rate from the unemployed population data and we combine everything in the original dataframe



In [None]:


#We now calculate the weighted average of the four county wide data to obtaine zip code level data for median house value and gross rent

census_group_house_value= census_data.groupby("3 Digit Zip Code").apply(weighted_average_applyFunct,"Median House Value","Total Population").reset_index(name='Average Median House Value')
census_group_rent= census_data.groupby("3 Digit Zip Code").apply(weighted_average_applyFunct,"Median Gross Rent","Total Population").reset_index(name='Average Median Gross Rent')
census_group_per_capita= census_data.groupby("3 Digit Zip Code").apply(weighted_average_applyFunct,"Per Capita Income","Total Population").reset_index(name='Per Capita Income')

f = {'Total Population': sum, 'Total Population over 16 Years Old': sum,"Unemployed": sum}
census_group_main= census_data.groupby("3 Digit Zip Code").agg(f)

#calculate simple unemployment statistics
census_group_main["Unemployment Rate"]=census_group_main["Unemployed"]/census_group_main["Total Population over 16 Years Old"]

census_group_main.reset_index(inplace=True)

census_group_main.drop(columns=["Total Population over 16 Years Old"],inplace=True)


census_group_main["Average Median House Value"]= census_group_house_value["Average Median House Value"]
census_group_main["Average Median Gross Rent"]= census_group_rent["Average Median Gross Rent"]
census_group_main["Per Capita Income"]= census_group_per_capita["Per Capita Income"]


# census_group_main["Average Median Gross Rent"]=census_group_house_value

census_data=census_group_main

census_data.head()



Now that we have the census data for almost all of the 3 digit zip codes, we need to assign the values in our original training dataframe.

In [None]:
# unemployment_dict=dict(zip(census_data['County Code'],census_data["Unemployment Rate"]))
per_capita_income_dict=dict(zip(census_data['3 Digit Zip Code'],census_data["Per Capita Income"]))
total_population_dict=dict(zip(census_data['3 Digit Zip Code'],census_data["Total Population"]))
average_median_house_value_dict=dict(zip(census_data['3 Digit Zip Code'],census_data["Average Median House Value"]))
average_median_rent_dict=dict(zip(census_data['3 Digit Zip Code'],census_data["Average Median Gross Rent"]))
unemployment_dict=dict(zip(census_data['3 Digit Zip Code'],census_data["Unemployment Rate"]))


def add_county_data(lendingclub_df):
    
    #first create a three digit zip code from the 5 digit code.
    lendingclub_df["3 Digit Zip Code"]=lendingclub_df["Zip Code"].str.rstrip("xx").str.lstrip("0")
    
    #now add census data
    lendingclub_df["Unemployment Rate"]=lendingclub_df["3 Digit Zip Code"].map(unemployment_dict)
    lendingclub_df["Average Median Gross Rent"]=lendingclub_df["3 Digit Zip Code"].map(average_median_rent_dict)
    lendingclub_df["Average Median House Value"]=lendingclub_df["3 Digit Zip Code"].map(average_median_house_value_dict)
    lendingclub_df["Total Population"]=lendingclub_df["3 Digit Zip Code"].map(total_population_dict)
    lendingclub_df["Per Capita Income"]=lendingclub_df["3 Digit Zip Code"].map(per_capita_income_dict)

    
    return lendingclub_df

training_nonempty.pipe(add_county_data)