# Merging Census Data

The census data used was from the 2010 census quick facts pages for the cities we analyzed. The data can be at census.gov, and one example for the city of Oakland is here: https://www.census.gov/quickfacts/fact/table/oaklandcitycalifornia/POP010210. To find the 2010 census data, the fact selected should be 2010 census, from the fact drop down menu on the page. 

I was able to make marged tables on the website itself with the cities, but the number of cities we are looking at exceeded the amount that could be put in the table, so I saved the data in two parts. In this notebook I will merge those two parts together. 

In [2]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
figsize = (16,8)

In [4]:
# loading the data
# the two tables of data from the census website containing all of the cities
df1 = pd.read_csv('Data/census_pt1.csv')
df2 = pd.read_csv('Data/census_pt2.csv')

In [6]:
# visualizing the tables
# the city names are the columns, and the different population estimates are the rows
df1.head()

Unnamed: 0,Fact,Fact Note,"Los Angeles city, California","Value Note for Los Angeles city, California","Oakland city, California","Value Note for Oakland city, California","San Diego city, California","Value Note for San Diego city, California","San Francisco city, California","Value Note for San Francisco city, California","San Jose city, California","Value Note for San Jose city, California","Stockton city, California","Value Note for Stockton city, California"
0,"Population estimates, July 1, 2019, (V2019)",,3979576,,433031,,1423851,,881549,,1021795,,312697,
1,"Population estimates base, April 1, 2010, (V2...",,3793139,,390765,,1301929,,805184,,952528,,292182,
2,"Population, percent change - April 1, 2010 (es...",,4.9%,,10.8%,,9.4%,,9.5%,,7.3%,,7.0%,
3,"Population, Census, April 1, 2010",,3792621,,390724,,1307402,,805235,,945942,,291707,
4,"Persons under 5 years, percent",,6.0%,,6.3%,,6.1%,,4.5%,,6.3%,,7.4%,


In [7]:
# dropping the header columns from df1 so we can concatinate the two datasets smoothly
# df1 contains the cities that come second in alphabetical order so it will be added to the end of df2
df1 = df1.drop(['Fact','Fact Note'], axis=1)

In [8]:
# Place the DataFrames side by side
df = pd.concat([df2, df1], axis=1)

In [9]:
# dropping the columns that do not contain information
df = df.drop(['Fact Note','Value Note for Bakersfield city, California','Value Note for Long Beach city, California','Value Note for Los Angeles city, California','Value Note for Oakland city, California','Value Note for San Diego city, California','Value Note for San Francisco city, California','Value Note for San Jose city, California','Value Note for Stockton city, California'], axis = 1)


In [10]:
# setting the fact (population information name) to the index
df = df.set_index('Fact')

In [11]:
# looking at the dataset again
df.head()

Unnamed: 0_level_0,"Bakersfield city, California","Long Beach city, California","Los Angeles city, California","Oakland city, California","San Diego city, California","San Francisco city, California","San Jose city, California","Stockton city, California"
Fact,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Population estimates, July 1, 2019, (V2019)",384145,462628,3979576,433031,1423851,881549,1021795,312697
"Population estimates base, April 1, 2010, (V2019)",347817,462221,3793139,390765,1301929,805184,952528,292182
"Population, percent change - April 1, 2010 (estimates base) to July 1, 2019, (V2019)",10.4%,0.1%,4.9%,10.8%,9.4%,9.5%,7.3%,7.0%
"Population, Census, April 1, 2010",347483,462257,3792621,390724,1307402,805235,945942,291707
"Persons under 5 years, percent",8.4%,6.7%,6.0%,6.3%,6.1%,4.5%,6.3%,7.4%


In [12]:
# transposing the dataset so cities are the rows
df = pd.DataFrame.transpose(df)

In [13]:
# selecting the rows relevant to our research
df = df.iloc[:, [3,7,9,14,15]]

In [14]:
# checking datatypes
display(df.dtypes)

Fact
Population, Census, April 1, 2010               object
Female persons, percent                         object
Black or African American alone, percent        object
Hispanic or Latino, percent                     object
White alone, not Hispanic or Latino, percent    object
dtype: object

In [15]:
# looking at the dataset again
df.head()

Fact,"Population, Census, April 1, 2010","Female persons, percent","Black or African American alone, percent","Hispanic or Latino, percent","White alone, not Hispanic or Latino, percent"
"Bakersfield city, California",347483,50.8%,7.4%,49.5%,33.5%
"Long Beach city, California",462257,50.7%,12.9%,42.5%,28.1%
"Los Angeles city, California",3792621,50.4%,8.9%,48.6%,28.5%
"Oakland city, California",390724,51.6%,23.6%,26.9%,28.2%
"San Diego city, California",1307402,49.7%,6.5%,30.1%,42.9%


In [16]:
# replacing the commas with spaces in the column titles
for i in range(len(df)): 
    df['Population, Census, April 1, 2010'][i] = float(df['Population, Census, April 1, 2010'][i].replace(',',''))

In [17]:
# putting the column names into a list so we can loop through them
df_col = df.columns.to_list()
df_col = df_col[1:len(df_col)]

In [18]:
# looping through the columns to calculate the estimated number of people of each race in each city, from the percents
for col in df_col: 
    for i in range (len(df)):
        df[col][i] = 1.0*float(df[col][i][:-1])/100
        df[col][i] = df['Population, Census, April 1, 2010'][i] * df[col][i]

In [19]:
# resesting the index
df = df.reset_index()

In [20]:
# renaming the columns
df = df.rename(columns={"index":"City", "Population, Census, April 1, 2010": "Total", "Female persons, percent": "female","Black or African American alone, percent": "black","Hispanic or Latino, percent": "hispanic","White alone, not Hispanic or Latino, percent": "white"})


In [21]:
#renaming the city name of each row
df['City'][0] = 'Bakersfield'
df['City'][1] = 'Long Beach'
df['City'][2] = 'Los Angeles'
df['City'][3] = 'Oakland'
df['City'][4] = 'San Diego'
df['City'][5] = 'San Francisco'
df['City'][6] = 'San Jose'
df['City'][7] = 'Stockton'

In [22]:
# looking at the final data set
df

Fact,City,Total,female,black,hispanic,white
0,Bakersfield,347483.0,176521.0,25713.7,172004.0,116407.0
1,Long Beach,462257.0,234364.0,59631.2,196459.0,129894.0
2,Los Angeles,3792620.0,1911480.0,337543.0,1843210.0,1080900.0
3,Oakland,390724.0,201614.0,92210.9,105105.0,110184.0
4,San Diego,1307400.0,649779.0,84981.1,393528.0,560875.0
5,San Francisco,805235.0,394565.0,41872.2,122396.0,326925.0
6,San Jose,945942.0,469187.0,28378.3,302701.0,245945.0
7,Stockton,291707.0,148771.0,34421.4,122809.0,60675.1


In [303]:
# saving the new merged dataset
df.to_csv('Data/df_census.csv')