### 1. Importing Pandas, the only library needed here.

In [1]:
import pandas as pd

### 2. Importing datasets

In [2]:
# Employment Dataset will serve as main dataset where others will be merged onto
df_employ = pd.read_csv("EMPLOYMENT_FILTERED.csv", dtype={"State-County FIPS Code": str})

# Importing the other datasets: Housing Prices, AQI, Risk Index
df_housing = pd.read_csv("HOUSING_FILTERED.csv", dtype={"StCntyFIPS": str})
df_aqi = pd.read_csv("AIR_DATA_WITH_FIPS.csv", dtype={"StCnty FIPS Code": str})
df_risk = pd.read_csv("RISKINDEX_FILTERED.csv", dtype={"STCOFIPS": str})

### 3. Creating subsets of the dataframes with the needed columns

For the employment dataset we include the county name column together with the FIPS code and the numerical data. This is because we will merge the other datasets on top of the filtered employment data.

In [3]:
# Creating subsets of datasets in the form of these columns: [FIPS Code, Column of Interest]

ss_aqi = df_aqi[["StCnty FIPS Code", "Median AQI"]]
ss_housing = df_housing[["StCntyFIPS", "2019_average"]]
ss_employ = df_employ[["County Name/State Abbreviation","State-County FIPS Code", "Unemployment Rate (%)"]]
ss_risk = df_risk[["STCOFIPS", "RISK_SCORE"]]

### 4. Renaming FIPS Columns
We rename each column containing the FIPS code to the same name, just "FIPS" so that we can merge the dataframes on this column.

In [4]:
# Renaming the FIPS column on every dataframe so we can merge them on the FIPS column

ss_aqi.rename(columns={"StCnty FIPS Code":"FIPS"}, inplace=True)
ss_housing.rename(columns={"StCntyFIPS":"FIPS"}, inplace=True)
ss_employ.rename(columns={"State-County FIPS Code":"FIPS"}, inplace=True)
ss_risk.rename(columns={"STCOFIPS":"FIPS"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ss_aqi.rename(columns={"StCnty FIPS Code":"FIPS"}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ss_housing.rename(columns={"StCntyFIPS":"FIPS"}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ss_employ.rename(columns={"State-County FIPS Code":"FIPS"}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

### 5. Merging dataframes
Finally, we merge each dataframe on top of each other and convert this dataframe to a csv file. This is the csv file that we will use for the majority of EDA and Hypothesis Testing.

Note that we are first dropping any NA values (there are only a few compared to the rest of the dataframe) and then we add the AQI data on this dataframe. The reason for this is that while the employment, risk index, and house price data has data for almost all counties in the United States, (approximately 3,000) the AQI dataset only has data for about 1,000 counties. I did not want to lose the other 2,000 counties' data.

In [6]:
# Merging housing dataframe onto employment dataframe
merged_dataframe = ss_employ.merge(ss_housing, on="FIPS", how="left")

# Merging risk index dataframe onto merged dataframe
merged_dataframe = merged_dataframe.merge(ss_risk, on="FIPS", how="left")

# Dropping the few rows with NA values
merged_dataframe.dropna(inplace=True)

# Merging the AQI dataframe onto the merged dataframe
merged_dataframe = merged_dataframe.merge(ss_aqi, on="FIPS", how="left")

# Writing to csv file
merged_dataframe.to_csv("MERGED_DATASET_WITH_NAN_AQI.csv", index=False)