**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Dean Nafarrete
- Emily Le
- Cedric Jeng
- Kevin Morales
- Richard Lao

# Research Question

Is there a statistically significant relationship between a region's economic health(GDP and Household Income), environmental degradation(Greenhouse Gas Emissions and Air Quality), and unemployment rate on crime rates? How can we use these variables to create a new standard measurement for economic health? 

## Background and Prior Work


<span style = "font-family: 'Segoe UI'; font-size: 17px;">
    <p>Since the advent of the modern industrial city, crime rates in urban areas have been a consistent concern. By nature, cities are densely populated with a robust infrastructure and diverse economy, and tend to attract more people. With such a large population in a small area, it’s inevitable that crime will rise proportionally; however, underlying factors beyond industrialization are at play in these statistics. Many hold the sentiment that cities are less ideal places to settle due to factors like crime rate, population density, and homelessness while praising the large economies that seemingly contribute to these issues. Ultimately, cities have become an integral part of the globalized economy, serving as centers for trade and culture despite the negative connotations. Consequently, efforts to remedy these issues by identifying and addressing the root causes tend to go overlooked or rarely implemented. Economic health and overall quality of life are not mutually exclusive as many seem to believe. An in-depth analysis of the effects of the physical and social environment may provide insight into issues that require more allocation to reduce crime rates in the future. In studying these factors, establishing a new measure for economic health that is inclusive of issues that impact those living within cities can deviate focus away from simple measures like GDP, which only measure the gross economic output of a given region, and allow for more nuanced discussions on how to maximize economic returns while reducing the toll on average citizens.<p>
    <p>The disproportionate amount of crime in urban areas is a topic that has been heavily explored by sociologists and economists alike. An article from the Journal of Political Economy, “Why is There More Crime in Cities?”,  compiled a number of different theories. They note that crime rates in cities outpace their rural counterparts even when accounting for the larger population. Some of the theories they posit suggest that dense city environments may cultivate less connected communities compared to smaller towns, decrease risk for criminals, or that those looking to profit from crime are likely to find it in economic centers. Ultimately, they found that crime reporting is underrepresented in cities, and the likelihood of arrest for a given crime is lower.<a name="#cite_note-1">[1]</a> Regardless, it’s difficult to pinpoint any one cause, and the theories are merely speculation on the social implications on crime. As for the environment, much less research is evident. The council on Strategic Risks highlighted this issue; extreme weather events, higher temperatures, and social factors related to stresses based on climate have been linked to various forms of crime, especially violent crime. Some examples include gender-based violence against women increasing following adverse weather events and the likelihood of mass shooting events increasing in the summer months. The working theory is that changes in the environment may indirectly cause stresses that incentivize crime more by reducing the perceived risk for potential criminals.<a name="#cite_note-2">[2]</a> Directly connecting a factor of climate change, like environmental degradation, may lend more credence to this issue being a factor in addressing crime in the future.<p>
    <p>Along with changes in the climate, income inequality is a known factor in crime rates. Increasing fears over crime often go hand in hand with homelessness; as such, this phenomenon has been explored in the past. The Institute of Labor Economics studied this effect in California, finding that homeless rates and crime rates are linked, but how they are linked is interesting; high rates of homelessness increase the number of violent crimes, but not property crimes.<a name="cite_note-3">[3]</a> Given this study was conducted at the state level, perhaps focusing on a single area may reaffirm or contradict this finding, depending on how a smaller, less diversified economy may have an effect. On the whole, while larger factors on crime rates appear to go overlooked, there have been some initiatives to address this disparity between economy, climate, inequality, and crime. Several states have implemented an alternate measure of economic health known as the genuine progress indicator, or GPI. This standard allows states to take into consideration non-economic factors like the environment and human health standards on the economy. According to the government of Maryland, this measure informs policymakers of economic progress without purely looking at the economic output, which may increase at the expense of its citizens.<a name="cite_note-4">[4]</a> We believe that this approach is more progressive on these issues and may prove to benefit the economy and well being of individuals in unison, however, the fact that this model is only implemented in a few states as a policy informing measurement is insufficient. Finding a general link between these factors may reveal the true cost of these factors, and potentially allow us to devise another measurement that can be achieved using public data.<p>
  </span>



1. <a name="cite_note-1"></a> [^](#cite_ref-1) Glaeser, E. L., & Sacerdote, B. (1999). Why is There More Crime in Cities? Journal of Political Economy, 107(S6), S225–S258. https://doi.org/10.1086/250109
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Facini, A. (2024, October 17). Climate Change & Crime: A big, bad, largely overlooked Nexus. The Council on Strategic Risks. http://councilonstrategicrisks.org/2024/10/17/climate-change-crime-a-big-bad-largely-overlooked-nexus/ 
3. <a name="cite_note-3"></a> [^](cite_ref-3) Artz, B., & Welsch, D. M. (2024, June). Homelessness and crime: An examination of California. Institute of Labor Economics. https://docs.iza.org/dp17086.pdf 
4. <a name="cite_note-4"></a> [^](cite_ref-4) Campbell, E. (n.d.). Maryland Genuine Progress Indicator. Maryland Department of Natural Resources. https://dnr.maryland.gov/mdgpi/Pages/default.aspx 


# Hypothesis


Our hypothesis is that there is a significant relationship with environmental degradation, high economic output, and unemployment that affects crime rates, and using these factors we can create a new measure of a city’s well being as an alternate measure of economic health. In order to measure these variables, we are looking at the air quality index, greenhouse gas index, unemployment percentage rates, and crime rates for each county. Higher outputs of the economy attract higher rates of crime because of the high foot traffic in retail centers where businesses sell products and people are carrying valuables or money. Furthermore, the stress of falling into unemployment and being homeless have individuals resorting to committing crimes as one of their survival options. As the environment continues to decline, this will also be another factor that leads to higher crime rates and in particular violent crimes due to lower air quality and greenhouse gases creating a toxic environment, thus inflicting environmental stress on the population. All of these variables are high stressing factors in people’s lives, which ultimately dictate their wellbeing, safety, and future hence why when these variables are threatened in their lives, people will turn towards committing crimes in order to ensure their survival. Additionally, we believe that there will be a noticeably greater significant effect on crime rates of our given variables in urbanized areas compared to its rural counterparts.

# Data

## Data overview

* Our ideal dataset would be a dataset that includes measures for a region's GDP, median household income, unemployment, carbon emissions, air quality, solid waste, and crime rates. Our ideal number of observations include our variables from all counties in California across multiple years with the ideal span being about 10 years. This would look like about 58 counties with a 10 year span per county. Looking at the data we have already found our time span can vary, but there is a time period where all our datasets line up.

* <u>**Environmental degradation**</u> data: Green house Gas Emissions :https://www.epa.gov/system/files/other-files/2024-10/2023_data_summary_spreadsheets.zip ; Air Quality: https://www.epa.gov/outdoor-air-quality-data/air-quality-statistics-report<br>

* **San Diego Median Income/Capita, Crime Rates, Unemployment info, based on specific regions in San Diego** <br>
https://data.sandiegocounty.gov/Live-Well-San-Diego/Live-Well-/San-Diego-Database/wsyp-5xpf/about_data <br> 

* <u>**Unemployment.. Less than Ideal**</u> <br>
Unemployment rate for California (as a whole):  https://labormarketinfo.edd.ca.gov/geography/california-statewide.html <br>
Unemployment rate for California (as counties): https://labormarketinfo.edd.ca.gov/geography/lmi-by-county.html <br>
* <u>Crime</u> <br>
Crime rates broken down into total crime, violent crime, and property crime rates for cities/regions in San Diego county: https://data.sandiegocounty.gov/Safety/SANDAG-Crime-Data/486f-q228/data_preview <br>


* All these sites have data that's not only obtainable but also easily processes because they are kept in Excel files. Excel files are csv files which can be easily turned into pandas dataframes. Of course, these sites contain more data than we need in our project, so tidying will be necessary. Moreover, we may choose a focus on specific types of crimes, such as violent crimes versus misdemeanors, or we may choose to look at all crime as a whole. 


For each dataset include the following information
- Dataset #1
  - Dataset Name: Air Quality Dataset
  - Link to the dataset: https://www.epa.gov/outdoor-air-quality-data/air-quality-statistics-report
  - Number of observations: 823
  - Number of variables: 18
  - Description: This dataset tracks county-level air quality metrics across California from 2010-2025, with each row representing a county's annual aggregate data. The key variables include AQI metrics (AQI Maximum, AQI 90th Percentile, and AQI Median), which serve as proxies for understanding the air quality in a respective county in a specific year and as baseline air quality respectively. The data requires cleaning to handle columns of information not needed in this project as well as standardizing labels to fit with our other datasets. We will use `.strip()` for removing empty spaces, `.split()` for removing redundant words, `.rename()` to clean up labels and prepare for `.merge()`.

- Dataset #2 
  - Dataset Name: Crime Rate Dataset
  - Link to the datasets: 
  - https://dof.ca.gov/forecasting/demographics/estimates/ (for population)
  - https://openjustice.doj.ca.gov/data (for crime count)
  - Number of observations: 28591 (for crime), 1770 (for population)
  - Number of variables: 70 (for crime), 3 (for population)
  - Description: We plan to look at **Violent_sum**, **Property_sum** within the crime count dataset & **Population** within the population dataset. For both, we will look at **Year** and **County**. Both datasets contain datetime datatype for **Year**, integer datatype for **Violent_sum**, **Property_sum**, & **Population**, and string datatype for **County**, but can contain undefined values which we replace with **0** instead. 
  - We plan to use `.strip()` for removing empty spaces, `pd.to_datetime()` for converting **Year** to an integer datatype, `.split()` for removing redundant words like '*County*' in 'Alameda *County*', `.astype()` to explicity convert our data to integer, `.melt()` to reshape the population dataframe to a long format, `.drop()` for unecessary columns, `.rename()` for making consistent columns before merge, and `.merge()` to get the complete Crime Rate Dataset. Both datasets will be merged from 2000-2025, where crime rate will then be calculated directly using the population count per county and crime commited per county. This will be the rate of crime per 100,000 residents, which will be used as a proxy for how social and economic well-being.
- Dataset #3
  - Dataset name : Greenhouse Gas Emissions 
  - Link to the dataset: https://www.epa.gov/system/files/other-files/2024-10/2023_data_summary_spreadsheets.zip
  - Number of Observations:767
  - Number of Variables:3
  - Description: We plan to use the the Green House Gas Emissions data set to look at emissions of green house gases per county, measured in units of metric tons of carbon dioxide. We plan on using **.dropna()** in order to get rid of missing data entries as well as using **.melt()** to reshape the data into long format and **.apply()** to further wrangle the dataset in order to remove the redundancy of county in the original unwrangled and tidy dataset . Further more the key observations for this datasets will be looking at the emmisions by county in CA overtime in order to compare against the other datasets in the project.
- Dataset #4
  - Dataset Name: Real Total GDP by County, California
  - Link to the datasets: https://fredaccount.stlouisfed.org/public/datalist/8338
  - Number of observations: 1357
  - Number of variables: 2
  - Description: We plan to use Real Total GDP for all the counties in California to compare against crime data, measured in thousands of chained 2017 USD. This dataset measures the GDP of all industries for their respective county and adjusts it for inflation, offering a true measure of economic growth that can be compared against crime data. Since the data is based on observation dates measured at the beginning of each year, we plan to change each observation date to account for the previous year's data and drop any missing data entries.
- Dataset #5: 
  - Dataset Name: Median Household Income by County, California
  - Link to the datasets: https://fredaccount.stlouisfed.org/public/datalist/8339
  - Number of observations: 1357
  - Number of variables: 2
  - Description: We plan to use Median Household Income to compare against crime data, measured in USD, to determine whether the economic status of individuals has an effect on crime rate. This dataset measures the median household income for all counties in California. Since the data is based on observation dates measured at the beginning of each year, we plan to change each observation date to account for the previous year's data and drop any missing data entries.
- Dataset #6:
  - Dataset Name: Unemployment Rates by County, California
  - Link to the datasets: https://fredaccount.stlouisfed.org/public/datalist/8337
  - Number of observations: 1478
  - Number of variables: 2
  - Description: The dataset contains the unemployment percentage rates of each county within California each year from the year 2000 to 2024 to compare against crime rates in each given year. Since the data is based on observation dates measured at the beginning of each year, we plan to change each observation date to account for the previous year's data and drop any missing data entries.


Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [2]:
# import necessary libraries

import pandas as pd
import numpy as np

## Air Quality Dataset

In [3]:
aq_df = pd.DataFrame()
import os 

# Load Air Quality Datasets + Add Year Column
for filename in os.listdir('./Datasets/AirQuality/'): 
    file_path = os.path.join('./Datasets/AirQuality/', filename)
    if '.csv' not in file_path: 
        continue
    year_val = int(filename.strip('aqireport.csv'))
    df_temp = pd.read_csv(file_path)
    df_temp = df_temp.assign(Year=year_val)
    aq_df = pd.concat([aq_df, df_temp])

# Removed Redundant Naming
county_name = aq_df.get('County').str.split().str[:-2]
county_name = county_name.apply(' '.join)
aq_df['County'] = county_name
aq_df.sort_values(by=['Year', 'County'], inplace=True)

# Dropped Unnecessary Columns
aq_df = aq_df[['County', 'AQI Maximum', 'AQI 90th Percentile', 'AQI Median', 'Year']]
aq_df

Unnamed: 0,County,AQI Maximum,AQI 90th Percentile,AQI Median,Year
0,Alameda,179,72.0,52.0,2010
1,Amador,151,64.0,35.0,2010
2,Butte,126,84.0,53.0,2010
3,Calaveras,154,84.0,44.0,2010
4,Colusa,119,54.0,39.0,2010
...,...,...,...,...,...
20,Solano,16,15.0,5.5,2025
21,Stanislaus,87,85.0,37.0,2025
22,Tulare,104,57.0,34.0,2025
23,Ventura,125,54.0,43.5,2025


## Crime Dataset

In [4]:
# Loading number of crime data
# Cleaning up data
crime_num_df = pd.read_csv(
    './Datasets/Crime/Crimes_and_Clearances_with_arson-1985-2023.csv', 
    dtype=object)
crime_num_df = crime_num_df.drop(columns='NCICCode')
crime_num_df.replace(' ', 0, inplace=True)
crime_num_df.replace(np.nan, 0, inplace=True)
for col in crime_num_df.columns: 
    if(col == 'County' or col == 'NCICCode'): 
        continue 
    crime_num_df[col] = crime_num_df[col].astype(int)
crime_num_df = crime_num_df[['Year', 'County', 'Violent_sum', 'Property_sum']].groupby(['Year', 'County']).sum()
crime_num_df = crime_num_df.reset_index()
#Filter to start from 2001
crime_num_df = crime_num_df[crime_num_df['Year'] >= 2000]

# Removing 'County'
crime_num_df['Year'] = crime_num_df['Year'].astype(object)
name = crime_num_df.get('County').str.split().str[:-1]
crime_num_df['County'] = name.apply(' '.join)

crime_num_df

Unnamed: 0,Year,County,Violent_sum,Property_sum
870,2000,Alameda,9485,58334
871,2000,Alpine,10,77
872,2000,Amador,179,674
873,2000,Butte,699,6514
874,2000,Calaveras,118,914
...,...,...,...,...
2257,2023,Tulare,2481,9535
2258,2023,Tuolumne,369,522
2259,2023,Ventura,2411,11071
2260,2023,Yolo,561,4968


In [5]:
# Load Population Data for 2000-2010
p1_df = pd.read_excel(
    './Datasets/Census/E4_2000-2010_Report_Final_EOC_000.xlsx',
    sheet_name=1,skiprows=3)
p1_df.dropna(how='all', inplace=True)
p1_df = pd.melt(
    p1_df, id_vars='COUNTY',
    var_name='Year', 
    value_name='Population')
p1_df['COUNTY'] = p1_df.get('COUNTY').str.strip()

# Load Population Data for 2010-2020
p2_df = pd.read_excel(
    './Datasets/Census/E-4_2010-2020-Internet-Version.xlsx',
    sheet_name=1, skiprows=1)
p2_df.drop(columns=['Column1', 'Column2'], inplace=True)
p2_df = pd.melt(p2_df, id_vars='COUNTY', var_name='Year',
    value_name='Population')
p2_df.dropna(how='all', inplace=True)
p2_df['COUNTY'] = p2_df.get('COUNTY').str.strip()
p2_df.get('Year').apply(pd.to_datetime)

# Load Population Data 2020-2025
p3_df = pd.read_excel(
    './Datasets/Census/E-4_2025_InternetVersion.xlsx',
    sheet_name=1, skiprows=2)
p3_df = p3_df.iloc[0:59]
p3_df = pd.melt(p3_df, id_vars='County', var_name='Year',
    value_name='Population')
p3_df['County'] = p3_df.get('County').str.strip()
p3_df = p3_df.rename(columns={'County': "COUNTY"})
p3_df

# Merge Population Data From 2000-2025 + Clean Up Data
pop_df = pd.concat([p1_df, p2_df, p3_df])
pop_df['Year'] = pop_df.get('Year').apply(pd.to_datetime).dt.year
pop_df.rename(columns={'COUNTY':'County'}, inplace=True)
pop_df['Year'] = pop_df['Year'].astype(object)
pop_df

Unnamed: 0,County,Year,Population
0,Alameda,2000,1443939.0
1,Alpine,2000,1208.0
2,Amador,2000,35100.0
3,Butte,2000,203171.0
4,Calaveras,2000,40554.0
...,...,...,...
349,Tuolumne,2025,54357.0
350,Ventura,2025,829005.0
351,Yolo,2025,225433.0
352,Yuba,2025,85023.0


In [6]:
# Merge Population Dataset With Crime Number Dataset
crime_rate_df = pop_df.merge(crime_num_df, on=['Year', 'County'])

# Calculate crime rate percentages
crime_rate_df['Violent_rate'] = (crime_rate_df['Violent_sum'])/(crime_rate_df['Population'])
crime_rate_df['Property_rate'] = (crime_rate_df['Property_sum'])/(crime_rate_df['Population'])

# Multiply to get rate of crime per 100,000 residents 
crime_rate_df['Violent_rate'] = (crime_rate_df['Violent_rate'] * 100000).apply(round)
crime_rate_df['Property_rate'] = (crime_rate_df['Property_rate'] * 100000).apply(round)


crime_rate_df.head()

Unnamed: 0,County,Year,Population,Violent_sum,Property_sum,Violent_rate,Property_rate
0,Alameda,2000,1443939.0,9485,58334,657,4040
1,Alpine,2000,1208.0,10,77,828,6374
2,Amador,2000,35100.0,179,674,510,1920
3,Butte,2000,203171.0,699,6514,344,3206
4,Calaveras,2000,40554.0,118,914,291,2254


In [7]:
crime_rate_df.shape

(1624, 7)

## Green House Gas Emissions Dataset (measured in units of metric tons of Carbon Dioxide)

In [8]:
# loading data into a df 
Ge_df = pd.read_excel('./Datasets/ghgp_data_by_year_2023.xlsx', skiprows=3)

# filtering th df to contain only rows relevant to California
Ge_df = Ge_df[Ge_df['State']=='CA']
Ge_df = Ge_df.dropna()

Ge_df['County'] = Ge_df['County'].str.upper()
# removing redundancy of 'County ' in each county name as it will become its own column
Ge_df['County'] = Ge_df['County'].apply(lambda x: x.replace('COUNTY','') if 'COUNTY' in x else x)
year = 2023
for col in Ge_df.columns:
    if col.startswith('20'):
        Ge_df.rename(columns = {col :f'{year}'}, inplace=True)
        year = year-1
Ge_df = Ge_df[['County','2023','2022','2021','2020','2019','2018','2017','2016','2015','2014','2013','2012','2011']]
Ge_df.head()

Ge_df = pd.melt(Ge_df, id_vars= ['County',], var_name='Year', value_name = 'emmisions')
Ge_df = Ge_df.groupby(['County','Year',], as_index =False).sum()
#wrangled and tidy data set based of year, county and emmisions
Ge_df.head(20)

Unnamed: 0,County,Year,emmisions
0,ALAMEDA,2011,388736.749724
1,ALAMEDA,2012,376022.924941
2,ALAMEDA,2013,283117.164
3,ALAMEDA,2014,276578.428
4,ALAMEDA,2015,287181.554
5,ALAMEDA,2016,260019.676
6,ALAMEDA,2017,253033.534
7,ALAMEDA,2018,229303.738
8,ALAMEDA,2019,260290.764
9,ALAMEDA,2020,246664.376


## GDP Dataset

In [9]:
#load gdp dataset, convert columns to county fip code
realgdp_ca = pd.read_csv('./Datasets/GDP & Income Data/gdp-by-county-ca2.csv')
realgdp_ca.columns = [int(col.replace('REALGDPALL', '')) if col.startswith('REALGDPALL') else col for col in realgdp_ca.columns]

#load list of fip codes, replace fip codes in gdp dataset with county names
fips_df = pd.read_csv('https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv')
fips_ca = fips_df[fips_df['state'] == 'CA']
fips_county = dict(zip(fips_ca['fips'], fips_ca['name']))
realgdp_ca.rename(columns=fips_county, inplace=True)

#units (Thousands of Chained 2017 USD, Not Seasonally Adjusted)
realgdp_ca.attrs['units'] = 'Thousands of Chained 2017 USD'

realgdp_ca

Unnamed: 0,observation_date,Alameda County,Alpine County,Amador County,Butte County,Calaveras County,Colusa County,Contra Costa County,Del Norte County,El Dorado County,...,Sonoma County,Stanislaus County,Sutter County,Tehama County,Trinity County,Tulare County,Tuolumne County,Ventura County,Yolo County,Yuba County
0,2001-01-01,86924126,84720,1360828,6790163,1098393,908729,64845726,659710,5655039,...,22424276,16559172,2532153,1419839,327124,11635782,1914694,39016318,10021979,2204099
1,2002-01-01,87945920,87647,1572280,7281711,1193532,845631,64827092,667447,6145440,...,23360909,17211668,3067456,1518222,363755,10822959,2265038,40234620,10202456,2374643
2,2003-01-01,91564585,89449,1616440,7600395,1220905,989680,65646731,692526,6297920,...,23193391,17828749,3074716,1553812,386418,11393198,2296930,42820863,10568911,2502333
3,2004-01-01,92470454,86175,1685018,7713220,1201726,814567,68905840,728039,6564920,...,23112212,18906794,3111914,1628286,373886,12631250,2317515,46880813,11055339,2426950
4,2005-01-01,94611477,87711,1677506,7912397,1290494,853126,75014351,744646,6731682,...,23574318,19533906,3005880,1739505,352092,13871037,2433650,49219057,11230028,2461033
5,2006-01-01,97221046,88802,1669942,8195172,1246157,921396,75242782,773156,6937119,...,23737230,19987363,3235685,1657115,367853,13084688,2489599,54587334,11399368,2588931
6,2007-01-01,98832535,97501,1639393,8084613,1231004,1019449,73275314,791314,6825184,...,24262255,19706375,3384828,1688061,365651,14624649,2417989,59025486,12001509,2561887
7,2008-01-01,99236414,94188,1620129,7917794,1215672,1331160,90234175,858035,6820209,...,24112663,19117205,3719483,1580861,358199,13815747,2357542,54095039,12056502,2660591
8,2009-01-01,94214542,89805,1539261,8066507,1147589,1619244,78642291,807494,6575227,...,23049293,18863423,3905209,1563087,336077,12622564,2255875,53813657,11535085,2707581
9,2010-01-01,97503433,94445,1516280,8103725,1247463,1381681,70521397,806195,6588968,...,23744974,19014237,3613524,1666055,382431,13947483,2384417,53517947,11318089,2709494


## Median Household Income Dataset

In [10]:
#load median income dataset, convert columns to county fip code
medianinc_ca = pd.read_csv('./Datasets/GDP & Income Data/median-income-county-ca.csv')
medianinc_ca.columns = [col.replace('A052NCEN', '') if col.startswith('MHICA') else col for col in medianinc_ca.columns]
medianinc_ca.columns = [int(col.replace('MHICA', '')) if col.startswith('MHICA') else col for col in medianinc_ca.columns]

#replace fip codes in median income dataset with county names
medianinc_ca.rename(columns=fips_county, inplace=True)

#units (USD, Not Seasonally Adjusted)
medianinc_ca.attrs['units'] = 'USD, Unadjusted for inflation'

medianinc_ca

Unnamed: 0,observation_date,Alameda County,Alpine County,Amador County,Butte County,Calaveras County,Colusa County,Contra Costa County,Del Norte County,El Dorado County,...,Sonoma County,Stanislaus County,Sutter County,Tehama County,Trinity County,Tulare County,Tuolumne County,Ventura County,Yolo County,Yuba County
0,2001-01-01,54925,38401,41805,31342,40890,34722,64433,28841,51861,...,52873,39300,38013,30609,27464,31587,37745,56525,41851,29927
1,2002-01-01,55595,37691,43628,32124,42563,34556,65186,29028,53182,...,53230,40000,38585,31307,28170,32033,38770,57052,42412,30860
2,2003-01-01,56225,38825,44494,33528,43462,36579,64365,29990,54131,...,52088,41619,39718,32905,29063,33190,39620,57885,43612,32802
3,2004-01-01,57659,42827,47459,34891,46052,38350,65459,31502,56629,...,53645,43072,41289,34520,30307,34809,41067,59379,44810,34493
4,2005-01-01,60937,45283,52078,36303,47639,39186,69463,32724,62199,...,58110,46769,44914,33903,31434,38179,42381,66531,49378,35786
5,2006-01-01,64285,47515,50528,40023,52745,40240,74058,33765,67605,...,60656,48252,47174,35639,33070,41117,44991,71807,50027,37558
6,2007-01-01,68263,46136,54903,39466,51447,43882,76317,35910,64256,...,62279,50367,49104,36884,35439,40444,45478,72762,55988,40602
7,2008-01-01,70217,49320,53951,40308,52850,44622,78469,36729,67019,...,62314,50094,49146,38160,34726,44383,49151,76190,57877,46715
8,2009-01-01,68258,45391,54461,41196,51564,47472,75084,38252,68778,...,61985,48550,48073,38179,33546,39876,48027,71246,56120,40947
9,2010-01-01,66937,44241,49516,41168,50745,44981,73678,35438,65201,...,58703,47442,46188,38188,35207,42377,44751,71418,54433,41045


## Unemployment Rates Dataset

In [None]:
unemploymentrate = pd.read_csv('./Datasets/Unemployment/Unemployment_by_counties.csv')
unemploymentrate.head()

Unnamed: 0,variable,value
0,observation_date,2000-01-01
1,observation_date,2001-01-01
2,observation_date,2002-01-01
3,observation_date,2003-01-01
4,observation_date,2004-01-01
...,...,...
1470,Yuba County,10.2
1471,Yuba County,8.0
1472,Yuba County,5.3
1473,Yuba County,6.2


# Ethics & Privacy

When it comes to dealing with ethics for our project, there may be potential county bias in the data available since it may be the case that there are missing counties that are underrepresented in the available government datasets listed above. That is, Lassen, Modoc, Sierra, and Yuba.
Additionally, there may be a confounding variable as not all crime and unemployment may be accounted for if not reported to the government. Though a confounding variable, the data collected from websites such as openjustice.doj.ca.gov permits the public usage of the data from their webiste, noting that their website public data is made sure to not include personal information of minors and or use copyrighted material. Furthermore, there may be bias in our statistical analyses when it comes to looking at the rate of crime rate for a specific high income cities which can bias our interpretations of the data. <br>

To address these issues, we will explore what missing counties there are in the datasets and why they are underrepresented. That way, we can transparently report these reasons as factors that can impact our intrepretation when we analyze our data. For instance, findings that indicate a strong relationship between our variables and crime rate may not be applicable to rural areas. Furthermore, regarding data collections, data regarding crime and umemployment are tracked by the government, but this is something out of our scope of responsibilities. Instead, we can acknowledge that a negative may be consistent with a 'false negative' because of the underrepresentation when interpreting the relationship between our variables and crime. In this understanding, we may not be able to say for certain that findings of 'no relationship' is true. We will make sure to aggregate a diverse set of counties in our datasets such that we can mitigate and reduce any regional bias as much as possible.

Our aim for this project is to find a more accurate measure for county's well-being, i.e. environment and economic factors may be considered, and how we can use this for a predictor model that can assist in assessing counties for their levels of crime. This project can be scaled for use in determining how the government can improve their allocation of resources to improve a county's well-being. However, because of bias in our project, our findings may only be applicable to counties similar to San Diego. That is, underrepresented counties should not be observed with our lens. Despite this bias, misuse or misinterpretation of our finding can be misleading and improperly measure a county's well-being. This can lead to reduction in select counties' aid from the government that can adversely affect them. 

In creating a new model for economic health, it's important to note that these measures are, in of itself, arbitrary and may not be truly holistic in accounting for economic health and individual needs. If a standard measure like ours were to be implemented in policy, as is the case with measures like the Genuine Progress Indicator, it may result in greater spending towards certain programs without substantive change for the individual or for the economy. Consequently, we must accept that the revised formula in our model should be taken with a grain of salt, and should only be used as a prototype of a new model that can be created with publicly available data.

# Team Expectations 

* Use Discord to communicate. Within 12 hours expected, but 2 hours if close to deadline.
* Meet at least **once** a week - Every Friday 12:30 PM. 
* Have an open space where all voices are heard. Everyone should be open to ideas, criticisms, and suggestions. Make decisions as a group. Majority vote makes final decisions on major portions of the project.
* Specializations: Dean (Project Leader), Kevin (Analysis), Emily (Editor), Cedric and Richard (Coder). Project leader will be in charge of handling merge requests, delegating group responsibilities, and organizing roles/meets. Analysis is in charge of overseeing the analysis portion and guiding the rest of the group members on statistics-related items. Editor will be in charge of major writing responsibilities, including drafting reports and proofreading. Coders will advise on coding for the rest of the group and oversee data wrangling portion. These roles are not restrictive - everyone will work on each part a little, but these are the main *specializations*
* In the event someone is struggling to deliver something they promised on, it is expected to let the group know as soon as possible. That way, we can look to delegate the tasks among the others to compensate. Project Leader will make final decision to contact TA as needed.

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/25  |  4 PM | Brainstormed topics and questions | Finalized research question and delegated tasks for proposal. Rough research done on topic. | 
| 4/28  |  8 PM |  Finalize Proposal. Hypothesis, Looking for Data, Ethics, How to Data Wrangle | Looking forward. Delegate tasks for how to research, retrieve data, and analysis plan | 
| 5/2  | 12:30 PM  | Compile list of datasets & clarify roles/organization  | Focus on group organization. Will we use branches? How will we check in. Potential analysis techniques we will do   |
| 5/9  | 12:30 PM  | Wrangle *some* data and have an idea on analysis | Review wrangling for correctness. Review analysis and plan   |
| 5/16  | 12:30 PM  | Finalize wrangling, continue EDA, begin devising model | Dive in fully into EDA and move onto ANALYSIS and have a complete review of project as a whole |
| 5/23 | 12:30 PM  | Continue analysis and model fully and begin drafting conclusions | Discuss difficulties and edit final draft together |
| 5/30 | 12:30 PM  | Review and fix any small details | Discuss final turn in of project before 6/13 |
| 6/13 | Before 11:59 PM | NA | Turn in Final Project & Group Project Surveys | 