**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

In [1]:
import numpy as np
import pandas as pd

# COGS 108 - EDA Checkpoint

# Names

- Dean Nafarrete
- Emily Le
- Cedric Jeng
- Kevin Morales
- Richard Lao

# Research Question

Is there a statistically significant relationship between a region's economic health(GDP and Household Income), environmental degradation(Greenhouse Gas Emissions and Air Quality), and unemployment rate on crime rates? How can we use these variables to create a new standard measurement for economic health?


## Background and Prior Work

The disproportionate amount of crime in urban areas is a topic that has been heavily explored by sociologists and economists alike. An article from the Journal of Political Economy, “Why is There More Crime in Cities?”, compiled a number of different theories. They note that crime rates in cities outpace their rural counterparts even when accounting for the larger population. Some of the theories they posit suggest that dense city environments may cultivate less connected communities compared to smaller towns, decrease risk for criminals, or that those looking to profit from crime are likely to find it in economic centers. Ultimately, they found that crime reporting is underrepresented in cities, and the likelihood of arrest for a given crime is lower.[1] Regardless, it’s difficult to pinpoint any one cause, and the theories are merely speculation on the social implications on crime. As for the environment, much less research is evident. The council on Strategic Risks highlighted this issue; extreme weather events, higher temperatures, and social factors related to stresses based on climate have been linked to various forms of crime, especially violent crime. Some examples include gender-based violence against women increasing following adverse weather events and the likelihood of mass shooting events increasing in the summer months. The working theory is that changes in the environment may indirectly cause stresses that incentivize crime more by reducing the perceived risk for potential criminals.[2] Directly connecting a factor of climate change, like environmental degradation, may lend more credence to this issue being a factor in addressing crime in the future.

Along with changes in the climate, income inequality is a known factor in crime rates. Increasing fears over crime often go hand in hand with homelessness; as such, this phenomenon has been explored in the past. The Institute of Labor Economics studied this effect in California, finding that homeless rates and crime rates are linked, but how they are linked is interesting; high rates of homelessness increase the number of violent crimes, but not property crimes.[3] Given this study was conducted at the state level, perhaps focusing on a single area may reaffirm or contradict this finding, depending on how a smaller, less diversified economy may have an effect. On the whole, while larger factors on crime rates appear to go overlooked, there have been some initiatives to address this disparity between economy, climate, inequality, and crime. Several states have implemented an alternate measure of economic health known as the genuine progress indicator, or GPI. This standard allows states to take into consideration non-economic factors like the environment and human health standards on the economy. According to the government of Maryland, this measure informs policymakers of economic progress without purely looking at the economic output, which may increase at the expense of its citizens.[4] We believe that this approach is more progressive on these issues and may prove to benefit the economy and well being of individuals in unison, however, the fact that this model is only implemented in a few states as a policy informing measurement is insufficient. Finding a general link between these factors may reveal the true cost of these factors, and potentially allow us to devise another measurement that can be achieved using public data.


1. ^ Glaeser, E. L., & Sacerdote, B. (1999). Why is There More Crime in Cities? Journal of Political Economy, 107(S6), S225–S258. https://doi.org/10.1086/250109
2. ^ Facini, A. (2024, October 17). Climate Change & Crime: A big, bad, largely overlooked Nexus. The Council on Strategic Risks. http://councilonstrategicrisks.org/2024/10/17/climate-change-crime-a-big-bad-largely-overlooked-nexus/
3. ^ Artz, B., & Welsch, D. M. (2024, June). Homelessness and crime: An examination of California. Institute of Labor Economics. https://docs.iza.org/dp17086.pdf
4. ^ Campbell, E. (n.d.). Maryland Genuine Progress Indicator. Maryland Department of Natural Resources. https://dnr.maryland.gov/mdgpi/Pages/default.aspx 


# Hypothesis


Our hypothesis is that there is a significant relationship with environmental degradation, high economic output, and unemployment that affects crime rates, and using these factors we can create a new measure of a city’s well being as an alternate measure of economic health. In order to measure these variables, we are looking at the air quality index, greenhouse gas index, unemployment percentage rates, and crime rates for each county. Higher outputs of the economy attract higher rates of crime because of the high foot traffic in retail centers where businesses sell products and people are carrying valuables or money. Furthermore, the stress of falling into unemployment and being homeless have individuals resorting to committing crimes as one of their survival options. As the environment continues to decline, this will also be another factor that leads to higher crime rates and in particular violent crimes due to lower air quality and greenhouse gases creating a toxic environment, thus inflicting environmental stress on the population. All of these variables are high stressing factors in people’s lives, which ultimately dictate their wellbeing, safety, and future hence why when these variables are threatened in their lives, people will turn towards committing crimes in order to ensure their survival. Additionally, we believe that there will be a noticeably greater significant effect on crime rates of our given variables in urbanized areas compared to its rural counterparts.

# Data

## Data overview

Our ideal dataset would be a dataset that includes measures for a region's GDP, median household income, unemployment, carbon emissions, air quality, solid waste, and crime rates. Our ideal number of observations include our variables from all counties in California across multiple years with the ideal span being about 10 years. This would look like about 58 counties with a 10 year span per county. Looking at the data we have already found our time span can vary, but there is a time period where all our datasets line up.

* Environmental degradation data: Green house Gas Emissions :https://www.epa.gov/system/files/other-files/2024-10/2023_data_summary_spreadsheets.zip ; Air Quality: https://www.epa.gov/outdoor-air-quality-data/air-quality-statistics-report 

* San Diego Median Income/Capita, Crime Rates, Unemployment info, based on specific regions in San Diego
https://data.sandiegocounty.gov/Live-Well-San-Diego/Live-Well-/San-Diego-Database/wsyp-5xpf/about_data 

* <u>**Unemployment.. Less than Ideal**</u> <br>
Unemployment rate for California (as a whole):  https://labormarketinfo.edd.ca.gov/geography/california-statewide.html <br>
Unemployment rate for California (as counties): https://labormarketinfo.edd.ca.gov/geography/lmi-by-county.html <br>
* <u>Crime</u> <br>
Crime rates broken down into total crime, violent crime, and property crime rates for cities/regions in San Diego county: https://data.sandiegocounty.gov/Safety/SANDAG-Crime-Data/486f-q228/data_preview <br>

* All these sites have data that's not only obtainable but also easily processes because they are kept in Excel files. Excel files are csv files which can be easily turned into pandas dataframes. Of course, these sites contain more data than we need in our project, so tidying will be necessary. Moreover, we may choose a focus on specific types of crimes, such as violent crimes versus misdemeanors, or we may choose to look at all crime as a whole. 


- Dataset #1
  - Dataset Name: Air Quality Dataset
  - Link to the dataset: https://www.epa.gov/outdoor-air-quality-data/air-quality-statistics-report
  - Number of observations: 823
  - Number of variables: 18
  - Description: This dataset tracks county-level air quality metrics across California from 2010-2025, with each row representing a county's annual aggregate data. The key variables include AQI metrics (AQI Maximum, AQI 90th Percentile, and AQI Median), which serve as proxies for understanding the air quality in a respective county in a specific year and as baseline air quality respectively. The data requires cleaning to handle columns of information not needed in this project as well as standardizing labels to fit with our other datasets. 

- Dataset #2 
  - Dataset Name: Crime Rate Dataset
  - Link to the datasets: 
    - https://dof.ca.gov/forecasting/demographics/estimates/ (for population)
    - https://openjustice.doj.ca.gov/data (for crime count)
  - Number of observations: 28591 (for crime), 1770 (for population)
  - Number of variables: 70 (for crime), 3 (for population)
  - Description: We plan to look at **Violent_sum**, **Property_sum** within the crime count dataset & **Population** within the population dataset. For both, we will look at **Year** and **County**. Both datasets contain datetime datatype for **Year**, integer datatype for **Violent_sum**, **Property_sum**, & **Population**, and string datatype for **County**, but can contain undefined values which we will use **0** as a proxy. 
  - We plan to use `.strip()` for removing empty spaces, `pd.to_datetime()` for converting **Year** to an integer datatype, `.split()` for removing redundant words like '*County*' in 'Alameda *County*', `.astype()` to explicity convert our data to integer, `.melt()` to reshape the population dataframe to a long format, `.drop()` for unecessary columns, `.rename()` for making consistent columns before merge, and `.merge()` to get the complete Crime Rate Dataset. Both datasets will be merged from 2000-2025, where crime rate will then be calculated directly using the population count per county and crime commited per county to get the rate of crime per 100,000 residents. 
- Dataset #3
  - Dataset name : Greenhouse Gas Emissions 
  - Link to the dataset: https://www.epa.gov/system/files/other-files/2024-10/2023_data_summary_spreadsheets.zip
  - Number of Observations:767
  - Number of Variables:3
  - Description: We plan to use the the Green House Gas Emissions data set to look at emissions of green house gases per county, measured in units of metric tons of carbon dioxide. We plan on using **.dropna()** in order to get rid of missing data entries as well as using **.melt()** to reshape the data into long format and **.apply()** to further wrangle the dataset in order to remove the redundancy of county in the original unwrangled and tidy dataset . Further more the key observations for this datasets will ne looking at the emmisions by county in CA overtime 
- Dataset #4
  - Dataset Name: Real Total GDP by County, California
  - Link to the datasets: https://fred.stlouisfed.org/release/tables?eid=1071149&rid=397
  - Number of observations: 1357
  - Number of variables: 2
  - Description: We plan to use Real Total GDP for all the counties in California to compare against crime data, measured in thousands of chained 2017 USD. This dataset measures the GDP of all industries for their respective county and adjusts it for inflation, offering a true measure of economic growth that can be compared against crime data. Since the data is based on observation dates measured at the beginning of each year, we plan to change each observation date to account for the previous year's data and drop any missing data entries.
- Dataset #5: 
  - Dataset Name: Median Household Income by County, California
  - Link to the datasets: https://fred.stlouisfed.org/searchresults/?st=median%20household%20income%20county%20california
  - Number of observations: 1357
  - Number of variables: 2
  - Description: We plan to use Median Household Income to compare against crime data, measured in USD, to determine whether the economic status of individuals has an effect on crime rate. This dataset measures the median household income for all counties in California. Since the data is based on observation dates measured at the beginning of each year, we plan to change each observation date to account for the previous year's data and drop any missing data entries.
- Dataset #6:
  - Dataset name : Unemployment
  - Link to the dataset: https://fredaccount.stlouisfed.org/public/datalist/8337
  - Description: The dataset contains the unemployment percentage rates of each county within California each year. The percentages are measured each year from the beginning of Januarary 2000 to 2024. We are going to be looking at the percent of unemployment each year and the specific year variables. This dataset is already cleaned up and wrangled through the data list website.
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 (use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

## Temperature Dataset

In [12]:
#list of counties in california for merging
ca_counties = [
    "Alameda County",
    "Alpine County",
    "Amador County",
    "Butte County",
    "Calaveras County",
    "Colusa County",
    "Contra Costa County",
    "Del Norte County",
    "El Dorado County",
    "Fresno County",
    "Glenn County",
    "Humboldt County",
    "Imperial County",
    "Inyo County",
    "Kern County",
    "Kings County",
    "Lake County",
    "Lassen County",
    "Los Angeles County",
    "Madera County",
    "Marin County",
    "Mariposa County",
    "Mendocino County",
    "Merced County",
    "Modoc County",
    "Mono County",
    "Monterey County",
    "Napa County",
    "Nevada County",
    "Orange County",
    "Placer County",
    "Plumas County",
    "Riverside County",
    "Sacramento County",
    "San Benito County",
    "San Bernardino County",
    "San Diego County",
    "San Francisco County",
    "San Joaquin County",
    "San Luis Obispo County",
    "San Mateo County",
    "Santa Barbara County",
    "Santa Clara County",
    "Santa Cruz County",
    "Shasta County",
    "Sierra County",
    "Siskiyou County",
    "Solano County",
    "Sonoma County",
    "Stanislaus County",
    "Sutter County",
    "Tehama County",
    "Trinity County",
    "Tulare County",
    "Tuolumne County",
    "Ventura County",
    "Yolo County",
    "Yuba County"
]
tempdf_list = []

In [25]:
#load all county datasets, merge into one dataset
for county in ca_counties:
    dataset = f'./Datasets/Temperature/{county}.csv'
    temp_env0 = pd.read_csv(dataset, skiprows=4)
    temp_env0['County'] = county
    tempdf_list.append(temp_env0)
temp_env = pd.concat(tempdf_list, ignore_index=True)
#cleanup

temp_env.rename(columns={'Value':'Average Temperature'})
temp_env = temp_env.iloc[:, [0, 1, 2]]
temp_env.columns = ['Date', 'Average Temperature', 'County']
temp_env = temp_env.dropna(subset=['Date', 'Average Temperature'])
temp_env['Year'] = temp_env['Date'].astype(str).str[:4].astype(int)


temp_env

Unnamed: 0,Date,Average Temperature,County,Year
0,200002,58.5,Alameda County,2000
1,200003,58.9,Alameda County,2000
2,200004,59.2,Alameda County,2000
3,200005,59.5,Alameda County,2000
4,200006,59.8,Alameda County,2000
...,...,...,...,...
135133,202412,61.1,San Luis Obispo County,2024
135134,202501,61.0,San Luis Obispo County,2025
135135,202502,61.2,San Luis Obispo County,2025
135136,202503,61.1,San Luis Obispo County,2025


# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

When it comes to dealing with ethics for our project, there may be potential county bias in the data available since it may be the case that there are missing counties that are underrepresented in the available government datasets listed above. That is, Lassen, Modoc, Sierra, and Yuba.
Additionally, there may be a confounding variable as not all crime and unemployment may be accounted for if not reported to the government. Though a confounding variable, the data collected from websites such as openjustice.doj.ca.gov permits the public usage of the data from their webiste, noting that their website public data is made sure to not include personal information of minors and or use copyrighted material. Furthermore, there may be bias in our statistical analyses when it comes to looking at the rate of crime rate for a specific high income cities which can bias our interpretations of the data. Additionaly there may be bias in looking at the rate of Carbon Dioxide Gas emmisions in metric tons when looking at big counties in California. This may cause underepresented counties to not get the economic support from California to better their air quality when they most need it.<br>

To address these issues, we will explore what missing counties there are in the datasets and why they are underrepresented. That way, we can transparently report these reasons as factors that can impact our intrepretation when we analyze our data. For instance, findings that indicate a strong relationship between our variables and crime rate may not be applicable to rural areas. Furthermore, regarding data collections, data regarding crime and umemployment are tracked by the government, but this is something out of our scope of responsibilities. Instead, we can acknowledge that a negative may be consistent with a 'false negative' because of the underrepresentation when interpreting the relationship between our variables and crime. In this understanding, we may not be able to say for certain that findings of 'no relationship' is true. 

Our aim for this project is to find a more accurate measure for county's well-being, i.e. environment and economic factors may be considered, and how we can use this for a predictor model that can assist in assessing counties for their levels of crime. This project can be scaled for use in determining how the government can improve their allocation of resources to improve a county's well-being. However, because of bias in our project, our findings may only be applicable to counties similar to San Diego. That is, underrepresented counties should not be observed with our lens. Despite this bias, misuse or misinterpretation of our finding can be misleading and improperly measure a county's well-being. This can lead to reduction in select counties' aid from the government that can adversely affect them. Specifically when looking at funding for county police departments to combat crime or even funding for environmental initaitives to improve air quality in underepresented counties.

# Team Expectations 

* Use Discord to communicate. Within 12 hours expected, but 2 hours if close to deadline.
* Meet at least **once** a week - Every Friday 12:30 PM. 
* Have an open space where all voices are heard. Everyone should be open to ideas, criticisms, and suggestions. Make decisions as a group. Majority vote makes final decisions on major portions of the project.
* Specializations: Dean (Project Leader), Kevin (Analysis), Emily (Editor), Cedric and Richard (Coder). Project leader will be in charge of handling merge requests, delegating group responsibilities, and organizing roles/meets. Analysis is in charge of overseeing the analysis portion and guiding the rest of the group members on statistics-related items. Editor will be in charge of major writing responsibilities, including drafting reports and proofreading. Coders will advise on coding for the rest of the group and oversee data wrangling portion. These roles are not restrictive - everyone will work on each part a little, but these are the main *specializations*
* In the event someone is struggling to deliver something they promised on, it is expected to let the group know as soon as possible. That way, we can look to delegate the tasks among the others to compensate. Project Leader will make final decision to contact TA as needed.

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/25  |  4 PM | Brainstormed topics and questions | Finalized research question and delegated tasks for proposal. Rough research done on topic. | 
| 4/28  |  8 PM |  Finalize Proposal. Hypothesis, Looking for Data, Ethics, How to Data Wrangle | Looking forward. Delegate tasks for how to research, retrieve data, and analysis plan | 
| 5/2  | 12:30 PM  | Compile list of datasets & clarify roles/organization  | Focus on group organization. Will we use branches? How will we check in. Potential analysis techniques we will do   |
| 5/9  | 12:30 PM  | Wrangle *some* data and have an idea on analysis | Review wrangling for correctness. Review analysis and plan   |
| 5/16  | 12:30 PM  | Finalize wrangling, continue analysis, begin devising model | Dive in fully into analysis and have a complete review of project as a whole |
| 5/23 | 12:30 PM  | Complete analysis and model fully and begin drafting conclusions | Discuss difficulties and edit final draft together |
| 5/30 | 12:30 PM  | Review and fix any small details | Discuss final turn in of project before 6/13 |
| 6/13 | Before 11:59 PM | NA | Turn in Final Project & Group Project Surveys | 