# **Safe n' Sound**
## by Suraj Gangaram

## **Introduction**
Through my final project for HCDE 410, I aim to bring light to the **lack of safety** in lower-income neighborhoods. Through analyzing different neighborhoods' **crime and safety data** with respect to the **socioeconomic standings** of their respective families, I hope to raise attention to the discrepancies between the amount of police and crime protection low income neighborhoods receive in comparison to higher income neighborhoods, in which the income level is determined by Federal Housing Finance Agency: House Price Index (**HPI**). From the research that I have done, I have only seen discrete datasets providing data on individual topics, and none of the ones I have seen try to find a relationship between the amount/type of crime in a given neighborhood and the average level of income of those who occupy it. With the results gathered from this project, I hope to provide useful insight for those representing lower income neighborhoods and incentive to bring about change, through implementing **new legislation and bills** to ensure that those with low incomes feel safe and are protected from crimes within their neighborhoods. Additionally, I hope that this project will spark interest in others to also research the differences in crime and protection and continue to provide more data to implement changes to address these inequities, which makes it very beneficial through both a practical and human-centered perspective.

## **Background and Related Work**
There are related to **house price indices** and **types of crime occurring** in various areas throughout the United States. However, there are not any datasets that I have seen which involve the comparison between the two variables of interest.

The [Housing-Index dataset](https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index-Datasets.aspx) is verified by the **United States Federal Housing Finance Agency** or FHFA. The dataset is collected by reviewing mortgage transactions on single family properties since January 1975. Each transaction is either purchased or secured by mortgage loan companies Freddie Mac or Fannie Mae. This dataset was last updated on November 29, 2020. Considering this dataset was reviewed by a government agency and two mortgage loan companies further secured it, I believe this dataset is very trustworthy and credible.

The [Crime In Context dataset](https://www.kaggle.com/datasets/marshallproject/crime-rates) was created by Gabriel Dance, Tom Meagher, and Emily Hopkin of the Marshall Project. They compiled this dataset from **the four major crimes data** the FBI classifies as violent. This includes homicide, rape, robbery and assault. This dataset includes 68 police jurisdictions with populations over 250,000 or greater. Only 1975 - 2014 data came directly from the FBI. This part of the data came from the FBI Uniform Crime Reporting program's "Offenses Known and Clearances by Arrest" database. The 2015 data was obtained from directly contacting the police agency. Only 61 police jurisdictions data was collected for 2015. The analysis for the crime rate calculated is based on a per 100,000 residents in a police jurisdiction. The 2015 crime rate analysis is based on 2014 population estimate. Considering most of this dataset came from the FBI and parts of it also came directly from police institutions, I believe that the source of the data is very secure and trustworthy. As for the analysis, the depth in which the Marshall project reveals their methodology and also directly naming the people working on the analysis make the analysis quite credible and trustworthy.

Through having learned about such background information within these datasets, I am motivated to combine the two variables in each respective dataset to go about comparing how socioeconomic standing, measured by house price indices, is related to crime rates in a given area. Each of my research questions answers something about each facet within the two datasets, **crime** and **income level**, as I aim to draw a conclusion about the association between the two through this project.

## **Overview and Research Questions**
The research questions I address through this project are motivated by procuring the basic safety needs for those living within the neighborhoods of interest. I am cognizant of the variances in police protection within higher income communities in contrast to lower income communities. Shedding light on and addressing these questions are vital to bringing about change. These research questions are important because they will challenge the current policing and crime protection of lower income communities and provide the necessary basis to explore the data regarding crime rates. With the data collected, I plan to help find answers to the questions below:

1. How much of an issue is neighborhood safety in low income areas?
2. What type of crimes are most likely to occur in low-income areas?
3. What is the difference between crime rates in low-income and high-income areas?


## **Data Selected for Analysis**
The [dataset](https://www.kaggle.com/datasets/sandeep04201988/housing-price-index-using-crime-rate-data?select=merged_dataset.csv) I have chosen compares housing price index with crime rates in various cities around the United States from 1975 to 2015. It has 3477 rows/observations and 9 columns/features. 

The data represents communities from around the United States over the span of 40 years. Some of the crimes, or variables, that were recorded in this dataset are homicides, rapes, assaults and robberies. The city, population, year and housing price index are also included. The data was collected from another Kaggle dataset, created by the **Marshall Project**, that measures crime rates across the United States from 1975 to 2015 and the FHFA house prices index. The Marshall Project, created in 2014, is a nonprofit, online journalism organization with a focus on issues related to criminal justice in the United States. It has been funded by donations and grants from various individuals and foundations. The FHFA is a federal agency in the United States that regulates mortgages. The house prices index is a measure of the movement of single-family house prices. This data is created from the merging of the Housing-Index dataset and Crime in Context dataset. The merged dataset is questionable in its validation and security, but the dataset that should be composed of this merged dataset is very credible and secure. I have merged the two sourced dataset ourselves to ensure the validity, security, and credibility of the data I am working with.

I obtained the housing price index and crime rate dataset from Kaggle. It was created by merging the two datasets from The Marshall Project and FHFA, listed above. They were merged together by Kaggle contributor, SandeepRamesh, with the goal of uncovering a correlation between **crime rate** and **housing price index**.

As you can see in the code chunk below, I began the process of working with the merged HPI and crime rate dataset by cleaning it using R the **dplyr** library, which streamlines the process of data manipulation through providing readily-named functions:

In [15]:
# Clean Data Script
#----------------------------------------------------------------------------#
# Load HPI_CrimeRate.csv into a dataframe. Use dplyr to filter out empty rows
# to make sure the data is clean.
#
# Note: change the working directory function to the path where the local
# repository is stored.
#----------------------------------------------------------------------------#
library(tidyverse)
hpi_crime_raw <- read.csv("HPI_CrimeRate.csv")
clean_data <- function(data){
  clean <- data %>% 
    filter(Year != "")
  return(clean)
}
hpi_crime_clean <- clean_data(hpi_crime_raw)
head(hpi_crime_clean)

Unnamed: 0_level_0,Year,index_nsa,City..State,Population,Violent.Crimes,Homicides,Rapes,Assaults,Robberies
Unnamed: 0_level_1,<int>,<dbl>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1975,41.08,"Atlanta, GA",490584,8033,185,443,3518,3887
2,1975,30.75,"Chicago, IL",3150000,37160,818,1657,12514,22171
3,1975,36.35,"Cleveland, OH",659931,10403,288,491,2524,7100
4,1975,20.91,"Oakland, CA",337748,5900,111,316,2288,3185
5,1975,20.385,"Seattle, WA",503500,3971,52,324,1492,2103
6,1975,31.215,"Washington, DC",716000,12704,235,520,2812,9137


I filter through the code for cells in which which the Year does not equal to "", or in this case are N/A. I am essentially just excluding data for which there are **no years provided for** the crime data. I then assign that cleaned dataset into a variable which is denoted clearly for convenience.

## **Methodology**
To properly frame my analysis and be able to gather valuable and actionable insights on the crime dataset of interest, I used the programming language **R**. It offers a multitude of libraries to go about reading, parsing through, filtering and cleaning datasets. Using the **dplyr** library and its many functions such as **group()** and **filter()**, I was able to group various data tables and calculate aggregate values essential to my findings. I answered my respective research questions through the methods below:

1. I addressed the first research question by selecting the lower income communities and observing the amount of violent crimes that occur within them. 
2. Using the different types of crimes listed in the dataset, I answered the second research question by determining which crimes are most likely to occur in low income areas through comparing the proportion of the total violent crimes each one of them makes up in certain communities. 
3. I answered the third research question by grouping the housing price index into different income levels and comparing the differences in crime rates between them. 

From there, I implemented the in-built **Plotly** library and used the **ggplot()** function to go about visualizing and plotting my analyses. I have presented my filtered data using various bar charts, scatterplots and box & whisker plots, depending on what I feel is best suited for each type of research question.

Using functions to filter through the dataset has enabled me to highlight important data that I can go about graphing and plotting, serving as visual representations for an audience interested in learning more about the topic and potentially wanting to take action. Using these methods, I feel that I have been able to present my findings in a manner that is impactful and actionable, bringing light to the underlying issues faced by neighborhoods that often go unspoken or are even purposefully hidden to the public. 

## **Findings**

In this section, I discuss my findings by providing my code for the following visualizations, each answering one of the three research questions.

### **Disclaimer**
I was unable to make my data visualizations run because they require a separate library in R that I was not able to integrate into Jupyter Notebook. However, I have included all the code cells for each of the three data visualizations as well as a screenshot of them.

### Plot 1

This code uses the cleaned dataset from the previous code cell to create a scatterplot of the data with the Housing Price Index values on the **x-axis** and the Violent Crimes rate as a decimal value on the **y-axis**. I use a function called **mutate()** to calculate the quotient of all the instances of crime in a given population, and use that data as individual points for the scatterplot. From there, I use the ggplot functions from R to add color and manipulate the shape of each dot in the scatterplot.

In [22]:
#----------------------------------------------------------------------------#
# Plot that tackles question: How much of an issue is neighborhood safety in 
# low income areas?
#
# Note: change the working directory function to the path where the local
# repository is stored.
#----------------------------------------------------------------------------#
source("cleanData.R")
library("plotly")
library("stringr")


createPlot <- function(data, crime) {
  plotOnedata <- data %>%
    mutate(VCR = (data[[crime]]/Population)) %>%
    na.omit()
  y_axis <- paste(str_sub(crime,1,nchar(crime)-1), 'Violent Crimes Rate')
  plotOne <- ggplot(data = plotOnedata) + 
    geom_point(aes(x = index_nsa, y = VCR), alpha = .5, 
               shape = 0) +
    geom_smooth(aes(x = index_nsa, y = VCR), color = 'red') +
    labs(
      title = paste("Housing Price Index vs", y_axis),
      x = "Housing Price Index Value",
      y = y_axis
    ) + theme_minimal()
  graph <- ggplotly(plotOne)
  return (graph)
}

ERROR: Error in library("plotly"): there is no package called ‘plotly’


The resulting data visualization as a **scatterplot** comparing the two variables is shown below:

![Screenshot 2023-06-04 at 9.22.07 PM.png](attachment:6295a1d6-a848-4d17-b627-a04a0f1b8faa.png)

This scatterplot shows the relationship between Housing Price Index (HPI) Values and Violent Crimes associated with them. This plot was made to answer the research question of how safe are the neighborhoods in low-income areas, and it is clear from this plot that neighborhoods with HPI values **higher than 200** experience **significantly less crime** compared to neighborhoods under that value.

### Plot 2

This code uses the cleaned dataset to create a bar chart of the data with the Violent Crimes For HPI under 110 on the **x-axis** and the Percentage of Major Crime on the **y-axis**. The types of major crime are split into four different categories: instances of assaults, homicides, rape, and robberies, and there is a color-coded legend for the audience to discern the different types. I took the sum of all the distinct types of crimes and wrote **na.rm = TRUE** within the code to omit all the null values in the calculation of the overall percentage. That percentage then made up the variable on the y-axis. 

In [23]:
#----------------------------------------------------------------------------#
# Plot that tackles question: What type of crimes are most likely to occur in 
# low-income areas?
#
# Note: change the working directory function to the path where the local
# repository is stored.
#----------------------------------------------------------------------------#
source("cleanData.R")
library("plotly")

createStackedBar <- function(data, indexValue) {
  plotTwoData <- data %>% 
    filter(index_nsa <= indexValue) %>% 
    summarise(
      group = c("Homicides", "Rapes", "Assaults", "Robberies"),
      percentage = round(c((sum(Homicides, na.rm = TRUE) / sum(Violent.Crimes, na.rm = TRUE)) * 100, 
                     (sum(Rapes, na.rm = TRUE) / sum(Violent.Crimes, na.rm = TRUE)) * 100, 
                     (sum(Assaults, na.rm = TRUE) / sum(Violent.Crimes, na.rm = TRUE)) * 100, 
                     (sum(Robberies, na.rm = TRUE) / sum(Violent.Crimes, na.rm = TRUE)) * 100), 2)
    )
  plotTwo <- ggplot(plotTwoData, aes(x="", y = percentage, fill=group)) +
    geom_bar(stat="identity", width=1) +
    xlab(paste("Violet Crimes For HPI under 110")) +
    ylab("Percentage of Major Crime") + theme_linedraw()
  
  return(ggplotly(plotTwo))
  
}

ERROR: Error in library("plotly"): there is no package called ‘plotly’


The resulting data visualization as a **stacked bar chart** comparing the two variables is shown below:

![Plot2_SS.png](attachment:c48f7c26-1e57-433a-8b45-86fd0c4afb66.png)

This plot examines assaults, homicides, rapes, and robberies as a percentage out of the total number of violent crimes for areas with the lowest HPI value (lowest overall income threshold). The stacked bar chart allows for a clear display of the breakdown amoung the four listed crimes. It is evident that **assaults** and **robberies** make up the majority of all violent crimes faced in low income neighborhoods.

### Plot 3

This code uses the cleaned dataset to create a box and whisker plot of the data with the HPI Group on the **x-axis** and the Crime Percentage on the **y-axis**. The measured HPI groups are split into three different categories: low (score from **20-141**), medium (**141-262**) and high (**262-387**). I felt that a box and whisker plot could highlight the extremeties of the percentage of crime experienced, especially in the low HPI group's case. 

In [24]:
#----------------------------------------------------------------------------#
# Plot that tackles question: What is the difference between crime rates in 
# low-income and high-income areas?
#
# Note: change the working directory function to the path where the local
# repository is stored.
#----------------------------------------------------------------------------#
source("cleanData.R")
library("plotly")


createBoxPlot <- function(data, selectedHPIGroup) {
  plotThreeData <- data %>%
    mutate(HPI_group = case_when(
      between(index_nsa, 20, 141) ~ "Low",
      between(index_nsa, 141, 262) ~ "Medium",
      between(index_nsa, 262, 387) ~ "High",
      FALSE ~ NA_character_
    )) %>%
    mutate(crime_percentage = (Violent.Crimes / Population) * 100) %>%
    select(HPI_group, crime_percentage)
    HPI <- plotThreeData$HPI_group
  plotThree <- ggplot(data = plotThreeData) +
    geom_boxplot(aes(x= .data$HPI_group, y = crime_percentage, fill = HPI)) +
    labs(
      title = "Comparing Crime Percentages of Various HPI Groups",
      x = "HPI Group",
      y = "Crime Percentage"
    ) + theme_minimal()
  
  graph <- ggplotly(plotThree)
  return (graph)
}

ERROR: Error in library("plotly"): there is no package called ‘plotly’


The resulting data visualization as a **box and whisker plot** comparing the two variables is shown below:

![Plot3_SS.png](attachment:1ddcaa61-6469-464b-af6c-2ec44be37132.png)

The median crime percentage for low income neighborhoods was roughly 1.18%, 0.96% for medium, and 0.74% for high HPI. The change in percentages depending on HPI makes it very apparent that there are **more crimes** that occur in **lower HPI neighborhoods**.

## **Discussion**
It is important to ensure that low-income neighborhoods are safe especially since the instances of robberies and rapes are more associated with such low HPI areas. Furthermore, the spread of crime percentage rates overall is higher for houses on the lower end of the HPI, meaning that lower income neighborhoods see higher rates of crime than houses in the middle or upper end of the HPI. And since crimes are proven to be more violent in neighborhoods with a lower HPI, change has to happen imminently. Based on the data, safety is a paramount issue that needs to be addressed in lower income areas. City council members should go about tackling these issues in order to remove the stigma around low-income neighborhoods being unsafe. Neighborhoods such as the notorious “O’ Block” in Chicago catch such a bad reputation because of a lot of factors, especially because of the fact that they are labeled as low income. Change has to happen at the grassroots level, involving city council members, protection agencies and citizens in order to bring about safer lower income neighborhoods.

## **Limitations**

After cleaning the dataset up, I found that there are only 1714 rows/observations useful to work with. I believe this is due to programming errors in previous mergers of the dataset before I got this data. However, this was likely an formatting error rather than significant data loss, so I do not feel that it will significantly disrupt the results of the data.

## **Implications and Conclusion**

It is good that we, as a community, highlight the issues within neighborhoods with the hopes of bettering them. However, an adverse result may occur in the form of [redlining](https://www.nytimes.com/2021/08/17/realestate/what-is-redlining.html) where individuals may typecast a certain community and inevitably have it associated with the stigma surrounding a particular race or group of people. Although it draws attention to a particular community, redlining does so in a negative light, highlighting particular neighborhoods for all the wrong reasons, especially in deeming them **"risky investments"**. Drawing attention to these lower-income neighborhoods has to be done at a grassroots level so that they avoid being redlined or shunned by the rest of the community. It is important that the findings from this and other similar studies should be used in the right way.

I believe that my findings surrounding lower-income neighborhoods can raise awareness of the discrepancies between the amount of police and crime protection towards low-income neighborhoods to receive in comparison to higher-income neighborhoods. I hope that all neighborhoods become equal no matter their status and are safe and protected from crimes within their neighborhoods. Wanting to be safe in your neighborhood shouldn't be something that has to be debated over. These are humans who want a better lifestyle without being afraid to live in their own homes. With the results gathered from this project, I hope to provide useful insight for those representing lower income neighborhoods and an incentive to bring about change, through implementing new legislation and bills to ensure that those with low incomes feel safe and are protected from crimes within their neighborhoods. Change must happen, and it must happen **immediately**.