# **Analysing Crime Data of the City of Los Angeles**

##### **Group Members:** Linchuan Yang(linchuan@ucsb.edu), Ruchika Saswade(ruchika_saswade@ucsb.edu)

##### **Member Contributions:**  
**Linchuan Yang:** Group Leader, Report Writting, Achieving Data, Data Tidying, Initial Visualization, PCA Analysis, Loess Plot Analysis  

**Ruchika Saswade:** Group Member, Report Checking, Data Visualizing, Plot Analyzing

<hr>

## **Abstract**

Crime is happening all over the world all the time, and the City of Los Angeles as a metropolis is even more so. This report will analyze Los Angeles criminal records from 2010 to 2021 to provide readers with a better understanding of crime in the City of Los Angeles. At the same time, the report will also use analytical techniques at the end to show the possible impact of the epidemic on crime victims.

<hr>

## **Introduction**

##### **Background:**

Our project is about analyzing and exploring the crime reports of the City of Los Angeles from 2010 to 2021. This project topic is closely related to areas of Criminology and Sociology. The data were gathered from the official database of the City of Los Angeles which contains every crime reported to the law enforcement within the LA. In the raw dataset, exact time, exact location, area and district where the crime was performed, crime code and description, and victim information were included.  

The motivation for this project is simple, Los Angeles is close to Santa Barbara, only a two-hour drive, which is familiar to us. Also, the observations of Los Angeles are huge which is good for data exploration. Moreover, we choose the dataset from 2010 to 2021 because we were interested in seeking and informing readers how COVID-19 started in 2020 may affect crime reported in LA city and get some sense of how crime in LA city would be in the future. 

<div align="center"><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Los_Angeles_with_Mount_Baldy.jpg/2880px-Los_Angeles_with_Mount_Baldy.jpg' title='Picture of Downtown Los Angeles' height=200 width=500></div>
<div align="center">Picture of Downtown Los Angeles</div> 

##### **Data Sources:**  
Crime Data from 2010 to 2019: https://data.lacity.org/Public-Safety/Crime-Data-from-2010-to-2019/63jg-8b9z  
Crime Data from 2020 to Present: https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8

##### **Approach:**

To analyze the data, we will focus on time, area, crime information, and victim information to answer several specific self-interested questions listed below. *Each group member will answer their own questions.*  

In this project, several data analysis technics are used to answer the questions, including:  

**Achieving Data**, **Data Tidying**, **Visualizing**, **Plot Analyzing**, **PCA Analysis**, **Loess Plot Analysis**

##### **Questions:**

**Visualizing** and **Plot Analyzing**: 
1. What month throughout the years has the most concentration of crimes happening? &ensp;&ensp; *(**Ruchika Saswade**)*
2. What age/sex seems to be the least victimized from 2010 to 2021? &ensp;&ensp; *(**Ruchika Saswade**)*
3. Which area in LA city were considered *dangerous*? &ensp;&ensp; *(**Linchuan Yang**)*

**PCA Analysis**:  

&ensp;&ensp;
4. List the crime by bigger major categories and apply to the data. Is there any area that different from others? &ensp;&ensp; *(**Linchuan Yang**)*  
  
**Loess Plot Analysis**:  

&ensp;&ensp;
5. How was the trend that a specific descent group were victimlized? &ensp;&ensp; *(**Linchuan Yang**)*  

<hr>

## **Raw Data**

In [2]:
# import packages
import pandas as pd
import numpy as np

The table below shows the raw data. The raw data include all the crime records reported within the City of Los Angeles from 2010 to 2022. The datasets are achieved from the database of the City of Los Angeles, the datasets are recorded by the Summary Reporting System (SRS) of Division Uniform Crime Reporting (UCR) of Program Criminal Justice Information Services (CJIS) under the U.S. Department of Justice. Also, the data are collected from local law enforcement jurisdictions. 

In [3]:
# show first five rows
pd.read_csv('data/raw/Crime_Data_from_2010_to_2019.csv').head()

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,1307355,02/20/2010 12:00:00 AM,02/20/2010 12:00:00 AM,1350,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,...,AA,Adult Arrest,900.0,,,,300 E GAGE AV,,33.9825,-118.2695
1,11401303,09/13/2010 12:00:00 AM,09/12/2010 12:00:00 AM,45,14,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,SEPULVEDA BL,MANCHESTER AV,33.9599,-118.3962
2,70309629,08/09/2010 12:00:00 AM,08/09/2010 12:00:00 AM,1515,13,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,...,IC,Invest Cont,946.0,,,,1300 E 21ST ST,,34.0224,-118.2524
3,90631215,01/05/2010 12:00:00 AM,01/05/2010 12:00:00 AM,150,6,Hollywood,646,2,900,VIOLATION OF COURT ORDER,...,IC,Invest Cont,900.0,998.0,,,CAHUENGA BL,HOLLYWOOD BL,34.1016,-118.3295
4,100100501,01/03/2010 12:00:00 AM,01/02/2010 12:00:00 AM,2100,1,Central,176,1,122,"RAPE, ATTEMPTED",...,IC,Invest Cont,122.0,,,,8TH ST,SAN PEDRO ST,34.0387,-118.2488


<hr>

## **Data Tidying**

The table below shows the tidied data. The tidied data only include need variables for the analysis like Year, Month, Crime description, and Victim information.

In [4]:
# show first five rows
pd.read_csv('data/Crime_Data_from_2010_to_Present_Clean.csv').head()

Unnamed: 0,Month,Year,Area,Crime,Victim Age,Victim Sex,Victim Descent
0,2,2010,Newton,VIOLATION OF COURT ORDER,48,Male,Hispanic
1,9,2010,Pacific,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",Unknown,Male,White
2,8,2010,Newton,OTHER MISCELLANEOUS CRIME,Unknown,Male,Hispanic
3,1,2010,Hollywood,VIOLATION OF COURT ORDER,47,Female,White
4,1,2010,Central,"RAPE, ATTEMPTED",47,Female,Hispanic


The table below shows the variable descrition of tidied data.  

Name | Variable description | Type | Units of measurement
---|---|---|---
Month | ith month | Numeric | Calendar month 
Year | year of crime record | Numeric | Calendar year
Area | specific area where crime occurred | Text | --- 
Crime | crime describtion | Text | --- 
Victim Age | the age of victim | Numeric | years old 
Victim Sex | the sex of victim | Text | --- 
Victim Descent |  the descent of victim | Text | --- 

<hr>

## **Data Visualization**

### **Initial Visualization**

This section will show basic informative plot of the cirme data of the City of Los Angeles. 

#### Bar Plot of Crime Total

The bar plot below shows the crime numbers on each month in each year and how they were distributed, minor seasonal trend were observed.  

<img src='data/plot/crime_total_bar.png'>

#### Pie Chart of Victim Descent by Year

The pie chart below shows the victim numbers of each year by descent, increasing trend on Asian Victim proportion were observed. 

<img src='data/plot/victim_descent_pie.png'>

### **(Question 1) What month throughout the years has the most concentration of crimes happening? &ensp;&ensp; *(**Ruchika Saswade**)***

##### **Plot:**

<div align="center"><img src='data/plot/crime_count_year.png' height=300></div>
<div align="center">Scatter and Bar Plot of Crime Count by Month</div> 

##### **Result:**

To answer question of the month that has the most crime, it is shown that July and August are the two months with the highest crime count in the charts. In the bar plot as well we can see that the highest counts are July, August, and October. If we will say one month that has the most crime count it would be July.

### **(Question 2) What age/sex seems to be the least victimized from 2010 to 2021? &ensp;&ensp; *(**Ruchika Saswade**)***

##### **Plot:**

<div align="center"><img src='data/plot/age_sex_crime_count.png' height=300></div>
<div align="center">Histogram of Victim Age and Sex Count</div> 

##### **Result:**

To answer the question of what age/sex seems to be the least victimized from 2010 to 2021, the age that seems to be at the peak of victimization is from 20-30, almost around 24/25 and so the least victimized age seems to be heading into the senior ages. (60-100) as well as ages under 19 (0-19) . The most victimized sex are males, almost 1,150,000 in count and the least is other/unknown as shown from the bar plot.

**Notice**, this does not means that these groups are overally less victiimized because the populations are different. Proper weight need to be added in order to reflect reality.

### **(Question 3) Which area in LA city were considered *dangerous*? &ensp;&ensp; *(**Linchuan Yang**)***

##### **Description:**

The acreage data are collected by going through each ditrict information from offical LAPD website. Then a data frame with area and acreage are created and merge with crime data grouped by `Year` and `Area`. After that, the index is calculated by `Crime` divide by `Acreage` then store to the merged data frame with variable name `Dangerous Index`. Finally, plot the bar plot of `Dangerous Index` with `Area`.   

##### **Create Plot Dataframe:**

In [111]:
# display first five rows
pd.read_csv('data/table/area_acreage_crime_data.csv').head()

Unnamed: 0,Year,Area,Crime,Acreage,Dangerous Index
0,2010,77th Street,14441,11.9,1213.529412
1,2011,77th Street,14257,11.9,1198.067227
2,2012,77th Street,14300,11.9,1201.680672
3,2013,77th Street,13746,11.9,1155.12605
4,2014,77th Street,14065,11.9,1181.932773


##### **Result:**

<div align="center"><img src='data/plot/dangerous_index_bar.png' title='Bar Plot of Dangerous Index' height=300 width=400></div>
<div align="center">Bar Plot of Dangerous Index</div> 

From the plot above, we observed that **77th Street** area has the highest dangerous index which means **77th Street** tend to be the most *'dangerous'* area of the City of Los Angeles.  

<div align="center"><img src='data/plot/77th.png' title='Map of 77th Street Area' height=300 width=400></div>
<div align="center">Map of 77th Street Area</div> 

Also, **Hollenbeck** and **Foothill** area has the lowest dangerous index which means **Hollenbeck** and **Foothill** tend to be the *'safest'* area of the City of Los Angeles.  

<div align="center"><img src='data/plot/HBK.png' title='Map of Hollenbeck Area' height=400 width=300>&ensp;&ensp;&ensp;&ensp;<img src='data/plot/FTHL.png' title='Map of Foothill Area' height=300 width=400></div>
<div align="center">Map of Hollenbeck Area(left) and Map of Foothill Area(right)</div> 

It is also interesting that from the map, **77th Street** is at *urban* area while **Hollenbeck** and **Foothill** are at *rural* area. It make some sense that *urban* area tend to have *higher crime rate* and vice versa.  

Moreover, the plot also explains the difference of `dangerous index` is not significance through `Year` of each area to avoid paradox.  

**Noice** the `dangerous` index is only based on `acreage` and crime count of the each area which means it might not be fully correctly refelct the reality.  

**Data Source:** https://www.lapdonline.org/lapd-organization-chart/

<hr>

## **PCA Analysis**

### **(Question 4) List the crime by bigger major categories and apply to the data. Is there any area that different from others? &ensp;&ensp; *(**Linchuan Yang**)***

##### **Description:**

In order to find the bigger category of the crimes, PCA analysis is required. But first, we will have to transform the data with each crime as a column and value as the count of the crimes. After that, draw a heat map and see whether there are strong correlations among each crime variable. Then, normalize the matrix data and compute PCs. And then, plot the variance explained graph and decide the number of PCs. After that, analyze the PCs with a loading plot and try to make sense of these PCs. Finally, apply PCs back to the original data and examine the questions.  

##### **Data Transform:**

In [5]:
# display first five rows
pd.read_csv('data/table/pca_data.csv').head()

Unnamed: 0,Month,Year,Area,Crime_ABORTION/ILLEGAL,Crime_ARSON,Crime_ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER,"Crime_ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",Crime_ATTEMPTED ROBBERY,Crime_BATTERY - SIMPLE ASSAULT,Crime_BATTERY ON A FIREFIGHTER,...,Crime_UNAUTHORIZED COMPUTER ACCESS,"Crime_VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)",Crime_VANDALISM - MISDEAMEANOR ($399 OR UNDER),Crime_VEHICLE - ATTEMPT STOLEN,"Crime_VEHICLE - MOTORIZED SCOOTERS, BICYCLES, AND WHEELCHAIRS",Crime_VEHICLE - STOLEN,Crime_VIOLATION OF COURT ORDER,Crime_VIOLATION OF RESTRAINING ORDER,Crime_VIOLATION OF TEMPORARY RESTRAINING ORDER,Crime_WEAPONS POSSESSION/BOMBING
0,1,2010,77th Street,0,4,0,88,16,160,0,...,0,46,58,4,0,92,15,10,0,0
1,1,2010,Central,0,1,0,29,5,109,0,...,0,14,15,0,0,17,3,4,0,0
2,1,2010,Devonshire,0,1,0,15,2,65,0,...,0,42,54,2,0,71,12,5,0,0
3,1,2010,Foothill,0,1,0,32,1,56,0,...,0,43,41,1,0,75,1,20,0,0
4,1,2010,Harbor,0,3,0,14,8,48,0,...,0,40,42,1,0,84,31,2,0,0


##### **Correlation Heat Map:**

<div align="center"><img src='data/plot/corr_heat_map.png' title='Correlation Heat Map' height=300></div>
<div align="center">Correlation Heat Map</div> 

From the heat map above, some strong correlations between crimes are observed.

##### **Select PCs:**

The table and the plot below show the proportion and cumulative variance explained by PC components.  

In [8]:
# read table
pd.read_csv('data/table/pca_var_explained.csv').head(6)

Unnamed: 0,Proportion of variance explained,Component,Cumulative variance explained
0,0.071215,1,0.071215
1,0.046773,2,0.117988
2,0.037056,3,0.155044
3,0.027385,4,0.182429
4,0.01889,5,0.201319
5,0.013008,6,0.214327


<div align="center"><img src='data/plot/pca_var_explained_plot.png' title='PCA Variance Explained Plot' height=300></div>
<div align="center">PCA Variance Explained Plot</div> 

From the plot, we can see that first 6 PCs are good for analyzing with cumulative 0.214327 of total variance explained.  
However, 6 PCs are too complicated to analyse, we will use first 4 PCs for analyzing with cumulative 0.182429 of total variance explained.  

<div align="center"><img src='data/plot/pca_loading_plot.png' title='PCA Loading Plot' height=300></div>
<div align="center">PCA Loading Plot</div> 

From the loading plot, we can see PC1 is more likey to be firearm related crime. PC2 is more likey thefting and transpassing. PC3 is more likey assulting and GTA. PC4 is more likey other than theft and burglary. 

##### **PCA Transform:**

The table below shows the project data after PCA transform.  

In [9]:
# read table
pd.read_csv('data/table/pca_projected_data.csv').head()

Unnamed: 0,PC1,PC2,PC3,PC4,Month,Year,Area
0,8.63738,-6.726617,6.533671,-3.985459,1,2010,77th Street
1,-1.84298,-1.912674,1.033121,5.089348,1,2010,Central
2,-2.9539,-2.030991,2.143411,-4.39137,1,2010,Devonshire
3,0.324112,-4.114268,1.741791,-3.037902,1,2010,Foothill
4,0.851694,-5.157373,1.653398,-1.814922,1,2010,Harbor


##### **Outliers:**

The plot and table below show the outliers of PC2(Thefting and Transpassing) and PC3(Assulting and GTA).  

<div align="center"><img src='data/plot/pca_outlier_plot.png' title='Outlier Scatter Plot of PC2 and PC3' height=300></div>
<div align="center">Outlier Scatter Plot of PC2 and PC3</div> 

In [10]:
pd.read_csv('data/table/pc2_pc3_outliers.csv')

Unnamed: 0,PC1,PC2,PC3,PC4,Month,Year,Area
0,1.769598,7.507764,7.523993,5.449909,1,2019,Central
1,0.617683,7.582819,6.898608,6.013018,4,2018,Central
2,2.322512,6.917347,7.569194,5.929423,5,2019,Central
3,2.629848,7.76364,7.114979,4.59556,6,2019,Central
4,2.997893,7.276762,7.067218,4.470929,7,2019,Central
5,3.572977,8.699017,6.27839,4.07751,8,2019,Central
6,4.56835,9.548046,6.421643,5.387357,9,2019,Central
7,2.886755,10.791482,4.462664,4.401455,11,2021,Central


##### **Result:**

From the PCA Analysis, the bigger major categories of crimes are the firearm related category, thefting and transpassing category, assaulting and GTA category, and other theft and burglary category.  

Since there are no obvious outliers formed with every two PCs, we will take look into place and time with high Assaulting, GTA, Thefting and Transpassing crime categories, and assume them as outliers. From the outliers listed above, we can see `Central` area in 2019 is different from the other area tend to have more Assaulting, GTA, Thefting and Transpassing than the other time in other areas.

<hr>

## **Loess Plot Analysis**

### **How was the trend that a specific descent group were victimlized? &ensp;&ensp; *(**Linchuan Yang**)***

##### **Description:**

From *Pie Chart of Victim Descent by Year* in *Initial Visualization*, we can see that there are trend that some specific descent group were becoming more or less victimlized. To analyse the trend, we will go deeply by month and do a Loess Plot Analysis.   

<div align="center"><img src='data/plot/victim_descent_pie.png' title='Pie Chart of Victim Descent by Year' height=150></div>
<div align="center">Pie Chart of Victim Descent by Year</div> 

##### **Table of Victim Descent by Month:**

In [11]:
# read table 
pd.read_csv('data/table/victim_descent_data.csv').head()

Unnamed: 0,Year,Month,Victim Descent,Proportion
0,2010,1,American Indian,0.0006
1,2010,1,Asian,0.0339
2,2010,1,Black,0.1951
3,2010,1,Hispanic,0.4569
4,2010,1,Pacific Islander,0.0004


##### **Scatter Plot:**

<div align="center"><img src='data/plot/victim_scatter.png' title='Victim Descent Proportion Scatter Plot' height=300 width=400></div>
<div align="center">Victim Descent Proportion Scatter Plot</div> 

From the plot above, we observed the proportion of Asian are incresingly to be victimlized while the other descents were dropping at 2020 to 2021. So lets take a look from 2018 to 2021.   

<div align="center"><img src='data/plot/victim_scatter_2018_to_2021.png' title='Victim Descent Proportion Scatter Plot from 2018 to 2021' height=300></div>
<div align="center">Victim Descent Proportion Scatter Plot from 2018 to 2021</div> 

From the plot above, we observed that the proportion of Victim descents after 2019 are more vibrating than before. To get better understanding on trend, loess plot or multi-linear gression is required. However, since the pattern is seasonal instead of linear, we cannot use multi-linear gression. Therefore, loess plot will be apply for this question.    

<div align="center"><img src='data/plot/victim_loess_plot.png' title='Victim Descent Proportion Loess Plot from 2018 to 2021' height=300></div>
<div align="center">Victim Descent Proportion Loess Plot from 2018 to 2021</div> 

From the loess plot above, we observed that the proportion of Asian being victimlized are countinuous increasing from June 2020 while the proportion of other desent are bouncing seasonally.  

One potential reason explains why Asian descent are tend to more likely being victimlized is "Asian Hate" leaded by "COVID-19".  
As "Covid 'hate crimes' against Asian Americans on rise" from BBC News stated, from March to May 2020 alone, over 800 Covid-related hate incidents were reported from 34 counties in the state, according to a report released by the Asian Pacific Policy Planning Council. 

<div align="center"><img src='https://ichef.bbci.co.uk/news/976/cpsprodpb/10537/production/_117317866_gettyimages-1231313555.jpg' title='Picture of #Stop Asian Hate Protest' height=300 width=500></div>
<div align="center">Picture of #Stop Asian Hate Protest</div> 

**Source:** https://www.bbc.com/news/world-us-canada-56218684

Also, **Notice** that since the analysis is based on the proportion of descent being victimlized. Correlation may interact with each other, which mean, the incerase of proportion may not because of actual increase, may also because the decrease of other proportion and vice versa. So that the result may not fully reflect reality.  

<hr>

## **Discussion**

From all the analyses above, we can draw conclusions from each part.  

From the visualization part, we can observe that January, July, August and October are the most concentrated Months with crimes happening. Also, aged 20 to 30 are the most population being victimized. Moreover, male victims are more than female victims. At last, with a visual analysis of the dangerous index, we derived that 77th Street Area is considered the most dangerous area, on the other hand, Hollenbeck and Foothill areas are considered the safest areas in the City of Los Angeles.  

From PCA analysis part, by using PCA from sklearn package, four major bigger categories of crimes are discovered. The four bigger major categories are firearm related category, thefting and transpassing category, assaulting and GTA category, and other than theft and burglary category. Also, analysis has pointed out that Central area in 2019 different from the other area tends to have more Assulting, GTA, Thefting and Transpassing than the other time in other areas of the City of Los Angeles.  

From the Loess plot analysis part, by plotting the loess plot of victim count by descent, we found a potential increase in the proportion of Asian descent. Connect to the news about COVID-19 related Asian Hate, the findings also make sense to the reality. 

For the future, since the significant bouncing pattern on the victim descent scatter plot was observed after COVID-19 was transmitted to the City of Los Angeles. Will the scatters still keep this pattern for post-COVID time, or will the pattern get back to normal? This question will be very thought-provoking for future analysis of crime data.