Skip to content

Predicting which areas in L.A are more prone to severe crime using decision trees and logistic regression models in R.

Notifications You must be signed in to change notification settings

rachelsohzc/L.A-Crime-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Investigating the Nature of Severe Crimes in L.A

Overview and motivation

Predictive policing is becoming an important part for fair policing today. We wanted to find out if we could predict which areas would have crimes based on factors like victim's age, sex, descent, premise types etc. using decision trees and logistic regression models. The final product is a list of predictors that heavily influence severe crime in Los Angeles. This list could be used to create a heat map to predict which areas are more prone to crime. This is an academic project for ST309.

Description

For this project, we used the 2020 - Present dataset which can be found here: L.A crime dataset (The data was last accessed on 9 February 2022, there may be updates to the data not incorporated in the analysis)

How to run

  • Make sure you have the packages in the first line of the code installed before running
  • Download the dataset from here, and name the files as: "Crime_Data_from_2010_to_2019" and "Crime_Data_from_2020_to_Present" when reading the csv files
  • Run the code on an R Script

Improvements

  • Categorical data: Categorical data is harder to interpret at times. For instance, when we transformed the premise description column, we only took the top 10 premises and classified the remaining under the ‘OtherPremise’ category. It is possible that doing so affected our analysis.
  • More modelling: After going through our analysis, using bagging and bootstrapping may have given us more confidence in our results.
  • Analysis limitations: Since we only used the 2020 - Present dataset, this could have affected our results. A merged dataset may have given us higher accuracy rates.

Conclusion

The results from our analysis showed that the factors Weapon, Sidewalk, Street, Female and Age have a strong link to severe crimes. However, it is also worth noting that these models were created based on training data. Previous predictive policing programs like PredPol have failed because the past data of crime records had race biases. These models may only further magnify these biases and lead to inaccuracy. A more accurate analysis would include a dataset that is free of bias.

Team members

References

  1. Decision and classification trees
  2. Predictive policing - part 1
  3. Predictive policing - part 2
  4. Crime forecasting
  5. Crime trends in California
  6. ROC Curves
  7. LAPD Patrol Area Maps
  8. Introduction to Random Forests
  9. Random Forests Classifiers
  10. Logistic regression

About

Predicting which areas in L.A are more prone to severe crime using decision trees and logistic regression models in R.

Topics

Resources

Stars

Watchers

Forks

Languages