# Abstract

The year 2020 was a very interesting year for much of society. It marked the spread of the first global pandameic in decades and also was accompanied by higher rates of crime in many large cities. The US government data site since then has published a dataset that includes over 800k reported incidents of crime in LA. We seek to better understand some of the patterns in crime from this dataset. We have two main goals that we try to better understand while also discussing the ethical implication of our work. First, resources are nearly always limited in law enforcement. Due to this, we investigate where current police stations are and compare them to potential police locations that are optimized based on crime occurance. We also seek to understand and see how machine learning models are able to predict crime based on attributes of the crime (this includes considering weapon used, where the crime took place, if the crime was indoors or outdoors, etc.). Both of these goals have practical and ethical implications that we will address later in the paper. 

# Introduction

First and foremost, exploring crime statistics with machine learning can be very cumbersome and difficult, with vast potential for ethical implications. We seek to comment on some of the main ethics-related topics at the end of our paper, but want to clarify that we do not seek to profile or enact change based on our findings. Rather, we seek to learn more about crime in LA and recommend any findings be studied at a much greater depth with caution. 

Our first goal is to understand potential locations for police stations. This finding could help LA county understand what parts of the city could use a police station that do not have one. Note that our dataset only contains one year of data, so many more years and consideration would need to be put into this for actual application. Our process could aid the county in choosing potential locations for future stations as it grows and expands without being the sole reason behind those decisions. Having police stations well-distributed throughout the city could improve first responder time and overall safety. Our second goal is to see how well machine learning models can predict what type of crime occurred based on attributes of the crime. When first responders are dispatched to a crime scene different resources are sent. Being able to predict more about what kind of crime occurred with only having limited information about the crime could improve more efficient resource allocation. These are the two goals that we seek to learn more about in our paper. We again acknowledge the many ethical implications that arise in this project and will address them in detail at the end of the paper. To analyze and understand these questions, we will first clean the data. We will then look at algorithms such as kmeans, random forests, boosted forests and logistic regression to help us further understand crime classification. 

The attempt to walk the delicate line of using predictive modeling on crime analysis has been done previously. This includes the use of computer-vision trained model to recognize what is considered a crime, and then using different classifiers, such as KNN, random forests and deep neural networks to classify each crime<sup>[7] </sup>. The goal was to include much more complex data than just location and classifications about evidence found at a crime, to help predict and prevent these types of scenarios. Other research on classification was done similar to ours, where different types of boosted regression were used to classify crime based on different features provided through their intricate data mining<sup>[1] </sup>. Much of the reserach that currently exists on predicitive modeling, such as the previous two examples, focus on a few specific crimes, such as murder, and robbery. We attempt to include a larger dataset where we no longer have a binary classification or only 3 or 4 crimes, but rather up to 16 different categorized crimes as well as using k-means clustering to identify possible police station locations. As we do this, we recognize the ethical implications of this work and will address that in this paper. 

# Data Explanation and Cleaning

In order to do data cleaning we needed to import several packages. These were all imported in the usual way (numpy as np, pandas as pd, etc.)

To begin, we had a dataset from data.gov that contained criminal reports from LA county from 2020<sup>[2] </sup>. This data set was rather large, having over 820k data points. The data has 28 columns including information about the time and location of the crime, a crime code (identifier of what the culprit was charged with), information about the victim, weapon used and more. Overall, the 28 columns represented different attributes that related to each crime that was reported. In order to avoid ethical issues and ensure that our models were not biased we first removed column attributes that had to with the culprit and defendant that could be unethical to use in classification predictions (race, gender and other similar columns).

Following this, due to the breadth of our data, we were able to drop rows with missing entries in critical columns like crime code (an identifier for the type of crime that was committed) that would help with classification. We took this approach as much of the data is nominal (like the description of the crime or weapon), making it difficult to replace missing values with the mean or other comparable approaches. In addition to entries with missing data, there were several crime types with no more than a few instances. With little data about such rarer offenses, we dropped any report involving a crime that happened less than 1000 times. After cleaning the data in this manner, we still had over 50k rows.

Additional data cleaning methods were used and are further described depending on the needs of individual algorithms.

The only other data we use is from lacity.org for the actual locations of its police stations<sup>[3] </sup>. 

# Potential Police Station Locations

One question that we wanted to answer is to be able to understand potential locations for police stations in LA based on the frequency of crimes. As previoulsy mentioned, this could help reduce the response time for first responders and could help improve safety, medical attention to victimes, and increase the odds of finding the culprit. Note that we understand there is much more nuance to identifying these locations, and do not believe this model to be a stand alone solution.

We wrote a kmeans class to help us identify $k$ potential locations for a police station optimized by the distances between each crime and a potential police station. The class is initialized with the $k$ number of clusters, a maximum number of iterations, a norm, a tolerance check and normalization boolean. 

The class also consists of four main functions that are used to run the kmeans algorithm. This includes a fit, predict, fit_predict and plot function. The fit function computes the cluster centers from random intiial conditions.

In [None]:
def fit(self, X, y=None):
        """Compute the cluster centers from random initial conditions.
        
        Parameters:
            X ((n_samples, n_classes) ndarray): the data to be clustered.
        """
        #set our centers and then normalize if specificed
        self.centers = X[np.random.choice(X.shape[0],self.n_clusters,replace=False)]
        if self.normalize == True:
            self.centers = np.reshape(self.centers/np.linalg.norm(self.centers,axis=1),(-1,1))
        for i in range(0,self.max_iter): #iterate thorugh max iter and create the label and new center
                label = np.argmin(np.linalg.norm(X[:,np.newaxis]-self.centers,ord=self.p,axis=2),axis=1)
                new_c = np.array([X[label==z].mean(axis=0) for z in range(self.n_clusters)])
                if np.linalg.norm(new_c-self.centers,ord=self.p) <self.tol: #if error is less than tol break
                     break
                self.centers = new_c #set the new center and normalize if specified
                if self.normalize == True:
                    self.centers = np.reshape(self.centers/np.linalg.norm(self.centers,axis=1),(-1,1))
        return self #return our object

This block is a little dense, so we expound upon the code comments. Essentially, we choose random center locations for each of our $k$ clusters. Then, we examine the distance (based on the specific norm) from each data point to our randomized centers. Each data point is labeled based on the smallest distance from it to all clusters. With each data point labeled, we can now shift our centers to better fit the data. We then iterate through this process until we reach our maximum iterations or the centers are no longer changing more than a specified tolerance. 

We will not include the code for predict, fit_predict and plot for brevity's sake. These functions classify each entry of the data on which cluster center it is closest to, return the labels and then plot the data and centers respectively. 

Now that we have our kmeans class, we can start to examine the potential locations for police stations. Our data from LA city shows 17 police stations, so we initialize our class with $k$ equal to 17<sup>[3] </sup>. 

As a part of the plot function, it will not only plot the ideal clusters for police stations but we also read in data to include the actual location of the current 17 police stations in LA county. We also utilize three different norms (1-norm, 2-norm and infinity norm) to get three different possible configurations. The following graphs represent our findings.

![ex 2.13](norms.jpeg)

These graphs corresponds to the 1,2 and $\infty$ norms, respectively. 

Overall, each of these graphs have slightly different potential locations for police stations. The black crosses correspond to the potential location of police stations based on our crime data and the red crosses correspond to the current actual location of police stations. We will not analyze the specific differences of the different results from each norm but wanted to provide potential locations based on varying norms. However, there is much that we can learn from these three graphs overall. Each graph has at least three police stations in the north/west part of LA county (the range is from 3 to 5). Also, each graph shows that there would ideally be 1-2 police stations in the furthest south portion of LA county. Interestingly, the red crosses showcase that nearly all of the police stations in LA county are located  centerally in LA county. This showcases that LA could look into investing and placing more police stations in the north/west and south parts of its county. Overall, these kmeans maps help us to understand the location of crime and how to better place police stations to minimize the distance from crime to police stations. This has many practical applications as it could decrease response time and increase overall safety. 

Now that we have examined ideal locations, we will look into classifying crime based on attributes of the crime. 

# Feature Engineering


When deciding which features to include in our models, we first dropped those that had potential ethical issues, as mentioned above. After that, we removed the columns corresponding to secondary crimes, as we decided that those may too directly relate to the primary crime. We also decided it to be an unlikely scenario in which one would be aware of secondary crimes but not the primary. We also decided to drop the status of the case because that is irrelevant to the type of crime actually occurred. Finally, we removed any columns with duplicate information, such as the crime description and weapon description. Once we dropped these columns, we applied one-hot encoding to a variety of features originally stored as qualitative data, such as area, weapon, and cross street. This allowed us to keep the remaining features in each of our models. With each model, there were additional measures taken to adjust the features, such as built in feature selection and PCA, which we will further discuss.

# Visualizations and Model Analysis

Random Forest Model

Grid Search of HyperParameters:

Having the new dataset where crimes without a certain number of instances have been dropped, we now seek to classify these crimes based of a variety of features. Our first model is a RandomForest, a classifier that performs well when dealing with binary or multi-class classification due to its composition of many individual decision trees. We attempt to classify specific crimes based off of location, time of day and weapon description to identify a specific crime being committed. This crime is identified by a unique crime code in the database.

After data cleaning, we've reduced the amount of classes to 16 types of crimes. In order to attempt to classify this crime, we must first convert time of day to a date-time object, which will allow us to insert the individual hour and day into our classifier. The other features chosen required a hot-one encoding due to each of our features being of the nominal type. This increases the amount of features inserted into the forest, so after a preliminary training and grid search, we attempted to use sklearns SelectFromModel package to pick what the model deemed as the "most important features" to improve training time and possibly accuracy score. In order to test our data, we used the sklearn LabelEncoder to create multiclass labels corresponding to each of the 16 unique crime codes.

A grid search was first required to find the optimal hyperparameters for the random forest. A max of 100 trees was chosen to decrease the time it took to train and get the best score with 3-fold validation. The optimal parameters and score are displayed below

In [None]:
#Define your training and test data after data cleaning
X = new_df

#create labels for each of the crime codes still remaining 
le = LabelEncoder()
y = le.fit_transform(df['Crm Cd'])

#create training and testing data
X_train,X_test,y_train, y_test = train_test_split(X,y, test_size = 0.3)

#define a Parameter Grid
rf = RandomForestClassifier()
param_grid = {'n_estimators': [25,50,100], "criterion": ['gini','entropy'], "max_features": [None,'sqrt','log2'], 'max_depth': [5,10]}
y

#Perform a Grid search with 3-fold cross validation
rf_gs = GridSearchCV(rf,param_grid,cv = 3, n_jobs = -1)
rf_gs.fit(X_train, y_train)

#display the best parameters and your score
print(f'Best Parameters: {rf_gs.best_params_}')
print(f'Best Score: {rf_gs.best_score_}')

Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': None, 'n_estimators': 100}
Best Score: 0.5290793275184281

After getting the best parameters and predicited score, we performed post-classification feature selection to get the "most important" features deemed by the model, which are the features that are most often split on, to speed up training time and prevent overfitting. This also decreases the complexity of our model, as the one-hot encoding of our features created an extensive amount of data to fit on. 

In [None]:
#define random forest with optimal parameters
X = new_df

#provide labels to each crime code from the cleaned data
le = LabelEncoder()
y = le.fit_transform(df['Crm Cd'])

#train a random forest with the optimal parameters
rf = RandomForestClassifier(n_estimators=100,criterion='gini', max_depth=10, max_features=None)
rf.fit(X_train, y_train)

#get your predictions, accuracy and f1 score
y_pred = rf.predict(X_test)
accuracy_s = accuracy_score(y_test, y_pred)
f1_score(y_test, y_pred, average = 'micro')

Here are the accuracy and F_1 scores that the model received after training on the cleaned data.

F1: 0.513062603937557

Accuracy: 0.513062603937557



In [None]:
#get the Optimal Features based off of feature importance
sel = SelectFromModel(RandomForestClassifier(criterion = 'gini', max_depth = 10, n_estimators = 100))
sel.fit(X_train, y_train)

#select the features to retrain the model on
X_selected = sel.fit_transform(X_train, y_train)
X_selected = X_selected.astype(int)

Here are the accuracy and F_1 scores after picking out features based off feature importance determined by the SelectFromModel package of Sklearn.

F1: 0.5093074405879513

Accuracy: 0.5093074405879513

The feature selection from sklearn, which determined what the most important features are, actually caused a decrease in accuracy, but only slightly. 

A classification report allows us to visualize how each label was classified, and the amount of data that existed for each label. Precision being the amount of true positives out of all the data classified as positive, recall being the true positive rate, f1 being the combination of those, and support being the amount of true positives and false negatives in the cleaned dataset. The model performed siginifcantly better on some crimes compared to others, this could be caused by certain crimes having many more instances than others, even after droppig crimes with a significant amount of occurrences.

              precision    recall  f1-score   support

         330       0.86      0.03      0.05       224
         210       0.51      0.09      0.15      3293
         230       0.00      0.00      0.00       583
         930       0.55      0.84      0.67      5260
         350       0.33      0.00      0.00       531
         624       0.67      0.02      0.04       190
         626       0.00      0.00      0.00       206
         740       0.61      0.64      0.62       227
         860       0.48      0.75      0.58      3957
         220       0.38      0.01      0.02       242
         625       0.38      0.58      0.46      1593
         121       0.47      0.36      0.41       328
         236       0.67      0.63      0.65       256
         761       0.17      0.00      0.00       965
         623       1.00      0.00      0.01       251
         753       0.69      0.85      0.76       535

    accuracy                           0.51       18641
    macro avg       0.48     0.30      0.28       18641
    weighted avg    0.48     0.51      0.42       18641
    
The model only performed at a a 0.51 accuracy score, but with the amount of data used within the classification and the amount of labels that we had, this isn't seen as unsuccsesful. This is about 8 time better than a random classification give the amount of labels that we have. So in some regards it is quite successful but not accurate enough to be used in practice. 

Gradient Boosted Model

We decided the next step was to create a boosted model. This could possibly help us increase the accuracy score that we had in the previous model. To do this we ran a new parameter search using a gradient boosted classifier. For this we used the same data cleaning techniques. The following codes showcases how we implemented the classifier and what grid search we ran on it. 

In [None]:
le = LabelEncoder()
#Get crime data
y = le.fit_transform(df['Crm Cd'])
X_train,X_test,y_train, y_test = train_test_split(X,y, test_size = 0.3)

#initialize classifier and set parameters
rf = GradientBoostingClassifier()
param_grid = {'n_estimators': [25,50,100], "loss": ['log_loss','exponential'], "max_features": [None,'sqrt','log2'], 'max_depth': [5,10], 'min_samples_leaf': [1,4,8]}

#run the grid search
rf_gs = GridSearchCV(rf,param_grid,cv = 3, n_jobs = -1)
rf_gs.fit(X_train, y_train)
#display the best parameters and your score
print(f'Best Parameters: {rf_gs.best_params_}')
print(f'Best Score: {rf_gs.best_score_}')

Best Parameters: {'loss': 'log_loss', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 4, 'n_estimators': 25}

Best Score: 0.3777531474692358

As before we will now run a post-classification feature selection using sklearn to try and improve the score of our model.

In [None]:
#get the Optimal Features based off of feature importance
sel = SelectFromModel(GradientBoostingClassifier(loss = 'log_loss', max_depth = 5, n_estimators = 25,min_samples_leaf=4,max_features=None))
sel.fit(X_train, y_train)

#select the features to retrain the model on
X_selected = sel.fit_transform(X_train, y_train)
X_selected = X_selected.astype(int)

From this we get an accuracy and F_1 score of 0.31574318381706246 and 0.31574318381706246 respectively. 

Here is our classification report for this model helping us understand more of our statistical specifics for our model.
              precision    recall  f1-score   support

         210       0.23      0.00      0.01      3233
         230       0.00      0.00      0.00       549
         930       0.32      0.93      0.47      5359
         624       0.12      0.00      0.00       482
         626       0.32      0.10      0.15      3926
         740       0.04      0.00      0.00      1650
         220       0.00      0.00      0.00       373
         236       0.00      0.00      0.00       953
         761       0.00      0.00      0.00       530

     accuracy                           0.32     17055
     macro avg       0.12      0.11      0.07     17055
     weighted avg       0.22      0.32      0.18     17055


We expected the boosted model to be an improvement to the random forest model, however our results suggest that this is not the case. Our accuracy has consistently been worse for this model. We suspect this may be due to a less thorough grid search. There is likly a better combination of hyperparameters for this model that would greatly improve the scores that we are seeing. One improvement that we could make to help this model perform better is to extend our parameter grid to include more options and run a more indepth search to achieve the best model. Another improvement that could be made is to adjust the features that we are using to train our model. There is likely a more optimal combination of features that would improve the predictive power of our model, thus increasing scores.

Multiclass Logistic Regression Classifiers

We will now look into our last model. We decided to also examine how a multiclass logistic regression classifier would work.

In [3]:
le = LabelEncoder()
#get crime data X and labels y
y = le.fit_transform(crime_data['Crm Cd'])
#split test data and initialize model
X_train,X_test,y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=3)
logreg = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=1, max_iter=500)
#train and run test data
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
#get accuracy score
accuracy = logreg.score(X_test, y_test)
print(f'Accuracy of the Logistic Regression Model: {accuracy}')

Accuracy of the Logistic Regression Model:  0.4540603928466725


Multiclass logistic regression is a valuable tool for predicting crime types, particularly in scenarios where the dataset involves nonbinary labeled data. Using this model allows for the classification of multiple crime categories based on associated features. In running logistic regression, we use identical datasets as the ones previously cleaned. After a grid search to find the optimal parameters, we settled on a multinomial class with regularization strength $C$ of 1, while doing lbfgs as the solving method. When applied to crime data, our multiclass logistic regression model yielded an accuracy score of 0.4541, signifying its moderate predictive capability in discerning between different crime types. Considering there are 16 types of crime within the dataset, this is a noteworthy score. Despite the modest accuracy, a visualization of the graph shows that the majority of crimes are classified under the two highest crime categories in green and purple (corresponding to assualt and battery, respectively). Each dot represents a crime reported and the color is the type of crime predicted by the model. The axes are arbitrary components to assist in two dimensional visualization of the classifier.

![ex 2.13](labelledlogclassifier.png)

# Ethics

Our project relates to understanding crime better in LA. There are a lot of ethically related issues that exist in our project with both goals that we try to understand better. The first part of our project is choosing possible locations for police stations based on the frequency of crime. Earlier in the paper we discussed the benefits of this but now we will consider the ethical implications. This could lead to more of a police presence in areas that now interact with the population more. This could lead to a self-fulfilling prophecy with more crime being reported in an area due to a higher police force. Another potential downside would be profiling these locations which could then be misused. There are also a plethora of other important factors to consider when looking into police station locations. So, there's a strong likelihood that the locations provided by our k-means clustering aren't realisitic even without including other important features that should go into the decision of where these stations should be located.

Our other goal of our paper is to be able to classify crime based on known attributes of the crime. We removed several columns that could have negative ethical implications like age, race and gender. The goal of our model was to be able to better understand crime. Similar to our other goal, this could have many negative ethical implications. It could be misused to justify charging people with crimes they did not commit. It could also be used for profiling which is not at all our intent. This could lead to classifying areas and heavily associating them with a certain crime. Such an association could change law enforcements engagement in an area creating self-fulfilling prophecy, There are also other ethical implications in relation to our project that we have not specifically mentioned.

It is critical in such a relavent and sensitive topic to address the ethical implications of our work. These results could be misused and create several negative impacts. We recommend any potential findings or insights in this paper to be examined under a very thorough-ethical lens before considering or implementing any changes. 

# Results and Conclusion

When examining potential police station locations based on crime in LA using kmeans, we found that the current locations may not be best suited for the crime data we were working with. We understand that there is much more nuance to choosing where to place these stations than we included, and recognize that more work is required to validate our results. After this, we explored classifying crime based on known attributes. Though none of our models were particularly accurate in predicting types of crime, they weren't entirely unsuccessful. Our random forest classifier worked about 50% of the time, which is not a dismissible score, given that we were classifying into 16 different crimes. We believe in the likely possibility that classification of crime is much more complicated than we expected, and this data simply may not include the necessary information to do so with much higher accuracy. There is room for more complicated modeling to address the provided issues, as well as consideration of tracking other features related to each crime.   

# References

[1] McClendon, Lawrence & Meghanathan, Natarajan. (2015). Using Machine Learning Algorithms to Analyze Crime Data. Machine Learning and Applications: An International Journal. 2. 1-12. 10.5121/mlaij.2015.2101. 

[2] Department, L. A. P. (2023, December 13). Crime data from 2020 to present: Los Angeles - open data portal. Crime Data from 2020 to Present | Los Angeles - Open Data Portal. https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8 

[3] City of Los Angeles. (2016, September 15). Sheriff and police stations. City of Los Angeles Hub. https://geohub.lacity.org/maps/19d2bcfd18054942bda2c95b47bf1927_146/about The database was updated on April 19, 2022

[4]Los Angeles County Sheriff’s Department. (2023, November 23). Statistics and reports. Los Angeles County Sheriff’s Department. https://lasd.org/transparency/statistics/ 

[5] Mansoor, S. (2020, September 9). Why many U.S. crime victims don’t get money meant to help. Time. https://time.com/5886815/crime-survivors-funding/ 

[6] De La Rue L, Ortega L, Rodriguez GC. System-based victim advocates identify resources and barriers to supporting crime victims. Int Rev Vict. 2023 Jan;29(1):16-26. doi: 10.1177/02697580221088340. Epub 2022 Apr 27. PMID: 36644331; PMCID: PMC9837801.

[7] Shah, N., Bhagat, N. & Shah, M. Crime forecasting: a machine learning and computer vision approach to crime prediction and prevention. Vis. Comput. Ind. Biomed. Art 4, 9 (2021). https://doi.org/10.1186/s42492-021-00075-z
