# Getting Away with Murder
<i>What factors impact the probability of a homicide being solved in the US, based on 1980-2014 data?<i/>

## 1. Overview
This is an analysis of the US Homicide dataset. The goal is to develop a model that will predict whether or not a homicide was solved based on certain factors of that homicide. Steps will include data cleansing, exploratory analysis and the creation of categorical models (KNN, decision tree, random forest, logistic regression). 

It is designed as a high level analysis outlining the process and how the models work, hence there are certain obvious areas for improvement, notably feature selection. This should still provide a good starting point for more detailed analysis into this.

## 2. Data import/ initial cleansing

In [None]:
# load modules
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

# sklearn
from sklearn import metrics, dummy, grid_search, cross_validation, neighbors
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# seaborn
import seaborn as sns

# for decision tree visualisation
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
# import pydotplus as pydot

np.random.seed(1)

# get data, change to lower case and remove spaces in column heads
murders = pd.read_csv('../input/database.csv', low_memory=False, index_col=0)
murders.columns = map(str.lower, murders.columns)
murders.columns = [x.strip().replace(' ', '_') for x in murders.columns]

In [None]:
# checking what the data looks like
print('shape:', murders.shape)
print('data types:', murders.dtypes)

In [None]:
# changed dependend variable to an integer for future analysis
murders['crime_solved'] = (murders['crime_solved'] == 'Yes').astype(int)

In [None]:
# check for null values
murders.isnull().sum()

It is possible to remove all of the columns relating to the perpetrator, as these will not be known for unsolved crimes.

In [None]:
perpetrator_columns = [col for col in murders.columns if 'perpetrator' in col]
perpetrator_columns.append('relationship')
murders = murders.drop(perpetrator_columns, axis=1)

It is unlikely that where the data comes from will have an impact on whether the crime is solved, so 'record_source' can be removed, and it is unclear what 'incident' means so this can also be removed.

In [None]:
murders = murders.drop(['incident','record_source'], axis=1)

To simplify this analysis, there will be no focus on the city or agency name as there are a large number of unique values. These are all related to the geographic location. In a future iteration of this study, it may be possible to group agencies (e.g. by number of murders) and assess their impact.

In [None]:
for col in ['agency_name', 'agency_code', 'city']:
    print('{} unique values: {}'.format(col, len(murders[col].unique())))

murders = murders.drop(['agency_code', 'agency_name', 'city'], axis=1)

It is interesting to see that there is both a race and an ethnicity variable - it seems 'ethnicity' focuses purely on whether a victim is Hispanic or not!

In [None]:
for col in ['victim_race', 'victim_ethnicity']:
    print('{} unique values: {}'.format(col, murders[col].unique()))

'crime_type' only refers to whether a crime is 'Manslaughter by Negligence' or not, which adds little value. Since negligence is not really comparable to other homicides and these account for a very small proportion of the total crimes, these instances will be removed along with the column.

In [None]:
print(murders['crime_type'].value_counts())
murders = murders.drop(murders.index[murders['crime_type'] == 'Manslaughter by Negligence'])
murders = murders.drop('crime_type', axis = 1)

In [None]:
# dataset to use
murders.head()

## 3. Exploratory Analysis

### 3.1 The Variables

To start with, a look at victim age. It seems that the frequency of murders is high for babies but falls over childhood. It then rises to peak at the age of 20 before tailing off towards old age. The 'solved rate' is highest in childhood and lower for adults, especially young adults in their early 20s. What happens at the top end of the range looks strange as there is a spike at 99 and a number of victims with age = 998. This would indicate some sort of irregular recording system (e.g. NaN is recorded as 99/ 998) so all crimes where age > 98 will be dropped from the dataset.

In [None]:
pd.set_option('display.mpl_style', 'default')
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.size'] = 10

age_dist_solved = murders.groupby('victim_age')['crime_solved'].mean()
age_dist = murders.groupby('victim_age')['crime_solved'].size()
ages = np.sort(murders['victim_age'].unique())

fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(16, 10))

figure = plt.subplot(2,2,1)
age_dist.plot(kind='bar', color = '#a11f0c')
plt.title('Number of Homicides by Victim Age', fontsize = 16)
plt.ylabel('Number of homicides')
plt.xticks(range(0, len(ages), 5), ages[range(0, len(ages), 5)])

figure = plt.subplot(2,2,2)
age_dist_solved.plot(kind='bar', color = '#e73f0b')
plt.title('Proportion of Homicides Solved by Victim Age', fontsize = 16)
plt.ylabel('Proportion solved')
_ = plt.xticks(range(0, len(ages), 5), ages[range(0, len(ages), 5)])

# _ assigned to stop the final array of xticks being the output and printing out

In [None]:
murders = murders.drop(murders.index[murders['victim_age'] >98])

Now to observe the practical details of the crimes. Agency type, state and weapon seem to be the variables with the most variation in solved rates. There are also a number of values that have exceedingly low frequency, particularly for agency type, state and weapon. 

Since this is designed to be a very high level analysis, any value with a frequency of lower than 1,200 (c.0.2% of dataset) for agency type, state and weapon will be removed from the dataset. Conclusions made from these variables will have very limited application and may distract the focus away from more important trends. 

Although one could argue it is not ideal to simply remove some states, this study is focused on broad trends. A glance at the bar charts would suggest that states with a low number of homicides have a better solved rate. A further iteration of this study could be to group states by their number of homicides, geographic location, or even the homicide rates (importing population information). For now, it is easier just to remove this data.

In [None]:
for column in ['agency_type', 'state', 'year', 'month', 'weapon']:
    fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(16, 10))
    
    dist_solved = murders.groupby(column)['crime_solved'].mean()
    dist = murders.groupby(column)['crime_solved'].size()
    
    figure = plt.subplot(2,2,1)
    dist.plot(kind='bar', color= '#003b46')
    plt.title('Number of Homicides by {}'.format(column), fontsize = 16)
    plt.ylabel('Number of homicides')

    figure = plt.subplot(2,2,2)
    dist_solved.plot(kind='bar', color = '#07575b')
    plt.title('Proportion of Homicides Solved by {}'.format(column), fontsize = 16)
    _ = plt.ylabel('Proportion solved')

In [None]:
print('dropped:')
for col in ['agency_type', 'weapon', 'state']:
    for val in murders[col].unique():
        if(len(murders[murders[col] == val]) < 1200):
            print('{}: {}, frequency: {}'.format(col, val,\
                    len(murders[murders[col] == val])))
            murders = murders.drop(murders.index[murders[col] == val])

Finally to consider the details of the victim. There look to be some significant differences in the solved rate for gender, race, ethnicity and victim count. There are also a number of 'unknown' values in sex, race and ethnicity. For race and gender, their frequency is negligible so these homicides will be removed from the dataset. For ethnicity, the number is significant, so they will remain. For the purposes of analysis, a dummy variable will be created only for those who positively identify as Hispanic (i.e. 'Unknown' and 'Non-Hispanic' will be grouped together. With regards to the number of victims, for the purposes of analysis, homicides will be categorised as either 'mass' or not, where mass indicates more than 1 victim ('victim_count' > 0). 

In [None]:
for column in ['victim_sex','victim_race', 'victim_ethnicity', 'victim_count']:
    fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(16, 10))
    
    dist_solved = murders.groupby(column)['crime_solved'].mean()
    dist = murders.groupby(column)['crime_solved'].size()
    
    figure = plt.subplot(2,2,1)
    dist.plot(kind='bar', color= '#2e4600')
    plt.title('Number of Homicides by {}'.format(column), fontsize = 16)
    plt.ylabel('Number of homicides')

    figure = plt.subplot(2,2,2)
    dist_solved.plot(kind='bar', color = '#486b00')
    plt.title('Proportion of Homicides Solved by {}'.format(column), fontsize = 16)
    _ = plt.ylabel('Proportion solved')

In [None]:
murders = murders.drop(murders.index[murders['victim_sex'] == 'Unknown'])
murders = murders.drop(murders.index[murders['victim_race'] == 'Unknown'])

### 3.2 Creating Dummy Variables

Dummies will be created for all categorical data. For the 'year' variable, it will be grouped by decade so as not to have too many dummies. For 'weapon', all types of gun related crimes will be grouped together. 

In [None]:
# get dummies for main variables
murders = murders.join(pd.get_dummies(murders['agency_type'], prefix = 'agency'))
murders = murders.join(pd.get_dummies(murders['state'], prefix = 'state'))
murders = murders.join(pd.get_dummies(murders['month'],prefix='mon'))
murders = murders.join(pd.get_dummies(murders['victim_sex']))
murders = murders.join(pd.get_dummies(murders['victim_race'], prefix='vic_rac'))
murders = murders.join(pd.get_dummies(murders['weapon'], prefix='weapon'))

# change to lowercase/ remove spaces
murders.columns = map(str.lower, murders.columns)
murders.columns = [x.strip().replace(' ', '_') for x in murders.columns]

# assign dummies for more than 1 victim (Victim Count is 0 if just 1 victim)
murders['mass'] = (murders['victim_count'] > 0)

# group values into decades and create a decade_vars dataframe
murders['1980s'] = (murders['year'] < 1990) & (murders['year'] >= 1980)
murders['1990s'] = (murders['year'] < 2000) & (murders['year'] >= 1990)
murders['2000s'] = (murders['year'] < 2010) & (murders['year'] >= 2000)
murders['2010s'] = (murders['year'] < 2020) & (murders['year'] >= 2010)

# create a dummy for Hispanic
murders['hispanic'] = murders['victim_ethnicity'] == 'Hispanic' 

# group all gun related crime into 1 dummy
murders['weapon_any_gun'] = (murders['weapon_rifle'] == True)\
| (murders['weapon_shotgun'] == True)\
| (murders['weapon_handgun'] == True)\
| (murders['weapon_gun'] == True)\
| (murders['weapon_firearm'] == True)

# drop the original gun dummies from the main dataset
murders = murders.drop(['weapon_rifle','weapon_shotgun', 'weapon_handgun','weapon_gun','weapon_firearm'], axis=1)

In [None]:
# drop the original columns that have been 'dummified'
murders = murders.drop(['agency_type', 'victim_sex',\
                        'victim_race', 'victim_ethnicity', 'state',\
                        'weapon', 'victim_count', 'month', 'year'], axis=1)

The following variables will be removed as 'baselines'. These are the ones with the largest number of associated instances:

* Agency: agency_municipal_police 
* State: state_california
* Month: mon_july
* Gender: male
* Race: vic_rac_white
* Weapon: weapon_any_gun
* Decade: 1990s

In [None]:
murders = murders.drop(['agency_municipal_police', 'state_california', 'mon_july','male',\
                'vic_rac_white', 'weapon_any_gun', '1990s'], axis = 1)

### 3.3 Correlation Matrix

The correlation matrix below corroborates what was determined above regarding the correlation between a crime being solved and weapon, gender, race and agency type. There are some weak correlations between the independent variables, such as race and state, state and agency type and weapon and victim age. These all seem reasonable and none of them seem strong enough for multicolinearity to be an issue.

In [None]:
cmap = sns.diverging_palette(220, 10, as_cmap=True)

murders = murders[murders.columns].astype(float)

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(murders.corr(), cmap=cmap)

## 4. Modelling

### 4.1 Train and Test Datasets
The model creation will be fairly basic, focusing more on optimising the parameters than feature selection. This project is more to demonstrate how the models work as opposed to creating the best predictive model.

4 types of model will be created:

1. KNN neighbours
2. Decision tree
3. Random forest
4. Logistic regression

All of the selected features will be used in each model. Each type of analysis will consist of:
1. Using grid search on a sample to determine which the best parameters would be - this will be quicker than using the full dataset
2. Use these parameters to create a model using a train dataset and then test it on a test dataset

For this reason, the dataset will need to be split into a train and test set and a sample will be taken. 30,000 is a large sample that represents around 5% of the dataset.


In [None]:
# split into a train and test set
train, test = train_test_split(murders, test_size = 0.3)

# create a sample and check solved rate similar to population
sample_murders = murders.sample(n=30000)
sample_murders['crime_solved'].value_counts(normalize = True)

### 4.2 Benchmarks and scoring

If it was the case that the predictors had no impact on the dependent variable, it would be rational to assume that any given homicide would be solved, given that 70.2% of the total were solved. Hence this 'model' will be the benchmark against which other models can be judged.  This would have the following metrics (based on the value counts for the dependent variable below):

--- | Predicted positive | Predicted negative
---| ---| ---
<b>Actually positive</b>| 429,704 (TP) | 0 (FN)
<b>Actually negative</b>| 182,175 (FP) | 0 (TN)


<br>
FPR = FP/(FP + TN) = 182,175/182,175 = 100%

Recall (TPR) = TP/P = 429,704/429,704 = 100%

Precision = TP/(TP+FP) = 429,704/(429,704 + 182,175) = 70.23%

Accuracy = (TP+TN)/Size of dataset = 429,705/611,879 = 70.23%

Since this study has the goal of improving information, failing to identify a true positive is no worse than incorrectly identifying a negative result as positive (i.e. a false positive). This means that <b>accuracy</b> is the a more appropriate measure of model performance than recall or precision. The target to beat is 70.23%.

The issue with accuracy is that, due to the high solved rate, the accuracy associated with guessing all murders were solved is already fairly high and hence hard to improve upon. Furthermore, if the goal of the model were to change from accuracy maximisation to a target that combines both precision and recall, the random assignment model starts to look shakier. 
A better model would be able to make predictions with a lower FPR whilst minimizing the sacrifice in TPR. A good measure for a model's ability to do this is the <b>area under the curve</b> metric (AUC). Therefore both accuracy and AUC will be considered when determining how good the models are. A model that relies on guessing should have an AUC score of 50%.

### 4.3 KNN
A grid search using 2, 5, 10, 20, 50, 100 and 150 neighbours shows that 100 neighbours produces the best accuracy and AUC scores, although the accuracy score is in line with the 70.2% benchmark. The accuracy starts to flatten out at just below 70% with around 50 neighbours, which could suggest that at this point the number of neighbours makes proximity irrelevant and it is more or less equivalent to comparing to the entire sample.

In [None]:
X = sample_murders.drop('crime_solved', axis=1)
y = sample_murders['crime_solved']

n = [2,5,10,20,50,100,150]

gs = grid_search.GridSearchCV(
    estimator=neighbors.KNeighborsClassifier(weights = 'uniform'),
    param_grid={'n_neighbors': n},
    cv=cross_validation.KFold(len(sample_murders), n_folds = 3),
)
gs.fit(X, y)
print('Best accuracy: {}, {}'.format(gs.best_score_, gs.best_params_))
knn_acc_scores = gs.grid_scores_

gs = grid_search.GridSearchCV(
    estimator=neighbors.KNeighborsClassifier(weights = 'uniform'),
    param_grid={'n_neighbors': n},
    cv=cross_validation.KFold(len(sample_murders), n_folds = 3),
    scoring = 'roc_auc'
)
gs.fit(X, y)
print('Best AUC: {}, {}'.format(gs.best_score_, gs.best_params_))
knn_auc_scores = gs.grid_scores_

In [None]:
plt.plot(n,[s[1] for s in  knn_acc_scores], label = 'Accuracy')
plt.plot(n,[s[1] for s in  knn_auc_scores], label = 'AUC')
plt.title('KNN: Score vs. Number of Neighbours')
plt.ylim(0.5,0.8)
plt.ylabel('Score')
plt.xlabel('Number of neighbours')
plt.legend(loc = 4)

For KNN, the sample will be split into a train and test dataset as opposed to using the train and test dataset created above. This is because KNN takes a particularly long time to run. The accuracy is around the same as the benchmark but the AUC marks an improvement. 

In [None]:
sample_train, sample_test = train_test_split(sample_murders, test_size = 0.3)

X = sample_train.drop('crime_solved', axis=1)
y = sample_train['crime_solved']

X_test = sample_test.drop('crime_solved', axis=1)
y_test = sample_test['crime_solved']

model = neighbors.KNeighborsClassifier(weights = 'uniform', n_neighbors = 100)

model.fit(X,y)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test).T[1]

print('Accuracy: {}, AUC: {}'.format(metrics.accuracy_score(y_test, predictions),\
                                     metrics.roc_auc_score(y_test, probabilities)))

### 4.4 Decision Tree

With this model, a visualisation can be created using a very basic decision tree (depth 3). This is helpful in understanding what the decision tree is actually doing. 

In [None]:
#to be run when I can figure out how to import pydotplus
model = DecisionTreeClassifier(max_depth=3)

X = sample_murders.drop('crime_solved', axis = 1)
y = sample_murders['crime_solved']

model.fit(X, y)
print(model.score(X,y))

## pydotplus not importing to Kaggle
# create an output file object
# dot_data = StringIO() 

# export_graphviz(model, 
#             out_file = dot_data,  
#             filled = True, 
#             rounded = True,
#             special_characters = True,
#             feature_names = X.columns)  

# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# Image(graph.create_png())

To optimise the model, depths of 1-20 will be tested in increments of 2. The model performs best under both accuracy and AUC when the max depth is 9.

In [None]:
X = sample_murders.drop('crime_solved', axis=1)
y = sample_murders['crime_solved']

depths = list(range(1,20, 2))

gs = grid_search.GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid={'max_depth': depths},
    cv=cross_validation.KFold(len(X), n_folds = 3),
    scoring = 'accuracy'
)
gs.fit(X, y)
print('Best accuracy: {}, {}'.format(gs.best_score_, gs.best_params_))
dt_acc_scores = gs.grid_scores_

gs = grid_search.GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid={'max_depth': depths},
    cv=cross_validation.KFold(len(X), n_folds = 3),
    scoring = 'roc_auc'
)
gs.fit(X, y)
print('Best AUC: {}, {}'.format(gs.best_score_, gs.best_params_))
dt_auc_scores = gs.grid_scores_

In [None]:
plt.plot(depths,[s[1] for s in  dt_acc_scores], label = 'Accuracy')
plt.plot(depths,[s[1] for s in  dt_auc_scores], label = 'AUC')
plt.title('Decision Tree: Score vs. Max Depth')
plt.ylabel('Score')
plt.xlabel('Max depth')
plt.legend(loc = 4)

The accuracy and AUC  are not far off what was observed with the KNN model.

In [None]:
X_test = test.drop('crime_solved', axis=1)
y_test = test['crime_solved']

model = DecisionTreeClassifier(max_depth=9)
model.fit(X,y)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test).T[1]

print('Accuracy: {}, AUC: {}'.format(metrics.accuracy_score(y_test, predictions),\
                                     metrics.roc_auc_score(y_test, probabilities)))

### 4.5 Random Forest
For the random forest model there are 2 parameters that will be tested. The number of trees to use (n_estimators) and the max depth of each tree. The optimal parameters are using 100 estimators and a max depth of 20.

In [None]:
X = sample_murders.drop('crime_solved', axis = 1)
y = sample_murders['crime_solved']

for scoring in ['accuracy', 'roc_auc']:
    gs = grid_search.GridSearchCV(
        estimator=RandomForestClassifier(),
        param_grid={'n_estimators': [5, 10, 20, 50, 100], 'max_depth': [10, 20, 30, 50, 100, 200]},
        cv=cross_validation.KFold(len(X), n_folds = 3, shuffle = True),
        scoring = scoring
    )
    gs.fit(X, y)
    print('best {}: {}, {}'.format(\
         scoring, gs.best_score_, gs.best_params_))

Using these parameters on the train data creates a model with an accuracy of 71.8% and an AUC of 70.7%. This is a notable improvement on previous models. The AUC can also be illustrated visually on the ROC curve.

In [None]:
X = train.drop('crime_solved', axis = 1)
y = train['crime_solved']

X_test = test.drop('crime_solved', axis = 1)
y_test = test['crime_solved']

model = RandomForestClassifier(n_estimators = 100, max_depth = 20)    
model.fit(X, y)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test).T[1]

print('Accuracy: {}, AUC: {}'.format(metrics.accuracy_score(y_test, predictions),\
                                     metrics.roc_auc_score(y_test, probabilities)))

# create a dummy model array
model_dum = dummy.DummyClassifier()
model_dum.fit(X, y)
probability_0 = model_dum.predict_proba(X_test).T[1]

# plot ROC curve
ax = plt.subplot(111)
vals = metrics.roc_curve(y_test, probability_0)
ax.plot(vals[0], vals[1])
vals = metrics.roc_curve(y_test, probabilities)
ax.plot(vals[0], vals[1])
_ = ax.set(title='ROC curve', ylabel='TPR', xlabel='FPR', xlim=(0, 1), ylim=(0, 1))

According to this model, the most important factors in determining whether a crime is solved is the age of the victim, whether it took place in NY state, if a knife was used as a weapon, if the victim was a female and if the victim was black. Whilst the 'direction' of the impact isn't clear from this, it can be inferred from the exploratory analysis.

In [None]:
imp_features = pd.DataFrame({'importance': model.feature_importances_, 'feature': X.columns})\
.sort_values(by = 'importance', ascending = True)
imp_features.tail(10).plot(kind = 'barh')
plt.title('Most important features', fontsize = 16)
plt.ylabel('Feature')
_ = plt.yticks(range(0, 10), imp_features['feature'].tail(10))

### 4.6 Logistic Regression

The best parameters for a logistic regression were using l2 regularisation with a penalty of 1.

In [None]:
X = sample_murders.drop('crime_solved', axis = 1)
y = sample_murders['crime_solved']

for scoring in ['accuracy', 'roc_auc']:
    gs = grid_search.GridSearchCV(
        estimator=LogisticRegression(),
        param_grid={'C': [10**i for i in range(-8, 9, 4)], 'penalty': ['l1', 'l2']},
        cv=cross_validation.KFold(n=len(X),n_folds=3),
        scoring = scoring
    )
    gs.fit(X, y)
    print('best {}: {}, {}'.format(scoring, gs.best_score_, gs.best_params_))

In actual fact, using either of these parameter sets doesn't make a huge difference on either so the one that maximises AUC will be used. This model has an accuracy of 70.7% and an AUC of 67.6%. This is not as good as the random forest but better than the other models.

In [None]:
X = train.drop('crime_solved', axis = 1)
y = train['crime_solved']

X_test = test.drop('crime_solved', axis = 1)
y_test = test['crime_solved']

model = LogisticRegression(penalty = 'l2', C = 1)  
model.fit(X, y)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test).T[1]

print('Accuracy: {}, AUC: {}'.format(metrics.accuracy_score(y_test, predictions),\
                                     metrics.roc_auc_score(y_test, probabilities)))

This model can be used to make predications, for example a 20 year old black, hispanic (according to the dataset they do exist) male strangulation victim who was killed in DC in October in the 1990s and whose killing was investigated by county police would only have an 11.6% chance of their homicide being solved. Note that 1990s is the default and County Police as an agency was dropped after the first logistic regression, as its impact was not significantly different from that of the default (Municipal Police). These do not need to be included in the predictive model.

In [None]:
new_victim_dict = {'victim_age': 20, 'hispanic':1, 'state_district_of_columbia': 1,\
                   'mon_october': 1, 'vic_rac_black': 1, 'weapon_strangulation': 1}

new_victim_arr = [0 for i in range(len(X.columns))]

for i in new_victim_dict:
     new_victim_arr[X.columns.get_loc(i)] = new_victim_dict.get(i)
        
model.predict_proba([new_victim_arr])[0][1]

A 2 year old white, non-hispanic female who was killed using drugs in North Carolina in February in the 1980s as part of a mass murder and whose killing was investigated by state police would have a 95.9% chance of having their homicide solved. Note that white is the default.

In [None]:
new_victim_dict = {'victim_age': 2, 'state_north_carolina': 1, '1980s':1,\
                   'mon_february': 1, 'weapon_drugs': 1, 'agency_state_police': 1, 'mass' : 1}

new_victim_arr = [0 for i in range(len(X.columns))]

for i in new_victim_dict:
     new_victim_arr[X.columns.get_loc(i)] = new_victim_dict.get(i)
        
model.predict_proba([new_victim_arr])[0][1]

## 5. Conclusions
The below is a summary of all the models created. Random forest with 100 estimators and a max depth of 20 is the best of these.

--- | Parameters | Accuracy score | ROC AUC score
---| ---| ---
<b>KNN</b>| Neighbours: 100 | 70.8% | 61.4%
<b>Decision tree</b>| Max depth: 11 | 70.6% | 65.2%
<b>Random forest</b>| No. estimators: 100<br> Max depth: 20| 71.8% | 70.7%
<b>Logistic regression</b>| Regularization: l2<br> Penalty: 1 | 70.7% | 67.7%

A more detailed analysis could consider which parameters to use and could develop these models further. It could also focus more on the contribution of individual variables, perhaps in a more focused way.