# Assault detection

Looking at reports of a crime, we're going to see whether we can detect the severity of an assault. It's part of a larger analysis [done by the LA Times](https://www.latimes.com/la-me-g-lapd-reclass-htmlstory.html) that we'll talk about in class.

## Imports

First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)



## Reading in our data

Our dataset is going to be a database of crimes committed between 2008 and 2012. The data has been cleaned and filtered a bit, though, so we're only left with two columns:

* `CCDESC`, what criminal code was violated
* `DO_NARRATIVE`, a short text description of what happened

We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.

In [2]:
df = pd.read_csv("2008-2012.csv")
df.head(10)

Unnamed: 0,CCDESC,DO_NARRATIVE
0,SHOPLIFTING - PETTY THEFT ($950 & UNDER),DO-SUSP WAS SEEN THROUGH SURVAILANCE CONCEALING SEVERAL ITEMS INTO HER SHOPPING AND PERSONAL BAG LEAVING WITHOUT PAYING DEPT STORE
1,VIOLATION OF COURT ORDER,DO-SUSP ARRIVED AT VICTS RESID AND ENTERED VICTS RESID IN VIOLATION OF RESTRAINING ORDER
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES
3,THEFT PLAIN - PETTY ($950 & UNDER),DO-UNK SUSP TOOK VICT PREPAID GIFT CARD SUSP PURCHASED PRODUCTS WITH ITEM
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS
5,THEFT OF IDENTITY,DO-UNK SUSP USED VICTS PERSONAL INFO FOR GAIN WITHOUT THE VICTS CONSENT ORKNOWLEDGE
6,SHOPLIFTING - PETTY THEFT ($950 & UNDER),DO-SUSP ENTERED MKT AND SEL ITEMS SUSP CONCEALED ITEMS AND EXITED STORE WOPAYING
7,BURGLARY,DO-UNK SUSP ENTERED VICTS RESIDENCE BY UNLOCKED FRONT DOOR SUSP REMOVED VCTICTS PROPERTY SUSP FLED LOC
8,OTHER MISCELLANEOUS CRIME,DO-SUSP ADMITTED TO PLACING 2010 REG TAG HE ILLEGALLY OBTAINED ON HIS LIC PLATE HIS VEH REG WAS STILL EXP
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V


## Filtering for assaults

First, filter the dataset so it **only includes assaults.** No burglary, no identity theft, no shoplifting: just every kind of assault.

In [10]:
df.CCDESC.str.contains('ASSAULT')
df_new = df[(df.CCDESC.str.contains('ASSAULT')) == True]
df_new.head()

Unnamed: 0,CCDESC,DO_NARRATIVE
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V
11,BATTERY - SIMPLE ASSAULT,DO-V STATED THAT SUSP CONFRT HER WHEN SHE TRIED TO APPR HER HUSBAND SUSP AND V HUSBAND ARE FRNDS SUSP YELLED STAY AWAY FROM HIM AND PUSHED V
16,BATTERY - SIMPLE ASSAULT,DO-SUSPS WERE VERBALLY ABUSING VICT DURING WHICH TIME S1 STRUCK VICT THREETIMES ON THE BACK OF HIS LEFT SHOULDER


## Converting to a yes/no question

There are a handful of kinds of assault listed in the dataset, but they boil down to either being "aggravated" or "simple". Aggravated assault is treated much more severely than simple assault.

|Description in dataset|Is it serious/aggravated?|
|---|---|
|BATTERY - SIMPLE ASSAULT|no|
|ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT|yes|
|INTIMATE PARTNER - SIMPLE ASSAULT|no|
|CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT|no|
|INTIMATE PARTNER - AGGRAVATED ASSAULT|yes|
|CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT|yes|
|ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER|yes|
|OTHER ASSAULT|no|

**Make a new `0`/`1` integer column that lists whether the assault was aggravated/serious or not.**

In [29]:
df_new['is_aggravated'] = df.CCDESC.str.contains('AGGRAVATED').astype(int)
df_new[df_new.is_aggravated == 1].head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['is_aggravated'] = df.CCDESC.str.contains('AGGRAVATED').astype(int)


Unnamed: 0,CCDESC,DO_NARRATIVE,is_aggravated
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES,1
64,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP DROVE TO THE DRIVER SIDE OF THE VICTS VEH S1 FRNT PSGR OF THE SUPSVEH PRODUCED HANDGUN FIRED APPROX FIVE ROUNDS AT VICT ASSIGNED TO GANG DETS,1
71,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S STATED SHOOTING AT V AND FLED ASSIGNED TO GANG DETS,1
87,CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT,DO-V WAS STRUCK WITH BELT,1
96,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP APPROACHED VICTS WHILE THEY WERE SLEEPING SUSP STABED BOTH VICTS MULTIPLE TIMES USING KNIVES SUSP THEN FLED ON FOOT,1
128,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP FIRED MULTIPLE SHOTS FROM HANDGUN STRIKING ALL THREE VICTIMS. ALL VICTS ARE HOOVERS. HOOVER/DL FUED,1
158,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP AND VICT ARE FRIENDS SUSP AND VICT ENGAGED IN VERBAL DISPUTE AND THEN IN A PHYSICAL FIGHT SUSP THEN STABBED AND UCT VICT W POSSIBLE KNIFE SUSP FL,1
187,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-BUSINESS DISPUTE SUPS PUNCHED VICT THREE TIMES IN THE FACE VICT FELL TOTHE GROUND SUPS THEN KICKED VICT IN THE FACE,1
241,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APROACHED V CHALLENGED TO A FIGHT S PRODUCED A METAL PIPE AND HIT V TWICE ON HEAD S FLED ON FOOT TO 6913 MENLO,1
243,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP DROVE UP ALONG SIDE VICTS DRIVER SIDE WINDOW AND STATED BITCH YOU TRIED TO HIT MY SON SUSP REMOVED BOTTLE WITH RIGHT HAND AND THREW IT AT VICT,1


## Building a classifier

**Use this dataset to build a classifier that can detect whether an assault is aggravated assault or simple assault.** You get to pick the words you'll be using as features, this is very similar to the airbags classifier.

Use a classifier that is **not** a `LogisticRegression` classifier. You can use what we did in class with BuzzFeed, or try one of the ones you used for Part One of the homework. I've pasted the StackOverflow list here, but note that they won't all work:

```python
from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm.classes import OneClassSVM
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
from sklearn.neighbors.classification import RadiusNeighborsClassifier
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OutputCodeClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.linear_model.ridge import RidgeClassifierCV
from sklearn.linear_model.ridge import RidgeClassifier
from sklearn.linear_model.passive_aggressive import PassiveAggressiveClassifier    
from sklearn.gaussian_process.gpc import GaussianProcessClassifier
from sklearn.ensemble.voting_classifier import VotingClassifier
from sklearn.ensemble.weight_boosting import AdaBoostClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.ensemble.bagging import BaggingClassifier
from sklearn.ensemble.forest import ExtraTreesClassifier
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import MultinomialNB  
from sklearn.neighbors import NearestCentroid
from sklearn.svm import NuSVC
from sklearn.linear_model import Perceptron
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.mixture import DPGMM
from sklearn.mixture import GMM 
from sklearn.mixture import GaussianMixture
from sklearn.mixture import VBGMM
```

**Be sure to...**

* Make it easy for me to see your features (words)
* Use test/train split
* Provide an accuracy score
* Provide a confusion matrix
* Provide the output of `eli5.show_weights`

In [30]:
train_df = pd.DataFrame({
    'is_aggravated': df_new.is_aggravated,
    'weapon': df_new.DO_NARRATIVE.str.contains("WEAPON", na=False).astype(int),
    'deadly': df_new.DO_NARRATIVE.str.contains("DEADLY", na=False).astype(int),
    'knife': df_new.DO_NARRATIVE.str.contains("KNIFE", na=False).astype(int),
    'fired': df_new.DO_NARRATIVE.str.contains("FIRED", na=False).astype(int),
    'shot': df_new.DO_NARRATIVE.str.contains("SHOTS", na=False).astype(int),
    'handgun': df_new.DO_NARRATIVE.str.contains("HANDGUN", na=False).astype(int),
    'shooting': df_new.DO_NARRATIVE.str.contains("SHOOTING", na=False).astype(int),
    'stab': df_new.DO_NARRATIVE.str.contains("STAB", na=False).astype(int) | df_new.DO_NARRATIVE.str.contains("STABED", na=False).astype(int) | df_new.DO_NARRATIVE.str.contains("STABBED", na=False).astype(int),
    'metal': df_new.DO_NARRATIVE.str.contains("METAL", na=False).astype(int),
    'wood': df_new.DO_NARRATIVE.str.contains("WOOD", na=False).astype(int),
    'bottle': df_new.DO_NARRATIVE.str.contains("BOTTLE", na=False).astype(int)
    })
train_df


Unnamed: 0,is_aggravated,weapon,deadly,knife,fired,shot,handgun,shooting,stab,metal,wood,bottle
2,1,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,0,0,0,0,0
16,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
830201,0,0,0,0,0,0,0,0,0,0,0,0
830206,0,0,0,0,0,0,1,0,0,0,0,0
830207,1,0,0,0,0,0,0,0,0,0,0,0
830208,1,0,0,0,0,0,0,0,1,0,0,0


In [31]:
# features
X = train_df.drop(columns='is_aggravated')
# labels
y = train_df.is_aggravated

In [32]:
# our features
X.head()

Unnamed: 0,weapon,deadly,knife,fired,shot,handgun,shooting,stab,metal,wood,bottle
2,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,0,0,0,0
16,0,0,0,0,0,0,0,0,0,0,0


In [33]:
# our labels
y.head()

2     1
4     0
9     0
11    0
16    0
Name: is_aggravated, dtype: int64

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [35]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [36]:
clf.score(X_test, y_test)

0.813940036633568

In [37]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not aggravated', 'aggravated'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not aggravated,Predicted aggravated
Is not aggravated,29336,551
Is aggravated,7169,4436


In [38]:
import eli5

feature_names=list(X.columns)
eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])

Weight,Feature
0.3101,fired
0.2778,knife
0.1722,stab
0.1563,handgun
0.0833,bottle
0.0002,metal
0.0001,wood
0.0001,weapon
0.0,shot
0.0,shooting
