In this notebook we are going to explore different aspects of this dataset and do some feature engineering, selection and finally create a multi-class classifier with some well-known algorithms. After that, I'm going to explain a little bit about my understanding about XAI and we'll  see more of explanatory analysis using the LIME framework.

__Loading Data and libraries__

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('seaborn')
sns.set(style='white', context='notebook', palette='deep')



crimes1 = pd.read_csv('../input/Chicago_Crimes_2005_to_2007.csv',error_bad_lines=False)
crimes2 = pd.read_csv('../input/Chicago_Crimes_2008_to_2011.csv',error_bad_lines=False)
crimes3 = pd.read_csv('../input/Chicago_Crimes_2012_to_2017.csv',error_bad_lines=False)
crimes = pd.concat([crimes1, crimes2, crimes3], ignore_index=False, axis=0)

crimes.head()

We have different features in here, let's take a closer look into them. 

We are going to answer some questions about this dataset:

How many crimes do we have base on different Primary Type?

In [None]:
crime_count = pd.DataFrame(crimes.groupby('Primary Type').size().sort_values(ascending=False).rename('Count').reset_index())
crime_count

We are seeing a full break down of crimes and their types above. Let's visualize top 20 of them now.

In [None]:
crime_count[:20].plot(x='Primary Type',y='Count',kind='bar')

2) Are they happening mostly on streets, apartments or somewhere else?

In [None]:
crime_count = pd.DataFrame(crimes.groupby('Location Description').size().rename('Count').sort_values(ascending=False).reset_index())
crime_count[:20]


3) What districts are most likely to have crimes in it?

In [None]:
crime_count = pd.DataFrame(crimes.groupby('District').size().rename('Count').sort_values(ascending=False).reset_index())
crime_count


It doesn't help us that much so let's cisualize this on map so we can have a better sense of what's going on!

In [None]:
crimes[['X Coordinate', 'Y Coordinate']] = crimes[['X Coordinate', 'Y Coordinate']].replace(0, np.nan)
crimes.dropna()
crimes.plot(kind='scatter',x='X Coordinate', y='Y Coordinate', c='District', cmap=plt.get_cmap('jet'))

In [None]:
plt.figure(figsize=(15,15))
sns.jointplot(x=crimes['X Coordinate'].values, y=crimes['Y Coordinate'].values, size=10, kind='hex')
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.show()

By looking at those hitmaps and this link(https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Districts-current-/fthy-xz3r) we can see that district 7, 8 and 11 as we expected to have the most number of crimes. But district 1 and 18 also have very heated on the map but they don't have that much of occurrence on our table, so why is like that? Because it's close to downtown! We can say the density of crimes are high there(number of them/square feet). 

Now let's see what would be the break down of crimes on the map.

In [None]:
plt.figure(figsize=(12,12))
sns.lmplot(x='X Coordinate', y='Y Coordinate', size=10, hue='Primary Type', data=crimes, fit_reg=False)
plt.ylabel('Longitude', fontsize=15)
plt.xlabel('Latitude', fontsize=15)
plt.show()

Well seems we cannot get so much out of it since the map is very condensed, so let's see on some tables.
Now I'm going to find what are the top 3 crimes in each district.

In [None]:
topk = crimes.groupby(['District', 'Primary Type']).size().reset_index(name='counts').groupby('District').apply(lambda x: x.sort_values('counts',ascending=False).head(3))
topk[:51]

In [None]:
g =sns.factorplot("Primary Type", y='counts', col="District", col_wrap=4,
                   data=topk, kind='bar')
for ax in g.axes:
    plt.setp(ax.get_xticklabels(), visible=True, rotation=30, ha='right')

plt.subplots_adjust(hspace=0.4)

What we get out of this are:

1) We don't have any data for district 13 on this dataset so that's why something doesn't show up. Also for district 31, we have only 181 crimes so that's why it's not that obvious in bars.

2) Theft and Battery are happening in all districts.

3) Deceptive Practise is most prevalent in Districts 1 and 18( What is Deceptive Practice? Based on this website:
https://www.consumerjusticecenter.com/blog/2017/05/examples-of-deceptive-trade-practices.shtml
It's mostly when:
- Saying that services are needed when they are not
- Saying that products have sponsorship that they don't actually have
Where are they happening mostly? Downtown! What else we expected? It's downtown and it's all about sales!

4) Narcotics are mostly at Districts 7, 10, 11, 15

----------------------------------------------------------------------------------------

Now let's see what arrest column can give us!

In [None]:
g =sns.factorplot("Arrest", col="District", col_wrap=4, legend_out=True,
                   data=crimes, orient='h',
                    kind="count")
for ax in g.axes:
    plt.setp(ax.get_xticklabels(), visible=False)

District 11 is having a high number of both arrest True and False, also they are close to each other in comparison to other districts' ratio. Let's back to see what is mostly happening on District 11. Interesting, it's mostly about narcotics. Seems that there are a lot of rats over there reporting them to the cups! HAHA. However, we don't know if Narcotics means organized crime or regular people smoking weed. So let's take look into NARCOTICS crimes description and see if we can find out something!

In [None]:
df_theft = crimes[crimes['Primary Type'] == 'NARCOTICS']
plt.figure(figsize = (15, 7))
sns.countplot(y = df_theft['Description'],order=df_theft['Description'].value_counts().index[:20])

As we can see most occured crime is possession of 30gr or less  canabis. Also we can see lower rate of crime for more than 30 grams or people who were narcotics. 

So what is beat? base on dataset description:
"A beat is the smallest police geographic area"
So take a look into it to see what beats most have the most dangerous area to work in.

In [None]:
Beat = pd.DataFrame(crimes.groupby('Beat').size().rename('Count').sort_values(ascending=False).reset_index())
Beat[:20]

Now it would be interesting to take a look into the crimes from time perspective.

In [None]:
crimes.Date = pd.to_datetime(crimes.Date, format='%m/%d/%Y %I:%M:%S %p')
crimes.index = pd.DatetimeIndex(crimes.Date)

crimes.resample('M').size().plot(legend=True)
plt.title('Number of crimes per month (2001 - 2018)')
plt.xlabel('Months')
plt.ylabel('Number of crimes')
plt.show()

Interesting! This is the breakdown for months, as we can see we have seasonality in our data. We have an up trending from the beginning of the year to the middle and then downward from middle to the end pretty much each year! Now let's see what's happening base on days, months and weekdays. First, we need to add columns for those features in our dataset.

In [None]:
crimes['day_of_week']=crimes['Date'].dt.weekday_name
crimes['month']=crimes['Date'].dt.month
crimes['day']=crimes['Date'].dt.day
crimes['hour']=crimes['Date'].dt.hour
crimes['minute']=crimes['Date'].dt.minute


crimes.head()


In [None]:
crime_ = pd.DataFrame(crimes.groupby('day').size().rename('Count').reset_index())
crime_.plot(x='day',y='Count',kind='bar')
crime_ = pd.DataFrame(crimes.groupby('month').size().rename('Count').reset_index())
crime_.plot(x='month',y='Count',kind='bar')
crime_ = pd.DataFrame(crimes.groupby('day_of_week').size().rename('Count').reset_index())
crime_.plot(x='day_of_week',y='Count',kind='bar')
crime_ = pd.DataFrame(crimes.groupby('hour').size().rename('Count').reset_index())
crime_.plot(x='hour',y='Count',kind='bar')
crime_ = pd.DataFrame(crimes.groupby('minute').size().rename('Count').reset_index())
crime_.plot(x='minute',y='Count',kind='bar')

As we expected there is an upward and downward trend in during the year on average. In terms of days of week and month, it's pretty much same and evenly distributed. There are only two anomalies on 31th and the first day of each month. 31st day we only have 6 of them but if we multiply that number by 2, it will be mostly around the first day of each month. It seems that at the end of each month and the first day of the month we are seeing a peak in our crime rate, but that might be something due to our data sampling and the timeline we have been gathering data. We are seeing a pattern in the hourly rate of crimes, mostly happening at midnights! seems 5 AM is the safest time during the day and that sort of make sense.

In terms of minutes, we can see a huge peak on 30th and first minutes of the hours. It might be because people are most likely to round things to 30th minutes or the closest hour in case they cannot remember the exact timing of the incident.

Now let's see a break down of crimes during the time we gathered data.

In [None]:
crimes_count_date = crimes.pivot_table('ID', aggfunc=np.size, columns='Primary Type', index=crimes.index.date, fill_value=0)
crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
plo = crimes_count_date.rolling(365).sum().plot(figsize=(12, 30), subplots=True, layout=(-1, 2), sharex=False, sharey=False)

Interesting! Most of these crimes had a jump in 2016, it might because of political and presidential issue or something else.  Some of them bounce back some of them not. One of them which is pretty interesting is deceptive practices which are now we are an all-time high in the last 16 years! It would be something interesting to dig into it if I had more time.

Ok, now let's jump into making a predictive model. 

We already have done some feature engineering by adding the day, minute, hour and so on to our dataset. Let's do this for other features that we have too.

In [None]:
crimes['day_of_week']=crimes['Date'].dt.weekday
crimes.head()



We have some fields that are related to the location such as Latitude, Longitude, X Coordinate, Block, District, and some others. So we are going to remove some of such as Longitude and Latitude since we have X and Y coordinate also we will make them zero-mean center. 

In [None]:
crimes=crimes.dropna()
crimes.isnull().sum(axis = 0)

Now let's drop some columns that have nothing to do with primary type prediction or we have another version of it in our features. We will also drop IUCR since base on their explanation it is sort of encoding of the primary type itself as the image below. It would cause data leakage problem in our prediction and gives us an unrealisticly good prediction.

In [None]:
crimes=crimes.drop('Case Number', axis=1)
crimes=crimes.drop('ID', axis=1)
crimes=crimes.drop('FBI Code', axis=1)
crimes=crimes.drop('Date', axis=1)
crimes=crimes.drop('Block', axis=1)
crimes=crimes.drop('Updated On', axis=1)
crimes=crimes.drop('Location', axis=1)
crimes=crimes.drop('Longitude', axis=1)
crimes=crimes.drop('Latitude', axis=1)
crimes=crimes.drop('IUCR', axis=1)


In [None]:

x=crimes['X Coordinate'].mean()
y=crimes['Y Coordinate'].mean()
print(x,y)


Now let's encode some of our field into numbers so that would be easier for our algorithms to compute. 

In [None]:
categories_type = {c:i for i,c in enumerate(crimes['Primary Type'].unique())}
categories_description = {c:i for i,c in enumerate(crimes['Description'].unique())}
categories_location_des = {c:i for i,c in enumerate(crimes['Location Description'].unique())}
categories_Arrest = {c:i for i,c in enumerate(crimes['Arrest'].unique())}
categories_Domestic = {c:i for i,c in enumerate(crimes['Domestic'].unique())}
#categories_IUCR = {c:i for i,c in enumerate(crimes['IUCR'].unique())}

crimes['Primary_Type_Num'] = [float(categories_type[t]) for t in crimes['Primary Type']]
crimes['Description_Num'] = [float(categories_description[t]) for t in crimes['Description']]
crimes['Location_Des_Num'] = [float(categories_location_des[t]) for t in crimes['Location Description']]
crimes['Arrest_Num'] = [float(categories_Arrest[t]) for t in crimes['Arrest']]
crimes['Domestic_Num'] = [float(categories_Domestic[t]) for t in crimes['Domestic']]
#crimes['IUCR_Num'] = [float(categories_IUCR[t]) for t in crimes['IUCR']]

crimes=crimes.drop('Primary Type', axis=1)
crimes=crimes.drop('Description', axis=1)
crimes=crimes.drop('Location Description', axis=1)
crimes=crimes.drop('Arrest', axis=1)
crimes=crimes.drop('Domestic', axis=1)
#crimes=crimes.drop('IUCR', axis=1)

crimes.head()

Now zero-mean centering our X and Y coordinates

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder as le
from sklearn.preprocessing import MultiLabelBinarizer

crimes['X Coordinate'] = preprocessing.scale(list(map(lambda x: x-1164537.87395, crimes['X Coordinate'])))
crimes['Y Coordinate'] = preprocessing.scale(list(map(lambda x: x-1885607.09892, crimes['X Coordinate'])))
crimes.head()

Let's train a Random Forest model. Some of pros and cons of this model are:

Pros:
+ Works well on large datasets
+ One of the most accurate decision models
+ Can be used to extract variable importance
+ Low chance of overfitting 

Cons:
- Unlike decision trees, results are difficult to interpret
- Hyperparameters need good tuning for high accuracy

Compare to other ensembling algorithms like XGBOOST, it's easier to tune, also it can be trained in parallel due to its nature, but since XGBOOST at each iteration needs the previous step to calculate gradients it cannot be done in parallel, But XGBOOST due to some benchmarks can have higher accuracy.

For our metric, we can use log loss for multiclass classification or Micro/Macro score. In here I'm going to use the second one.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.ensemble import RandomForestClassifier

print(crimes.shape)
X_train, X_test, y_train, y_test = train_test_split(crimes.loc[:, crimes.columns != 'Primary_Type_Num'], crimes['Primary_Type_Num'], test_size = 0.2, random_state = 0)
clf = RandomForestClassifier(max_features="log2", max_depth=16, n_estimators=25,
                             min_samples_split=600, oob_score=False,n_jobs=4).fit(X_train,y_train)
y = clf.predict(X_test)
print(X_train.shape,X_test.shape)

print(recall_score(y,y_test, average='micro'))
print(recall_score(y,y_test, average='macro'))
print(recall_score(y,y_test, average='weighted'))

print(precision_score(y,y_test, average='micro'))
print(precision_score(y,y_test, average='macro'))
print(precision_score(y,y_test, average='weighted'))

print(f1_score(y,y_test, average='micro'))
print(f1_score(y,y_test, average='macro'))
print(f1_score(y,y_test, average='weighted'))


Okay, interesting! So we have a high score for micro in every case but not for macro. So basically when the difference is that much it means that our classifier is suffering in predicting classes that have a lower number of instances. Let's see how many samples each class does again.

In [None]:
crime_count = pd.DataFrame(crimes.groupby('Primary_Type_Num').size().sort_values(ascending=False).rename('Count').reset_index())
crime_count

Okay as we expected we can see that for some classes we have a lot of data but not for all of them(Imbalanced dataset). We can do over/under sampling. So basically try to artificially make some instances for those classes that don't have that much of data. In my experience and as I have been reading, over-sampling usually works better since for undersampling we are losing a portion of real data which is valuable and nothing interesting is in losing data. For oversampling, there are several methods such as ROSE or SMOTH. I've found SMOTE more accurate and robust. So let's see what will happen after oversampling our data.

NOTE: Running code below might take several hours, in case you need to run it.

In [None]:
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(crimes.loc[:, crimes.columns != 'Primary_Type_Num'], crimes['Primary_Type_Num'], test_size = 0.2, random_state = 0)

print(X_train.shape,y_train.shape,'before')
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
print(X_train_res.shape,y_train_res.shape,'after')

clf = RandomForestClassifier(max_features="log2", max_depth=20, n_estimators=25,
                             min_samples_split=600, oob_score=False,n_jobs=4).fit(X_train_res,y_train_res.ravel())
y = clf.predict(X_test)




print(recall_score(y,y_test, average='micro'))
print(recall_score(y,y_test, average='macro'))
print(recall_score(y,y_test, average='weighted'))

print(precision_score(y,y_test, average='micro'))
print(precision_score(y,y_test, average='macro'))
print(precision_score(y,y_test, average='weighted'))

print(f1_score(y,y_test, average='micro'))
print(f1_score(y,y_test, average='macro'))
print(f1_score(y,y_test, average='weighted'))


Great! So the first point is that before oversampling we had around 5 million rows of data, but after that around 35 Million! After training our model we can see our Micro score has been decreased by about 7% but on the other hand our Macro score has been improved significantly. Based on precision, recall or F-1 score our Macro score has been improved from 15% to 30% which is pretty good. It means that even for classes that have a very low number of samples we can predict with a reasonable percentage. So based on what we need from our data, lower false positives or false negatives we can use either of those. Due to the need for high computational power for training this model I wasn't able to do that much hyper-parameter tuning. I'm sure after doing some tuning our micro score would be higher too.

Now let's train a simpler model like Logistic regression. Some of pros and cons for this model are:

Pros:
+ It's easy and convenient to interpret due to probability scores for observations.
+ can be regularized to avoid overfitting

Cons:
- Doesn’t perform well when feature space is too large
- Doesn’t handle a large number of categorical features/variables well
- Not doing well on non-linear patterns

In [None]:
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression()
lgr.fit(X_train,y_train)
y = lgr.predict(X_test)


print(recall_score(y,y_test, average='micro'))
print(recall_score(y,y_test, average='macro'))
print(recall_score(y,y_test, average='weighted'))

print(precision_score(y,y_test, average='micro'))
print(precision_score(y,y_test, average='macro'))
print(precision_score(y,y_test, average='weighted'))

print(f1_score(y,y_test, average='micro'))
print(f1_score(y,y_test, average='macro'))
print(f1_score(y,y_test, average='weighted'))

Okay, that's pretty much what we expected from our logistic regression. When our feature domain become large. Our model cannot perform very well in this case but it's good to have them and see its performance on LIME too.

__Explainable AI (XAI)__

So let's talk about the interesting topic of explainable AI (XAI). In the following, I'm going to refer these references with numbers attached to them.

+ Why Should I Trust You?” Explaining the Predictions of Any Classifier --> 1
+ A Unified Approach to Interpreting Model Predictions --> 2
+ Anchors: High-Precision Model-Agnostic Explanations --> 3
+ https://github.com/datascienceinc/Skater --> 4

Base on (1) and (3) and some other web sources these are my understanding about XAI. 

Each day we are facing new algorithms, some of them are easy to interpret some of them like neural networks or random forest are pretty impossible to interpret. We are training them all over the places and make our decisions based on those models. When it comes to sensitive data and decisions people need to know why this model is saying what is trying to say. They need to know what are factors that have been affected this model to tell us, for example, this person is eligible for getting a loan or not. 

That was when this concern raised in different communities and they have decided to pass new roles for using machine learning algorithms. Something like GDPR which is mostly the right to be informed about the decision and they were supposed to start the needed structure for doing so in Europe in 2018. 

So these algorithms are becoming more and more like black boxes that we have no or little idea what's going on in them. I found this image as something that we are here today and something that we want to have.


In [None]:
Image(filename='xai.jpg')

Let's jump into the (1) reference. So they offered an algorithm which LIME and the way it works is that basically, it tries to find boundaries by sampling instances. So like the image below, it tries to find out why this model decided to set bold red + as a red class base on its training. We can see that this model is pretty complex and those boundaries cannot be well explained with a linear model. Now, what LIME is doing is that it tries to samples points around the specific data we are looking for and then base on distribution and distances between that point and they try to explain why we classified this as to be red. So, for example, some of those points are closer so they are bolder and so on. Another factor that this model is considering is that how much it is interpretable for a human. 

After making different experiments on different dataset they found interesting results. For example, they tried to see what's the reason that an image classification would say about wolves and huskies. Without XAI system, we are getting high accuracy but when they try to look at the decision boundaries it was pretty disappointing. Sometimes the model tries to classify images as wolves just because there is a snow background in it. Definitely, it has some relation to that but we know that is not a reliable factor. So after using LIME they were showing that for example to classify a Labrador with a guitar which is pretty hard for computers to recognize different things in it, it can pretty well explain what is going on in this image. We might get lower accuracy for our classification or regression but we would know why the model is doing this and it would be more reliable in terms of human interaction.

Another interesting paper from those is (3), so the method they explained is anchors and how they can be a very high-precision method to explain machine learning algorithms. So the idea behind anchor is basically what causality is about. We try to see what are things that when we know they are happening can affect our decision boundaries to decide which class this record needs to belong to. We would try to define a threshold for that as hyper-parameter. Later they did interesting experiments on POS tagging and household dataset. For example in the image below, Anchor is trying to explain why it has been classifying those instances into different classes. It tries to show us what are important things that affected the model at that specific point.

In [None]:
Image(filename='house.png')

Another evidence that we need XAI platforms as they discussed in (1) is Christianity/Atheism text classification. Base on different algorithms such SVM they were getting high accuracy in terms of classification, but after taking closer look into the model, they realized that words like “Posting”, “Host” and “Re” are things that classifier are looking for them to base on presence  or lack of that decide what class this text belongs to. Things that are not a good metric( not any relation between them to either Christianity or Atheism) for classifying.  

So after testing on different datasets and considering both how people think that's interpretable and accuracy of the model they have summarized the results of LIME and Anchor into the table below.

In [None]:
Image(filename='final.png')

So precision and time are pretty obvious but there is another factor which is coverage. It is something that tries to explain how this model could explain its behavior based on our feature set on all over data. Unclear coverage can lead to low human precision and they might think that insight from an explanation can be applied to unseen samples even when it does not. 

Base on different parameters such as Time, precision and coverage we can see that the anchor model is performing better than LIME. Does that mean LIME simply doesn't work? Well, the short answer is it depends. So, for example, the household income dataset has a lot of features and when it comes to using Anchor method it can give us a lot of boundaries that it's thinking is responsible for its action at that point. The result of that would be low coverage and interpretability of unseen data. On the other hand, LIME is trying to give us lower predictions and simpler ones, that would be something more appealing to users. In the below image we can see that Anchor is giving a lot of factors, but LIME is trying to give less!

In [None]:
Image(filename='anchor.png')

These were pretty much a general understanding of myself about these methods. Now let's try to make a simple XAI platform for this data set that we have been working with.

In [None]:
import sklearn
import sklearn.datasets
import sklearn.ensemble
import numpy as np
import lime
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
import lime.lime_tabular


predict_fn_logreg = lambda x: lgr.predict_proba(x).astype(float)
predict_fn_rf = lambda x: clf.predict_proba(x).astype(float)




explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values ,feature_names = list(crimes.loc[:, crimes.columns != 'Primary_Type_Num'].columns)
,class_names=list(crimes.Primary_Type_Num.unique()))



observation_1 = 4
# Get the explanation for Logistic Regression
exp = explainer.explain_instance(X_test.values[observation_1], predict_fn_logreg, num_features=6)
exp.show_in_notebook(show_all=False)

# Get the explanation for RandomForest
exp = explainer.explain_instance(X_test.values[observation_1], predict_fn_rf, num_features=6)
exp.show_in_notebook(show_all=False)


print('Actual class of our observation --->  ',y_test[observation_1])




observation_2 = 124
# Get the explanation for Logistic Regression
exp = explainer.explain_instance(X_test.values[observation_2], predict_fn_logreg, num_features=6)
exp.show_in_notebook(show_all=False)

# Get the explanation for RandomForest
exp = explainer.explain_instance(X_test.values[observation_2], predict_fn_rf, num_features=6)
exp.show_in_notebook(show_all=False)


print('Actual class of our observation --->  ',y_test[observation_2])

So as we can see for the first example our actual class was 4. Our models have a different prediction for first observation, logistic regression classified as group 6 while random forest with high confidence of 89% classified that correctly.  On the right-hand side, we are seeing features that were contributing to our prediction. For example, it says that something like being domestic or not is something important to classify this specific instance as class 1 or not class 1 and so forth. The length of the bar is showing how much that predictor is helping in our prediction base on what side is has been placed. It pretty explains how each feature and its related value domain can affect our model's prediction. For observation 2 our models both predicted correctly but as we can see and expected random forest has a higher probability about its prediction.

Thank you for giving the time to reading my notebook.