# San Francisco Crime Analysis and Classifier:

Hi folks!, Welcome to one new project in which we will build a classifier based on traditional ML models to predict the crime category given 8 characteristics, nevertheless is the exhaustive EDA we had to perform in order to create a proper data, with no more to say, let's get started!

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. From Sunset to SOMA, and Marina to Excelsior, this project analyzes 12 years of crime reports from across all of San Francisco's neighborhoods to create a model that predicts the category of crime that occurred given time and location.


**Problem Statement**

To examine the specific problem, we will apply a full Data Science life cycle composed of the following steps:

- Data Exploration in which we will clean and understand the variables and how they relate between each other obtaining key insights that help us enhance the pre-modeling process of our data.
- Feature Engineering to create additional features derived from the existing ones.
- Training / Testing data creation to evaluate the performance of our models and fine-tune their hyperparameters.

**Data fields**

- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

## Exploratory data analysis

We will start by displaying the files contained in the dataset and unzip them so as to be proper to be used in our notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
base_dir = "../input/sf-crime"
train_dir = os.path.join(base_dir, "train.csv.zip")
test_dir = os.path.join(base_dir, "test.csv.zip")
submission_dir = os.path.join(base_dir, "sampleSubmission.csv.zip")

import zipfile
with zipfile.ZipFile(train_dir,"r") as train:
    train.extractall()

with zipfile.ZipFile(test_dir,"r") as test:
    test.extractall()

with zipfile.ZipFile(submission_dir,"r") as sub:
    sub.extractall()

In [None]:
df=pd.read_csv('./train.csv')

In [None]:
df.shape

In [None]:
df.head()

Here we can see both files contain over 870 thousand instances and the testing set considers only 7 features which can be seen below: 

In [None]:
df_test=pd.read_csv('./test.csv')
df_test.shape

In [None]:
df_test.head()

Below we can see the number of null values in our columns and the type of them, this step is important to perform early in order to set these columns to their proper type and do some basic cleaning.

In [None]:
df.info()

There is not null values, they are all set properly excepting the Date column which has to be set as Datetime type, we will do this in the next lines:

In [None]:
df.isna().sum()

As we have only 2 numeric columns the describe function will be performed only on these two:

In [None]:
df.describe()

From other notebook I have noticed these features contain outliers corresponding to coordinates (latitude and longitude) that does not correspond to San Francisco rather to North Pole, this is why we will get rid of them, but firstly let's display such instances:

In [None]:
df[df['X']>=-120.5]

**Official extreme coordinates:** Correspond to the original extreme coordinates of the San Francisco map, these will help us to narrow down the intervals coordinates so as to be more accuracte in our plottings.

**min_longitude, min_latitude, max_longitude, max_latitude = -122.52469 37.69862 -122.33663 37.82986**

**Box = (-122.52469, -122.33663, 37.69862, 37.82986)**

In [None]:
df = df[df['X']<-120.5]
df.describe()

In [None]:
df.shape

Once we got rid of those instances in the describe table we can see the X and Y minimum and maximum are shortened which means they comprehend the region around SF. Now we will set the Date feature to datetime type allowing us to perform time series analysis.

In [None]:
df['Dates'] =  pd.to_datetime(df['Dates'], infer_datetime_format=True)
df.info()

Another very important detail about the data is that it contains duplicated instances, we have to eliminate them in order to avoid unuseful/insignificant data:

In [None]:
len(df[df.duplicated()])

In [None]:
df.drop_duplicates(inplace=True)
len(df)

Once we have made a basic processing we can start exploring our label and features:

In [None]:
df.Category.unique()

The label contains 39 crime categories, below we will print a bar plot considering the proportion of incidents of each one in relation to the total:

In [None]:
plt.figure(figsize=(10, 10))
sns.barplot(df.Category.value_counts(normalize=True), 
            df.Category.value_counts(normalize=True).index,
            orient='h', palette="Blues_r")
plt.title('Incidents per Crime Category', fontdict={'fontsize': 16})
plt.xlabel('Incidents (%)')

plt.show()

We can plot the distribution of the crime by category on the San Francisco map, but firstly I will read the image obtained from Google maps:

In [None]:
img_map = plt.imread('../input/san-francisco-map/SF_map.png')
plt.subplots(figsize = (11,11))
plt.imshow(img_map)

As I said earlier the extreme coordinates would help us to set the limits of our map either in latitude and longitude. In the below cell we will plot a heat map of all crimes in the region, take into account that as the data contains more than 800 thousand that would take hours to plot, this is why we will randomly sample and plot 5000 instances:

In [None]:
#    ll.lon     ll.lat   ur.lon     ur.lat
#    -122.52469 37.69862 -122.33663 37.82986

BBox = (-122.52469, -122.33663, 37.69862, 37.82986)
fig, ax = plt.subplots(figsize = (8,7))

sns.kdeplot(data=df.sample(5000), x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5)
ax.set_title('Plotting all incidents map')
ax.set_xlim(-122.52469, -122.33663)
ax.set_ylim(37.69862, 37.82986)
ax.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
ax.axis('off')

I have selected 9 crime categories based on their number of incidents and in the next cell we will plot the heat map for each of these crimes (notice again that I have taken a sample of them in order to speed up the process):

In [None]:
import pylab
import numpy as np

pylab.rcParams['figure.figsize'] = (18.0, 13.0)

larceny = df[df['Category'] == "LARCENY/THEFT"].sample(4000)
assault = df[df['Category'] == "ASSAULT"].sample(4000)
drug = df[df['Category'] == "DRUG/NARCOTIC"].sample(4000)
vehicle = df[df['Category'] == "VEHICLE THEFT"].sample(4000)
vandalism = df[df['Category'] == "VANDALISM"].sample(4000)
burglary = df[df['Category'] == "BURGLARY"].sample(4000)
robbery = df[df['Category'] == "ROBBERY"].sample(4000)
prostitution = df[df['Category'] == "PROSTITUTION"].sample(4000)
driving_drunk = df[df['Category'] == "DRIVING UNDER THE INFLUENCE"]

with plt.style.context('seaborn-darkgrid'):
    ax2 = plt.subplot2grid((3,3), (0, 0))
    ax2.set_title('Larceny/theft incidents map')
    ax2.set_xlim(-122.52469, -122.33663)
    ax2.set_ylim(37.69862, 37.82986)
    ax2.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=larceny, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax2)
    ax2.axis('off')
    
    ax3 = plt.subplot2grid((3,3), (0, 1))
    ax3.set_title('Assault incidents map')
    ax3.set_xlim(-122.52469, -122.33663)
    ax3.set_ylim(37.69862, 37.82986)
    ax3.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=assault, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax3)
    ax3.axis('off')
    
    ax4 = plt.subplot2grid((3,3), (0, 2))
    ax4.set_title('Drug/Narcotic incidents map')
    ax4.set_xlim(-122.52469, -122.33663)
    ax4.set_ylim(37.69862, 37.82986)
    ax4.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=drug, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax4)
    ax4.axis('off')
    
    ax5 = plt.subplot2grid((3,3), (1, 0))
    ax5.set_title('Vehicle theft incidents map')
    ax5.set_xlim(-122.52469, -122.33663)
    ax5.set_ylim(37.69862, 37.82986)
    ax5.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=vehicle[vehicle['Y']<=38], x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax5)
    ax5.axis('off')
    
    ax6 = plt.subplot2grid((3,3), (1, 1))
    ax6.set_title('Vasdalism incidents map')
    ax6.set_xlim(-122.52469, -122.33663)
    ax6.set_ylim(37.69862, 37.82986)
    ax6.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=vandalism, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax6)
    ax6.axis('off')
    
    ax7 = plt.subplot2grid((3,3), (1, 2))
    ax7.set_title('Burglary incidents map')
    ax7.set_xlim(-122.52469, -122.33663)
    ax7.set_ylim(37.69862, 37.82986)
    ax7.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=burglary, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax7)
    ax7.axis('off')

    ax8 = plt.subplot2grid((3,3), (2, 0))
    ax8.set_title('Robbery incidents map')
    ax8.set_xlim(-122.52469, -122.33663)
    ax8.set_ylim(37.69862, 37.82986)
    ax8.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=robbery, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax8)
    ax8.axis('off')
    
    ax9 = plt.subplot2grid((3,3), (2, 1))
    ax9.set_title('Prostitution incidents map')
    ax9.set_xlim(-122.52469, -122.33663)
    ax9.set_ylim(37.69862, 37.82986)
    ax9.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=prostitution, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax9)
    ax9.axis('off')
    
    ax10 = plt.subplot2grid((3,3), (2, 2))
    ax10.set_title('Drunken driver incidents map')
    ax10.set_xlim(-122.52469, -122.33663)
    ax10.set_ylim(37.69862, 37.82986)
    ax10.imshow(img_map, zorder=0, extent = BBox, aspect= 'equal')
    sns.kdeplot(data=driving_drunk, x='X', y='Y', fill=True, thresh=0.02, cmap='Blues', alpha=0.5, ax=ax10)
    ax10.axis('off')
  
    pylab.gcf().text(0.5, 1.03, 
                    'San Franciso Crime Incidents',
                     horizontalalignment='center',
                     verticalalignment='top', 
                     fontsize = 28)
    
plt.tight_layout()
plt.show()

Let's print the amount of unique descriptions in our dataset:

In [None]:
len(df.Descript.unique())

In [None]:
df.Descript.unique()[:10]

This feature is so important and specific that we could obviously use a model based on attention (transformers) and predict the category as such column explain what happened in the event and almost always contains key words to identify the crime category. As this is not considered in the testing file we can go ahead and ignore it.

In [None]:
df.DayOfWeek.unique()

Week days column makes sense as it contains from monday to sunday events, later we will see the the distribution of incidents by day.

In [None]:
df.PdDistrict.unique()

This column represents the police department district, in other words the area of San Francisco that is covered by a specific police department, we can see that southern is the most frequent department requested as more crimes happen in such district.

In [None]:
plt.figure(figsize=(12, 4))

df_pd=df.groupby(by='PdDistrict').count()
df_pd.iloc[:,0].plot(kind='bar')
plt.title('Bar plot for records by PdDistrict')
plt.xlabel("PdDistrict")
plt.ylabel("Number of incidents")
plt.show()

In [None]:
df.Resolution.unique()

There are 17 possible event resolutions and these can not be related to specific crimes, this is why we will ignore it later when defining the features for training:

In [None]:
len(df.Resolution.unique())

The address is significatively important as it contains key words of streets that can be related any specific crime such as drug dealing, prostitution, car theft, etc. Later we will see the most frequent words in this column by performing N-gram analysis.

In [None]:
len(df.Address.unique())

Now we will perform time series analysis in which we can find patterns in hours, days, months or even years related to a crime, for this we will extract such components of the date feature which then will be grouped adding up the number of incidents.

In [None]:
df['year'] = pd.to_datetime(df['Dates']).dt.year
df['month'] = pd.to_datetime(df['Dates']).dt.month
df['day'] = pd.to_datetime(df['Dates']).dt.day
df['hour'] = pd.to_datetime(df['Dates']).dt.hour
df.sample(10)

**Crime ocurrence by hour**

In [None]:
plt.figure(figsize=(12, 4))

hours_event=df.groupby(by='hour').count()
hours_event.iloc[:,0].plot(kind='bar')
plt.title('Bar plot for records by hour')
plt.xlabel("Hour of the day")
plt.ylabel("Number of incidents")
plt.show()

In [None]:
pylab.rcParams['figure.figsize'] = (18.0, 13.0)

larceny = df[df['Category'] == "LARCENY/THEFT"]
assault = df[df['Category'] == "ASSAULT"]
drug = df[df['Category'] == "DRUG/NARCOTIC"]
vehicle = df[df['Category'] == "VEHICLE THEFT"]
vandalism = df[df['Category'] == "VANDALISM"]
burglary = df[df['Category'] == "BURGLARY"]
robbery = df[df['Category'] == "ROBBERY"]
prostitution = df[df['Category'] == "PROSTITUTION"]
driving_drunk = df[df['Category'] == "DRIVING UNDER THE INFLUENCE"]

with plt.style.context('seaborn-darkgrid'):
    ax1 = plt.subplot2grid((4,3), (0, 0), colspan=3)
    ax1.plot(df.groupby('hour').size(), 'ro-')
    ax1.set_title ('All crimes')
    ax1.xaxis.set_ticks(np.arange(0, 24, 1))
    
    ax2 = plt.subplot2grid((4,3), (1, 0))
    ax2.plot(larceny.groupby('hour').size(), 'o-')
    ax2.set_title ('Larceny/Theft')
    
    ax3 = plt.subplot2grid((4,3), (1, 1))
    ax3.plot(assault.groupby('hour').size(), 'o-')
    ax3.set_title ('Assault')
    
    ax4 = plt.subplot2grid((4,3), (1, 2))
    ax4.plot(drug.groupby('hour').size(), 'o-')
    ax4.set_title ('Drug/Narcotic')
    
    ax5 = plt.subplot2grid((4,3), (2, 0))
    ax5.plot(vehicle.groupby('hour').size(), 'o-')
    ax5.set_title ('Vehicle theft')
    
    ax6 = plt.subplot2grid((4,3), (2, 1))
    ax6.plot(vandalism.groupby('hour').size(), 'o-')
    ax6.set_title ('Vandalism')
    
    ax7 = plt.subplot2grid((4,3), (2, 2))
    ax7.plot(burglary.groupby('hour').size(), 'o-')
    ax7.set_title ('Burglary')

    ax8 = plt.subplot2grid((4,3), (3, 0))
    ax8.plot(robbery.groupby('hour').size(), 'o-')
    ax8.set_title ('Robbery')
    
    ax9 = plt.subplot2grid((4,3), (3, 1))
    ax9.plot(prostitution.groupby('hour').size(), 'o-')
    ax9.set_title ('Prostitution')
    
    ax10 = plt.subplot2grid((4,3), (3, 2))
    ax10.plot(driving_drunk.groupby('hour').size(), 'o-')
    ax10.set_title ('Driving under the influence')
  
    pylab.gcf().text(0.5, 1.03, 
                    'San Franciso Crime Occurence by Hour',
                     horizontalalignment='center',
                     verticalalignment='top', 
                     fontsize = 28)
    
plt.tight_layout()
plt.show()

**Crime ocurrence by day of the week**

In [None]:
data = df.groupby('DayOfWeek').count().iloc[:, 0]
data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

plt.figure(figsize=(18, 4))
data.plot(kind='bar')
plt.title('Bar plot for records by day of week')
plt.xlabel("Day of week")
plt.ylabel("Number of incidents")
plt.show()

In [None]:
pylab.rcParams['figure.figsize'] = (18.0, 13.0)

larceny = df[df['Category'] == "LARCENY/THEFT"]
assault = df[df['Category'] == "ASSAULT"]
drug = df[df['Category'] == "DRUG/NARCOTIC"]
vehicle = df[df['Category'] == "VEHICLE THEFT"]
vandalism = df[df['Category'] == "VANDALISM"]
burglary = df[df['Category'] == "BURGLARY"]
robbery = df[df['Category'] == "ROBBERY"]
prostitution = df[df['Category'] == "PROSTITUTION"]
driving_drunk = df[df['Category'] == "DRIVING UNDER THE INFLUENCE"]

with plt.style.context('seaborn-darkgrid'):
    ax1 = plt.subplot2grid((4,3), (0, 0), colspan=3)
    data = df.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax1.plot(data, 'ro-')
    ax1.set_title ('All crimes')
    ax1.xaxis.set_ticks(np.arange(0, 7, 1))
    
    ax2 = plt.subplot2grid((4,3), (1, 0))
    data = larceny.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax2.plot(data, 'o-')
    ax2.set_title ('Larceny/Theft')
    
    ax3 = plt.subplot2grid((4,3), (1, 1))
    data = assault.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax3.plot(data, 'o-')
    ax3.set_title ('Assault')
    
    ax4 = plt.subplot2grid((4,3), (1, 2))
    data = drug.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax4.plot(data, 'o-')
    ax4.set_title ('Drug/Narcotic')
    
    ax5 = plt.subplot2grid((4,3), (2, 0))
    data = vehicle.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax5.plot(data, 'o-')
    ax5.set_title ('Vehicle theft')
    
    ax6 = plt.subplot2grid((4,3), (2, 1))
    data = vandalism.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax6.plot(data, 'o-')
    ax6.set_title ('Vandalism')
    
    ax7 = plt.subplot2grid((4,3), (2, 2))
    data = burglary.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax7.plot(data, 'o-')
    ax7.set_title ('Burglary')

    ax8 = plt.subplot2grid((4,3), (3, 0))
    data = robbery.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax8.plot(data, 'o-')
    ax8.set_title ('Robbery')
    
    ax9 = plt.subplot2grid((4,3), (3, 1))
    data = prostitution.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax9.plot(data, 'o-')
    ax9.set_title ('Prostitution')
    
    ax10 = plt.subplot2grid((4,3), (3, 2))
    data = driving_drunk.groupby('DayOfWeek').size()
    data = data.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
    ax10.plot(data, 'o-')
    ax10.set_title ('Driving under the influence')
  
    pylab.gcf().text(0.5, 1.03, 
                    'San Franciso Crime Occurence by day of week',
                     horizontalalignment='center',
                     verticalalignment='top', 
                     fontsize = 28)
    
plt.tight_layout()
plt.show()

**Crime ocurrence by day of the month**

In [None]:
plt.figure(figsize=(18, 4))

days_event=df.groupby(by='day').count()
days_event.iloc[:,0].plot(kind='bar')
plt.title('Bar plot for records by day of the month')
plt.xlabel("Month day")
plt.ylabel("Number of incidents")
plt.show()

**Crime ocurrence by month**

In [None]:
plt.figure(figsize=(12, 4))

months_event=df.groupby(by='month').count()
months_event.iloc[:,0].plot(kind='bar')
plt.title('Bar plot for records by month')
plt.xlabel("Month number")
plt.ylabel("Number of incidents")
plt.show()

In [None]:
pylab.rcParams['figure.figsize'] = (18.0, 13.0)

larceny = df[df['Category'] == "LARCENY/THEFT"]
assault = df[df['Category'] == "ASSAULT"]
drug = df[df['Category'] == "DRUG/NARCOTIC"]
vehicle = df[df['Category'] == "VEHICLE THEFT"]
vandalism = df[df['Category'] == "VANDALISM"]
burglary = df[df['Category'] == "BURGLARY"]
robbery = df[df['Category'] == "ROBBERY"]
prostitution = df[df['Category'] == "PROSTITUTION"]
driving_drunk = df[df['Category'] == "DRIVING UNDER THE INFLUENCE"]

with plt.style.context('seaborn-darkgrid'):
    ax1 = plt.subplot2grid((4,3), (0, 0), colspan=3)
    ax1.plot(df.groupby('month').size(), 'ro-')
    ax1.set_title ('All crimes')
    ax1.xaxis.set_ticks(np.arange(1, 13, 1))
    
    ax2 = plt.subplot2grid((4,3), (1, 0))
    ax2.plot(larceny.groupby('month').size(), 'o-')
    ax2.set_title ('Larceny/Theft')
    
    ax3 = plt.subplot2grid((4,3), (1, 1))
    ax3.plot(assault.groupby('month').size(), 'o-')
    ax3.set_title ('Assault')
    
    ax4 = plt.subplot2grid((4,3), (1, 2))
    ax4.plot(drug.groupby('month').size(), 'o-')
    ax4.set_title ('Drug/Narcotic')
    
    ax5 = plt.subplot2grid((4,3), (2, 0))
    ax5.plot(vehicle.groupby('month').size(), 'o-')
    ax5.set_title ('Vehicle theft')
    
    ax6 = plt.subplot2grid((4,3), (2, 1))
    ax6.plot(vandalism.groupby('month').size(), 'o-')
    ax6.set_title ('Vandalism')
    
    ax7 = plt.subplot2grid((4,3), (2, 2))
    ax7.plot(burglary.groupby('month').size(), 'o-')
    ax7.set_title ('Burglary')

    ax8 = plt.subplot2grid((4,3), (3, 0))
    ax8.plot(robbery.groupby('month').size(), 'o-')
    ax8.set_title ('Robbery')
    
    ax9 = plt.subplot2grid((4,3), (3, 1))
    ax9.plot(prostitution.groupby('month').size(), 'o-')
    ax9.set_title ('Prostitution')
    
    ax10 = plt.subplot2grid((4,3), (3, 2))
    ax10.plot(driving_drunk.groupby('month').size(), 'o-')
    ax10.set_title ('Driving under the influence')
  
    pylab.gcf().text(0.5, 1.03, 
                    'San Franciso Crime Occurence by Month',
                     horizontalalignment='center',
                     verticalalignment='top', 
                     fontsize = 28)
    
plt.tight_layout()
plt.show()

**Crime ocurrence by year**

In [None]:
plt.figure(figsize=(12, 4))

years_event=df.groupby(by='year').count()
years_event.iloc[:,0].plot(kind='bar')
plt.title('Bar plot for records by year')
plt.xlabel("Year")
plt.ylabel("Number of incidents")
plt.show()

In [None]:
df2 = df.copy()
Others = df2.Category.value_counts()[10:].index.tolist()
df2.Category.replace(Others, 'OTHER OFFENSES', inplace=True)
df2.Category.unique()

In [None]:
df_ct=pd.crosstab(df2['year'], df2['Category'], rownames=['year'], colnames=['Category'])
df_ct

In [None]:
ax = df_ct.plot(kind='bar', stacked=True, figsize=(15, 9), colormap='Set2')
plt.legend(title='labels', bbox_to_anchor=(1.0, 1), loc='upper left')
plt.title('Bar chart crime category proportion by year')
ax.set_ylabel('Number of incidents')
ax.set_xlabel('Year')
plt.show()

In [None]:
pylab.rcParams['figure.figsize'] = (18.0, 13.0)

larceny = df[df['Category'] == "LARCENY/THEFT"]
assault = df[df['Category'] == "ASSAULT"]
drug = df[df['Category'] == "DRUG/NARCOTIC"]
vehicle = df[df['Category'] == "VEHICLE THEFT"]
vandalism = df[df['Category'] == "VANDALISM"]
burglary = df[df['Category'] == "BURGLARY"]
robbery = df[df['Category'] == "ROBBERY"]
prostitution = df[df['Category'] == "PROSTITUTION"]
driving_drunk = df[df['Category'] == "DRIVING UNDER THE INFLUENCE"]

with plt.style.context('seaborn-darkgrid'):
    ax1 = plt.subplot2grid((4,3), (0, 0), colspan=3)
    ax1.plot(df.groupby('year').size(), 'ro-')
    ax1.set_title ('All crimes')
    ax1.xaxis.set_ticks(np.arange(2003, 2016, 1))
    
    ax2 = plt.subplot2grid((4,3), (1, 0))
    ax2.plot(larceny.groupby('year').size(), 'o-')
    ax2.set_title ('Larceny/Theft')
    
    ax3 = plt.subplot2grid((4,3), (1, 1))
    ax3.plot(assault.groupby('year').size(), 'o-')
    ax3.set_title ('Assault')
    
    ax4 = plt.subplot2grid((4,3), (1, 2))
    ax4.plot(drug.groupby('year').size(), 'o-')
    ax4.set_title ('Drug/Narcotic')
    
    ax5 = plt.subplot2grid((4,3), (2, 0))
    ax5.plot(vehicle.groupby('year').size(), 'o-')
    ax5.set_title ('Vehicle theft')
    
    ax6 = plt.subplot2grid((4,3), (2, 1))
    ax6.plot(vandalism.groupby('year').size(), 'o-')
    ax6.set_title ('Vandalism')
    
    ax7 = plt.subplot2grid((4,3), (2, 2))
    ax7.plot(burglary.groupby('year').size(), 'o-')
    ax7.set_title ('Burglary')

    ax8 = plt.subplot2grid((4,3), (3, 0))
    ax8.plot(robbery.groupby('year').size(), 'o-')
    ax8.set_title ('Robbery')
    
    ax9 = plt.subplot2grid((4,3), (3, 1))
    ax9.plot(prostitution.groupby('year').size(), 'o-')
    ax9.set_title ('Prostitution')
    
    ax10 = plt.subplot2grid((4,3), (3, 2))
    ax10.plot(driving_drunk.groupby('year').size(), 'o-')
    ax10.set_title ('Driving under the influence')
  
    pylab.gcf().text(0.5, 1.03, 
                    'San Franciso Crime Occurence by Year',
                     horizontalalignment='center',
                     verticalalignment='top', 
                     fontsize = 28)
    
plt.tight_layout()
plt.show()

### N-gram analysis of Address column

In this step we will do an interesting analysis of most frequent words, bigrams and trigrams in the address columns, this is to find patterns or common streets that could help us enhance the meaning of our data. If you want to see more details about the process and functions you are about to see I kindly encourage you to see the following notebooks in which I explain it much better:

https://www.kaggle.com/georgesaavedra/text-news-topic-classification

https://www.kaggle.com/georgesaavedra/best-nlp-disaster-tweets-classifier


In [None]:
import nltk
nltk.download('all')

Will create a copy in case the data ends unsorted or wasted.

In [None]:
df3 = df.copy()

**Removing punctuations:**


In [None]:
import string

def remove_punct(text):
    table=str.maketrans('','', string.punctuation)
    return text.translate(table)

example="I am a #king"
print(remove_punct(example))

In [None]:
df3['Address']=df3['Address'].apply(lambda x : remove_punct(x))

**Removing numbers:**


In [None]:
df3['Address']=df3['Address'].str.replace('\d+', '')

In [None]:
nltk.download("stopwords")

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

In [None]:
from collections import defaultdict,Counter

word_count = Counter(" ".join(df3['Address']).split()).most_common(100)
x=[]
y=[]
for word,count in word_count:
    if (word.casefold() not in stop_words) :
        x.append(word)
        y.append(count)

plt.figure(figsize=(6, 10))
sns.barplot(x=y[:15],y=x[:15])
plt.title('15 most common words in Address column')

Now let's compute the N-grams in each set already mentioned, the folowing function generate_ngrams will help us with the process:

In [None]:
# Define ngram generator function
def generate_ngrams(text, n_gram):
    token = [token for token in text.lower().split(' ') if token != '' if token not in stop_words]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

The number of N-grams to compute:

In [None]:
N=30

We will display the top 30 bigrams for the cleaned and uncleaned address column, in other words, we will see if the numbers, punctuations are key when finding a common street:

In [None]:
# Bigrams
training_bigrams = defaultdict(int)
training_bigrams2 = defaultdict(int)

for instance in df3['Address']:
    for word in generate_ngrams(instance, n_gram=2):
        training_bigrams[word] += 1

for instance in df['Address']:
    for word in generate_ngrams(instance, n_gram=2):
        training_bigrams2[word] += 1
   
df_training_bigrams = pd.DataFrame(sorted(training_bigrams.items(), key=lambda x: x[1])[::-1])
df_training_bigrams2 = pd.DataFrame(sorted(training_bigrams2.items(), key=lambda x: x[1])[::-1])

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(25,30), dpi=80)
#plt.tight_layout()

sns.barplot(y=df_training_bigrams[0].values[:N], x=df_training_bigrams[1].values[:N], ax=ax1, color='r')
ax1.spines['right'].set_visible(False)
ax1.tick_params(axis='x', labelsize=13)
ax1.tick_params(axis='y', labelsize=13)

sns.barplot(y=df_training_bigrams2[0].values[:N], x=df_training_bigrams2[1].values[:N], ax=ax2, color='b')
ax2.spines['right'].set_visible(False)
ax2.tick_params(axis='x', labelsize=13)
ax2.tick_params(axis='y', labelsize=13)

ax1.set_title(f'Top {N} most common bigrams in Address column without stopwords', fontsize=15)
ax2.set_title(f'Top {N} most common bigrams in Address column with stopwords', fontsize=15)

plt.show()
plt.tight_layout()

We will do exactly the same process but for trigrams in both cleaned and uncleaned addresses:

In [None]:
# Trigrams
training_trigrams = defaultdict(int)
training_trigrams2 = defaultdict(int)

for instance in df3['Address']:
    for word in generate_ngrams(instance, n_gram=3):
        training_trigrams[word] += 1

for instance in df['Address']:
    for word in generate_ngrams(instance, n_gram=3):
        training_trigrams2[word] += 1
   
df_training_trigrams = pd.DataFrame(sorted(training_trigrams.items(), key=lambda x: x[1])[::-1])
df_training_trigrams2 = pd.DataFrame(sorted(training_trigrams2.items(), key=lambda x: x[1])[::-1])

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(25,20), dpi=80)
#plt.tight_layout()

sns.barplot(y=df_training_trigrams[0].values[:N], x=df_training_trigrams[1].values[:N], ax=ax1, color='r')
ax1.spines['right'].set_visible(False)
ax1.tick_params(axis='x', labelsize=13)
ax1.tick_params(axis='y', labelsize=13)

sns.barplot(y=df_training_trigrams2[0].values[:N], x=df_training_trigrams2[1].values[:N], ax=ax2, color='b')
ax2.spines['right'].set_visible(False)
ax2.tick_params(axis='x', labelsize=13)
ax2.tick_params(axis='y', labelsize=13)

ax1.set_title(f'Top {N} most common trigrams in Address column without stopwords', fontsize=15)
ax2.set_title(f'Top {N} most common trigrams in in Address column with stopwords', fontsize=15)

plt.show()
plt.tight_layout()

Let's bring back the top 15 most frequent words obtained at the beginning of the N-gram analysis, notice there are some that are super common such as: St or Block, these can help but I really prefer using the 15 and the method is similar to one hot encoding function, if the address contains the "key word" then place a 1 in the "column key word".

In [None]:
key_addresses = x[:15]
key_addresses

In [None]:
df.sample(3)

Time now to narrow down the features that we will use in feature engineering process, as I said earlier the Resolution and Descript are not important and the Dates can be deleted.

## Feature Engineering

In [None]:
df4 = df[['Category','DayOfWeek','PdDistrict','Address','X','Y','year','month','day','hour']]
df4.sample(3)

Firstly, despite the fact that we will use a Tree-based algorithm to classify the instances we will scale all our features in order to have all of them more meaningful and easier to understand/follow.

For this we have to min-max scale the longitude and latitude coordinates so as to have them numbers between 0 to 1:

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()

In [None]:
df4.loc[:,['X','Y']] = mm.fit_transform(df4.loc[:,['X','Y']])

We will one-hot-encode the date components as they make more sense as nominal categorical variables rather than ordinal variables, this is because we don't see a linear relation of crimes with the hour or months, it does not have a linear relation, all crime types have a specific non-linear behaviour with time components, this is why setting them as one hot columns will obtain patterns about the most frequent hour for a certain crime (prostitution), the month with more thefts (robbery) or the day with more drug-dealing or driver with alcohol influence incidents.

In [None]:
nominal_variables = ['year','month','day','hour','DayOfWeek','PdDistrict']
df4 = pd.get_dummies(df4, columns = nominal_variables, drop_first=True) 

And finally the reason why we performed N-gram analysis, in the following cells we will associate columns to key words contained in the address:

In [None]:
key_addresses

In [None]:
df4['Block'] = df4['Address'].str.contains('block|bl', case=False)
df4['AV'] = df4['Address'].str.contains('av', case=False)
df4['Mission'] = df4['Address'].str.contains('mission', case=False)
df4['Market'] = df4['Address'].str.contains('market', case=False)
df4['Bryant'] = df4['Address'].str.contains('bryant', case=False)
df4['RD'] = df4['Address'].str.contains('rd', case=False)
df4['Geary'] = df4['Address'].str.contains('geary', case=False)
df4['Turk'] = df4['Address'].str.contains('turk', case=False)
df4['Eddy'] = df4['Address'].str.contains('eddy', case=False)
df4['DR'] = df4['Address'].str.contains('dr', case=False)
df4['Ellis'] = df4['Address'].str.contains('ellis', case=False)
df4['Ofarrell'] = df4['Address'].str.contains('ofarrell', case=False)

In [None]:
df4[['Block','AV','Mission','Market','Bryant','RD','Geary','Turk','Eddy','DR','Ellis','Ofarrell']] = df4[['Block','AV','Mission','Market','Bryant','RD','Geary','Turk','Eddy','DR','Ellis','Ofarrell']].astype(int)

In [None]:
df4.drop(columns=['Address'], inplace=True)

At this moment the data should contain 106 columns considering features and label:

In [None]:
df4.sample(3)

As a result we should have a dataset with all features containing values between 0 to 1:

In [None]:
df4.describe().T

We can confirm that all our variables have maximum and minimum 1.0 and 0.0 respectively by counting their max() and min() as follows:

In [None]:
df4.drop(columns=['Category']).max().value_counts()

In [None]:
df4.drop(columns=['Category']).min().value_counts()

## Modeling

As first step we have to create the features (105 columns) and label (Category column) sets:

In [None]:
features = df4.drop(columns=['Category'])
label = df4['Category']

In [None]:
features.shape, label.shape

Another important aspect about our data is the extreme unbalance in the label, as the most frequent crime LARCENY/THEFT occured 174305 times and the least frequent TREA has only 6 records, this is crucial as if we don't consider a balancing techinque doesn't matter what powerful algorithm we use the prediction will be biased towards the most frequent class making our model unuseful. There are well known balancing techniques used in ML being SMOTE the most used and preferred one, however the possible techniques are undersampling in which reduces all instances to the least frequent class (which is hardly ever useful) or oversampling which increases all instances to the most frequent class, but this has a big disadvantage, it's proper to use when the difference in instances is under a threshold and for our current dataset there is an extremely significant difference, this creates a big problem and the most proper solution would be to reduce the amount of categories to 15 or 20 maximum by grouping the least frequent into one category, reducing such gap in number of incidents, once we have that we could apply SMOTE technique and expect a high performance by our model, this is a pending step as the submission considers the probability that the instance corresponds to one of the 39 categories. 

In [None]:
label.value_counts()

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(features, label)

In [None]:
X_res.shape

In [None]:
y_res.value_counts()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, label_train, label_val = train_test_split(X_res, y_res, test_size=0.10, random_state=42)

In [None]:
X_train.shape, X_val.shape, label_train.shape, label_val.shape

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import label_binarize

In [None]:
!pip install catboost

**Model training:**

In [None]:
from catboost import CatBoostClassifier

cat_model_class = CatBoostClassifier(iterations=300,
                                     learning_rate=0.7,
                                     random_seed=42,
                                     depth=3)

cat_model_class.fit(X_train, label_train, 
                    cat_features=None, 
                    eval_set=(X_val, label_val), 
                    verbose=False)

In [None]:
label_pred_cat=cat_model_class.predict(X_val)
print(classification_report(label_val,label_pred_cat))

In order to properly predict the categories of the testing set with the ML model we have to apply the same processing steps we performed so as to have the same features and distributions:

In [None]:
def test_file_processing(df_t):
  df_t['Dates'] =  pd.to_datetime(df_t['Dates'], infer_datetime_format=True)
  df_t['year'] = pd.to_datetime(df_t['Dates']).dt.year
  df_t['month'] = pd.to_datetime(df_t['Dates']).dt.month
  df_t['day'] = pd.to_datetime(df_t['Dates']).dt.day
  df_t['hour'] = pd.to_datetime(df_t['Dates']).dt.hour
  nominal_variables = ['year',	'month',	'day',	'hour', 'DayOfWeek', 'PdDistrict']
  df_t = pd.get_dummies(df_t, columns = nominal_variables, drop_first=True) 
  df_t['Block'] = df_t['Address'].str.contains('block|bl', case=False)
  df_t['AV'] = df_t['Address'].str.contains('av', case=False)
  df_t['Mission'] = df_t['Address'].str.contains('mission', case=False)
  df_t['Market'] = df_t['Address'].str.contains('market', case=False)
  df_t['Bryant'] = df_t['Address'].str.contains('bryant', case=False)
  df_t['RD'] = df_t['Address'].str.contains('rd', case=False)
  df_t['Geary'] = df_t['Address'].str.contains('geary', case=False)
  df_t['Turk'] = df_t['Address'].str.contains('turk', case=False)
  df_t['Eddy'] = df_t['Address'].str.contains('eddy', case=False)
  df_t['DR'] = df_t['Address'].str.contains('dr', case=False)
  df_t['Ellis'] = df_t['Address'].str.contains('ellis', case=False)
  df_t['Ofarrell'] = df_t['Address'].str.contains('ofarrell', case=False)
  df_t.drop(columns=['Dates','Address','Id'], inplace=True)
  df_t[['Block','AV','Mission','Market','Bryant','RD','Geary','Turk','Eddy','DR','Ellis','Ofarrell']] = df_t[['Block','AV','Mission','Market','Bryant','RD','Geary','Turk','Eddy','DR','Ellis','Ofarrell']].astype(int)

  return df_t

In [None]:
test_file_processed = test_file_processing(df_test)

In [None]:
test_file_processed.shape

In [None]:
label_test_pred_prob = cat_model_class.predict_proba(test_file_processed)

In [None]:
label_test_pred_class = cat_model_class.predict(test_file_processed)

The following line should print the class predicted for instances in testing file:

In [None]:
label_test_pred_class

I would like to know any feedback in order to improve the analysis and obviously in the building of the model or tell me if you found a different method which gave you an outstanding performance!

If you liked this notebook I would appreciate so much your upvote if you want to see more projects/tutorials like this one. I encourage you to see my projects portfolio, am sure you will love it.

Thank you!