# San Francisco Crime Classification

From 2010 - 2015 the San Francisco police department made records crime reports from across all of San Francisco's neighborhoods. 

Your task is **to predict if the crime was theft or not!**

![](https://digital.ihg.com/is/image/ihg/holiday-inn-san-francisco-6019097353-2x1)

The following information has been recorded in our data: 
* `Dates` - timestamp of the crime incident
* `Category` - category of the crime incident. 
* `Descript` - detailed description of the crime incident
* `PdDistrict` - name of the Police Department District
* `Resolution` - how the crime incident was resolved
* `Address` - the approximate street address of the crime incident
* `X` - Longitude
* `Y` - Latitude
* `Theft` - A flag that states if the crime was a theft or larcency. **This is the target variable you are going to predict.**

Our target is **Theft**. This column was created from the `Category`column which contains more detailed categories of crime.

This dataset and the corresponding challenge was part of a competition hosted by Kaggle and many people have attempted to solve this problem before you. Take a look at what people have tried at the [Kaggle site](https://www.kaggle.com/c/sf-crime/notebooks).

## Import the libraries

Use this next cell to import the libraries you need, pandas is already done for you. If you need more the further down the notebook you get come back and add them to this cell.

In [1]:
import pandas as pd

## Steps to follow

We are going to follow the following project steps:

- Import the data
- Exploratory Data Analysis
- In-depth Data Analysis
- Feature Engineering
- Preparing the Data for Sci-kit Learn
- Use a transformer on categorical features
- Build a pipeline
- Using grid search on model parameters
- Analysing model predictive power
- Further improvements
- Conclusion/suggestions for future

## Import the data

First import the data from the `sf_crime_hackathon.csv` in the `/data` folder.

In [3]:
crime = pd.read_csv('data/sf_crime_hackathon.csv')
crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Theft
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,0
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,0
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,0
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,1
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,1


## Data exploration

### Part 1: Exploratory Analysis
Have a quick look at the data using these methods:

- `df.head()`, `df.tail()`, `df.sample(5)`
- `df.shape`
- `df.describe()`
- `df.info()`
- `df.dtypes`

In [11]:
crime.head()
crime.sample(5)
crime.dtypes

Dates          object
Category       object
Descript       object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
Theft           int64
dtype: object

You may want to clean the data a bit before moving on.

Re-read the data in and add parameters to `.read_csv()` and chain methods to:
- parse the type of the `dates` column: make sure they are datetimes instead of strings;
- rename the colums to be all lower case (or upper if you prefer);
- rename the column `dates` to be `date`.

In [30]:
crime = pd.read_csv('data/sf_crime_hackathon.csv', parse_dates=['Dates'])
crime.columns = crime.columns.str.lower()
crime.rename(columns={'dates':'date'}, inplace=True)
crime.head()

Unnamed: 0,date,category,descript,pddistrict,resolution,address,x,y,theft
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,0
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,0
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,0
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,1
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,1


### Part 2: In-depth Analysis (with optional visualisations)

Answer the following questions. You can also visualise your results using `.plot()` if you want. 

1. Have a look at the different columns in the data. Which ones do you have access to now, but not once the model is actually being used? What should you do when training the model?

2. How many missing values are there in the data?

In [32]:
crime.isnull().sum()

date          0
category      0
descript      0
pddistrict    0
resolution    0
address       0
x             0
y             0
theft         0
dtype: int64

3. How many unique values for address are there?

In [37]:
crime['address'].nunique()

19538

4. What are the top 10 largest categories?

In [60]:
crime['category'].value_counts().head(10)

category
LARCENY/THEFT     85501
OTHER OFFENSES    53370
NON-CRIMINAL      46766
ASSAULT           33303
VANDALISM         19858
DRUG/NARCOTIC     18108
WARRANTS          17731
VEHICLE THEFT     15793
SUSPICIOUS OCC    15382
BURGLARY          15307
Name: count, dtype: int64

5. Which month has the most amount of data?

In [66]:
crime['date'].dt.month_name().value_counts().head(1)

date
April    35404
Name: count, dtype: int64

6. Which district has the most amount of crime? Which has the least?

In [70]:
display(crime['pddistrict'].value_counts().head(1))
display(crime['pddistrict'].value_counts().tail(1))

pddistrict
SOUTHERN    72580
Name: count, dtype: int64

pddistrict
RICHMOND    20692
Name: count, dtype: int64

7. What are the 10 most/least occuring crimes?

In [73]:
display(crime['category'].value_counts().head(1))
display(crime['category'].value_counts().tail(1))

category
LARCENY/THEFT    85501
Name: count, dtype: int64

category
TREA    6
Name: count, dtype: int64

# Feature Engineering

Now you're going to create some new features for our model. Let's start by creating new features from our `date` feature. If you converted this column to a Timestamp, you can easily extract several [date properties](https://pandas.pydata.org/docs/reference/arrays.html#properties)!


Create the following columns and drop the original `date` column:
- `n_days`: Number of days since the first date in the dataset
- `day`: The day of the year
- `weekday`: The day of the week
- `month`: The month of the year
- `year`: The year
- `hour`: The hour of the day
- `minute`: The minute of the hour


In [118]:
crime['n_days'] = (crime['date'] - crime['date'].min()).dt.days
crime['day'] = crime['date'].dt.day_of_year
crime['weekday'] = crime['date'].dt.weekday
crime['month'] = crime['date'].dt.month
crime['year'] = crime['date'].dt.dayofyear
crime['hour'] = crime['date'].dt.hour
crime['minute'] = crime['date'].dt.minute

crime['date'].drop


<bound method Series.drop of 0        2015-05-13 23:53:00
1        2015-05-13 23:53:00
2        2015-05-13 23:33:00
3        2015-05-13 23:30:00
4        2015-05-13 23:30:00
                 ...        
382843   2010-01-11 00:01:00
382844   2010-01-11 00:01:00
382845   2010-01-11 00:01:00
382846   2010-01-11 00:01:00
382847   2010-01-11 00:01:00
Name: date, Length: 382848, dtype: datetime64[ns]>

Make sure the features have been created correctly. What is the max and min of the new features? Is that what you expect?

In [116]:
display(crime[['n_days','day','weekday','month','year','hour','minute']].max(axis=0))
display(crime[['n_days','day','weekday','month','year','hour','minute']].min(axis=0))

n_days     1948
day         365
weekday       6
month        12
year        365
hour         23
minute       59
dtype: int64

n_days     0
day        1
weekday    0
month      1
year       1
hour       0
minute     0
dtype: int64

Now let's make some extra features. The address column will be an issue since there are over 22,000 unique values. You also probably want to drop such sensitive data. Instead let's create:
- `is_block`: Take the value of True if the `address` feature has the word `Block` in
- `x_minus_y`: The difference between `x` and `y`
- `x_plus_y`: The sum of `x` and `y`

In [126]:
crime['is_block'] = crime['address'].str.contains('Block', case=False, na=False)
crime[['address','is_block']]


Unnamed: 0,address,is_block
0,OAK ST / LAGUNA ST,False
1,OAK ST / LAGUNA ST,False
2,VANNESS AV / GREENWICH ST,False
3,1500 Block of LOMBARD ST,True
4,100 Block of BRODERICK ST,True
...,...,...
382843,700 Block of HARRISON ST,True
382844,600 Block of CAPP ST,True
382845,300 Block of WILDE AV,True
382846,800 Block of BRYANT ST,True


In [125]:
crime['x_minus_y'] = crime['x'] - crime['y']
crime[['x','y','x_minus_y']]

Unnamed: 0,x,y,x_minus_y
0,-122.425892,37.774599,-160.200490
1,-122.425892,37.774599,-160.200490
2,-122.424363,37.800414,-160.224777
3,-122.426995,37.800873,-160.227868
4,-122.438738,37.771541,-160.210279
...,...,...,...
382843,-122.397815,37.782137,-160.179952
382844,-122.417665,37.756291,-160.173956
382845,-122.403755,37.716710,-160.120466
382846,-122.403405,37.775421,-160.178825


In [127]:
crime['x_plus_y'] = crime['x'] + crime['y']
crime[['x','y','x_plus_y']]

Unnamed: 0,x,y,x_plus_y
0,-122.425892,37.774599,-84.651293
1,-122.425892,37.774599,-84.651293
2,-122.424363,37.800414,-84.623949
3,-122.426995,37.800873,-84.626123
4,-122.438738,37.771541,-84.667196
...,...,...,...
382843,-122.397815,37.782137,-84.615677
382844,-122.417665,37.756291,-84.661374
382845,-122.403755,37.716710,-84.687045
382846,-122.403405,37.775421,-84.627984


## Bonus

Add at least 2 extra features.

Potential features:

- Any special date fields
- Features based on past observations (e.g. crime in the same area during the past year)
- Converting categorical features (`pddistrict`) into numeric using target encoding
- Any other combination of features
- External features (from other datasets)

# Prepare the data for sci-kit learn

First you will need to separate the features and target.

Note that two features will need to be dropped:
- `descript` - detailed description of the crime incident, therefore a more detailed version of the **target feature**
- `resolution` - how the crime incident was resolved. This is created after the target was defined, and cannot be used to predict a new crime.

Create X and y. 

- X should have the following features (plus whatever bespoke ones you've created): 
```python
'pddistrict', 'x', 'y','n_days', 'day', 'dayofweek', 'month', 'year', 'hour', 'minute', 'x_minus_y', 'x_plus_y', 'is_block'
```
- y should only have one feature: `theft`

Now let's look at which features are categorical (`object` data types):

Looks like we have some non-numeric data that you will need to change later! (`pddistrict`, `address`)

First let's split into train and test:

# Build a base-line model

Choose a model to build and use the raw data to build a base-line model. Drop all missing rows.

How does your model perform? How do other algorithms perform? Why are some better than others?

# Encoding

You can use one-hot encoding to convert the categorical features, or just drop them for now while building a baseline model.


### Add ColumnTransformer

Let's include `ColumnTransformer()` so that we can apply the onehotencoder to only the categorical columns.

# Create a pipeline

Now that we know we have to transform our data using a sklearn transformer, it would be good to package this up into a pipeline.

Let's build a RandomForest to begin with.

# Gridsearch

Now let's see if you can hypertune these parameters!

# Metrics

So far you've looked at Accuracy Score, F1-Score, Recall and Precision. Can you create an output that reports these metrics?

# Feature Selection

Now that you have your model, can you look at the feature importance to reduce the features in your model and still maintain the score?

To access the feature importance you must:

1. access the model from the pipeline: `pipeline['model']`
2. access the feature importances attribute using `.feature_importances_`

You can then plot this against the feature names (`.get_feature_names_`)

## Rebuild the model using fewer characteristics