## San Francisco Crime Classification
### Predict the category of crimes that occurred in the city by the bay

In [None]:
from IPython.display import Image 

In [None]:
Image(filename="../input/sf-picture/sf1.jpg")

Image: https://unsplash.com/@mvdheuvel
  

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

In [None]:
#
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import os
from matplotlib import rcParams

%config InlineBackend.figure_format = 'retina'
sns.set_style("white")
rcParams['figure.figsize'] = (8,4)
import matplotlib.ticker as ticker
from IPython.display import Image 

from sklearn.preprocessing import RobustScaler # hay outliers
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import log_loss

In [None]:

# change astype to np.float32 to reduce memory usage
df = pd.read_csv("../input/sf-crime/train.csv.zip",dtype={"X":np.float32,"Y":np.float32})

In [None]:
df.head(3)

In [None]:
# remove duplicates
print(df.duplicated(keep=False).value_counts())
df = df.drop_duplicates()

In columns X and Y there seem to be outliers (Y = 90.0000); it seems that this event belongs to another location. So we will remove it !

In [None]:
df[["X","Y"]].describe()

Now let's see some information about the non-numeric columns.
The most repeated category is LARCENY/THEFT; Fridays seem to be quite entertaining and in SOUTHERN I don't think they get bored. 	

In [None]:
df.describe(include="object")

We are going to extract information from the columns Date and Adress; for example we can have columns like Year, Month, DayofWeek,Weekend,Minute; and from Adress the cases happen either in a street (ST) or block (Block), so we can create a column called "Block".	

We will also use LabelEncoder to transform the "Category" and "PdDistrict" columns.

In [None]:
def convert_dataframe(df):
    """
    remove outliers and create time and block columns. Convert to np.int32 
    due to memory usage
    """
    
    # time columns
    df["Dates"] = pd.to_datetime(df["Dates"],infer_datetime_format=True)
    df['Date'] = df['Dates'].dt.date
    df["Year"] = df["Dates"].dt.year.astype(np.int32)
    df["Month"] = df["Dates"].dt.month.astype(np.int32)
    df["Day"] = df["Dates"].dt.day.astype(np.int32)
    df["Hour"] = df["Dates"].dt.hour.astype(np.int32)
    df["Minute"] = df["Dates"].dt.minute.astype(np.int32)
    df["Day_week_numeric"] = df["Dates"].dt.dayofweek.astype(np.int32)
    df["Weekend"]= np.where((df["Day_week_numeric"] >= 4) & (df["Day_week_numeric"] <=6),1,0)
    df["count_days"] = (df['Date'] - df['Date'].min()).apply(lambda x: x.days)
    # create block column from Adress column
    df["Block"] = df.Address.str.contains("Block").astype(np.int32)
    # drop
    df = df.drop(["Date","Address"],axis=1)
    return df

In [None]:
df_date = convert_dataframe(df)

In [None]:
# label encoder Category and PdDistrict

label_cat = LabelEncoder()
df_date["Category_encode"] = label_cat.fit_transform(df_date.Category)
label_dist = LabelEncoder()
df_date["PdDistric_encode"] = label_dist.fit_transform(df_date.PdDistrict)


In [None]:
# remove outliers
df_outliers = df_date.loc[df_date.Y < 90.].copy()

In [None]:
df_outliers.head(3)

### EDA
We have already prepared the dataframe, so now we can do an exploratory analysis to see what information we can obtain.

In [None]:
# count values (we can use value_counts() as well)
month_count = df_outliers.groupby(["Month"])["Dates"].count().reset_index()
# lineplot
ax = sns.lineplot(x="Month",y="Dates",data=month_count,color="#6549DA")
# add horizontal line
ax.axhline(month_count['Dates'].mean(),color="#9CDEF6")
sns.despine()
# adding text
ax.text(0.5,83000,"Case count per Month",
        fontsize=13,        
         fontweight='bold') 
ax.text(0.5,81500,"What happens during the Vacations?",
        fontsize=11)
ax.text(10,72000,"Month,s Cases mean",
        fontsize=8,        
         fontweight='bold')
plt.xlabel("Mounth")
plt.ylabel("Frecuency")
# so only the graphic appears without any text referring to the object type.
plt.show(block=False)

At first glance, it seems that during the summer and Christmas months the number of cases is below the monthly average.

Let's do the same but with the days of the week to see if we can see any difference.
We can see that cases go up a bit during Wednesdays and Fridays; Sundays are quieter. 

The graph may be misleading and show that there is a big difference between the days, but the range is only between 132500 and 117500 (Day of Week) and 80000-65000 (Month).

Many will think it is common sense that during the vacation months and Sundays there are fewer cases, but it never hurts to show it in a graph.

In [None]:
day_count = df_outliers.groupby(["Day_week_numeric"])["Dates"].count().reset_index()
#lineplot
ax = sns.lineplot(x="Day_week_numeric",y="Dates",data=day_count,color="#6549DA")
# add horizontal line
ax.axhline(day_count['Dates'].mean(),color="#9CDEF6")
sns.despine()
# add text
ax.text(-0.1,136000,"Case count per Day of Week",
        fontsize=13,        
         fontweight='bold') 
ax.text(-0.1,134500,"Are Sundays quieter?",
        fontsize=11)
ax.text(5,124000,"Day,s Cases mean",
        fontsize=11,        
         fontweight='bold')
# axis title
plt.xlabel("Day of Week")
plt.ylabel("Frecuency")
# so only the graphic appears without any text referring to the object type.
plt.show(block=False)

It is not surprising that, if we graph the cases by time of day, we see that during the night there are fewer cases.

In [None]:
hour_count = df_outliers.groupby(["Hour"])["Dates"].count().reset_index()
#barplot
ax = sns.barplot(y="Dates",x="Hour",data=hour_count,color="#B8EBE9")
# with axvline we can draw a vertical line
ax.axhline(hour_count["Dates"].mean(),color="#6549DA")

plt.ylabel("Frecuency")
plt.xlabel("Hour")
plt.grid(False)
sns.despine()

# add text anotation
ax.text(18,34000, "Hour,s Cases Mean", horizontalalignment='left', size='medium', color='black', weight='semibold')
ax.text(-0.1,59000,"Cases Count per Hour",
        fontsize=13,        
         fontweight='bold') 
ax.text(-0.1,55000,"Nights are for sleeping?",
        fontsize=11) 
plt.show(block=False)


Let us now see which are the most common categories, expressed as a percentage of the total,and the districts where there are more incidences.

In [None]:
# dataframe
category_counts = df_outliers.Category.value_counts(normalize=True).reset_index().head(10)
#barplot
ax=sns.barplot(y="index",x="Category",data=category_counts,color="#04A4B5")

plt.ylabel("")
plt.xlabel("Cases,s Percentage")
plt.grid(False)
sns.despine()
#add text
ax.text(0,-2,"Cases Percentage per Category",
        fontsize=13,        
         fontweight='bold') 
ax.text(0,-1.3,"Protect your pockets well...",
        fontsize=11)
# with a loop I add the values to the graphic
for num,text in zip(range(10),round(category_counts["Category"],2)):
    ax.text(text,num,text)
plt.show(block=False)

Let's make a graph of cases by districts; using cumsum we can see the cumulative. We can see that 54 percent of the cases occur in only four districts.

In [None]:
#Let's use cumsum to see a cumulative
distric_counts_cumsum = df_outliers.PdDistrict.value_counts(normalize=True).cumsum().reset_index()
#barplot
ax=sns.barplot(y="PdDistrict",x="index",data=distric_counts_cumsum,color="#30BFBF")

plt.ylabel( "Cases,s Percentage")
plt.xlabel("")
plt.xticks(rotation=45)
plt.grid(False)
sns.despine()
#add text
ax.text(-0.2,1.10,"Cumulative Cases Percentage per PdDistrict",
        fontsize=13,        
         fontweight='bold') 
ax.text(-0.2,1.0,"Where should I buy my house?",
        fontsize=11)
# with a loop I add the values to the graphic
for num,text in zip(range(10),round(distric_counts_cumsum["PdDistrict"],2)):
    ax.text(num,text,text)
plt.show(block=False)

Using pd.crosstab we can see by categories which are the districts where there are more incidences (using normalize="index" shows us the percentages per row).

We can see, for example, that 48 percent of the prostitution cases take place in Mission, 32 percent of the Drug cases take places in Tenderloin..



In [None]:
distric_category = pd.crosstab(columns=df_outliers["PdDistrict"],index=df["Category"],normalize="index")
category_district_max = pd.concat([distric_category.idxmax(axis=1),distric_category.max(axis=1)],axis=1).sort_values(by=1,ascending=False).reset_index()
category_district_max.columns = ["Category","District","Percentage"]
category_district_max.head(10)

We can also see by districts and weekends (Friday-Sunday); it strikes me that Tinderloin, where 32 percent of the drug cases occur there, the weekends are quieter.

In [None]:
pd.crosstab(index=df_outliers.PdDistrict,columns=df_outliers.Weekend,normalize="index")

### Model
I could go on and on analyzing the cases for example by year, by resolution, focusing on a category (see scatter below) etc... but let's go directly to the model.



In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(x="X",y="Y",data=df_outliers.loc[df_outliers.Year==2012],alpha=0.5,color="#B3BDB2")
sns.scatterplot(x="X",y="Y",data=df_outliers.loc[(df_outliers.Year==2012)&(df_outliers.Category=="PROSTITUTION")],alpha=0.8,color="r")


#add text
plt.text(-122.514741,37.829977,"San Francisco Prostitution Cases by District in 2012",
        fontsize=15,        
         fontweight='bold') 
plt.text(-122.416145,37.761631,"MISSION",
        fontsize=12,        
         fontweight='bold') 
plt.text(-122.420296,37.788879,"NORTHERN",
        fontsize=12,        
         fontweight='bold') 


plt.show(block=False)



For this, the first thing I am going to do, using Kmeans, is to create a new feature that can help to improve the predictions.

In [None]:
X = df_outliers.drop(["Dates","Category","Descript","DayOfWeek","PdDistrict","Resolution","Category_encode"],axis=1).copy()
y = df_outliers["Category_encode"]

# train and validation split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 21)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

#kmeans
kmeans = KMeans(n_clusters=6,random_state=0).fit(X_train)
  

In [None]:
#I have to convert the array X_train into a dataframe to add a new column
X_train_df = pd.DataFrame(X_train)
X_val_df = pd.DataFrame(X_val)
## add new columns
X_train_df["Kmean"] = kmeans.labels_
X_val_df["Kmean"] = kmeans.predict(X_val)

In [None]:
# Fitting Random Forest Classification to the Training set
classifier = RandomForestClassifier( n_jobs = -1,random_state =50,max_depth=10,max_features="auto",min_samples_split=4)
classifier.fit(X_train_df, y_train)

In [None]:
# predict
predict_proba = classifier.predict_proba(X_val_df)
log_loss(y_val,predict_proba)

### Submision

In [None]:
test_data = pd.read_csv("../input/sf-crime/test.csv.zip")

In [None]:
test_data_transformed = convert_dataframe(test_data)
test_data_transformed["PdDistric_encode"] = label_dist.fit_transform(test_data_transformed.PdDistrict)
test_data_final = test_data_transformed.drop(["DayOfWeek","PdDistrict","Dates",
                "Id"],axis=1).copy()



In [None]:
test_data_scaler = scaler.transform(test_data_final)

In [None]:
#scaler
test_data_final = pd.DataFrame(test_data_scaler)
#add kmean column
test_data_final["Kmean"] = kmeans.predict(test_data_final)


In [None]:
test_data_pred_proba = classifier.predict_proba(test_data_final)
#from label encoder we use classes to have the original values of Category, 
#and we will use them as columns in the submission dataframe.
keys = label_cat.classes_


In [None]:
result = pd.DataFrame(data=test_data_pred_proba,columns=keys)
result.head(3)

In [None]:
result.to_csv(path_or_buf="classifier_sf.csv",index=True, index_label = 'Id')