## <u> Crime Statistics </u>

### <u> About Our Dataset </u>

This dataset was compiled for use in our capstone project for the M.S. in Data Science at Drexel University. 
We collected our data through thousands of calls to the NYC OpenData platform. 
After collecting all the data, we dropped columns that we didn't think add value. D
espite our best effort, due to time constraints, we weren't able to scrub all null values for the less f
requently used attributes like any of the "computed_region" colunns. However, the dataset is in a usable form and ~95% cleaned.

### <u> Content </u> 

While most of the data like "arrest_date" and "age_group" are straightforward, here is a key for some items that may be less obvious.

Column				Description
pd_desc				Description of internal classification corresponding with PD code (more granular than Offense Description)
ofns_desc			Description of offense corresponding with key code
law_code			NY penal law code of offense.
law_cat_cd			Level of offense: felony, misdemeanor, violation
arrest_boro			The borough of NYC where the arrest took place
arrest_precinct		Police precinct that the arrest took place
jurisdiction_code	Jurisdiction responsible for incident. 

Either internal, like Police, Transit, and Housing; or external, like Correction, Port Authority, etc.

:@computed_region_f5dn_yrer	Community Districts
:@computed_region_yeji_bk3q	Borough Boundaries
:@computed_region_92fq_4b7q	City Council Districts
:@computed_region_sbqj_enih	Police Precincts

### <u> TASKS : </u>

arrest_precinct - validate the column by doing predicts over it with respect to the other features influencing the originality of this column in the datasets

### <u> Acknowledgements </u>

Thanks to NYC Open Data for the data.

This project has been a collaboration between Ambrose Karella, Janam Patel, and Naimish Bizzu.

### <u> Inspiration </u>

We thought this data was interesting because it allows for exploring crime in a geospatial way. 
While broad demographics are interesting, we can get more granular and answer questions like 
"where is a tourist least likely to be a victim in a crime?"


In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Pandas and NumPy
import pandas as pd, numpy as np

In [None]:
# Importing all datasets
crime_stats = pd.read_csv("/kaggle/input/nyc-crime-stats/NYC_crime.csv")
crime_stats.head()

In [None]:
crime_stats = crime_stats.rename(columns={":@computed_region_f5dn_yrer":"computed_region1",":@computed_region_yeji_bk3q":"computed_region2",":@computed_region_92fq_4b7q":"computed_region3",":@computed_region_sbqj_enih":"computed_region4"})

In [None]:
crime_stats.head(2)

In [None]:
crime_stats.dtypes

In [None]:
# Drop 'Unnamed: 13' as this is not in use
crime_stats.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [None]:
crime_stats.head(3)

In [None]:
crime_stats.law_cat_cd.value_counts()

In [None]:
crime_stats.perp_sex.value_counts()

### Converting some binary variables (Yes/No) to 0/1

In [None]:
# List of variables to map

varlist =  ['perp_sex']

# Defining the map function
def binary_map(x):
    return x.map({"M": 1, "F": 0})

# Applying the function to the housing list
crime_stats[varlist] = crime_stats[varlist].apply(binary_map)

### Inspecting the Null Values 

In [None]:
crime_stats.isnull().sum()

### Imputing the missing values in the columns  with the most common values

In [None]:
crime_stats['law_cat_cd'] = crime_stats['law_cat_cd'].fillna(crime_stats['law_cat_cd'].mode()[0])

In [None]:
crime_stats['computed_region1'] = crime_stats['computed_region1'].fillna(crime_stats['computed_region1'].mode()[0])

In [None]:
crime_stats['computed_region2'] = crime_stats['computed_region2'].fillna(crime_stats['computed_region2'].mode()[0])

In [None]:
crime_stats['computed_region3'] = crime_stats['computed_region3'].fillna(crime_stats['computed_region3'].mode()[0])

In [None]:
crime_stats['computed_region4'] = crime_stats['computed_region4'].fillna(crime_stats['computed_region4'].mode()[0])

### Label Encoding

In [None]:
# import preprocessing from sklearn
from sklearn import preprocessing

# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
crime_stats_2 = crime_stats.apply(le.fit_transform)
crime_stats_2.head(5)

### Rescaling the Features 

We will use MinMax scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["arrest_key", "arrest_date", "pd_desc", "ofns_desc", "law_code", "age_group", "law_cat_cd", "perp_race", "latitude", "longitude", "arrest_boro", "arrest_precinct", "jurisdiction_code", "computed_region1", "computed_region2", "computed_region3", "computed_region4"]

crime_stats_2[num_vars] = scaler.fit_transform(crime_stats_2[num_vars])

crime_stats_2.head()

In [None]:
crime_stats_2.isnull().sum()

## Checking for Outliers 

In [None]:
# Checking for outliers in the continuous variables
num_crime_stats_2 = crime_stats_2[["arrest_key","arrest_date","pd_desc","ofns_desc","law_code","law_cat_cd","age_group","perp_sex","perp_race","latitude","longitude","arrest_boro","arrest_precinct","jurisdiction_code","computed_region1","computed_region2","computed_region3","computed_region4"]]

In [None]:
# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%
num_crime_stats_2.describe(percentiles=[.25, .5, .75, .90, .95, .99])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("dark_background")

## Distribution of Crime_statistics

In [None]:
#Apply matplotlib functionalities

#Change the colour of bins to green
#Change the number of bins

#Create a distribution plot for rating

#import the necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import seaborn as sns
plt.figure(figsize = [9,5])
sns.distplot(num_crime_stats_2.arrest_precinct,  bins = 40, color = "orange")
plt.title("Distribution of Crime_statistics", fontsize = 20, fontweight = 10, verticalalignment = 'baseline')

plt.show()

## Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
crime_stats_2.head(2)

In [None]:
# Putting feature variable to X
X = crime_stats_2.drop(['arrest_precinct'], axis=1)

X.head()

In [None]:
# Putting response variable to y
y = crime_stats_2['arrest_precinct']

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
# Let's see the correlation matrix 
plt.style.use("ggplot")
plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(crime_stats_2.corr(),annot = True,cmap="Greens")
plt.show()

### Model Building
Let's start by splitting our data into a training set and a test set.

#### Running Your First Training Model

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 11)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
X_train = X_train.drop(['pd_desc'], axis=1)
X_train = X_train.drop(['ofns_desc'], axis=1)
X_train = X_train.drop(['law_cat_cd'], axis=1)
X_train = X_train.drop(['age_group'], axis=1)
X_train = X_train.drop(['perp_sex'], axis=1)
X_train = X_train.drop(['computed_region3'], axis=1)


In [None]:
# Build a third fitted model
 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

### Checking VIF

Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. The formula for calculating `VIF` is:

### $ VIF_i = \frac{1}{1 - {R_i}^2} $

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['arrest_key'], axis=1)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['computed_region2'], axis=1)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['longitude'], axis=1)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Build a third fitted model
 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_cnt = lr_2.predict(X_train_lm)

In [None]:
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)     

### Making Predictions

In [None]:
X_test.columns

In [None]:
X_test.head(2)

In [None]:
num_vars = ["arrest_key","arrest_date","pd_desc","ofns_desc","law_code","law_cat_cd","age_group","perp_sex","perp_race","latitude","longitude","arrest_boro","jurisdiction_code","computed_region1","computed_region2","computed_region3","computed_region4"]
X_test[num_vars] = scaler.transform(X_test[num_vars])

In [None]:
X_test.columns

In [None]:
# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
X_test_new.columns

In [None]:
# Making predictions
y_pred = lr_2.predict(X_test_new)


In [None]:
y_pred

In [None]:
lr_2.params

## Model Evaluation

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label

In [None]:
# Converting y_test to dataframe

X_test_df = pd.DataFrame(X_test)

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Putting CustID to index
X_test_df['ID'] = X_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
X_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([X_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'arrest_precinct_Prob'})

In [None]:
y_pred_final.head(3)

* GRADIENT DESCENCT 

In [None]:
crime_stats.describe()

In [None]:
X['intercept'] = 1
X = X.reindex(["intercept","arrest_key","perp_sex","latitude","longitude","arrest_precinct","jurisdiction_code","computed_region1","computed_region2","computed_region3",""computed_region4"], axis=1)

In [None]:
X.head()

In [None]:
import numpy as np
X = np.array(X)
y = np.array(y)

In [None]:
# Theta needed to be changed with the number of response varaible used.
theta = np.matrix(np.array([0,0,0,0])) 
alpha = 0.01
iterations = 1000

In [None]:
import numpy as np

def compute_cost(X, y, theta):
    return np.sum(np.square(np.matmul(X, theta) - y)) / (2 * len(y))

In [None]:
def gradient_descent_multi(X, y, theta, alpha, iterations):
    theta = np.zeros(X.shape[1])
    m = len(X)
    gdm_df = pd.DataFrame( columns = ['Bets','cost'])

    for i in range(iterations):
        gradient = (1/m) * np.matmul(X.T, np.matmul(X, theta) - y)
        theta = theta - alpha * gradient
        cost = compute_cost(X, y, theta)
        gdm_df.loc[i] = [theta,cost]

    return gdm_df

In [None]:
gradient_descent_multi(X, y, theta, alpha, iterations)

In [None]:
print(gradient_descent_multi(X, y, theta, alpha, iterations).values[999])

In [None]:
gradient_descent_multi(X, y, theta, alpha, iterations).reset_index().plot.line(x='index', y=['cost'])

In [None]:
# import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Representing LinearRegression as lr(Creating LinearRegression Object)
lr = LinearRegression()

#You don't need to specify an object to save the result because 'lr' will take the results of the fitted model.
lr.fit(X, y)

In [None]:
print(lr.intercept_)
print(lr.coef_)