# Modeling

We're going to use SciKit-Learn to generate some models that will (hopefully) help us make predictions about crime in Denver.

## SciKit-Learn and linear regression
SciKit-Learn is one of the most popular machine learning libraries about there. It allows us to create models very easily! Let's use it to create a simple [linear regression model](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares). If you're unfamiliar with linear regression, the idea is to essentially draw a line of best fit through the data.

The way we do this is by 'fitting' a line based on a set of input data, `X`, and a response variable, `y`. If we're trying to predict crime occurrences, `y` should be the count of occurrences of a specific type of crime in a specific neighborhood. `X` should be the set of predictor features that we are going to use to predict `y`. Feature engineering is the process of creating `X` and `y` from the raw data. We implemented the feature engineering functionality in `denvercrime/src/features/make_features.py`. 

Let's try to predict how many occurences of drug/alcohol related crimes will occur in Stapleton tomorrow. The first step is to create the training data using the feature engineering functionality. Here are the docstrings for the feature engineering functions, so you can get an idea of how they work. **Note that running the feature engineering functions might take a while.**

```python
def make_predictors(df, date, hood):
    """ Create predictor variables.

    Parameters:
        df (DataFrame): crime data
        date (datetime.date): datetime.date obj; date to predict
        hood (str): value for df['NEIGHBORHOOD_ID']

    Returns:
        (DataFrame): Rows are indexed by predictions, columns are features (counts for different intervals)
    """
    
def make_responses(df, dates, hood, category):
    """ Create response variables.

    Parameters:
        df (DataFrame): crime data
        dates (list): list of datetime.date objs
        hood (str): value for df['NEIGHBORHOOD_ID']
        category (str): value for df['OFFENSE_CATEGORY_ID']

    Returns:
        (Series): Counts of each offense category for this date and neighborhood
    """
```

In [2]:
# import modules
import datetime as dt
import numpy as np
import pandas as pd
import sys, os
sys.path.append('../src/features/')
from make_features import make_predictors
from make_features import make_responses

In [3]:
if os.path.exists('../data/processed/crime.pkl'):
    # if we already made the crime data, then just load it
    crime = pd.read_pickle('../data/processed/crime.pkl')
else:
    # read in the crime data
    crime = pd.read_csv('../data/raw/crime.csv', parse_dates=True)
    # change the date to datetime type and round to nearest day
    crime['REPORTED_DATE'] = pd.to_datetime(crime['REPORTED_DATE']).dt.normalize()
    crime.to_pickle('../data/processed/crime.pkl')

In [None]:
nhood = 'stapleton'
crimetype = 'drug-alcohol'

In [31]:
# make a list of dates for make_responses
dates = pd.date_range(
            dt.datetime.strptime('2015-01-02', '%Y-%m-%d'),
            dt.datetime.strptime('2019-02-07', '%Y-%m-%d'), 
            #periods=500
            freq="1D"
        ).normalize().tolist()

# initialize training data
X_train = make_predictors(
    crime,
    dates,
    nhood,
    crimetype
)
y_train = make_responses(
    crime,
    dates,
    nhood,
    crimetype
)
print('done')

done


In [35]:
print(len(dates))
print(X_train.shape)
print(y_train.shape)
print(dates[1:5])

1498
(1498, 31)
(1498, 1)
[Timestamp('2015-01-03 00:00:00', freq='D'), Timestamp('2015-01-04 00:00:00', freq='D'), Timestamp('2015-01-05 00:00:00', freq='D'), Timestamp('2015-01-06 00:00:00', freq='D')]


Now that we have data fit for training a model, all we have to do is import the model from SKLearn and fit it.

In [33]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X_train, y_train)

And that's all it takes! Now that our model is stored in this variable, `reg`, we can use it to predict crime for any date.

In [23]:
# let's make predictors for the last day in our dataset
X_lastday = make_predictors(
    crime,
    [dt.datetime.strptime('2019-02-06', '%Y-%m-%d')],
    nhood,
    crimetype
)
y_lastday = make_responses(crime,
    [dt.datetime.strptime('2019-02-06', '%Y-%m-%d')],
    nhood,
    crimetype
).iloc[0][0]

In [28]:
y_lastday_hat = reg.predict(X_predict)[0][0]
print("Observed:" + str(y_lastday) + ", Predicted:" + str(y_lastday_hat))

Observed:3, Predicted:0.479073663268916


In [9]:
# let's make predictors for the day after the last day in our dataset
X_predict = make_predictors(
    crime,
    [dt.datetime.strptime('2019-02-07', '%Y-%m-%d')],
    nhood,
    crimetype
)

In [22]:
y_lastday.iloc[0][0]

3

In [29]:
reg.predict(X_predict)[0][0]

0.479073663268916

In [30]:
# Which features are important?
reg.coef_

array([[-2.06795817e-16, -1.70498467e-02,  3.14217352e-02,
        -5.10450624e-03,  5.41167438e-04, -7.51031623e-03,
        -2.23050745e-14,  1.04508793e-03,  3.53641472e-03,
        -6.70361617e-04,  1.30530526e-04,  1.36540602e-03,
         1.87015074e-02,  1.52137169e-01,  6.93420682e-02,
        -2.25705712e-02,  1.14487528e-01, -4.08725321e-03,
        -9.74303310e-02, -1.96740462e-01, -6.38149520e-02,
        -1.51672122e-01,  1.16653714e-01,  6.49937054e-02,
         6.00053006e-02, -1.72643399e-02, -7.41913530e-02,
         2.21037457e-01,  9.40518666e-02, -1.18818398e-01,
        -1.64820534e-01]])

At this point it would be wise to ask, "How accurate is this prediction?" 

One of the ways we can evaluate a regression is by checking the $R^2$, which is a measure of how close the data are to the line we created. A higher $R^2$ is generally better. Let's get SKLearn to tell us the $R^2$ of our model.

In [11]:
reg.score(X_train, y_train)

0.050856452070279556

## Other models
Let's try out some other models and see how they compare to the linear regression model. SKLearn offers a plethora of statistical algorithms for us to try.

Try creating a [ridge](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression), 
[lasso](https://scikit-learn.org/stable/modules/linear_model.html#lasso), 
[k nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor), 
or [regression tree](https://scikit-learn.org/stable/modules/tree.html#regression) 
model. Click the link to read about the algorithm and see an example. How does your model compare to the linear regression? Can you explain any differences in results based on the model's algorithm?