# Modeling

We're going to use SciKit-Learn to generate some models that will (hopefully) help us make predictions about crime in Denver.

## SciKit-Learn and linear regression
SciKit-Learn is one of the most popular machine learning libraries about there. It allows us to create models very easily! Let's use it to create a simple [linear regression model](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares). If you're unfamiliar with linear regression, the idea is to essentially draw a line of best fit through the data.

The way we do this is by 'fitting' a line based on a set of input data, `X`, and a response variable, `y`. If we're trying to predict crime occurrences, `y` should be the count of occurrences of a specific type of crime in a specific neighborhood. `X` should be the set of predictor features that we are going to use to predict `y`. Feature engineering is the process of creating `X` and `y` from the raw data. We implemented the feature engineering functionality in `denvercrime/src/features/make_features.py`. 

Let's try to predict how many occurences of drug/alcohol related crimes will occur in Stapleton tomorrow. The first step is to create the training data using the feature engineering functionality. Here are the docstrings for the feature engineering functions, so you can get an idea of how they work. **Note that running the feature engineering functions might take a while.**

```python
def make_predictors(df, date, hood):
    """ Create predictor variables.

    Parameters:
        df (DataFrame): crime data
        date (datetime.date): datetime.date obj; date to predict
        hood (str): value for df['NEIGHBORHOOD_ID']

    Returns:
        (DataFrame): Rows are indexed by predictions, columns are features (counts for different intervals)
    """
    
def make_responses(df, dates, hood, category):
    """ Create response variables.

    Parameters:
        df (DataFrame): crime data
        dates (list): list of datetime.date objs
        hood (str): value for df['NEIGHBORHOOD_ID']
        category (str): value for df['OFFENSE_CATEGORY_ID']

    Returns:
        (Series): Counts of each offense category for this date and neighborhood
    """
```

In [56]:
# import modules
import datetime as dt
import numpy as np
import pandas as pd
import sys
sys.path.append('../src/features/')
from make_features import make_predictors
from make_features import make_responses

# read in the crime data
crime = pd.read_csv('../data/raw/crime.csv', parse_dates=True)

# change the date to datetime type
crime['REPORTED_DATE'] = pd.to_datetime(crime['REPORTED_DATE']).dt.normalize()

# make a list of dates for make_responses
dates = pd.date_range(pd.datetime.today(), periods=100).tolist()

# initialize training data
X = make_predictors(
    crime,
    dt.datetime.strptime('2017-04-23', '%Y-%m-%d'),
    'stapleton'
)
y = make_responses(
    crime,
    dates,
    'stapleton',
    'drug-alcohol'
)
print('done')

ValueError: Shape of passed values is (1, 4), indices imply (1, 15)

Now that we have data fit for training a model, all we have to do is import the model from SKLearn and fit it.

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X, y)

And that's all it takes! Now that our model is stored in this variable, `reg`, we can use it to predict crime for any date.

In [None]:
reg.predict()

At this point it would be wise to ask, "How accurate is this prediction?" 

One of the ways we can evaluate a regression is by checking the $R^2$, which is a measure of how close the data are to the line we created. A higher $R^2$ is generally better. Let's get SKLearn to tell us the $R^2$ of our model.

In [None]:
reg.score()

## Other models
Let's try out some other models and see how they compare to the linear regression model. SKLearn offers a plethora of statistical algorithms for us to try.

Try creating a [ridge](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression), 
[lasso](https://scikit-learn.org/stable/modules/linear_model.html#lasso), 
[k nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor), 
or [regression tree](https://scikit-learn.org/stable/modules/tree.html#regression) 
model. Click the link to read about the algorithm and see an example. How does your model compare to the linear regression? Can you explain any differences in results based on the model's algorithm?