# 4.4 Running Regression

INSTRUCTIONS: <br>
Using this data, build a regression model to predict property crimes. You can use the features you prepared in the previous assignment, new features of your own choosing, or a combination. The goal here is prediction rather than understanding mechanisms, so the focus is on creating a model that explains a lot of variance.

Submit a notebook with your model and a brief writeup of your feature engineering and selection process to submit and review with your mentor.

# Load and clean:
We've already looked at these variable distributions in 4.2, so I'll go ahead and retype, deal with missing data, and eliminate the major outliers here.

In [2]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

In [14]:
crime_df = pd.read_csv("NYC_2013_crime.csv")
print(len(crime_df))
crime_df.isnull().sum()

351


City                                        0
Population                                  3
Violent\ncrime                              3
Murder and\nnonnegligent\nmanslaughter      3
Rape\n(revised\ndefinition)1              351
Rape\n(legacy\ndefinition)2                 3
Robbery                                     3
Aggravated\nassault                         3
Property\ncrime                             3
Burglary                                    3
Larceny-\ntheft                             3
Motor\nvehicle\ntheft                       3
Arson3                                    164
dtype: int64

In [16]:
crime_df = pd.read_csv("NYC_2013_crime.csv")

# give columns some reasonable names
crime_df.rename(columns={
    "Murder and\nnonnegligent\nmanslaughter":"Murder",
    "Aggravated\nassault": "agg_assault",
    "Property\ncrime":"prop_crime",
    "Larceny-\ntheft":"Larceny",
    "Motor\nvehicle\ntheft":"mv_theft",
    "Violent\ncrime": "violent_crime",
    "Arson3": "Arson"}, inplace=True)

# Rape is complicated, faces definition discrepancies, and miniscule and variable reporting rates.  Let's drop it.
crime_df.drop(["Rape\n(revised\ndefinition)1", "Rape\n(legacy\ndefinition)2"], axis=1, inplace=True)

# convert arson NaNs to 0s (we'll use this for one particular purpose, then drop the column)
crime_df["Arson"].fillna(0, inplace=True)
# remove our other NaNs (there are only three of them – last three rows – so this is fine)
crime_df.dropna(inplace=True)

# City is already a string, but we need to retype all columns to integers
for column in crime_df:
    if column!="City":
        try:
            crime_df[column] = crime_df[column].astype(str).str.replace(",|\.0","").astype(int)
        except:
            print(column)
            
# Remove our four large outlier cities
crime_df = crime_df[crime_df["Population"]<190000]

crime_df.head()

Unnamed: 0,City,Population,violent_crime,Murder,Robbery,agg_assault,prop_crime,Burglary,Larceny,mv_theft,Arson
0,Adams Village,1861,0,0,0,0,12,2,10,0,0
1,Addison Town and Village,2577,3,0,0,3,24,3,20,1,0
2,Akron Village,2846,3,0,0,3,16,1,15,0,0
3,Albany,97956,791,8,227,526,4090,705,3243,142,0
4,Albion Village,6388,23,0,4,16,223,53,165,5,0


# Feature engineering

It's important to note here that larceny, burglary, arson, and motor vehicle theft are all simply instances of property crime.  They go straight into the count.  From the FBI's website: "In the FBI’s Uniform Crime Reporting (UCR) Program, property crime includes the offenses of burglary, larceny-theft, motor vehicle theft, and arson. The object of the theft-type offenses is the taking of money or property, but there is no force or threat of force against the victims."

This means that in effect, given the assigned task of predicting property crime, we're actually trying to predict property crime that's not already included in those other columns, since we already have that data.  So we ought to be able to add up the four known property crime columns (burglary, larceny, mv_theft, and arson), and subtract them from the total property crime number.

However, trying to do this reveals a problem:

In [25]:
crime_df["recorded_propcrimes"] = (
    crime_df["mv_theft"]
    + crime_df["Arson"]
    + crime_df["Burglary"]
    + crime_df["Larceny"])
print(len(crime_df[crime_df["recorded_propcrimes"] > crime_df["prop_crime"]]))

52


It turns out that in 52 cases, there are actually more recorded crimes that meet the definition of property crimes than the recorded property crimes.  And there would probably be many more such cases if we weren't missing half the arson counts.  This means that there is some systematic error in the data: either the FBI's own property crimes column is not following the given definition of property crime, or that that total number is independently tallied and either it or the subcategory counts are wrong.

Since half the arson data was missing, let's see what happens if we don't include that in our recorded property crimes count (even though it's included in the FBI's explicit definition).

In [30]:
crime_df["recorded_propcrimes"] = (
    crime_df["mv_theft"]
    + crime_df["Burglary"]
    + crime_df["Larceny"])
print(len(crime_df[crime_df["recorded_propcrimes"] == crime_df["prop_crime"]])/len(crime_df))

1.0


Voila!  If you drop the arson column, the property crime count that we've been assigned to predict is exactly the sum of motor vehicle theft, burglary, and larceny <i>in every single row of the dataset</i>.

So if we do a reasonable job cleaning the data and setting up the assigned problem, we see that the problem does not actually exist.  There is no regression to be run, because once we discover the error in the FBI's definition, we simply have to add the subcategory columns to "predict" the desired target data.

I'm hoping (albeit rather doubtfully) that this was an intentional trick question on Thinkful's part.  But because the purported point of the exercise was to run the linear regression that they spent this unit explaining, I'll give an attempt at using regression to predict something that actually makes sense to predict, such as predicting violent crime from nonviolent crime and population, or population from all recorded crimes.

# Predicting violent crime 

In [79]:
# randomly select 30% of rows as CV
import random
from sklearn.model_selection import train_test_split
y = crime_df["violent_crime"]
X_train, X_cv, y_train, y_cv = train_test_split(crime_df[["Population","Burglary","Larceny","mv_theft"]], y, test_size=0.3)
X_train.head()

from sklearn import linear_model
rModel = linear_model.LinearRegression()
rModel.fit(X_train, y_train)
print(rModel.coef_)
print(rModel.score(X_cv, y_cv))

[-6.64693461e-04  2.04092680e-01  2.78646649e-02  2.50681517e+00]
0.845535435766062


But there's some variance in this small of a dataset.  Let's run this repeatedly:

In [111]:
import random
from sklearn.model_selection import train_test_split
def regtest(crime_df):
    # randomly select 30% of rows as CV
    y = crime_df["violent_crime"]
    X_train, X_cv, y_train, y_cv = train_test_split(crime_df[["Population","Burglary","Larceny","mv_theft"]], y, test_size=0.3)
    X_train.head()

    from sklearn import linear_model
    rModel = linear_model.LinearRegression()
    rModel.fit(X_train, y_train)
    return rModel.coef_, rModel.score(X_cv, y_cv)

rcoefs = []
rscores = []
for i in range(200):
    coef, score = regtest(crime_df)
    rscores.append(score)
    rcoefs.append(coef)
print(sum(rscores)/len(rscores))
print(sum(rcoefs)/len(rcoefs))

0.8148160576246721
[-3.39504785e-04  3.64411182e-01 -1.80048656e-03  2.15529344e+00]


So we can see it generates an average r-score of just over .8, and that it relies primarily on motor vehicle theft and secondarily on burglary to generate this prediction.  In other words, motor vehicle theft and burglary are the best predictors of violent crime in a small or midsized city in NY state.

Finally, here's some pseudcode for a proper optimization function for feature selection here:
- Args: dataframe, target column name
- Returns: dataframe with list of optimized features


Execution:
- add a sqrt transform and a squared transform of every feature (all cols except target)
- run regression to produce coefficients
- set a usefulness threshold (e.g., 0.2)
- for each feature (including transforms) calculate usefulness.  A simple formula: feature coefficient times feature mean, divided by yval mean.
- If it's below the usefulness threshold, discard the feature from the dataframe
- return df

Note: the feature elimination part of this could be accomplished by setting a reasonable regularization parameter; but it may sometimes be useful to eliminate features like this for computation-time purposes before running regularized LR on the full dataset.

If you wanted to do a really good job - at more computational expense - you could do this with all 2-factor combinations as well.