You now know two kinds of regression and two kinds of classifier. So let's use that to compare models!

Comparing models is something data scientists do all the time. There's very rarely just one model that would be possible to run for a given situation, so learning to choose the best one is very important.

Here let's work on regression. Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. Submit a link to your notebook below.

In [1]:
# Import data science environmemnt.
import math
import warnings

from IPython.display import display
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import neighbors
from sklearn import linear_model 
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
import statsmodels.formula.api as smf

# Display preferences
% matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error
warnings.filterwarnings(
    action='ignore',
    module='scipy',
    message='^internal gelsd'
)
warnings.filterwarnings('ignore')

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv', skiprows=3, header=1)
df.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1861,0,0.0,,0,0,0,12,2,10,0,0.0
1,Addison Town and Village,2577,3,0.0,,0,0,3,24,3,20,1,0.0
2,Akron Village,2846,3,0.0,,0,0,3,16,1,15,0,0.0
3,Albany,97956,791,8.0,,30,227,526,4090,705,3243,142,
4,Albion Village,6388,23,0.0,,3,4,16,223,53,165,5,


In [3]:
# Change long column names.
df.rename(columns={"Murder and\nnonnegligent\nmanslaughter":"Murder"})

Unnamed: 0,City,Population,Violent crime,Murder,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1861,0,0.000,,0,0,0,12,2,10,0,0.000
1,Addison Town and Village,2577,3,0.000,,0,0,3,24,3,20,1,0.000
2,Akron Village,2846,3,0.000,,0,0,3,16,1,15,0,0.000
3,Albany,97956,791,8.000,,30,227,526,4090,705,3243,142,
4,Albion Village,6388,23,0.000,,3,4,16,223,53,165,5,
5,Alfred Village,4089,5,0.000,,0,3,2,46,10,36,0,
6,Allegany Village,1781,3,0.000,,0,0,3,10,0,10,0,0.000
7,Amherst Town,118296,107,1.000,,7,31,68,2118,204,1882,32,3.000
8,Amityville Village,9519,9,0.000,,2,4,3,210,16,188,6,1.000
9,Amsterdam,18182,30,0.000,,0,12,18,405,99,291,15,0.000


In [4]:
# Eliminate commas from number > 999.
def convert_number(number):
    try:
        converted = float(number.replace(',', ''))
    except:
        converted = number
        
    return converted

In [5]:
# Change NaN values to 0. Convert object types to floats.
df.dropna()
df['Population'] = df['Population'].apply(lambda x: convert_number(x))
df['Population^2'] = df['Population']**2
df['Murder'] = df['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: convert_number(x))
df['Robbery'] = df['Robbery'].apply(lambda x: convert_number(x))
df['Property_Crime'] = df['Property\ncrime'].apply(lambda x: convert_number(x))

In [6]:
# Eliminate final three rows of text after data.
df = df[:348]

In [7]:
# Create new data frame with only relevant columns.
df_fbi = df[['City', 'Population', 'Population^2', 'Murder', 'Robbery', 'Property_Crime']]

In [8]:
# Preview new data frame.
df_fbi.head()

Unnamed: 0,City,Population,Population^2,Murder,Robbery,Property_Crime
0,Adams Village,1861.0,3463321.0,0.0,0.0,12.0
1,Addison Town and Village,2577.0,6640929.0,0.0,0.0,24.0
2,Akron Village,2846.0,8099716.0,0.0,0.0,16.0
3,Albany,97956.0,9595377936.0,8.0,227.0,4090.0
4,Albion Village,6388.0,40806544.0,0.0,4.0,223.0


In [9]:
df_fbi_scaled = pd.DataFrame(preprocessing.scale(df_fbi), columns=names)

ValueError: could not convert string to float: 'Yorktown Town'

In [None]:
# Build our KNN model.
knn = neighbors.KNeighborsRegressor(n_neighbors=5)
X = df_fbi_scaled.Population
Y = df_fbi_scaled.Property_Crime.values.reshape(-1, 1)
knn.fit(X, Y)

# Set up our prediction line.
T = np.arange(0, 20000, 100)[:, np.newaxis]

# Trailing underscores are a common convention for a prediction.
Y = knn.predict(T)

plt.scatter(X, Y, c='k', label='data')
plt.plot(T, Y_, c='g', label='prediction')
plt.legend()
plt.title('K=5, Unweighted')
plt.show()