In [1]:
import numpy as np
import random as py_random
import numpy.random as np_random
import time
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

sns.set(style="whitegrid")

# Module 12 Lab - Distance Based Machine Learning

## Directions

1. Show all work/steps/calculations. If it is easier to write it out by hand, do so and submit a scanned PDF in addition to this notebook. Otherwise, generate a Markdown cell for each answer.
2. You must submit to **two** places by the deadline:
    1. In the Lab section of the Course Module where you downloaded this file from, and
    2. In your Lab Discussion Group, in the forum for the appropriate Module.
3. You may use any core Python libraries or Numpy/Scipy. **Additionally, code from the Module notebooks and lectures is fair to use and modify.** You may also consult Stackoverflow (SO). If you use something from SO, please place a comment with the URL to document the code.

We're getting to the point in the semester where you should be know the drill.

This module covered 3 basic problems: supervised learning (classification, regression), unsupervised learning (clustering) and recommenders (collaborative filtering based systems related to missing value imputation) using distance/similarity. We're only going to cover the first 2 in this lab.

You should definitely use [Scikit Learn](http://scikit-learn.org/stable/) and refer to the documentation for this assignment.

Remember to create a new random seed for each experiment (if needed) and save it.

**Problem 1. kNN Regression**

Use k-Nearest Neighbors *regression* for the insurance data set. Make sure you do the following:

1. Pick an appropriate evaluation metric.
2. Validation curves to find the best value of k.
3. Learning curves to see if we are high bias or high variance and suggest ways to improve the model.
4. 10 fold cross validation to estimate the mean metric and its credible interval.
5. Was this better than the best linear regression model you estimated in Lab 11? Use Bayesian statistical inference to generate and evaluate the posterior distribution of the difference of means.

In [3]:
data_raw = pd.read_csv("insurance.csv")
data = pd.get_dummies(data_raw)
data.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


In [5]:
indpendent_variables = [i for i in data.dtypes.index if i != "charges"]
dpendent_variables = "charges"
X = data[indpendent_variables]
y = data[dpendent_variables]

In [23]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X, y) 
neigh.score(X,y)

0.7638422434725783

In [27]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

for train_index, test_index in kf.split(X):
    X_train, X_test = X.loc[train_index,:], X.loc[test_index,:]
    y_train, y_test = y[train_index], y[test_index]
    neigh.fit(X_train, y_train) 
    y_predict = neigh.predict(X_test)
    print(neigh.score(X,y), r2_score(y_test,y_predict))

0.7082014580277243 0.32928941878155393
0.7074439010761796 0.1394458310483273
0.7097466982739462 0.13099860360330917
0.7242681634187055 0.3547717529457989
0.6901787171060927 0.1766292382149549
0.7004705862208795 0.11260716799725978
0.685501150253028 0.04761746018108304
0.7205131705723737 0.4007020054640634
0.7082400515417288 0.16942805243764536
0.7248504482747424 0.4844096643376389


** Problem 2. Clustering **

Use k-Means Clustering on clustering problems of your own creation in two dimensions ($x_1$ and $x_2$). You should explore the following points: 

1. What if the data has no clusters (there are no hidden categorical variables)?
2. Now assume that you have some "hidden" categorical variable and the clusters are compact and distinct as well as having the same variance? What does the Elbow Method show for the k you should use?
3. Now assume that you have some "hidden" categorical variable and the clusters are disperse? Different variances? What does the Elbow Method show for the k you should use?

In [29]:
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.predict([[0, 0], [4, 4]])
kmeans.cluster_centers_


array([[1., 2.],
       [4., 2.]])