# End of semester review lab

**Instructions**

This lab is designed to be a review of some of the key concepts and coding from the second half of the semester.

Your task is to predict the wage in the following data set using all of the available features.  You should do this using two separate models, and compare which model has the lowest RMSE (Root Mean Squared Error).  The two models you should use are linear regression and K-nearest neighbor, with K = 1.  Note that we are using K nearest neighbor in a new way because this is not a classification problem.  The K-nearest neighbor should simply use the nearest neighbor's wage as the prediction wage.

You will need to follow these steps

## 1. Setup
Run the two cells below to load the packages and data.

The dataset CPS85 contains data on 534 individuals surveyed in the year 1985.

    wage = The hourly wage
    educ = years of education
    sex = sex (male or female)
    exper = years of experience
    union = whether or not the person was in a union

In [2]:
from datascience import *
import numpy as np
import matplotlib
%matplotlib inline

In [3]:
CPS85 = Table.read_table("CPS85_small.csv")
CPS85

wage,educ,sex,exper,union
9.0,10,M,27,Not
5.5,12,M,20,Not
3.8,12,F,4,Not
10.5,12,F,29,Not
15.0,12,M,40,Union
9.0,16,F,27,Not
9.57,12,F,5,Union
15.0,14,M,22,Not
11.0,8,M,42,Not
5.0,12,F,14,Not


**Step 1: Data Preparation**
1. You can't fit a model with categorical data the such as sex and union as text.  You will need to convert those columns into numbers.  Fortunately since both are binary variables, you just need to convert M to 1 and F to 0, and Union to 1 and Not to 0.

2. Convert each column to standard units

3. Save the cleaned data as CPS85_clean

4. Create a train/test split with 70% of the data for training and 30% for testing.  Use a random seed of 1234 so that we all get the same results when we do the train test split.  There should be 374 training observations and 160 test observations.  Name the resulting data sets train and test.  




In [4]:
def sex_to_b(text):
    if text == 'M':
        return 1
    else: return 0

In [5]:
def union_to_b(text):
    if text == 'union':
        return 1
    else: return 0

In [6]:
CPS85.apply(sex_to_b, 'sex')

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,

In [7]:
CPS85_clean = (CPS85
    .with_column('sex', CPS85.apply(sex_to_b, 'sex'))
    .with_column('union', CPS85.apply(union_to_b, 'union'))
)
CPS85_clean

wage,educ,sex,exper,union
9.0,10,1,27,0
5.5,12,1,20,0
3.8,12,0,4,0
10.5,12,0,29,0
15.0,12,1,40,0
9.0,16,0,27,0
9.57,12,0,5,0
15.0,14,1,22,0
11.0,8,1,42,0
5.0,12,0,14,0


In [8]:
np.random.seed(1234)
shuffled = CPS85_clean.sample(with_replacement=False) # Randomly permute the rows
train = shuffled.take(np.arange(374))
test  = shuffled.take(np.arange(374, CPS85_clean.num_rows))

In [9]:
train

wage,educ,sex,exper,union
8.75,12,1,9,0
12.0,14,0,10,0
20.4,17,1,3,0
13.0,12,1,25,0
8.5,12,1,13,0
7.0,12,0,10,0
4.35,11,0,20,0
6.0,7,0,15,0
8.0,12,0,8,0
10.0,10,0,25,0


# Regression Model 

**Step 2: Regression Model** 

1. Define a function that calculates the root mean squared error of a regression model that predicts the wage using all other features.  Note: You have only done this before with one 'x' variable.  Now there are more than one.  All you have to do is to add a 'slope' for each variable. For example the mathematical equation for the predicted value is:

$fitted = slope_1*educ + slope_2*sex + slope_3*exper + slope_4*union + intercept$

Once you have the fitted value for each data point, you can use it to calculate the RMSE for the model

*Hint* This [example from the textbook](https://umass-data-science.github.io/190fwebsite/textbook/15/3/method-of-least-squares/) should help you.  This was also in the last lab we did.

2. Use the minimize function to find the slopes and intercept that minimize the the RMSE.  These slopes and intercepts are your model, and the RMSE that these slopes give you is the *training RMSE of your model*

In [10]:
def rmse(slope1, slope2, slope3, slope4, intercept):
    e = train.column('educ')
    s = train.column('sex')
    ex = train.column('exper')
    u = train.column('union')
    w = train.column('wage')
    fitted = slope1 * e + slope2 * s + (slope3 * ex + slope4 * u) + intercept
    rmse = np.sqrt(np.mean((w - fitted) ** 2))
    return rmse

In [12]:
model_coefs = minimize(rmse)
model_coefs#these "slopes" minimize the RMSE

array([ 9.31258141e-01,  2.14693454e+00,  1.04060094e-01,  3.15191442e+02,
       -6.12851452e+00])

**Step 3: Calculate test set RMSE**

1.  Use the slopes and intercept you calculated from your training data to predict the wage of each observation in your test set.

2.  Calculate the RMSE of those predictions.  This is the *test set RMSE of your model*.

In [16]:
#define a function to calculate test set RMSE notice we use test set here
def rmse_test(slope1, slope2, slope3, slope4, intercept):
    e = test.column('educ')
    s = test.column('sex')
    ex = test.column('exper')
    u = test.column('union')
    w = test.column('wage')
    fitted = slope1 * e + slope2 * s + (slope3 * ex + slope4 * u) + intercept
    rmse = np.sqrt(np.mean((w - fitted) ** 2))
    return rmse

4.48293960279929

In [17]:
#RMSE of test set
rmse_test(model_coefs[0], model_coefs[1], model_coefs[2], model_coefs[3], model_coefs[4])

4.335290672696314

In [18]:
#RMSE of training set.
#notice the test RMSE is very close, and a bit smaller this is a little unusual but not 
#so unusual as to worry.
rmse(model_coefs[0], model_coefs[1], model_coefs[2], model_coefs[3], model_coefs[4])

4.48293960279929

# K-nearest Neighbor Model

Note you should use the same training and test data for this model as you did for the regression.

1. Create a function or series of functions that finds the nearest neighbor in the training data of a single row of data.  You will find [this notebook from lecture helpful](http://datahub.cs.umass.edu/hub/user-redirect/git-sync?repo=https://github.com/umass-data-science/materials-fa20&subPath=lec/lec22.ipynb)

*hint:* you want the closest() function to work.  You don't need the majority_class or classify functions because this is a regression problem.








In [25]:
 def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return np.sqrt(sum((pt1 - pt2)**2))

def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return distance(np.array(row1), np.array(row2))

def distances(train, example):
    """Compute distance between example and every row in train.
    Return train augmented with Distance column"""
    distances = make_array()
    attributes = train.drop('Class')
    for row in attributes.rows:
        distances = np.append(distances, row_distance(row, example))
    return train.with_column('Distance', distances)

def closest(train, example, k):
    """Return a table of the k closest neighbors to example"""
    return distances(train, example).sort('Distance').take(np.arange(k))

2.  Use you function to find the nearest neighbor (k=1) in the training set of each observation in the test set.  The value of the nearest neighbor's wage in the training set is the prediction of the value of wage for the test observation.

In [39]:
predicted_wage_array = make_array()
for i in np.arange(test.num_rows):
    predicted_wage = closest(train, test.row(i), 1)[0][0]#this chooses the wage of the nearest neighbor
    predicted_wage_array = np.append(predicted_wage_array, predicted_wage)

13.757630609956061

3.  Calculate the RMSE of the predicted wage using your nearest neighbor prediction.  This is the RMSE of your nearest neighbor model

In [41]:
np.sqrt(np.sum((fitted_wage_array - test.column('wage'))**2))

13.757630609956061

**Results:** What are the RMSEs of your two models?  Which one has a lower RMSE on the test set? 

RMSE of the linear regression model is 4.48, RMSE of the nearest neighbor model is 13.7, so the linear regression model has a lower test set error

# Optional, what about larger values of K?

In [45]:
closest(train, test.row(0), 4)[0].mean()

9.48

In [55]:
k = 4
predicted_wage_array = make_array()
for i in np.arange(test.num_rows):
    predicted_wage = closest(train, test.row(i), k)[0].mean()#this chooses the mean wage of the nearest neighbor
    predicted_wage_array = np.append(predicted_wage_array, predicted_wage)

In [56]:
np.sqrt(np.sum((predicted_wage_array - test.column('wage'))**2))

10.696611028732416