In [None]:
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd

# Predicting the temperament of ROUSes using k-NN classification
Using other data we have in the table, we want to predict the temperament of ROUSes.

In [None]:
rouses = pd.read_csv('ROUSes.csv')
print(rouses.shape)
rouses.head()

### Exploratory analysis
First, let's look at a scatterplot with the temperament represented as color and symbols to get a general idea of the data.

In [None]:
sns.scatterplot(data=rouses, x='Age',y='Length', hue='Temperament', style='Temperament')

As you can see, there are some clusters of the same temperament, which means the samples have the same temperament as their neighbors, so k-NN should work well for those.  But there are also definitely some samples are more "alone" so k-NN won't be as good for prediction.

In [None]:
rouses.describe()

### Normalize columns
The next cell normalizes the columns so the neighbor distance calculations will be scaled equivalently.  You can try skipping this cell to see the performance without scaling.

In [None]:
# First, try skipping this cell and see results without scaling
rouses['Age'] = (rouses['Age']-rouses['Age'].min())/( rouses['Age'].max()-rouses['Age'].min()) # normalize 'Age' columns
rouses['Length'] = (rouses['Length']-rouses['Length'].min())/( rouses['Length'].max()-rouses['Length'].min()) # normalize 'Length' columns
rouses['Weight'] = (rouses['Weight']-rouses['Weight'].min())/( rouses['Weight'].max()-rouses['Weight'].min()) # normalize 'Weight' columns

rouses.head()

Okay, as usual let's follow the train and test process:

In [None]:
train = rouses.sample(frac= 0.8, random_state=1234) # 80% rows for training
test = rouses.drop(index=train.index) # rest of rows for testing
print(train.shape, test.shape)

In [None]:
y_train = train['Temperament']
X_train = train.drop(columns=['Temperament'])
print(X_train.shape, y_train.shape)

y_test = test['Temperament']
X_test = test.drop(columns=['Temperament']) 
print(X_test.shape, y_test.shape)

Notice that is a very small number of train and test samples, so our results are going to be highly dependent on how the data is split.  Let's try k-NN classification:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print('Train score:',knn.score(X_train, y_train))
print('Test score:',knn.score(X_test, y_test))


In [None]:
print(X_test)
print('Prediction:',knn.predict(X_test))
print('Actual:',list(y_test))

Well, predicting 4 out of 6 isn't great, but it is actually better than expected considering the Train score.  Predicting temperament is a pretty tough challenge!  Try different values for k (`n_neighbors`) to see what changes.

# Predicting the weight of ROUSes using k-NN
Using other data we have in the table, we want to predict the weight of ROUSes.

In [None]:
rouses = pd.read_csv('ROUSes.csv')
print(rouses.shape)
rouses.head()

In [None]:
sns.scatterplot(data=rouses, x='Length',y='Weight')

In our previous linear regression work, we were able to use `Age` to predict `Weight` quite well because the correlation was close to linear.  Let's try using `Length` instead, and see how well we can predict despite the relationship being less linear.

In [None]:
rouses = rouses.drop(columns=['Temperament','Age']) # drop the columns 'Temperament' and 'Age'
rouses.head()

Train and test!

In [None]:
train = rouses.sample(frac= 0.8, random_state=4321) # 80% rows for training
test = rouses.drop(index=train.index) # rest of rows for testing
print(train.shape, test.shape)

The next thing to do is to separate out the target data `Weight` from the predictor data (everything else; in this case just `Length` is left).

In [None]:
y_train = train['Weight']
X_train = train.drop(columns=['Weight'])
print(X_train.shape, y_train.shape)

y_test = test['Weight']
X_test = test.drop(columns=['Weight']) 
print(X_test.shape, y_test.shape)

Okay, let's try using weighted K-NN for regression:

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5,weights='distance')
knn.fit(X_train, y_train)
print('Train score:',knn.score(X_train, y_train))
print('Test score:',knn.score(X_test, y_test))


For regression a "score" (the R2 value) near 1 is what we are hoping for, and 0 is the worst result.  So our model is doing a very good job at predicting the data!

To visualize, we can plug in the test values in and have their outputs predicted:

In [None]:
predictions = knn.predict(X_test)

fig, ax = plt.subplots(1)
sns.scatterplot(train,x='Length', y='Weight', label = "training data")
ax.scatter(test['Length'], predictions, label = "test")
ax.set(xlabel='Length', ylabel='Weight')
plt.legend()

Indeed, it looks like these predictions follow the trend of the data! You can try different values for k to see how the results change. 

# One-hot encoding

## Predicting the weight using length and temperament.

Let's try adding in `Temperament` along with `Length` to help predict `Weight`.  

In [None]:
rouses = pd.read_csv('ROUSes.csv')
rouses=rouses.drop(columns=['Age'])
rouses.head()

Let's see how weight varies for different `Temperament`:

In [None]:
sns.boxplot(data=rouses, x='Weight',y='Temperament')

Uh-oh, `Temperament` might not much help to predict `Weight`!  That range of `Weight` for the `Sleepy` category means that `Temperament` won't be a very good predictor for `Weight` for ROUSes in that category, and it might actually end up hurting our predictions.  But the tighter range for `No-nonsense` is more promising.  

As we learned in lecture, to do distance calculations for a categorical feature we'll use one-hot encoding to convert `Temperament` to 4 new columns.  Also, we need to normalize the `Length` feature to make similar scales.  

In [None]:
rouses=pd.get_dummies(rouses,dtype=int)
rouses['Length']= (rouses['Length']-rouses['Length'].min())/(rouses['Length'].max()-rouses['Length'].min()) # normalize Length
rouses.head(10)

So easy!  Okay, let's do the prediction for `Weight` again, now using `Length` and `Temperament` one-hot features.

In [None]:
train = rouses.sample(frac= 0.8, random_state=4321) # 80% rows for training
test = rouses.drop(index=train.index) # rest of rows for testing
print(train.shape, test.shape)

In [None]:
y_train = train['Weight']
X_train = train.drop(columns=['Weight'])
print(X_train.shape, y_train.shape)

y_test = test['Weight']
X_test = test.drop(columns=['Weight']) 
print(X_test.shape, y_test.shape)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5,weights='distance')
knn.fit(X_train, y_train)
print('Train score:',knn.score(X_train, y_train))
print('Test score:',knn.score(X_test, y_test))


In [None]:
predictions = knn.predict(X_test)

fig, ax = plt.subplots(1)
sns.scatterplot(train,x='Length', y='Weight', label = "training data")
ax.scatter(test['Length'], predictions, label = "predictions")
ax.set(xlabel='Length', ylabel='Weight')
plt.legend()

Indeed, it looks like adding the `Temperament` feature actually hurt prediction performance.  Unlike with Linear Regression, there are no coefficients to change the importance of each feature, so it is very important to choose useful features for k-NN.