# Lab 11: Anonymization Using PCA

In [None]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt

In this lab we will use the same Scottish traffic dataset as last week. We'll preprocess it in the same way as we did last time. But in this week's lab we'll try a different approach to anonymization, this time using PCA. 

In [None]:
df = pd.read_csv('scotland-traffic.csv', parse_dates=['count_date'])

We will extract the month from the date and use that as a separate feature. We will also recode the road type. 

In [None]:
df['month'] = pd.DatetimeIndex(df['count_date']).month

In [None]:
df['road_type'] = df['road_type'] == 'Major'

We'll get rid of the records that are missing the outcome variable. And for now we'll replace all other missing values with 0. 

In [None]:
df = df[~df.all_motor_vehicles.isna()]
df.fillna(0, inplace=True)

In [None]:
df.head()

## Predicting Number of Motor Vehicles, Given Other Features (No Anonymization)

This section of the notebook is the same as last week, predicting 'all_motor_vehicles' with no data anonymization. 

In our first experiment, we'll try to predict the amount of motor vehicle traffic using all of the other features, with no anonymization of the data. 

In [None]:
features = ['year', 'hour', 'month', 'latitude', 'longitude', 'link_length_km', 'pedal_cycles', 'two_wheeled_motor_vehicles', 'buses_and_coaches', 'road_type']

X = df[features]

In [None]:
y = df['all_motor_vehicles']

We'll rescale all of the features so that they will all fall within a similar range. 

In [None]:
X = preprocessing.StandardScaler().fit_transform(X)

We'll use 2/3 of the data for training and 1/3 for testing. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

For now we'll use k-nearest neigbors regression. (yet another $k$ with yet another meaning)

This is a very simple regression model that is often effective. Basically when we want to predict the value of the outcome variable for a test example, we find its $k$ nearest neighbours in the training set and predict the average of their values. 

You could also use linear regression, or a multi-layer perceptron, or a random forest model, for example. 

In [None]:
knn = KNeighborsRegressor(n_neighbors=10)

In [None]:
knn.fit(X_train, y_train)

In [None]:
preds = knn.predict(X_test)

We're going to report an evaluation metric called the R2 score (or r-squared metric). Basically a value close to 1 is good as it means the predictive model explains much of the variance in the outcome variable. 

In [None]:
r2_orig = r2_score(y_test, preds)
print('R2 Score (No Anonymization):', r2_orig)

Not bad! We can compare the actual values with our predicted values. 

In [None]:
plt.figure()
plt.scatter(y_test, preds)
plt.xlabel('actual number of motor vehicles')
plt.ylabel('predicted number of motor vehicles')
plt.show()

# Predicting Number of Motor Vehicles (With PCA Anonymized Data)

Now we'll try anonymizing the data using principal components analysis (PCA), and then trying the same prediction tasks on the anonymized data. 

In [None]:
from sklearn.decomposition import PCA

We'll use the same features again, but now transforming them with PCA. We'll rescale them before doing PCA. 

In [None]:
features = ['year', 'hour', 'month', 'latitude', 'longitude', 'link_length_km', 'pedal_cycles', 'two_wheeled_motor_vehicles', 'buses_and_coaches', 'road_type']

X = df[features]
X = preprocessing.StandardScaler().fit_transform(X)

y = df['all_motor_vehicles']

In [None]:
print('shape before PCA')
print(X.shape)
    
# number of PCA components to retain
# e.g. first 5 principal components
num_comps = 5
    
pca = PCA(n_components=num_comps)
newX = pca.fit_transform(X)

print('shape after PCA')
print(newX.shape)   

print('\nhere we can see that each observation now has a score for each of the first n principal components:\n')
print(newX)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(newX, y, test_size=0.33, random_state=42)

knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train, y_train)
preds_anon = knn.predict(X_test)

r2_anon = r2_score(y_test, preds_anon)
print('R2 Score (w/ PCA Anonymization):', r2_anon)

# Lab Assignment

Try the following:
   - Try different numbers of principal components to retain. See how the prediction performance on the anonymized data changes as you change the number of components. 
   - Try at least one other regression model other than k-nearest neighbors and see how it performs on this prediction task, both on the original data and on the anonymized data. You could use linear regression, or a multi-layer perceptron, or another regression model of your choice. 
   - When preprocessing the data, we replaced all missing values with 0. Change this so that each column has its missing values replaced with the median value of the column. See if this changes the performance of the prediction models. 

### Deliverables: Submit your completed notebook via Blackboard. 