# Lab 10: Anonymization Using K-Means Clustering

In [None]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt

In this lab we will use some traffic data from Scotland. Each record consists of a traffic measurement at a particular location and time. We will try to predict the total amount of motor vehicle traffic (all_motor_vehicles), given other features such as hour, time, year, location coordinates, type of road, etc. 

In [None]:
df = pd.read_csv('scotland-traffic.csv', parse_dates=['count_date'])

We will extract the month from the date and use that as a separate feature. We will also recode the road type. 

In [None]:
df['month'] = pd.DatetimeIndex(df['count_date']).month

In [None]:
df['road_type'] = df['road_type'] == 'Major'

We'll get rid of the records that are missing the outcome variable. And for now we'll replace all other missing values with 0. 

In [None]:
df = df[~df.all_motor_vehicles.isna()]
df.fillna(0, inplace=True)

In [None]:
df.head()

## Predicting Number of Motor Vehicles, Given Other Features (No Anonymization)

In our first experiment, we'll try to predict the amount of motor vehicle traffic using all of the other features, with no anonymization of the data. 

In [None]:
features = ['year', 'hour', 'month', 'latitude', 'longitude', 'link_length_km', 'pedal_cycles', 'two_wheeled_motor_vehicles', 'buses_and_coaches', 'road_type']

X = df[features]

In [None]:
y = df['all_motor_vehicles']

We'll rescale all of the features so that they will all fall within a similar range. 

In [None]:
X = preprocessing.StandardScaler().fit_transform(X)

We'll use 2/3 of the data for training and 1/3 for testing. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

For now we'll use k-nearest neigbors regression. (yet another $k$ with yet another meaning)

This is a very simple regression model that is often effective. Basically when we want to predict the value of the outcome variable for a test example, we find its $k$ nearest neighbours in the training set and predict the average of their values. 

You could also use linear regression, or a multi-layer perceptron, or a random forest model, for example. 

In [None]:
knn = KNeighborsRegressor(n_neighbors=10)

In [None]:
knn.fit(X_train, y_train)

In [None]:
preds = knn.predict(X_test)

We're going to report an evaluation metric called the R2 score (or r-squared metric). Basically a value close to 1 is good as it means the predictive model explains much of the variance in the outcome variable. 

In [None]:
r2_orig = r2_score(y_test, preds)
print('R2 Score (No Anonymization):', r2_orig)

Not bad! We can compare the actual values with our predicted values. 

In [None]:
plt.figure()
plt.scatter(y_test, preds)
plt.xlabel('actual number of motor vehicles')
plt.ylabel('predicted number of motor vehicles')
plt.show()

# Predicting Number of Motor Vehicles (With Anonymized Data)

Now we'll try anonymizing the data using k-means clustering, and then trying the same prediction tasks on the anonymized data. 

In [None]:
from sklearn.cluster import KMeans

We'll first attempt this with 20 clusters. That's a really low value for this dataset. Each cluster will end up having a large number of observations and the dataset will end up being heavily modified as a result. 

In [None]:
k = 20
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)

We'll add the cluster assignments to our original dataframe. Each record/observation has a cluster number now. 

In [None]:
df['cluster'] = kmeans.labels_
print(df.head())

We'll create a new dataframe that stores the centroid vector for each cluster. 

In [None]:
feas_anon = [f+'_anon' for f in features]

cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=feas_anon)
cluster_centers['cluster_num'] = cluster_centers.index
print(cluster_centers.head())

Now we'll join those two dataframes, so that we have a new dataframe which contains both the original data and the anonymized data for each row. 

In [None]:
df_merge = df.merge(cluster_centers, how='left', left_on='cluster', right_on='cluster_num')

In our second experiment, we'll use just the anonymized features. 

In [None]:
new_X = df_merge[feas_anon]
new_y = df_merge['all_motor_vehicles']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_X, new_y, test_size=0.33, random_state=42)

knn = KNeighborsRegressor(n_neighbors=10)

knn.fit(X_train, y_train)

preds_anon = knn.predict(X_test)

r2_anon = r2_score(y_test, preds_anon)

print('R2 Score (With Anonymization):', r2_anon)

Our score has taken quite a hit. That's not too surprising, given that we used a very small number of clusters, resulting in an anonymized dataset in which a lot of information was lost. In the lab assignment, you'll see if you can improve this. 

# Lab Assignment

Try the following:
   - Try different number of clusters (e.g. both smaller and greater than what we tried) and anonymize the data using those clusterings. See how the prediction performance on the anonymized data changes as you change the number of clusters. Keep in mind that if you try a very large number of clusters, it may take several minutes to find a solution. 
   - Try at least one other regression model other than k-nearest neighbors and see how it performs on this prediction task, both on the original data and on the anonymized data. You could use linear regression, or a multi-layer perceptron, or another regression model of your choice. 

### Deliverables: Submit your completed notebook via Blackboard. 