# Multi-level regression using K means clustering

You are working for a non-profit organization that wants to develop a machine learning model to predict which of the people they engage with are likely to volunteer their time for community service. To train your model, you have a dataset that includes survey data about volunteering from respondents across the United States. However, you realize that volunteering behavior is very region-specific. So you want to first cluster data by geographic features, then fit a new linear model on volunteering-related features in each cluster.

This will be a type of multi-level regression. In a multi-level regression, a sample is modeled as
$$
y=w_{0, j}+w_{1, j} x_1+\ldots+w_{p, j} x_p
$$
if the sample belongs to group *j*.

In other words, there is a different set of regression coefficients for each group. In this case, we will use K-means clustering to form the groups.

---

In this workspace, you are given training data `Xtr`, `ytr` and test data `Xts`, `yts` for a regression problem. Write code to do the following:

- Perform K-means clustering on the training data `Xtr` with a given number `n_cluster` clusters.
- Fit `n_cluster` linear regression models, each on the training data belonging to a particular cluster. For example, the 0th model should be trained on data from the 0th cluster.
- Compute the predicted outputs `yhat_ts` for the test data, and compute the MSE of the model on the test data.

> In the workspace, use `n_cluster = 5`.
For full credit, your solution should use no more than *one* `for` loop.

|Name|	Type|	Description|
| --- | --- | --- |
|`Xtr`|	pandas dataframe|	Training data - features.|
|`Xts`|	pandas dataframe|	Test data - features.|
|`ytr`|	1d numpy array|	Training data - target variable.|
|`yts`|	1d numpy array|	Test data - target variable.|
|`Xtr_cid`|	1d numpy array|	Cluster indices for training samples|
|`Xts_cid`|	1d numpy array|	Cluster indices for test samples|
|`yhat_ts`|	1d numpy array|	Predictions of model on test set.|
|`mse_ts`|	float|	Mean squared error on test set.|


In [1]:
import random
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

Read in the dataset:

In [2]:
df = pd.read_csv('data-volunteer.csv')

The data has the following features, some of which you will use for clustering and some of which you will use for the regression model:

* `GEDIV` (geographical region of the US where the respondent lives, ordinal-encoded. You will use this for clustering only.)
* `GTMETSTA` (whether or not the respondent lives in a metropolitan area. You will use this for clustering only.)
* `GTCBSASZ` (size of the metro area where the respondent lives. You will use this for clustering only.)
* `PESEX` (sex of the respondent. You will use this for the regression only.)
* `PRTAGE` (age of the respondent, ordinal encoded. You will use this for the regression only.)
* `PEEDUCA` (education level of the respondent, ordinal encoded. You will use this for the regression only.)
* `PUWK` (whether the respondent worked in the last week (1), did not work in the last week (2), or is retired (3). You will use this for the regression only.)
* `PTS16E` (number of hours spent volunteering in the last 12 months. You will use this as the target variable for the regression.)

Split the data into training and test sets. Use 2,500 samples for the test set and the remaining samples for the training set. Use `random_state = 42`.

 * `ytr` and `yts` should each be a 1d `numpy` array with only the target variable.
 * `Xtr` and `Xts` should be `pandas` data frames with all of the remaining variables (excluding the target variable.)



In [3]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
regression_features = ['PESEX', 'PRTAGE', 'PEEDUCA', 'PUWK']
clustering_features = ['GEDIV', 'GTMETSTA', 'GTCBSASZ']
target = 'PTS16E'

X = df[regression_features + clustering_features]
y = df[target]
Xtr, Xts, ytr, yts = train_test_split(X, y, test_size=2500, random_state=42)
ytr = ytr.to_numpy()
yts = yts.to_numpy()

In the next cells, you will use `sklearn` to perform K-means clustering using  `Xtr`. First, set `n_cluster` as specified on the question page.

In [4]:
n_cluster = 5

Then, assign cluster labels to each data point, using only the geographical features that were specified as "You will use this for clustering only". 

(Use the `random_state = 42` as shown below so that your clustering will match the auto-grader's.) Save the assigned class labels in `Xtr_cid` and `Xts_cid` for the training and test data, respectively.

In [5]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
kmeans = KMeans(n_clusters=n_cluster, random_state=42)
Xtr_cid = kmeans.fit_predict(Xtr[clustering_features])
Xts_cid = kmeans.predict(Xts[clustering_features])


Finally, fit regression coefficients using the training data in each cluster, and then use the fitted regression models to create `yhat_ts`, the predicted values on the test set.

In [6]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

# this just generates an array that's the correct shape - yhat_ts shouldn't really be all zeros
yhat_ts = np.zeros(yts.shape)

for i in range(n_cluster):
    train_indices = np.where(Xtr_cid == i)[0]
    test_indices = np.where(Xts_cid == i)[0]
    if len(train_indices) > 0:
        reg = LinearRegression()
        reg.fit(Xtr.iloc[train_indices][regression_features], ytr[train_indices])
        yhat_ts[test_indices] = reg.predict(Xts.iloc[test_indices][regression_features])

Then, compute the mean squared error of your model on the test data.

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
mse_ts = mean_squared_error(yts, yhat_ts)