## Assignment 3: $k$ Nearest Neighbor

**Do two questions.**

`! git clone https://github.com/DS3001/knn`

**Q1.** This question is a case study for $k$ nearest neighbor The target variable `y` is `price` and the features are `year` and `mileage`.

1. Load the `./data/USA_cars_datasets.csv`. Keep the following variables and drop the rest: `price`, `year`, `mileage`. Are there any `NA`'s to handle? Look at the head and dimensions of the data.

```
import pandas as pd

df = pd.read_csv('/Users/tamerafang/PycharmProjects/DS3002_knn/venv/USA_cars_datasets.csv')
df = df.loc[:,['price','year','mileage'] ]
print(df.shape)
df.describe()
df.head()
```
* There are no NA's to handle.
* dimensions: (2499, 3)


2. Maxmin normalize `year` and `mileage`.

```
def maxmin(data):
    data_min = min(data)
    data_max = max(data)
    normalized_value = (data - data_min) / (data_max - data_min)
    return normalized_value

df['year'] = maxmin(df['year'])
df['mileage'] = maxmin(df['mileage'])
```
3. Split the sample into ~80% for training and ~20% for evaluation.

```
from sklearn.model_selection import train_test_split
y = df['price']
X = df.drop('price',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=100)
```
4. Use the $k$NN algorithm and the training data to predict `price` using `year` and `mileage` for the test set for $k=3,10,25,50,100,300$. For each value of $k$, compute the mean squared error and print a scatterplot showing the test value plotted against the predicted value. What patterns do you notice as you increase $k$?
* From the plots, we notice that as k increases, the model ignores random fluctuations and gives smoother predictions. But, if it gets overly large, the model oversimplifies values too much which causes us to miss details in data.
```
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
import matplotlib.pyplot as plt
for k in [3,10,25,50,100,300]:
    model = KNeighborsRegressor(n_neighbors=k).fit(X_train,y_train)
    y_hat = model.predict(X_test)
    SSE = np.sum( (y_test-y_hat)**2 )
    #
    plot, axes = plt.subplots()
    plt.scatter(y_test,y_hat)
    plt.title('k: '+str(k)+', SSE: '+str(SSE))
    axes.set_ylim(-1000, 62000)
    axes.set_xlim(-1000, 62000)
    plt.show()
```

5. Determine the optimal $k$ for these data.

```
k_bar = 200
k_grid = np.arange(1, k_bar)
SSE = np.zeros(k_bar)

for k in range(k_bar):
    fitted_model = KNeighborsRegressor(n_neighbors=k + 1).fit(X_train, y_train)
    y_hat = fitted_model.predict(X_test)  
    SSE[k] = np.sum((y_test - y_hat) ** 2)  

SSE_min = np.min(SSE)
min_index = np.where(SSE == SSE_min)  
k_star = k_grid[min_index]
print(k_star)

plt.plot(np.arange(0, k_bar), SSE)  
plt.xlabel("k")
plt.title("optimal k:" + str(k_star))
plt.ylabel('SSE')
plt.show()
```

6. Describe what happened in the plots of predicted versus actual prices as $k$ varied, taking your answer into part 6 into account. (Hint: Use the words "underfitting" and "overfitting".)
* With variations of k, we saw many different patterns. As k starts small, the model might overfit, catching too much of the tiny details in the data. Once k reaches the optimal point, there's a good balance. Then, when k gets too large, the model begins to underfit, becoming too general and missing the important patterns.

**Q2.** This question is a case study for $k$ nearest neighbor. The data for the question include:

- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- high blood pressure: if the patient has hypertension (boolean)
- creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction (percentage)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- death event: if the patient deceased during the follow-up period (boolean)

1. Load the `./data/heart_failure_clinical_records_dataset.csv`. Are there any `NA`'s to handle? use `.drop()` to remove `time` from the dataframe.

```
import pandas as pd
df = pd.read_csv('/Users/tamerafang/PycharmProjects/DS3002_knn/venv/heart_failure_clinical_records_dataset.csv')
print(df.shape)
df.describe()
df = df.drop('time',axis=1)

```
* There are no NA's to handle.
2. Make a correlation matrix. What variables are strongly associated with a death event?

```
correlation_matrix = df.corr()
correlation_with_death = correlation_matrix['DEATH_EVENT'].sort_values(ascending=False)
print(correlation_with_death)
```
* Age (.254), ejection_fraction (-.269), and serum_creatine (.294) are most strongly associated with death.

3. For the dummy variables `anaemia`, `diabetes`, `high_blood_pressure`, `sex`, and `smoking`, compute a summary table of `DEATH_EVENT` grouped by the variable. For which variables does a higher proportion of the population die when the variable takes the value 1 rather than 0?

```
vars = ['anaemia','diabetes','high_blood_pressure','sex','smoking']
for var in vars:
    print(df.loc[:,[var,'DEATH_EVENT']].groupby(var).describe())
```
* The variables with higher proportions dying when the variables takes value 1 rather than 0 are anaemia and high blood pressure.
4. On the basis of your answers from 2 and 3, build a matrix $X$ of the variables you think are most predictive of a death, and a variable $y$ equal to `DEATH_EVENT`.

```
y = df['DEATH_EVENT']
vars = ['age','ejection_fraction','serum_creatinine','high_blood_pressure','anaemia']
X = df.loc[:,vars]
```
5. Maxmin normalize all of the variables in `X`.

```
def maxmin(data):
    data_min = data.min()
    data_max = data.max()
    normalized_data = (data - data_min) / (data_max - data_min)
    return normalized_data
X_normalized = X.apply(maxmin)
```
6. Split the sample into ~80% for training and ~20% for evaluation. (Try to use the same train/test split for the whole question, so that you're comparing apples to apples in the questions below.).

```
np.random.seed(100)
N = X.shape[0]
all = np.arange(1,N)
train = np.random.choice(N,int(.8*N) )
test = [item for item in all if item not in train]
X_train = X.iloc[train,:]
y_train = y.iloc[train]
X_test = X.iloc[test,:]
y_test = y.iloc[test]
```
7. Determine the optimal number of neighbors for a $k$NN regression for the variables you selected.

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
k_bar = 25
k_grid = np.arange(1, k_bar + 1)
SSE = np.zeros(k_bar)
for k in k_grid:
    knn = KNeighborsRegressor(n_neighbors=k)
    predictor = knn.fit(X_train, y_train)
    y_hat = predictor.predict(X_test)
    SSE[k - 1] = np.sum((y_test.values.ravel() - y_hat) ** 2)
SSE_min = np.min(SSE)
min_index = np.where(SSE == SSE_min)
k_star = k_grid[min_index]
print(f"Optimal k: {k_star}")
plt.figure(figsize=(10, 6))
plt.plot(k_grid, SSE, marker='o')
plt.xlabel("Number of Neighbors k")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title(f"KNN Regression SSE by k (Optimal k: {k_star}, SSE: {SSE_min})")
plt.xticks(k_grid)
plt.grid(True)
plt.show()
```
8. OK, do steps 5 through 7 again, but use all of the variables (except `time`). Which model has a lower Sum of Squared Error? Which would you prefer to use in practice, if you had to predict `DEATH_EVENT`s? If you play with the selection of variables, how much does the SSE change for your fitted model on the test data? Are more variables always better? Explain your findings.

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
X = df.drop('DEATH_EVENT', axis=1)
X_train = X.iloc[train, :]
y_train = y.iloc[train]
X_test = X.iloc[test, :]
y_test = y.iloc[test]
k_bar = 100
k_grid = np.arange(1, k_bar + 1)
SSE = np.zeros(len(k_grid))
for k in k_grid:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_hat = knn.predict(X_test)
    SSE[k - 1] = np.sum((y_test.values.ravel() - y_hat) ** 2)
SSE_min = np.min(SSE)
min_index = np.where(SSE == SSE_min)[0]
k_star = k_grid[min_index]
print(f"Optimal k: {k_star}")
plt.figure(figsize=(10, 6))
plt.plot(k_grid, SSE, marker='o')
plt.xlabel("Number of Neighbors k")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title(f"KNN Regression SSE by k (Optimal k: {k_star}, SSE: {SSE_min})")
plt.xticks(k_grid)
plt.grid(True)
plt.show()
```
* Adding more variables leads the model to choose a larger optimal number of neighbors (k*), increasing to 83. However, the model's error (SSE) worsens, rising from 27 to 29, which means less accurate predictions. A simpler model with fewer variables and neighbors outperforms the more complex one; this is due to better prediction accuracy with less complexity.

**Q3.** Let's do some very basic computer vision. We're going to import the MNIST handwritten digits data and $k$NN to predict values (i.e. "see/read").

1. To load the data, run the following code in a chunk:
```
from keras.datasets import mnist
df = mnist.load_data('minst.db')
train,test = df
X_train, y_train = train
X_test, y_test = test
```
The `y_test` and `y_train` vectors, for each index `i`, tell you want number is written in the corresponding index in `X_train[i]` and `X_test[i]`. The value of `X_train[i]` and `X_test[i]`, however, is a 28$\times$28 array whose entries contain values between 0 and 256. Each element of the matrix is essentially a "pixel" and the matrix encodes a representation of a number. To visualize this, run the following code to see the first ten numbers:
```
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000)
for i in range(5):
    print(y_test[i],'\n') # Print the label
    print(X_test[i],'\n') # Print the matrix of values
    plt.contourf(np.rot90(X_test[i].transpose())) # Make a contour plot of the matrix values
    plt.show()
```
OK, those are the data: Labels attached to handwritten digits encoded as a matrix.

2. What is the shape of `X_train` and `X_test`? What is the shape of `X_train[i]` and `X_test[i]` for each index `i`? What is the shape of `y_train` and `y_test`?
3. Use Numpy's `.reshape()` method to covert the training and testing data from a matrix into an vector of features. So, `X_test[index].reshape((1,784))` will convert the $index$-th element of `X_test` into a $28\times 28=784$-length row vector of values, rather than a matrix. Turn `X_train` into an $N \times 784$ matrix $X$ that is suitable for scikit-learn's kNN classifier where $N$ is the number of observations and $784=28*28$ (you could use, for example, a `for` loop).
4. Use the reshaped `X_test` and `y_test` data to create a $k$-nearest neighbor classifier of digit. What is the optimal number of neighbors $k$? If you can't determine this, play around with different values of $k$ for your classifier.
5. For the optimal number of neighbors, how well does your predictor perform on the test set?
6. So, this is how computers "see." They convert an image into a matrix of values, that matrix becomes a vector in a dataset, and then we deploy ML tools on it as if it was any other kind of tabular data. To make sure you follow this, invent a way to represent a color photo in matrix form, and then describe how you could convert it into tabular data. (Hint: RGB color codes provide a method of encoding a numeric value that represents a color.)