### This assignment is related to the simulation study described in Section 2.3.1 (the so-called Scenario 2 or Example 2) of “Elements of Statistical Learning” (ESL).

**Scenario 2**: the two-dimensional data $X \in R^2$ in each class are generated from a mixture of 10 different bivariate Gaussian distributions with uncorrelated components and different means, i.e.,
$$
    X | Y = k, Z = j \sim N(m_{kj}, s^2I_2)
$$,
where $k=0$ or $1$, and $j = 1, 2, ..., 10$.
Set
$$
    P(Y = k) = 1/2, P(Z = j) = 1/10, s^2 = 1/5
$$.
In other words, given $Y = k, X$ follows a mixture distribution with probability density function (PDF),
$$
    \frac{1}{10} \sum_{j=1}^{10} (\frac{1}{\sqrt{2 \pi s^2}})^2 e^{-\frac{\parallel x - m_{kj}\parallel^2}{2 s^2}}
$$


## Part 1: Generate Data
1. First generate the 20 centers from two-dimensional normal. You can use any mean and covariance structure. **You should not regenerate the centers. Use these 20 centers throughout this simulation study.**

2. Given the 20 centers, generate a training sample of size 200 (100 from each class) and a test sample of size 10,000 (5,000 from each class).

3. Produce a **scatter plot** of the training data:

assign different colors to the two classes of data points;
overlay the 20 centers on this scatter plot, using a distinguishing marker (e.g., a star or a different shape) and color them according to their respective class.

In [56]:
import numpy as np
import plotly.graph_objects as go
import scipy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

In [57]:
J = 10
K = 2
m0 = np.array([0,1])
m1 = np.array([1,0])
# Step 1: Generate 20 centers
np.random.seed(100)
mu_k0 = np.random.normal(size=(J,K)) + np.tile(m0, (J, 1))
mu_k1 = np.random.normal(size=(J,K)) + np.tile(m1, (J, 1))

# Step 2:  Given the 20 centers, generate a training sample of size 200 (100 from each class) and a test sample of size 10,000 (5,000 from each class).
s = np.sqrt(1/5)
n_training = 100
training_set_id0 = np.random.randint(0, J, size=n_training)
training_set_id1 = np.random.randint(0, J, size=n_training)
X_training = np.random.normal(size=(2 * n_training, K)) * s + np.vstack((mu_k0[training_set_id0], mu_k1[training_set_id1]))
Y_training = np.array([0]*n_training  + [1]*n_training)


n_testing = 5000
testing_set_id0 = np.random.randint(0, J, size=n_testing)
testing_set_id1 = np.random.randint(0, J, size=n_testing)
X_testing = np.random.normal(size=(2 * n_testing, K)) * s + np.vstack((mu_k0[testing_set_id0], mu_k1[testing_set_id1]))
Y_testing = np.array([0]*n_testing  + [1]*n_testing)

# Step 3: Produce a scatter plot of training data
fig = go.Figure()
fig.update_layout(width=1000,height=500)

fig.add_trace(go.Scatter(x=X_training[:n_training, 0], y= X_training[:n_training, 1],
                    mode='markers',
                    marker_symbol='circle-open',
                    name='class 0',
                    marker_color="blue"))

fig.add_trace(go.Scatter(x=X_training[n_training:, 0], y= X_training[n_training:, 1],
                    mode='markers',
                    marker_symbol='circle-open',
                    name='class 1',
                    marker_color="red"))

fig.add_trace(go.Scatter(x=mu_k0[:,0], y=mu_k0[:,1],
                    mode='markers',
                    name= 'class 0 center',
                    marker_symbol='cross',
                    marker_size =15,
                    marker_color="blue"))

fig.add_trace(go.Scatter(x=mu_k1[:,0], y=mu_k1[:,1],
                    mode='markers',
                    name= 'class 1 center',
                    marker_symbol='star',
                    marker_size =15,
                    marker_color="red"))


## Part 2: kNN
1. Implement kNN **from scratch**; use Euclidean Distance. Your implementation should meet the following requirements:

- **Input**: Your kNN function should accept three input parameters: training data, test data, and k. No need to write your kNN function to handle any general input; it suffices to write a function that is able to handle the data for this specific simulation study: binary classification; features are two-dimensional numerical vectors.

- **Output**: Your function should return a vector of predictions for the test data.

- **Vectorization**: Efficiently compute distances between all test points and training points simultaneously. Make predictions for all test points in a single operation.

- **No Loops**: Do not use explicit loops like for or while inside your kNN function to compute distances or make predictions. Instead, harness the power of vectorized operations for efficient computations. For example, you can use broadcasting in Numpy or command outer in R.

In [58]:
def kNN_pedict(X_training, Y_training, X_testing, k):
    # Vectorized code
    # X_testing: n_testing X 2
    # X_training: n_training X 2
    # Euclidean_dist: n_testing X n_training
    euclidean_dist_mat = np.linalg.norm(X_testing[:, None] - X_training, axis=2)
    
    # n_testing X n_training
    k_nearest_neighor_ids_mat = euclidean_dist_mat.argsort(kind='mergesort', axis=1)[:,:k]  # sort distance in ascending order
    
    # prediction
    # Because the problem is an classification problem, using mode, i.e., the value that occurs most often
    predict_class = scipy.stats.mode(Y_training[k_nearest_neighor_ids_mat], axis=1, keepdims=True).mode.squeeze()
    
    return predict_class

2. *Question: Explain how you handle distance ties and voting ties*
- **distance ties** may occur when you have multiple (training) observations that are equidistant from a test observation.
- **voting ties** may occur when K is an even number and you have 50% of the k-nearest-neighbors from each of the two classes.

*Answer*
- **distance ties**: I am using <u>mergesort</u> in the python argsort() function call, which is a stable sort and preserves the relative order of equal values. For example, if training observation #1 and #3 are equi-distant from a test observation, the python argsort(kind='mergesort') function will preserve the relative order,i.e., #1 is chosen before #3 if k = 1.

- **voting ties**: when there are multiple values that occur with the same highest frequency(i.e., multiple modes) scipy.stats.mode() will return the <u>smallest</u> of these values. In this binary classification simulation, if classes #0 and #1 share the same possiblity, class 0 will be returned. 

3. Test your code with the training/test data you just generated when K = 1, 3, 5; and compare your results with <u>knn</u> in *R* or <u>sklearn.neighbors</u> in *Python*.

- Report your results (on the test data) as a 2-by-2 table (confusion matrix) for each K value.
- Report the results from knn or sklearn.neighbors as a 2-by-2 table (confusion matrix) for each K value.

In [59]:
def calc_accuracy(y_predict,y_measurement):
    assert(len(y_measurement) == len(y_predict))
    accuracy = np.sum(y_predict==y_measurement)/len(y_predict)
    return accuracy

def calc_confusion_matrix(y_predict, y_measurement):
    TP = sum((y_predict==1) & (y_measurement==1))
    TN = sum((y_predict==0) & (y_measurement==0))
    FP = sum((y_predict==1) & (y_measurement==0))
    FN = sum((y_predict==0) & (y_measurement==1))
    
    return np.array([[TP, FP],
                    [FN, TN]])

In [60]:
k = 1

sklearn_knn_1 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_1.fit(X_training, Y_training)
sklearn_knn_result_1 =  sklearn_knn_1.predict(X_testing)
local_knn_result_1 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_1, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_1, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_1.score(X_testing, Y_testing)))

My Confusion Matrix at k = 1
[[4040 1169]
 [ 960 3831]]
skearn.neighbors Confusion Matrix at k = 1
[[4040 1169]
 [ 960 3831]]
sklearn.neighbors accuracy is 0.7871


In [61]:
k = 3

sklearn_knn_3 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_3.fit(X_training, Y_training)
sklearn_knn_result_3 =  sklearn_knn_3.predict(X_testing)
local_knn_result_3 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_3, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_3, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_3.score(X_testing, Y_testing)))


My Confusion Matrix at k = 3
[[4071 1054]
 [ 929 3946]]
skearn.neighbors Confusion Matrix at k = 3
[[4071 1054]
 [ 929 3946]]
sklearn.neighbors accuracy is 0.8017


In [62]:
k = 5

sklearn_knn_5 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_5.fit(X_training, Y_training)
sklearn_knn_result_5 =  sklearn_knn_5.predict(X_testing)
local_knn_result_5 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_5, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_5, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_5.score(X_testing, Y_testing)))



My Confusion Matrix at k = 5
[[4070 1007]
 [ 930 3993]]
skearn.neighbors Confusion Matrix at k = 5
[[4070 1007]
 [ 930 3993]]
sklearn.neighbors accuracy is 0.8063


## Part 3:cvKNN
1. Implement KNN classification with K chosen by 10-fold cross-validation **from scratch**.

- Set the candidate K values from 1 to 180. (The maximum candidate K value is 180. Why?)
- From now on, you are allowed to use the built-in kNN function from R or Python instead of your own implementation from Part 2.
- It is possible that multiple K values give the (same) smallest CV error; when this happens, pick the largest K value among them, since the larger the K value, the simpler the model.
2. Test your code with the training/test data you just generated. Report your results (on the test data) as a 2-by-2 table and also report the value of the selected K.

In [63]:
from sklearn.model_selection import train_test_split

*Question: Explain The maximum candidate K value is 180. Why?*

*Answer* 
Because in 10-fold cross validation, 10% of the data is used for cross-validate testing, i.e. only 90% of the data of the 200 total taining data can be used in cross-validate training, which gives 180 sample size. As the K value cannot surpasss the sample size, so that maximum candidate K value is 180

In [64]:
def cvKNN(x_training, y_training):
    N_fold = 10
    n_fold_array = np.linspace(1, N_fold, N_fold).astype(int)
    K_array = np.linspace(1, 180, 180).astype(int)
    cv_error_array = np.zeros(180)

    for k in K_array:
        sklearn_knn_k = KNeighborsClassifier(n_neighbors=k)
        for n in n_fold_array:
            x_train_k, x_test_k, y_train_k, y_test_k = train_test_split(x_training, y_training, test_size=1/N_fold, random_state=n)
            sklearn_knn_k.fit(x_train_k, y_train_k)
            cv_error_array[k-1] += (1.0 - sklearn_knn_k.score(x_test_k, y_test_k))/N_fold
    
    #print(cv_error_array)
    k_min_index = 10
    k_min, cv_error_min = K_array[k_min_index], cv_error_array[k_min_index]
    print("The minimum cv error is: {}, with k = {}".format(cv_error_min, k_min))

    fig = go.Figure()
    fig.update_layout(width=1000,height=500,
                       xaxis_title="k values", yaxis_title="Average cv error",
                      )

    fig.add_trace(go.Scatter(x=K_array, y=cv_error_array,
                    mode='lines+markers',
                    name='average cv error',
                    marker_color="blue"))
    
    fig.add_trace(go.Scatter(x=[k_min], y=[cv_error_min],
                    mode='markers',
                    name= 'k value with the minimum average cv error',
                    marker_symbol='star',
                    marker_size =15,
                    marker_color="red"))
    fig.show()


In [65]:
cvKNN(X_training, Y_training)

The minimum cv error is: 0.135, with k = 11


In [72]:
sklearn_knn_11 = KNeighborsClassifier(n_neighbors=11)
sklearn_knn_11.fit(X_testing, Y_testing)
sklearn_knn_result_11 =  sklearn_knn_11.predict(X_testing)
print("skearn.neighbors Confusion Matrix at k = 11")
print(calc_confusion_matrix(sklearn_knn_result_11, Y_testing))

skearn.neighbors Confusion Matrix at k = 11
[[4078  651]
 [ 922 4349]]


## Part 4: Bayes rule
1. Implement the Bayes rule. Your implementation should meet the following requirements:
   - Do not use explicit loops over the test sample size (10,000 or 5,000). 
   - You are allowed to use loops over the number of centers (10 or 20), although you can avoid all loops.
2. Test your code with the test data you just generated. (Note that you do not need training data for the Bayes rule.) Report your results (on the test data) as a 2-by-2 table. 

The Bayes rule for binary classification (under the zero-one loss), as derived in class, is: predict $Y$ to be 1, if 

$$
P(Y = 1 \mid X = x) \ge P(Y = 0 \mid X=x), 
$$

or equivalently

$$ \frac{P(Y = 1 \mid X = x)}{P(Y = 0 \mid X=x)} \ge 1.$$

Following the data generation process, we have 
$$ \displaystyle  \frac{P(Y=1\mid X=x)}{P(Y=0\mid X=x)}=\frac{P(Y=1) \cdot P(X=x\mid Y=1)}{P(Y=0) \cdot P(X=x\mid Y=0)} $$
$$\displaystyle =\frac{(1/2)\cdot 10^{-1}\sum_{l=1}^{10}(2\pi s^2)^{-1}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{1l}\rVert^2/(2s^2)\right)}{(1/2)\cdot 10^{-1}\sum_{l=1}^{10}(2\pi s^2)^{-1}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{0l}\rVert^2/(2s^2)\right)} $$
$$\displaystyle =\frac{\sum_{l=1}^{10}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{1l}\rVert^2/(2s^2)\right)}{\sum_{l=1}^{10}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{0l}\rVert^2/(2s^2)\right)}. 
$$

In [66]:
def calculate_prob(data, mu):
    result = np.linalg.norm(data[:, None] - mu, axis=2) ** 2
    result = np.sum(np.exp(-result / (2 * s ** 2)), axis=1)
    return result

def bayes_rule(data, mu0, mu1):
    return np.where(calculate_prob(data, mu1) >= calculate_prob(data, mu0), 1, 0)

In [69]:
y_pred_bayes = np.where(calculate_prob(X_testing, mu_k1) >= calculate_prob(X_testing, mu_k0), 1, 0)

In [70]:
print(calc_confusion_matrix(y_pred_bayes, Y_testing))

[[3855  615]
 [1145 4385]]
