# Assignment 2

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning sessions. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from ISLP import load_data

### Question 1: Classification using KNN

We'll now use the `Caravan` dataset from the `ISLP` package. (You may use `Caravan.describe()` to review details of the dataset.) In this dataset, the response variable of interest is `Purchase`, which indicates if a given customer purchased a caravan insurance policy. We will simultaneously use all other variables in the dataset to predict the response variable.

In [13]:
# Load the "Caravan" dataset using the "load_data" function from the ISLP package
Caravan = load_data('Caravan')

# Add your code here
print(Caravan)
Caravan.describe()

      MOSTYPE  MAANTHUI  MGEMOMV  MGEMLEEF  MOSHOOFD  MGODRK  MGODPR  MGODOV  \
0          33         1        3         2         8       0       5       1   
1          37         1        2         2         8       1       4       1   
2          37         1        2         2         8       0       4       2   
3           9         1        3         3         3       2       3       2   
4          40         1        4         2        10       1       4       1   
...       ...       ...      ...       ...       ...     ...     ...     ...   
5817       36         1        1         2         8       0       6       1   
5818       35         1        4         4         8       1       4       1   
5819       33         1        3         4         8       0       6       0   
5820       34         1        3         2         8       0       7       0   
5821       33         1        3         3         8       0       6       1   

      MGODGE  MRELGE  ...  APERSONG  AG

Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,ALEVEN,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND
count,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,...,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0
mean,24.253349,1.110615,2.678805,2.99124,5.773617,0.696496,4.626932,1.069907,3.258502,6.183442,...,0.076606,0.005325,0.006527,0.004638,0.570079,0.000515,0.006012,0.031776,0.007901,0.014256
std,12.846706,0.405842,0.789835,0.814589,2.85676,1.003234,1.715843,1.017503,1.597647,1.909482,...,0.377569,0.072782,0.080532,0.077403,0.562058,0.022696,0.081632,0.210986,0.090463,0.119996
min,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,1.0,2.0,2.0,3.0,0.0,4.0,0.0,2.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30.0,1.0,3.0,3.0,7.0,0.0,5.0,1.0,3.0,6.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,35.0,1.0,3.0,3.0,8.0,1.0,6.0,2.0,4.0,7.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,41.0,10.0,5.0,6.0,10.0,9.0,9.0,5.0,9.0,9.0,...,8.0,1.0,1.0,2.0,7.0,1.0,2.0,3.0,2.0,2.0


Before fitting any model, it is essential to understand our data. Answer the following questions about the `Caravan` dataset (Hint: use `print` and `describe`):  
_(i)_ How many observations (rows) does the dataset contain?
- Number of observations (rows) : 5822

_(ii)_ How many variables (columns) does the dataset contain?  
- Number of variables (columns) : 86

_(iii)_ What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?    
- Variable type of Purchase : object
- Levels of Purchase : ['No' 'Yes']

_(iv)_ How many predictor variables do we have (Hint: all variables other than `Purchase`)?  
- Number of predictor variables : 85

In [11]:
# Add your code here
data = Caravan
target_variable = 'Purchase'
print(f"Number of observations (rows) : {data.shape[0]}")
print(f"Number of variables (columns) : {data.shape[1]}")
print(f"Variable type of {target_variable} : {data[target_variable].dtype}")
print(f"Levels of {target_variable} : {data[target_variable].unique()}")
print(f"Number of predictor variables : {data.shape[1]-1}")

Number of observations (rows) : 5822
Number of variables (columns) : 86
Variable type of Purchase : object
Levels of Purchase : ['No' 'Yes']
Number of predictor variables : 85


Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the `scaler` method, provided as follows:

In [12]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(v)_ Why is it important to standardize the predictor variables?  
_(vi)_ Why did we elect not to standard our response variable `Purchase`?  


In [14]:
# Your answer here
# (v) Standaring the predictor variables is important for several reasons:
#   1. Equal Weighting: It ensures that all predictor variables contribute equally to the model, preventing features with larger ranges from dominating.
#   2. Algorithm Performance: Algorithms like k-nearest neighbors (KNN), support vector machines (SVM), and those that use gradient descent 
#      (like logistic regression) perform better and converge faster when features are on a similar scale.
#   3. Distance Calculations: For algorithms that rely on distance metrics, like KNN, standardized features ensure meaningful and accurate distance calculations.

# (vi) We do not standardize the response variable Purchase because:
#   1. Categorical Nature: Purchase is a categorical variable with values 'Yes' and 'No'. Standardization is meant for numerical variables.
#   2. Interpretability: Keeping Purchase in its original form ensures that the model's predictions (e.g., the probability of a purchase) are interpretable and meaningful.
#   3. Algorithm Requirements: Classification algorithms are designed to handle categorical response variables and do not require them to be standardized.



_(vii)_ A second essential step is to set a random seed. Do so below (Hint: use the `random.seed` function). Why is setting a seed important? Is the particular seed value important? Why or why not?

In [37]:
# Add your code here
import random
random.seed(42)

# Importance of Setting a Random Seed
#   1. Reproducibility: Ensures experiments are reproducible by generating the same random numbers each run.
#   2. Consistency: Maintains consistency across runs, crucial for splitting data, initializing weights, and random sampling.
# 3. Comparison: Allows fair comparison between models by using the same subsets of data.

# Importance of the Particular Seed Value
#   1. Deterministic Output: Any seed value provides reproducible results. Different seeds give different but deterministic sequences.
#   2. Arbitrary Choice: The specific value doesn't matter; using any integer as a seed achieves the goal.
#   3. random.seed(42) is often used as a convention in examples and tutorials.


_(viii)_ A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. Extend the code to create a non-overlapping test set for the predictors and response variables.

In [49]:
# Create a random vector of True and False values
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])

# Define the training set for X (predictors)
training_X = predictors_standardized[split]

# Define the training set for Y (response)
training_Y = Caravan.loc[split, 'Purchase']

# Define the testing set for X (predictors)
testing_X = predictors_standardized[~split]

# Define the testing set for Y (response)
testing_Y = Caravan.loc[~split, 'Purchase']

# Extract the indices for training and testing
training_indices = training_X.index
testing_indices = testing_X.index

# Check for overlapping indices
overlapping_indices = set(training_indices).intersection(testing_indices)

if overlapping_indices:
    print("Training and testing sets are overlapping.")
else:
    print("Training and testing sets are non-overlapping.")

Training and testing sets are non-overlapping.


_(ix)_ We are finally set to fit the KNN model. In Python, we can use the `KNeighborsClassifier()` function. Fit the KNN with k=1. (You may review arguments to knn by typing `help(knn.fit)`). 

In [99]:
# Add your code here
from sklearn.neighbors import KNeighborsClassifier
# Create a KNN classifier with k=1
knn1 = KNeighborsClassifier(n_neighbors=1)

# Fit the model to the training data
knn1.fit(training_X, training_Y)
print("Successfully fit the KNN model, k=1.")

Successfully fit the KNN model, k=1.


Using your fit model, answer the following questions:   
_(x)_ What is the prediction accuracy? (Hint: use the `score` method, and compare your model to `testing_Y`)  
- Prediction Accuracy: 0.8902677988242979

_(xi)_ What is the predictor error ? (Hint: compute it from the accuracy)
- Predictor Error : 0.10973220117570215

In [57]:
# prediction accuracy rate
accuracy = knn1.score(testing_X, testing_Y)
print(f"Prediction Accuracy: {accuracy}")

Prediction Accuracy: 0.8902677988242979


In [58]:
# prediction error rate
error = 1- accuracy
print(f"Predictor Error : {error}")

Predictor Error : 0.10973220117570215


_(xii)_ How does this prediction error/accuracy compare to what could be achieved via random guesses? To answer this, consider the percent of customers in the `Caravan` dataset who actually purchase insurance, computed below:
- The random accuracy consistently falls within the range of 87-89%, and the model accuracy is also 89%. It suggests that the model's performance is comparable to random guessing.
- Given that the model accuracy is similar to the range of random accuracy, it implies that the model may not be effectively capturing the underlying patterns in the data. While the model's accuracy is relatively high, its performance is not significantly better than random guessing.

In [108]:
# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / len(Caravan['Purchase'])) * 100

print(percentage_purchase)

# 1 To compare based on the same dataset, calculating the percentage_purchase in testing_Y
percentage_purchase_testY = (testing_Y.eq('Yes').sum() / testing_Y.eq('No').sum()) * 100
print(f"Percentage_purchase from testing_Y :  {percentage_purchase_testY}")

# 2 Random guesses based on the result from step 1
# Determine the majority class label - 'no' in this case
majority_class_label = 'Yes' if percentage_purchase > 0.5 else 'No'

# 3 Repeat random guessing and accuracy calculation for 1000 iterations
num_iterations = 1000
mean_random_guess_accuracy_list = []

for _ in range(num_iterations):
    # 3.1 Generate random guesses for the length of testing_Y based on the data distribution
    random_guesses = np.random.choice(['Yes', 'No'], size=len(testing_Y), p=[percentage_purchase/100, 1 - percentage_purchase/100])
    
    # 3.2 Count the number of correct guesses
    num_correct_guesses = np.sum(random_guesses == testing_Y)
    
    # 3.3 Calculate the accuracy of random guessing for this iteration
    random_guess_accuracy = (num_correct_guesses / len(testing_Y)) * 100
    
    #3.4 Append the accuracy to the list
    mean_random_guess_accuracy_list.append(random_guess_accuracy)

# Compute the mean accuracy from the list of accuracies
mean_random_guess_accuracy = np.mean(mean_random_guess_accuracy_list)

print("Mean Random Guess Accuracy (1000 iterations):", mean_random_guess_accuracy )

6.357325538911216
Percentage_purchase from testing_Y :  6.615598885793872
Mean Random Guess Accuracy (1000 iterations): 88.21456564337035


_(xiii)_ Fit a second KNN model, with $K=3$. Does this model perform better (i.e., have higher accuracy, compared to a random guess)?
- Since the model accuracy of 92.95% is significantly higher than the random guess accuracy range, it indicates that the model performs much better than random guessing. This suggests that the model effectively captures patterns in the data and makes more accurate predictions compared to simply guessing based on the distribution of the data.

In [102]:
# Your code here
# Create a KNN classifier with k=1
knn3 = KNeighborsClassifier(n_neighbors=3)

# Fit the model to the training data
knn3.fit(training_X, training_Y)
print("Successfully fit the KNN model, k=3.")

accuracy = knn3.score(testing_X, testing_Y)
print(f"Prediction Accuracy: {accuracy}")


Successfully fit the KNN model, k=3.
Prediction Accuracy: 0.9294578706727629


# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Classification using KNN|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applying_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
