# Assignment 2

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning sessions. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from ISLP import load_data

### Question 1: Classification using KNN

We'll now use the `Caravan` dataset from the `ISLP` package. (You may use `Caravan.describe()` to review details of the dataset.) In this dataset, the response variable of interest is `Purchase`, which indicates if a given customer purchased a caravan insurance policy. We will simultaneously use all other variables in the dataset to predict the response variable.

In [3]:
# Load the "Caravan" dataset using the "load_data" function from the ISLP package
Caravan = load_data('Caravan')

# Add your code here
print(Caravan.describe())
print(Caravan.info())
print(Caravan.head())

# Check the distribution of the response variable 'Purchase'
print(Caravan['Purchase'].value_counts())

           MOSTYPE     MAANTHUI      MGEMOMV     MGEMLEEF     MOSHOOFD  \
count  5822.000000  5822.000000  5822.000000  5822.000000  5822.000000   
mean     24.253349     1.110615     2.678805     2.991240     5.773617   
std      12.846706     0.405842     0.789835     0.814589     2.856760   
min       1.000000     1.000000     1.000000     1.000000     1.000000   
25%      10.000000     1.000000     2.000000     2.000000     3.000000   
50%      30.000000     1.000000     3.000000     3.000000     7.000000   
75%      35.000000     1.000000     3.000000     3.000000     8.000000   
max      41.000000    10.000000     5.000000     6.000000    10.000000   

            MGODRK       MGODPR       MGODOV       MGODGE       MRELGE  ...  \
count  5822.000000  5822.000000  5822.000000  5822.000000  5822.000000  ...   
mean      0.696496     4.626932     1.069907     3.258502     6.183442  ...   
std       1.003234     1.715843     1.017503     1.597647     1.909482  ...   
min       0.00000

Before fitting any model, it is essential to understand our data. Answer the following questions about the `Caravan` dataset (Hint: use `print` and `describe`):  
_(i)_ How many observations (rows) does the dataset contain?    
_(ii)_ How many variables (columns) does the dataset contain?    
_(iii)_ What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?    
_(iv)_ How many predictor variables do we have (Hint: all variables other than `Purchase`)?  

In [5]:
#(i) How many observations (rows) does the dataset contain?
# Load the Caravan dataset
Caravan = load_data('Caravan')

# Get the number of rows and columns
rows, columns = Caravan.shape

# Print the number of observations (rows)
print(f"The Caravan dataset contains {rows} observations.")

The Caravan dataset contains 5822 observations.


In [6]:
# (ii) How many variables (columns) does the dataset contain?
Caravan = load_data('Caravan')

# Get the number of rows and columns
rows, columns = Caravan.shape

# Print the number of variables (columns)
print(f"The Caravan dataset contains {columns} variables.")


The Caravan dataset contains 86 variables.


In [7]:
# (iii)What 'variable' type is the response variable Purchase (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?
Caravan = load_data('Caravan')

# Check the type of the 'Purchase' variable
purchase_type = Caravan['Purchase'].dtype

# Check the unique levels of the 'Purchase' variable
purchase_levels = Caravan['Purchase'].unique()

print(f"The 'Purchase' variable type is: {purchase_type}")
print(f"The levels of the 'Purchase' variable are: {purchase_levels}")


The 'Purchase' variable type is: object
The levels of the 'Purchase' variable are: ['No' 'Yes']


In [8]:
# (iv) How many predictor variables do we have (Hint: all variables other than Purchase)?
Caravan = load_data('Caravan')

# Total number of columns in the dataset
total_columns = Caravan.shape[1]

# Number of predictor variables
predictor_variables = total_columns - 1

print(f"The number of predictor variables is: {predictor_variables}")


The number of predictor variables is: 85


Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the `scaler` method, provided as follows:

In [9]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(v)_ Why is it important to standardize the predictor variables?  
_(vi)_ Why did we elect not to standard our response variable `Purchase`?  


(v ) Why is it important to standardize the predictor variables?
Standardizing predictor variables is essential in ensuring that no single variable disproportionately affects the model, leading to more balanced, accurate, and interpretable results. For KNN, which heavily relies on distance measures, standardization is especially critical to ensure fair comparison and accurate classification.

In practice, using tools like StandardScaler from sklearn.preprocessing helps automate this standardization process, ensuring that each feature's mean is 0 and the standard deviation is 1, thereby normalizing their influence in distance-based calculations.

(vi) Why did we elect not to standard our response variable Purchase?

Standardizing the response variable, such as "Purchase" in this case, is typically not necessary or recommended for several reasons:

Preserving interpretability: Standardizing the response variable can make interpretation more challenging. For example, if the response variable represents monetary amounts like purchase prices, standardizing it would change its units to standard deviations from the mean, which might not be meaningful or interpretable in the context of the problem.

Maintaining the original scale: In many cases, maintaining the original scale of the response variable is important for understanding the magnitude and implications of the model's predictions. Standardizing it would remove this original scale information.

Model evaluation: When evaluating the performance of a model, especially in terms of metrics like mean squared error (MSE) or mean absolute error (MAE), it's important to use the original scale of the response variable to assess how well the model is performing in real-world terms.

Application-specific considerations: In some cases, there may be domain-specific reasons for not standardizing the response variable. For example, in finance, it might be essential to preserve the original currency units for decision-making purposes.


_(vii)_ A second essential step is to set a random seed. Do so below (Hint: use the `random.seed` function). Why is setting a seed important? Is the particular seed value important? Why or why not?

In [11]:
# Add your code here
import random

# Set a random seed for reproducibility
random.seed(42)


Setting a random seed is crucial in ensuring reproducibility and consistency in machine learning experiments. By setting a random seed, you fix the initial state of the random number generator, which ensures that the same sequence of random numbers is generated each time you run the code. This consistency allows for reproducible results and facilitates debugging, sharing, and comparing experiments.

The particular seed value chosen is not inherently important in most cases. The primary purpose is to ensure that the same sequence of random numbers is generated across different runs of the code. As long as the chosen seed is consistent across experiments, the specific value itself does not significantly affect the results. However, it's essential to document the seed value used to ensure transparency and reproducibility in research.

_(viii)_ A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. Extend the code to create a non-overlapping test set for the predictors and response variables.

In [20]:
# Create a random vector of True and False values
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])

# Define the training set for X (predictors)
training_X = predictors_standardized[split]

# Define the training set for Y (response)
training_Y = Caravan.loc[split, 'Purchase']

# Define the testing set for X (predictors)
testing_X = predictors_standardized[~split]

# Define the testing set for Y (response)
testing_Y = Caravan.loc[~split, 'Purchase']


_(ix)_ We are finally set to fit the KNN model. In Python, we can use the `KNeighborsClassifier()` function. Fit the KNN with k=1. (You may review arguments to knn by typing `help(knn.fit)`). 

In [23]:
# Add your code here
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier object with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the KNN model to training data
knn.fit(training_X , training_Y)

# Calculate the prediction accuracy on the test set
accuracy = knn.score(testing_X,testing_Y )

# Print the prediction accuracy
print("Prediction accuracy:", accuracy)




Prediction accuracy: 0.8848275862068965


Using your fit model, answer the following questions:   
_(x)_ What is the prediction accuracy? (Hint: use the `score` method, and compare your model to `testing_Y`)  
_(xi)_ What is the predictor error ? (Hint: compute it from the accuracy)

In [24]:
# Assuming accuracy is calculated previously
accuracy_rate = accuracy * 100
print("Prediction accuracy rate: {:.2f}%".format(accuracy_rate))


Prediction accuracy rate: 88.48%


In [None]:
# prediction accuracy rate  88.48%

In [25]:

error_rate = 1 - accuracy
print("Prediction error rate: {:.2f}%".format(error_rate * 100))


Prediction error rate: 11.52%


In [None]:
# prediction error rate 11.52%

_(xii)_ How does this prediction error/accuracy compare to what could be achieved via random guesses? To answer this, consider the percent of customers in the `Caravan` dataset who actually purchase insurance, computed below:

In [26]:
# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / Caravan['Purchase'].eq('No').sum()) * 100

print(percentage_purchase)

6.357325538911216


_(xiii)_ Fit a second KNN model, with $K=3$. Does this model perform better (i.e., have higher accuracy, compared to a random guess)?

In [29]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier object with k=3
knn_2 = KNeighborsClassifier(n_neighbors=3)

# Fit the second KNN model to training data
knn_2.fit(training_X, training_Y)

# Calculate the prediction accuracy on the test set for the second model
accuracy_2 = knn_2.score(testing_X, testing_Y)

# Print the prediction accuracy of the second model
print("Prediction accuracy of the second KNN model (k=3): {:.2f}%".format(accuracy_2 * 100))

# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / len(Caravan)) * 100

# If the percentage of customers who purchase insurance is used for random guessing,
# the accuracy rate would be the same as the percentage of customers who purchase insurance
random_guess_accuracy = percentage_purchase / 100

# Print the random guess accuracy rate
print("Random guess accuracy rate: {:.2f}%".format(random_guess_accuracy * 100))

# Compare the accuracy of the second model with random guess accuracy
if accuracy_2 > random_guess_accuracy:
    print("The second KNN model (k=3) performs better than random guessing.")
elif accuracy_2 == random_guess_accuracy:
    print("The second KNN model (k=3) has the same accuracy as random guessing.")
else:
    print("The second KNN model (k=3) performs worse than random guessing.")


Prediction accuracy of the second KNN model (k=3): 92.48%
Random guess accuracy rate: 5.98%
The second KNN model (k=3) performs better than random guessing.


# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Classification using KNN|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
