# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [19]:
import numpy as np
import pandas as pd
import yellowbrick
from yellowbrick.datasets import load_spam


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [20]:
# TO DO: Import spam dataset from yellowbrick library
X, y = load_spam()
# TO DO: Print size and type of X and y
print (X.shape)
print(type(X))
print (y.shape)
print(type(y))

(4600, 57)
<class 'pandas.core.frame.DataFrame'>
(4600,)
<class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [21]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing = np.isnan(X).sum().sum()
print("The total number of missing values is ", missing, ".")


The total number of missing values is  0 .


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [22]:
# TO DO: Create X_small and y_small 
import sklearn
from sklearn.model_selection import train_test_split
X_small = X.sample(frac=0.05, random_state=0)
y_small = y.sample(frac=0.05, random_state=0)

print(X_small.shape)
print(y_small.shape)
X_small_train, X_small_test, y_small_train, y_small_test = train_test_split(X_small, y_small, test_size=0.2, random_state=0)


(230, 57)
(230,)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [23]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=2000)
model_2_column = LogisticRegression(max_iter=2000)
model_small = LogisticRegression(max_iter=2000)

#1 - Implement the machine learning model with X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model.fit(X_train, y_train)

#2 - Implement the machine learning model with first 2 columns of X and y
X_2columns = X.iloc[:, :2]
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2columns, y, test_size=0.2, random_state=0)
model_2_column.fit(X_2_train, y_2_train)

#3 - Implement the machine learning model with X_small and y_small
model_small.fit(X_small_train, y_small_train)

# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

# logreg = LogisticRegression(max_iter=10000).fit(X_train, y_train)

print("Training set score for X and y: {:.3f}".format(model.score(X_train, y_train)))
print("Validation set score for X and y: {:.3f}\n".format(model.score(X_test, y_test)))

print("Training set score for first 2 columns of X and y: {:.3f}".format(model_2_column.score(X_2_train, y_2_train)))
print("Validation set score for first 2 columns of X and y: {:.3f}\n".format(model_2_column.score(X_2_test, y_2_test)))

print("Training set score for small X and small y: {:.3f}".format(model_small.score(X_small_train, y_small_train)))
print("Validation set score for small X and small y: {:.3f}\n".format(model_small.score(X_small_test, y_small_test)))


Training set score for X and y: 0.927
Validation set score for X and y: 0.938

Training set score for first 2 columns of X and y: 0.615
Validation set score for first 2 columns of X and y: 0.593

Training set score for small X and small y: 0.957
Validation set score for small X and small y: 0.804



### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. The training set score was higher with a smaller dataset (5% of the size) relative to the original dataset (0.927 compared to 0.957). The validation score was significantly higher for the larger dataset though (0.938 compared to 0.804). In the case of the set that used only the first 2 columns of data, the training and validation set scores were both markably lower than the other two datasets (0.615 for training and 0.593 for validation). This highlights the importance of including as many features as possible instead of just those found in the first 2 columns.

2. A false positive would represent marking a regular e-mail as spam, and a false negative would represent marking a spam e-mail as a regular one. In my opinion, both situations are unfortunate (false negative or false positive) but a false negative is worse, because spam e-mails can potentially be very harmful.


## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [24]:
# TO DO: Import concrete dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

# TO DO: Print size and type of X and y
X_concrete, y_concrete = load_concrete()
# TO DO: Print size and type of X and y
print (X_concrete.shape)
print(type(X_concrete))
print (y_concrete.shape)
print(type(y_concrete))

(1030, 8)
<class 'pandas.core.frame.DataFrame'>
(1030,)
<class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [25]:
# TO DO: Check if there are any missing values and fill them in if necessary
missing2 = np.isnan(X_concrete).sum().sum()
print("The total number of missing values is ", missing2, ".")

The total number of missing values is  0 .


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [26]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
model_concrete = LinearRegression()

#1 - Implement the machine learning model with X and y
X_concrete_train, X_concrete_test, y_concrete_train, y_concrete_test = train_test_split(X_concrete, y_concrete, test_size=0.2, random_state=0)
model_concrete.fit(X_concrete_train, y_concrete_train)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [27]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score

# Predict using the trained model
y_train_pred = model_concrete.predict(X_concrete_train)
y_test_pred = model_concrete.predict(X_concrete_test)

# Compute the Mean Squared Error for training and test sets
mse_train = mean_squared_error(y_concrete_train, y_train_pred)
mse_test = mean_squared_error(y_concrete_test, y_test_pred)

# Compute the R2 score for training and test sets
r2_train = r2_score(y_concrete_train, y_train_pred)
r2_test = r2_score(y_concrete_test, y_test_pred)

# Print out the results
print("Training Mean squared Error:", mse_train)
print("Test Mean Squared Error:", mse_test)
print("\nTraining R2 Score:", r2_train)
print("Test R2 Score:", r2_test)


Training Mean squared Error: 110.34550122934108
Test Mean Squared Error: 95.63533482690423

Training R2 Score: 0.6090710418548884
Test R2 Score: 0.6368981103411244


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [28]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame(index=['MSE', 'R2 score'], 
                       columns=['Training accuracy', 'Validation accuracy'])

results['Training accuracy']['MSE'] = mse_train
results['Training accuracy']['R2 score'] = r2_train
results['Validation accuracy']['MSE'] = mse_test
results['Validation accuracy']['R2 score'] = r2_test

results

Unnamed: 0,Training accuracy,Validation accuracy
MSE,110.345501,95.635335
R2 score,0.609071,0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

No, the results provided a training accuracy of 0.61 in R2, and 0.64 in R2, which is relatively low in the context of predicting the dependent variable. This low R2 value could be due to various factors. One such factor could be that the 'X' data does not share a linear relationship with the 'y' data. Additionally, in this example there were 8 features, compared to the previous problem in which there were over 50 features - perhaps there are additional potential features that could have been included to help increase accuracy.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. Where did you source your code?
    - I referenced some of the previous lecture and lab examples and https://www.geeksforgeeks.org/ for some syntax help. I completed the assignment in VSCode.
1. In what order did you complete the steps?
    - I completed them in order of the steps laid out, sequentially.
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
    - I didn't use any generative AI.
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
    - Yes, I had some trouble importing the right modules, but referencing the past lecture and lab examples helped clear some of those challenges up.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*
- In the first part, there was a clear relationship between the size of dataset (e.g. full dataset or 5% of dataset) and validation score. This is consistent with what we have been learning in lectures. Additionally, the fewer the features, the lower the accuracy seems to be. This applied in the first part when only 2 columns of X were used to predict y, where the validation and training accuracy both reduced significantly. 

In the 2nd part with the concrete data, there were only 8 features as opposed to 57, and the accuracy was significantly lower. This doesn't necessarily indicate that fewer features will always result in lower accuracy, since some features can be much stronger predictors than others, but it is a correlation that was apparent in this assignment.

The second model also appeared to be underfitting, based on the low R2 score on both the validation and training sets.



## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked working through this Jupyter notebook. I also enjoyed analyzing the two datasets in different ways (binary logistic regression and then linear regression). I've always enjoyed data analysis so this assignment as a whole was relatively enjoyable. I did get slightly frustrated at times when I would get errors for failed imported modules, like those through sk.learn, but figured it out eventually after realizing it was entirely due to my own small mistakes that I was getting the errors in the first place.

Though it was a simple example, this was a motivating assignment because I'm very curious to learn more about machine learning, so it's exciting to get a foot into the door with this very interesting field. 



## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [29]:
# TO DO: ADD YOUR CODE HERE

#Part 2: Ridge Regression
from sklearn.linear_model import Ridge

ridge0001 = Ridge(alpha=0.001).fit(X_train, y_train)
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
ridge = Ridge().fit(X_train, y_train)
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
ridge100 = Ridge(alpha=100).fit(X_train, y_train)

print("Ridge Regression:")
print("\nalpha = 0.001")
print("Training score: {:.2f}".format(ridge0001.score(X_train, y_train)))
print("Validation score: {:.2f}".format(ridge0001.score(X_test, y_test)))

print("\nalpha = 0.1")
print("Training score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Validation score: {:.2f}".format(ridge01.score(X_test, y_test)))

print("\nalpha = 1")
print("Training score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Validation score: {:.2f}".format(ridge.score(X_test, y_test)))

print("\nalpha = 10")
print("Training score: {:.2f}".format(ridge10.score(X_train, y_train)))
print("Validation score: {:.2f}".format(ridge10.score(X_test, y_test)))

print("\nalpha = 100")
print("Training score: {:.2f}".format(ridge100.score(X_train, y_train)))
print("Validation score: {:.2f}".format(ridge100.score(X_test, y_test)))

print("Conclusion: Ridge regression has even lower accuracy regardless of which alpha value is used.")
print("\n")

#Lasso Regression:
from sklearn.linear_model import Lasso

lasso0001 = Lasso(alpha=0.001).fit(X_train, y_train)
lasso01 = Lasso(alpha=0.1).fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)
lasso10 = Lasso(alpha=10).fit(X_train, y_train)
lasso100 = Lasso(alpha=100).fit(X_train, y_train)

print("Lasso Regression:")
print("\nalpha = 0.001")
print("Training set score: {:.2f}".format(lasso0001.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso0001.score(X_test, y_test)))

print("\nalpha = 0.1")
print("Training set score: {:.2f}".format(lasso01.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso01.score(X_test, y_test)))

print("\nalpha = 1")
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso.score(X_test, y_test)))

print("\nalpha = 10")
print("Training set score: {:.2f}".format(lasso10.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso10.score(X_test, y_test)))

print("\nalpha = 100")
print("Training set score: {:.2f}".format(lasso100.score(X_train, y_train)))
print("Validation set score: {:.2f}".format(lasso100.score(X_test, y_test)))

print("Conclusion: Lasso regression has even LOWER accuracy regardless of which alpha value is used. Alpha = 1 seems to provide the most accurate results.")
print("\n")


Ridge Regression:

alpha = 0.001
Training score: 0.56
Validation score: 0.54

alpha = 0.1
Training score: 0.56
Validation score: 0.54

alpha = 1
Training score: 0.56
Validation score: 0.54

alpha = 10
Training score: 0.56
Validation score: 0.54

alpha = 100
Training score: 0.55
Validation score: 0.53
Conclusion: Ridge regression has even lower accuracy regardless of which alpha value is used.


Lasso Regression:

alpha = 0.001
Training set score: 0.56
Validation set score: 0.53

alpha = 0.1
Training set score: 0.25
Validation set score: -0.20

alpha = 1
Training set score: 0.11
Validation set score: -0.25

alpha = 10
Training set score: 0.08
Validation set score: 0.02

alpha = 100
Training set score: 0.00
Validation set score: -0.00
Conclusion: Lasso regression has even LOWER accuracy regardless of which alpha value is used. Alpha = 1 seems to provide the most accurate results.




*ANSWER HERE*