# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [6]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [300]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
X, y = load_spam()

# TO DO: Print size and type of X and y
print(X.shape)
print(y.shape)

(4600, 57)
(4600,)


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [301]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum())
print('\n')
print(y.isnull().sum())

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [355]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split


X_small, X_test, y_small, y_test = train_test_split(X, y, train_size = 0.05, random_state = 498)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [357]:
#Import Logistical Regression from sklearn
from sklearn.linear_model import LogisticRegression

#Instantiate model logisticalRegression(max_iter=2000)
model1 = LogisticRegression(max_iter=2000)
model2 = LogisticRegression(max_iter=2000)
model3 = LogisticRegression(max_iter=2000)

#Impletement the machine learning model with three different datasets:
#     X and y
model1.fit(X, y)

#     only first two columns of X and y
model2.fit(X.iloc[:,:2], y)

#     X_small and y_small
model3.fit(X_small, y_small)





### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

In [359]:
#Validate three different test scores
from sklearn.metrics import accuracy_score
print('Training score for model 1 trained with all data: {:.3f}'.format(model1.score(X,y)))
print('Validation score for model 1 trained with all data: {:.3f}'.format(model1.score(X_test,y_test)))

print('\n')

print('Training score for model 2 trained with two columns of data: {:.3f}'.format(model2.score(X.iloc[:,:2],y)))
print('Validation score for model 2 trained with two columns of data: {:.3f}'.format(model2.score(X_test.iloc[:,:2],y_test)))
      
print('\n')

print('Training score for model 3 trained with 5% of data {:.3f}'.format(model3.score(X_small,y_small)))
print('Validation score for model 3 trained with 5% of data: {:.3f}'.format(model3.score(X_test,y_test)))



Training score for model 1 trained with all data: 0.932
Validation score for model 1 trained with all data: 0.932


Training score for model 2 trained with two columns of data: 0.616
Validation score for model 2 trained with two columns of data: 0.616


Training score for model 3 trained with 5% of data 0.935
Validation score for model 3 trained with 5% of data: 0.894


### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [360]:
results = pd.DataFrame(columns = ['Data Size', 'Training Accuracy', 'Validation Accuracy'])
results['Data Size'] = ['100% of data', 'First two columns of data', '5% of data']
results['Training Accuracy'] = [model1.score(X,y), model2.score(X.iloc[:,:2],y), model3.score(X_small,y_small) ]
results['Validation Accuracy'] = [model1.score(X_test, y_test), model2.score(X_test.iloc[:, :2], y_test), model3.score(X_test,y_test)]
results
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

Unnamed: 0,Data Size,Training Accuracy,Validation Accuracy
0,100% of data,0.931739,0.932494
1,First two columns of data,0.616304,0.615561
2,5% of data,0.934783,0.894279


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1. As the amount of data is increased to train the model, both the training score and validation score converge to a particular score. Adding more data will not increase the validation score once enough data is used, only using another model can improve the validation score. This can be seen by the models we created above. Model 1 uses 100% of the data to train itself, and both the validation and training score coverge at 0.932. Model 3 uses 5% of the data for training, but the training score is 0.934 and validation score is 0.894. This indicates not enough data has been given to give us convergence between the training and validation score.

2. The dataset used in this assignment determines if an email is spam or not. Positive means the email is spam, otherwise it is negative. A false positive means a good email was incorrectly determined to be spam. A false negative means a spam email was incorrectly determined to be not spam. False positives are worse because it can lead to an important email being filed as spam, and the user missing important information.


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1.  M. Hopkins, E. Reeber, G. Forman, J. Suermondt, June 30th/1999, "Spambase", UCI Machine Learning Repository. [Online]. Available : https://archive.ics.uci.edu/dataset/94/spambase                                                                                           

2. I started with data input and imported the dataset into two dataframes. I processed the data afterwords, looking for any null values within the dataframes. Then I implemented the machine learning model, and trained 3 same models with various amount of data. Afterwords, I validated the model by analysing their training and validation scores. Finally, I visualized the results by creating a dataframe that shows all training scores and validation scores for the 3 models.

3. I did not use any generative AI in this part of the assignment.

4. Yes, I struggled training a model with only two columns of data. I learned that you need the feature matrix to be the same amount of columns when you are determining the training score.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [7]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
X, y = load_concrete()

# TO DO: Print size and type of X and y
print(X.shape)
print(y.shape)

(1030, 8)
(1030,)


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [8]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum())
print('\n')
print(y.isnull().sum())






cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64


0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [9]:
# TO DO: ADD YOUR CODE HERE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


# Note: for any random state parameters, you can use random_state = 0

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [10]:
# TO DO: ADD YOUR CODE HERE
print('Training score is {:.3f}'.format(model.score(X_train, y_train)))
print('Validation score is {:.3f}'.format(model.score(X_test, y_test)))

Training score is 0.611
Validation score is 0.623


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [11]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

results = pd.DataFrame(columns = ['Error', 'Training Accuracy', 'Validation Accuracy'])
results['Error'] = ['MSE', 'R2']
results.set_index('Error', inplace = True)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

results['Training Accuracy'] = [mean_squared_error(y_train, y_train_pred), r2_score(y_train, y_train_pred)]
results['Validation Accuracy'] = [mean_squared_error(y_test, y_test_pred), r2_score(y_test, y_test_pred)]


results

Unnamed: 0_level_0,Training Accuracy,Validation Accuracy
Error,Unnamed: 1_level_1,Unnamed: 2_level_1
MSE,111.358439,95.904136
R2,0.610823,0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

I think the linear model produced good results. The training score was 0.61 and the validation score is 0.62. Since, the validation score is slightly higher than the training score, we can assume their might be a small overfitting occuring. In this case, we might want to use a less complex model, but I do think overall the model produced decent results.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I-Cheng Yeh, Augus 2nd/2007, "Concrete Compressive Strength", UCI Machine Learning Repository. [Online]. Available : https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength

2. I started with data input and imported the dataset into two dataframes. I processed the data afterwords, looking for any null values within the dataframes. Then I implemented the machine learning model, LinearRegression() , and trained the model with the given dataset. Afterwords, I validated the model by analysing their training and validation scores. Finally, I visualized the results by creating a dataframe that shows all training scores and validation scores for mean square error and R2 score.

3. I did not use generative AI in this case.

4. I didn't have any problems with this part of the assignment. I think what helped me be succesful is the previous part the of the assignment helped me understand any doubts I had with the steps to create a machine learning model.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

In both parts of the assignment we had cases of slight overfitting. Model1 of part 1 had a training score of 0. 0.9317  and a validation score of 0.9324. Our model for part 2 had a training score of 0.61 and a validation score of 0.62. This means that we used a models that might be a bit too complex and using a less complex model would help improve our model. 

The most interesting pattern in this assignment was how the training and validation score change with the amount of data you use. Using the complete data to train the model, we get a training and validation score of respectively, 0.9317, and 0,9324. Using 5% of the data to train the model we get a training and validation score of respectively, 0.9348 and 0.8943. This shows how the using more training points can improve the model as a whole.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked in part 1 of the assignment, we learned how manipulating the data size can impact the peformance of the model. I liked we implemented models multiple times to really hammer down the basics of implementing machine learning. Didn't really dislike anything from this assignment. I think the most interesting part of the assignment was how the validation score changed with amount of data. It was confusing/challenging to implement a model with only two columns of data, but thats about it.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [12]:
# TO DO: ADD YOUR CODE HERE

# Implement Machine Learning Model
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

a1 = 1 # This is alpha for ridge
a2 = 10 # This is alpha for lasso
ridge = Ridge(alpha = a1, max_iter = 10000).fit(X_train, y_train)
lasso = Lasso(alpha = a2, max_iter = 10000).fit(X_train, y_train)


print("With a alpha of " + str(a1) + " for ridge model ")
print('Training score for the ridge model is {:.3f}'.format(ridge.score(X_train, y_train)))
print('Validation score for the ridge model is {:.3f}'.format(ridge.score(X_test, y_test)))

print('\n')

print("With a alpha of " + str(a2) + " for lasso model ")
print('Training score for the lasso model is {:.3f}'.format(lasso.score(X_train, y_train)))
print('Validation score for the lasso model is {:.3f}'.format(lasso.score(X_test, y_test)))

print('\n')

results = pd.DataFrame(columns = ['Model Error', 'Training Accuracy', 'Validation Accuracy'])
results['Model Error'] = ['Ridge MSE', 'Ridge R2', 'Lasso MSE', 'Lasso R2']
results.set_index(['Model Error'], inplace = True)

y_ridge_train_pred = ridge.predict(X_train)
y_ridge_test_pred = ridge.predict(X_test)
y_lasso_train_pred = lasso.predict(X_train)
y_lasso_test_pred = lasso.predict(X_test)

results['Training Accuracy'] = [mean_squared_error(y_train, y_ridge_train_pred), r2_score(y_train, y_ridge_train_pred),
                                mean_squared_error(y_train, y_lasso_train_pred), r2_score(y_train, y_lasso_train_pred)]

results['Validation Accuracy'] = [mean_squared_error(y_test, y_ridge_test_pred), r2_score(y_test, y_ridge_test_pred),
                                  mean_squared_error(y_test, y_lasso_test_pred), r2_score(y_test, y_lasso_test_pred)]



results






With a alpha of 1 for ridge model 
Training score for the ridge model is 0.611
Validation score for the ridge model is 0.623


With a alpha of 10 for lasso model 
Training score for the lasso model is 0.604
Validation score for the lasso model is 0.627




Unnamed: 0_level_0,Training Accuracy,Validation Accuracy
Model Error,Unnamed: 1_level_1,Unnamed: 2_level_1
Ridge MSE,111.358439,95.904035
Ridge R2,0.610823,0.623415
Lasso MSE,113.22078,95.048607
Lasso R2,0.604314,0.626774


*ANSWER HERE*

For the ridge model, I tried changing alpha to number betweens 0.001 and 100, but neither the training score and validation score changed. This might be because the dataset with the model I'm using might not be sensitive to regularization. Or It could be an error on my part.

For the lasso model, changing the alpha to 10 gave me the best validation score. I think this is the best score both lasso and ridge models give with just changing the alpha value, but if we use a different model we might be able to get a better result