# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [2]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
X,y = load_spam()



# TO DO: Print size and type of X and y
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))
print("")
print(X.dtypes)
print(y.dtypes)


(4600, 57)
(4600,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               fl

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [3]:
# TO DO: Check if there are any missing values and fill them in if necessary

print(X.isnull().sum().sum())
print(y.isnull().sum())

0
0


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [4]:
# TO DO: Create X_small and y_small

from sklearn.model_selection import train_test_split

# Train Test split of the full dataset
X_train, X_val, y_train, y_val = train_test_split(X,y, stratify= y ,random_state=0)

# Train-Test split for the first two columns
X_train_f2, X_val_f2, y_train_f2, y_val_f2 = train_test_split(X.iloc[:,0:2],y,random_state=0)

# Train-Test split for the small set
X_small,_,y_small,_ = train_test_split(X,y, train_size=0.05, random_state=0)
X_small_train,X_small_val,y_small_train,y_small_val = train_test_split(X_small,y_small, random_state=0)




### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [5]:
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(max_iter=2000)
model2 = LogisticRegression(max_iter=2000)
model3 = LogisticRegression(max_iter=2000)

model1.fit(X_train,y_train)

model2.fit(X_train_f2,y_train_f2)

model3.fit(X_small_train,y_small_train)

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

In [6]:
print("Training set score for X & y dataset: {:.4f}".format(model1.score(X_train, y_train)))
print("Validation set score for X & y dataset: {:.4f}".format(model1.score(X_val, y_val)))
print("")
print("Training set score for the first two column X & y dataset: {:.4f}".format(model2.score(X_train_f2,y_train_f2)))
print("Validation set score for the first two columns X & y dataset: {:.4f}".format(model2.score(X_val_f2,y_val_f2)))
print("")
print("Training set score for the X small and y small dataset: {:.4f}".format(model3.score(X_small,y_small)))
print("Validation set score for the X small and y small dataset: {:.4f}".format(model3.score(X_small_val,y_small_val)))


Training set score for X & y dataset: 0.9342
Validation set score for X & y dataset: 0.9313

Training set score for the first two column X & y dataset: 0.6084
Validation set score for the first two columns X & y dataset: 0.6130

Training set score for the X small and y small dataset: 0.9348
Validation set score for the X small and y small dataset: 0.9310


### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [7]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

results = pd.DataFrame(columns=['Data size', 'Training Accuracy', 'Validation Accuracy'])
results['Data size'] = [X.size, (X.iloc[:,0:2]).size, X_small.size]

results['Training Accuracy'] = [model1.score(X_train, y_train),model2.score(X_train_f2,y_train_f2),model3.score(X_small,y_small) ]
results['Validation Accuracy'] = [model1.score(X_val, y_val),model2.score(X_val_f2,y_val_f2), model3.score(X_small_val,y_small_val)]
print(results)


   Data size  Training Accuracy  Validation Accuracy
0     262200           0.934203             0.931304
1       9200           0.608406             0.613043
2      13110           0.934783             0.931034


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
Answer 1. 

We can see that when we have the full data with all features the accuracy score is 0.934 and validation score is 0.931. When the full dataset is available, the model has more samples to train on allowing to get greater accuracy with the complete dataset. We again see that when obtain a high validation score, which is very close to the training accuracy store. Having all the features also helps the model to find underlying patterns in the data. This is a good model.
When only two columns are retained and rest of the data is omitted, we see see the training score drop down to 0.608 and validation score 0.613. By dropping a significant portion of the data, the model loses several key features which are important improving the accuracy of the model. The model becomes too simple and underfits the data as seen by the training and validation scores. The underlying patterns are lost when significant amount of features are dropped. This is not a good model.
When using only 5% of the data while keeping all features, we obtain a training accuracy of 0.934 and validation accuracy 0.931. With limited amount of data, but retaining all the features in the feature matrix and having random set of data samples, we obtain good scores for the training and validation. The results from this model is good.

Answer 2: A false positive is when an email is flagged as a spam when it was not a spam email. A false negative is when an email is spam but is not deleted, instead it shows up in the inbox. A false positive is worse because an important email can be flagged as spam, and you will not know about it.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
My process for completing this assignment was I read through the lectures slides to understand the theory component, later I reviewed the Jupyter notebooks to I sourced the code through the example jupyter notebooks provided on D2L, additionally I obtained help from ChatGPT to learn what the commands mean and how to use them.

I used prompts such as explain mean squared error, how is it implemented in python, what is R2 score and how is it implemented in python, what is the meaning of model fit. These commands helped understand and do the implementation python to complete the assignment.

I had challages of understanding what the commands in python were doing and what their inputs and outputs were, for example train_test_split. Through the help of ChatGPT I was able to get a better understand the use of this command and what it means.


Citations:
OpenAI. (2023). ChatGPT API. Retrieved from https://www.openai.com/chatgpt-api
Dawson, Leanne. (2023). ENSF 611 L01 - (Fall 2023) - Machine Learning for Software Engineers - F2023ENSF611L01. In Desire2Learn (Brightspace). https://d2l.ucalgary.ca/d2l/home/543310




## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [8]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y

from yellowbrick.datasets import load_concrete
X,y = load_concrete()

print(X.shape)
print(y.shape)
print(type(X))
print(type(y))
print("")
print(X.dtypes)
print(y.dtypes)

(1030, 8)
(1030,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object
float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [9]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum().sum())
print(y.isnull().sum())

0
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [10]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)

lr = LinearRegression().fit(X_train, y_train)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [11]:
# TO DO: ADD YOUR CODE HERE
print("Mean Squared Training score: {:.2f}".format(mean_squared_error(y_train, lr.predict(X_train))))
print("Mean Squared Validation score: {:.2f}".format(mean_squared_error(y_val, lr.predict(X_val))))
print('')
print("R2 Training score: {:.3f}".format(r2_score(y_train, lr.predict(X_train))))
print("R2 Valdiation score: {:.3f}".format(r2_score(y_val, lr.predict(X_val))))




Mean Squared Training score: 111.36
Mean Squared Validation score: 95.90

R2 Training score: 0.611
R2 Valdiation score: 0.623


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [12]:
# TO DO: ADD YOUR CODE HERE

results = pd.DataFrame(columns = ['Training Accuracy', 'Validation accuracy'], index = ['MSE', 'R2 Score'])

results['Training Accuracy'] = [round(mean_squared_error(y_train, lr.predict(X_train)),2),round(r2_score(y_train, lr.predict(X_train)),2)]
results['Validation accuracy'] = [round(mean_squared_error(y_val, lr.predict(X_val)),2),round(r2_score(y_val, lr.predict(X_val)),2)]

results



Unnamed: 0,Training Accuracy,Validation accuracy
MSE,111.36,95.9
R2 Score,0.61,0.62


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

Answer: based on the results provided from the mean squared error, the training MSE is 111.36 and validation MSE is 95.90. The R2 score for training is 0.61 and validation score is 0.62 which are far lower than close to 1 that we are seeking. Based on the R2 scores, the model unfits the data and can be considered too simple. It shows signs of high bias. I believe this model may not be the best fit, and its predeictions for unseen data can be unreliable. We should try another models which maybe a better fit.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
My process for completing this section of the assignment is very similar, I read through the lectures slides to understand the theory component for linear regression, later I reviewed the Jupyter notebooks provided on D2L to go through the example, additionally I obtained help from ChatGPT to learn.

I used prompts on ChatGPT to get explanations on mean squared error and R2 score and how is it implemented in python, what is the meaning of model fit. The explanations on ChatGPT helped me understand and do the implementation in python to complete the assignment.

I had challages of understanding what the commands in python were doing and what their inputs and outputs were, for example r2_score. Through the help of ChatGPT I was able to get a better understand the use of this command and what it means.


Citations:
OpenAI. (2023). ChatGPT API. Retrieved from https://www.openai.com/chatgpt-api
Dawson, Leanne. (2023). ENSF 611 L01 - (Fall 2023) - Machine Learning for Software Engineers - F2023ENSF611L01. In Desire2Learn (Brightspace). https://d2l.ucalgary.ca/d2l/home/543310

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*
In part 1, with the entire spam dataset, we had a training score 0.9342 and validation score is 0.9313. The training and validation scores were very close in this case. As we lowered the number of features and used only 2 columns for the feature matrix we saw the training score drop significantly to 0.6084 and validation score to 0.6130. This is big drop from when we were using the full dataset. By using a very small dataset just the 5% we got 0.9348 as the training score and 0.9310 as the validation score. Reviewing the results, I see that we got the best results for the training and validation scores when we used the full dataset for spam. But as we drop the features, our training and validation results drop significantly. A large data sample and including all features will help us to discover underlying trends and patterns in the data.

In part 2, for concrete data with linear regression, I found that model had R2 training score of 0.61 and R2 validation score to 0.62. This shows sign that the model is too simple and underfits the data. This model is not a good fit considering the results we have obtained from the training and validation.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


I liked using the python commands to develop my first machine learning model. It was exciting to work through the process and review the theory about what the results mean. I also now have a better understanding of underlying principles like training and validation data sets.

I found it interesting and challenging as we try to understand how well or how bad our model works given the data. As we interpret the application of the machine learning model and assess its suitability to our data. This is a interesting and unique aspect of machine learning that I did not know about in the past.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [13]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
ridge = Ridge(alpha=1).fit(X_train, y_train)
print("Training score: {:.3f}".format(ridge.score(X_train, y_train)))
print("Validation score: {:.3f}".format(ridge.score(X_val, y_val)))
print("")
lasso = Lasso(alpha=1).fit(X_train, y_train)
print("Training score: {:.3f}".format(ridge.score(X_train, y_train)))
print("Validation score: {:.3f}".format(ridge.score(X_val, y_val)))


Training score: 0.611
Validation score: 0.623

Training score: 0.611
Validation score: 0.623


*ANSWER HERE*
The scores provided for training and validation provided by Lasso and Ridge regression were very similar after trying many different alpha values between 0.001 to 100 on the logorithmic scale. Therefore, I applied the value of alpha as 1.0 for both Ridge and Lasso models. The resulting scores from both model are not good because the training and validation scores are very low indicating the model is underfitting the data. The results are almost same as when we used Linear Regression model in the earlier part. Lasso and Ridge are both applied when we want to regularize the model to avoid overfitting, however we were not overfitting in the Linear Regression model to begin with. Lasso and Ridge are not the best models to apply in this case, perhaps other models should be tried to achieve better training and validation accuracy.