<font size="+3"><b>Linear Models and Validation Metrics</b></font>

<font color='Blue'>In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.</font>

# **Part 1: Classification (14.5 marks total)**

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:


## **Step 0:** Import Libraries

In [None]:
import numpy as np
import pandas as pd

## **Step 1:** Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [None]:
# Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam

# Load spam dataset into the feature matrix X and target vector y
X, y = load_spam()

# Print size and type of X and y
print("Size of X:", X.size)
print("Type of X:", type(X))
print("Size of y:", y.size)
print("Type of y:", type(y))

Size of X: 262200
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: 4600
Type of y: <class 'pandas.core.series.Series'>


## **Step 2:** Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [None]:
# Check if there are any missing values and fill them in if necessary
missing_values = X.isnull().any()
print(missing_values)

# All values are False (no values are missing from any column) so no method to fill-in is necessary
# If a number was missing, it could be filled with 0 using the line below
# X.fillna(0, inplace = True)

word_freq_make                False
word_freq_address             False
word_freq_all                 False
word_freq_3d                  False
word_freq_our                 False
word_freq_over                False
word_freq_remove              False
word_freq_internet            False
word_freq_order               False
word_freq_mail                False
word_freq_receive             False
word_freq_will                False
word_freq_people              False
word_freq_report              False
word_freq_addresses           False
word_freq_free                False
word_freq_business            False
word_freq_email               False
word_freq_you                 False
word_freq_credit              False
word_freq_your                False
word_freq_font                False
word_freq_000                 False
word_freq_money               False
word_freq_hp                  False
word_freq_hpl                 False
word_freq_george              False
word_freq_650               

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [None]:
from sklearn.model_selection import train_test_split
# Create X_small and y_small
# Since the order is usually X_train, X_test, y_train, y_test, the line below would treat X_small and y_small as "test" sets with test_size = 0.05
_ , X_small, _ , y_small = train_test_split(X, y, test_size = 0.05, random_state = 42)

## **Step 3:** Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets:
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

## **Step 4:** Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

## **Step 5:** Visualize Results (4 marks for steps 3-5)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [None]:
# ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logistic_model = LogisticRegression(max_iter = 2000, random_state = 0)

# Created dictionary with dataset information
datasets = {"Full": (X, y), "First Two Columns": (X.iloc[:, :2], y), "Small" : (X_small, y_small)}

# Steps used for all 3 datasets: Split into train and test sets, fit the model, then calculate train and validation accuracies
# Used hint to create loop to do tasks

results = []
for name, (X_set, y_set) in datasets.items():
  # Split into train and test sets
  X_train, X_test, y_train, y_test = train_test_split(X_set, y_set, test_size = 0.2, random_state = 0)

  # Fit the model
  logistic_model.fit(X_train, y_train)

  # Calculate training accuracy
  t_prediction = logistic_model.predict(X_train)
  t_accuracy = accuracy_score(y_train, t_prediction)

  # Calculate validation accuracy
  v_prediction = logistic_model.predict(X_test)
  v_accuracy = accuracy_score(y_test, v_prediction)

  # Calculate data size
  data_size = X_set.size

  # Store results
  results.append({
        'Data Size': data_size,
        'Training Accuracy': t_accuracy,
        'Validation Accuracy': v_accuracy
    })

# Create dataframe of results
results_df = pd.DataFrame(results)

print(results_df)


   Data Size  Training Accuracy  Validation Accuracy
0     262200           0.926902             0.936957
1       9200           0.614946             0.593478
2      13110           0.940217             0.847826


## **Questions (4 marks)**
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

<font color='Green'>
  <b>
    1. The full dataset shows a high training accuracy (~93%) along with the highest validation accuracy (~94%) out of the three datasets, which indicates that the larger the data size, the the higher the validation accuracy.  <br>
    In the second situation where only the first two columns of data were used and the data size is the smallest, there is clearly not enough information about the different features of each email to make accurate predictions, resulting in the lowest training (~61%) and validation accuracy (~59%) metrics in the table. Also, the second model does almost equally well on its training and validation sets, which implies a high bias.  <br>
    Despite the last dataset only containing 5% of the data, all columns are used which allows for more information about the features of each email and a high training accuracy (~94%); however, the data set is still quite small because of a limited number of email examples used, leading to a much lower validation accuracy (~85%). In addition, the fact that the third model did quite worse on the validation set than the training set indicates a higher variance and overfitting. <br> <br>
  </b>
  <b>
    2. In this dataset which classifies emails as spam or non-spam, a false positive would represent a non-spam email that has been misclassified as spam, while a false negative would represent a spam email that has been misclassified as non-spam. I believe that false positives would be worse in spam detection because important emails that individuals are waiting for could be incorrectly classified as spam, which could be more impactful than simply letting spam emails into one's mailbox (false negatives).
  </b>
</font>

## **Process Description (4 marks)**
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

<font color='Green'><b>DESCRIBE YOUR PROCESS HERE</b></font>

> I referenced a few sources to come up with my code including the link provided to be able to use the load_spam() function, as well as notes given in labs 1, 2, and 3 to be able to import the appropriate libraries and use the train_test_split() and accuracy_score() functions. I also made sure to properly read the instructions given to create the LogisticRegression model. The isnull().any() function was referenced from the lab, and I remembered type() and size from past Python classes.

> I completed the steps in the order provided because I believed it made the most sense.

> I used generative AI, specifically ChatGPT, to help me fix an error when I was running my code after step 5, since I had written X[:,:2] instead of X.iloc[:,:2] to get the first two columns of the dataset. I simply prompted it to explain the error message to me, and I made the appropriate minor modifications to the code.

> One challenge I faced was initially trying to find out how to calculate my training and validation accuracies; however, I looked through the lab 3 file and found the accuracy_score function which made it more clear. Overall, referencing the labs and reading the instructions helped me be successful.



# **Part 2: Regression (10.5 marks total)**

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

## **Step 1:** Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [None]:
# Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

# Load concrete dataset into the feature matrix X and target vector y
X, y = load_concrete()

# Print size and type of X and y
print("Size of X:", X.size)
print("Type of X:", type(X))
print("Size of y:", y.size)
print("Type of y:", type(y))

Size of X: 8240
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: 1030
Type of y: <class 'pandas.core.series.Series'>


## **Step 2:** Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [None]:
# Check if there are any missing values and fill them in if necessary
missing_values = X.isnull().any()
print(missing_values)

# All values are False (no values are missing from any column) so no method to fill-in is necessary
# If a number was missing, it could be filled with the mean value using the line below
# X.fillna(X.mean(), inplace = True)

cement    False
slag      False
ash       False
water     False
splast    False
coarse    False
fine      False
age       False
dtype: bool


## **Step 3:** Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [None]:
# ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression

# Instantiate model
linear_model = LinearRegression()

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Fit the model
linear_model.fit(X_train, y_train)

## **Step 4:** Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [None]:
# ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score

# Calculate training mean squared error and R2 score
t_prediction = linear_model.predict(X_train)
t_mse = mean_squared_error(y_train, t_prediction)
t_r2 = r2_score(y_train, t_prediction)

# Calculate valiadtion mean squared error and R2 score
v_prediction = linear_model.predict(X_test)
v_mse = mean_squared_error(y_test, v_prediction)
v_r2 = r2_score(y_test, v_prediction)

## **Step 5:** Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [None]:
# ADD YOUR CODE HERE
# Create results dataframe
results = pd.DataFrame({
    'Training accuracy': [t_mse, t_r2],
    'Validation accuracy': [v_mse, v_r2]
}, index = ['MSE', 'R2 score'])

results

Unnamed: 0,Training accuracy,Validation accuracy
MSE,110.345501,95.635335
R2 score,0.609071,0.636898


## **Questions (2 marks)**
1. Did using a linear model produce good results for this dataset? Why or why not?

No, the linear model did not produce good results for this dataset because the MSE values are quite high, indicating a large deviation between the actual and predicted values. Also, R2 scores are better when they are closer to 1, so the values calculated indicate that the model is not capturing the majority of the variance in the data.

## **Process Description (4 marks)**
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

<font color='Green'><b>Explain YOUR PROCESS here:</b></font>

> I referenced a few sources when coming up with my code including the link provided to be able to use the load_concrete() function, as well as notes given in labs 1, 2, and 3 to be able to import the appropriate libraries and use the train_test_split() and accuracy_score() functions. I also made sure to properly read the instructions given to create the LinearRegression model. The isnull().any() function was referenced from the lab, and I remembered type() and size from past Python classes. The mean_squared_error function was provided in lab 3; however, I utilized scikit learn documentation to verify my use of the r2_score function (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html).

> I completed the steps in the order provided because I believed it made the most sense.

> I used generative AI, specifically ChatGPT, to help me understand how to format my results DataFrame with indices since I had not done that in the previous part of this assignment, and I did not remember how to do it. I prompted it by asking how to specify indices when creating a DataFrame object in Python, and I modified the code to fit my specific requirements.

> One challenge I faced was initially trying to find out how to calculate the mean_squared_error and r2_score values; however, I looked through the lab 3 file and found the mean_squared_error function and then I looked up the r2_score function to find out how to use them properly. Overall, referencing the labs and reading the instructions helped me be successful.

# **Part 3: Observations/Interpretation (3 marks)**

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


<font color='Green'><b>
  The main observation I had from Part 1 was the impact of data size on the accuracy scores. During lectures, we discussed that a logistic model is a type of linear model used when the output value is binary, and the findings in this assignment prove that linear models scale to very large datasets, which is shown based on the high accuracies when the full dataset was used (about 93% for the training set and about 94% for the validation set). In addition, it is clear that linear models perform well when the number of features is large compared to the number of samples, as the accuracies were much higher in the third case where only 5% of the samples were used and all the features were accounted (about 94% for the training set and about 85% for the validation set) for as opposed to the second case where only two features were used with the total number of samples (about 61% for the training set and about 59% for the validation set). <br> <br>

  My main observation from the linear model in Part 2 was that linear models can be much less effective with smaller amounts of data. The full data size of the concrete set (8240) is much smaller than the data size from Part 1 of the spam data (262200), and the MSE and R2 scores calculated earlier (MSE values of ~110 for training and ~96 for validation with R2 scores of ~0.61 and ~0.64 respectively) clearly indicate that the linear model proved to be less effective for this set of data. These results support this weakness of linear models that was presented in class.
</b></font>

# **Part 4: Reflection (2 marks)**
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


<font color='Green'><b>
  I really liked being able to calculate and compare different regression metrics for real-life datasets in order to better understand which characteristics of a data set are most important when trying to create a model that is best suited for it. I found it interesting that the number of features plays a bigger role in creating a more accurate model than the number of samples, but it was challenging at times to figure out which python functions would show me the metrics I needed.
</b></font>

# **Part 5: Bonus Question (4 marks)**

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [None]:
# ADD YOUR CODE HERE

# Repeated from step 2 to ensure X and y are properly defined
from yellowbrick.datasets import load_concrete
X, y = load_concrete()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# First, importing Ridge and Lasso Regression tools
from sklearn.linear_model import Ridge, Lasso

# Created a list of alpha values from 0.001 to 100 along the logarithmic scale
alpha_values = [0.001, 0.01, 0.1, 1, 10, 100]

# Used Ridge Regression and calculated R2 score for different alpha values

ridge_results = []

for alpha in alpha_values:
  ridge_model = Ridge(alpha = alpha)
  ridge_model.fit(X_train, y_train)

  t_prediction = ridge_model.predict(X_train)
  t_r2 = r2_score(y_train, t_prediction)
  t_mse = mean_squared_error(y_train, t_prediction)

  v_prediction = ridge_model.predict(X_test)
  v_r2 = r2_score(y_test, v_prediction)
  v_mse = mean_squared_error(y_test, v_prediction)

  ridge_results.append((alpha, t_r2, v_r2, t_mse, v_mse))

ridge_data = pd.DataFrame(ridge_results, columns = ['Alpha', 'Training R2', 'Validation R2', 'Training MSE', 'Validation MSE'])
print(ridge_data)
print("\n")
# Found alpha value with the highest validation R2 score
best_ridge = ridge_data.loc[ridge_data['Validation R2'].idxmax()]

print("Best Ridge R^2 Score:")
print(best_ridge)
print("\n")
print("\n")

# Used Lasso Regression and calculated R2 score for different alpha values

lasso_results = []

for alpha in alpha_values:
  lasso_model = Lasso(alpha = alpha)
  lasso_model.fit(X_train, y_train)

  t_prediction = lasso_model.predict(X_train)
  t_r2 = r2_score(y_train, t_prediction)
  t_mse = mean_squared_error(y_train, t_prediction)

  v_prediction = lasso_model.predict(X_test)
  v_r2 = r2_score(y_test, v_prediction)
  v_mse = mean_squared_error(y_test, v_prediction)

  lasso_results.append((alpha, t_r2, v_r2, t_mse, v_mse))

lasso_data = pd.DataFrame(lasso_results, columns = ['Alpha', 'Training R2', 'Validation R2','Training MSE', 'Validation MSE'])
print(lasso_data)
print("\n")
# Found alpha value with the highest validation R2 score
best_lasso = lasso_data.loc[lasso_data['Validation R2'].idxmax()]

print("Best Lasso R^2 Score:")
print(best_lasso)


     Alpha  Training R2  Validation R2  Training MSE  Validation MSE
0    0.001     0.609071       0.636898    110.345501       95.635335
1    0.010     0.609071       0.636898    110.345501       95.635334
2    0.100     0.609071       0.636898    110.345501       95.635324
3    1.000     0.609071       0.636899    110.345501       95.635231
4   10.000     0.609071       0.636902    110.345502       95.634301
5  100.000     0.609071       0.636937    110.345597       95.625173


Best Ridge R^2 Score:
Alpha             100.000000
Training R2         0.609071
Validation R2       0.636937
Training MSE      110.345597
Validation MSE     95.625173
Name: 5, dtype: float64




     Alpha  Training R2  Validation R2  Training MSE  Validation MSE
0    0.001     0.609071       0.636899    110.345501       95.634971
1    0.010     0.609071       0.636912    110.345507       95.631698
2    0.100     0.609069       0.637034    110.346120       95.599545
3    1.000     0.608852       0.638035    11

<font color='Green'><b>
  The validation R2 score was improved in both cases compared to the linear model score calculated in part 2 (0.636898), but the Lasso regression model gave the best R2 validation score of 0.638874, with an alpha value of 10. This indicates that the Lasso model was able to more accurately predict unseen data in this set; however, the Ridge R2 score was higher for the training set, giving the same result as the linear model (0.609071), so Ridge might be fitting the data a bit more closely.
  <br>
  Ultimately, the lasso validation R2 score of 0.638874 is slightly higher than the other models, but all of the scores are in the range of 0.63-0.64, which indicates that about 63%-64% of the variation in the data is explained by these models. Whether or not this is good enough depends on what the required baseline performance of the model is for the purpose it is being used for. The scikit learn website describes concrete as the most important material in civil engineering, and the dataset being used predicts its compressive strength which could be vital for construction purposes; therefore, I would guess that the R2 score is not good enough, especially because it is not very close to 1, even though it entirely depends on what the experts consider "good enough" to be.
</b></font>