# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [3]:
import numpy as np
import pandas as pd
import yellowbrick
from yellowbrick import datasets
from sklearn.model_selection import train_test_split

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [4]:
X, Y = yellowbrick.datasets.loaders.load_spam(data_home=None, return_dataset=False)

print(f"Size of X: {X.shape}, with type {type(X)}")
print(f"Size of y: {Y.shape}, with type {type(Y)}")


# TO DO: Print size and type of X and y

Size of X: (4600, 57), with type <class 'pandas.core.frame.DataFrame'>
Size of y: (4600,), with type <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [5]:
# TO DO: Check if there are any missing values and fill them in if necessary
missingValuesX = X.isna().sum().sum()
missingValuesY = Y.isnull().sum()

print(f"Missing values in x: {missingValuesX}")
print(f"missing values in y: {missingValuesY}")

#no missing values



Missing values in x: 0
missing values in y: 0


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [6]:
# TO DO: Create X_small and y_small 
XSmall, _, YSmall, _ = train_test_split(X, Y, test_size=0.95, random_state=0)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score




### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [8]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5


#For the partia data set
XPartial = X.iloc[:,:2]

# goign to store the datasets in a nested dictionary
datasets = {
    'fullDataSet': {'X': X, 'Y': Y},
    'smallDataSet': {'X': XSmall, 'Y': YSmall},
    'paritalDataSet': {'X': XPartial, 'Y': Y }
}

results = pd.DataFrame(columns=['Dataset','DataSize', 'TrainingAccuracy', 'ValidationAccuracy'])

for datasetName, data in datasets.items():
    x = data['X']
    y = data['Y']


    # Create the training sets
    xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=0)

    # Train the model
    model = LogisticRegression(max_iter=2000)
    model.fit(xTrain, yTrain)

    #Predict and calculate accuracy
    yTrainPred = model.predict(xTrain)
    yTestPred = model.predict(xTest)

    trainAccuracy = accuracy_score(yTrain, yTrainPred)
    dataAccuracy = accuracy_score(yTest, yTestPred)

    newRow = {
        'Dataset': datasetName,
        'DataSize': len(x),
        'TrainingAccuracy': trainAccuracy,
        'ValidationAccuracy': dataAccuracy
    }

    results.loc[len(results)] = newRow

print(results)    





# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

          Dataset  DataSize  TrainingAccuracy  ValidationAccuracy
0     fullDataSet      4600          0.927446            0.936957
1    smallDataSet       230          0.934783            0.913043
2  paritalDataSet      4600          0.614946            0.593478


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

### Answers
1. These were interesting examples,  from the control, (the full data set), we can see that the small dataset increased in training accuracy, and decreased in validation accuracy.  While these numbers were small changes, the effect is still there.  from this,  we can see that as we decrease the size of our dataset for training,  the ability of the model to be fit for more cases decreases, and its model is more effected by different edge cases, or randomness.  We can see the difference from the fullDataSet to the smallDataSet that we are beginning to make a move to overfitting the data. In the case of the partial data set,  we see a huge discrepancy with both the trainingAccuracy and the validationAccuracy.  By only having two features to test with (the first two columns) we can see that it is hard for the model to fully understand what is going on when trying to predict a result.  This makes sense, as if we have less information in general, we can make less infomred decisions,  and the computer learning model is no different.  

2. In the case of a spam email data set, a false positive would be identifying an categorizing a real, authentic email as spam. A false negative would be categorizing a spam email as an authentic one.  In this context, the false positive is worse.  We all live through a world of spam email. Getting spam email is something we constantly live with, and it doesnt affect our day to day.  Whereas a false positive in this case, would lead to an authentic, and potentially important email being sent to a trash folder, and perhaps important information being disregarded.  

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

Most of my code was homebrew, with influence from the past labs, some google/StackOverflow, and of course a sprinkle from GPT.  I completed the steps from the top down, like most any human would do, as the steps built on each other, a good indication of this is that they were numbered and generally people will follow steps in sequence. One thing I did do however, was to complete steps 3-5 the old fashioned way, with some very wet code before I refactored it into a loop with the different data sets in a dictionary.  The previous solution worked, but obviously wet code isnt anyones idea of a good time.  The GPT references were generally "What is the syntax of..."  or "What is the difference between this and that", or "why would I do something this way, instead of this other way?"  One thing in particular, was in splitting out the data for the 5% sample size.  I had initally used pd.sample,  howeve the instructions had said to use test_train_split,  upon some quick questions from GPT, I understood that the befits of the test_train_split was a more refined approach and less bias (from index) view of the data, supposedly.  I did have some challenges however, but eventually was able to track them down to operator error, largely my inability to spell things, and pythons inadherence to strongly typed code.  This was one of the things that GPT helped me with, in letting me know that Accuracy isnt spelled "Acuuracy". 

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [9]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

X, Y = yellowbrick.datasets.loaders.load_concrete(data_home=None, return_dataset=False)

print(f"Size of X: {X.shape}, with type {type(X)}")
print(f"Size of y: {Y.shape}, with type {type(Y)}")

Y.var()


Size of X: (1030, 8), with type <class 'pandas.core.frame.DataFrame'>
Size of y: (1030,), with type <class 'pandas.core.series.Series'>


279.0797166936134

### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [10]:
# TO DO: Check if there are any missing values and fill them in if necessary

missingValuesX = X.isna().sum().sum()
missingValuesY = Y.isnull().sum()

print(f"Missing values in x: {missingValuesX}")
print(f"missing values in y: {missingValuesY}")

Missing values in x: 0
missing values in y: 0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [11]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

# Create the training sets
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=0.2, random_state=0)

# Train the model
model = LinearRegression()
model.fit(xTrain, yTrain)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [13]:
# TO DO: ADD YOUR CODE HERE

yTestPred = model.predict(xTest)
yTrainPred = model.predict(xTrain)



mseTest = mean_squared_error(yTest, yTestPred)
mseTrain = mean_squared_error(yTrain, yTrainPred)

r2Test = r2_score(yTest, yTestPred)
r2Train = r2_score(yTrain, yTrainPred)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [16]:
# TO DO: ADD YOUR CODE HERE



results = pd.DataFrame({
    "TrainingAccuracy": [mseTrain, r2Train],
    "ValidationAccuracy": [mseTest, r2Test]
}, index=['MSE', 'R2'])

print(results)

     TrainingAccuracy  ValidationAccuracy
MSE        110.345501           95.635335
R2           0.609071            0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

I dont think so.  Unfortunatly for this assignment,  I come from a civil engineering background, and know that concrete is fairly predicable as humanity has been using it in some form for hundreds of years, so its a science that were pretty familiar with,  thus an R2 of ~0.6 isnt a particularly hot number for something like this.  Additionally, estimating the yeild of concrete off by an average of 10MPa, can be pretty catestrophic to construction, and thus, dont think this is a model that we should be putting any faith in. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

Lots of the code came from the first part of the assignment, with some changes for the patterns and different context and model types.  Some of it came from the ol' google/SO/GPT, and some of it was homebrew.  Again, the steps were completed in order as they were described, as it would have been impossible not to do this.  However the one step that I did look into in more detail, was a gander at the data, to understand more effectively what the model was trying to understand/estimate, and what information it had to do it with.  Understanding the context, with by background helped me know that this was not a linear relationship, and thus my expectations for this relationship was very limited.  

Any of the code from generative sources was obviously modified for the context, and solution nescessary.  

One of the only challenges I had, was to really understanding how the differences between measurements for validation and training accuracy between both model types, and how each validation parameter would work better on each model, and why they wouldnt work on the other, and better understanding their application.  This was understood better by some googling, youtube, and some light reading. 

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

This is a really good assignment to see the differences in validations when models fit,  vs when they do not fit, and are just not applicable.  Spam is a great example of classification models, whereas concrete yeild strength, will never ever be a linear relationship. In my opinion,  when you get accuracies of greater than 90%, and assuming that the training and validation accuraccy is relatively the same, (such that we are not overfitting the model) then I think it is a fair assumption that the model is relative to the data.  However,  similar to part one of this assignment where we had not enough information and only got ~60% accuracies,  we can see that this may look like an improper model, however it was really a case of just not enough information for the model to make representative and accurate estimations.  However, for the second part of the assignment,  even though we had what seemed to be more than enough data, we were still only able to get to ~0.6 on the R2,  or errors of around 10MPa,  which as mentioned before, are totally off in the cement world, to give perspective, there are concretes that dont even have a yeild of 10MPa, so this much difference is entirely too much, and just a case of the wrong model for the application. 

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked that this was the first assignment where we had to do our own programming.  The labs are good, but often too rushed, and really just an example of "write this code", and not able to have enough time to really soak in the information and learn from it.  I have learned more from this assignment than from all the labs combined, and was happy to do it, even though it took me a while to complete, as I had to reference tons of different sources to get the knowledge. However I understand that is just me, and everyone learns differently.  I find that self learning and practical application is the best methods to learn.  

I think its awesome how approachable and easy it really is to apply machine learning systems, however I do understand the dark side of machine learning also, that most of the work is cleaning the data, and so these beautiful yellowbrick datasets arent soemthing that we would likely ever see in the real world.  I would like to see some more examples where we can learn some more helpful approaches to data cleaning, even though that is everyones least favorite task, Id like to understand how to do it more effectively and efficiently. 

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [None]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*