# Assignment 3: Non-Linear Models and Validation Metrics (38 total marks)
### Due: March 5 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses non-linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

### Import Libraries

In [150]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Part 1: Regression (14.5 marks)

For this section, we will be continuing with the concrete example from yellowbrick. You will need to compare these results to the results from the previous assignment. Please use the results from the solution if you were unable to complete Assignment 2

### Step 1: Data Input (0.5 marks)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the concrete dataset into the feature matrix `X` and target vector `y`.

In [151]:
# TO DO: Import concrete dataset from yellowbrick library
from yellowbrick import datasets

X, y = datasets.load_concrete()

### Step 2: Data Processing (0 marks)

Data processing was completed in the previous assignment. No need to repeat here.

### Step 3: Implement Machine Learning Model

1. Import the Decision Tree, Random Forest and Gradient Boosting Machines regression models from sklearn
2. Instantiate the three models with `max_depth = 5`. Are there any other parameters that you will need to set?
3. Implement each machine learning model with `X` and `y`

In [152]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

dt = DecisionTreeRegressor(max_depth=5, random_state=0)
rf = RandomForestRegressor(max_depth=5, random_state=0)
gb = GradientBoostingRegressor(max_depth=5, random_state=0)

dt.fit(X,y)

rf.fit(X,y)

gb.fit(X,y)


### Step 4: Validate Model

Calculate the average training and validation accuracy using mean squared error with cross-validation. To do this, you will need to set `scoring='neg_mean_squared_error'` in your `cross_validate` function and negate the results (multiply by -1)

In [153]:
from sklearn.model_selection import cross_validate

def get_accuracy(model, X, y, scoring):
    cv_results = cross_validate(model, X, y, cv=5, scoring=scoring, return_train_score=True)
    train_mse = -1 * cv_results['train_score'].mean()
    validation_mse = -1 * cv_results['test_score'].mean()
    return train_mse, validation_mse

dt_train_mse, dt_validation_mse = get_accuracy(dt, X, y, 'neg_mean_squared_error')
print(f"DT Regressor - Train MSE: {dt_train_mse}, Validation MSE: {dt_validation_mse}")

rf_train_mse, rf_validation_mse = get_accuracy(rf, X, y, 'neg_mean_squared_error')
print(f"RF Regressor - Train MSE: {rf_train_mse}, Validation MSE: {rf_validation_mse}")

gb_train_mse, gb_validation_mse = get_accuracy(gb, X, y, 'neg_mean_squared_error')
print(f"GB Regressor - Train MSE: {gb_train_mse}, Validation MSE: {gb_validation_mse}")


DT Regressor - Train MSE: 47.91856102734339, Validation MSE: 163.08777547307804
RF Regressor - Train MSE: 32.056464386029816, Validation MSE: 156.251425313443
GB Regressor - Train MSE: 3.7392700109420987, Validation MSE: 99.2245764199326


### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: DT, RF and GB
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [154]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
results = pd.DataFrame(columns=["Model", "Training Accuracy", "Validation Accuracy"])

results.loc[len(results)]= {"Model": 'DT', "Training Accuracy": dt_train_mse, "Validation Accuracy": dt_validation_mse}
results.loc[len(results)]= {"Model": 'RF', "Training Accuracy": rf_train_mse, "Validation Accuracy": rf_validation_mse}
results.loc[len(results)]= {"Model": 'GB', "Training Accuracy": gb_train_mse, "Validation Accuracy": gb_validation_mse}
print(results)

  Model  Training Accuracy  Validation Accuracy
0    DT          47.918561           163.087775
1    RF          32.056464           156.251425
2    GB           3.739270            99.224576


Repeat the step above to print the R2 score instead of the mean-squared error. For this case, you can use `scoring='r2'`

In [155]:
# TO DO: ADD YOUR CODE HERE
dt_train_r2, dt_validation_r2 = get_accuracy(dt, X, y, 'r2')
print(f"DT Regressor - Train r2: {dt_train_r2}, Validation r2: {dt_validation_r2}")

rf_train_r2, rf_validation_r2 = get_accuracy(rf, X, y, 'r2')
print(f"RF Regressor - Train r2: {rf_train_r2}, Validation r2: {rf_validation_r2}")

gb_train_r2, gb_validation_r2 = get_accuracy(gb, X, y, 'r2')
print(f"GB Regressor - Train r2: {gb_train_r2}, Validation r2: {gb_validation_r2}")

results = pd.DataFrame(columns=["Model", "Training R2", "Validation R2"])

results.loc[len(results)]= {"Model": 'DT', "Training R2": dt_train_r2, "Validation R2": dt_validation_r2}
results.loc[len(results)]= {"Model": 'RF', "Training R2": rf_train_r2, "Validation R2": rf_validation_r2}
results.loc[len(results)]= {"Model": 'GB', "Training R2": gb_train_r2, "Validation R2": gb_validation_r2}
print(results)

DT Regressor - Train r2: -0.8228872809524459, Validation r2: -0.1762104452178903
RF Regressor - Train r2: -0.8812176248882352, Validation r2: -0.17478141576183334
GB Regressor - Train r2: -0.9864362663137645, Validation r2: -0.47442497151979585
  Model  Training R2  Validation R2
0    DT    -0.822887      -0.176210
1    RF    -0.881218      -0.174781
2    GB    -0.986436      -0.474425


### Questions (6 marks)
1. How do these results compare to the results using a linear model in the previous assignment? Use values.
1. Out of the models you tested, which model would you select for this dataset and why?
1. If you wanted to increase the accuracy of the tree-based models, what would you do? Provide two suggestions.

*ANSWER HERE*
1. The accuracy of training is high but the validation accuracy is way lower than assignment 2's.The training score is close to 1.0, but the validation score is much lower, the model isoverfitting. Assignment 2 has both high training and validation score.
2. I will select the LogisticRegression Model with X and y as training set. Because that model has both decent high training score and testing score among these logistic regression models. And since the tree models' max depthes are all only 5, and validation scores are low with max depth = 5. This means the models are already overfitting at the start of increasing more complexity. I would not choose tree models in this case. 
3. Hyperparameter Tuning: Increase & Decrease the max_depth parameter to prevent overfitting.
Feature Engineering: Drop the unnecessary columns and retain only those features that contribute most to target variable. This reduces model complexity and improves interpretability of the model make the accuracy increase and reduce the noise.


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

1. Where did you source your code?
- Based on lecture materials and labs and assignment 2.
2. In what order did you complete the steps?
- Try to come up with solutions by myself. If I see challenges. Ex: cannot memorize a function's parameters, or don't know which method to use, I will google or go to lecture code examples.
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
- I ask specific questions, not copy and paste the content to the AI. Yes I need to modify the code to help me memorize what it does. then next time I don't need to ask ai. Specifically I let AI to help me with some ideas with making the tree models perform better, I found the idea I wanted and also some new ideas that I didn't learn in the lecture. It was a good tool to extend my learning and help me with the reasoning though I came up with the answer by myself in my own way of thinking without using AI.
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
- I had challenges with not understanding what is the cause of low validation score and high training score so I did some research and found out it is because of overfitting. I also had challenges with coming up with ideas that help improve the performance of these tree models. I looked at the lecture material and that helped me.

## Part 2: Classification (18.5 marks)

You have been asked to develop code that can help the user classify different wine samples. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 1: Data Input (2 marks)

The data used for this task can be downloaded from UCI: https://archive.ics.uci.edu/dataset/109/wine

Use the pandas library to load the dataset. You must define the column headers if they are not included in the dataset 

You will need to split the dataset into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the size and type of `X` and `y`

In [156]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine = fetch_ucirepo(id=109) 
  
# data (as pandas dataframes) 
X = wine.data.features 
y = wine.data.targets 

# metadata 
print(wine.metadata) 
  
# variable information 
print(wine.variables) 

{'uci_id': 109, 'name': 'Wine', 'repository_url': 'https://archive.ics.uci.edu/dataset/109/wine', 'data_url': 'https://archive.ics.uci.edu/static/public/109/data.csv', 'abstract': 'Using chemical analysis to determine the origin of wines', 'area': 'Physics and Chemistry', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 178, 'num_features': 13, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1992, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C5PC7J', 'creators': ['Stefan Aeberhard', 'M. Forina'], 'intro_paper': {'title': 'Comparative analysis of statistical pattern recognition methods in high dimensional settings', 'authors': 'S. Aeberhard, D. Coomans, O. Vel', 'published_in': 'Pattern Recognition', 'year': 1994, 'url': 'https://www.semanticscholar.org/paper/83dc3e4030d7b9fbdbb4bde03ce12ab70ca10528', 'do

### Step 2: Data Processing (1.5 marks)

Print the first five rows of the dataset to inspect:

In [157]:
# TO DO: ADD YOUR CODE HERE
print(X.head())


print(y.head())


   Alcohol  Malicacid   Ash  Alcalinity_of_ash  Magnesium  Total_phenols  \
0    14.23       1.71  2.43               15.6        127           2.80   
1    13.20       1.78  2.14               11.2        100           2.65   
2    13.16       2.36  2.67               18.6        101           2.80   
3    14.37       1.95  2.50               16.8        113           3.85   
4    13.24       2.59  2.87               21.0        118           2.80   

   Flavanoids  Nonflavanoid_phenols  Proanthocyanins  Color_intensity   Hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   0D280_0D315_of_diluted_wines  Proline  
0                        

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values

In [158]:
# TO DO: ADD YOUR CODE HERE
print(X.isnull().sum())

print(y.isnull().sum())

Alcohol                         0
Malicacid                       0
Ash                             0
Alcalinity_of_ash               0
Magnesium                       0
Total_phenols                   0
Flavanoids                      0
Nonflavanoid_phenols            0
Proanthocyanins                 0
Color_intensity                 0
Hue                             0
0D280_0D315_of_diluted_wines    0
Proline                         0
dtype: int64
class    0
dtype: int64


How many samples do we have of each type of wine?

In [159]:
# TO DO: ADD YOUR CODE HERE

wine_counts = y.value_counts()
wine_counts

class
2        71
1        59
3        48
Name: count, dtype: int64

### Step 3: Implement Machine Learning Model

1. Import `SVC` and `DecisionTreeClassifier` from sklearn
2. Instantiate models as `SVC()` and `RandomForestClassifier(max_depth = 2)`
3. Implement the machine learning model with `X` and `y`

In [160]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier


svc = SVC(random_state = 0)
rf = RandomForestClassifier(max_depth = 2, random_state = 0)

y = y.values.ravel()

svc.fit(X, y)

rf.fit(X, y)

### Step 4: Validate Model 

Calculate the average training and validation accuracy using `cross_validate` for the two different models listed in Step 3. For this case, use `scoring='accuracy'`

In [161]:
from sklearn.model_selection import cross_validate

def get_accuracy(model, X, y, scoring):
    cv_results = cross_validate(model, X, y, cv=5, scoring=scoring, return_train_score=True)
    train_accuracy =  cv_results['train_score'].mean()
    validation_accuracy = cv_results['test_score'].mean()
    return train_accuracy, validation_accuracy

svc_train_accuracy, svc_validation_accuracy = get_accuracy(svc, X, y, 'accuracy')
print(f"svc Regressor - Train accuracy: {svc_train_accuracy}, Validation accuracy: {svc_validation_accuracy}")

rf_train_accuracy, rf_validation_accuracy = get_accuracy(rf, X, y, 'accuracy')
print(f"RF Regressor - Train accuracy: {rf_train_accuracy}, Validation accuracy: {rf_validation_accuracy}")

svc Regressor - Train accuracy: 0.7037427361371023, Validation accuracy: 0.6634920634920635
RF Regressor - Train accuracy: 0.9859647394858662, Validation accuracy: 0.972063492063492


### Step 5: Visualize Results (4 marks)

#### Step 5.1: Compare Models
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [162]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5.1
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
results = pd.DataFrame(columns=["Model", "Training Accuracy", "Validation Accuracy"])

results.loc[len(results)]= {"Model": 'SVC', "Training Accuracy": svc_train_accuracy, "Validation Accuracy": svc_validation_accuracy}
results.loc[len(results)]= {"Model": 'RF', "Training Accuracy": rf_train_accuracy, "Validation Accuracy": rf_validation_accuracy}
print(results)

  Model  Training Accuracy  Validation Accuracy
0   SVC           0.703743             0.663492
1    RF           0.985965             0.972063


#### Step 5.2: Improve Model Performance

As stated in class, support vector machines require additional pre-processing compared to tree-based models. Write the code below to test three different scaling methods, `MinMaxScaler`, `StandardScaler` and `RobustScaler`. For this case, use the same cross-validation method mentioned in the previous step. Print the training and validation accuracy results in a table. Use the default parameters for the `SVC`. 

In [163]:
# TO DO: Test different scaling methods for the SVM model    
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()
X_scaled_minmax = minmax_scaler.fit_transform(X)
X_scaled_standard = standard_scaler.fit_transform(X)
X_scaled_robust = robust_scaler.fit_transform(X)



In [164]:
from sklearn.model_selection import cross_validate

def get_accuracy(svc, X, y, scoring):
    svc.fit(X, y)
    cv_results = cross_validate(svc, X, y, cv=5, scoring=scoring, return_train_score=True)
    train_accuracy = cv_results['train_score'].mean()
    validation_accuracy = cv_results['test_score'].mean()
    return train_accuracy, validation_accuracy

svc = SVC(random_state = 0)

svc_train_accuracy_minmax, svc_validation_accuracy_minmax = get_accuracy(svc, X_scaled_minmax, y, 'accuracy')
print(f"SVC - Train Accuracy with MinMax Scaler: {svc_train_accuracy_minmax}, Validation Accuracy: {svc_validation_accuracy_minmax}")

svc_train_accuracy_standard, svc_validation_accuracy_standard = get_accuracy(svc, X_scaled_standard, y, 'accuracy')
print(f"SVC - Train Accuracy with Standard Scaler: {svc_train_accuracy_standard}, Validation Accuracy: {svc_validation_accuracy_standard}")

svc_train_accuracy_robust, svc_validation_accuracy_robust = get_accuracy(svc, X_scaled_robust, y, 'accuracy')
print(f"SVC - Train Accuracy with Robust Scaler: {svc_train_accuracy_robust}, Validation Accuracy: {svc_validation_accuracy_robust}")



SVC - Train Accuracy with MinMax Scaler: 0.9957943464985718, Validation Accuracy: 0.9776190476190477
SVC - Train Accuracy with Standard Scaler: 0.9986013986013986, Validation Accuracy: 0.9833333333333334
SVC - Train Accuracy with Robust Scaler: 0.9986013986013986, Validation Accuracy: 0.9722222222222221


### Questions (7 marks)
1. How do the training and validation accuracy change depending on the method used (without scaling)? Were either of these models a good fit for the data?
1. What are two reasons why a support vector machines model might not work as well as a tree-based model?
1. How did each scaler perform compared to the unscaled results? Was there a significant difference in the performance of the scalers comparatively? Explain with values.
1. How did the results for the scaled SVM model compare to the random forest model? 

*YOUR ANSWERS HERE*

1. 

svc: - Train accuracy: 0.7037427361371023, Validation accuracy: 0.6634920634920635

Random Forest - Train accuracy: 0.9859647394858662, Validation accuracy: 0.972063492063492

SVC is not a good fit for the data because both training and validation accuracy are low. The model is underfit. it might not capture the complexity of the data as effectively as the Random Forest model.

Random Forest is a good fit for the data since both training and validating accuracy are very close to 1.

2. 
Reason 1: SVC are sensitive to the scale of the features, this means that if the features are not on the same scale, the SVM might prioritize the features with larger scales, thus the model won't work well. Tree-based models like Random Forests, on the other hand, are less sensitive to feature scaling since they make decisions based on splits and do not rely on the magnitude of the feature values. This is why it doesn't win over Random Forest model
Reason 2: SVC struggles with datasets that have complex, non-linear relationships. Kernel (like RBF) can help SVMs capture non-linear data and trend in the model, tree-based models naturally is good at handling non-linear data by dividing the feature space into simpler, piece-wise segments. 

3. 
There is no significant difference on accuracies by appling these different scalers. scaling can significantly impact the performance of models, especially for SVM that are sensitive to the scale of input features. Scalers MinMaxScaler, StandardScaler, or RobustScaler can bring all features to a similar scale, which makes the data centered at the origin. (x=0, y=0) that's why the accuracy with these scaled data are similarily high.

SVC - Train Accuracy with MinMax Scaler: 0.9957943464985718, Validation Accuracy: 0.9776190476190477
SVC - Train Accuracy with Standard Scaler: 0.9986013986013986, Validation Accuracy: 0.9833333333333334
SVC - Train Accuracy with Robust Scaler: 0.9986013986013986, Validation Accuracy: 0.9722222222222221

4. 
The performance of SVM with scaling has similar high performance like the Random Forest model:

SVC - Train Accuracy with MinMax Scaler: 0.9957943464985718, Validation Accuracy: 0.9776190476190477
SVC - Train Accuracy with Standard Scaler: 0.9986013986013986, Validation Accuracy: 0.9833333333333334
SVC - Train Accuracy with Robust Scaler: 0.9986013986013986, Validation Accuracy: 0.9722222222222221
Random Forest     Train Accuracy:      0.985965      Validation Accuracy:       0.972063

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*
1. Where did you source your code?
- Based on lecture materials from Week6 and Question 1 from A3.
2. In what order did you complete the steps?
- Try to come up with solutions by myself. If I see challenges. Ex: cannot give the reason why random forest has better performance than svm without scaling. I will google or go to lecture code examples.
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
- I ask specific questions, not copy and paste the content to the AI. Yes I need to modify the code to help me memorize what it does. then next time I don't need to ask ai. Specifically I let AI to help me with some ideas why scaling can make the SVM perform better, I found the idea I wanted and also some new ideas that I didn't learn in the lecture. It was a good tool to extend my learning and help me with the reasoning though I came up with the answer by myself in my own way of thinking without using AI.
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
- I had challenges with not understanding what is the cause of low validation accuracy and low training accuracy, so I did some research and found out it is because of underfitting. I also had challenges with coming up with ideas that help improve the performance of SVM. I looked at the lecture material and that helped me.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.



I found different scaler won't make a huge different in this dataset with these features. The reason is they are scaled into similar kind of data, not like norm scalers. 

I also found sometimes when you are not sure if should use a scaler or not, you can always use random forest since it splits the data into branches instead of looking at its magnitude.

I noticed that sometimes if accuracy of training and validation is low, it doesn't mean the model is not good fit. it sometimes has something to do with its features, we need to do preprocessing and feature engineering to make the model perform better.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

Working on this assignment, I found the process of scaling and how scaling impact SVM model performance, and explanation why RF doesn't need scaling because the feature is not used as magnitude but splitted into decision branches. I learned a lot and memorized lots of concepts by doing this assignment.