# Assignment 3: Non-Linear Models and Validation Metrics (38 total marks)
### Due: March 5 at 11:59pm

### Name: Saurav Uprety

### In this assignment, you will need to write code that uses non-linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

### Import Libraries

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Part 1: Regression (14.5 marks)

For this section, we will be continuing with the concrete example from yellowbrick. You will need to compare these results to the results from the previous assignment. Please use the results from the solution if you were unable to complete Assignment 2

### Step 1: Data Input (0.5 marks)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the concrete dataset into the feature matrix `X` and target vector `y`.

In [25]:
# TO DO: Import concrete dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

X, y = load_concrete()

### Step 2: Data Processing (0 marks)

Data processing was completed in the previous assignment. No need to repeat here.

### Step 3: Implement Machine Learning Model

1. Import the Decision Tree, Random Forest and Gradient Boosting Machines regression models from sklearn
2. Instantiate the three models with `max_depth = 5`. Are there any other parameters that you will need to set?
3. Implement each machine learning model with `X` and `y`

### Step 4: Validate Model

Calculate the average training and validation accuracy using mean squared error with cross-validation. To do this, you will need to set `scoring='neg_mean_squared_error'` in your `cross_validate` function and negate the results (multiply by -1)

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: DT, RF and GB
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [35]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

tree_reg = DecisionTreeRegressor(max_depth=1, random_state=0)

# random forest can specify number of trees (n_estimator) hyper param.
# can also specify max features - since its regression task, use num of features by default
# don't need max_features, or min_samples_leaf as pre-pruning done with max_depth
forest_reg = RandomForestRegressor(max_depth=5, random_state=0)

# Can also specify learning rate hyper param.
gb_reg = GradientBoostingRegressor(max_depth=5, random_state=0)

In [37]:
res = pd.DataFrame(columns=['Decision Tree', 'Random Forest', 'Gradient Boosted Machine'],
                   index=['CV Training R2', 'CV Validation R2'])


for i, regressor in enumerate([tree_reg, forest_reg, gb_reg]):
    scores_ = cross_validate(regressor, X, y, cv=2, return_train_score=True, scoring='neg_mean_squared_error', ) 
    res.iloc[0,i] = (-1*scores_['train_score']).mean()
    res.iloc[1,i] = (-1*scores_['test_score']).mean()
res

Unnamed: 0,Decision Tree,Random Forest,Gradient Boosted Machine
CV Training R2,155.310868,22.009196,2.100764
CV Validation R2,368.547629,233.010203,157.24033


Repeat the step above to print the R2 score instead of the mean-squared error. For this case, you can use `scoring='r2'`

In [28]:
res = pd.DataFrame(columns=['Decision Tree', 'Random Forest', 'Gradient Boosted Machine'],
                   index=['CV Training R2', 'CV Validation R2'])


for i, regressor in enumerate([tree_reg, forest_reg, gb_reg]):
    scores_ = cross_validate(regressor, X, y, cv=5, return_train_score=True, scoring='r2',) 
    res.iloc[0,i] = np.mean(scores_['train_score'])
    res.iloc[1,i] = np.mean(scores_['test_score'])

res

Unnamed: 0,Decision Tree,Random Forest,Gradient Boosted Machine
CV Training R2,0.822887,0.881221,0.986436
CV Validation R2,0.17621,0.173748,0.473701


### Questions (6 marks)
1. How do these results compare to the results using a linear model in the previous assignment? Use values.
1. Out of the models you tested, which model would you select for this dataset and why?
1. If you wanted to increase the accuracy of the tree-based models, what would you do? Provide two suggestions.

*ANSWER HERE*

1. Compared to linear models, the results are worse. Although the training R2 scores are much better for the non linear methods, 0.823, 0.881, 0.986 for decision trees, random forest, and gradient boost machines, respectively, compared to 0.611 for linear regression. The validatioon scores for the non-linear methods are significantly lower - 0.176, 0.173, 0.474 (decision trees, random forest, gradient boost machines) compared to 0.623 for linear regression. Similar trends are observed for the mean squared error. This discrepancy is most likely due to the differences in training size, and the inability to non-linear methods to extrapolate. Since cross-validation is used, with the default value of 5 folds. The training size for each cross-validation step is much smaller compared to the training size of the linear regression dataset (20 % comapred to 75 %). Hence, the models are in the high-variance region (left) of the learning curve. Finally, the tree-based regression methods have limited ability to extrapolate beyond their training dataset, this coupled with the limited training dataset could explain the discrepancy between the validation and training accuracies.
2. Out of the models, I would use GBM for this dataset. GBM has the best perforamce, 0.990 and 0.474 training and test accuracies, respectively. GBM can also be highly tuned with the learning rate and n estimators parameters. 
3. The limiting factor here is clearly, the size of the training data size, 

1. Linear models can extrapolate - however, these models cannot extrapolate beyond their training datasets
1. Use GBM, although it requires the most tuning, it provides the best results.
1. If underfitting, increase model complexity with higher depth. If overfitting, decrease model complexity. For GBM consider learning rate. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 2: Classification (18.5 marks)

You have been asked to develop code that can help the user classify different wine samples. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 1: Data Input (2 marks)

The data used for this task can be downloaded from UCI: https://archive.ics.uci.edu/dataset/109/wine

Use the pandas library to load the dataset. You must define the column headers if they are not included in the dataset 

You will need to split the dataset into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the size and type of `X` and `y`

In [None]:
# TO DO: Import wine dataset
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine = fetch_ucirepo(id=109) 
  
# data (as pandas dataframes) 
X = wine.data.features 
y = wine.data.targets 

print(f"X is of type {type(X)} and size {X.shape}")
print(f"y is of type {type(y)} and size {y.shape}")

In [None]:
# metadata 
print(wine.metadata) 
  
# variable information 
print(wine.variables) 

### Step 2: Data Processing (1.5 marks)

Print the first five rows of the dataset to inspect:

In [None]:
# TO DO: ADD YOUR CODE HERE
X.head()

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values

In [None]:
# TO DO: ADD YOUR CODE HERE
X.isnull().sum()

How many samples do we have of each type of wine?

In [None]:
# TO DO: ADD YOUR CODE HERE
y['class'].value_counts()

### Step 3: Implement Machine Learning Model

1. Import `SVC` and `RandomForestClassifier` from sklearn
2. Instantiate models as `SVC()` and `RandomForestClassifier(max_depth = 2)`
3. Implement the machine learning model with `X` and `y`

### Step 4: Validate Model 

Calculate the average training and validation accuracy using `cross_validate` for the two different models listed in Step 3. For this case, use `scoring='accuracy'`

### Step 5: Visualize Results (4 marks)

#### Step 5.1: Compare Models
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy
2. Add the training and validation accuracy for each model to the `results` DataFrame
3. Print `results`

In [None]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5.1
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

rf_mdl = RandomForestClassifier(max_depth=2, random_state=0)
svc_mdl = SVC(random_state=0)

In [None]:
res = pd.DataFrame(columns=['Random Forest', 'SVC'],
                   index=['Training Accuracy', 'Validation Accuracy'])


for i, regressor in enumerate([rf_mdl, svc_mdl]):
    scores_ = cross_validate(regressor, X, y['class'], cv=5, return_train_score=True, scoring='accuracy') 
    res.iloc[0,i] = np.mean(scores_['train_score'])
    res.iloc[1,i] = np.mean(scores_['test_score'])

res

#### Step 5.2: Improve Model Performance

As stated in class, support vector machines require additional pre-processing compared to tree-based models. Write the code below to test three different scaling methods, `MinMaxScaler`, `StandardScaler` and `RobustScaler`. For this case, use the same cross-validation method mentioned in the previous step. Print the training and validation accuracy results in a table. Use the default parameters for the `SVC`. 

In [None]:
# TO DO: Test different scaling methods for the SVM model
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

res = pd.DataFrame(columns=['MinMaxScaler','StandardScale', 'RobustScaler'], index=['Training Accuracy', 'Validation Accuracy'])


for i, Scaler in enumerate([MinMaxScaler, StandardScaler, RobustScaler]):
    scaler = Scaler()
    scaler.fit(X)
    X_scaled = scaler.transform(X)
    scores_ = cross_validate(svc_mdl, X_scaled, y['class'], cv=5, return_train_score=True, scoring='accuracy') 
    res.iloc[0,i] = np.mean(scores_['train_score'])
    res.iloc[1,i] = np.mean(scores_['test_score'])


In [None]:
res

### Questions (7 marks)
1. How do the training and validation accuracy change depending on the method used (without scaling)? Were either of these models a good fit for the data?
1. What are two reasons why a support vector machines model might not work as well as a tree-based model?
1. How did each scaler perform compared to the unscaled results? Was there a significant difference in the performance of the scalers comparatively? Explain with values.
1. How did the results for the scaled SVM model compare to the random forest model? 

1. X
2. SVM does not scale well with number of samples, whereas tree-based models scale very well. This is because, predicting with tree-based methods requires just going through an if-elseif ladder, where as prediciting with SVM requires computation of distances support vectors which is computationally more expensive.  Tree-based methods, with reasonable pre-pruned depth, like GBM, can be visualize and easily communicated to provide an intuitive understanding, however, SVM is not the same. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*
 
Aeberhard,Stefan and Forina,M.. (1991). Wine. UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*