# Assignment 3: Non-Linear Models and Validation Metrics (38 total marks)
### Due: March 5 at 11:59pm

### Name: Saurav Uprety

### In this assignment, you will need to write code that uses non-linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Part 1: Regression (14.5 marks)

For this section, we will be continuing with the concrete example from yellowbrick. You will need to compare these results to the results from the previous assignment. Please use the results from the solution if you were unable to complete Assignment 2

### Step 1: Data Input (0.5 marks)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the concrete dataset into the feature matrix `X` and target vector `y`.

In [2]:
# TO DO: Import concrete dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

X, y = load_concrete()

### Step 2: Data Processing (0 marks)

Data processing was completed in the previous assignment. No need to repeat here.

### Step 3: Implement Machine Learning Model

1. Import the Decision Tree, Random Forest and Gradient Boosting Machines regression models from sklearn
2. Instantiate the three models with `max_depth = 5`. Are there any other parameters that you will need to set?
3. Implement each machine learning model with `X` and `y`

### Step 4: Validate Model

Calculate the average training and validation accuracy using mean squared error with cross-validation. To do this, you will need to set `scoring='neg_mean_squared_error'` in your `cross_validate` function and negate the results (multiply by -1)

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: DT, RF and GB
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [3]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

tree_reg = DecisionTreeRegressor(max_depth=5, random_state=0)

# random forest can specify number of trees (n_estimator) hyper param.
# can also specify max features - since its regression task, use num of features by default
# don't need max_features, or min_samples_leaf as pre-pruning done with max_depth
forest_reg = RandomForestRegressor(max_depth=5, random_state=0)

# Can also specify learning rate hyper param. as well as n_estimator
gb_reg = GradientBoostingRegressor(max_depth=5, random_state=0)

In [4]:
res = pd.DataFrame(columns=['Decision Tree', 'Random Forest', 'Gradient Boosted Machine'],
                   index=['CV Training R2', 'CV Validation R2'])


for i, regressor in enumerate([tree_reg, forest_reg, gb_reg]):
    scores_ = cross_validate(regressor, X, y, cv=5, return_train_score=True, scoring='neg_mean_squared_error', ) 
    res.iloc[0,i] = (-1*scores_['train_score']).mean()
    res.iloc[1,i] = (-1*scores_['test_score']).mean()
res

Unnamed: 0,Decision Tree,Random Forest,Gradient Boosted Machine
CV Training R2,47.918561,32.055432,3.73927
CV Validation R2,163.087775,156.404972,99.360259


Repeat the step above to print the R2 score instead of the mean-squared error. For this case, you can use `scoring='r2'`

In [5]:
res = pd.DataFrame(columns=['Decision Tree', 'Random Forest', 'Gradient Boosted Machine'],
                   index=['CV Training R2', 'CV Validation R2'])


for i, regressor in enumerate([tree_reg, forest_reg, gb_reg]):
    scores_ = cross_validate(regressor, X, y, cv=5, return_train_score=True, scoring='r2',) 
    res.iloc[0,i] = np.mean(scores_['train_score'])
    res.iloc[1,i] = np.mean(scores_['test_score'])

res

Unnamed: 0,Decision Tree,Random Forest,Gradient Boosted Machine
CV Training R2,0.822887,0.881221,0.986436
CV Validation R2,0.17621,0.173748,0.473701


### Questions (6 marks)
1. How do these results compare to the results using a linear model in the previous assignment? Use values.
1. Out of the models you tested, which model would you select for this dataset and why?
1. If you wanted to increase the accuracy of the tree-based models, what would you do? Provide two suggestions.

*ANSWER HERE*

1. Compared to linear models, the results are worse. Although the training R2 scores are much better for the non linear methods, 0.823, 0.881, 0.986 for decision trees, random forest, and gradient boost machines, respectively, compared to 0.611 for linear regression. The validatioon scores for the non-linear methods are significantly lower - 0.176, 0.173, 0.474 (decision trees, random forest, gradient boost machines) compared to 0.623 for linear regression. Similar trends are observed for the mean squared error. This discrepancy is most likely due to the differences in training size, and the inability to non-linear methods to extrapolate. Since cross-validation is used, with the default value of 5 folds. The training size for each cross-validation step is much smaller compared to the training size of the linear regression dataset (20 % comapred to 75 %). Hence, the models are in the high-variance region (left) of the learning curve. Finally, the tree-based regression methods have limited ability to extrapolate beyond their training dataset, this coupled with the limited training dataset could explain the discrepancy between the validation and training accuracies.
2. Out of the models, I would use GBM for this dataset. GBM has the best perforamce, 0.990 and 0.474 training and test accuracies, respectively. GBM can also be highly tuned with the learning rate and n estimators hyperparameters. 
3. For the GBM and RF, tuning the hyperparameters - tree estimators, and for GBM the learning rate should help increase the accuracy. However, the obious limiting factor here is the learning size, as the training and validation scores are large, so increasing the dataset size, or the number of samples, should also increase accuracy. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I wrote this code myself, I utilized the sklearn documentation for some details such as the intilization of cross_validate and the different regressor. I also used the *Introduction to Machine Learning with Python* textbook along with the course notes for understanding of decision trees/ensemble methods. I completed the order in the steps perscribed. The results of this questions were challenging to understand, the large difference in training vs. validation scores caused significant doubt. I also saw that if the dataset is split into test/train, better results could be achevied. However, since we are cross validating, and also not doing any prediction with the model, I could not justify doing the test train split. 

## Part 2: Classification (18.5 marks)

You have been asked to develop code that can help the user classify different wine samples. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 1: Data Input (2 marks)

The data used for this task can be downloaded from UCI: https://archive.ics.uci.edu/dataset/109/wine

Use the pandas library to load the dataset. You must define the column headers if they are not included in the dataset 

You will need to split the dataset into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the size and type of `X` and `y`

In [6]:
# TO DO: Import wine dataset
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine = fetch_ucirepo(id=109) 
  
# data (as pandas dataframes) 
X = wine.data.features 
y = wine.data.targets 

print(f"X is of type {type(X)} and size {X.shape}")
print(f"y is of type {type(y)} and size {y.shape}")

X is of type <class 'pandas.core.frame.DataFrame'> and size (178, 13)
y is of type <class 'pandas.core.frame.DataFrame'> and size (178, 1)


In [7]:
# metadata 
print(wine.metadata) 
  
# variable information 
print(wine.variables) 

{'uci_id': 109, 'name': 'Wine', 'repository_url': 'https://archive.ics.uci.edu/dataset/109/wine', 'data_url': 'https://archive.ics.uci.edu/static/public/109/data.csv', 'abstract': 'Using chemical analysis to determine the origin of wines', 'area': 'Physics and Chemistry', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 178, 'num_features': 13, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1992, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C5PC7J', 'creators': ['Stefan Aeberhard', 'M. Forina'], 'intro_paper': {'title': 'Comparative analysis of statistical pattern recognition methods in high dimensional settings', 'authors': 'S. Aeberhard, D. Coomans, O. Vel', 'published_in': 'Pattern Recognition', 'year': 1994, 'url': 'https://www.semanticscholar.org/paper/83dc3e4030d7b9fbdbb4bde03ce12ab70ca10528', 'do

### Step 2: Data Processing (1.5 marks)

Print the first five rows of the dataset to inspect:

In [8]:
# TO DO: ADD YOUR CODE HERE
X.head()

Unnamed: 0,Alcohol,Malicacid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,0D280_0D315_of_diluted_wines,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values

In [9]:
# TO DO: ADD YOUR CODE HERE
X.isnull().sum()

Alcohol                         0
Malicacid                       0
Ash                             0
Alcalinity_of_ash               0
Magnesium                       0
Total_phenols                   0
Flavanoids                      0
Nonflavanoid_phenols            0
Proanthocyanins                 0
Color_intensity                 0
Hue                             0
0D280_0D315_of_diluted_wines    0
Proline                         0
dtype: int64

How many samples do we have of each type of wine?

In [10]:
# TO DO: ADD YOUR CODE HERE
y['class'].value_counts()

class
2    71
1    59
3    48
Name: count, dtype: int64

### Step 3: Implement Machine Learning Model

1. Import `SVC` and `RandomForestClassifier` from sklearn
2. Instantiate models as `SVC()` and `RandomForestClassifier(max_depth = 2)`
3. Implement the machine learning model with `X` and `y`

### Step 4: Validate Model 

Calculate the average training and validation accuracy using `cross_validate` for the two different models listed in Step 3. For this case, use `scoring='accuracy'`

### Step 5: Visualize Results (4 marks)

#### Step 5.1: Compare Models
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy
2. Add the training and validation accuracy for each model to the `results` DataFrame
3. Print `results`

In [11]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5.1
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

rf_mdl = RandomForestClassifier(max_depth=2, random_state=0)
svc_mdl = SVC(random_state=0)

In [12]:
res = pd.DataFrame(columns=['Random Forest', 'SVC'],
                   index=['Training Accuracy', 'Validation Accuracy'])


for i, regressor in enumerate([rf_mdl, svc_mdl]):
    scores_ = cross_validate(regressor, X, y['class'], cv=5, return_train_score=True, scoring='accuracy') 
    res.iloc[0,i] = np.mean(scores_['train_score'])
    res.iloc[1,i] = np.mean(scores_['test_score'])

res

Unnamed: 0,Random Forest,SVC
Training Accuracy,0.985965,0.703743
Validation Accuracy,0.972063,0.663492


#### Step 5.2: Improve Model Performance

As stated in class, support vector machines require additional pre-processing compared to tree-based models. Write the code below to test three different scaling methods, `MinMaxScaler`, `StandardScaler` and `RobustScaler`. For this case, use the same cross-validation method mentioned in the previous step. Print the training and validation accuracy results in a table. Use the default parameters for the `SVC`. 

In [13]:
# TO DO: Test different scaling methods for the SVM model
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

res = pd.DataFrame(columns=['MinMaxScaler','StandardScale', 'RobustScaler'], index=['Training Accuracy', 'Validation Accuracy'])


for i, Scaler in enumerate([MinMaxScaler, StandardScaler, RobustScaler]):
    scaler = Scaler()
    scaler.fit(X)
    X_scaled = scaler.transform(X)
    scores_ = cross_validate(svc_mdl, X_scaled, y['class'], cv=5, return_train_score=True, scoring='accuracy') 
    res.iloc[0,i] = np.mean(scores_['train_score'])
    res.iloc[1,i] = np.mean(scores_['test_score'])


In [14]:
res

Unnamed: 0,MinMaxScaler,StandardScale,RobustScaler
Training Accuracy,0.995794,0.998601,0.998601
Validation Accuracy,0.977619,0.983333,0.972222


### Questions (7 marks)
1. How do the training and validation accuracy change depending on the method used (without scaling)? Were either of these models a good fit for the data?
1. What are two reasons why a support vector machines model might not work as well as a tree-based model?
1. How did each scaler perform compared to the unscaled results? Was there a significant difference in the performance of the scalers comparatively? Explain with values.
1. How did the results for the scaled SVM model compare to the random forest model? 

1. The training and validation accuracies for both RF, and SVC (without scaling) are comparable. This means that, we are not in the high variance region of the learning curve, and increase in training data size will not improve results. The random forest performed exceptionally well, with training and validation scores of 0.996 and 0.978, respectively. However, the SVC appears to have underfit, with training and validation scores of 0.704 and 0.663
1.  SVM is very sensitive to differences in scales of features, whereas tree-based models are not impacted this. In this case, where the features have vastly different scales, the tree-based models would outperform SVM. SVM also requires careful tuning of the gamma and C parameters, incorrect tuning could also lead to SVM having inferior performance than tree-based methods
1.  Scaling brings incredible improvements in the results for SVC, the cross validation, training vs test scores improved from (0.704, 0.663) to (0.9958, 0.978), (0.999, 0.983), (0.998, 0.972) for unscaled, min/max scaled, standard scaled, and robust scaled, respectively. All of the scaled models appear to be a good fit for the data, as the training and validation R2 scores are near 1. Out of the scaled results, standard scaling appears to have marginally better results, as the gap between training and validation accuracies is the ssmallest. 
1. Results for the scaled models are slightly better than the random forest model. However, given that random forest is computationally less expensive in predicting, it might be the better alternative for a larger dataset, despite the slightly lower scores. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*
 
For getting acces to the UCI wine data, I used the example code provided in the datasets description [1]. The rest of the code, I refered to the example notebooks provided in the class. I also learned about SVM and Random Forest for the *Introduction to Machine Learning with Python* textbook. I completed the questions in order and did not have any challenges. I went first learned about the methods using the text book and then went through the relevant class examples before attempting the assignment. Going through the content in this order made the assignment less challenging. 

[1]Aeberhard,Stefan and Forina,M.. (1991). Wine. UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

In class we talked about how, tree-based methods and ensembles are some of the most powerful and widely used methods for both regression and classification. However, in here we saw that tree-based methods can also be quite limited in their regression capabilites as they have the tendency to overfit. For example, in part 1, the tree-based regressors had vast differences in training vs. validation scores, which were (0.823, 0.176), (0.881, 0.174), and (0.986, 0.474) for decision-tree, random-forest, and gradient boosted machine, respectively. <br><br>
We also talked about the effects of difference in scale on SVC performance, and could the improvement observed when the various scalers were used to back to this point. 

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

Overall, I liked the assignment, as it was once again hands on. Some of the instruction could be more clear, for example, part 1-step 3.2, it asks about any other parameters we should consider. Its not clear how we are supposed to approach this question, should I just mention the other parameters, or should I play around with the other parameters to get better results, or is this just something to think about. 