# Name: Harsh Siddhapura
# ASU ID: 1230169813

# Lab 14: Building the ML Pipeline in SciKit Learn 

In this lab you will use Scikit Learn to build a machine learning pipeline for a classification application.

## Dataset
We will be using the US Census, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a dataset for this lab. 


## Part I: Building the Pipeline
The pipeline should include two stages:

- A standard transformer that transforms the columns of the dataset by giving them a zero mean & standard deviation of 1. 
- A classifier object such as a Decision Tree 

An estimator is created using this pipeline then it is used to fit the data. Refer to the make_pipeline()Links to an external site. function documentation on how to create the pipeline.  

Run the estimator after building the pipeline. Print the obtained classification accuracy.   

The code steps should look like this:

- Load the LibSVM file (Don't use read_csv() function).
- Create a standard Scaler Object: SS
- Create a Decision Tree Object: DT
- Create a Pipeline that contains two steps for SS -> DT
- Use the Cross Validation Score function & Print the average score 

Note:
The libSVM dataset file we are working with is sparse. The Standard Scaler implementation in SciKit learn is not designed to handle sparse data! Therefore, it will not be able to scale the data in this format.

One solution (which is given as a hint in the error messages you might get) is to set with_mean=False and with_std=False. The problem with this solution is that the data will not be standardized! Simply, the standard scaler won't be applied to the data. 

Another solution is to convert the sparse data into dense data. You can do so by calling the todense() function as follows:

- X,y = load_svm_...('a9a')
- X_d = X.todense()  

Please use the second solution and convert the dataset into a dense matrix before applying the standard scaler to it.

In [2]:
from sklearn.datasets import load_svmlight_file 
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
import warnings 
import numpy as np

warnings.filterwarnings('ignore')

# Step 1: Loading the LibSVM file.
X, y = load_svmlight_file("a9a.txt")
X_d = np.array(X.todense()) # Convert the matrix to a NumPy array

# Step 2: Creating standard scaler.
scaler = StandardScaler()
print(scaler.fit(X_d))
print(scaler.transform(X_d))

# Step 3: Creating decision tree object.
dec_tree = DecisionTreeClassifier(random_state = 0) 

# Step 4: Creating a Pipeline that contains two steps for SS -> DT
pipeline = make_pipeline(scaler, dec_tree)
print(pipeline)

# Step 5: Using cross val score to find out average
scores = cross_val_score(pipeline, X_d, y, cv = 5) 
print("\n\nCross Validation score:", scores) 
print("\nAverage score:", scores.mean())


StandardScaler()
[[-0.49513889 -0.46930197  1.94096624 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197  1.94096624 ... -0.03087016 -0.02479131
  -0.00554189]
 ...
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [ 2.01963532 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]]
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(random_state=0))])


Cross Validation score: [0.78826961 0.79407248 0.79914005 0.79453317 0.80451474]

Average score: 0.7961060113754724


## Part II: Parameter Fine Tuning with Cross Validation  

In this part you will add a parameter grid to your code to experiment with different parameter values and select the one that gives the highest accuracy. The parameter set is:  
- Impurity Measure: “gini”, “entropy”
- Maximum Tree Depth: 5, 10, 15, 20  

For cross validation, you can use 5 folds.  

Check the obtained results. Which parameter set gives the best accuracy?  

Notice that you still need to create the pipeline and use it for fitting the model while also performing cross validation & parameter fine tuning.  

The code steps should look something similar to this:

- Load LibSVM Data
- Create a Standard Scalar Object: SS
- Create a Decision Tree Object: DT
- Create a Pipeline with two steps: SS -> DT
- Create a Parameter Grid [['gini', 'entropy'],[5,10,15,20]]
- Create the Grid Search Cross Validation object
- Call the fit function to fit the pipeline to the data and try different parameter combinations
- Print the best obtained results

In [3]:
from sklearn.datasets import load_svmlight_file 
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
import warnings 
import numpy as np

warnings.filterwarnings('ignore')

# Step 1: Loading the LibSVM file.
X, y = load_svmlight_file("a9a.txt")
X_d = np.array(X.todense()) # Convert the matrix to a NumPy array

# Step 2: Creating standard scaler.
scaler = StandardScaler()
print(scaler.fit(X_d))
print(scaler.transform(X_d))

# Step 3: Creating decision tree object.
dec_tree = DecisionTreeClassifier(random_state = 0) 

# Step 4: Creating a Pipeline that contains two steps for SS -> DT
pipeline = make_pipeline(scaler, dec_tree)
print(pipeline)

# Step 5: Creating a Parameter Grid
param_grid = {'decisiontreeclassifier__criterion': ['gini', 'entropy'],
              'decisiontreeclassifier__max_depth': [5, 10, 15, 20]}

# Step 6: Creating the Grid Search Cross Validation object
grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

# Step 7: Call the fit function to fit the pipeline to the data and try different parameter combinations
grid.fit(X_d, y)

# Step 8: Print the best obtained results
print("\n\nBest parameters: ", grid.best_params_)
print("\nBest estimator: ", grid.best_estimator_)
print("\nBest cross-validation score: ", grid.best_score_)

StandardScaler()
[[-0.49513889 -0.46930197  1.94096624 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197  1.94096624 ... -0.03087016 -0.02479131
  -0.00554189]
 ...
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [ 2.01963532 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]]
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(random_state=0))])


Best parameters:  {'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__max_depth': 10}

Best estimator:  Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=10,
                                   

## Part III: Extra Credit (5 points): Parameter Fine Tuning, Cross Validation with Train/Test Split  

Notice that in parts I & II we did not use a train/test split so the scalar is applied independently to each. That's because the way the cross_val_score() function is implemented does not give access to the learned models. The cross_val_score() function should only be used in cases of comparing between different types of classifiers (DT vs SVM vs ...). The GridSearchCV() object is more suitable for parameter fine tuning. It also gives access to the model that gave the best parameters.

In this part, we will modify Part II so that the data is split into train/test parts, the pipeline is fit to the training data with parameter fine tuning and cross validation involved, and finally, the testing data is passed through the scalar, and the best resulting model is applied to the scaled test data to give the final model accuracy.

The code steps should look something similar to this:

- Load LibSVM Data
- Split data into Train and Test sets
- Create a Standard Scalar Object: SS
- Create a Decision Tree Object: DT
- Create a Pipeline with two steps: SS -> DT
- Create a Parameter Grid [['gini', 'entropy'],[5,10,15,20]]
- Create the Grid Search Cross Validation object
- Call the fit function to fit the pipeline to the Train data and try the different parameter combinations
- Extract the best model from the Grid Search Cross Validation Object. Check the best_estimator_ attribute.
- Apply the Scalar to the Test data
- Apply the best model to the scaled Test data
- Print the obtained test results


In [4]:
from sklearn.datasets import load_svmlight_file 
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
import warnings 
import numpy as np

warnings.filterwarnings('ignore')

# Step 1: Loading the LibSVM file.
X, y = load_svmlight_file("a9a.txt")
X_d = np.array(X.todense()) # Convert the matrix to a NumPy array

# Step 2: Split data into Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(X_d, y, test_size=0.2, random_state=42)

# Step 3: Creating standard scaler.
scaler = StandardScaler()
print(scaler.fit(X_d))
print(scaler.transform(X_d))

# Step 4: Creating decision tree object.
dec_tree = DecisionTreeClassifier(random_state = 0) 

# Step 5: Creating a Pipeline that contains two steps for SS -> DT
pipeline = make_pipeline(scaler, dec_tree)

# Step 6: Creating a Parameter Grid
param_grid = {'decisiontreeclassifier__criterion': ['gini', 'entropy'],
              'decisiontreeclassifier__max_depth': [5, 10, 15, 20]}

# Step 7: Creating the Grid Search Cross Validation object
grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

# Step 8: Call the fit function to fit the pipeline to the Train data and try the different parameter combinations
grid.fit(X_train, y_train)

# Step 9: Extract the best model from the Grid Search Cross Validation Object. Check the best_estimator_ attribute.
print("\n\nBest parameters: ", grid.best_params_)
print("\nBest estimator: ", grid.best_estimator_)
print("\nBest cross-validation score: ", grid.best_score_)

# Step 10: Apply the Scalar to the Test data
X_test_scaled = scaler.transform(X_test)

# Step 11: Apply the best model to the scaled Test data
test_score = grid.score(X_test_scaled, y_test)

# Step 12: Print the obtained test results
print("\nTest score: ", test_score)

StandardScaler()
[[-0.49513889 -0.46930197  1.94096624 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197  1.94096624 ... -0.03087016 -0.02479131
  -0.00554189]
 ...
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [ 2.01963532 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]
 [-0.49513889 -0.46930197 -0.51520731 ... -0.03087016 -0.02479131
  -0.00554189]]


Best parameters:  {'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': 10}

Best estimator:  Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(max_depth=10, random_state=0))])

Best cross-validation score:  0.8320025616375615

Test score:  0.7658529095654845
