Problem 2) We will perform an experiment in this problem. Experiments help you make data-driven decisions to maximize the performance of your ML pipeline. The goal of this experiment is two-fold. We will study which ML algorithms require preprocessing and which algorithm performs best on the breast cancer dataset.

2a) The breast cancer dataset is loaded directly from sklearn. Please read the description of the dataset [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html). 

Split the data into train and test sets with 25% of points in test such that the split is reproducable (the same points end up in train and test every time you run your code). All features are continuous. Leave the continuous features unprocessed for now! (2 points)

Train a Lasso, a random forest, an SVM rbf, and a nearest neighbor model on the data. Please tune the appropriate parameters and evaluate the performance on the test set. Use accuracy as your evaluation metric. Print which parameter values give the best prediction. Determine the parameter range such that the best value is not near the lowest or largest values if possible. The parameters should span a wide range of values. Print the best test score for each model! (12 points) 

Which ML algorithm performs best on the data? (1 point)

Hint: to avoid code duplication and potential bugs (see 2b), please wrap the four methods and the parameter tuning into a function which takes X_train, y_train, X_test, y_test as an input, and it returns the best Lasso, random forest, SVM, and nearest neighbor test scores.

Hint 2: Randomforest is non-deterministic. Make sure that your results are reproducable such that if you run your code again, the same scores and best parameters are returned. 

In [6]:
from sklearn.datasets import load_breast_cancer
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(len(y))
print(np.shape(X))
ftr_names = data.feature_names
target_names = data.target_names
print(len(ftr_names))
print(target_names)
print(ftr_names)

569
(569, 30)
30
['malignant' 'benign']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [7]:
def train(X_train, X_test, y_train, y_test):
    # lasso
    alpha = np.logspace(-3,5,20)
    a = []
    for i in range(len(alpha)):
        log_reg_l1 = LogisticRegression(penalty='l1',C = 1/alpha[i], solver='saga', max_iter = 1e4)
        log_reg_l1.fit(X_train, y_train)
        y_pred = log_reg_l1.predict(X_test)
        a.append(accuracy_score(y_test, y_pred))

    print("Lasso: The best test accuracy score is", max(a), 
          "and the alpha values are", alpha[a.index(max(a))], 
          "the C value is", 1/alpha[a.index(max(a))])
    
    # random forest
    depths = [i for i in range(1,11)]
    sss = [i for i in range(2,13)]
    a = []
    pair = []
    for depth in depths:
        for ss in sss:
            rfr = RandomForestClassifier(n_estimators=1, max_depth=depth, min_samples_split=ss, random_state=1)
            rfr.fit(X_train, y_train)
            y_pred = rfr.predict(X_test)
            pair.append((depth,ss))
            a.append(accuracy_score(y_test, y_pred))
    pair_arr = np.array(pair)   
    print("Random forest: The best test score is",max(a), 
      "and the corresponding depth and min_samples_split values are", pair_arr[a == max(a)])
    
    # SVC rbf
    gammas= np.logspace(-9, 3, 13)
    Cs = np.logspace(-2, 10, 13)
    a = []
    pair = []
    for gamma_value in gammas:
        for c in Cs:
            svc = SVC(gamma = gamma_value, C = c, probability=True)
            svc.fit(X_train, y_train)
            pair.append((gamma_value,c))
            y_pred = svc.predict(X_test)
            a.append(accuracy_score(y_test, y_pred))
    pair_arr = np.array(pair)   
    print("SVC rbf: The best test score of is",max(a), 
      "and the corresponding gamma and C values are", pair_arr[a == max(a)])
    
    # KNN
    a = []
    n_n = [i for i in range(1,15)]
#     n_n = range(1,15)
    for n in n_n:
        knc = KNeighborsClassifier(n_neighbors=n)
        knc.fit(X_train, y_train)
        y_pred = knc.predict(X_test)
        a.append(accuracy_score(y_test, y_pred))
#     plt.plot(n_n, a)
#     plt.show()
    print("KNN: The best accuracy score is", max(a), 
          ", and the num of neighbors is", n_n[a.index(max(a))])
    
train(X_train, X_test, y_train, y_test)

Lasso: The best test accuracy score is 0.965034965034965 and the alpha values are 0.001 the C value is 1000.0
Random forest: The best test score is 0.958041958041958 and the corresponding depth and min_samples_split values are [[5 5]
 [5 6]
 [5 7]]
SVC rbf: The best test score of is 0.986013986013986 and the corresponding gamma and C values are [[1.e-09 1.e+05]
 [1.e-08 1.e+04]
 [1.e-07 1.e+03]
 [1.e-06 1.e+02]]
KNN: The best accuracy score is 0.9790209790209791 , and the num of neighbors is 11


Which ML algorithm performs best on the data? (1 point)    

Ans: The SVC rbf performs best on the data. 

2b) Now that our modeling approach has been developed, we change one aspect of it and we will study how this one change impacts the accuracy scores. This is the key to a successful experiment. 

The one aspect we change: standardize the continuous features! Take care to avoid data leakage! (3 points) 

Repeat the proceduce in 2a on the standardized dataset. (0 points)

Which model gives the best accuracy score this time? How much improvement in accuracy did we gain by preprocessing? (2 point)

The accuracy score of which algorithm improved the most after preprocessing? (1 point)

Which algorithm is insensitive to preprocessing? (1 point)

We will discuss these findings in class after the submission deadline. 

In [8]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
train(X_train, X_test, y_train, y_test)

Lasso: The best test accuracy score is 0.972027972027972 and the alpha values are 0.001 the C value is 1000.0
Random forest: The best test score is 0.951048951048951 and the corresponding depth and min_samples_split values are [[5 5]
 [5 6]
 [5 7]]
SVC rbf: The best test score of is 0.993006993006993 and the corresponding gamma and C values are [[1.e-08 1.e+10]]
KNN: The best accuracy score is 0.972027972027972 , and the num of neighbors is 6


Which model gives the best accuracy score this time? How much improvement in accuracy did we gain by preprocessing? (2 point)   

Ans: The SVC rbf performs the best after preprocessing. Lasso increases by 0.7%, Random forest stays the same, SVC rbf increases by 0.71%, KNN decreases by 0.7%. 

The accuracy score of which algorithm improved the most after preprocessing? (1 point)    

Ans: SVC rbf improves the most.

Which algorithm is insensitive to preprocessing? (1 point)    

Ans: Random forest is insensitive to preprocessing. 