# Logistic Regression 

In this exercise, you will use logistic regression to classify breast cancer as either malignant or benign. First run the code below to print and read the description of the data set. 

In [2]:
from sklearn.datasets import load_breast_cancer
import numpy as np

DataCancer=load_breast_cancer()
print(DataCancer.keys())
print(DataCancer.DESCR)

X_features=DataCancer.data
Y_targetClass=DataCancer.target




dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Ra

### A) Scale the features  to have zero mean and unit variance. Use logistic regression, with ridge regularization and tuning parameter set to 1. Find the accuracy of the model.  
- Use random_state = 0 in the train_test_split.

In [4]:
# write your code here

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing


X_train, X_test, Y_train, Y_test= train_test_split(X_features, Y_targetClass, random_state= 0)

scaler=preprocessing.StandardScaler().fit(X_train)
X_train_transformed=scaler.transform(X_train)
X_test_transformed=scaler.transform(X_test)

FittedLogRegModelLasso= LogisticRegression().fit(X_train_transformed,Y_train)
score_logRegLasso1=FittedLogRegModelLasso.score(X_test_transformed,Y_test)
print("The score of logistic regression, with ridge regularization & C=1, is:", score_logRegLasso1)

The score of logistic regression, with ridge regularization & C=1, is: 0.958041958042


### B) For the same problem, we also use logistic regression but want to select the best tuning parameter of Ridge regularization in logistic regression using 5-fold cross validation. We try the following set of values for the tuning parameter: [0.01, 0.1, 1, 10, 100]. Find the best tuning parameter in this set and the test accuracy of the model when the best tuning parameter is selected. 

In [5]:
#write your code here
from sklearn.model_selection import cross_val_score

best_score = 0 # initialize the best_score to zero which will then be updated

kfolds=5 # set the number of folds

X_trainval= X_train_transformed
Y_trainval=Y_train

for c in [0.01, 0.1, 1, 10, 100]: #iterate over the values we need to try for the parameter
    # for each value of c,
    # train the model
    logRegModel = LogisticRegression(C=c)
    # perform cross-validation
    scores = cross_val_score(logRegModel, X_trainval, Y_trainval, cv=kfolds)
    
    # compute mean cross-validation accuracy
    score = np.mean(scores)
    
    # if we got a better score, store the score and parameters
    if score > best_score:
        best_score = score
        best_parameters = c
        
# rebuild a model on the combined training and validation set
SelectedLogRegModel = LogisticRegression(C=best_parameters).fit(X_trainval, Y_trainval)

test_score = SelectedLogRegModel.score(X_test_transformed, Y_test)
print("Best score on validation set is:", best_score)
print("Best parameter for regularization (C) is: ", best_parameters)
print("Test set score with best C parameter is", test_score)



Best score on validation set is: 0.983556120122
Best parameter for regularization (C) is:  0.1
Test set score with best C parameter is 0.965034965035
