## Logistic Regression

Breast Cancer data from [the UCI repository](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) contains records corresponding to 
cases of observed tumors.   There are a number of observations for each and a categorisation in the `class` column: 2 for benign (good), 4 for malignant (bad).  Your task is to build a logistic regression model to classify these cases. 

The data is provided as a CSV file.  There are a small number of cases where no value is available, these are indicated in the data with `?`. I have used the `na_values` keyword for `read_csv` to have these interpreted as `NaN` (Not a Number).  Your first task is to decide what to do with these rows. You could just drop these rows or you could [impute them from the other data](http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values).

You then need to follow the procedure outlined in the lecture for generating a train/test set, building and evaluating a model. Your goal is to build the best model possible over this data.   Your first step should be to build a logistic regression model using all of the features that are available.
  

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE

In [2]:
import os
os.chdir("C:\\Users\\suhas\\Documents\\GitHub\\ITEC657-Workshop-Repository\\data")

#bcancer = pd.read_csv("files/breast-cancer-wisconsin.csv", na_values="?")
#bcancer.head()

In [24]:
df = pd.read_csv('6.csv', na_values="?")
df.head()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [25]:
df.columns

Index(['sample_code_number', 'clump_thickness', 'uniformity_cell_size',
       'uniformity_cell_shape', 'marginal_adhesion',
       'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin',
       'normal_nucleoli', 'mitoses', 'class'],
      dtype='object')

In [26]:
df.shape

(699, 11)

In [27]:
df.count()

sample_code_number             699
clump_thickness                699
uniformity_cell_size           699
uniformity_cell_shape          699
marginal_adhesion              699
single_epithelial_cell_size    699
bare_nuclei                    683
bland_chromatin                699
normal_nucleoli                699
mitoses                        699
class                          699
dtype: int64

In [28]:
# Examine the data, look at the statistical summary etc.
print(df.describe())

       sample_code_number  clump_thickness  uniformity_cell_size  \
count        6.990000e+02       699.000000            699.000000   
mean         1.071704e+06         4.417740              3.134478   
std          6.170957e+05         2.815741              3.051459   
min          6.163400e+04         1.000000              1.000000   
25%          8.706885e+05         2.000000              1.000000   
50%          1.171710e+06         4.000000              1.000000   
75%          1.238298e+06         6.000000              5.000000   
max          1.345435e+07        10.000000             10.000000   

       uniformity_cell_shape  marginal_adhesion  single_epithelial_cell_size  \
count             699.000000         699.000000                   699.000000   
mean                3.207439           2.806867                     3.216023   
std                 2.971913           2.855379                     2.214300   
min                 1.000000           1.000000                    

In [29]:
# deal with the NaN values in the data
df.isna().sum()

sample_code_number              0
clump_thickness                 0
uniformity_cell_size            0
uniformity_cell_shape           0
marginal_adhesion               0
single_epithelial_cell_size     0
bare_nuclei                    16
bland_chromatin                 0
normal_nucleoli                 0
mitoses                         0
class                           0
dtype: int64

In [30]:
df = df.dropna()

In [31]:
df.shape #after droping nan values from the original data frame

(683, 11)

In [32]:
# Build your first model - defining training and test data sets then use Logistic Regression to build a model
df = df.drop(['sample_code_number'], axis=1)

In [33]:
df.head()

Unnamed: 0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


In [35]:
df1 = ['class']

In [36]:
new_df = df[df1]

In [37]:
df2 = df.drop(['class'], axis = 1)

In [38]:
df2.head()

Unnamed: 0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses
0,5,1,1,1,2,1.0,3,1,1
1,5,4,4,5,7,10.0,3,2,1
2,3,1,1,1,2,2.0,3,1,1
3,6,8,8,1,3,4.0,3,7,1
4,4,1,1,3,2,1.0,3,1,1


In [39]:
new_df.head()

Unnamed: 0,class
0,2
1,2
2,2
3,2
4,2


In [40]:
da = pd.read_csv('6.csv', na_values="?")
da.head()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [41]:
da = da.dropna()

In [65]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(da, random_state = 142, test_size=0.2)
print(test.shape)
print(train.shape)

(137, 11)
(546, 11)


In [66]:
X = train.drop(['sample_code_number','class'], axis =1)
y = train['class']
X1 = test.drop(['sample_code_number', 'class'], axis = 1)
y1 = test['class']
print(X.shape)
print(y.shape)
print(X1.shape)
print(y1.shape)

(546, 9)
(546,)
(137, 9)
(137,)


In [50]:
reg = linear_model.LinearRegression()
X = train.drop(['sample_code_number', 'class'], axis = 1)
y = train['class']
reg.fit(X, y)
print("y = x *", reg.coef_, "+", reg.intercept_)

y = x * [0.06329935 0.03445854 0.04200182 0.01886785 0.01659464 0.08779989
 0.04214361 0.02728742 0.01069022] + 1.5009873552962198


In [51]:
predicted = reg.predict(X1)
mse = ((np.array(y1)-predicted)**2).sum()/len(y1)
r2 = r2_score(y1, predicted)
print("MSE:", mse)
print("R Squared:", r2)

MSE: 0.13836562456343035
R Squared: 0.858920881821435


In [67]:

model = LogisticRegression()
model.fit(X, y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### Evaluation

To evaluate a classification model we want to look at how many cases were correctly classified and how many
were in error.  In this case we have two outcomes - benign and malignant.   SKlearn has some useful tools, the 
[accuracy_score]() function gives a score from 0-1 for the proportion correct.  The 
[confusion_matrix](http://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) function 
shows how many were classified correctly and what errors were made.  Use these to summarise the performance of 
your model (these functions have already been imported above).

In [68]:
# Evaluate the model
y_predict = model.predict(X1)
print(len(y_predict))
print(y_predict)

137
[2 4 2 2 4 4 2 4 2 4 4 2 2 4 2 2 2 2 2 2 2 2 4 2 2 2 4 2 2 4 4 2 4 4 4 2 2
 4 2 2 2 4 2 4 2 4 2 4 2 2 2 4 2 2 4 2 4 4 4 2 2 4 2 2 4 2 4 2 2 4 2 4 2 2
 2 2 2 4 2 2 4 4 2 2 4 2 2 4 2 4 2 2 2 2 2 4 2 2 2 2 4 4 4 2 2 2 2 2 4 2 2
 4 2 4 2 2 4 4 2 2 2 2 4 4 2 2 4 2 4 2 2 2 2 4 2 4 4]


In [69]:
y1

280    2
232    2
369    2
563    2
491    4
320    4
327    2
270    4
63     4
187    4
50     4
632    2
45     2
304    4
573    2
561    2
679    2
135    2
13     2
384    2
169    2
527    2
167    4
198    2
671    2
502    2
255    4
538    2
470    2
105    4
      ..
589    2
523    4
143    2
602    2
425    4
609    2
218    4
72     2
281    2
279    4
124    4
672    2
510    2
347    2
4      2
247    4
122    4
354    2
667    2
177    4
365    2
233    4
615    2
419    2
432    2
645    2
353    4
307    2
126    4
67     4
Name: class, Length: 137, dtype: int64

In [70]:
#calculating acuracy test scores on test set
print(accuracy_score(y1, y_predict))

0.9635036496350365


In [71]:
confusion_matrix(y1, y_predict) #confusion_matrix

array([[83,  2],
       [ 3, 49]], dtype=int64)

### Feature Selection

Since you have many features available, one part of building the best model will be to select which features to use as input to the classifier. Your initial model used all of the features but it is possible that a better model can 
be built by leaving some of them out.   Test this by building a few models with subsets of the features - how do your models perform? 

This process can be automated.  The [sklearn RFE function](http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) implements __Recursive Feature Estimation__ which removes 
features one by one, evaluating the model each time and selecting the best model for a target number of features.  Use RFE to select features for a model with 3, 4 and 5 features - can you build a model that is as good or better than your initial model?

In [1]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
#from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
#from sklearn.svm import SVR
#X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = LogisticRegression()
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)
y_predict = selector.predict(X1)
#print(accuracy_score(y1, y_predict))
print(selector.support_)
print(selector.ranking_)
col = X1.columns
col[selector.support_]

NameError: name 'LogisticRegression' is not defined

In [89]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
#from sklearn.svm import SVR
#X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = LogisticRegression()
selector = RFE(estimator, 4, step=1)
selector = selector.fit(X, y)
y_predict = selector.predict(X1)
print(accuracy_score(y1, y_predict))
print(selector.support_)
print(selector.ranking_)
col = X1.columns
col[selector.support_]

0.9635036496350365
[ True  True False False False  True  True False False]
[1 1 4 2 6 1 1 3 5]


Index(['clump_thickness', 'uniformity_cell_size', 'bare_nuclei',
       'bland_chromatin'],
      dtype='object')

In [88]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
#from sklearn.svm import SVR
#X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = LogisticRegression()
selector = RFE(estimator, 3, step=1)
selector = selector.fit(X, y)
y_predict = selector.predict(X1)
print(accuracy_score(y1, y_predict))
print(selector.support_)
print(selector.ranking_)
col = X1.columns
col[selector.support_]

0.9562043795620438
[False  True False False False  True  True False False]
[2 1 5 3 7 1 1 4 6]


Index(['uniformity_cell_size', 'bare_nuclei', 'bland_chromatin'], dtype='object')

In [87]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
#from sklearn.svm import SVR
#X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = LogisticRegression()
selector = RFE(estimator, 2, step=1)
selector = selector.fit(X, y)
y_predict = selector.predict(X1)
print(accuracy_score(y1, y_predict))
print(selector.support_)
print(selector.ranking_)
col = X1.columns
col[selector.support_]

0.948905109489051
[False  True False False False  True False False False]
[3 1 6 4 8 1 2 5 7]


Index(['uniformity_cell_size', 'bare_nuclei'], dtype='object')

In [86]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
#from sklearn.svm import SVR
#X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = LogisticRegression()
selector = RFE(estimator, 1, step=1)
selector = selector.fit(X, y)
y_predict = selector.predict(X1)
print(accuracy_score(y1, y_predict))
print(selector.support_)
print(selector.ranking_)
col = X1.columns
col[selector.support_]

0.9124087591240876
[False  True False False False False False False False]
[4 1 7 5 9 2 3 6 8]


Index(['uniformity_cell_size'], dtype='object')

## Conclusion

Write a brief conclusion to your experiment.  You might comment on the proportion of __false positive__ and __false negative__ classifications your model makes.  How useful would this model be in a clinical diagnostic setting? 

#### The model is predicting wrong values arroung 5% range as well as we do not need all the features to explain cause some of them are not required, so generating the results for no of variables and choosing the best one which is explaining the most which is almost should be equal to the one with the one having all the features
#### In this case we need a model which predicts the values 99.999% 

In [None]:
#import warnings
#warnings.filterwarnings("ignore", category=FutureWarning)