## Breast Cancer Wisconsin (Diagnostic) Data Set
**************

1) The dataset given here is about the patients who were detected with 2 kinds of breast cancer : a) Malignant or b) Benign <br>
2) The features given here are the characteristics of the cell nuclei computed from the fine needle aspirate(FNA) of a breast mass. <br>
3) Ten real-valued features are computed for each cell nucleus as follows:
    
   - radius (mean of distances from center to points on the perimeter)
   - texture (standard deviation of gray-scale values) 
   - perimeter 
   - area 
   - smoothness (local variation in radius lengths) 
   - compactness (perimeter^2 / area - 1.0)
   - concavity (severity of concave portions of the contour) 
   - concave points (number of concave portions of the contour)
   - symmetry 
   - fractal dimension ("coastline approximation" - 1)
    

4) Mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. 

In [None]:
# Importing Libraries:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [None]:
# Reading the file
data = pd.read_csv('../input/data.csv')

### I: Data Wrangling - 

#### I-a. Doing spot check on the data:

In [None]:
# Overall view of the data:
data.info()

In [None]:
# Checking the first few rows:
data.head()

In [None]:
# Target Variable:
data.diagnosis.unique()

#### I-b. Summary of Numeric Columns:

In [None]:
data.describe()

* There are no null values
* ***id*** and ***Unnamed: 32*** are not required columns. So we will get rid of those.
* There are two outcomes - **Benign Tumor** (which spreads locally) **Malignant Tumor** (which can spread throughout the whole body via blood)

#### I-c. Some operations:

In [None]:
# Dropping some of the unwanted variables:
data.drop('id',axis=1,inplace=True)
data.drop('Unnamed: 32',axis=1,inplace=True)

In [None]:
# Binarizing the target variable:
data['diagnosis'] = data['diagnosis'].map({'M':1,'B':0})

**Important**: The data is highly variable and any feature with low variance will be neglected. We will scale the data to allow more predictive power. <br>
** Here we are standardizing the dataset - meaning shifting the distribution to have mean of zero and standard deviation of unit variance **

In [None]:
datas = pd.DataFrame(preprocessing.scale(data.iloc[:,1:32]))
datas.columns = list(data.iloc[:,1:32].columns)
datas['diagnosis'] = data['diagnosis']

### Doing some EDA:

In [None]:
#Looking at the number of patients with Malignant and Benign Tumors:
datas.diagnosis.value_counts().plot(kind='bar', alpha = 0.5, facecolor = 'b', figsize=(12,6))
plt.title("Diagnosis (M=1 , B=0)", fontsize = '18')
plt.ylabel("Total Number of Patients")
plt.grid(b=True)

* ~ 65% of the patients had Benign tumor while the rest of them had Malignant.

#### Considering only mean features of nucleus

In [None]:
data.columns

In [None]:
data_mean = data[['diagnosis','radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean', 'compactness_mean', 'concavity_mean','concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]

#### We will just see how these features coorelate with the diagnosis using heatmap:

In [None]:
plt.figure(figsize=(14,14))
foo = sns.heatmap(data_mean.corr(), vmax=1, square=True, annot=True)

* **radius_mean, perimeter_mean, area_mean, compactness_mean, concavity_mean, concave points_mean** show high coorelation with the **diagnosis**.
* The other variables do not really show high impact over diagnoses.

### Bivariate Exploration:

In [None]:
_ = sns.swarmplot(y='perimeter_mean',x='diagnosis', data=data_mean)
plt.show()

In [None]:
# from pandas.tools.plotting import scatter_matrix

# p = sns.PairGrid(datas.ix[:,20:32], hue = 'diagnosis', palette = 'Reds')
# p.map_upper(plt.scatter, s = 20, edgecolor = 'w')
# p.map_diag(plt.hist)
# p.map_lower(sns.kdeplot)
# p.add_legend()

# p.figsize = (30,30)

### Setting up the train and test data:

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn import metrics

predictors = data_mean.columns[2:11]
target = "diagnosis"

X = data_mean.loc[:,predictors]
y = np.ravel(data.loc[:,[target]])

# Split the dataset in train and test:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print ('Shape of training set : %i || Shape of test set : %i' % (X_train.shape[0],X_test.shape[0]) )
print ('The dataset is very small so simple cross-validation approach should work here')
print ('There are very few data points so 10-fold cross validation should give us a better estimate')

#### Logistic Regression Model:

* This technique is widely used in medical field where dealing with binary classification problems.
* Firstly we will run over all the mean features against our target feature.
* We will use the features for our model based on the features that showed correlation in our heatmap.

In [None]:
# Importing the model:
from sklearn.linear_model import LogisticRegression

# Initiating the model:
lr = LogisticRegression()

scores = cross_val_score(lr, X_train, y_train, scoring='accuracy' ,cv=10).mean()

print("The mean accuracy with 10 fold cross validation is %s" % round(scores*100,2))

#### SVM:

In [None]:
# Importing the model:
from sklearn import svm

# Initiating the model:
svm = svm.SVC()

scores = cross_val_score(svm, X_train, y_train, scoring='accuracy' ,cv=10).mean()

print("The mean accuracy with 10 fold cross validation is %s" % round(scores*100,2))

* We will try to hyper tune the parameters and try to fit with other kernels.

#### kNN:

In [None]:
# Importing the model:
from sklearn.neighbors import KNeighborsClassifier

# Initiating the model:
knn = KNeighborsClassifier()

scores = cross_val_score(knn, X_train, y_train, scoring='accuracy' ,cv=10).mean()

print("The mean accuracy with 10 fold cross validation is %s" % round(scores*100,2))

#### Perceptron:

- Perceptron is binary linear classification algortithm that purely decides based on the input (vector of numbers) if it belongs to specific class or not.

In [None]:
# Importing the model:
from sklearn.linear_model import Perceptron

# Initiating the model:
pct = Perceptron()

scores = cross_val_score(pct, X_train, y_train, scoring='accuracy' ,cv=10).mean()

print("The mean accuracy with 10 fold cross validation is %s" % round(scores*100,2))

#### Random Forest Model:

In [None]:
# Importing the model:
from sklearn.ensemble import RandomForestClassifier

# Initiating the model:
rf = RandomForestClassifier()

scores = cross_val_score(rf, X_train, y_train, scoring='accuracy' ,cv=10).mean()

print("The mean accuracy with 10 fold cross validation is %s" % round(scores*100,2))

#### Naive Bayes:

In [None]:
# Importing the model:
from sklearn.naive_bayes import GaussianNB

# Initiating the model:
nb = GaussianNB()

scores = cross_val_score(rf, X_train, y_train, scoring='accuracy' ,cv=10).mean()

print("The mean accuracy with 10 fold cross validation is %s" % round(scores*100,2))

#### Logisitic Regression, Random Forest, Naive Bayes and kNN looks to perform better. Lets try to fine tune the parameters and see if we can get any improvisation.

#### Starting with k-Nearest Neighbors:

#### The default neighbors is 20. However, lets try rnning kNN for different values of neighbors:

In [None]:
for i in range(1, 21):
    knn = KNeighborsClassifier(n_neighbors = i)
    score = cross_val_score(knn, X_train, y_train, scoring='accuracy' ,cv=10).mean()
    print("N = " + str(i) + " :: Score = " + str(round(score,2)))

#### The default number of trees is 10. However, lets try running Random Forest for different values of trees:

In [None]:
for i in range(1, 21):
    rf = RandomForestClassifier(n_estimators = i)
    score = cross_val_score(rf, X_train, y_train, scoring='accuracy' ,cv=10).mean()
    print("N = " + str(i) + " :: Score = " + str(round(score,2)))

#### It looks like trees with 18 should give a reasonable estimate of the test data. Let us trying using Random Forest and Naive Bayes on our test dataset and finalize our model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initiating the model:
rf = RandomForestClassifier(n_estimators=18)

rf = rf.fit(X_train, y_train)

predicted = rf.predict(X_test)

acc_test = metrics.accuracy_score(y_test, predicted)

print ('The accuracy on test data is %s' % (round(acc_test,2)))

In [None]:
from sklearn.naive_bayes import GaussianNB

# Initiating the model:
nb = GaussianNB()

nb = nb.fit(X_train, y_train)

predicted = nb.predict(X_test)

acc_test = metrics.accuracy_score(y_test, predicted)

print ('The accuracy on test data is %s' % (acc_test))

#### More stuff to come. Please comment if any suggestions or advice. Anything would be appreciated!