# Introduction
## The dataset
The Breast Cancer (Wisconsin) Diagnosis dataset contains the diagnosis and a set of 30 features describing the characteristics of the cell nuclei present in the digitized image of a fine needle aspirate (FNA) of a breast mass.

Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter);

texture (standard deviation of gray-scale values);

perimeter;

area;

smoothness (local variation in radius lengths);

compactness (perimeter^2 / area - 1.0);

concavity (severity of concave portions of the contour);

concave points (number of concave portions of the contour);

symmetry;

fractal dimension (“coastline approximation” - 1).

The mean, standard error (SE) and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. We will analyze the features to understand the predictive value for diagnosis. We will then create models using two different algorithms and use the models to predict the diagnosis.

## Fine needle aspiration
Fine-needle aspiration (FNA) is a diagnostic procedure used to investigate lumps or masses. In this technique, a thin (23–25 gauge), hollow needle is inserted into the mass for sampling of cells that, after being stained, will be examined under a microscope (biopsy). Fine-needle aspiration biopsies are very safe minor surgical procedures.


In this notebook, I will train a ***logistic regression*** model using the Breast Cancer dataset to predict whether tumor is belign or malingnent. this is  a *binary classification problem*.

## Read the data

In [None]:
import pandas as pd
import plotly as px
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
from itertools import chain
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score
import warnings
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

plt.rcParams['figure.figsize'] = 8, 5
plt.style.use('ggplot')

warnings.filterwarnings('ignore') #ignore warning messages 

In [None]:
import pandas as pd 

raw_df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
raw_df

The dataset contains  569 rows and 33 columns.The dataset contains both numeric and categorical columns. Our objective is to create a model to predict the value in the column 'dignosis'.

## Missing value

Let's check the data types and missing values in the various columns.

In [None]:
raw_df.info()

In [None]:
raw_df.describe()

In [None]:
#Drop the column with all missing values
raw_df = raw_df.dropna(axis=1)

In [None]:
#get the new count of the number of rows
raw_df.shape

In [None]:
#get a count of the number of Malignant (M) or Benign(B) cells
raw_df['diagnosis'].value_counts()


In [None]:
#Visualize the count through seaborn library
sns.countplot(raw_df['diagnosis'], label='count')

# Distribution of the features

In [None]:
features = ['radius','texture','perimeter','area','smoothness','compactness','concavity','concave points','symmetry','fractal_dimension']

for feature in features:
    print("{} distribution".format(feature))
    sns.boxplot(data=raw_df[['{}_mean'.format(feature), '{}_se'.format(feature), '{}_worst'.format(feature)]])
    plt.title('Distribution of {}'.format(feature))
    plt.show()

In [None]:
#print the first 5 rows of the new data
raw_df.head(5)

In [None]:
#get the correlation of the columns
raw_df.iloc[:,1:12].corr()

# Correlation of the variables

In [None]:
print('Pairplot')
sns.pairplot(data=raw_df[['diagnosis','area_mean','texture_mean','smoothness_mean','concavity_mean','symmetry_mean']], hue="diagnosis", height=3, diag_kind="hist")
plt.show()

In [None]:
#Visualize the correlation
plt.figure(figsize=(12,10))
sns.heatmap(raw_df.iloc[:,1:12].corr(), annot=True, fmt='.0%')

In [None]:
#Encode the categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
raw_df.iloc[:,1]= labelencoder_Y.fit_transform(raw_df.iloc[:,1].values)

raw_df.iloc[:,1]

## Create Model for prediction 

In [None]:
#split the data set into independent(X) and dependent (Y) data sets 
X= raw_df.iloc[:,2:31].values
Y= raw_df.iloc[:,1].values



In [None]:
#split the dataset into 75% training and 25% testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25, random_state =0)

In [None]:
#scale the data(feature scaling)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
#create a function for thr models
def models(X_train, Y_train):
    #Logistic Regression 
    from sklearn.linear_model import LogisticRegression
    log = LogisticRegression(random_state=0)
    log.fit(X_train, Y_train)
    
    #Decision Tree
    from sklearn.tree import DecisionTreeClassifier
    tree = DecisionTreeClassifier(criterion = 'entropy', random_state=0)
    tree.fit(X_train, Y_train)
    
    #Random Forest Claasifier
    from sklearn.ensemble import RandomForestClassifier
    forest = RandomForestClassifier(n_estimators= 10, criterion ='entropy', random_state = 0)
    forest.fit(X_train, Y_train)
    
    #Print the models accuracy on the training data
    print('[0]Logistc Regreesion Training Accuracy: ', log.score(X_train, Y_train))
    print('[1]Decision Tree Classifier Training Accuracy: ', tree.score(X_train, Y_train))
    print('[2]Random Forest Train Training Accuracy: ', forest.score(X_train, Y_train))
    
    return log, tree, forest

In [None]:
#Getting all of the models
model = models(X_train, Y_train)

In [None]:
#Test model sccuracy on test data on confusion matrix
from sklearn.metrics import confusion_matrix 

for i in range( len(model)):
    print('MODEL', i)
    
    cm = confusion_matrix(Y_test, model[i].predict(X_test))
    TP = cm[0][0]
    TN = cm[1][1]
    FN = cm[1][0]
    FP = cm[0][1]

    print(cm)
    print('Testing Accuracy = ', (TP+TN)/(TP+TN+FN+FP))
    print()

In [None]:
#show another way to get metrics of the models
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
for i in range( len(model)):
    print('MODEL', i)
    print(classification_report(Y_test, model[i].predict(X_test)))
    print(accuracy_score(Y_test, model[i].predict(X_test)))


In [None]:
#Print the prediction of Random Forest Classifier Model
pred = model[2].predict(X_test)
print(pred)
print()
print(Y_test)