<h3> INTRODUCTION </h3>
In this Kernel, I will have a look at Exploratory Data Analysis of Breast Cancer dataset.
I'll use a popular machine learning algorithm for classification,called K-Nearest Neighbors (KNN). 
I'm going to use this classification algorithm to build a model based on the data from patients and corresponding tumor data.
After training data, Im going to predict the class of unknown data,using this trained model, to find if the tumor is M = malignant or B = benign.

This Kernel is mainly build for those who are new to **Data Analysis** and **Machine Learning**. So im going to explain the steps I'll do in details to make it more comprehendable for newbies. 

First lets start with importing requered Libraries.

In [None]:
import numpy as np 
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

In [None]:
df = pd.read_csv("../input/data.csv")
df.head()

<h2>Data Wrangling</h2>

<h3>Identify and handle missing values</h3>

In our dataset, missing data might come with the question mark "?". We will replace "?" with NaN (Not a Number),NaN is a python default missing value marker and we prefer to change "?" to NaN since it works faster and convenient. Here we use the function:
.replace(A, B, inplace = True) 
to replace A by B

In [None]:
# replace "?" to NaN
df.replace("?", np.nan, inplace = True)
df.head(5)

Checking to see if there are any Null or NaN values in our dataframe. There are two methods to detect these missing values: 
<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
Output of these two methods are a boolean value dataframe indicating whether there's a missing data or not.

In [None]:
missing_data = df.isnull()
missing_data.head(5)

In [None]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

<h3> Dealing with missing data </h3>
in the dataframe we have,we can see that there are 2 columns with no data or less valuable data.
<br />These columns are **Unnamed: 32** and **id**. Unnamed data is empty,but id column is not empty but the value it has is not useful.<br/>We are going to drop these two columns.

In [None]:
list = ['Unnamed: 32','id']
df.drop(list, axis = 1, inplace = True)

In [None]:
df.head()

<h3> Data Normalization</h3>
Data Normalization is the process of changing or transforming values of variables into a similar range.<br/>
This normalization can include scaling variables somehow that the variable average is 0, or scaling the variable to have variance of 1 or scaling the variable to lie between 0 to 1.<br/>
<br/>
Though in this dataset different features have different ranges and we could Normalize this data,but here we wont change the range and wont normalize since its out of scope for this dataset.


<h3>Exploratory Data Analysis</h3>
Exploratory Data Analysis or EDA  refers to a group of investigative processes performed on our data to discover patterns or anomalies, to test our initial hypothesis and assumptions.<br/>
This work is done with the help of summary statistics and graphical representations of the data. EDAs are a good tool to understand the data and get an insight about our data.<br/>
<br/>
In order to obtain descriptive statistics of our data we can use **describe** function which will compute basic statistics for all continous variables.<br/>
Some information that describe function will provide are: <br/>
* Count of variables.
* mean of variables
* their standard deviation
* min of each variable
* max of each variable.<br/>
<br/>
Lets take a look at our Features and their descriptive statistics.

In [None]:
df.describe().transpose()

Default setting of "describe" will skip object data type,in order to apply describe function on "object" data type we have to add 'include=['object']' :
<br/>
As you can see in the result of describe function here it gives us number (count) for diagnosis, number of unique variables which here are two (B and M) and the top variable which here is B,it has the majority in diagnosis.

In [None]:
df.describe(include=['object'])

We can convert the series given by 'describe' to Dataframe as follows:
<br/> Here are the numbers for Benign and Malignant :

In [None]:
B, M = df['diagnosis'].value_counts()
print('Number of Malignant : ', M)
print('Number of Benign: ', B)

In [None]:
df['diagnosis'].value_counts().to_frame()

We will do some visualization on our data,in order to do this visualization we will use seaborn library.

In [None]:
import seaborn as sns
sns.set(style="darkgrid")
ax = sns.countplot(df.diagnosis,label="Count") 

In [None]:
import matplotlib.pyplot as plt

# Data to plot
labels = 'Benign', 'Malignant'
sizes = df['diagnosis'].value_counts()
colors = ['lightskyblue', 'orange']
explode= [0.4,0]
# Plot
plt.pie(sizes, explode=explode, labels=labels,radius= 1400 ,colors=colors,
autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(7,7)
plt.show()

In [None]:
width = 12
height = 10



Ill perform KNN from here!!!!


In [None]:
import itertools
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

The Dataset we downloaded previously has categorized patients tumors into two groups: Benign and Malignant. we can use characteristic data of the tumor to predict each tumor's type. It is a classification problem. That is, given the dataset,  with predefined labels, we need to build a model to be used to predict class of a new or unknown case. 

The example focuses on using characteristic data, such as Texture, Radius,Are and so on to predict tumors' patterns. 

The target field, called __diagnosis__, has two possible values that correspond to the two tumor groups, as follows:
  1- Benign
  2- Malignant

Our objective is to build a classifier, to predict the class of unknown cases. We will use a specific type of classification called K nearest neighbour.


In [None]:
df.columns

In [None]:
X= df[['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst']]
X[0:5]

In [None]:
y=df["diagnosis"].values
y[0:5]

## Normalize Data 
Data Standardization give data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on distance of cases:

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

### Train Test Split  
Out of Sample Accuracy is the percentage of correct predictions that the model makes on data that that the model has NOT been trained on. Doing a train and test on the same dataset will most likely have low out-of-sample accuracy, due to the likelihood of being over-fit.

It is important that our models have a high, out-of-sample accuracy, because the purpose of any model, of course, is to make correct predictions on unknown data. So how can we improve out-of-sample accuracy? One way is to use an evaluation approach called Train/Test Split.
Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. 

This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. It is more realistic for real world problems.


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

# Classification 

## K nearest neighbor (K-NN)

### Training

Lets start the algorithm with k=4 for now:

In [None]:
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

### Predicting
we can use the model to predict the test set:

In [None]:
yhat = neigh.predict(X_test)
yhat[0:5]

### Accuracy evaluation
In multilabel classification, __accuracy classification score__ function computes subset accuracy. This function is equal to the jaccard_similarity_score function. Essentially, it calculates how match the actual labels and predicted labels are in the test set.

In [None]:
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

#### What about other K?
K in KNN, is the number of nearest neighbors to examine. It is supposed to be specified by User. So, how we choose right K?
The general solution is to reserve a part of your data for testing the accuracy of the model. Then chose k =1, use the training part for modeling, and calculate the accuracy of prediction using all samples in your test set. Repeat this process, increasing the k, and see which k is the best for your model.

We can calucalte the accuracy of KNN for different Ks.

In [None]:
Ks = 30
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

#### Plot  model accuracy  for Different number of Neighbors 

In [None]:
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 