In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## **BREAST CANCER CLASSIFICATION**
### **Via SuperDataScience Team**

*   Breast cancer is the most common cancer among women worldwide accounting for 25% of all cancer cases and infected 2.1 million people in 2015.
*   Early diagnosis significantly increase the chances of survival. 
*   The key challenge in cancer detection is how to classify tumors into *malignant* or *benign* ; therefore, machine learning techniques can dramatically improve the accuracy of diagnosis.
*   Research indicates that most experienced physicians can diagnose cancer with maximum 79% accuracy 


**First stage**: Any process which is simply extracting some of the cells out of the tumor

When we say benign that means the tumor is kind of not spreading across the body so the patient is safe; somehow,  if it's malignant that means it's a cancerous.
That means we need to intervene and actually stopping cancer growth. 

---



What we do here in the machine learning aspect. 
* We execute all these images and 
* We wanted to specify if that cancer out of these images is malignant or benign.

So what we do with that, we extract out of these images some features. When we see features that mean some characteristics out of the image such as 
* radius
* cells
* texture
* perimeter
* area
* smoothness

We feed all these features into kind of our machine learning model.

**MAIN PART:**  We want to teach the machine how to basically classify images or classify data and tell us if it's malignant or benign.


## **PROBLEM IN MACHINE LEARNING VOCABULARY**

*Input:* 30 features 
* Radius
* Texture
* Perimeter
* Area
* Smoothness
* ...

*Target Class:* 2 
* Malignant
* Benign

*How many datasets we have? :* 
* Number of Instances: 569
* Class Distribution: 212 Malignant, 357 Benign

*Data source:*
* [Breast Cancer Wisconsin(Diagnostic)](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
* [Breast Cancer Detection with Reduced Feature Set](https://www.researchgate.net/publication/271907638_Breast_Cancer_Detection_with_Reduced_Feature_Set)


We're going to say: if you look at all these features then indicate that cancer which is denoted by a zero, is malignant in this case.

And then, If we look at the 30 features that may be classified as one, this cancer type is kind of benign.

So it's a kind of binary detection indicating zero or one for malignant or benign.

---
**SUPPORT VECTOR MACHINE CLASSIFIER**

Near the maximum Margin Hyperplane, we don't know whether this cancer is malignant or benign. 

That's why the support vector machine classifier is very unique in this sense. It simply uses the points or the support vectors that are on the boundary to draw the boundary out to classify the classes.

Support vector machines are really powerful techniques.
Why? Because it's kind of an extreme algorithm.
It just focuses on supporting the support vectors or the points on the boundary and separating them somehow.


**IMPORTING DATA**

In [None]:
# import libraries 
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis 
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sns # Statistical data visualization
# %matplotlib inline

In [None]:
# Import Cancer data drom the Sklearn library

from sklearn.datasets import load_breast_cancer
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
# https://scikit-learn.org/0.16/datasets/index.html
# https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset
# https://scikit-learn.org/stable/datasets.html#:~:text=The%20sklearn.,from%20the%20'real%20world'.
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#examples-using-sklearn-datasets-load-breast-cancer

cancer = load_breast_cancer()

In [None]:
cancer
# cancer.items()

In [None]:
# What dictionaries we have
cancer.keys()

In [None]:
cancer.values()

In [None]:
# print them one by one
print(cancer['DESCR']) #descriptions of the dataset


In [None]:
print(cancer['target'])

In [None]:
print(cancer['target_names'])

In [None]:
print(cancer['feature_names'])

In [None]:
len(cancer['feature_names'])

In [None]:
print(cancer['data'])

In [None]:
cancer['data'].shape

numpy.c_ = <numpy.lib.index_tricks.CClass object>
Translates slice objects to concatenation along the second axis.

https://numpy.org/doc/stable/reference/generated/numpy.c_.html#:~:text=c_-,numpy.,because%20of%20its%20common%20occurrence.

In [None]:
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))
df_cancer.head()

In [None]:
df_cancer.tail(5)

**VISUALIZING THE DATA**

**Pairplot is a module of seaborn library which provides a high-level interface for drawing attractive and informative statistical graphics.**

https://medium.com/analytics-vidhya/pairplot-visualization-16325cd725e6#:~:text=Pairplot%20is%20a%20module%20of,attractive%20and%20informative%20statistical%20graphics.

https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166

In [None]:
sns.pairplot(df_cancer,vars= ['mean radius','mean texture', 'mean area', 'mean perimeter', 'mean smoothness'])

**But the only problem is that doesn't show the target class. It doesn't show actual which one of these samples is malignant or which one of them is benign.**

In [None]:
sns.pairplot(df_cancer,hue = 'target', vars= ['mean radius','mean texture', 'mean area', 'mean perimeter', 'mean smoothness'])

The blue points in here that's the malignant case. The orange points in here that's the benign case.

In [None]:
sns.countplot(df_cancer['target'])

**We take one of these slide graphs and see how can we play.**

In [None]:
sns.scatterplot(x='mean area', y='mean smoothness', hue='target', data=df_cancer)

**Let's check the correlation between the variables**

In [None]:
df_cancer.corr()

https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e

How to Create a Seaborn Correlation Heatmap in Python?

In [None]:
sns.heatmap(df_cancer.corr())

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df_cancer.corr(), annot=True)

# **MODEL TRAINING (FINDING A PROBLEM SOLUTION)**

In [None]:
# Let's drop the target label coloumns
x = df_cancer.drop(['target'],axis=1)
x

In [None]:
x.shape

In [None]:
y = df_cancer['target']
y

In [None]:
y.shape

**Now we are going to divide our data into Train and Test set**

therefore, we aretraining on the Train set and checking the validation of the chosen model on test set (here we omit the idea of Validation set which are used before test set since the data is very small compared to Big data)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state= 5)

x_train

In [None]:
print(len(x_train))
print(len(x_test))

In [None]:
print(len(x_train)/len(x)) # so 80% Training set and 20% test set

In [None]:
x_train.shape

In [None]:
x_test

In [None]:
x_test.shape

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
y_test

In [None]:
y_test.shape

**The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the "predicted" class is.**

https://pythonprogramming.net/linear-svc-example-scikit-learn-svm-python/#:~:text=The%20objective%20of%20a%20Linear,the%20%22predicted%22%20class%20is.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
svc_model = SVC()

In [None]:
svc_model.fit(x_train, y_train)

****EVALUATING THE MODEL****

We're talking about the testing data which has data that has never seen before. 

In [None]:
y_predict = svc_model.predict(x_test)
y_predict

**We're going to plot a confusion matrix.  We need to specify compare our true value versus the predicted that.**

In [None]:
cm = confusion_matrix(y_test, y_predict)

In [None]:
sns.heatmap(cm, annot=True)

* **A Classification report is used to measure the quality of predictions from a classification algorithm. ... The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives.**

https://muthu.co/understanding-the-classification-report-in-sklearn/#:~:text=A%20Classification%20report%20is%20used,predictions%20from%20a%20classification%20algorithm.&text=The%20report%20shows%20the%20main,positives%2C%20true%20and%20false%20negatives.

https://medium.com/@kohlishivam5522/understanding-a-classification-report-for-your-machine-learning-model-88815e2ce397



In [None]:
print(classification_report(y_test, y_predict))

**There are four ways to check if the predictions are right or wrong:**

**TN / True Negative**: the case was negative and predicted negative

**TP / True Positive**: the case was positive and predicted positive

**FN / False Negative**: the case was positive but predicted negative

**FP / False Positive**: the case was negative but predicted positive


**Precision — What percent of your predictions were correct?**
Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class, it is defined as the ratio of true positives to the sum of a true positive and false positive.

**Precision:- Accuracy of positive predictions.
Precision = TP/(TP + FP)**

**Recall — What percent of the positive cases did you catch?**
Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.

**Recall:- Fraction of positives that were correctly identified.
Recall = TP/(TP+FN)**

**F1 score — What percent of positive predictions were correct?**
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

**F1 Score = 2*(Recall * Precision) / (Recall + Precision)**

**Support**
Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the evaluation process.

# **IMPROVING THE MODEL**

In [None]:
x_train

In [None]:
x_train.min

In [None]:
min_train = x_train.min()
min_train

In [None]:
range_train = (x_train - min_train).max()
range_train

In [None]:
x_train_scaled = (x_train - min_train)/range_train
x_train_scaled

In [None]:
sns.scatterplot(x = x_train['mean area'], y= x_train['mean smoothness'], hue= y_train)

In [None]:
sns.scatterplot(x= x_train_scaled['mean area'], y= x_train_scaled['mean smoothness'], hue= y_train)

In [None]:
min_test = x_test.min()
range_test = (x_test - min_test).max()
x_test_scaled = (x_test - min_test)/ range_test

In [None]:
svc_model.fit(x_train_scaled, y_train)

In [None]:
y_predict = svc_model.predict(x_test_scaled)

In [None]:
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True, fmt = 'd')

In [None]:
print(classification_report(y_test, y_predict))

# **IMPROVING THE MODEL - PART 2**

In [None]:
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} 
param_grid
# print(type(param_grid))

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.


https://machinelearningmastery.com/k-fold-cross-validation/

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f

**What is GridSearchCV?**

GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.

https://towardsdatascience.com/grid-search-for-hyperparameter-tuning-9f63945e8fec#:~:text=What%20is%20GridSearchCV%3F,parameters%20from%20the%20listed%20hyperparameters.

https://medium.datadriveninvestor.com/an-introduction-to-grid-search-ff57adcc0998

In [None]:
# Example of GridSearchCV

# x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=50)
# xgb=XGBClassifier()
# ----------------------------------------------------------------------
# from sklearn.model_selection import GridSearchCV
# parameters=[{'learning_rate':[0.1,0.2,0.3,0.4],'max_depth':[3,4,5,6,7,8],'colsample_bytree':[0.5,0.6,0.7,0.8,0.9]}]
            
# gscv=GridSearchCV(xgb,parameters,scoring='accuracy',n_jobs=-1,cv=10)
# grid_search=gscv.fit(x,y)
# grid_search.best_params_
# -----------------------------------------------------------------------
# x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=50)
# xgb=XGBClassifier(colsample_bytree=0.8, learning_rate=0.4, max_depth=4)
# xgb.fit(x,y)
# pred=xgb.predict(x_test)
# print('Accuracy=  ',accuracy_score(y_test,pred))
# -----------------------------------------------------------------------
# #Cross validating (for classification) the model and checking the cross_val_score,model giving highest score will be choosen as final model.
# from sklearn.model_selection import cross_val_predict
# xgb=XGBClassifier(colsample_bytree=0.8, learning_rate=0.4, max_depth=4)
# cvs=cross_val_score(xgb,x,y,scoring='accuracy',cv=10)
# print('cross_val_scores=  ',cvs.mean())
# y_pred=cross_val_predict(xgb,x,y,cv=10)
# conf_mat=confusion_matrix(y_pred,y)
# conf_mat
# ---------------------------------------------------------------------------
# #Cross validating(for regression) the model and checking the cross_val_score,model giving highest score will be choosen as final model.
# gbm=GradientBoostingRegressor(max_depth=7,min_samples_leaf=1,n_estimators=100)
# cvs=cross_val_score(xgb,x,y,scoring='r2',cv=5)
# print('cross_val_scores=  ',cvs.mean())
# -------------------------------------------------------------------------------
# #parameters
# #xgboost:-
# parameters=[{'learning_rate':[0.1,0.2,0.3,0.4],'max_depth':[3,4,5,6,7,8],'colsample_bytree':[0.5,0.6,0.7,0.8,0.9]}]
# #random forest
# parameters=[{'max_depth':[5,7,9,10],'min_samples_leaf':[1,2],'n_estimators':[100,250,500]}]
# #gradientboost
# parameters=[{'max_depth':[5,7,9,10],'min_samples_leaf':[1,2],'n_estimators':[100,250,500]}]
# #kneighbors
# parameters={'n_neighbors':[5,6,8,10,12,14,15]}
# #logistic regression
# parameters={'penalty':['l1','l2'],'C':[1,2,3,4,5]}
# #gaussiannb
# parameters={'var_smoothing': np.logspace(0,-9, num=100)}
# #SVC
# parameters=[{'C':[0.1,0.5,1,2,3],'kernel':['rbf','poly']}]
# #adaboost
# parameters=[{'base_estimator':[lr],'learning_rate':[1,0.1,0.001],'n_estimators':[100,150,250]}]
# #decesion tree
# parameters=[{'criterion':['gini','entropy'],'max_depth':[5,7,9,10],'min_samples_leaf':[1,2]}]


In [None]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(), param_grid, refit= True, verbose= 4)

In [None]:
param_grid

In [None]:
grid.fit(x_train_scaled, y_train)

What is grid search used for?

Grid-search is used to find the **optimal hyperparameters of a model** which results in the most 'accurate' predictions

https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
grid.best_params_

https://datascience.stackexchange.com/questions/21877/how-to-use-the-output-of-gridsearch

The .best_estimator_ attribute is an instance of the specified model type, which has the 'best' combination of given parameters from the param_grid. Whether or not this instance is useful depends on whether the refit parameter is set to True (it is by default).

In [None]:
grid.best_estimator_

SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,

    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

In [None]:
grid_prediction = grid.predict(x_test_scaled)
grid_prediction

In [None]:
cm = confusion_matrix(y_test, grid_prediction)
cm

In [None]:
sns.heatmap(cm, annot=True)

In [None]:
print(classification_report(y_test,grid_prediction ))

# **CONCLUSION**
* Machine Learning techniques (SVM) was able to classify tumors into Malignant / Benign with 97% accuracy.
* The technique can rapidly evaluate breast masses and classify them in an automated fashion. 
* Early breast cancer can dramatically save lives especially in the developing world
* The technique can be further improved by combining Computer Vision/ ML techniques to directly classify cancer using tissue images.