# Advanced Certification Program in Computational Data Science
## A Program by IISc and TalentSprint
### Additional Notebook (ungraded) on Cross Validation Techniques


## Learning Objectives

At the end of the experiment, you will be able to

* Train the Decision Tree model on the diabetes dataset.
* Explore HoldOut Validation Approach- Train And Test Split
* Explore K Fold Cross Validation
* Explore Stratified K-fold Cross Validation
* Explore Leave One Out Cross Validation(LOOCV)
* Explore Repeated Random Test-Train Splits

## Introduction

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.



## Dataset Description

Attribute Information for Breast Cancer Wisconsin (Diagnostic) Data Set:

1. ID number
2. Diagnosis (M = malignant, B = benign)


3-32.

Ten real-valued features are computed for each cell nucleus:

    a. radius (mean of distances from center to points on the perimeter)
    b. texture (standard deviation of gray-scale values)
    c. perimeter
    d. area
    e. smoothness (local variation in radius lengths)
    f. compactness (perimeter^2 / area - 1.0)
    g. concavity (severity of concave portions of the contour)
    h. concave points (number of concave portions of the contour)
    i. symmetry
    j. fractal dimension ("coastline approximation" - 1)


The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

* All feature values are recoded with four significant digits.

* Missing attribute values: none

* Class distribution: 357 benign, 212 malignant

For more information click [here](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic))

In [None]:
# @title Download the Dataset
! wget https://cdn.iisc.talentsprint.com/CDS/Datasets/cancer_dataset.csv
print("The datset was downloaded")


### Load data

In [None]:
import pandas as pd
df=pd.read_csv('cancer_dataset.csv')
df.head()

In [None]:
###  Independent And dependent features
X=df.iloc[:,2:]
y=df.iloc[:,1]

In [None]:
X.head()

In [None]:
X=X.dropna(axis=1)

In [None]:
# Independent Feature Set
X.head()

In [None]:
# Dependent Feature
y.value_counts()

### HoldOut Validation Approach- Train And Test Split

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) is the official documentation of DecisionTreeClassifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Splitting the data and fixin the random state.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
model = DecisionTreeClassifier()
# Training the model
model.fit(X_train, y_train)
# Looking at the scores
result = model.score(X_test, y_test)
print(result)

### K Fold Cross Validation

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) is the official documentation of KFold

In [None]:
from sklearn.model_selection import KFold
model=DecisionTreeClassifier()
kfold_validation=KFold(10)

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score
# Displaying the cross validation score
results=cross_val_score(model,X,y,cv=kfold_validation)
print(results)
print(np.mean(results))

### Stratified K-fold Cross Validation

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) is the official documentation of StratifiedKFold

In [None]:
from sklearn.model_selection import StratifiedKFold
skfold=StratifiedKFold(n_splits=5)
model=DecisionTreeClassifier()
scores=cross_val_score(model,X,y,cv=skfold)
print(np.mean(scores))

In [None]:
scores

### Leave One Out Cross Validation(LOOCV)

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut) is the official documentation of LeaveOneOut

In [None]:
from sklearn.model_selection import LeaveOneOut
model=DecisionTreeClassifier()
leave_validation=LeaveOneOut()
results=cross_val_score(model,X,y,cv=leave_validation)

In [None]:
results

In [None]:
print(np.mean(results))

### Repeated Random Test-Train Splits
This technique is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data in the training-test set manner and then repeat the process of splitting and evaluating the algorithm multiple times, just like the cross-validation method.

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit) is the official documentation of ShuffleSplit

In [None]:
from sklearn.model_selection import ShuffleSplit
model=DecisionTreeClassifier()
ssplit=ShuffleSplit(n_splits=10,test_size=0.30)
results=cross_val_score(model,X,y,cv=ssplit)

In [None]:
# view the results
results

In [None]:
# Taking the mean of results
np.mean(results)