# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



### Not for Grading

## Learning Objective

The objective of this experiment is to understand the KFold cross-validaton and Leave One Out


### Dataset Decription:
The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome.

    Preg: Number of times pregnant
    Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    BloodPressure: Diastolic blood pressure (mm Hg)
    SkinThickness: Triceps skin fold thickness (mm)
    Insulin: 2-Hour serum insulin (mu U/ml)
    BMI: Body mass index (weight in kg/(height in m)^2)
    DiabetesPedigreeFunction: Diabetes pedigree function
    Age: Age (years)
    Outcome: Class variable (0 or 1)


### K-Fold Cross Validation


The problem with machine learning models is that you won’t get to know how well a model performs until you test it's performance on an independent data set (the data set which was not used for training the machine learning model).

Cross Validation comes in to picture here and helps us to estimate the performance of our model. One type of cross validation is the K-Fold Cross Validation

In our experiment, we are using K-Fold Cross Validation  technique to reduce (limit) the problem of overfitting. K-Fold Cross Validation is a way to evaluate and improve the performance of our machine learning model. It helps to prevent from overfitting to a single train or test split.


When we are given a machine learning problem, we will be given two types of data sets — known data (training data set) and unknown data (test data set). By using cross validation, you would be “testing” your machine learning model in the “training” phase to check for overfitting and to get an idea about how your machine learning model will generalize to independent data, which is the test data set given in the problem.


In first round of cross validation, we have to divide our original training data set into two parts:

1. Cross validation training set
2. Cross validation testing set or Validation set

<img src="https://cdn.talentsprint.com/aiml/Experiment_related_data/IMAGES/K-Fold.png" alt="drawing" width="500"/>


The above image represents how the K-Fold Cross Validation works. We divide the dataset in to "K'' parts and will use the K-1 parts for training and remaining 1 for testing. We will rotate the test set and repeat the process for "K" times.

we will train our machine learning model on the cross validation training set and test the model’s predictions against the validation set. we will get to know how accurate our machine learning model’s predictions are when we compare the model’s predictions on the validation set and the actual labels of the data points in the validation set.

To reduce the variability, multiple rounds of cross validation are performed by using different cross validation training sets and cross validation testing sets. The results from all the rounds are averaged to estimate the accuracy of the machine learning model.

**K-fold cross validation is performed as per the following steps:**

1. Randomly split the entire training dataset into k subsets.
2. Reserve one block as our test data
3. Train on each of the remaining K-1 blocks
4. Measure the performance against the test set
5. The average of our K recorded errors is called the cross-validation error and it will be used as a performance metric for the model



### Leave One Out

Leave One Out is a special form of Cross-Validation. In this method each sample is used once as a test set while the remaining samples for the training set. A generalization error estimate is obtained by repeating this procedure for each of the training points available, averaging the results.

## Setup Steps

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}


In [None]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}


In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "U1W1_02_KFold_and_Leaveoneout" #name of the notebook
Answer = "Ungraded"
def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/diabetes.csv")
    display(HTML('<script src="https://dashboard.talentsprint.com/submissions/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():

    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword(), "batch" : ""}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getComplexity() and getAdditional() and getConcepts() and getComments():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "id" : Id, "file_hash" : file_hash,
              "feedback_experiments_input" : Comments, "notebook" : notebook, "batch" : ""}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iiith.talentsprint.com/notebook_submissions")
        # print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None
def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()

else:
  print ("Please complete Id and Password cells before running setup")


## Importing Required Packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

### Prepare the model

In [None]:
diabetes_data = pd.read_csv("/content/diabetes.csv")
diabetes_data.head()

Extracting features and labels from the diabetes data

In [None]:
x_data = diabetes_data.iloc[:,:-1].values
y_data = diabetes_data.iloc[:,-1].values
x_data.shape, y_data.shape

### Apply KFold

In [None]:
def crossvalidation(data):
    scores_Test = []
    for train_index, test_index in data.split(x_data):
        # Split the data into train and test
        x_train, x_test = x_data[train_index], x_data[test_index]
        y_train, y_test  = y_data[train_index], y_data[test_index]

        # Create DecisionTree classifier object with hyper parameters
        decision_tree2 = DecisionTreeClassifier(max_depth=2)

        # Fit the data into the model
        decision_tree2.fit(x_train, y_train)
        scores_Test.append(decision_tree2.score(x_test, y_test))
    print("Average score of the Testing set %.2f"%np.mean(scores_Test))

In [None]:
# Set the KFold module for 5 splits:
kf = KFold(n_splits=5)

# crossvalidation function returns the average score of the test data
crossvalidation(kf)

## Please answer the questions below to complete the experiment:

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}

In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]

In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook  { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")