# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



### Not for Grading

## Learning Objective

The objective of this experiment is to understand the influence of tuning hyperparameters to avoid Overfitting.


### Dataset Decription:
The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome.

    Preg: Number of times pregnant
    Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    BloodPressure: Diastolic blood pressure (mm Hg)
    SkinThickness: Triceps skin fold thickness (mm)
    Insulin: 2-Hour serum insulin (mu U/ml)
    BMI: Body mass index (weight in kg/(height in m)^2)
    DiabetesPedigreeFunction: Diabetes pedigree function
    Age: Age (years)
    Outcome: Class variable (0 or 1)


## Setup Steps

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "aiml_pg_25" #@param {type:"string"}


In [2]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "4521452411" #@param {type:"string"}


In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "U2W9_Demo_Overfitting_Diabetes" #name of the notebook
Answer = "Ungraded"
def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    ipython.magic("sx pip install playsound")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/diabetes.csv")
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():

    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getComplexity() and getAdditional() and getConcepts() and getComments():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "id" : Id, "file_hash" : file_hash,
              "feedback_experiments_input" : Comments, "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iiith.talentsprint.com/notebook_submissions")
        # print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None
def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()

else:
  print ("Please complete Id and Password cells before running setup")


Setup completed successfully


## Importing Required Packages

In [4]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

## Prepare the model

In [5]:
diabetes_data = pd.read_csv("/content/diabetes.csv")
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Extracting features and labels from the diabetes data

In [6]:
train_features = diabetes_data.iloc[:,:-1].values
labels = diabetes_data.iloc[:,-1].values
train_features.shape, labels.shape

((768, 8), (768,))

Calculate the training accuracy

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(train_features, labels, test_size = 0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(614, 8) (154, 8) (614,) (154,)


In [8]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,y_train)
train_pred = decision_tree.predict(X_train)
train_acc = accuracy_score(train_pred, y_train)
print("Training accuracy: ", train_acc)

Training accuracy:  1.0


Calculate the testing accuracy

In [9]:
# Verify the accuracy of the test data
test_pred = decision_tree.predict(X_test)
test_acc = accuracy_score(test_pred, y_test)
print("Testing accuracy", test_acc)

Testing accuracy 0.7727272727272727



Although the training accuracy was very high, when we run the model on an unseen test dataset, we are getting less accuracy. It means that the model is not generalizing well to unseen data. This is known as **Overfitting**.




## Overfitting


A model that has learned the noise instead of (or alongwith) the signal is considered “overfit” because it fits the training dataset very well but has a poor fit with unseen test datasets.

When the tree grows deeper, it starts capturing random fluctuations (noise) in the data. This is one of the root causes of overfitting in decision trees. So, if the model is overfitting, reducing the max_depth value is one way to combat overfitting.

To avoid overfitting, we should tune the parameters like max_depth in a decision tree appropriately.  


Find the the train and test accuracy at different depths and identify the point at which the test accuracy starts decreasing. In the final decision tree model, we can specify the max_depth value after which the testing accuracy starts decreasing. This will allow the model to generalize well to further unseen data.


In [10]:
# Calculating accuracy of train and test data
depths = np.arange(1,15)
train_acc, test_acc = [], []

for i in depths:
  decision_tree = DecisionTreeClassifier(max_depth=i, random_state=1)
  decision_tree.fit(X_train,y_train)
  y_train_pred = decision_tree.predict(X_train)
  train_acc.append(accuracy_score(y_train_pred, y_train))

  # Test accuracy calculation
  y_test_pred = decision_tree.predict(X_test)
  test_acc.append(accuracy_score(y_test_pred, y_test))

acc_data = pd.DataFrame({'depth': depths, 'Train accuracy': train_acc, 'Test Accuracy': test_acc})
acc_data.set_index('depth', inplace=True)
acc_data

Unnamed: 0_level_0,Train accuracy,Test Accuracy
depth,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.734528,0.74026
2,0.771987,0.772727
3,0.776873,0.75974
4,0.798046,0.694805
5,0.84202,0.798701
6,0.86645,0.74026
7,0.910423,0.753247
8,0.942997,0.766234
9,0.970684,0.753247
10,0.983713,0.733766


## Apply KFold

In [11]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
scores_Train = []
scores_Test = []
x_data =train_features
# i=1
for train_index, test_index in kf.split(x_data):

    x_train, x_test, y_train, y_test = x_data[train_index], x_data[test_index], labels[train_index], labels[test_index]
    decision_tree2 = DecisionTreeClassifier(max_depth=2)
    decision_tree2.fit(x_train, y_train)
    scores_Train.append(decision_tree2.score(x_train, y_train))
    scores_Test.append(decision_tree2.score(x_test, y_test))

#  Finally, we can average all the scores from all 5 cross validations and get an average score for Train and Test data
print("Average score of the Training set %.2f"%np.mean(scores_Train))
print("Average score of the Testing set %.2f"%np.mean(scores_Test))

Average score of the Training set 0.77
Average score of the Testing set 0.76


## Please answer the questions below to complete the experiment:

In [12]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [13]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "sdgfdjhfgjfhjh" #@param {type:"string"}

In [14]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]

In [15]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Somewhat Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [16]:
#@title Run this cell to submit your notebook  { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 6424
Date of submission:  03 Jul 2025
Time of submission:  15:13:44
View your submissions: https://learn-iiith.talentsprint.com/notebook_submissions
