# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Learning Objective

At the end of the experiment, you will be able to :

* perform Data Pre-processing
* perform Bagging classifier

## Dataset

### Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Ticket_Class:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**Siblings_Spouse:**	No. of siblings/spouse of the passenger aboard the Titanic

**Parents_Children:**	No. of parents / children of the passenger aboard the Titanic

**Ticket_Number:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation

**Survived:** Survived or Not information

### Setup Steps:

In [46]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2302815" #@param {type:"string"}

In [47]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "+6592721549" #@param {type:"string"}

In [48]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "U1W4_14_Bagging_Classifier_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Titanic.csv")
    display(HTML('<script src="https://staging.dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getWalkthrough() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook, "feedback_walkthrough":Walkthrough ,
              "feedback_experiments_input" : Comments,
              "feedback_inclass_mentor": Mentor_support}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aiml-iiith.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


def getWalkthrough():
  try:
    if not Walkthrough:
      raise NameError
    else:
      return Walkthrough
  except NameError:
    print ("Please answer Walkthrough Question")
    return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


## Import Libraries

In [49]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

## Data Pre-Processing

### Load the data and print the first five records

**Hint:** https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [50]:
df = pd.read_csv('Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Ticket_Class,Name,Sex,Age,Siblings_Spouse,Parents_Children,Ticket_Number,Fare,Cabin,Embarked,Survived
0,1,3rd class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,Southampton,No
1,2,1st class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,Cherbourg,Yes
2,3,3rd class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,Southampton,Yes
3,4,1st class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,Southampton,Yes
4,5,3rd class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,Southampton,No


### Data Cleaning

* Generate [Descriptive Statistics](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) of the dataframe

* Count [NaN values in each column](https://stackoverflow.com/a/26266451) of the dataframe

* Fill the blanks in the age column as follows:
  * Fill the age of the survived people with the average age of the survived people
  * Similarly, fill the remaining blanks with the average age of not survived people

  **Hint:** [DataFrame.where](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html) for replacing values with the **`False`** condition

* Drop unnecessary columns which are not contributing for the prediction of a person survival

* Make sure the final dataframe does not have any null or nan values. Delete the rows which has missing values.

*  **Example:**
  * PassengerId column can never decide survival of a person, hence it can be dropped






In [51]:
df.describe()

Unnamed: 0,PassengerId,Age,Siblings_Spouse,Parents_Children,Fare
count,891.0,714.0,891.0,891.0,891.0
mean,446.0,29.699118,0.523008,0.381594,32.204208
std,257.353842,14.526497,1.102743,0.806057,49.693429
min,1.0,0.42,0.0,0.0,0.0
25%,223.5,20.125,0.0,0.0,7.9104
50%,446.0,28.0,0.0,0.0,14.4542
75%,668.5,38.0,1.0,0.0,31.0
max,891.0,80.0,8.0,6.0,512.3292


In [52]:
df.isna().sum()

PassengerId           0
Ticket_Class          0
Name                  0
Sex                   0
Age                 177
Siblings_Spouse       0
Parents_Children      0
Ticket_Number         0
Fare                  0
Cabin               687
Embarked              0
Survived              0
dtype: int64

In [53]:
# Finding the mean age of "Survived" people
meanS= df[df.Survived=='Yes'].Age.mean()
df.Age = df.Age.where(~((df.Age.isna()) & (df['Survived']=='Yes')), meanS)
df.isna().sum()

PassengerId           0
Ticket_Class          0
Name                  0
Sex                   0
Age                 125
Siblings_Spouse       0
Parents_Children      0
Ticket_Number         0
Fare                  0
Cabin               687
Embarked              0
Survived              0
dtype: int64

In [54]:
# Finding the mean age of "Not Survived" people
meanNS = df[df.Survived == 'No'].Age.mean()
df.Age.fillna(meanNS,inplace=True)
df.isna().sum()

PassengerId           0
Ticket_Class          0
Name                  0
Sex                   0
Age                   0
Siblings_Spouse       0
Parents_Children      0
Ticket_Number         0
Fare                  0
Cabin               687
Embarked              0
Survived              0
dtype: int64

In [55]:
# Other way of doing it without using df.where()

# means = df.groupby(['Survived'])['Age'].mean()
# df = df.set_index('Survived')
# df['Age'] = df['Age'].fillna(means)
# df = df.reset_index()
# df.isna().sum()

In [56]:
df.isna().sum()

PassengerId           0
Ticket_Class          0
Name                  0
Sex                   0
Age                   0
Siblings_Spouse       0
Parents_Children      0
Ticket_Number         0
Fare                  0
Cabin               687
Embarked              0
Survived              0
dtype: int64

In [57]:
# Dropping useless columns

df.drop(columns=['PassengerId', 'Name','Ticket_Number', 'Fare', 'Cabin'], inplace=True)
df.head()

Unnamed: 0,Ticket_Class,Sex,Age,Siblings_Spouse,Parents_Children,Embarked,Survived
0,3rd class,male,22.0,1,0,Southampton,No
1,1st class,female,38.0,1,0,Cherbourg,Yes
2,3rd class,female,26.0,0,0,Southampton,Yes
3,1st class,female,35.0,1,0,Southampton,Yes
4,3rd class,male,35.0,0,0,Southampton,No


In [58]:
survivedQ = df.loc[(df.Embarked == 'Queenstown')&(df.Survived == 'Yes')].shape[0]
not_survivedQ = df.loc[(df.Embarked == 'Queenstown')&(df.Survived == 'No')].shape[0]

survivedC = df.loc[(df.Embarked == 'Cherbourg')&(df.Survived == 'Yes')].shape[0]
not_survivedC = df.loc[(df.Embarked == 'Cherbourg')&(df.Survived == 'No')].shape[0]

survivedS = df.loc[(df.Embarked == 'Southampton')&(df.Survived == 'Yes')].shape[0]
not_survivedS = df.loc[(df.Embarked == 'Southampton')&(df.Survived == 'No')].shape[0]

print(survivedQ, not_survivedQ)
print(survivedC, not_survivedC)
print(survivedS, not_survivedS)

# As there are significant changes in the survival rate based on which port the passengers aboard the ship. We cannot delete the whole embarked column(It is useful)

30 47
93 75
219 427


### Convert categorical values to numerical
**Hint:** Use [Sklearn LabelEncoder's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) fit_transform method

In [59]:
df.head()

Unnamed: 0,Ticket_Class,Sex,Age,Siblings_Spouse,Parents_Children,Embarked,Survived
0,3rd class,male,22.0,1,0,Southampton,No
1,1st class,female,38.0,1,0,Cherbourg,Yes
2,3rd class,female,26.0,0,0,Southampton,Yes
3,1st class,female,35.0,1,0,Southampton,Yes
4,3rd class,male,35.0,0,0,Southampton,No


In [60]:
df.dtypes

Ticket_Class         object
Sex                  object
Age                 float64
Siblings_Spouse       int64
Parents_Children      int64
Embarked             object
Survived             object
dtype: object

In [61]:
le_t = preprocessing.LabelEncoder()
df['Ticket_Class'] = le_t.fit_transform(df['Ticket_Class'])

le_s = preprocessing.LabelEncoder()
df['Sex'] = le_s.fit_transform(df['Sex'])

le_e = preprocessing.LabelEncoder()
df['Embarked'] = le_e.fit_transform(df['Embarked'])

le_sur = preprocessing.LabelEncoder()
df['Survived'] = le_sur.fit_transform(df['Survived'])
df.head()

Unnamed: 0,Ticket_Class,Sex,Age,Siblings_Spouse,Parents_Children,Embarked,Survived
0,2,1,22.0,1,0,2,0
1,0,0,38.0,1,0,0,1
2,2,0,26.0,0,0,2,1
3,0,0,35.0,1,0,2,1
4,2,1,35.0,0,0,2,0


###  Consider the target labels as **Survived Column** and remaining as the features

* Print the shape of the features and labels


In [62]:
features = df.iloc[:, :-1]
labels = df.iloc[:, -1]
print(features.shape)
print(labels.shape)

(891, 6)
(891,)


###  Split the data into train and test sets




In [63]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(623, 6) (268, 6) (623,) (268,)


### Build the classification model using bagging technique

In [64]:
from sklearn.ensemble import BaggingClassifier

Bag = BaggingClassifier()
Bag.fit(X_train, y_train)
bag_y_pred = Bag.predict(X_test)

# Accuracy Score of the  Bagging Classifier Model
accuracy_score(y_test, bag_y_pred)

0.7798507462686567

### Please answer the questions below to complete the experiment:




In [65]:
#@title Bootstrap randomly draws datasets with replacement from the training data & each sample does not have the same size as the original training set? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "TRUE" #@param ["","TRUE", "FALSE"]


In [66]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [67]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "na" #@param {type:"string"}


In [68]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [69]:
#@title  Experiment walkthrough video? { run: "auto", vertical-output: true, display-mode: "form" }
Walkthrough = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [70]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [71]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [72]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 4400
Date of submission:  24 May 2024
Time of submission:  17:17:12
View your submissions: https://aiml-iiith.talentsprint.com/notebook_submissions
