# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



### Not for Grading

# Titanic Dataset Preprocessing Assignment

In this assignment, we will preprocess the Titanic dataset to practice data cleaning, encoding, scaling, binarization, polynomial features, discretization, power transformation, and feature selection. We will use Scikit-learn's preprocessing functions to make the data ready for machine learning.

### Objectives:
- Handle missing values
- Encode categorical variables
- Scale features
- Apply binarization and discretization
- Generate polynomial features
- Perform power transformation
- Select the top features


### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython
import re
ipython = get_ipython()

notebook= "U1W2_03_Titanic_Preprocessing_sklearn" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Purchase_Dataset.csv")
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getComplexity() and getAdditional() and getConcepts() and getWalkthrough() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
               "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook, "feedback_walkthrough":Walkthrough ,
              "feedback_experiments_input" : Comments,
              "feedback_inclass_mentor": Mentor_support}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions:  https://learn-iiith.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


def getWalkthrough():
  try:
    if not Walkthrough:
      raise NameError
    else:
      return Walkthrough
  except NameError:
    print ("Please answer Walkthrough Question")
    return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



In [None]:
# Loading the dataset
import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)
print("First 5 rows of the dataset:\n", data.head())
print("\nDataset information:")
data.info()

## Step 2: Handling Missing Values

In [None]:
from sklearn.impute import SimpleImputer
# Impute numerical columns
imputer_num = SimpleImputer(strategy="median")
data["Age"] = imputer_num.fit_transform(data[["Age"]])
data["Fare"] = imputer_num.fit_transform(data[["Fare"]])
print("\nAfter imputing numerical columns:\n", data[["Age", "Fare"]].head())

# Impute categorical columns
imputer_cat = SimpleImputer(strategy="most_frequent")

In [None]:
data["Embarked"] = imputer_cat.fit_transform(data[["Embarked"]]).ravel()
print("\nAfter imputing categorical column 'Embarked':\n", data["Embarked"].head())

## Step 3: Encoding Categorical Variables

In [None]:
pd.DataFrame([['A','B','C'],[]])

In [None]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding for binary categorical column 'Sex'
le = LabelEncoder()
data["Sex"] = le.fit_transform(data["Sex"])
print("\nAfter label encoding 'Sex':\n", data["Sex"].head())

# One-hot encoding for multi-category columns
data = pd.get_dummies(data, columns=["Embarked", "Pclass"], drop_first=True)
print("\nAfter one-hot encoding 'Embarked' and 'Pclass':\n", data.head())

## Step 4: Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standard Scaling for 'Fare'
scaler_standard = StandardScaler()
print(data['Fare'].head())
data["Fare"] = scaler_standard.fit_transform(data[["Fare"]])
print("\nAfter standard scaling 'Fare':\n", data["Fare"].head())

# Min-Max Scaling for 'Age'
scaler_minmax = MinMaxScaler()
data["Age"] = scaler_minmax.fit_transform(data[["Age"]])
print("\nAfter min-max scaling 'Age':\n", data["Age"].head())

In [None]:
data[["Fare"]].max()

## Step 5: Binarization

In [None]:
scaler_minmax.inverse_transform([[0.5]])

In [None]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.5)
data["Age_binarized"] = binarizer.fit_transform(data[["Age"]])
print("\nAfter binarizing 'Age':\n", data[["Age", "Age_binarized"]].head())

## Step 6: Discretization

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
[]
discretizer = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="uniform")
data["Fare_binned"] = discretizer.fit_transform(data[["Fare"]])
print("\nAfter discretizing 'Fare':\n", data[["Fare", "Fare_binned"]].head())

In [None]:
for a,b in data[["Fare", "Fare_binned"]].groupby("Fare_binned"):
  print(a,b)

## Step 7: Polynomial Features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=True)
poly_features = poly.fit_transform(data[["Age", "Fare"]])
poly_feature_names = poly.get_feature_names_out(["Age", "Fare"])
data[poly_feature_names] = poly_features
print("\nAfter generating polynomial features:\n", data[poly_feature_names].head())

## Step 8: Power Transformation

In [None]:
from sklearn.preprocessing import PowerTransformer

power_transformer = PowerTransformer()
data["Fare_power"] = power_transformer.fit_transform(data[["Fare"]])
print("\nAfter power transforming 'Fare':\n", data[["Fare", "Fare_power"]].head())

## Step 9: Feature Selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Dropping non-numeric columns
data_numeric = data.drop(["Name", "Ticket", "Cabin"], axis=1)

# Define features and target
X = data_numeric.drop("Survived", axis=1)
y = data_numeric["Survived"]

# Apply SelectKBest
selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support(indices=True)]
print("\nSelected features based on ANOVA F-value:", selected_features)

## Step 10: Splitting the Data

In [None]:
from sklearn.model_selection import train_test_split

# Using the selected features
X = data[selected_features]  # Use only selected features for training
y = data["Survived"]

# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nFirst 5 rows of training set:\n", X_train.head())

1. Handling Missing Values

Purpose: Missing values can cause errors in machine learning algorithms or lead to biased models if not handled properly.

Effect: Imputing with the median for numeric columns ensures that extreme values don’t unduly influence the imputed value. For categorical columns, using the most frequent category helps avoid creating artificial patterns by filling missing entries with a reasonable, frequently occurring value.

Modeling Impact: Filling missing values prevents data loss and ensures that each feature can contribute effectively to the model without gaps.

2. Encoding Categorical Variables

Purpose: Machine learning models generally cannot handle categorical variables directly, as they require numerical inputs.

Effect: Label encoding binary categorical features, like Sex, converts them into numerical form, while one-hot encoding for multi-class features, like Embarked and Pclass, ensures that the model doesn’t assume any ordinal relationship between categories.

Modeling Impact: Encoding enables the model to learn from categorical features without assuming ordinal relationships, improving interpretability and performance for these variables.

3. Feature Scaling

Purpose: Different features in the dataset may have vastly different scales, which can lead to biased model behavior (favoring features with larger scales).

Effect: Standard scaling (Fare) and min-max scaling (Age) ensure that all features contribute equally to the model, especially in distance-based algorithms like KNN or SVM.

Modeling Impact: Scaling helps the model converge faster and provides a balanced treatment for all features, leading to more stable and reliable performance.

4. Binarization

Purpose: For some features, transforming continuous values into binary categories (e.g., Age) can be helpful for simplifying complex distributions or highlighting specific distinctions (e.g., minor vs. adult).

Effect: Binarizing Age at a threshold of 0.5 (in the scaled data) separates passengers into two groups (young vs. old), which might provide an alternate representation of age with specific modeling insights.

Modeling Impact: This approach can make certain features more interpretable and lead to better performance in cases where binary decisions or groups are more meaningful.

5. Discretization

Purpose: Discretizing continuous features into discrete bins can sometimes improve model performance by reducing noise and capturing distinct groups.

Effect: Dividing Fare into three bins (low, medium, high) allows the model to treat fare categories distinctly rather than trying to interpret a continuous scale, which may enhance its ability to make categorical distinctions.

Modeling Impact: Discretization can simplify complex distributions, making it easier for certain models (like tree-based models) to understand the feature’s impact.

6. Polynomial Features

Purpose: Polynomial features capture interactions between features and can reveal non-linear relationships.

Effect: Generating interaction terms and higher-degree features for Age and Fare introduces quadratic relationships, allowing the model to capture more complex dependencies between these features.

Modeling Impact: Polynomial features can improve the model’s ability to learn non-linear patterns, especially in cases where interactions between features are crucial to predictions.

7. Power Transformation

Purpose: Power transformations reduce skewness, normalize distributions, and stabilize variance, which can be beneficial for heavily skewed features.

Effect: Applying a power transformation to Fare can make the distribution more Gaussian, which often improves performance, particularly for algorithms sensitive to skewed data.

Modeling Impact: Normalizing distributions with power transformations can make learning more effective by reducing bias introduced by skewed data.

8. Feature Selection

Purpose: Feature selection helps identify the most relevant features, reducing dimensionality and minimizing noise.

Effect: Using SelectKBest to choose the top features based on the ANOVA F-value selects features most related to the target variable (Survived), which reduces the dataset size and potentially improves model efficiency.

Modeling Impact: Feature selection can lead to simpler, faster, and sometimes more accurate models by focusing on only the most predictive features, thereby improving model generalization and reducing overfitting.


## Summary

In this assignment, you preprocessed the Titanic dataset by handling missing values, encoding categorical variables, scaling features, creating interaction terms with polynomial features, binarizing, discretizing, power transforming, and finally selecting the top features for the model input.

This preparation is essential for ensuring that the data is well-suited for machine learning models.

### Please answer the questions below to complete the experiment:




In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Experiment walkthrough video? { run: "auto", vertical-output: true, display-mode: "form" }
Walkthrough = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for Ungrading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")