# IF3170 Artificial Intelligence | Praktikum

This notebook serves as a template for the assignment. Please create a copy of this notebook to complete your work. You can add more code blocks, markdown blocks, or new sections if needed.


Group Number: xx

Group Members:
- Name (NIM)
- Name (NIM)

## Import Libraries

In [26]:
import pandas as pd
import numpy as np
# Import other libraries if needed

## Import Dataset

In [27]:
# Write your code here
train = pd.read_csv('train.csv')

# 1. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to uncover patterns, trends, anomalies, and insights. It is the first step before applying more advanced statistical and machine learning techniques. EDA helps you to gain a deep understanding of the data you are working with, allowing you to make informed decisions and formulate hypotheses for further analysis.

In [28]:
# Write your code here
train.head()

Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
0,0,1170.0,D-penicillamine,23741.0,F,Y,Y,N,Y,5.2,,2.8,108.0,1790.0,151.9,,110.0,12.4,4.0,D
1,1,1786.0,Placebo,25329.0,F,N,Y,N,N,1.9,302.0,3.67,52.0,1866.0,97.65,164.0,329.0,9.9,2.0,C
2,2,1067.0,,15706.0,F,,,,N,0.6,,3.73,,,,,269.0,9.8,3.0,C
3,3,4062.0,,23011.0,F,,,,N,0.6,,3.65,,,,,388.0,11.5,4.0,C
4,4,1067.0,Placebo,11773.0,F,N,Y,N,N,0.6,346.0,3.8,81.0,1257.0,122.45,90.0,318.0,10.9,2.0,C


# 2. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train.csv` given by the TA. The `test.csv` data is only used for kaggle submission.

In [None]:
# Split training set and validation set here, store into variables train_set and val_set.
# Remember to also keep the original training set before splitting. This will come important later.

from sklearn.model_selection import train_test_split

train_set, val_set = train_test_split(train, test_size=0.2, random_state=42)

print(len(train_set), "train +", len(val_set), "val")
print(len(train))

12000 train + 3000 val
15000


# 3. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

In [30]:
# Write your code here
temp_train = train_set
numerical_columns = ["Age","N_Days", "Age","Bilirubin","Cholesterol", "Albumin",	"Copper",	"Alk_Phos",	"SGOT",	"Tryglicerides",	"Platelets",	"Prothrombin"]
categorical_columns = [
    ("Status", ["C", "D"]),                  
    ("Drug", ["D-penicillamine", "Placebo"]), 
    ("Sex", ["M", "F"]),                    
    ("Ascites", ["N", "Y"]),                
    ("Hepatomegaly", ["N", "Y"]),           
    ("Spiders", ["N", "Y"]),                
    ("Edema", ["N", "S", "Y"]),       
    ("Stage", [1, 2, 3, 4])              
]


# Handle data cleaning and Preprocessing here
# Handle Missing values
def handleMissingValues(data):
    for column in data.columns:
        if column in numerical_columns: 
            # Diisi deng
            data[column].fillna(data[column].mean(), inplace=True)
        else:  
            # Diisi berdasarkan mayoritas buat kategorikal
            data[column].fillna(data[column].mode(), inplace=True)
    print(f"Number of rows: {len(data)}")
    return data
# Handle Outliers
def handleOutliers(data):
    for column in numerical_columns:
        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        data[column] = data[column].clip(lower_bound, upper_bound)

    for col_name, valid_values in categorical_columns:
        if col_name in data.columns:
            # Replace invalid values with NaN
            data[col_name] = data[col_name].apply(
                lambda x: x if x in valid_values else np.nan
            )
    return data


# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin


class HandleMissingValues(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        data = X.copy()
        for column in data.columns:
            if column in numerical_columns: 
                data[column].fillna(data[column].mean(), inplace=True)
            else:  
                data[column].fillna(data[column].mode()[0], inplace=True)
        return data

class HandleOutliers(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        data = X.copy()
        for column in numerical_columns:
            Q1 = data[column].quantile(0.25)
            Q3 = data[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            data[column] = data[column].clip(lower_bound, upper_bound)
        
        for col_name, valid_values in categorical_columns:
            if col_name in data.columns:
                data[col_name] = data[col_name].apply(
                    lambda x: x if x in valid_values else np.nan
                )
        return data

pipe = Pipeline([("imputer", HandleMissingValues()),
                 ("outlier_remover", HandleOutliers())])




The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mean(), inplace=True)


Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
9839,9839,1967.0,D-penicillamine,14019.0,F,N,Y,N,N,2.10,315.000000,3.69,75.000000,1637.000000,136.411582,96.273327,136.0,9.6,3.0,C
9680,9680,4190.0,Placebo,14060.0,F,N,N,N,N,0.90,263.008434,3.50,24.000000,423.000000,63.753051,119.544455,213.0,10.1,2.0,C
7093,7093,2812.0,D-penicillamine,18302.0,F,N,N,N,N,0.70,321.327710,3.48,76.178271,1670.811899,109.164633,110.817782,273.0,10.6,3.0,C
11293,11293,3149.0,D-penicillamine,20459.0,F,N,N,N,N,0.90,298.000000,3.65,25.000000,685.000000,71.300000,96.273327,311.0,9.7,3.0,C
820,820,460.0,D-penicillamine,23241.0,M,N,N,N,N,0.90,356.319276,3.90,39.000000,645.000000,70.000000,96.273327,228.0,12.3,3.0,D
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5191,5191,2716.0,D-penicillamine,19358.0,F,N,N,N,N,0.60,321.327710,3.58,76.178271,1670.811899,109.164633,110.817782,330.0,9.9,3.0,C
13418,13418,625.0,D-penicillamine,23741.0,F,N,N,N,N,3.45,321.327710,3.40,76.178271,1670.811899,109.164633,110.817782,190.0,11.0,4.0,D
5390,5390,2534.0,D-penicillamine,21185.0,F,N,N,N,N,0.60,321.327710,3.45,76.178271,1670.811899,109.164633,110.817782,200.0,10.3,4.0,C
860,860,2456.0,D-penicillamine,17774.0,F,N,N,N,N,1.40,263.008434,3.60,74.000000,1009.000000,136.411582,108.000000,271.0,10.1,3.0,C


In [None]:
# # Your code should work up until this point
train_set = pipe.fit_transform(train_set)
val_set = pipe.transform(val_set)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we

or create your own here

In [None]:
# Write your code here

# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the probability for each class in the `Status` feature (`Status_C`, `Status_CL`, `Status_D`). Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

## KNN

## Naive Bayes

In [None]:
# Type your code here

## ID3

In [None]:
# Type your code here

## SVM

In [None]:
# Type your code here

## Logistic Regression

In [None]:
# Type your code here

## Notes for improvements

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

- **Ensemble methods**

Ensemble methods are powerful machine learning techniques that combine the predictions of multiple models (often referred to as base learners or weak learners) to create a stronger, more accurate predictive model. The idea behind ensemble methods is that by aggregating the opinions of multiple models, you can reduce the impact of individual model errors and improve overall prediction performance. (Hint: bagging, boosting, stacking, voting)

- **Model interpretation**

Model interpretation is the process of understanding and explaining the inner workings of a machine learning model, particularly its decision-making process. Interpretation helps data scientists, stakeholders, and end-users gain insights into why a model makes certain predictions or classifications. Model interpretation is crucial for building trust in machine learning systems, identifying biases, and extracting actionable information from models. (Hint: Feature importance, PDP, SHAP Values, etc)

- **Explore other models**

There are a lot of ML models that you can use in this usecase. Try to explore and use them to solve this problem.

## Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id`, `Status_C`, `Status_CL`, `Status_D` column.

In [None]:
# Type your code here

# 6. Error Analysis

Based on all the process you have done until the modeling and evaluation step, write an analysis to support each steps you have taken to solve this problem. Write the analysis using the markdown block. Some questions that may help you in writing the analysis:

- Does my model perform better in predicting one class than the other? If so, why is that?
- To each models I have tried, which performs the best and what could be the reason?
- Is it better for me to impute or drop the missing data? Why?
- Does feature scaling help improve my model performance?
- etc...

`Provide your analysis here`