# IF3070 Foundations of Artificial Intelligence | Tugas Besar 2

This notebook serves as a template for the assignment. Please create a copy of this notebook to complete your work. You can add more code blocks, markdown blocks, or new sections if needed.


Group Number: 33

Group Members:
- Audra Zelvania P. H. (18222106)
- Rizqi Andhika Pratama (18222118)
- Sekar Anindita Nurjadini (18222125)
- Khayla Belva Annandira (18222138)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import GaussianNB

## Import Dataset

In [2]:
df = pd.read_csv('https://drive.google.com/uc?id=1a96WTg0CHxx2Ja7BFuWGlVneUXQsgD7Y')
df.head()

Unnamed: 0,id,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,1,,https://www.northcm.ac.th,24.0,www.northcm.ac.th,17.0,0.0,,0.8,,...,0.0,0.0,1.0,,3.0,,69.0,,,1
1,4,8135291.txt,http://uqr.to/1il1z,,,,,to,1.0,0.000896,...,,0.0,0.0,,,,,,1.0,0
2,5,586561.txt,https://www.woolworthsrewards.com.au,35.0,www.woolworthsrewards.com.au,28.0,0.0,au,0.857143,,...,1.0,0.0,1.0,33.0,7.0,8.0,15.0,,2.0,1
3,6,,,31.0,,,,com,0.5625,0.522907,...,1.0,0.0,1.0,24.0,5.0,14.0,,,,1
4,11,412632.txt,,,www.nyprowrestling.com,22.0,0.0,,1.0,,...,0.0,0.0,1.0,,,14.0,,0.0,,1


# 1. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train` folder given by the TA. The `test` data is only used for kaggle submission.

In [4]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y) directly from df
X = df.drop(['label'], axis=1)  # Drop 'label' as it's the target variable
y = df['label']  # 'label' is the target variable

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.1,  # Gunakan test_size lebih kecil untuk mempercepat
    random_state=42,
    stratify=None  # Nonaktifkan stratifikasi jika tidak diperlukan
)

# Mengambil 10% data untuk pengujian
X_train_sample = X_train.sample(frac=0.1, random_state=42)
y_train_sample = y_train.loc[X_train_sample.index]

# Print shapes of the resulting datasets
print(f"Training features shape: {X_train.shape}")
print(f"Validation features shape: {X_val.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Validation labels shape: {y_val.shape}")

Training features shape: (126363, 55)
Validation features shape: (14041, 55)
Training labels shape: (126363,)
Validation labels shape: (14041,)


# 2. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

We will give some common methods for you to try, but you only have to **at least implement one method for each process**. For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

## A. Data Cleaning

**Data cleaning** is the crucial first step in preparing your dataset for machine learning. Raw data collected from various sources is often messy and may contain errors, missing values, and inconsistencies. Data cleaning involves the following steps:

1. **Handling Missing Data:** Identify and address missing values in the dataset. This can include imputing missing values, removing rows or columns with excessive missing data, or using more advanced techniques like interpolation.

2. **Dealing with Outliers:** Identify and handle outliers, which are data points significantly different from the rest of the dataset. Outliers can be removed or transformed to improve model performance.

3. **Data Validation:** Check for data integrity and consistency. Ensure that data types are correct, categorical variables have consistent labels, and numerical values fall within expected ranges.

4. **Removing Duplicates:** Identify and remove duplicate rows, as they can skew the model's training process and evaluation metrics.

5. **Feature Engineering**: Create new features or modify existing ones to extract relevant information. This step can involve scaling, normalizing, or encoding features for better model interpretability.

### I. Handling Missing Data

Missing data can adversely affect the performance and accuracy of machine learning models. There are several strategies to handle missing data in machine learning:

1. **Data Imputation:**

    a. **Mean, Median, or Mode Imputation:** For numerical features, you can replace missing values with the mean, median, or mode of the non-missing values in the same feature. This method is simple and often effective when data is missing at random.

    b. **Constant Value Imputation:** You can replace missing values with a predefined constant value (e.g., 0) if it makes sense for your dataset and problem.

    c. **Imputation Using Predictive Models:** More advanced techniques involve using predictive models to estimate missing values. For example, you can train a regression model to predict missing numerical values or a classification model to predict missing categorical values.

2. **Deletion of Missing Data:**

    a. **Listwise Deletion:** In cases where the amount of missing data is relatively small, you can simply remove rows with missing values from your dataset. However, this approach can lead to a loss of valuable information.

    b. **Column (Feature) Deletion:** If a feature has a large number of missing values and is not critical for your analysis, you can consider removing that feature altogether.

3. **Domain-Specific Strategies:**

    a. **Domain Knowledge:** In some cases, domain knowledge can guide the imputation process. For example, if you know that missing values are related to a specific condition, you can impute them accordingly.

4. **Imputation Libraries:**

    a. **Scikit-Learn:** Scikit-Learn provides a `SimpleImputer` class that can handle basic imputation strategies like mean, median, and mode imputation.

    b. **Fancyimpute:** Fancyimpute is a Python library that offers more advanced imputation techniques, including matrix factorization, k-nearest neighbors, and deep learning-based methods.

The choice of imputation method should be guided by the nature of your data, the amount of missing data, the problem you are trying to solve, and the assumptions you are willing to make.

In [5]:
class MissingDataHandler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.numeric_imputer = {}
        self.categorical_imputer = {}

    def fit(self, X, y=None):
        # Spcify categorical and numerical features
        categorical_features = ['FILENAME', 'URL', 'Domain', 'TLD', 'Title']
        numerical_features = [
            col for col in X.columns if col not in categorical_features + ['id', 'label']
        ]

        # Compute imputers
        for col in numerical_features:
            if col in X.columns:
                self.numeric_imputer[col] = X[col].median()

        for col in categorical_features:
            if col in X.columns:
                self.categorical_imputer[col] = X[col].mode()[0] if not X[col].mode().empty else None

        return self

    def transform(self, X, y=None):
        X = X.copy()
        # Impute missing numerical values
        for col, value in self.numeric_imputer.items():
            if col in X.columns:
                X[col].fillna(value, inplace=True)

        # Impute missing categorical values
        for col, value in self.categorical_imputer.items():
            if col in X.columns and value is not None:
                X[col].fillna(value, inplace=True)
        return X

    def save(self, X, current_dir):
        # Define the cleaned data path
        cleaned_data_path = os.path.join(current_dir, 'cleaned_data.csv')
        X.to_csv(cleaned_data_path, index=False)
        print(f"Dataset saved to {cleaned_data_path}")

### II. Dealing with Outliers

Outliers are data points that significantly differ from the majority of the data. They can be unusually high or low values that do not fit the pattern of the rest of the dataset. Outliers can significantly impact model performance, so it is important to handle them properly.

Some methods to handle outliers:
1. **Imputation**: Replace with mean, median, or a boundary value.
2. **Clipping**: Cap values to upper and lower limits.
3. **Transformation**: Use log, square root, or power transformations to reduce their influence.
4. **Model-Based**: Use algorithms robust to outliers (e.g., tree-based models, Huber regression).

In [6]:
class OutlierHandler(BaseEstimator, TransformerMixin):
    def __init__(self, method='iqr', multiplier=1.5, strategy='clip'):
        """
        method: Outlier detection method ('iqr' supported currently).
        multiplier: Multiplier for IQR to calculate bounds (used only with 'iqr' method).
        strategy: Strategy to handle outliers ('clip', 'mean', 'median').
        """
        self.method = method
        self.multiplier = multiplier
        self.strategy = strategy
        self.bounds = {}

    def fit(self, X, y=None):
        # Automatically detect numeric columns
        numeric_cols = X.select_dtypes(include=[np.number]).columns

        # Calculate bounds for each numeric column using IQR method
        for col in numeric_cols:
            Q1 = X[col].quantile(0.25)
            Q3 = X[col].quantile(0.75)
            IQR = Q3 - Q1
            self.bounds[col] = {
                'lower': Q1 - self.multiplier * IQR,
                'upper': Q3 + self.multiplier * IQR
            }
        print(f"Bounds calculated for columns: {self.bounds}")
        return self

    def transform(self, X, y=None):
        # Ensure that bounds are calculated before transforming
        if not self.bounds:
            raise ValueError("The model is not fitted yet. Please call 'fit' first.")

        X = X.copy()

        # Handle outliers based on the calculated bounds
        for col, bounds in self.bounds.items():
            if col in X.columns:
                if self.strategy == 'clip':
                    # Clipping the values between lower and upper bounds
                    X[col] = np.clip(X[col], bounds['lower'], bounds['upper'])
                elif self.strategy == 'mean':
                    # Replace outliers with the mean of the column
                    mean_value = X[col].mean()
                    X[col] = X[col].where(
                        (X[col] >= bounds['lower']) & (X[col] <= bounds['upper']),
                        mean_value
                    )
                elif self.strategy == 'median':
                    # Replace outliers with the median of the column
                    median_value = X[col].median()
                    X[col] = X[col].where(
                        (X[col] >= bounds['lower']) & (X[col] <= bounds['upper']),
                        median_value
                    )
        return X

### III. Remove Duplicates
Handling duplicate values is crucial because they can compromise data integrity, leading to inaccurate analysis and insights. Duplicate entries can bias machine learning models, causing overfitting and reducing their ability to generalize to new data. They also inflate the dataset size unnecessarily, increasing computational costs and processing times. Additionally, duplicates can distort statistical measures and lead to inconsistencies, ultimately affecting the reliability of data-driven decisions and reporting. Ensuring data quality by removing duplicates is essential for accurate, efficient, and consistent analysis.

In [7]:
class DuplicateHandler(BaseEstimator, TransformerMixin):
  def __init__(self, subset=None, keep='first'):
    self.subset = subset
    self.keep = keep

  def fit(self, X, y=None):
    return self

  def fit_transform(self, X, y=None):
    X = X.copy()
    if y is not None:
        unique_idx = ~X.duplicated(subset=self.subset, keep=self.keep)
        X_unique = X[unique_idx].reset_index(drop=True)
        y_unique = y[unique_idx].reset_index(drop=True)
        return X_unique, y_unique
    return X.drop_duplicates(subset=self.subset, keep=self.keep).reset_index(drop=True)

  def transform(self, X, y=None):
    X = X.copy()
    if y is not None:
        return X, y
    return X.drop_duplicates(subset=self.subset, keep=self.keep).reset_index(drop=True)

### IV. Feature Engineering

**Feature engineering** involves creating new features (input variables) or transforming existing ones to improve the performance of machine learning models. Feature engineering aims to enhance the model's ability to learn patterns and make accurate predictions from the data. It's often said that "good features make good models."

1. **Feature Selection:** Feature engineering can involve selecting the most relevant and informative features from the dataset. Removing irrelevant or redundant features not only simplifies the model but also reduces the risk of overfitting.

2. **Creating New Features:** Sometimes, the existing features may not capture the underlying patterns effectively. In such cases, engineers create new features that provide additional information. For example:
   
   - **Polynomial Features:** Engineers may create new features by taking the square, cube, or other higher-order terms of existing numerical features. This can help capture nonlinear relationships.
   
   - **Interaction Features:** Interaction features are created by combining two or more existing features. For example, if you have features "length" and "width," you can create an "area" feature by multiplying them.

3. **Binning or Discretization:** Continuous numerical features can be divided into bins or categories. For instance, age values can be grouped into bins like "child," "adult," and "senior."

4. **Domain-Specific Feature Engineering:** Depending on the domain and problem, engineers may create domain-specific features. For example, in fraud detection, features related to transaction history and user behavior may be engineered to identify anomalies.

Feature engineering is both a creative and iterative process. It requires a deep understanding of the data, domain knowledge, and experimentation to determine which features will enhance the model's predictive power.

In [8]:
class PhishingFeatureEngineer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def calculate_capital_ratio(self, url):
        url = str(url)
        capital_count = sum(1 for char in url if char.isupper())
        return capital_count / (len(url) + 1)

    def check_phishing_keywords(self, url):
        phishing_keywords = ['login', 'secure', 'account', 'bank', 'verify']
        url = str(url).lower()
        for keyword in phishing_keywords:
            if keyword in url:
                return 1
        return 0

    def count_url_segments(self, url):
        url = str(url)
        return url.count('/')

    def calculate_special_char_ratio(self, url):
        special_chars = set('!@#$%^&*()[]{}|\\:;"\'<>,.?/~`-=_+')
        url = str(url)
        special_char_count = sum(1 for char in url if char in special_chars)
        return special_char_count / (len(url) + 1)

    def transform(self, X, y=None):
        X = X.copy()

        if 'URL' in X.columns:
            X['capital_ratio'] = X['URL'].apply(self.calculate_capital_ratio)

        if 'URL' in X.columns:
            X['contains_phishing_keywords'] = X['URL'].apply(self.check_phishing_keywords)

        if 'URL' in X.columns:
            X['url_segment_count'] = X['URL'].apply(self.count_url_segments)

        if 'URL' in X.columns:
            X['special_char_ratio'] = X['URL'].apply(self.calculate_special_char_ratio)

        if 'TLDLegitimateProb' in X.columns:
            X['TLDLegitimateProb'] = X['TLDLegitimateProb'].fillna(0)

        if 'NoOfSubDomain' in X.columns:
            X['NoOfSubDomain'] = X['NoOfSubDomain'].fillna(0)

        return X

## B. Data Preprocessing

**Data preprocessing** is a broader step that encompasses both data cleaning and additional transformations to make the data suitable for machine learning algorithms. Its primary goals are:

1. **Feature Scaling:** Ensure that numerical features have similar scales. Common techniques include Min-Max scaling (scaling to a specific range) or standardization (mean-centered, unit variance).

2. **Encoding Categorical Variables:** Machine learning models typically work with numerical data, so categorical variables need to be encoded. This can be done using one-hot encoding, label encoding, or more advanced methods like target encoding.

3. **Handling Imbalanced Classes:** If dealing with imbalanced classes in a binary classification task, apply techniques such as oversampling, undersampling, or using different evaluation metrics to address class imbalance.

4. **Dimensionality Reduction:** Reduce the number of features using techniques like Principal Component Analysis (PCA) or feature selection to simplify the model and potentially improve its performance.

5. **Normalization:** Normalize data to achieve a standard distribution. This is particularly important for algorithms that assume normally distributed data.

### Notes on Preprocessing processes

It is advised to create functions or classes that have the same/similar type of inputs and outputs, so you can add, remove, or swap the order of the processes easily. You can implement the functions or classes by yourself

or

use `sklearn` library. To create a new preprocessing component in `sklearn`, implement a corresponding class that includes:
1. Inheritance to `BaseEstimator` and `TransformerMixin`
2. The method `fit`
3. The method `transform`

### I. Feature Scaling

**Feature scaling** is a preprocessing technique used in machine learning to standardize the range of independent variables or features of data. The primary goal of feature scaling is to ensure that all features contribute equally to the training process and that machine learning algorithms can work effectively with the data.

Here are the main reasons why feature scaling is important:

1. **Algorithm Sensitivity:** Many machine learning algorithms are sensitive to the scale of input features. If the scales of features are significantly different, some algorithms may perform poorly or take much longer to converge.

2. **Distance-Based Algorithms:** Algorithms that rely on distances or similarities between data points, such as k-nearest neighbors (KNN) and support vector machines (SVM), can be influenced by feature scales. Features with larger scales may dominate the distance calculations.

3. **Regularization:** Regularization techniques, like L1 (Lasso) and L2 (Ridge) regularization, add penalty terms based on feature coefficients. Scaling ensures that all features are treated equally in the regularization process.

Common methods for feature scaling include:

1. **Min-Max Scaling (Normalization):** This method scales features to a specific range, typically [0, 1]. It's done using the following formula:

   $$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

   - Here, $X$ is the original feature value, $X_{min}$ is the minimum value of the feature, and $X_{max}$ is the maximum value of the feature.  
<br />
<br />
2. **Standardization (Z-score Scaling):** This method scales features to have a mean (average) of 0 and a standard deviation of 1. It's done using the following formula:

   $$X' = \frac{X - \mu}{\sigma}$$

   - $X$ is the original feature value, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature.  
<br />
<br />
3. **Robust Scaling:** Robust scaling is a method that scales features to the interquartile range (IQR) and is less affected by outliers. It's calculated as:

   $$X' = \frac{X - Q1}{Q3 - Q1}$$

   - $X$ is the original feature value, $Q1$ is the first quartile (25th percentile), and $Q3$ is the third quartile (75th percentile) of the feature.  
<br />
<br />
4. **Log Transformation:** In cases where data is highly skewed or has a heavy-tailed distribution, taking the logarithm of the feature values can help stabilize the variance and improve scaling.

The choice of scaling method depends on the characteristics of your data and the requirements of your machine learning algorithm. **Min-max scaling and standardization are the most commonly used techniques and work well for many datasets.**

Scaling should be applied separately to each training and test set to prevent data leakage from the test set into the training set. Additionally, **some algorithms may not require feature scaling, particularly tree-based models.**

In [9]:
class FeatureScaler(BaseEstimator, TransformerMixin):
    def __init__(self, method):
        # Choose scaler based on the method
        if method == 'standard':
            self.scaler = StandardScaler()
        elif method == 'minmax':
            self.scaler = MinMaxScaler()
        elif method == 'robust':
            self.scaler = RobustScaler()
        else:
            raise ValueError("Invalid method. Choose 'standard', 'minmax', or 'robust'.")
        self.numeric_columns = None

    def fit(self, X, y=None):
        # Identify numerical columns to scale without id and label
        self.numeric_columns = [
            col for col in X.select_dtypes(include=['float64', 'int64']).columns
            if col not in ['id', 'label']
        ]
        self.scaler.fit(X[self.numeric_columns])
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X[self.numeric_columns] = self.scaler.transform(X[self.numeric_columns])
        return X

    def save(self, X, current_dir):
        # Define the scaled data path
        scaled_data_path = os.path.join(current_dir, 'scaled_data.csv')
        X.to_csv(scaled_data_path, index=False)
        print(f"Scaled dataset saved to {scaled_data_path}")

### II. Feature Encoding

**Feature encoding**, also known as **categorical encoding**, is the process of converting categorical data (non-numeric data) into a numerical format so that it can be used as input for machine learning algorithms. Most machine learning models require numerical data for training and prediction, so feature encoding is a critical step in data preprocessing.

Categorical data can take various forms, including:

1. **Nominal Data:** Categories with no intrinsic order, like colors or country names.  

2. **Ordinal Data:** Categories with a meaningful order but not necessarily equidistant, like education levels (e.g., "high school," "bachelor's," "master's").

There are several common methods for encoding categorical data:

1. **Label Encoding:**

   - Label encoding assigns a unique integer to each category in a feature.
   - It's suitable for ordinal data where there's a clear order among categories.
   - For example, if you have an "education" feature with values "high school," "bachelor's," and "master's," you can encode them as 0, 1, and 2, respectively.
<br />
<br />
2. **One-Hot Encoding:**

   - One-hot encoding creates a binary (0 or 1) column for each category in a nominal feature.
   - It's suitable for nominal data where there's no inherent order among categories.
   - Each category becomes a new feature, and the presence (1) or absence (0) of a category is indicated for each row.
<br />
<br />
3. **Target Encoding (Mean Encoding):**

   - Target encoding replaces each category with the mean of the target variable for that category.
   - It's often used for classification problems.

In [10]:
class FeatureEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        """
        Categorical columns are predefined within the class.
        """
        self.categorical_columns = ['FILENAME', 'URL', 'Domain', 'TLD', 'Title']
        self.label_encoders = {}

    def fit(self, X, y=None):
        # Initialize LabelEncoders for each categorical column
        for col in self.categorical_columns:
            if col in X.columns:
                self.label_encoders[col] = LabelEncoder()
                self.label_encoders[col].fit(X[col].astype(str))
        return self

    def transform(self, X):
        X = X.copy()

        # Apply Label Encoding to each categorical column
        for col, encoder in self.label_encoders.items():
            if col in X.columns:
                X[col] = encoder.transform(X[col].astype(str))

        return X

    def save(self, X, current_dir):
        # Define the scaled data path
        scaled_data_path = os.path.join(current_dir, 'scaled_data.csv')
        X.to_csv(scaled_data_path, index=False)
        print(f"Scaled dataset saved to {scaled_data_path}")

### III. Handling Imbalanced Dataset

**Handling imbalanced datasets** is important because imbalanced data can lead to several issues that negatively impact the performance and reliability of machine learning models. Here are some key reasons:

1. **Biased Model Performance**:

 - Models trained on imbalanced data tend to be biased towards the majority class, leading to poor performance on the minority class. This can result in misleading accuracy metrics.

2. **Misleading Accuracy**:

 - High overall accuracy can be misleading in imbalanced datasets. For example, if 95% of the data belongs to one class, a model that always predicts the majority class will have 95% accuracy but will fail to identify the minority class.

3. **Poor Generalization**:

 - Models trained on imbalanced data may not generalize well to new, unseen data, especially if the minority class is underrepresented.


Some methods to handle imbalanced datasets:
1. **Resampling Methods**:

 - Oversampling: Increase the number of instances in the minority class by duplicating or generating synthetic samples (e.g., SMOTE).
 - Undersampling: Reduce the number of instances in the majority class to balance the dataset.

2. **Evaluation Metrics**:

 - Use appropriate evaluation metrics such as precision, recall, F1-score, ROC-AUC, and confusion matrix instead of accuracy to better assess model performance on imbalanced data.

3. **Algorithmic Approaches**:

 - Use algorithms that are designed to handle imbalanced data, such as decision trees, random forests, or ensemble methods.
 - Adjust class weights in algorithms to give more importance to the minority class.

In [11]:
class ImbalanceHandler(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=42):
        self.smote = SMOTE(random_state=random_state)
        self._is_fitted = False

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if y is not None:
            if self._is_fitted:
                return X, y
            else:
                self._is_fitted = True
                X_resampled, y_resampled = self.smote.fit_resample(X, y)
                return X_resampled, y_resampled
        return X

# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [12]:
feature_pipeline = Pipeline([
    ('missing_handler', MissingDataHandler()),
    ('outlier_handler', OutlierHandler()),
    ('duplicate_handler', DuplicateHandler()),
    ('feature_engineer', PhishingFeatureEngineer()),
    ('feature_scaler', FeatureScaler(method="standard")),
])

target_pipeline = Pipeline([
    ('feature_encoder', FeatureEncoder()),
    ('imbalance_handler', ImbalanceHandler()),
])

X_train_processed = feature_pipeline.fit_transform(X_train_sample)
X_val_processed = feature_pipeline.transform(X_val)

# Initialize the transformers
encoder = FeatureEncoder()
imbalancer = ImbalanceHandler()
# For training data
X_train_encoded = encoder.fit_transform(X_train_processed, y_train_sample)
X_train_imbalanced, y_train_imbalanced = imbalancer.transform(X_train_encoded, y_train_sample)
# X_train_reduced, y_train_reduced = reducer.fit_transform(X_train_processed, y_train)
# For validation data
X_val_encoded = encoder.fit_transform(X_val_processed, y_val)
X_val_imbalanced, y_val_imbalanced = imbalancer.transform(X_val_encoded, y_val)
print('x-val',X_val_imbalanced)
print('y-val',X_val_imbalanced)
# X_val_reduced, y_val_reduced = reducer.transform(X_val_imbalanced, y_val_imbalanced)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(value, inplace=True)


Bounds calculated for columns: {'id': {'lower': np.float64(-117983.5), 'upper': np.float64(353342.5)}, 'URLLength': {'lower': np.float64(22.0), 'upper': np.float64(30.0)}, 'DomainLength': {'lower': np.float64(13.5), 'upper': np.float64(25.5)}, 'IsDomainIP': {'lower': np.float64(0.0), 'upper': np.float64(0.0)}, 'CharContinuationRate': {'lower': np.float64(1.0), 'upper': np.float64(1.0)}, 'TLDLegitimateProb': {'lower': np.float64(-0.5844536499999999), 'upper': np.float64(1.18732355)}, 'URLCharProb': {'lower': np.float64(0.05371043162499999), 'upper': np.float64(0.06663930262500001)}, 'TLDLength': {'lower': np.float64(3.0), 'upper': np.float64(3.0)}, 'NoOfSubDomain': {'lower': np.float64(1.0), 'upper': np.float64(1.0)}, 'HasObfuscation': {'lower': np.float64(0.0), 'upper': np.float64(0.0)}, 'NoOfObfuscatedChar': {'lower': np.float64(0.0), 'upper': np.float64(0.0)}, 'ObfuscationRatio': {'lower': np.float64(0.0), 'upper': np.float64(0.0)}, 'NoOfLettersInURL': {'lower': np.float64(10.5), 'up

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(value, inplace=True)


x-val            id  FILENAME   URL  URLLength  Domain  DomainLength  IsDomainIP  \
0       14051      7106  2336  -0.021054      81      1.785630         0.0   
1       34872         1  3603  -0.021054    2452     -0.062539         0.0   
2      148915      1608     0  -0.440013      81     -0.370567         0.0   
3       40612         1  3223   0.816864      81     -0.062539         0.0   
4      159405         1  3002  -1.696890      81     -0.062539         0.0   
...       ...       ...   ...        ...     ...           ...         ...   
14036   18938         1  7079  -1.277931      81     -0.062539         0.0   
14037   22151      5032     0  -0.440013      81     -0.370567         0.0   
14038  183832      7917  4430  -0.021054      81     -0.062539         0.0   
14039  168975      5935  3643  -0.021054    2482     -0.062539         0.0   
14040   75730       259     0  -0.858972    3045     -0.678596         0.0   

       TLD  CharContinuationRate  TLDLegitimateProb  ... 

# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the target feature `label`. Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

## A. KNN

## Import Libraries

In [13]:
import numpy as np
import pandas as pd
import pickle
import os

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, recall_score, precision_score, f1_score, confusion_matrix, classification_report

from scipy.spatial.distance import cdist
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from imblearn.over_sampling import SMOTE

## Import Dataset

In [14]:
train_df = pd.read_csv('https://drive.google.com/uc?id=1a96WTg0CHxx2Ja7BFuWGlVneUXQsgD7Y')
train_df.head()

Unnamed: 0,id,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,1,,https://www.northcm.ac.th,24.0,www.northcm.ac.th,17.0,0.0,,0.8,,...,0.0,0.0,1.0,,3.0,,69.0,,,1
1,4,8135291.txt,http://uqr.to/1il1z,,,,,to,1.0,0.000896,...,,0.0,0.0,,,,,,1.0,0
2,5,586561.txt,https://www.woolworthsrewards.com.au,35.0,www.woolworthsrewards.com.au,28.0,0.0,au,0.857143,,...,1.0,0.0,1.0,33.0,7.0,8.0,15.0,,2.0,1
3,6,,,31.0,,,,com,0.5625,0.522907,...,1.0,0.0,1.0,24.0,5.0,14.0,,,,1
4,11,412632.txt,,,www.nyprowrestling.com,22.0,0.0,,1.0,,...,0.0,0.0,1.0,,,14.0,,0.0,,1


In [15]:
test_df = pd.read_csv('https://drive.google.com/uc?id=19aftoyJGEVXPgW5BZzDkdv_hX6-oNtel')
test_df.head()

Unnamed: 0,"<!DOCTYPE html><html><head><title>Google Drive - Infected file</title><meta http-equiv=""content-type"" content=""text/html; charset=utf-8""/><style nonce=""-hbQBsW-_hpop3LD0So2CQ"">.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial",sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,".uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel=""icon"" href=""//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png""/></head><body><div class=""uc-main""><div id=""uc-text""><p class=""uc-error-caption"">Sorry","this file is infected with a virus.</p><p class=""uc-error-subcaption"">Only the owner is allowed to download infected files.</p></div></div><div class=""uc-footer""><hr class=""uc-footer-divider""></div></body></html>"


## Data Preprocessing

In [16]:
def preprocess_data(train_df, test_df=None, test_size=0.3, random_state=42):
    X = train_df.drop(['label', 'id', 'FILENAME', 'URL', 'Domain'], axis=1)
    y = train_df['label']

    numeric_columns = X.select_dtypes(include=['number']).columns
    categorical_columns = X.select_dtypes(exclude=['number']).columns

    for col in numeric_columns:
        X[col] = np.log1p(X[col])

    for col in numeric_columns:
        X[col] = X[col].fillna(X[col].median())

    for col in categorical_columns:
        X[col] = X[col].fillna(X[col].mode()[0])

    scaler = StandardScaler()
    X[numeric_columns] = scaler.fit_transform(X[numeric_columns])

    label_encoders = {}
    for col in categorical_columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
        label_encoders[col] = le

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, random_state=random_state
    )

    smote = SMOTE(random_state=random_state, k_neighbors=1)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    if test_df is not None:
        X_test_final = test_df.drop(['id', 'FILENAME', 'URL', 'Domain'], axis=1)
        for col in numeric_columns:
            X_test_final[col] = np.log1p(X_test_final[col])
            X_test_final[col] = X_test_final[col].fillna(X_test_final[col].median())
        for col in categorical_columns:
            X_test_final[col] = X_test_final[col].fillna(X_test_final[col].mode()[0])
            X_test_final[col] = X_test_final[col].map(lambda val: label_encoders[col].transform([val])[0]
                                                      if val in label_encoders[col].classes_
                                                      else -1)
        X_test_final[numeric_columns] = scaler.transform(X_test_final[numeric_columns])
    else:
        X_test_final = None

    return X_train_resampled, X_test, y_train_resampled, y_test, X_test_final

## KNN Algorithm from Scratch

In [17]:
class KNNClassifier:
    def __init__(self, k=3, metric="euclidean", batch_size=1000, n_threads=4):
        """
        Initialize the KNN Classifier with thread pool support.

        Parameters:
        -----------
        k : int, default=3
            Number of neighbors to use
        metric : str, default="euclidean"
            Distance metric to use
        batch_size : int, default=1000
            Number of test points to process in each batch
        n_threads : int, default=4
            Number of threads to use for parallel processing
        """
        self.k = k
        self.metric = metric
        self.batch_size = batch_size
        self.n_threads = n_threads
        self.train_data = None
        self.train_labels = None

    def fit(self, train_data, train_labels):
        """
        Store the training data and labels.
        """
        self.train_data = np.asarray(train_data, dtype=np.float32)
        self.train_labels = np.asarray(train_labels)
        return self

    def _compute_batch_distances(self, test_batch):
        """
        Compute distances for a batch of test points.
        """
        if self.metric == 'euclidean':
            distances = cdist(test_batch, self.train_data, metric='euclidean')
        elif self.metric == 'manhattan':
            distances = cdist(test_batch, self.train_data, metric='cityblock')
        elif self.metric == 'minkowski':
            distances = cdist(test_batch, self.train_data, metric='minkowski')
        else:
            raise ValueError(f"Unsupported metric: {self.metric}")
        return distances

    def _predict_batch(self, test_batch):
        """
        Predict labels for a batch of test points.
        """
        distances = self._compute_batch_distances(test_batch)
        batch_predictions = []

        for point_distances in distances:
            k_indices = np.argpartition(point_distances, self.k)[:self.k]
            k_labels = self.train_labels[k_indices]

            prediction = Counter(k_labels).most_common(1)[0][0]
            batch_predictions.append(prediction)

        return batch_predictions

    def predict(self, test_points):
        """
        Predict labels for test points using a thread pool.
        """
        if self.train_data is None:
            raise ValueError("Model has not been trained. Call 'fit' first.")

        test_points = np.asarray(test_points, dtype=np.float32)
        predictions = []

        # Split test points into batches
        batches = [
            test_points[i:i + self.batch_size]
            for i in range(0, len(test_points), self.batch_size)
        ]

        # Use ThreadPoolExecutor for parallel processing
        with ThreadPoolExecutor(max_workers=self.n_threads) as executor:
            results = executor.map(self._predict_batch, batches)

        for result in results:
            predictions.extend(result)

        return np.array(predictions)

    def get_params(self, deep=True):
        return {
            'k': self.k,
            'metric': self.metric,
            'batch_size': self.batch_size,
            'n_threads': self.n_threads
        }

    def set_params(self, **params):
        if 'k' in params:
            self.k = params['k']
        if 'metric' in params:
            self.metric = params['metric']
        if 'batch_size' in params:
            self.batch_size = params['batch_size']
        if 'n_threads' in params:
            self.n_threads = params['n_threads']
        return self

    def score(self, X_test, y_test):
        predictions = self.predict(X_test)
        return np.mean(predictions == y_test)

    def save_model(self, filename):
        """
        Save the model to a file.
        """
        with open(filename, 'wb') as f:
            pickle.dump(self, f)
        print(f"Model saved to {filename}")

    @staticmethod
    def load_model(filename):
        """
        Load the model from a file.
        """
        with open(filename, 'rb') as f:
            model = pickle.load(f)
        print(f"Model loaded from {filename}")
        return model

## Accuration Test from Scratch

In [18]:
# Preprocess data
X_train, X_test, y_train, y_test,X_final = preprocess_data(train_df)

# Train the KNN model
knn = KNNClassifier(k=4, metric='manhattan')
knn.fit(X_train, y_train)

# Generate predictions
predictions = knn.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions, multi_class='ovr', average='macro')
recall = recall_score(y_test, predictions, average='macro')
precision = precision_score(y_test, predictions, average='macro')
f1 = f1_score(y_test, predictions, average='macro')

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"ROC AUC Score: {roc_auc}")
print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
class_report = classification_report(y_test, predictions)
print("Classification Report:")
print(class_report)

Accuracy: 0.9731019419780638
ROC AUC Score: 0.86067797986059
Recall: 0.8606779798605901
Precision: 0.9361982705532068
F1 Score: 0.8941758627892029
Confusion Matrix:
[[ 2306   860]
 [  273 38683]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.73      0.80      3166
           1       0.98      0.99      0.99     38956

    accuracy                           0.97     42122
   macro avg       0.94      0.86      0.89     42122
weighted avg       0.97      0.97      0.97     42122



## KNN Algorithm and Accuration Test from Library

In [19]:
# Preprocess data
X_train, X_test, y_train, y_test, _ = preprocess_data(train_df)

# Train the KNN model using sklearn
knn = KNeighborsClassifier(n_neighbors=4, metric='manhattan')
knn.fit(X_train, y_train)

# Generate predictions
predictions = knn.predict(X_test)

# Check if ROC AUC score is valid (requires probabilities for each class)
if len(knn.classes_) > 2:  # Multiclass case
    probabilities = knn.predict_proba(X_test)
    roc_auc = roc_auc_score(y_test, probabilities, multi_class='ovr', average='macro')
else:  # Binary case
    probabilities = knn.predict_proba(X_test)[:, 1]  # Select positive class probabilities
    roc_auc = roc_auc_score(y_test, probabilities)

# Calculate other evaluation metrics
accuracy = accuracy_score(y_test, predictions)
recall = recall_score(y_test, predictions, average='macro')
precision = precision_score(y_test, predictions, average='macro')
f1 = f1_score(y_test, predictions, average='macro')

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"ROC AUC Score: {roc_auc}")
print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
class_report = classification_report(y_test, predictions)
print("Classification Report:")
print(class_report)

Accuracy: 0.9705854422866911
ROC AUC Score: 0.8888659684214084
Recall: 0.8728111187787742
Precision: 0.9080282247111293
F1 Score: 0.8894615202110798
Confusion Matrix:
[[ 2399   767]
 [  472 38484]]
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.76      0.79      3166
           1       0.98      0.99      0.98     38956

    accuracy                           0.97     42122
   macro avg       0.91      0.87      0.89     42122
weighted avg       0.97      0.97      0.97     42122



## B. Naive Bayes

## Import Dataset

In [25]:
current_dir = os.path.dirname(os.path.abspath("__file__"))
train_path = os.path.join(current_dir, "data/train.csv")
test_path = os.path.join(current_dir, "data/test.csv")

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
train_df.head()
test_df.head()

Unnamed: 0,id,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,CharContinuationRate,TLDLegitimateProb,...,Bank,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef
0,48,80851.txt,https://www.iaee.org,19.0,,12.0,0.0,org,,0.079963,...,1.0,0.0,0.0,1.0,,,13.0,194.0,,65.0
1,68,mw130480.txt,http://www.iran-edi.com,22.0,,16.0,,,0.625,,...,,,0.0,,0.0,,2.0,0.0,0.0,1.0
2,76,400382.txt,https://www.bistum-chur.ch,25.0,www.bistum-chur.ch,18.0,0.0,ch,0.636364,0.004983,...,,,0.0,1.0,5.0,12.0,18.0,193.0,,196.0
3,155,625297.txt,https://www.numberthreebath.com,30.0,,23.0,,com,1.0,0.522907,...,0.0,0.0,,,,1.0,10.0,12.0,0.0,11.0
4,167,8123642.txt,https://ipfs.litnet.work/ipfs/bafybeib5jvxytzb...,100.0,ipfs.litnet.work,,0.0,work,,,...,0.0,,0.0,,,1.0,,,,


## Data Preprocessing

In [26]:
def preprocess_data(train_df, test_df=None, test_size=0.3, random_state=42):
    X = train_df.drop(['label', 'id', 'FILENAME', 'URL', 'Domain'], axis=1)
    y = train_df['label']

    numeric_columns = X.select_dtypes(include=['number']).columns
    categorical_columns = X.select_dtypes(exclude=['number']).columns

    for col in numeric_columns:
        X[col] = np.log1p(X[col])

    for col in numeric_columns:
        X[col] = X[col].fillna(X[col].median())

    for col in categorical_columns:
        X[col] = X[col].fillna(X[col].mode()[0])

    scaler = StandardScaler()
    X[numeric_columns] = scaler.fit_transform(X[numeric_columns])

    label_encoders = {}
    for col in categorical_columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
        label_encoders[col] = le

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, random_state=random_state
    )

    smote = SMOTE(random_state=random_state, k_neighbors=1)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    if test_df is not None:
        X_test_final = test_df.drop(['id', 'FILENAME', 'URL', 'Domain'], axis=1)
        for col in numeric_columns:
            X_test_final[col] = np.log1p(X_test_final[col])
            X_test_final[col] = X_test_final[col].fillna(X_test_final[col].median())
        for col in categorical_columns:
            X_test_final[col] = X_test_final[col].fillna(X_test_final[col].mode()[0])
            X_test_final[col] = X_test_final[col].map(lambda val: label_encoders[col].transform([val])[0]
                                                      if val in label_encoders[col].classes_
                                                      else -1)
        X_test_final[numeric_columns] = scaler.transform(X_test_final[numeric_columns])
    else:
        X_test_final = None

    return X_train_resampled, X_test, y_train_resampled, y_test, X_test_final

## Naive Bayes Implementation from Scratch

In [27]:
class NaiveBayes:
    # NaiveBayes class inisialization
    def __init__(self, smoothing=1e-3, prior_adjustment=None):
        self.smoothing = smoothing
        self.classes_ = None
        self.class_probabilities = {}
        self.feature_probabilities = {}
        self.class_counts = {}
        self.prior_adjustment = prior_adjustment

    # Trains the Naive Bayes model by calculating prior probabilities for each class 
    def fit(self, X, y):
        X = np.array(X)
        y = np.array(y)

        self.classes_ = np.unique(y)
        n_samples, n_features = X.shape

        for cls in self.classes_:
            self.class_probabilities[cls] = np.sum(y == cls) / n_samples
            if self.prior_adjustment and cls in self.prior_adjustment:
                self.class_probabilities[cls] *= self.prior_adjustment[cls]

            self.class_counts[cls] = np.sum(y == cls)

        # Calculate the prior probability for each class
        self.feature_probabilities = {cls: [] for cls in self.classes_}
        for cls in self.classes_:
            X_cls = X[y == cls]
            for feature_idx in range(n_features):
                feature_vals = X_cls[:, feature_idx]
                unique_vals, counts = np.unique(feature_vals, return_counts=True)
                feature_prob = {
                    val: (count + self.smoothing) / (self.class_counts[cls] + self.smoothing * len(unique_vals))
                    for val, count in zip(unique_vals, counts)
                }
                self.feature_probabilities[cls].append(feature_prob)

        return self
    
    # This method calculates the probability of each class for each sample in X
    def predict_proba(self, X):
        X = np.array(X)
        probabilities = []
        for sample in X:
            posteriors = []
            for cls in self.classes_:
                score = np.log(self.class_probabilities[cls] + self.smoothing)
                for feature_idx, feature_val in enumerate(sample):
                    feature_prob = self.feature_probabilities[cls][feature_idx].get(feature_val, self.smoothing)
                    score += np.log(feature_prob + self.smoothing)
                posteriors.append(np.exp(score))
            probabilities.append(posteriors / np.sum(posteriors))
        return np.array(probabilities)
    
    # This method determines the predicted class for each sample in the input data
    def predict(self, X, threshold=0.005):
        probabilities = self.predict_proba(X)
        predictions = (probabilities[:, 1] >= threshold).astype(int)

        for i, prob in enumerate(probabilities):
            if prob[0] > prob[1] * 0.8: 
                predictions[i] = 0

        return predictions

    # Saves a trained model to a file.
    def save_model(self, filename):
        with open(filename, 'wb') as file:
            pickle.dump(self, file)
        print(f"Model saved in {filename}.")

    # Loads a previously saved model from a file for reuse.
    @staticmethod
    def load_model(filename):
        with open(filename, 'rb') as file:
            model = pickle.load(file)
        print(f"Model loaded from {filename}.")
        return model

## Evaluate Model

In [28]:
X_train, X_test, y_train, y_test, X_test_final = preprocess_data(train_df, test_df)

# Train and evaluate model
nb = NaiveBayes(prior_adjustment={0: 50.0, 1: 1.0})
nb.fit(X_train, y_train)
nb.save_model('naive_bayes_model.pkl')

# Perform cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = []

X_train_np = np.array(X_train)
y_train_np = np.array(y_train)

for train_idx, val_idx in kf.split(X_train_np):
    X_cv_train, X_cv_val = X_train_np[train_idx], X_train_np[val_idx]
    y_cv_train, y_cv_val = y_train_np[train_idx], y_train_np[val_idx]

    nb_cv = NaiveBayes(prior_adjustment={0: 50.0, 1: 1.0})
    nb_cv.fit(X_cv_train, y_cv_train)
    y_cv_pred = nb_cv.predict(X_cv_val)
    cross_val_scores.append(accuracy_score(y_cv_val, y_cv_pred))

print(f"IMPLEMENTATION FROM SCRATCH")
print(f"Cross-Validation Accuracy (Mean): {np.mean(cross_val_scores) * 100:.2f}%")
print(f"Cross-Validation Accuracy (Standard Deviation): {np.std(cross_val_scores) * 100:.2f}%")

# Evaluate the final model on the test set
y_pred = nb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nNaive Bayes kustom classification accuracy: {accuracy * 100:.2f}%\n")
print("Detailed Classification Report (Custom Naive Bayes):")
print(classification_report(y_test, y_pred))

# Save predictions to a CSV file
predictions = nb.predict(X_test_final)
submission_df = pd.DataFrame({
    "id": test_df["id"],
    "label": predictions
})
submission_file_path = 'submission-nb-scratch.csv'
submission_df.to_csv(submission_file_path, index=False)

print(f"Predictions saved to '{submission_file_path}'.")

Model saved in naive_bayes_model.pkl.
IMPLEMENTATION FROM SCRATCH
Cross-Validation Accuracy (Mean): 94.18%
Cross-Validation Accuracy (Standard Deviation): 0.08%

Naive Bayes kustom classification accuracy: 98.31%

Detailed Classification Report (Custom Naive Bayes):
              precision    recall  f1-score   support

           0       1.00      0.78      0.87      3166
           1       0.98      1.00      0.99     38956

    accuracy                           0.98     42122
   macro avg       0.99      0.89      0.93     42122
weighted avg       0.98      0.98      0.98     42122

Predictions saved to 'submission-nb-scratch.csv'.


## Naive Bayes Implementation with Scikit-Learn

## Evaluate Model

In [29]:
X_train, X_test, y_train, y_test, X_test_final = preprocess_data(train_df, test_df)

# Train and evaluate model
model = GaussianNB()
model.fit(X_train.values, y_train.values)

# Perform cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = []

X_train_np = X_train.values
y_train_np = y_train.values

for train_idx, val_idx in kf.split(X_train_np):
    X_cv_train, X_cv_val = X_train_np[train_idx], X_train_np[val_idx]
    y_cv_train, y_cv_val = y_train_np[train_idx], y_train_np[val_idx]

    model.fit(X_cv_train, y_cv_train)
    y_cv_pred = model.predict(X_cv_val)
    cross_val_scores.append(accuracy_score(y_cv_val, y_cv_pred))

print(f"IMPLEMENTATION WITH SCIKIT-LEARN")
print(f"Cross-Validation Accuracy (Mean): {np.mean(cross_val_scores) * 100:.2f}%")
print(f"Cross-Validation Accuracy (Standard Deviation): {np.std(cross_val_scores) * 100:.2f}%")

# Evaluate the final model on the test set
y_pred = model.predict(X_test.values) 
accuracy = accuracy_score(y_test, y_pred)

print(f"\nNaive Bayes classification accuracy: {accuracy * 100:.2f}%\n")
print("Detailed Classification Report:")
print(classification_report(y_test, y_pred))

# Save predictions to a CSV file
predictions = model.predict(X_test_final.values)
submission_df = pd.DataFrame({
    "id": test_df["id"],
    "label": predictions
})
submission_file_path = 'submission-nb-scikit-learn.csv'
submission_df.to_csv(submission_file_path, index=False)

print(f"Predictions saved to '{submission_file_path}'.")


IMPLEMENTATION WITH SCIKIT-LEARN
Cross-Validation Accuracy (Mean): 92.62%
Cross-Validation Accuracy (Standard Deviation): 0.11%

Naive Bayes classification accuracy: 98.77%

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.87      0.91      3166
           1       0.99      1.00      0.99     38956

    accuracy                           0.99     42122
   macro avg       0.97      0.94      0.95     42122
weighted avg       0.99      0.99      0.99     42122

Predictions saved to 'submission-nb-scikit-learn.csv'.


## C. Improvements (Optional)

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

In [30]:
X_train, X_test, y_train, y_test, X_test_final = preprocess_data(train_df, test_df)

# Train and evaluate model
nb = NaiveBayes(prior_adjustment={0: 50.0, 1: 1.0})
nb.fit(X_train, y_train)
nb.save_model('naive_bayes_model.pkl')

# Perform cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = []

X_train_np = np.array(X_train)
y_train_np = np.array(y_train)

for train_idx, val_idx in kf.split(X_train_np):
    X_cv_train, X_cv_val = X_train_np[train_idx], X_train_np[val_idx]
    y_cv_train, y_cv_val = y_train_np[train_idx], y_train_np[val_idx]

    nb_cv = NaiveBayes(prior_adjustment={0: 50.0, 1: 1.0})
    nb_cv.fit(X_cv_train, y_cv_train)
    y_cv_pred = nb_cv.predict(X_cv_val)
    cross_val_scores.append(accuracy_score(y_cv_val, y_cv_pred))

print(f"IMPLEMENTATION FROM SCRATCH")
print(f"Cross-Validation Accuracy (Mean): {np.mean(cross_val_scores) * 100:.2f}%")
print(f"Cross-Validation Accuracy (Standard Deviation): {np.std(cross_val_scores) * 100:.2f}%")

Model saved in naive_bayes_model.pkl.
IMPLEMENTATION FROM SCRATCH
Cross-Validation Accuracy (Mean): 94.18%
Cross-Validation Accuracy (Standard Deviation): 0.08%


## D. Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id` and `label` column.

Note: Adjust step 1 and 2 to your implementation of the preprocessing step if you don't use pipeline API from `sklearn`.

In [31]:
# Preprocess the train and test data
X_train, X_test_split, y_train, y_test_split, X_test = preprocess_data(train_df, test_df)

# Train the KNN model
knn = KNNClassifier(k=4, metric='manhattan')
knn.fit(X_train, y_train)

# Predict using the KNN model
predictions = knn.predict(X_test)

# Save predictions in the required format
submission = pd.DataFrame({
    "id": test_df["id"],  # Use the 'id' column from the test data
    "label": predictions   # Add the predicted labels
})

# Specify the file path for saving
submission_file_path = 'submission.csv'
submission.to_csv(submission_file_path, index=False)

print(f"Predictions saved to '{submission_file_path}'.")

Predictions saved to 'submission.csv'.


# 6. Error Analysis

Based on all the process you have done until the modeling and evaluation step, write an analysis to support each steps you have taken to solve this problem. Write the analysis using the markdown block. Some questions that may help you in writing the analysis:

- Does my model perform better in predicting one class than the other? If so, why is that?
- To each models I have tried, which performs the best and what could be the reason?
- Is it better for me to impute or drop the missing data? Why?
- Does feature scaling help improve my model performance?
- etc...

## Analisis Model 

### Performa Model pada Tiap Kelas
Model KNN yang digunakan menunjukkan bahwa model memiliki performa yang jauh lebih baik dalam memprediksi kelas mayoritas (kelas 1) dibandingkan dengan kelas minoritas (kelas 0). Hal ini terlihat dari Precision dan Recall kelas 1 yang mencapai masing-masing 0.98 dan 0.99, menunjukkan bahwa hampir seluruh prediksi kelas 1 adalah benar, dengan sedikit kesalahan. Sebaliknya, untuk kelas 0, Precision hanya mencapai 0.84, sementara Recall lebih rendah, yaitu 0.76. Rendahnya recall untuk kelas 0 menunjukkan bahwa model kesulitan menangkap semua sampel kelas minoritas, yang kemungkinan besar disebabkan oleh ketidakseimbangan data (kelas 1 jauh lebih banyak dibandingkan kelas 0).

### Performa Model dengan Parameter  𝑘
k dan Metrik Jarak
Dari beberapa eksperimen yang telah dilakukan, nilai 𝑘 = 4
k=4 dengan metrik Manhattan menghasilkan performa terbaik untuk model KNN. Hal ini kemungkinan disebabkan oleh kemampuan metrik Manhattan yang lebih baik dalam menangkap pola pada data tabular, terutama ketika fitur memiliki perbedaan skala yang sudah dinormalisasi.

### Penanganan Missing Data
Pada tahap preprocessing, missing data diimputasi daripada dihapus. Keputusan ini diambil untuk menghindari hilangnya informasi dari dataset, terutama karena setiap fitur dapat memberikan kontribusi penting pada algoritma berbasis jarak seperti KNN. Dengan imputasi median untuk fitur numerik dan modus untuk fitur kategorikal, data tetap lengkap tanpa mengorbankan informasi penting.

### Pengaruh Scaling pada Performansi Model
Feature scaling sangat penting untuk algoritma KNN karena model ini sangat bergantung pada jarak antar titik. Dalam proses ini, semua fitur numerik telah dinormalisasi menggunakan z-score normalization. Hal ini membantu meningkatkan performa model dengan memastikan bahwa fitur yang memiliki skala besar tidak mendominasi perhitungan jarak.

### Kesimpulan dan Saran Perbaikan
Model KNN memberikan Accuracy sebesar 97.06%, yang merupakan performa yang sangat baik secara keseluruhan. Namun, analisis lebih lanjut menunjukkan adanya bias terhadap kelas mayoritas, yang mengurangi performa dalam mengenali kelas minoritas.

Untuk meningkatkan performa, terutama pada kelas 0, beberapa langkah dapat diambil:

Oversampling atau SMOTE: Menyeimbangkan distribusi kelas untuk membantu model mengenali kelas minoritas lebih baik.

Weighted Voting: Memberikan bobot lebih besar pada tetangga yang lebih dekat untuk meningkatkan akurasi dalam wilayah yang padat data.

Tuning Parameter 𝑘 
Eksperimen lebih lanjut dengan nilai k yang lebih besar dapat membantu meningkatkan generalisasi model.

Feature Selection: Menghapus fitur yang kurang relevan atau berlebihan untuk mengurangi noise pada data.
Langkah-langkah ini diharapkan dapat meningkatkan recall untuk kelas minoritas tanpa mengorbankan performa kelas mayoritas.