## Import Libraries

In [154]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Import other libraries if needed

## Import Dataset

In [155]:
df_train = pd.read_csv('https://drive.google.com/uc?id=1RqoINbiatAf5ssVrqESn2F3vGCe_XY17')
df_train.head()

Unnamed: 0,id,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,1,,https://www.northcm.ac.th,24.0,www.northcm.ac.th,17.0,0.0,,0.8,,...,0.0,0.0,1.0,,3.0,,69.0,,,1
1,4,8135291.txt,http://uqr.to/1il1z,,,,,to,1.0,0.000896,...,,0.0,0.0,,,,,,1.0,0
2,5,586561.txt,https://www.woolworthsrewards.com.au,35.0,www.woolworthsrewards.com.au,28.0,0.0,au,0.857143,,...,1.0,0.0,1.0,33.0,7.0,8.0,15.0,,2.0,1
3,6,,,31.0,,,,com,0.5625,0.522907,...,1.0,0.0,1.0,24.0,5.0,14.0,,,,1
4,11,412632.txt,,,www.nyprowrestling.com,22.0,0.0,,1.0,,...,0.0,0.0,1.0,,,14.0,,0.0,,1


In [156]:
df_test = pd.read_csv('https://drive.google.com/uc?id=1O9Fd-FRnJmQwGsqnUVQmAj8vYJM0NWH_')

In [157]:
#ubah datatype


# Convert boolean-like columns (0.0 and 1.0) to 'boolean' type (nullable boolean)
for col in df_train.select_dtypes(include=['float64']).columns:
    # Drop NaN values to check if the rest are either 0.0 or 1.0
    if df_train[col].dropna().isin([0.0, 1.0]).all():
        # If column only contains 0.0 and 1.0, convert it to 'boolean' (nullable boolean)
        df_train[col] = df_train[col].astype('boolean')

# Convert boolean-like columns (0.0 and 1.0) to 'boolean' type (nullable boolean)
for col in df_train.select_dtypes(include=['int64']).columns:
    # Drop NaN values to check if the rest are either 0.0 or 1.0
    if df_train[col].dropna().isin([0, 1]).all():
        # If column only contains 0.0 and 1.0, convert it to 'boolean' (nullable boolean)
        df_train[col] = df_train[col].astype('boolean')

# Iterate over columns and check if the float columns can be converted to int
for col in df_train.select_dtypes(include=['float64']).columns:
    # Check if the non-NaN values in the column are whole numbers (i.e., no decimal part)
    if (df_train[col].dropna() == df_train[col].dropna().astype('int64')).all():
        # Convert to int type only the non-NaN values if condition is met
        df_train[col] = df_train[col].astype('Int64')  # Use 'Int64' to support NaN as well


#print the data types of the columns
df_train.dtypes

id                              int64
FILENAME                       object
URL                            object
URLLength                       Int64
Domain                         object
DomainLength                    Int64
IsDomainIP                    boolean
TLD                            object
CharContinuationRate          float64
TLDLegitimateProb             float64
URLCharProb                   float64
TLDLength                       Int64
NoOfSubDomain                   Int64
HasObfuscation                boolean
NoOfObfuscatedChar              Int64
ObfuscationRatio              float64
NoOfLettersInURL                Int64
LetterRatioInURL              float64
NoOfDegitsInURL                 Int64
DegitRatioInURL               float64
NoOfEqualsInURL                 Int64
NoOfQMarkInURL                  Int64
NoOfAmpersandInURL              Int64
NoOfOtherSpecialCharsInURL      Int64
SpacialCharRatioInURL         float64
IsHTTPS                       boolean
LineOfCode  

# 1. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train` folder given by the TA. The `test` data is only used for kaggle submission.

In [158]:
# Split training set and validation set here, store into variables train_set and val_set.
# Remember to also keep the original training set before splitting. This will come important later.
train_set, val_set = train_test_split(df_train,test_size=0.25, random_state=42)

# check
print(train_set.shape)
print(val_set.shape)

(105303, 56)
(35101, 56)


In [159]:
train_set.nunique()

id                            105303
FILENAME                       62287
URL                            72823
URLLength                        196
Domain                         52431
DomainLength                      76
IsDomainIP                         2
TLD                              453
CharContinuationRate             421
TLDLegitimateProb                393
URLCharProb                    65739
TLDLength                         10
NoOfSubDomain                      6
HasObfuscation                     2
NoOfObfuscatedChar                 6
ObfuscationRatio                  24
NoOfLettersInURL                 176
LetterRatioInURL                 473
NoOfDegitsInURL                   76
DegitRatioInURL                  349
NoOfEqualsInURL                   13
NoOfQMarkInURL                     4
NoOfAmpersandInURL                14
NoOfOtherSpecialCharsInURL        42
SpacialCharRatioInURL            171
IsHTTPS                            2
LineOfCode                      7477
L

# 2. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

We will give some common methods for you to try, but you only have to **at least implement one method for each process**. For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

## A. Data Cleaning

**Data cleaning** is the crucial first step in preparing your dataset for machine learning. Raw data collected from various sources is often messy and may contain errors, missing values, and inconsistencies. Data cleaning involves the following steps:

1. **Handling Missing Data:** Identify and address missing values in the dataset. This can include imputing missing values, removing rows or columns with excessive missing data, or using more advanced techniques like interpolation.

2. **Dealing with Outliers:** Identify and handle outliers, which are data points significantly different from the rest of the dataset. Outliers can be removed or transformed to improve model performance.

3. **Data Validation:** Check for data integrity and consistency. Ensure that data types are correct, categorical variables have consistent labels, and numerical values fall within expected ranges.

4. **Removing Duplicates:** Identify and remove duplicate rows, as they can skew the model's training process and evaluation metrics.

5. **Feature Engineering**: Create new features or modify existing ones to extract relevant information. This step can involve scaling, normalizing, or encoding features for better model interpretability.

### I. Handling Missing Data

Missing data can adversely affect the performance and accuracy of machine learning models. There are several strategies to handle missing data in machine learning:

1. **Data Imputation:**

    a. **Mean, Median, or Mode Imputation:** For numerical features, you can replace missing values with the mean, median, or mode of the non-missing values in the same feature. This method is simple and often effective when data is missing at random.

    b. **Constant Value Imputation:** You can replace missing values with a predefined constant value (e.g., 0) if it makes sense for your dataset and problem.

    c. **Imputation Using Predictive Models:** More advanced techniques involve using predictive models to estimate missing values. For example, you can train a regression model to predict missing numerical values or a classification model to predict missing categorical values.

2. **Deletion of Missing Data:**

    a. **Listwise Deletion:** In cases where the amount of missing data is relatively small, you can simply remove rows with missing values from your dataset. However, this approach can lead to a loss of valuable information.

    b. **Column (Feature) Deletion:** If a feature has a large number of missing values and is not critical for your analysis, you can consider removing that feature altogether.

3. **Domain-Specific Strategies:**

    a. **Domain Knowledge:** In some cases, domain knowledge can guide the imputation process. For example, if you know that missing values are related to a specific condition, you can impute them accordingly.

4. **Imputation Libraries:**

    a. **Scikit-Learn:** Scikit-Learn provides a `SimpleImputer` class that can handle basic imputation strategies like mean, median, and mode imputation.

    b. **Fancyimpute:** Fancyimpute is a Python library that offers more advanced imputation techniques, including matrix factorization, k-nearest neighbors, and deep learning-based methods.

The choice of imputation method should be guided by the nature of your data, the amount of missing data, the problem you are trying to solve, and the assumptions you are willing to make.

### Teknik Handling Missing data
- Menggunakan dua metode imputasi, yaitu most_frequent dan mean, untuk menangani nilai yang hilang (missing values) pada dataset. Imputasi dengan strategi most_frequent digunakan untuk kolom bertipe object dan boolean, karena tipe data ini biasanya berisi nilai kategori atau True/False, sehingga pengisian dengan nilai yang paling sering muncul (modus) adalah pendekatan yang masuk akal. Sementara itu, strategi mean digunakan untuk kolom bertipe numerik, yaitu float64 dan Int64, karena pengisian nilai yang hilang dengan rata-rata dapat menjaga distribusi data numerik tetap konsisten.

In [160]:
def missingDataHandler(dataset) :
    imputer_modus = SimpleImputer(strategy='most_frequent')
    imputer_mean  = SimpleImputer(strategy='mean')

    dataset_object = dataset.select_dtypes(include='object').columns
    boolean_dataset = dataset.select_dtypes(include='boolean').columns
    float_dataset = dataset.select_dtypes(include='float64').columns
    int_dataset = dataset.select_dtypes(include='Int64').columns

    dataset[dataset_object]  = imputer_modus.fit_transform(dataset[dataset_object])
    dataset[int_dataset]     = np.ceil(imputer_mean.fit_transform(dataset[int_dataset]))
    dataset[float_dataset]   = imputer_mean.fit_transform(dataset[float_dataset])
    dataset[boolean_dataset] = imputer_modus.fit_transform(dataset[boolean_dataset])

    #mengubah kembali datatype menjadi semua (dari object)
    for col in df_train.columns:
        dataset[col] = dataset[col].astype(df_train[col].dtype)

missingDataHandler(train_set)
missingDataHandler(val_set)
train_set

Unnamed: 0,id,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
3716,6208,100000.txt,https://www.farmranchstore.com,28,ipfs.io,20,False,com,0.918051,0.522907,...,False,False,False,42,11,17,105,0,78,True
95917,160974,92017.txt,https://www.comlaude.com,28,www.comlaude.com,16,False,com,1.000000,0.522907,...,False,False,True,30,11,17,143,4,78,True
103893,174395,489814.txt,http://test-mantenimiento-bancaweb.azurewebsit...,23,ipfs.io,16,False,net,0.918051,0.038420,...,False,False,True,4,11,8,105,0,78,True
23596,39373,100000.txt,https://www.citygrab.co.uk,28,www.citygrab.co.uk,18,False,uk,0.918051,0.277795,...,False,False,True,15,11,1,20,4,22,True
24006,40047,100000.txt,https://www.chinacdc.cn,22,www.chinacdc.cn,20,False,com,1.000000,0.003322,...,False,False,True,42,7,17,105,3,178,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110268,185151,100000.txt,https://www.gemathis.com,23,www.gemathis.com,16,False,com,1.000000,0.522907,...,False,False,True,42,2,17,106,7,78,True
119879,201338,627933.txt,http://test-mantenimiento-bancaweb.azurewebsit...,24,www.mural24.co.uk,20,False,uk,0.800000,0.277795,...,True,False,True,166,3,7,105,1,78,True
103694,174060,561645.txt,http://test-mantenimiento-bancaweb.azurewebsit...,34,ipfs.io,27,False,com,1.000000,0.277795,...,True,False,True,42,11,17,42,4,25,True
131932,221529,100000.txt,https://www.leinsterrugby.ie,27,ipfs.io,20,False,ie,0.918051,0.001588,...,False,False,True,42,11,57,105,0,331,True


In [161]:
train_set.isna().sum()

id                            0
FILENAME                      0
URL                           0
URLLength                     0
Domain                        0
DomainLength                  0
IsDomainIP                    0
TLD                           0
CharContinuationRate          0
TLDLegitimateProb             0
URLCharProb                   0
TLDLength                     0
NoOfSubDomain                 0
HasObfuscation                0
NoOfObfuscatedChar            0
ObfuscationRatio              0
NoOfLettersInURL              0
LetterRatioInURL              0
NoOfDegitsInURL               0
DegitRatioInURL               0
NoOfEqualsInURL               0
NoOfQMarkInURL                0
NoOfAmpersandInURL            0
NoOfOtherSpecialCharsInURL    0
SpacialCharRatioInURL         0
IsHTTPS                       0
LineOfCode                    0
LargestLineLength             0
HasTitle                      0
Title                         0
DomainTitleMatchScore         0
URLTitle

### II. Dealing with Outliers

Outliers are data points that significantly differ from the majority of the data. They can be unusually high or low values that do not fit the pattern of the rest of the dataset. Outliers can significantly impact model performance, so it is important to handle them properly.

Some methods to handle outliers:
1. **Imputation**: Replace with mean, median, or a boundary value.
2. **Clipping**: Cap values to upper and lower limits.
3. **Transformation**: Use log, square root, or power transformations to reduce their influence.
4. **Model-Based**: Use algorithms robust to outliers (e.g., tree-based models, Huber regression).

In [162]:
train_set.nunique()

id                            105303
FILENAME                       62287
URL                            72823
URLLength                        196
Domain                         52431
DomainLength                      76
IsDomainIP                         2
TLD                              453
CharContinuationRate             422
TLDLegitimateProb                394
URLCharProb                    65740
TLDLength                         10
NoOfSubDomain                      6
HasObfuscation                     2
NoOfObfuscatedChar                 7
ObfuscationRatio                  25
NoOfLettersInURL                 176
LetterRatioInURL                 474
NoOfDegitsInURL                   76
DegitRatioInURL                  350
NoOfEqualsInURL                   13
NoOfQMarkInURL                     4
NoOfAmpersandInURL                14
NoOfOtherSpecialCharsInURL        42
SpacialCharRatioInURL            172
IsHTTPS                            2
LineOfCode                      7477
L

In [163]:
numerical_train_set = train_set.select_dtypes(include='number').columns
for col in numerical_train_set :
    print(f'min {col} : {train_set[col].min()}')
print('\n')
for col in numerical_train_set :
    print(f'max {col} : {train_set[col].max()}')

min id : 4
min URLLength : 14
min DomainLength : 4
min CharContinuationRate : 0.0
min TLDLegitimateProb : 0.0
min URLCharProb : 0.001229244
min TLDLength : 2
min NoOfSubDomain : 0
min NoOfObfuscatedChar : 0
min ObfuscationRatio : 0.0
min NoOfLettersInURL : 0
min LetterRatioInURL : 0.0
min NoOfDegitsInURL : 0
min DegitRatioInURL : 0.0
min NoOfEqualsInURL : 0
min NoOfQMarkInURL : 0
min NoOfAmpersandInURL : 0
min NoOfOtherSpecialCharsInURL : 0
min SpacialCharRatioInURL : 0.0
min LineOfCode : 2
min LargestLineLength : 23
min DomainTitleMatchScore : 0.0
min URLTitleMatchScore : 0.0
min NoOfPopup : 0
min NoOfiFrame : 0
min NoOfImage : 0
min NoOfCSS : 0
min NoOfJS : 0
min NoOfSelfRef : 0
min NoOfEmptyRef : 0
min NoOfExternalRef : 0


max id : 235795
max URLLength : 4054
max DomainLength : 93
max CharContinuationRate : 1.0
max TLDLegitimateProb : 0.5229071
max URLCharProb : 0.088765828
max TLDLength : 11
max NoOfSubDomain : 5
max NoOfObfuscatedChar : 291
max ObfuscationRatio : 0.212
max NoOfLe

### Teknik Dealing with Outliers
- Menggunakan metode clipping. Pada kode ini, kami menangani outlier pada data numerik di dataset menggunakan metode IQR (Interquartile Range). Pertama, kami menghitung kuartil pertama (Q1) dan kuartil ketiga (Q3) untuk setiap kolom numerik, lalu menghitung IQR sebagai selisih antara Q3 dan Q1. Kemudian, kami menentukan batas bawah (lowerBound) dan batas atas (upperBound). Selanjutnya, nilai-nilai yang berada di luar batas tersebut dianggap outlier dan di-clip agar berada dalam rentang batas bawah dan atas. Untuk tipe data Int64, batas bawah dan atas dibulatkan menggunakan np.floor dan np.ceil agar konsisten dengan tipe datanya, sementara untuk tipe numerik lainnya tidak dilakukan pembulatan. Langkah ini bertujuan untuk menangani outlier secara efektif tanpa menghilangkan data, sehingga distribusi data tetap wajar untuk analisis atau pemodelan.

In [164]:
def outlierHandler(dataset) :
  numerical_dataset = dataset.select_dtypes(include='number').columns

  for i in numerical_dataset:
    Q1 = dataset[i].quantile(0.25)
    Q3 = dataset[i].quantile(0.75)

    IQR = Q3-Q1

    threshold = 1.5

    lowerBound = Q1 - (threshold*IQR)
    upperBound = Q3 + (threshold*IQR)

    # mean_value = train_set[i].mean()
    # train_set[i] = train_set[i].apply(lambda x: mean_value if (x < lowerBound or x > upperBound) else x)
    if (dataset[i].dtype == 'Int64') :
      dataset[i] = dataset[i].clip(lower=np.floor(lowerBound), upper=np.ceil(upperBound))
    else :
      dataset[i] = dataset[i].clip(lower=lowerBound, upper=upperBound)

outlierHandler(train_set)
outlierHandler(val_set)

In [165]:
train_set.nunique()

id                            105303
FILENAME                       62287
URL                            72823
URLLength                         14
Domain                         52431
DomainLength                      17
IsDomainIP                         2
TLD                              453
CharContinuationRate              61
TLDLegitimateProb                394
URLCharProb                    44601
TLDLength                          1
NoOfSubDomain                      5
HasObfuscation                     2
NoOfObfuscatedChar                 3
ObfuscationRatio                   3
NoOfLettersInURL                  14
LetterRatioInURL                  60
NoOfDegitsInURL                    4
DegitRatioInURL                   12
NoOfEqualsInURL                    4
NoOfQMarkInURL                     4
NoOfAmpersandInURL                 4
NoOfOtherSpecialCharsInURL         5
SpacialCharRatioInURL             37
IsHTTPS                            2
LineOfCode                      3004
L

In [166]:
numerical_train_set = train_set.select_dtypes(include='number').columns
for col in numerical_train_set :
    print(f'min {col} : {train_set[col].min()}')
print('\n')
for col in numerical_train_set :
    print(f'max {col} : {train_set[col].max()}')

min id : 4
min URLLength : 20
min DomainLength : 11
min CharContinuationRate : 0.7951269561285073
min TLDLegitimateProb : 0.0
min URLCharProb : 0.05365319925000001
min TLDLength : 3
min NoOfSubDomain : 0
min NoOfObfuscatedChar : 0
min ObfuscationRatio : 0.0
min NoOfLettersInURL : 7
min LetterRatioInURL : 0.44999999999999996
min NoOfDegitsInURL : 0
min DegitRatioInURL : 0.0
min NoOfEqualsInURL : 0
min NoOfQMarkInURL : 0
min NoOfAmpersandInURL : 0
min NoOfOtherSpecialCharsInURL : 0
min SpacialCharRatioInURL : 0.02865970600357193
min LineOfCode : 2
min LargestLineLength : 23
min DomainTitleMatchScore : 26.991534574244795
min URLTitleMatchScore : 27.852956209784807
min NoOfPopup : 0
min NoOfiFrame : 0
min NoOfImage : 0
min NoOfCSS : 0
min NoOfJS : 2
min NoOfSelfRef : 0
min NoOfEmptyRef : 0
min NoOfExternalRef : 0


max id : 235795
max URLLength : 33
max DomainLength : 27
max CharContinuationRate : 1.0
max TLDLegitimateProb : 0.5229071
max URLCharProb : 0.06682242524999998
max TLDLength : 3

### III. Remove Duplicates
Handling duplicate values is crucial because they can compromise data integrity, leading to inaccurate analysis and insights. Duplicate entries can bias machine learning models, causing overfitting and reducing their ability to generalize to new data. They also inflate the dataset size unnecessarily, increasing computational costs and processing times. Additionally, duplicates can distort statistical measures and lead to inconsistencies, ultimately affecting the reliability of data-driven decisions and reporting. Ensuring data quality by removing duplicates is essential for accurate, efficient, and consistent analysis.

### Teknik Remove Duplicates
- Menghapus duplikasi data berdasarkan kolom URL menggunakan metode drop_duplicates. Hal ini dilakukan karena setiap URL dalam dataset seharusnya unik. Jika terdapat URL yang sama dengan label yang sama, data tersebut tidak menambah informasi baru dan hanya meningkatkan redundansi. Sebaliknya, jika URL yang sama memiliki label yang berbeda (contoh: phishing dan non-phishing), hal ini dapat menyebabkan inkonsistensi data.

In [167]:
# Write your code here
def duplicateHandler(dataset) :
    print(dataset.shape)
    datasett = dataset.drop_duplicates(subset=["URL"])
    print(dataset.shape)

duplicateHandler(train_set)
duplicateHandler(val_set)

(105303, 56)
(105303, 56)
(35101, 56)
(35101, 56)


In [168]:
(train_set['label'] == 1).sum()

np.int64(97414)

### IV. Feature Engineering

**Feature engineering** involves creating new features (input variables) or transforming existing ones to improve the performance of machine learning models. Feature engineering aims to enhance the model's ability to learn patterns and make accurate predictions from the data. It's often said that "good features make good models."

1. **Feature Selection:** Feature engineering can involve selecting the most relevant and informative features from the dataset. Removing irrelevant or redundant features not only simplifies the model but also reduces the risk of overfitting.

2. **Creating New Features:** Sometimes, the existing features may not capture the underlying patterns effectively. In such cases, engineers create new features that provide additional information. For example:
   
   - **Polynomial Features:** Engineers may create new features by taking the square, cube, or other higher-order terms of existing numerical features. This can help capture nonlinear relationships.
   
   - **Interaction Features:** Interaction features are created by combining two or more existing features. For example, if you have features "length" and "width," you can create an "area" feature by multiplying them.

3. **Binning or Discretization:** Continuous numerical features can be divided into bins or categories. For instance, age values can be grouped into bins like "child," "adult," and "senior."

4. **Domain-Specific Feature Engineering:** Depending on the domain and problem, engineers may create domain-specific features. For example, in fraud detection, features related to transaction history and user behavior may be engineered to identify anomalies.

Feature engineering is both a creative and iterative process. It requires a deep understanding of the data, domain knowledge, and experimentation to determine which features will enhance the model's predictive power.

In [169]:
train_set.nunique()

id                            105303
FILENAME                       62287
URL                            72823
URLLength                         14
Domain                         52431
DomainLength                      17
IsDomainIP                         2
TLD                              453
CharContinuationRate              61
TLDLegitimateProb                394
URLCharProb                    44601
TLDLength                          1
NoOfSubDomain                      5
HasObfuscation                     2
NoOfObfuscatedChar                 3
ObfuscationRatio                   3
NoOfLettersInURL                  14
LetterRatioInURL                  60
NoOfDegitsInURL                    4
DegitRatioInURL                   12
NoOfEqualsInURL                    4
NoOfQMarkInURL                     4
NoOfAmpersandInURL                 4
NoOfOtherSpecialCharsInURL         5
SpacialCharRatioInURL             37
IsHTTPS                            2
LineOfCode                      3004
L

### Teknik Feature Engineering 
- Degan metode binning. Dilakukan proses feature engineering untuk kolom numerik yang memiliki jumlah nilai unik lebih dari 30 (disaring melalui nunique) dan bukan kolom id. Kolom tersebut dianggap memiliki rentang nilai yang besar (big range), sehingga dilakukan pembagian menjadi beberapa kategori untuk menyederhanakan representasi data. Setiap nilai pada kolom diubah menjadi tiga kategori: low, medium, dan high, berdasarkan nilai batas yang dihitung dari rentang minimum hingga maksimum menggunakan pendekatan persentil sederhana. Proses ini dilakukan dengan membuat kolom baru dengan prefix bin_ untuk setiap kolom asli. Hasilnya, kategori ini diubah menjadi tipe data category agar lebih efisien dalam penyimpanan dan pemrosesan.

In [170]:
# Write your code here
def featureBinning(dataset):
    numerical_dataset = dataset[[col for col in dataset.select_dtypes(include='number').columns if col != 'id']].copy()
    old_columns = numerical_dataset.columns

    for col in numerical_dataset:
        min_val = numerical_dataset[col].min()
        max_val = numerical_dataset[col].max()
        val1 = min_val + ((max_val - min_val) // 3)
        val2 = min_val + ((max_val - min_val) * 2 // 3)

        numerical_dataset.loc[numerical_dataset[col] <= val1, 'bin_' + col] = 'low'
        numerical_dataset.loc[(numerical_dataset[col] > val1) & (numerical_dataset[col] <= val2), 'bin_' + col] = 'medium'
        numerical_dataset.loc[numerical_dataset[col] > val2, 'bin_' + col] = 'high'

    numerical_dataset = numerical_dataset.drop(columns=old_columns)
    numerical_dataset = numerical_dataset.astype('category')
    return numerical_dataset

train_set = pd.concat([train_set, featureBinning(train_set)], axis=1)
train_set
val_set = pd.concat([val_set, featureBinning(val_set)], axis=1)


## B. Data Preprocessing

**Data preprocessing** is a broader step that encompasses both data cleaning and additional transformations to make the data suitable for machine learning algorithms. Its primary goals are:

1. **Feature Scaling:** Ensure that numerical features have similar scales. Common techniques include Min-Max scaling (scaling to a specific range) or standardization (mean-centered, unit variance).

2. **Encoding Categorical Variables:** Machine learning models typically work with numerical data, so categorical variables need to be encoded. This can be done using one-hot encoding, label encoding, or more advanced methods like target encoding.

3. **Handling Imbalanced Classes:** If dealing with imbalanced classes in a binary classification task, apply techniques such as oversampling, undersampling, or using different evaluation metrics to address class imbalance.

4. **Dimensionality Reduction:** Reduce the number of features using techniques like Principal Component Analysis (PCA) or feature selection to simplify the model and potentially improve its performance.

5. **Normalization:** Normalize data to achieve a standard distribution. This is particularly important for algorithms that assume normally distributed data.

### Notes on Preprocessing processes

It is advised to create functions or classes that have the same/similar type of inputs and outputs, so you can add, remove, or swap the order of the processes easily. You can implement the functions or classes by yourself

or

use `sklearn` library. To create a new preprocessing component in `sklearn`, implement a corresponding class that includes:
1. Inheritance to `BaseEstimator` and `TransformerMixin`
2. The method `fit`
3. The method `transform`

In [171]:
# Example

# from sklearn.base import BaseEstimator, TransformerMixin

# class FeatureEncoder(BaseEstimator, TransformerMixin):

#     def fit(self, X, y=None):

#         # Fit the encoder here

#         return self

#     def transform(self, X):
#         X_encoded = X.copy()

#         # Encode the categorical variables here

#         return X_encoded

### I. Feature Scaling

**Feature scaling** is a preprocessing technique used in machine learning to standardize the range of independent variables or features of data. The primary goal of feature scaling is to ensure that all features contribute equally to the training process and that machine learning algorithms can work effectively with the data.

Here are the main reasons why feature scaling is important:

1. **Algorithm Sensitivity:** Many machine learning algorithms are sensitive to the scale of input features. If the scales of features are significantly different, some algorithms may perform poorly or take much longer to converge.

2. **Distance-Based Algorithms:** Algorithms that rely on distances or similarities between data points, such as k-nearest neighbors (KNN) and support vector machines (SVM), can be influenced by feature scales. Features with larger scales may dominate the distance calculations.

3. **Regularization:** Regularization techniques, like L1 (Lasso) and L2 (Ridge) regularization, add penalty terms based on feature coefficients. Scaling ensures that all features are treated equally in the regularization process.

Common methods for feature scaling include:

1. **Min-Max Scaling (Normalization):** This method scales features to a specific range, typically [0, 1]. It's done using the following formula:

   $$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

   - Here, $X$ is the original feature value, $X_{min}$ is the minimum value of the feature, and $X_{max}$ is the maximum value of the feature.  
<br />
<br />
2. **Standardization (Z-score Scaling):** This method scales features to have a mean (average) of 0 and a standard deviation of 1. It's done using the following formula:

   $$X' = \frac{X - \mu}{\sigma}$$

   - $X$ is the original feature value, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature.  
<br />
<br />
3. **Robust Scaling:** Robust scaling is a method that scales features to the interquartile range (IQR) and is less affected by outliers. It's calculated as:

   $$X' = \frac{X - Q1}{Q3 - Q1}$$

   - $X$ is the original feature value, $Q1$ is the first quartile (25th percentile), and $Q3$ is the third quartile (75th percentile) of the feature.  
<br />
<br />
4. **Log Transformation:** In cases where data is highly skewed or has a heavy-tailed distribution, taking the logarithm of the feature values can help stabilize the variance and improve scaling.

The choice of scaling method depends on the characteristics of your data and the requirements of your machine learning algorithm. **Min-max scaling and standardization are the most commonly used techniques and work well for many datasets.**

Scaling should be applied separately to each training and test set to prevent data leakage from the test set into the training set. Additionally, **some algorithms may not require feature scaling, particularly tree-based models.**

### Teknik Feature Scaling
- Degan metode Min-Max Scaling. Proses feature scaling ini bertujuan untuk mengubah rentang nilai fitur menjadi skala yang seragam (biasanya antara 0 dan 1), sehingga model pembelajaran mesin tidak memberikan bobot berlebihan pada fitur dengan nilai yang lebih besar. Pada metode fit, kolom numerik dalam dataset diidentifikasi menggunakan tipe data Int64 dan float64, kemudian skala ditentukan berdasarkan nilai minimum dan maksimum dari kolom-kolom tersebut. Pada metode transform, dataset diubah dengan menskalakan nilai-nilai kolom numerik menggunakan skala yang telah ditentukan.

In [172]:
class FeatureScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = MinMaxScaler()
        self.numerical_columns = []

    def fit(self, X, y=None):
        self.numerical_columns = X.select_dtypes(include=['Int64', 'float64']).columns
        self.scaler.fit(X[self.numerical_columns])
        return self

    def transform(self, X):
        X_transformed = X.copy()
        X_transformed[self.numerical_columns] = self.scaler.transform(X_transformed[self.numerical_columns])
        return X_transformed

In [173]:
# Initialize the custom scaler
scaler = FeatureScaler()

# Fit the scaler to the data (learn the scaling parameters from the numerical features)
scaler.fit(train_set)

# Transform the data to apply scaling to the numerical columns
scaled_data = scaler.transform(train_set)

for col in scaled_data.select_dtypes(include=['number']).columns:
    print(f"Column: {col}")
    print(f"  Min: {scaled_data[col].min()}")
    print(f"  Max: {scaled_data[col].max()}")
    print(f"  Dtype: {scaled_data[col].dtype}")
    print()

Column: id
  Min: 0.0
  Max: 0.9999999999999998
  Dtype: float64

Column: URLLength
  Min: 0.0
  Max: 1.0000000000000002
  Dtype: float64

Column: DomainLength
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: CharContinuationRate
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: TLDLegitimateProb
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: URLCharProb
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: TLDLength
  Min: 0.0
  Max: 0.0
  Dtype: float64

Column: NoOfSubDomain
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: NoOfObfuscatedChar
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: ObfuscationRatio
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: NoOfLettersInURL
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: LetterRatioInURL
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: NoOfDegitsInURL
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: DegitRatioInURL
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: NoOfEqualsInURL
  Min: 0.0
  Max: 1.0
  Dtype: float64

Column: NoOfQMarkInURL
  Min: 0

### II. Feature Encoding

**Feature encoding**, also known as **categorical encoding**, is the process of converting categorical data (non-numeric data) into a numerical format so that it can be used as input for machine learning algorithms. Most machine learning models require numerical data for training and prediction, so feature encoding is a critical step in data preprocessing.

Categorical data can take various forms, including:

1. **Nominal Data:** Categories with no intrinsic order, like colors or country names.  

2. **Ordinal Data:** Categories with a meaningful order but not necessarily equidistant, like education levels (e.g., "high school," "bachelor's," "master's").

There are several common methods for encoding categorical data:

1. **Label Encoding:**

   - Label encoding assigns a unique integer to each category in a feature.
   - It's suitable for ordinal data where there's a clear order among categories.
   - For example, if you have an "education" feature with values "high school," "bachelor's," and "master's," you can encode them as 0, 1, and 2, respectively.
<br />
<br />
2. **One-Hot Encoding:**

   - One-hot encoding creates a binary (0 or 1) column for each category in a nominal feature.
   - It's suitable for nominal data where there's no inherent order among categories.
   - Each category becomes a new feature, and the presence (1) or absence (0) of a category is indicated for each row.
<br />
<br />
3. **Target Encoding (Mean Encoding):**

   - Target encoding replaces each category with the mean of the target variable for that category.
   - It's often used for classification problems.

### Teknik Feature Encoding
- Degan metode One-Hot Encoding. Proses ini bertujuan mengubah data kategori menjadi representasi numerik agar dapat digunakan oleh algoritma pembelajaran mesin, yang umumnya bekerja lebih baik dengan data numerik.

In [174]:
class FeatureEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = OneHotEncoder(sparse_output=False) #mengeluarkan dalam array dan bukan sparse matrix
        self.categorical_columns = []

    def fit(self, X, y=None):
        self.categorical_columns = X.select_dtypes(include=['category']).columns 
        self.encoder.fit(X[self.categorical_columns])
        return self

    def transform(self, X):
        encoded_data = pd.DataFrame( #mengeluarkan pandas dataframe dan bukan numpy array
            self.encoder.transform(X[self.categorical_columns]),
            columns=self.encoder.get_feature_names_out(self.categorical_columns), #menggabungkan nama column lama dengan isi valuenya
            index=X.index
        )
        return pd.concat([X.drop(columns=self.categorical_columns), encoded_data], axis=1) #drop column original lalu membuat column baru dengan nama gabungan

In [175]:
# Initialize the CategoricalFeatureEncoder
encoder = FeatureEncoder()

# Fit the encoder to the data
encoder.fit(train_set)

# Transform the data
encoded_data = encoder.transform(train_set)

# Show the transformed data
# Get the new encoded column names
encoded_columns = encoder.encoder.get_feature_names_out()

# Print the distribution for each value for each encoded column in encoded_data
for col in encoded_columns:
    print(f"Distribution for {col}:")
    print(encoded_data[col].value_counts())
    print()

Distribution for bin_URLLength_high:
bin_URLLength_high
0.0    86133
1.0    19170
Name: count, dtype: int64

Distribution for bin_URLLength_low:
bin_URLLength_low
0.0    82736
1.0    22567
Name: count, dtype: int64

Distribution for bin_URLLength_medium:
bin_URLLength_medium
1.0    63566
0.0    41737
Name: count, dtype: int64

Distribution for bin_DomainLength_high:
bin_DomainLength_high
0.0    83179
1.0    22124
Name: count, dtype: int64

Distribution for bin_DomainLength_low:
bin_DomainLength_low
0.0    83740
1.0    21563
Name: count, dtype: int64

Distribution for bin_DomainLength_medium:
bin_DomainLength_medium
1.0    61616
0.0    43687
Name: count, dtype: int64

Distribution for bin_CharContinuationRate_high:
bin_CharContinuationRate_high
1.0    92359
0.0    12944
Name: count, dtype: int64

Distribution for bin_CharContinuationRate_low:
bin_CharContinuationRate_low
0.0    92359
1.0    12944
Name: count, dtype: int64

Distribution for bin_TLDLegitimateProb_high:
bin_TLDLegitimatePr

### III. Handling Imbalanced Dataset

**Handling imbalanced datasets** is important because imbalanced data can lead to several issues that negatively impact the performance and reliability of machine learning models. Here are some key reasons:

1. **Biased Model Performance**:

 - Models trained on imbalanced data tend to be biased towards the majority class, leading to poor performance on the minority class. This can result in misleading accuracy metrics.

2. **Misleading Accuracy**:

 - High overall accuracy can be misleading in imbalanced datasets. For example, if 95% of the data belongs to one class, a model that always predicts the majority class will have 95% accuracy but will fail to identify the minority class.

3. **Poor Generalization**:

 - Models trained on imbalanced data may not generalize well to new, unseen data, especially if the minority class is underrepresented.


Some methods to handle imbalanced datasets:
1. **Resampling Methods**:

 - Oversampling: Increase the number of instances in the minority class by duplicating or generating synthetic samples (e.g., SMOTE).
 - Undersampling: Reduce the number of instances in the majority class to balance the dataset.

2. **Evaluation Metrics**:

 - Use appropriate evaluation metrics such as precision, recall, F1-score, ROC-AUC, and confusion matrix instead of accuracy to better assess model performance on imbalanced data.

3. **Algorithmic Approaches**:

 - Use algorithms that are designed to handle imbalanced data, such as decision trees, random forests, or ensemble methods.
 - Adjust class weights in algorithms to give more importance to the minority class.

### Teknik Handliing Imbalanced Set
- Degan metode resampling (Oversampling), dengan SMOTE. //penjelasan//

In [176]:
class ImbalancedDatasetHandler(BaseEstimator, TransformerMixin):

    def __init__(self, method="smote"): #default ke smote
        self.method = method
        self.sampler = None

    def fit(self, X, y):
        if self.method == "smote":
            self.sampler = SMOTE(random_state=42) #agar setiap run code, pembagiannya sama
        elif self.method == "undersample":
            self.sampler = RandomUnderSampler(random_state=42) #agar setiap run code, pembagiannya sama 
        return self

    def transform(self, X, y=None):
        #X_resampled, y_resampled = self.sampler.fit_resample(X, y) 
        #resampled_data = self.sampler.fit_resample(X, y) 
        #resampled_data = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.Series(y_resampled, name='label')], axis=1)
        #return resampled_data   
        if y is not None:
            y = y.astype(int)
            resampled_data = self.sampler.fit_resample(X, y)
            return resampled_data
        else:
            return X

# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [177]:
# from sklearn.pipeline import Pipeline

# # Note: You can add or delete preprocessing components from this pipeline

# pipe = Pipeline([("imputer", FeatureImputer()),
#                  ("featurecreator", FeatureCreator()),
#                  ("scaler", FeatureScaler()),
#                  ("encoder", FeatureEncoder())])

# train_set = pipe.fit_transform(train_set)
# val_set = pipe.transform(val_set)

In [178]:
# # Your code should work up until this point
# train_set = pipe.fit_transform(train_set)
# val_set = pipe.transform(val_set)


or create your own here

In [179]:
numerical_features = train_set.select_dtypes(include=['float64', 'Int64']).columns
categorical_features = train_set.select_dtypes(include=['category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('scaler', FeatureScaler(), numerical_features),
        ('encoder', FeatureEncoder(), categorical_features)
    ]
)

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('smote', ImbalancedDatasetHandler(method="smote")) #bisa ganti ke undersampling pake ImbalancedDatasetHandler(method="undersample")
])

X_train = train_set.drop(columns=['label'])
y_train = train_set['label']
X_val = val_set.drop(columns=['label'])
y_val = val_set['label']

train_set = pipe.fit_transform(X_train, y_train)
val_set = pipe.transform(X_val)



# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the target feature `label`. Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

## A. KNN

In [180]:
# Type your code here
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_set, y_train)
y_pred = knn.predict(val_set)
print(y_pred)
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy: {accuracy:.4f}")

precision = precision_score(y_val, y_pred, average='weighted', zero_division=0)
print(f"Precision Score: {precision:.4f}")

conf_matrix = confusion_matrix(y_val, y_pred)
print(f"Confusion Matrix:\n{conf_matrix}")

class_report = classification_report(y_val, y_pred, zero_division=0)
print(f"Classification Report:\n{class_report}")

[1. 1. 1. ... 1. 1. 1.]
Accuracy: 0.9554
Precision Score: 0.9522
Confusion Matrix:
[[ 1314  1351]
 [  213 32223]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.86      0.49      0.63      2665
         1.0       0.96      0.99      0.98     32436

    accuracy                           0.96     35101
   macro avg       0.91      0.74      0.80     35101
weighted avg       0.95      0.96      0.95     35101



In [181]:
class KNN:
    def __init__(self, k=3, metric='euclidean'):
        self.k = k
        self.metric = metric

    def fit(self, X_train, y_train):
        self.X_train = np.array(X_train)
        self.y_train = np.array(y_train)  # Convert to NumPy array for positional indexing

    def _distance(self, point1, point2):
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((point1 - point2) ** 2))
        elif self.metric == 'manhattan':
            return np.sum(np.abs(point1 - point2))
        elif self.metric == 'minkowski':
            return np.sum(np.abs(point1 - point2) ** 3) ** (1 / 3)  # p=3 example

    def predict(self, X_test):
        predictions = []
        for test_point in X_test:
            distances = []
            for i, train_point in enumerate(self.X_train):
                distance = self._distance(test_point, train_point)
                distances.append((distance, i))
            distances.sort(key=lambda x: x[0])

            # K-nearest neighbors
            k_nearest_neighbors = [self.y_train[i] for _, i in distances[:self.k]]
            
            # Voting
            vote = Counter(k_nearest_neighbors).most_common(1)
            predictions.append(vote[0][0])
        
        return np.array(predictions)

# Use only 20% of the data
data_fraction = 0.2
train_subset_size = int(len(train_set) * data_fraction)
val_subset_size = int(len(val_set) * data_fraction)

train_subset = train_set[:train_subset_size]
y_train_subset = y_train[:train_subset_size]
val_subset = val_set[:val_subset_size]
y_val_subset = y_val[:val_subset_size]

# KNN Implementation
knn = KNN(k=3, metric='euclidean')
knn.fit(train_subset, y_train_subset)
predictions = knn.predict(val_subset)

# Calculate metrics
accuracy = accuracy_score(y_val_subset, predictions)
print(f"Accuracy: {accuracy:.4f}")

precision = precision_score(y_val_subset, predictions, average='weighted', zero_division=0)
print(f"Precision Score: {precision:.4f}")

conf_matrix = confusion_matrix(y_val_subset, predictions)
print(f"Confusion Matrix:\n{conf_matrix}")

class_report = classification_report(y_val_subset, predictions, zero_division=0)
print(f"Classification Report:\n{class_report}")

# knn = KNN(k=3, metric='euclidean')
# knn.fit(train_set, y_train)
# predictions = knn.predict(val_set)

# # Hitung akurasi
# accuracy = accuracy_score(y_val, y_pred)
# print(f"Accuracy: {accuracy:.4f}")

# precision = precision_score(y_val, y_pred, average='weighted', zero_division=0)
# print(f"Precision Score: {precision:.4f}")

# # Matriks kebingungan (confusion matrix)
# conf_matrix = confusion_matrix(y_val, y_pred)
# print(f"Confusion Matrix:\n{conf_matrix}")

# # Laporan klasifikasi
# class_report = classification_report(y_val, y_pred, zero_division=0)
# print(f"Classification Report:\n{class_report}")

Accuracy: 0.9491
Precision Score: 0.9440
Confusion Matrix:
[[ 261  288]
 [  69 6402]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.79      0.48      0.59       549
         1.0       0.96      0.99      0.97      6471

    accuracy                           0.95      7020
   macro avg       0.87      0.73      0.78      7020
weighted avg       0.94      0.95      0.94      7020



## B. Naive Bayes

In [182]:
# Type your code here

## C. Improvements (Optional)

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

In [183]:
# Type your code here

## D. Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id` and `label` column.

Note: Adjust step 1 and 2 to your implementation of the preprocessing step if you don't use pipeline API from `sklearn`.

In [184]:
# Type your code here

# 6. Error Analysis

Based on all the process you have done until the modeling and evaluation step, write an analysis to support each steps you have taken to solve this problem. Write the analysis using the markdown block. Some questions that may help you in writing the analysis:

- Does my model perform better in predicting one class than the other? If so, why is that?
- To each models I have tried, which performs the best and what could be the reason?
- Is it better for me to impute or drop the missing data? Why?
- Does feature scaling help improve my model performance?
- etc...

`Provide your analysis here`