# Data Mining Project<br>
## Team name: Only God Knows<br>
Team members: <br>
    - Alameen Sabbah<br>
    - Yazeed Migdadi<br>


In [None]:
# Libraries
import numpy as np
import pandas as pd
import optuna
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import mutual_info_classif
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


## Let's Take a Look at Our Data

In [None]:

df = pd.read_csv('train.csv')

df.head()


## Data Types and Categories

In this step, we'll examine the data types of each column in the dataset and categorize them accordingly. Understanding the nature of each feature will help us decide how to handle them during the data analysis process.

### Feature Description and Data Types:

1. **Gender**: 
   - Values: `0: Male`, `1: Female`
   - Data Type: **Categorical (Binary)**
   - Description: Indicates the gender of the customer.

2. **Age**:
   - Data Type: **Numerical (Continuous)**
   - Description: Represents the age of the customer in years.

3. **Driving_License**:
   - Values: `0: No License`, `1: Has License`
   - Data Type: **Categorical (Binary)**
   - Description: Indicates whether the customer has a driving license.

4. **Region_Code**:
   - Data Type: **Categorical (Ordinal)**
   - Description: Encoded regions of the customer.

5. **Previously_Insured**:
   - Values: `0: No`, `1: Yes`
   - Data Type: **Categorical (Binary)**
   - Description: Indicates whether the customer has had insurance before.

6. **Vehicle_Age**:
   - Values: `1: 1-2 years`, `2: < 1 year`, `3: > 2 years`
   - Data Type: **Categorical (Ordinal)**
   - Description: Represents the age of the vehicle in different categories.

7. **Vehicle_Damage**:
   - Values: `0: No`, `1: Yes`
   - Data Type: **Categorical (Binary)**
   - Description: Indicates whether the customer's vehicle has previously been damaged.

8. **Annual_Premium**:
   - Data Type: **Numerical (Continuous)**
   - Description: The amount of money paid annually for the insurance policy.

9. **Policy_Sales_Channel**:
   - Data Type: **Categorical (Nominal)**
   - Description: Indicates the sales agency that dealt with the customer, identifying which agency offered the insurance service.

10. **Vintage**:
    - Data Type: **Numerical (Continuous)**
    - Description: Represents the number of days the customer has been insured with the company.

11. **Response**:
    - Values: `0: No`, `1: Yes`
    - Data Type: **Categorical (Binary)**
    - Description: The target variable, indicating whether the customer responded positively to the insurance offer.


## Checking Data Quality

In this step, we will assess the quality of the dataset. This includes identifying:

1. **Outliers**: Extreme values that may distort the analysis.
2. **Wrong Data**: Inconsistent or invalid data entries.
3. **Duplicate Data**: Duplicate rows that could bias the results.
4. **Missing Data**: Columns or rows with missing values.
5. **Class Balance**: Assessing the balance between different classes in categorical data, especially in the target variable `Response`.

We will use various techniques and visualizations to identify these issues.


In [None]:
# Load the dataset
df = pd.read_csv('train.csv')
df.drop(['id'], axis=1, inplace=True)

# 1. Checking for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# 2. Checking for missing data
missing_data = df.isnull().sum()
print("\nMissing data per column:")
print(missing_data)

# 3. Checking for outliers in specific columns
columns_to_check = ['Age', 'Region_Code', 'Annual_Premium', 'Vintage', 'Policy_Sales_Channel']

# Ensure the columns exist in the DataFrame
columns_to_check = [col for col in columns_to_check if col in df.columns]

plt.figure(figsize=(12, 8))
num_cols = 3  # Set number of columns for subplots
num_rows = (len(columns_to_check) // num_cols) + (len(columns_to_check) % num_cols > 0)

for i, col in enumerate(columns_to_check, 1):
    plt.subplot(num_rows, num_cols, i)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

# 4. Checking for out-of-bound values
# Example for Age: Assuming a reasonable range (18 to 100)
age_outliers = df[(df['Age'] < 18) | (df['Age'] > 100)]
print("\nWrong Data (Out-of-bound Age):")
print(age_outliers)

# Example for Annual_Premium: Defining an arbitrary range for demonstration
premium_outliers = df[df['Annual_Premium'] > 100000]
print("\nOutliers in Annual_Premium (Above 100,000):")
print(premium_outliers)

# 5. Checking class balance for the target variable 'Response'
sns.countplot(x='Response', data=df)
plt.title('Class Distribution of Response')
plt.show()

# 6. Check balance in categorical columns (if needed)
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=col, data=df)
    plt.title(f'Class Distribution of {col}')
    plt.xticks(rotation=45)
    plt.show()


# Our Best Model

# Data Preprocessing Steps

The following sections outline the key preprocessing steps applied to the dataset:

## 1. Attribute Transformation
- **Log Transformation**: 
  - The `Annual_Premium` column is transformed using the natural logarithm (`log1p`). This transformation helps in reducing skewness and stabilizing the variance in the data.
  
- **Square Root Transformation**: 
  - The `Vintage` column is transformed using the square root (`sqrt`). This helps to reduce the impact of large values and bring the data closer to a normal distribution.

## 2. Feature Engineering
- **Creating Interaction Terms**:
  - Interaction features are created by combining the `Age_Group` and `Vehicle_Age` categories to capture any potential combined effect these features might have on the target variable.

- **Age Group Categorization**:
  - The `Age` feature is categorized into four age groups: `Young`, `Adult`, `Middle_Aged`, and `Senior` using the `pd.cut()` function. This helps capture non-linear relationships and makes the data more interpretable.

- **Additional Features**:
  - **Age_Squared**: The square of the `Age` feature to capture quadratic effects of age.
  - **Premium_to_Age_Ratio**: The ratio of `Annual_Premium_Log` to `Age` to capture how the premium relates to age.
  - **Retention_Likelihood**: A binary feature indicating whether the `Vintage` (duration of the policy) is above the median, assuming this may correlate with customer retention likelihood.
  - **Age_VehicleDamage_Interaction**: Interaction between `Age` and `Vehicle_Damage`, as the relationship between age and vehicle damage might be significant.

- **Region-Based Features**:
  - A new `Region` feature is created based on the `Policy_Sales_Channel`, where the `Policy_Sales_Channel` value is used to assign either 'North' or 'South' to the region. This can capture geographical differences in sales channels.

## 3. Feature Selection
- **Mutual Information for Feature Selection**:
  - Feature selection is performed using the **mutual information** between the features (`X`) and the target variable (`y`). Mutual information measures the amount of information gained about one variable through another.
  - The `mutual_info_classif` function from `sklearn.feature_selection` is used to calculate the mutual information scores between each feature and the target variable.
  - The top `n_features` features are selected based on their mutual information scores. By default, the top 30 features are selected, but this number can be adjusted as needed.
  
## 4. Remove Outliers
- **Outlier Removal Based on Standard Deviation**:
  - Outliers are detected and removed based on the **standard deviation** from the mean of each column.
  - The `remove_outliers` function filters out rows where the values of a given column lie outside the range of `mean ± n_std * std`, where `n_std` is a multiplier that defines how many standard deviations away from the mean the values must be in order to be considered outliers.
  - By default, `n_std` is set to 3, meaning that any values outside of 3 standard deviations from the mean will be removed.
  - The code for removing outliers is as follows:





In [None]:
def preprocess_data(data, is_train=True):
    """Enhanced feature engineering and preprocessing."""
    
    # Handle missing values
    imputer = SimpleImputer(strategy='median')
    data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
    
    # Apply transformations and scaling
    data["Annual_Premium_Log"] = np.log1p(data["Annual_Premium"])
    data["Vintage_Sqrt"] = np.sqrt(data["Vintage"])

    scaler = StandardScaler()
    columns_to_scale = ["Annual_Premium_Log", "Vintage_Sqrt", "Age", "Policy_Sales_Channel"]
    data[columns_to_scale] = scaler.fit_transform(data[columns_to_scale])

    # Convert 'Vehicle_Age' into boolean features
    vehicle_age_dummies = pd.get_dummies(data['Vehicle_Age'], prefix='Vehicle_Age')
    data = pd.concat([data, vehicle_age_dummies], axis=1)

    # Create age group categories and interaction terms
    data["Age_Group"] = pd.cut(data["Age"], bins=[0, 30, 45, 60, 100], labels=["Young", "Adult", "Middle_Aged", "Senior"])
    age_group_dummies = pd.get_dummies(data["Age_Group"], prefix="Age_Group")
    data = pd.concat([data, age_group_dummies], axis=1)

    # Interaction features
    for age_group in age_group_dummies.columns:
        for vehicle_age in vehicle_age_dummies.columns:
            data[f"{age_group}_{vehicle_age}"] = data[age_group] * data[vehicle_age]

    # Additional features
    data["Age_Squared"] = data["Age"] ** 2
    data["Premium_to_Age_Ratio"] = data["Annual_Premium_Log"] / (data["Age"] + 1e-5)
    data["Retention_Likelihood"] = (data["Vintage"] > data["Vintage"].median()).astype(int)
    data["Age_VehicleDamage_Interaction"] = data["Age"] * data["Vehicle_Damage"]

    # Region-based features
    data['Region'] = data['Policy_Sales_Channel'].apply(lambda x: 'North' if x <= 100 else 'South')
    region_dummies = pd.get_dummies(data['Region'], prefix='Region')
    data = pd.concat([data, region_dummies], axis=1)

    # Handle categorical features
    categorical_features = ['Gender', 'Driving_License', 'Previously_Insured', 'Vehicle_Damage']
    for feature in categorical_features:
        le = LabelEncoder()
        data[feature] = le.fit_transform(data[feature])

    # Drop original categorical columns after encoding
    data.drop(columns=["Age_Group", "Vehicle_Age", "Region"], inplace=True)

    # Separate ID column and response if training
    id_column = data.pop("id")
    if is_train:
        y = data.pop("Response")
        return data, y, id_column
    return data, id_column


In [None]:
def remove_outliers(df, column, n_std=3):
    """Remove outliers based on standard deviation."""
    mean = df[column].mean()
    std = df[column].std()
    return df[(df[column] >= mean - n_std * std) & (df[column] <= mean + n_std * std)]


In [None]:
def select_important_features(X, y, n_features=30):
    """Select top n important features based on mutual information."""
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, index=X.columns)
    return mi_scores.nlargest(n_features).index.tolist()


### Hyperparameter Optimization with Optuna

In this notebook, we demonstrate the use of **Optuna** for hyperparameter optimization of machine learning models, specifically **XGBoost** and **LightGBM** classifiers. Hyperparameter tuning is a crucial step in improving model performance, as it helps to find the best combination of hyperparameters for a given dataset.

#### Key Functions:

1. **`optimize_xgb`**:
    - This function uses Optuna to optimize hyperparameters for the **XGBoost** classifier.
    - The hyperparameters being tuned include `n_estimators`, `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `min_child_weight`, `gamma`, and `scale_pos_weight`.
    - A cross-validation approach is used to evaluate the performance of the model with the suggested hyperparameters.

2. **`optimize_lgbm`**:
    - Similar to the `optimize_xgb` function, this function optimizes hyperparameters for the **LightGBM** classifier using Optuna.
    - Hyperparameters tuned include `n_estimators`, `learning_rate`, `num_leaves`, `subsample`, `colsample_bytree`, `min_child_samples`, `reg_alpha`, and `reg_lambda`.
    - The function also incorporates cross-validation to evaluate the performance.

3. **`cross_validate`**:
    - A utility function that performs **Stratified K-Fold cross-validation** to evaluate the model's performance.
    - It splits the dataset into training and validation sets and computes the **ROC AUC score** for each fold.
    - The average ROC AUC score is returned as the final evaluation metric.

These functions are crucial for automating the process of hyperparameter tuning and ensuring that the models perform optimally on the given dataset.

By leveraging Optuna’s optimization capabilities, we can efficiently search the hyperparameter space for both XGBoost and LightGBM models to achieve better predictive performance.


In [None]:
def optimize_xgb(trial):
    """Optimize XGBoost hyperparameters using Optuna."""
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "learning_rate": trial.suggest_loguniform("learning_rate", 1e-3, 1e-1),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "gamma": trial.suggest_loguniform("gamma", 1e-8, 1.0),
        "scale_pos_weight": scale_pos_weight,
    }
    model = XGBClassifier(**params)
    return cross_validate(model, X_train, y_train)

def optimize_lgbm(trial):
    """Optimize LightGBM hyperparameters using Optuna."""
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "learning_rate": trial.suggest_loguniform("learning_rate", 1e-3, 1e-1),
        "num_leaves": trial.suggest_int("num_leaves", 20, 100),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "min_child_samples": trial.suggest_int("min_child_samples", 1, 100),
        "reg_alpha": trial.suggest_loguniform("reg_alpha", 1e-8, 10.0),
        "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-8, 10.0),
        'is_unbalance': True,
    }
    model = LGBMClassifier(**params)
    return cross_validate(model, X_train, y_train)


def cross_validate(model, X, y, n_splits=5):
    """Perform cross-validation."""
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = []
    for train_idx, val_idx in cv.split(X, y):
        X_train_cv, X_val_cv = X.iloc[train_idx], X.iloc[val_idx]
        y_train_cv, y_val_cv = y.iloc[train_idx], y.iloc[val_idx]
        model.fit(X_train_cv, y_train_cv)
        y_pred = model.predict_proba(X_val_cv)[:, 1]
        scores.append(roc_auc_score(y_val_cv, y_pred))
    return np.mean(scores)

### Main Workflow

In [None]:
if __name__ == "__main__":
    # Load data
    train_data = pd.read_csv("train.csv")
    test_data = pd.read_csv("test.csv")

    # Remove outliers in 'Annual_Premium' column
    train_data = remove_outliers(train_data, "Annual_Premium")

    # Preprocess data
    X, y, _ = preprocess_data(train_data)
    X_test, test_ids = preprocess_data(test_data, is_train=False)
    
    # Select top important features
    top_features = select_important_features(X, y, n_features=30)
    X = X[top_features]
    X_test = X_test[top_features]

    # Train-validation split
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

    # Run Optuna studies
    n_trials = 10
    timeout = 600 

    print("Optimizing XGBoost...")
    study_xgb = optuna.create_study(direction="maximize")
    study_xgb.optimize(optimize_xgb, n_trials=n_trials, timeout=timeout)

    print("Optimizing LightGBM...")
    study_lgbm = optuna.create_study(direction="maximize")
    study_lgbm.optimize(optimize_lgbm, n_trials=n_trials, timeout=timeout)



    # Train final models
    best_xgb = XGBClassifier(**study_xgb.best_params).fit(X, y)
    best_lgbm = LGBMClassifier(**study_lgbm.best_params).fit(X, y)

    # Make predictions
    predictions = {
        "XGBoost": best_xgb.predict_proba(X_test)[:, 1],
        "LightGBM": best_lgbm.predict_proba(X_test)[:, 1],
    }
    
    # Ensemble predictions
    ensemble_predictions = np.mean([predictions["XGBoost"], predictions["LightGBM"]], axis=0)
    
    # Save outputs
    for model_name, preds in predictions.items():
        output = pd.DataFrame({"id": test_ids, "Response": preds})
        output.to_csv(f"output_{model_name.lower()}.csv", index=False)
        print(f"{model_name} predictions saved to output_{model_name.lower()}.csv")

    # Save ensemble predictions
    ensemble_output = pd.DataFrame({"id": test_ids, "Response": ensemble_predictions})
    ensemble_output.to_csv("output_ensemble.csv", index=False)
    print("Ensemble predictions saved to output_ensemble.csv")


# Model Evaluation Metrics

The following table summarizes the performance metrics for the models used in this project. 

| **Model**                          | **Kaggle Score** |
|------------------------------------|------------------|
| **XGBoost**                        | 0.89325          |
| **LightGBM**                       | 0.89314          |
| **Deep Learning (Neural Network)** | 0.89194          |
| **Tabnet**                          | 0.88446          |
| **Decision Tree using Entropy V2**  | 0.86291          |
| **Logistic Regression**            | 0.85110          |
| **Decision Tree using Entropy V1**  | 0.82384          |
| **Random Forest**                  | 0.82597          |
| **Decision Tree using Gini V1**    | 0.82260          |
| **KNN**                            | 0.78019          |







## Achieving a Kaggle Score of 0.89452

Our team was able to achieve a Kaggle score of **0.89452**, which was the highest among the models we tested, with the next best model having a Kaggle score of **0.89325**. This improvement was achieved through an effective **model ensembling** strategy.

### Ensembling Approach

The key idea behind our approach was to combine the predictions from multiple models to leverage their strengths and create a more robust prediction. Here’s how we went about it:

1. **Identifying the Best Models**: 
   - We identified two of our top-performing models based on their individual Kaggle scores: **Model 1** (with a score of 0.89325) and **Model 2** (with a slightly lower score).
   
2. **Ensembling High-Differentiation Outputs**:
   - Instead of simply averaging the predictions of the two models, we focused on selecting predictions that had the highest difference (most divergent outputs) between the two models. This helps capture more unique information and reduces the risk of overfitting.
   
3. **Further Ensembling**:
   - Once we had the ensemble predictions from the initial two models, we fed these into a second ensembling step with predictions from additional models to further refine the results. This iterative process helped us to maximize the combined predictive power.


### Code Used for Ensembling

Here is the code we used to combine the predictions from two models and create the final ensemble prediction:

In [None]:
# Load the two prediction files
file1 = "output1.csv"  
file2 = "output2.csv" 

# Read the prediction files
pred1 = pd.read_csv(file1)
pred2 = pd.read_csv(file2)

if not pred1['id'].equals(pred2['id']):
    raise ValueError("The 'id' columns in the two files do not match!")

# Average the predictions (ensemble)
ensemble_preds = (pred1['Response'] + pred2['Response']) / 2

ensemble_output = pd.DataFrame({
    "id": pred1['id'],
    "Response": ensemble_preds
})

ensemble_output.to_csv("output_ensemble11.csv", index=False)
print("Ensemble predictions saved to output_ensemble.csv")