In [None]:
import numpy as np
import pandas as pd

# Load the data
train_data = pd.read_csv("/kaggle/input/predict-the-success-of-bank-telemarketing/train.csv")
test_data = pd.read_csv("/kaggle/input/predict-the-success-of-bank-telemarketing/test.csv")

In [None]:
train_data.info()

## Dataset Overview

The dataset structure and key insights based on the `info()` output are as follows:

1. **Total Entries (Rows)**: The dataset contains  **39211 rows**.
2. **Total Columns**: There are **16 columns** in the dataset.
3. **Data Types**:
   - **Integer (`int64`)**: 6 column
   - **Float (`float64`)**: 0 column
   - **Object (`object`)**: 10 column
   - **Boolean (`bool`)**: 0 column
4. **Missing Values**:
   - `Column3` has **229 missing values**.
   - `Column5` has **1467 missing values**.
   - `Column10` has **10336 missing values**.
   - `Column15` has **29451 missing values**.
5. **Memory Usage**: The dataset uses approximately **4.8 MB** of memory.

### Additional Observations
- Other Columns have no missing values, ensuring their reliability.

### Suggestions for Data Preparation
- Handle missing values in `Column3`, `Column5`, `Column10` and `Column15` before proceeding with training.
- Verify the data types for correctness, especially for `object` and `category` columns.
- Explore the categorical distributions in `object` columns to understand its significance.

In [None]:
train_data.head()

## Initial Data Preview (`head()`)

Here is a quick summary of the first few rows of the dataset:

1. The dataset contains columns like `last contact date`, `age`, `job`, `marital` etc., which represent the personal life of person with his/her banking data and campaigns for bank deposits.
2. Data:
   - `Column1` seems to be last day on which contact made to the person by the bank.
   - `Column2 - Column5` contains personal information with some missing values (`NaN`).
   - `Column6 - Column10` appears to hold Banking information such as loan status, contact type and defaulter status.
   - `Column11 - Column15` appears to hold campaigns data for the person by the bank.
   - `Column16` reprsents the campaign suceess information as `yes` or `no`.
3. Observed patterns:
   - `Column16` has only `yes` and `no` values, indicating it is a `Binary Classification` problem.
   
**Action Points**:
- Investigate the data types further for accuracy.


In [None]:
train_data.describe()

## Statistical Summary (`describe()`)

### Key Observations:
1. **Summary of Numerical Columns**:
   - `Column2`:
     - Minimum: 18
     - Maximum: 95
     - Mean: 40 
   - `Column7`:
     - Mean: 5441.781719, Standard Deviation: 16365.292065.
     - Minimum and Maximum values: -8019 and 102127.
     - Some outliers may exist, as the range is wide.
   - `Column11`:
     - Mean: 439.062789, Standard Deviation: 769.096291.
     - Minimum and Maximum values: 0 and 4918.
     - Some outliers may exist, as the range is wide.
   - `Column12`:
     - Mean: 5.108770, Standard Deviation: 9.890153.
     - Minimum and Maximum values: 1 and 63.
     - Some outliers may exist, as the range is wide.
   - `Column13`:
     - Mean: 72.256051, Standard Deviation: 160.942593.
     - Minimum and Maximum values: -1 and 871.
     - Some outliers may exist, as the range is wide.
   - `Column14`:
     - Mean: 11.826171, Standard Deviation: 44.140259.
     - Minimum and Maximum values: 0 and 275.
     - Some outliers may exist, as the range is wide.

2. **Notable Trends**:
   - `Column7, Column11-Column14` has a significant standard deviation, indicating high variability.
   - Skewness or outliers should be checked in `Column2` based on its range.

3. **Missing Data**:
   - `describe()` excludes null values in calculations. Cross-check against `info()` for accurate missing value handling.

**Action Points**:
- Perform further analysis on `Column7, Column11-Column14` for skewness or outliers.
- Normalize/Scale the data in columns with high variability for modeling purposes.


In [None]:
test_data.info()

## Dataset Overview

The dataset structure and key insights based on the `info()` output are as follows:

1. **Total Entries (Rows)**: The dataset contains  **10000 rows**.
2. **Total Columns**: There are **15 columns** in the dataset.
3. **Data Types**:
   - **Integer (`int64`)**: 6 column
   - **Float (`float64`)**: 0 column
   - **Object (`object`)**: 10 column
   - **Boolean (`bool`)**: 0 column
4. **Missing Values**:
   - `Column3` has **59 missing values**.
   - `Column5` has **390 missing values**.
   - `Column10` has **2684 missing values**.
   - `Column15` has **7508 missing values**.
5. **Memory Usage**: The dataset uses approximately **1.1 MB** of memory.

### Additional Observations
- Other Columns have no missing values, ensuring their reliability.

### Suggestions for Data Preparation
- Handle missing values in `Column3`, `Column5`, `Column10` and `Column15` before proceeding with predicting.
- Verify the data types for correctness, especially for `object` and `category` columns.
- Explore the categorical distributions in `object` columns to understand its significance.

In [None]:
test_data.head()

## Initial Data Preview (`head()`)

Here is a quick summary of the first few rows of the dataset:

1. The dataset contains columns like `last contact date`, `age`, `job`, `marital` etc., which represent the personal life of person with his/her banking data and campaigns for bank deposits.
2. Data:
   - `Column1` seems to be last day on which contact made to the person by the bank.
   - `Column2 - Column5` contains personal information with some missing values (`NaN`).
   - `Column6 - Column10` appears to hold Banking information such as loan status, contact type and defaulter status.
   - `Column11 - Column15` appears to hold campaigns data for the person by the bank.
   
**Action Points**:
- Investigate the data types further for accuracy.


In [None]:
test_data.describe()

## Statistical Summary (`describe()`)

### Key Observations:

1. **Notable Trends**:
   - `Column7, Column11-Column14` has a significant standard deviation, indicating high variability.
   - Skewness or outliers should be checked in `Column2` based on its range.

2. **Missing Data**:
   - `describe()` excludes null values in calculations. Cross-check against `info()` for accurate missing value handling.

**Action Points**:
- Perform further analysis on `Column7, Column11-Column14` for skewness or outliers.
- Normalize/Scale the data in columns with high variability for modeling purposes.


In [None]:
# Checking duplicates
train_data.loc[train_data.duplicated()]

## Duplicate Detection

We do not found any duplicate rows in our dataset which is crucial for removing redundency and in training the model.

In [None]:
# Checking Age Distribution

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1, 2, figsize=(11, 4))

# KDE plot
sns.kdeplot(data=train_data, x='age', ax=ax[0], fill = True)
ax[0].set_ylabel('')

# Boxplot
sns.boxplot(data=train_data, x='age', ax=ax[1])

# Set main title for all subplots
plt.suptitle('Age Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Checking job Distribution

# Get the order of categories
order = train_data['job'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='job', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Job Distribution')

plt.show()

In [None]:
# Checking Marital Status Distribution

# Get the order of categories
order = train_data['marital'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='marital', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Marital Status Distribution')

plt.show()

In [None]:
# Checking Educational Qualification Distribution

# Get the order of categories
order = train_data['education'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='education', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Education Distribution')

plt.show()

In [None]:
# Checking Credit default Distribution

# Get the order of categories
order = train_data['default'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='default', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Credits in default Distribution')

plt.show()

In [None]:
# Checking Balance Distribution

fig, ax = plt.subplots(1, 2, figsize=(11, 4))

# KDE plot
sns.kdeplot(data=train_data, x='balance', ax=ax[0], fill = True)
ax[0].set_ylabel('')

# Boxplot
sns.boxplot(data=train_data, x='balance', ax=ax[1])

# Set main title for all subplots
plt.suptitle('Balance Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Checking Housing Loan Status Distribution

# Get the order of categories
order = train_data['housing'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='housing', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Housing loan Distribution')

plt.show()

In [None]:
# Checking Loan Status Distribution

# Get the order of categories
order = train_data['loan'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='loan', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Loan Distribution')

plt.show()

In [None]:
# Checking Contact Type Distribution

# Get the order of categories
order = train_data['contact'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='contact', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Contact type Distribution')

plt.show()

In [None]:
# Checking Contact Duration Distribution

fig, ax = plt.subplots(1, 2, figsize=(11, 4))

# KDE plot
sns.kdeplot(data=train_data, x='duration', ax=ax[0], fill = True)
ax[0].set_ylabel('')

# Boxplot
sns.boxplot(data=train_data, x='duration', ax=ax[1])

# Set main title for all subplots
plt.suptitle('Last contact duration Distribution(in sec)')

plt.tight_layout()
plt.show()

In [None]:
# Checking Number of contacts during campaign Distribution

fig, ax = plt.subplots(1, 2, figsize=(11, 4))

# KDE plot
sns.kdeplot(data=train_data, x='campaign', ax=ax[0], fill = True)
ax[0].set_ylabel('')

# Boxplot
sns.boxplot(data=train_data, x='campaign', ax=ax[1])

# Set main title for all subplots
plt.suptitle('Number of contacts during campaign')

plt.tight_layout()
plt.show()

In [None]:
# Checking no. of days from previous contact day Distribution

fig, ax = plt.subplots(1, 2, figsize=(20, 4))

# KDE plot
sns.kdeplot(data=train_data, x='pdays', ax=ax[0], fill = True)
ax[0].set_ylabel('')

# Boxplot
sns.boxplot(data=train_data, x='pdays', ax=ax[1])

# Set main title for all subplots
plt.suptitle('Number of of days from previous Contact Day Distribution')

plt.show()

In [None]:
# Checking campaign Distribution

fig, ax = plt.subplots(1, 2, figsize=(11, 4))

# KDE plot
sns.kdeplot(data=train_data, x='previous', ax=ax[0], fill = True)
ax[0].set_ylabel('')

# Boxplot
sns.boxplot(data=train_data, x='previous', ax=ax[1])

# Set main title for all subplots
plt.suptitle('Number of contacts before campaign')

plt.show()

In [None]:
# Checking outcome Distribution

# Get the order of categories
order = train_data['poutcome'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='poutcome', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Outcome of the previous marketing campaign')

plt.show()

In [None]:
# Checking Target Distribution

# Get the order of categories
order = train_data['target'].value_counts().index

# Set figure size
plt.figure(figsize=(11, 4))

# Create countplot
sns.countplot(data=train_data, y='target', order=order).set_ylabel('')

# Set main title for all subplots
plt.suptitle('Target Distribution')

plt.show()

## Count Plot of the Target Variable

The count plot for the target variable **`target`** is shown above:

### Observations:
1. The dataset is imbalanced:
   - **"Yes"**: approx. 15% of the total records.
   - **"No"**: approx. 85% of the total records.
2. Imbalanced classes could lead to biased predictions, favoring the majority class.

### Actions Taken:
- Instead of oversampling or undersampling the data, we addressed the imbalance by using the scale_pos_weight parameter in the training process of the model.
- `scale_pos_weight` = (`Number of Negative Samples`)/(`Number of Positive Samples`) which is in the training data is `5.7291916938390255`. 

In [None]:
# Checking correlation

# Select numerical columns
numerical_cols = train_data.select_dtypes(include=['float64', 'int64']).columns

# Calculate correlation matrix
corr = train_data[numerical_cols].corr()

# Mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1.0, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True)

plt.show()

## Correlation Analysis

### Key Insights:
1. **Overall Relationships**:
   - The correlation matrix reveals weak, and negligible relationships between variables.
   - Variables with weak positive/negative correlation may not be closely related.

2. **Notable Observations**:
   - Strong correlations between features may indicate redundancy.
   - Negative correlations might suggest inverse dependencies, useful for predictive modeling.

### Visualization:
- The correlation heatmap highlights relationships:
  - Darker colors ndicate stronger positive/negative correlations.
  - Lighter colors indicate weaker relationships.

### Recommendations:
1. **Feature Selection**:
   - Focus on weakly correlated variables for exploratory analysis.

2. **Further Exploration**:
   - Remove any outliers and check again for correlation analysis.


In [None]:
from scipy.stats import zscore

# Detecting outliers

# Selecting only numeric columns from df
numerical_df = train_data.select_dtypes(include=[np.number])

# Calculating z-scores
z_scores = numerical_df.apply(zscore)

# Get rows where any column has an absolute z-score > 3
outlier_rows = train_data[(np.abs(z_scores) > 3).any(axis=1)]

# Print outlier rows
outlier_rows

## Outlier Detection and Removal

Outliers are extreme data points that deviate significantly from the majority of the dataset. They can:
- Skew statistical summaries such as mean and standard deviation.
- Affect model performance by introducing noise.

### Methods Used for Outlier Detection
1. **Visualization Techniques**:
   - **Box Plot**: Highlighted potential outliers beyond the whiskers.
   - **Scatter Plot**: Revealed extreme values in relationships between variables.
2. **Statistical Methods**:
   - **Z-Score**: Data points with a Z-score > 3 or < -3 were considered outliers.
   - **Interquartile Range (IQR)**:
     - Formula: IQR = Q3 - Q1
     - Lower Bound: Q1 - 1.5 * IQR
     - Upper Bound: Q3 + 1.5 * IQR


### Outliers Detected

We have found out that 3222 rows in our training dataset are outliers which can create potential noise in our model training leading to overfitting of the model as the outliers consists of approximately 10% of the total data rows in the train dataset.

In [None]:
# removing outliers

train_data = train_data.drop(outlier_rows.index)

In [None]:
# Define features and target
X_train = train_data.drop(['target'], axis=1)
y_train = train_data['target']

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encoding target column
le_target = LabelEncoder()
y_train = le_target.fit_transform(y_train)

In [None]:
all_data = pd.concat([X_train, test_data], axis=0, ignore_index=True)

from sklearn.impute import SimpleImputer

# Impute 'job' with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
all_data['job'] = imputer.fit_transform(all_data[['job']]).ravel()

# Impute 'education', 'contact', and 'poutcome' with 'unknown'
imputer = SimpleImputer(strategy='constant', fill_value='unknown')
for column in ['education', 'contact', 'poutcome']:
    all_data[column] = imputer.fit_transform(all_data[[column]]).ravel()

In [None]:
# Feature engineering
def feature_engineering(df):
    # Convert date to datetime
    df['last contact date'] = pd.to_datetime(df['last contact date'])
    
    # Extract date features
    df['contact_month'] = df['last contact date'].dt.month
    df['contact_day'] = df['last contact date'].dt.day
    df['contact_dayofweek'] = df['last contact date'].dt.dayofweek
    
    # Create age groups
    df['age_group'] = pd.cut(df['age'], bins=[0, 20, 30, 40, 50, 60, 100], labels=['0-20', '21-30', '31-40', '41-50', '51-60', '60+'])
    
    # Create balance groups
    df['balance_group'] = pd.qcut(df['balance'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
    
    # Drop original date column
    df = df.drop('last contact date', axis=1)
    
    return df

all_data = feature_engineering(all_data)

In [None]:
# Preprocess the data

# Encoding categorical columns
le = LabelEncoder()
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome', 'age_group', 'balance_group']

for col in categorical_columns:
    all_data[col] = le.fit_transform(all_data[col].astype(str))

In [None]:
# Removing 'pdays' column

all_data.drop(['pdays'], axis=1, inplace=True)

from sklearn.preprocessing import StandardScaler

# Scale numerical features
scaler = StandardScaler()
numerical_columns = ['age', 'balance', 'duration', 'campaign', 'previous']
all_data[numerical_columns] = scaler.fit_transform(all_data[numerical_columns])

In [None]:
# split the train and test data after feature engineering

X_train = all_data[:len(X_train)]
X_test = all_data[len(X_train):]

In [None]:
# Import necessary libraries
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, make_scorer
from skopt import BayesSearchCV

# Define hyperparameter search space
param_grid = {
    'n_estimators': (100, 500),  # Range for number of trees
    'max_depth': (3, 10),  # Tree depth
    'learning_rate': (0.01, 0.1, 'log-uniform'),  # Log scale for finer tuning
    'min_child_weight': (1, 10),  # Minimum sum of weights for split
    'subsample': (0.6, 1.0),  # Fraction of samples
    'colsample_bytree': (0.6, 1.0),  # Fraction of features
    'gamma': (0, 0.5),  # Minimum loss reduction
    'reg_alpha': (0, 1),  # L1 regularization
    'reg_lambda': (0.5, 2),  # L2 regularization
    'scale_pos_weight': (1, 10)  # Class imbalance adjustment
}

# Initialize XGBoost model
xgb = XGBClassifier(
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)

# Define scoring metric
scorer = make_scorer(f1_score, average='macro')

# Set up Bayesian Search
bayes_search = BayesSearchCV(
    estimator=xgb,
    search_spaces=param_grid,
    scoring=scorer,
    n_iter=50,  # Number of iterations
    cv=3,  # 3-fold cross-validation
    n_jobs=-1,
    random_state=42,
    verbose=2
)

# Fit Bayesian optimizer
bayes_search.fit(X_train, y_train)

# Best parameters and model
best_params = bayes_search.best_params_
print("Best Parameters:", best_params)

best_model = bayes_search.best_estimator_

In [None]:
# Train Test Split for creating and getting insight from model prediction with confusion matrix

from sklearn.model_selection import train_test_split

X_train_subset, X_eval, y_train_subset, y_eval = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [None]:
from sklearn.metrics import precision_recall_curve

# Predict probabilities
y_prob = best_model.predict_proba(X_eval)[:, 1]

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_eval, y_prob)

# Identify the best threshold for maximizing F1-Score
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx = f1_scores.argmax()
optimal_threshold = thresholds[optimal_idx]
print("Optimal Threshold for Precision-Recall Balance:", optimal_threshold)

# Apply the threshold
y_pred_adjusted = (y_prob >= optimal_threshold).astype(int)

# Evaluate
from sklearn.metrics import classification_report
print("Adjusted Classification Report:\n", classification_report(y_eval, y_pred_adjusted))


In [None]:
# Checking Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_eval, y_pred_adjusted)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

## Confusion Matrix

### Insights

1. From the confusion matrix we can see that the approximately 18% of True(Positive - 1) values are predicted as False(Negative - 0).
2. And the approximately 2.5% of the False(Negative - 0) values are predicted as True(Positive - 1).

**Note:** From these results we can say that our model has an accuracy of about approximately 90%.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate evaluation metrics
accuracy = accuracy_score(y_eval, y_pred_adjusted)
precision = precision_score(y_eval, y_pred_adjusted)
recall = recall_score(y_eval, y_pred_adjusted)
f1 = f1_score(y_eval, y_pred_adjusted)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

In [None]:
# Evaluate on test set
y_pred_prob = best_model.predict_proba(X_test)[:, 1]

y_pred = (y_pred_prob >= optimal_threshold).astype(int)

In [None]:
# Create the submission file
submission = pd.DataFrame({'id': range(len(y_pred)), 'target': le_target.inverse_transform(y_pred)})
submission.to_csv('submission.csv', index=False)
print("Submission file created: submission.csv")

## Creating Submission File (`submission.csv`)

1. **Inversing the `target - column`** from `1` and `0` to `yes` and `no` respectively.
2. **Binding target with `id - column`** 0 - 9999 (`10000 rows`).