<a href="https://colab.research.google.com/github/vimesh630/ML_CW/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Mount Google Drive

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import Libraries

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE

Load the Datasets

In [3]:
# Load the datasets from Google Drive
file_path_full = '/content/drive/MyDrive/ML Coursework/bank+marketing/bank-additional/bank-additional/bank-additional-full.csv'
file_path_subset = '/content/drive/MyDrive/ML Coursework/bank+marketing/bank-additional/bank-additional/bank-additional.csv'

# Load both datasets
df_full = pd.read_csv(file_path_full, sep=';')
df_subset = pd.read_csv(file_path_subset, sep=';')

# Check if both datasets have the same structure
if df_full.columns.equals(df_subset.columns):
    print("Both datasets have the same structure. Merging...")
    df = pd.concat([df_full, df_subset], ignore_index=True)
else:
    print("Datasets have different structures. Using `bank-additional-full.csv` for preprocessing.")
    df = df_full

Both datasets have the same structure. Merging...


Handle Missing or Unknown Data

In [4]:
# Replace "unknown" with NaN and handle missing values
df.replace('unknown', np.nan, inplace=True)

# Drop columns with more than 30% missing values (adjust threshold if needed)
missing_threshold = 0.3
missing_percentage = df.isnull().mean()
columns_to_drop = missing_percentage[missing_percentage > missing_threshold].index
df.drop(columns=columns_to_drop, inplace=True)

# Fill remaining missing values
for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)  # Fill categorical columns with mode
for col in df.select_dtypes(include=['float64', 'int64']).columns:
    df[col].fillna(df[col].mean(), inplace=True)  # Fill numeric columns with mean

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)  # Fill categorical columns with mode
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)  # Fill numeric columns with mean


Encode Categorical Variables

In [5]:
# Encode categorical variables using LabelEncoder
categorical_cols = df.select_dtypes(include=['object']).columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le  # Save encoders for future use

Scale Numerical Features

In [6]:
# Scale numeric features using StandardScaler
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.drop('y')
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

Balance the Dataset

In [7]:
# Separate features and target
X = df.drop('y', axis=1)
y = df['y']

# Check target distribution
print("Target distribution before balancing:", y.value_counts())

# Use SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Check new target distribution
print("Target distribution after balancing:", np.bincount(y_resampled))

Target distribution before balancing: y
0    40216
1     5091
Name: count, dtype: int64




Target distribution after balancing: [40216 40216]


Split the Dataset

In [8]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

Save Preprocessed Data

In [9]:
# Save preprocessed data to Google Drive
preprocessed_data_path_train = '/content/drive/My Drive/ML Coursework/Preprocessed Dataset/train_data.csv'
preprocessed_data_path_test = '/content/drive/My Drive/ML Coursework/Preprocessed Dataset/test_data.csv'

# Save training and testing data
pd.concat([X_train, y_train], axis=1).to_csv(preprocessed_data_path_train, index=False)
pd.concat([X_test, y_test], axis=1).to_csv(preprocessed_data_path_test, index=False)

print(f"Preprocessed training data saved to: {preprocessed_data_path_train}")
print(f"Preprocessed testing data saved to: {preprocessed_data_path_test}")

Preprocessed training data saved to: /content/drive/My Drive/ML Coursework/Preprocessed Dataset/train_data.csv
Preprocessed testing data saved to: /content/drive/My Drive/ML Coursework/Preprocessed Dataset/test_data.csv


Verify Preprocessed Data

In [10]:
# Verify the preprocessed training data
print("Sample of preprocessed training data:")
print(pd.concat([X_train, y_train], axis=1).head())

Sample of preprocessed training data:
            age       job   marital  education   default   housing      loan  \
30576 -0.579390  0.945251 -0.280414   0.620221 -0.009397  0.907476  2.356668   
8588  -0.867539 -0.744768 -1.938341  -0.820256 -0.009397  0.907476  2.356668   
43947 -1.155688  0.945251 -1.938341  -0.340097 -0.009397 -1.101957  2.356668   
79182 -0.865009  1.508591  1.373147   0.620221 -0.009397  0.907476 -0.424328   
73911 -1.900053  1.226921  1.377514   1.010280 -0.009397 -0.976271 -0.424328   

        contact     month  day_of_week  ...  campaign     pdays  previous  \
30576 -0.757217  0.760470    -0.719562  ...  0.158113  0.195930 -0.349533   
8588   1.320625 -0.102081     1.428221  ... -0.205228  0.195930 -0.349533   
43947 -0.757217 -0.533357    -1.435490  ... -0.205228  0.195930 -0.349533   
79182 -0.757217  1.621886    -1.433604  ... -0.205228  0.195930 -0.349533   
73911 -0.757217  0.760470     0.757074  ... -0.227955 -5.119789  9.663307   

       poutcome  e