# Step 1: Dataset Selection
## Bank Marketing Dataset
This dataset contains customer information and marketing campaign data from a bank. The goal is to predict whether a customer subscribes to a term deposit (`y` variable). 

### Tasks Covered in this Step:
1. Load the dataset
2. Explore dataset structure
3. Check for missing values
4. Check for duplicate entries
5. Summarize numerical features

### 1. Import Required Libraries

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

### 2. Load the Dataset

In [2]:
# Load the dataset (CSV file)
file_path = "bank-additional-full.csv"
df = pd.read_csv(file_path, delimiter=';')

# Display first 5 rows to understand the data
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### 3. Get Basic Info about Dataset

In [3]:
# Display dataset info (column names, non-null counts, data types)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

### 4. Check for Missing Values

In [4]:
# Check for missing values in each column
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

### 5. Check for Duplicate Rows

In [5]:
# Count duplicate rows
duplicates = df.duplicated().sum()
print(f"Total Duplicate Rows: {duplicates}")

Total Duplicate Rows: 12


### 6. Summary Statistics for Numerical Features

In [6]:
# Display statistical summary of numerical columns
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


# Step 2: Dataset Schema & Parquet Conversion

### Objectives:
1. Define the schema for our dataset.
2. Convert the dataset in Parquet format.

### 1. Define Dataset Schema

In [7]:
# Define dataset schema
dataset_schema = {
    "age": {"type": "numerical", "nullable": False},
    "job": {"type": "categorical", "allowed_values": list(df["job"].unique()), "nullable": False},
    "marital": {"type": "categorical", "allowed_values": list(df["marital"].unique()), "nullable": False},
    "education": {"type": "categorical", "allowed_values": list(df["education"].unique()), "nullable": False},
    "default": {"type": "categorical", "allowed_values": list(df["default"].unique()), "nullable": False},
    "housing": {"type": "categorical", "allowed_values": list(df["housing"].unique()), "nullable": False},
    "loan": {"type": "categorical", "allowed_values": list(df["loan"].unique()), "nullable": False},
    "contact": {"type": "categorical", "allowed_values": list(df["contact"].unique()), "nullable": False},
    "month": {"type": "categorical", "allowed_values": list(df["month"].unique()), "nullable": False},
    "day_of_week": {"type": "categorical", "allowed_values": list(df["day_of_week"].unique()), "nullable": False},
    "duration": {"type": "numerical", "nullable": False},
    "campaign": {"type": "numerical", "nullable": False},
    "pdays": {"type": "numerical", "nullable": False},
    "previous": {"type": "numerical", "nullable": False},
    "poutcome": {"type": "categorical", "allowed_values": list(df["poutcome"].unique()), "nullable": False},
    "emp.var.rate": {"type": "numerical", "nullable": False},
    "cons.price.idx": {"type": "numerical", "nullable": False},
    "cons.conf.idx": {"type": "numerical", "nullable": False},
    "euribor3m": {"type": "numerical", "nullable": False},
    "nr.employed": {"type": "numerical", "nullable": False},
    "y": {"type": "categorical", "allowed_values": list(df["y"].unique()), "nullable": False}  # Target variable
}

# Display schema
dataset_schema


{'age': {'type': 'numerical', 'nullable': False},
 'job': {'type': 'categorical',
  'allowed_values': ['housemaid',
   'services',
   'admin.',
   'blue-collar',
   'technician',
   'retired',
   'management',
   'unemployed',
   'self-employed',
   'unknown',
   'entrepreneur',
   'student'],
  'nullable': False},
 'marital': {'type': 'categorical',
  'allowed_values': ['married', 'single', 'divorced', 'unknown'],
  'nullable': False},
 'education': {'type': 'categorical',
  'allowed_values': ['basic.4y',
   'high.school',
   'basic.6y',
   'basic.9y',
   'professional.course',
   'unknown',
   'university.degree',
   'illiterate'],
  'nullable': False},
 'default': {'type': 'categorical',
  'allowed_values': ['no', 'unknown', 'yes'],
  'nullable': False},
 'housing': {'type': 'categorical',
  'allowed_values': ['no', 'yes', 'unknown'],
  'nullable': False},
 'loan': {'type': 'categorical',
  'allowed_values': ['no', 'yes', 'unknown'],
  'nullable': False},
 'contact': {'type': 'categ

### 2. Convert Dataset to Parquet Format

In [8]:
# Save dataset as a Parquet file in the datasets directory
parquet_file_path = "../datasets/bank-additional-full.parquet"
df.to_parquet(parquet_file_path, engine="pyarrow", index=False)

print(f"Dataset successfully saved as Parquet at: {parquet_file_path}")

Dataset successfully saved as Parquet at: ../datasets/bank-additional-full.parquet


### 3. Verify Parquet File by Reading It

In [9]:
# Load the Parquet file to verify
df_parquet = pd.read_parquet(parquet_file_path)

# Display first few rows
df_parquet.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# Step 3: Dataset Profiling

### Objectives:
1. Generate an automatic profiling report.
2. Identify potential data issues (missing values, outliers, distributions).
3. Save the profiling report for future reference.

### 1. Install ydata-profiling

In [10]:
# Install profiling library
!pip install --upgrade numpy
!pip install --upgrade ydata-profiling

Collecting numpy
  Using cached numpy-2.2.3-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.2.3-cp312-cp312-win_amd64.whl (12.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.3
    Uninstalling numpy-2.1.3:
      Successfully uninstalled numpy-2.1.3
Successfully installed numpy-2.2.3


  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.61.0 requires numpy<2.2,>=1.24, but you have numpy 2.2.3 which is incompatible.
ydata-profiling 4.14.0 requires numpy<2.2,>=1.16.0, but you have numpy 2.2.3 which is incompatible.

[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting numpy<2.2,>=1.16.0 (from ydata-profiling)
  Using cached numpy-2.1.3-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.1.3-cp312-cp312-win_amd64.whl (12.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.3
    Uninstalling numpy-2.2.3:
      Successfully uninstalled numpy-2.2.3
Successfully installed numpy-2.1.3



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### 2. Generate the Profiling Report

In [11]:
# Import profiling tool
from ydata_profiling import ProfileReport

# Generate profiling report
profile = ProfileReport(df, explorative=True)

# Save the report as an HTML file
profiling_report_path = "../datasets/profiling_report.html"
profile.to_file(profiling_report_path)

print(f"Profiling report saved at: {profiling_report_path}")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling report saved at: ../datasets/profiling_report.html


### 3. Display the Report

In [12]:
# Display profiling report inside Jupyter Notebook
profile.to_notebook_iframe()

# Step 4: Train-Test-Production Split

### Objectives:
1. Split the dataset into Training (60%), Test (20%), and Production (20%) sets.
2. Ensure the split is reproducible using a random seed.
3. Save all three splits in Parquet format.

### 1. Import Required Libraries

In [13]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

### 2. Perform Train-Test-Production Split

In [14]:
# Set a random seed for reproducibility
random_seed = 42

# First, split into Train (60%) and Temp (40%) (Test + Production)
train_data, temp_data = train_test_split(df, test_size=0.4, random_state=random_seed, stratify=df['y'])

# Now, split Temp into Test (20%) and Production (20%)
test_data, prod_data = train_test_split(temp_data, test_size=0.5, random_state=random_seed, stratify=temp_data['y'])

# Display split sizes
print(f"Train Data: {train_data.shape}")
print(f"Test Data: {test_data.shape}")
print(f"Production Data: {prod_data.shape}")

Train Data: (24712, 21)
Test Data: (8238, 21)
Production Data: (8238, 21)


### 3. Save Train-Test-Production Sets as Parquet

In [15]:
# Define file paths
train_path = "../datasets/train.parquet"
test_path = "../datasets/test.parquet"
prod_path = "../datasets/prod.parquet"

# Save to Parquet format
train_data.to_parquet(train_path, engine="pyarrow", index=False)
test_data.to_parquet(test_path, engine="pyarrow", index=False)
prod_data.to_parquet(prod_path, engine="pyarrow", index=False)

print(f"Datasets successfully saved:\nTrain: {train_path}\nTest: {test_path}\nProduction: {prod_path}")


Datasets successfully saved:
Train: ../datasets/train.parquet
Test: ../datasets/test.parquet
Production: ../datasets/prod.parquet


### 4. Verify by Reloading Parquet Files

In [16]:
# Load back the Parquet files to verify
train_loaded = pd.read_parquet(train_path)
test_loaded = pd.read_parquet(test_path)
prod_loaded = pd.read_parquet(prod_path)

# Display the first few rows of Train set
train_loaded.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,35,admin.,married,university.degree,unknown,yes,no,telephone,may,wed,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.858,5191.0,no
1,42,unknown,divorced,high.school,no,no,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,45,blue-collar,married,basic.9y,unknown,yes,yes,telephone,may,wed,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.856,5191.0,no
3,27,admin.,single,high.school,no,no,no,cellular,jul,tue,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1,no
4,51,admin.,married,university.degree,no,yes,no,cellular,jul,fri,...,5,999,0,nonexistent,-2.9,92.469,-33.6,0.921,5076.2,yes


# Step 5: Data Version Control

### Objectives:
1. Upload all dataset files to GitHub.
2. Ensure proper organization inside the repository.
3. Track changes using Git version control.

In [46]:
# Create empty .gitkeep files to track empty folders
open("datasets/.gitkeep", "w").close()
open("models/.gitkeep", "w").close()
open("scripts/.gitkeep", "w").close()

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/.gitkeep'

### 1. Initialize Git

In [38]:
# Initialize git (if not done already)
!git init

Reinitialized existing Git repository in C:/Users/PC/Documents/ML_Engineering/MLOps_BankMarketing_Project/notebooks/.git/


In [39]:
# Link to your GitHub repo
!git remote remove origin
!git remote add origin https://github.com/riasingh-13/MLOps_BankMarketing_Project.git

### 2. Add All Files

In [41]:
# Add all files
!git add .



### 3. Commit the Changes with a Meaningful Message

In [42]:
!git commit -m "Added all files for MLOps pipeline"

[main 4374281] Added all files for MLOps pipeline
 1 file changed, 69 insertions(+), 40 deletions(-)


### 4. Pull Changes from Remote Repository

In [43]:
!git fetch origin
!git pull origin main --rebase

From https://github.com/riasingh-13/MLOps_BankMarketing_Project
 * branch            main       -> FETCH_HEAD
Rebasing (1/3)
Rebasing (2/3)
Rebasing (3/3)

[KSuccessfully rebased and updated refs/heads/main.


### 5. Push the Changes to GitHub

In [44]:
!git push origin main

To https://github.com/riasingh-13/MLOps_BankMarketing_Project.git
   14e29ec..59bc37f  main -> main
