# Problem Statement & Objective

## **Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.


## **Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

## **Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:

**Customer Details**
- **CustomerID:** Unique identifier for each customer.
- **ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).
- **Age:** Age of the customer.
- **TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).
- **CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).
- **Occupation:** Customer's occupation (e.g., Salaried, Freelancer).
- **Gender:** Gender of the customer (Male, Female).
- **NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.
- **PreferredPropertyStar:** Preferred hotel rating by the customer.
- **MaritalStatus:** Marital status of the customer (Single, Married, Divorced).
- **NumberOfTrips:** Average number of trips the customer takes annually.
- **Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).
- **OwnCar:** Whether the customer owns a car (0: No, 1: Yes).
- **NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.
- **Designation:** Customer's designation in their current organization.
- **MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**
- **PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.
- **ProductPitched:** The type of product pitched to the customer.
- **NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.-
- **DurationOfPitch:** Duration of the sales pitch delivered to the customer.


# Setup Instructions

### 1. Created Conda Environment

I created and activated a conda environment:

```bash
conda create -n tourism-mlops python=3.10 -y
conda activate tourism-mlops
```

### 2. Installed Requirements

I installed the project dependencies:

```bash
pip install -r requirements.txt
```

### 3. Set Up Hugging Face

#### Installed Hugging Face CLI

I installed the Hugging Face CLI:

```bash
curl -LsSf https://hf.co/cli/install.sh | bash
```

#### Created Hugging Face Account & Token

I completed the following steps:

1. Went to [huggingface.co](https://huggingface.co) and signed in / signed up
2. Clicked my profile → Settings → Access Tokens
3. Created a New token (type: Write) and copied it

#### Logged In from Terminal

I logged in from the terminal (inside the conda environment):

```bash
huggingface-cli login
```

I pasted my token when prompted.

#### Created Dataset Repository on Hugging Face

I created the dataset repository in my browser:

1. Went to [huggingface.co/datasets](https://huggingface.co/datasets)
2. Clicked **New dataset**
3. Named it: `mukherjee78/tourism-wellness-package`
4. Set visibility to **Public**
5. Clicked **Create**

### 4. Created Project Structure

I created the project folder structure:

```bash
mkdir data notebooks src
```

Then I:
1. Created a notebook inside the `notebooks` folder
2. Copied the `tourism.csv` file into the `data` folder

# Data Registration (Hugging Face Datasets)

In [1]:
HF_USERNAME = "mukherjee78"
DATASET_REPO_ID = f"{HF_USERNAME}/tourism-wellness-package"
MODEL_REPO_ID = f"{HF_USERNAME}/tourism-wellness-model"

In [2]:
from huggingface_hub import HfApi
import os

api = HfApi()

local_data_path = "../data/tourism.csv"

# Upload file to HF dataset repo
api.upload_file(
    path_or_fileobj=local_data_path,
    path_in_repo="data/tourism.csv",
    repo_id=DATASET_REPO_ID,
    repo_type="dataset"
)

print("Uploaded tourism.csv to Hugging Face Datasets repo:", DATASET_REPO_ID)

No files have been modified since last commit. Skipping to prevent empty commit.


Uploaded tourism.csv to Hugging Face Datasets repo: mukherjee78/tourism-wellness-package


> We created a Hugging Face Dataset repository mukherjee78/tourism-wellness-package and uploaded the raw tourism.csv file to it. This satisfies the data registration requirement and allows the rest of the pipeline to load data directly from the data hub.

In [3]:
from datasets import load_dataset

dataset = load_dataset(DATASET_REPO_ID, data_files={"full": "data/tourism.csv"})
dataset

Generating full split: 0 examples [00:00, ? examples/s]

DatasetDict({
    full: Dataset({
        features: ['Unnamed: 0', 'CustomerID', 'ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
        num_rows: 4128
    })
})

In [4]:
import pandas as pd

df = dataset["full"].to_pandas()
df.head()
df.info()
df.describe(include="all")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4128 entries, 0 to 4127
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                4128 non-null   int64  
 1   CustomerID                4128 non-null   int64  
 2   ProdTaken                 4128 non-null   int64  
 3   Age                       4128 non-null   float64
 4   TypeofContact             4128 non-null   object 
 5   CityTier                  4128 non-null   int64  
 6   DurationOfPitch           4128 non-null   float64
 7   Occupation                4128 non-null   object 
 8   Gender                    4128 non-null   object 
 9   NumberOfPersonVisiting    4128 non-null   int64  
 10  NumberOfFollowups         4128 non-null   float64
 11  ProductPitched            4128 non-null   object 
 12  PreferredPropertyStar     4128 non-null   float64
 13  MaritalStatus             4128 non-null   object 
 14  NumberOf

Unnamed: 0.1,Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,...,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
count,4128.0,4128.0,4128.0,4128.0,4128,4128.0,4128.0,4128,4128,4128.0,...,4128,4128.0,4128,4128.0,4128.0,4128.0,4128.0,4128.0,4128,4128.0
unique,,,,,2,,,4,3,,...,5,,4,,,,,,5,
top,,,,,Self Enquiry,,,Salaried,Male,,...,Basic,,Married,,,,,,Executive,
freq,,,,,2918,,,1999,2463,,...,1615,,1990,,,,,,1615,
mean,2527.763808,202527.763808,0.193072,37.231831,,1.663275,15.584787,,,2.94937,...,,3.578488,,3.2953,0.2953,3.060804,0.612161,1.223595,,23178.464147
std,1409.439133,1409.439133,0.394757,9.174521,,0.92064,8.398142,,,0.718818,...,,0.795031,,1.8563,0.456233,1.363064,0.487317,0.852685,,4506.614622
min,0.0,200000.0,0.0,18.0,,1.0,5.0,,,1.0,...,,3.0,,1.0,0.0,1.0,0.0,0.0,,1000.0
25%,1320.75,201320.75,0.0,31.0,,1.0,9.0,,,2.0,...,,3.0,,2.0,0.0,2.0,0.0,1.0,,20751.0
50%,2603.5,202603.5,0.0,36.0,,1.0,14.0,,,3.0,...,,3.0,,3.0,0.0,3.0,1.0,1.0,,22418.0
75%,3748.25,203748.25,0.0,43.0,,3.0,20.0,,,3.0,...,,4.0,,4.0,1.0,4.0,1.0,2.0,,25301.0


## Data Preparation

### Load the dataset from Hugging Face

In [5]:
from datasets import load_dataset

dataset = load_dataset(DATASET_REPO_ID, data_files={"full": "data/tourism.csv"})
df = dataset["full"].to_pandas()

df.head()

Unnamed: 0.1,Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,...,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,...,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,...,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,...,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,...,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,5,200005,0,32.0,Company Invited,1,8.0,Salaried,Male,3,...,Basic,3.0,Single,1.0,0,5,1,1.0,Executive,18068.0


### Basic data inspection (for explanation + cleaning decisions)

In [6]:
print("Shape:", df.shape)
print("\nColumns:\n", df.columns.tolist())

print("\nInfo:")
print(df.info())

print("\nMissing values per column:")
print(df.isna().sum())

print("\nNumber of duplicate rows:", df.duplicated().sum())

Shape: (4128, 21)

Columns:
 ['Unnamed: 0', 'CustomerID', 'ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome']

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4128 entries, 0 to 4127
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                4128 non-null   int64  
 1   CustomerID                4128 non-null   int64  
 2   ProdTaken                 4128 non-null   int64  
 3   Age                       4128 non-null   float64
 4   TypeofContact             4128 non-null   object 
 5   CityTier                  4128 non-null   int64  
 6   DurationOfPitch           4128 non-null   float64
 7   Occu

### Data cleaning

In [7]:
TARGET_COL = "ProdTaken"

In [8]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

df_clean = df.copy()

cols_to_drop = ["CustomerID", "Unnamed: 0"]

df_clean = df_clean.drop(columns=cols_to_drop)
print(f"Dropped columns: {cols_to_drop}")

# Drop duplicates
before = df_clean.shape[0]
df_clean = df_clean.drop_duplicates()
after = df_clean.shape[0]
print(f"Dropped {before - after} duplicate rows")

# Impute missing values
feature_cols = [c for c in df_clean.columns if c != TARGET_COL]
numeric_cols = df_clean[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df_clean[feature_cols].select_dtypes(exclude=[np.number]).columns.tolist()

df_imputed = df_clean.copy()

if numeric_cols:
    num_imputer = SimpleImputer(strategy="median")
    df_imputed[numeric_cols] = num_imputer.fit_transform(df_imputed[numeric_cols])

if categorical_cols:
    cat_imputer = SimpleImputer(strategy="most_frequent")
    df_imputed[categorical_cols] = cat_imputer.fit_transform(df_imputed[categorical_cols])

print("Remaining missing values after imputation:", df_imputed.isna().sum().sum())

Dropped columns: ['CustomerID', 'Unnamed: 0']
Dropped 117 duplicate rows
Remaining missing values after imputation: 0


### Train–test split and save locally

In [9]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

train_df = X_train.copy()
train_df[TARGET_COL] = y_train

test_df = X_test.copy()
test_df[TARGET_COL] = y_test

train_path = "../data/train.csv" 
test_path  = "../data/test.csv"

train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)

print(f"Saved train to {train_path}, shape={train_df.shape}")
print(f"Saved test to {test_path}, shape={test_df.shape}")

Train shape: (3302, 20) (3302,)
Test shape: (826, 20) (826,)
Saved train to ../data/train.csv, shape=(3302, 21)
Saved test to ../data/test.csv, shape=(826, 21)


In [10]:
pd.read_csv(train_path).head()
pd.read_csv(test_path).head()

Unnamed: 0.1,Unnamed: 0,CustomerID,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,...,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome,ProdTaken
0,2273,202273,34.0,Company Invited,1,9.0,Salaried,Male,2,4.0,...,3.0,Married,4.0,0,1,0,0.0,Executive,17979.0,0
1,73,200073,32.0,Self Enquiry,1,6.0,Salaried,Male,3,3.0,...,4.0,Divorced,2.0,0,3,0,0.0,Manager,21220.0,0
2,167,200167,30.0,Self Enquiry,3,11.0,Salaried,Female,2,3.0,...,3.0,Divorced,3.0,0,4,1,1.0,Senior Manager,24419.0,0
3,4725,204725,39.0,Self Enquiry,3,9.0,Small Business,Male,3,4.0,...,4.0,Unmarried,2.0,0,4,1,2.0,Senior Manager,26029.0,0
4,4219,204219,37.0,Company Invited,1,31.0,Salaried,Female,3,4.0,...,4.0,Married,2.0,0,3,1,2.0,Manager,24352.0,0


### Upload train.csv and test.csv back to Hugging Face Dataset Space

In [11]:
from huggingface_hub import HfApi

api = HfApi()

api.upload_file(
    path_or_fileobj=train_path,
    path_in_repo="data/train.csv",
    repo_id=DATASET_REPO_ID,
    repo_type="dataset",
)

api.upload_file(
    path_or_fileobj=test_path,
    path_in_repo="data/test.csv",
    repo_id=DATASET_REPO_ID,
    repo_type="dataset",
)

print("Uploaded train.csv and test.csv to HF dataset repo:", DATASET_REPO_ID)

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


Uploaded train.csv and test.csv to HF dataset repo: mukherjee78/tourism-wellness-package


# Modeling & Experiment Tracking

### Load Train/Test From Hugging Face Dataset Space

In [12]:
from datasets import load_dataset
import pandas as pd

dataset_splits = load_dataset(
    DATASET_REPO_ID,
    data_files={
        "train": "data/train.csv",
        "test": "data/test.csv"
    }
)

train_df = dataset_splits["train"].to_pandas()
test_df = dataset_splits["test"].to_pandas()

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

train_df.head()

Train shape: (3302, 21)
Test shape: (826, 21)


Unnamed: 0.1,Unnamed: 0,CustomerID,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,...,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome,ProdTaken
0,3850,203850,55.0,Self Enquiry,1,17.0,Small Business,Female,4,4.0,...,5.0,Unmarried,8.0,1,1,0,1.0,Manager,23118.0,0
1,2463,202463,39.0,Self Enquiry,1,9.0,Salaried,Male,3,4.0,...,3.0,Unmarried,7.0,1,4,0,2.0,Executive,22622.0,0
2,878,200878,42.0,Company Invited,2,8.0,Small Business,Male,3,1.0,...,5.0,Divorced,1.0,0,2,0,2.0,Manager,21272.0,0
3,2482,202482,37.0,Self Enquiry,1,12.0,Salaried,Female,3,5.0,...,5.0,Divorced,2.0,1,2,1,1.0,Executive,98678.0,0
4,3074,203074,23.0,Self Enquiry,1,7.0,Salaried,Male,3,5.0,...,3.0,Divorced,8.0,0,2,1,1.0,Manager,23453.0,0


### Prepare Features and Target

In [13]:
X_train = train_df.drop(columns=[TARGET_COL])
y_train = train_df[TARGET_COL]
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)


X_test = test_df.drop(columns=[TARGET_COL])
y_test = test_df[TARGET_COL]
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (3302, 20)
y_train shape: (3302,)
X_test shape: (826, 20)
y_test shape: (826,)


### Preprocessing Pipeline (Categorical Encoding + Numeric Passthrough)

In [14]:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_cols),
        ("cat", OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ]
)

Numeric columns: ['Unnamed: 0', 'CustomerID', 'Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'MonthlyIncome']
Categorical columns: ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']


### RandomForest Model + Tuning

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

rf_model = RandomForestClassifier(random_state=42, n_jobs=-1)

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", rf_model),
])

rf_param_dist = {
    "model__n_estimators": randint(100, 400),
    "model__max_depth": [None, 5, 10, 20],
    "model__min_samples_split": randint(2, 10),
    "model__min_samples_leaf": randint(1, 5),
    "model__max_features": ["sqrt", "log2"],
}

rf_search = RandomizedSearchCV(
    rf_pipe,
    rf_param_dist,
    n_iter=20,
    scoring="f1",
    cv=3,
    n_jobs=-1,
    verbose=2,
    random_state=42,
)

rf_search.fit(X_train, y_train)
rf_best = rf_search.best_estimator_
rf_best_params = rf_search.best_params_
rf_search.best_score_, rf_best_params

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=3, model__min_samples_split=4, model__n_estimators=187; total time=   0.6s
[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=8, model__n_estimators=206; total time=   0.6s
[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=8, model__n_estimators=206; total time=   0.6s
[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=3, model__min_samples_split=4, model__n_estimators=187; total time=   0.6s
[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=8, model__n_estimators=221; total time=   0.7s
[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=8, model__n_estimators=221; total time=   0.7s
[CV] END 

(np.float64(0.5376645817527071),
 {'model__max_depth': 20,
  'model__max_features': 'sqrt',
  'model__min_samples_leaf': 1,
  'model__min_samples_split': 8,
  'model__n_estimators': 221})

### XGBoost Model + Tuning

In [16]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=42,
    n_estimators=300
)

xgb_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", xgb_model),
])

xgb_param_dist = {
    "model__learning_rate": [0.01, 0.05, 0.1],
    "model__max_depth": [3, 5, 7],
    "model__subsample": [0.6, 0.8, 1.0],
    "model__colsample_bytree": [0.6, 0.8, 1.0],
}

xgb_search = RandomizedSearchCV(
    xgb_pipe,
    xgb_param_dist,
    n_iter=10,
    scoring="f1",
    cv=3,
    n_jobs=-1,
    verbose=2,
    random_state=42,
)

xgb_search.fit(X_train, y_train)
xgb_best = xgb_search.best_estimator_
xgb_best_params = xgb_search.best_params_
xgb_search.best_score_, xgb_best_params

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END model__colsample_bytree=0.6, model__learning_rate=0.01, model__max_depth=3, model__subsample=0.6; total time=   0.1s
[CV] END model__colsample_bytree=0.6, model__learning_rate=0.01, model__max_depth=3, model__subsample=0.6; total time=   0.1s
[CV] END model__colsample_bytree=0.6, model__learning_rate=0.01, model__max_depth=3, model__subsample=0.6; total time=   0.1s
[CV] END model__colsample_bytree=0.8, model__learning_rate=0.01, model__max_depth=5, model__subsample=0.6; total time=   0.2s
[CV] END model__colsample_bytree=0.8, model__learning_rate=0.01, model__max_depth=5, model__subsample=0.6; total time=   0.2s
[CV] END model__colsample_bytree=0.6, model__learning_rate=0.1, model__max_depth=5, model__subsample=0.8; total time=   0.1s
[CV] END model__colsample_bytree=0.8, model__learning_rate=0.01, model__max_depth=5, model__subsample=0.6; total time=   0.2s
[CV] END model__colsample_bytree=0.6, model__learning_rate

(np.float64(0.6698570953805251),
 {'model__subsample': 0.8,
  'model__max_depth': 7,
  'model__learning_rate': 0.05,
  'model__colsample_bytree': 1.0})

### Evaluate Both Models on the Test Set

In [17]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def evaluate(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    return {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, zero_division=0),
        "recall": recall_score(y_test, y_pred, zero_division=0),
        "f1": f1_score(y_test, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_test, y_prob)
    }

rf_metrics = evaluate(rf_best, X_test, y_test)
xgb_metrics = evaluate(xgb_best, X_test, y_test)

rf_metrics, xgb_metrics

({'accuracy': 0.87409200968523,
  'precision': 0.8666666666666667,
  'recall': 0.4088050314465409,
  'f1': 0.5555555555555556,
  'roc_auc': 0.928903472791906},
 {'accuracy': 0.9249394673123487,
  'precision': 0.9145299145299145,
  'recall': 0.6729559748427673,
  'f1': 0.7753623188405797,
  'roc_auc': 0.9552487906989902})

### Compare Both Models

In [18]:
import pandas as pd

comparison_df = pd.DataFrame([rf_metrics, xgb_metrics], index=["RandomForest", "XGBoost"])
comparison_df

Unnamed: 0,accuracy,precision,recall,f1,roc_auc
RandomForest,0.874092,0.866667,0.408805,0.555556,0.928903
XGBoost,0.924939,0.91453,0.672956,0.775362,0.955249


> XGBoost achieved higher recall and F1-score, making it more suitable for identifying potential buyers. Therefore, XGBoost is selected as the final model for deployment.

### Experiment Tracking Using MLflow

In [25]:
import mlflow
import mlflow.sklearn

mlflow.set_experiment("tourism_wellness_modeling")

<Experiment: artifact_location='/Users/siddhartha/Projects/gl/tourism_mlops_project/notebooks/mlruns/1', creation_time=1765109391712, experiment_id='1', last_update_time=1765109391712, lifecycle_stage='active', name='tourism_wellness_modeling', tags={}>

### Log RandomForest

In [27]:
with mlflow.start_run(run_name="RandomForest_Best"):
    mlflow.log_params(rf_best_params)
    for k, v in rf_metrics.items():
        mlflow.log_metric(k, v)
    mlflow.sklearn.log_model(rf_best, name="rf_model")
    print("✅ Logged to MLflow with run name: RandomForest_Best")

✅ Logged to MLflow with run name: RandomForest_Best


### Log XGBoost

In [28]:
with mlflow.start_run(run_name="XGBoost_Best"):
    mlflow.log_params(xgb_best_params)
    for k, v in xgb_metrics.items():
        mlflow.log_metric(k, v)
    mlflow.sklearn.log_model(xgb_best, name="xgb_model")
    print("✅ Logged to MLflow with run name: XGBoost_Best")

✅ Logged to MLflow with run name: XGBoost_Best


### Select Best Model & Save Locally

In [29]:
best_model_name = "XGBoost" if xgb_metrics["f1"] > rf_metrics["f1"] else "RandomForest"
best_model = xgb_best if best_model_name == "XGBoost" else rf_best

print("Best model selected:", best_model_name)

Best model selected: XGBoost


In [30]:
import joblib, os

os.makedirs("../models", exist_ok=True)
model_path = f"../models/best_model.pkl"
joblib.dump(best_model, model_path)

model_path

'../models/best_model.pkl'

### Register Best Model on Hugging Face Model Hub

In [31]:
from huggingface_hub import HfApi, create_repo

HF_MODEL_REPO_ID = f"{HF_USERNAME}/tourism-wellness-best-model"

create_repo(repo_id=HF_MODEL_REPO_ID, repo_type="model", exist_ok=True)

api = HfApi()

api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="best_model.pkl",
    repo_id=HF_MODEL_REPO_ID,
    repo_type="model"
)

print("Model uploaded to:", HF_MODEL_REPO_ID)

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Model uploaded to: mukherjee78/tourism-wellness-best-model


# Model Deployment (HF Spaces)

In [32]:
import os
DEPLOY_DIR = "../src"

### Added requirements.txt inside deployment folder

In [34]:
file_path = f"{DEPLOY_DIR}/requirements.txt"

if os.path.exists(file_path):
    print(f"File {file_path} exists")
    with open(file_path, "r") as f:
        print(f.read())
else:
    print(f"File {file_path} does not exist")


File ../src/requirements.txt exists
streamlit
pandas
numpy
scikit-learn
xgboost
joblib
huggingface_hub
dill


### Added app.py inside deployment folder

In [35]:
file_path = f"{DEPLOY_DIR}/app.py"

if os.path.exists(file_path):
    print(f"File {file_path} exists")
    with open(file_path, "r") as f:
        print(f.read())
else:
    print(f"File {file_path} does not exist")


File ../src/app.py exists
import streamlit as st
import pandas as pd
import joblib
from huggingface_hub import hf_hub_download

st.set_page_config(page_title="Wellness Tourism Package Predictor")

HF_USERNAME = "mukherjee78"
HF_MODEL_REPO = f"{HF_USERNAME}/tourism-wellness-best-model"
MODEL_FILENAME = "best_model.pkl"

@st.cache_resource
def load_model():
    model_path = hf_hub_download(
        repo_id=HF_MODEL_REPO,
        filename=MODEL_FILENAME,
        repo_type="model"
    )
    model = joblib.load(model_path)
    return model

model = load_model()

st.title("🧘 Wellness Tourism Package - Purchase Prediction")
st.write("Enter customer details to predict whether they will purchase the package.")

# -------------------------
# USER INPUT FORM
# -------------------------
with st.form("input_form"):
    Age = st.number_input("Age", min_value=18, max_value=100, value=35)
    TypeofContact = st.selectbox("Type of Contact", ["Company Invited", "Self Inquiry"])
    CityTier = st.selectb

### Created Dockerfile

In [36]:
file_path = f"{DEPLOY_DIR}/Dockerfile"

if os.path.exists(file_path):
    print(f"File {file_path} exists")
    with open(file_path, "r") as f:
        print(f.read())
else:
    print(f"File {file_path} does not exist")

File ../src/Dockerfile exists
FROM python:3.10-slim

WORKDIR /app

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 7860

CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]


### Created HF Space Deployment Script (deploy_to_space.py)

In [37]:
file_path = f"{DEPLOY_DIR}/deploy_to_space.py"

if os.path.exists(file_path):
    print(f"File {file_path} exists")
    with open(file_path, "r") as f:
        print(f.read())
else:
    print(f"File {file_path} does not exist")

File ../src/deploy_to_space.py exists
import os
from huggingface_hub import HfApi

def main():
    space_repo_id = os.getenv("HF_SPACE_REPO_ID", "mukherjee78/tourism-wellness-space")
    token = os.getenv("HF_TOKEN")

    ABSOLUTE_PATH = os.path.dirname(os.path.abspath(__file__))
    print(ABSOLUTE_PATH)

    api = HfApi(token=token)

    api.create_repo(
        repo_id=space_repo_id,
        repo_type="space",
        exist_ok=True,
        space_sdk="docker"
    )

    files_to_upload = [
        (f"{ABSOLUTE_PATH}/Dockerfile", "Dockerfile"),
        (f"{ABSOLUTE_PATH}/app.py", "app.py"),
        (f"{ABSOLUTE_PATH}/requirements.txt", "requirements.txt"),
    ]

    for local_path, remote_path in files_to_upload:
        if not os.path.exists(local_path):
            continue

        print(f"Uploading {local_path} to {space_repo_id}:{remote_path}")
        api.upload_file(
            path_or_fileobj=local_path,
            path_in_repo=remote_path,
            repo_id=space_repo_id

In [38]:
!python ../src/deploy_to_space.py

/Users/siddhartha/Projects/gl/tourism_mlops_project/src
Uploading /Users/siddhartha/Projects/gl/tourism_mlops_project/src/Dockerfile to mukherjee78/tourism-wellness-space:Dockerfile
No files have been modified since last commit. Skipping to prevent empty commit.
Uploading /Users/siddhartha/Projects/gl/tourism_mlops_project/src/app.py to mukherjee78/tourism-wellness-space:app.py
No files have been modified since last commit. Skipping to prevent empty commit.
Uploading /Users/siddhartha/Projects/gl/tourism_mlops_project/src/requirements.txt to mukherjee78/tourism-wellness-space:requirements.txt
No files have been modified since last commit. Skipping to prevent empty commit.
✅ Deployment files pushed to Hugging Face Space: mukherjee78/tourism-wellness-space


## CI/CD with GitHub Actions

In [None]:
import os

pipeline_path = "../.github/workflows/pipeline.yml"

if os.path.exists(pipeline_path):
    print(f"File {pipeline_path} exists\n")
    with open(pipeline_path, "r") as f:
        print(f.read())
else:
    print(f"File {pipeline_path} does not exist")

# Final Results & Links