# Problem Statement & Objective

## **Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.


## **Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

## **Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:

**Customer Details**
- **CustomerID:** Unique identifier for each customer.
- **ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).
- **Age:** Age of the customer.
- **TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).
- **CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).
- **Occupation:** Customer's occupation (e.g., Salaried, Freelancer).
- **Gender:** Gender of the customer (Male, Female).
- **NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.
- **PreferredPropertyStar:** Preferred hotel rating by the customer.
- **MaritalStatus:** Marital status of the customer (Single, Married, Divorced).
- **NumberOfTrips:** Average number of trips the customer takes annually.
- **Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).
- **OwnCar:** Whether the customer owns a car (0: No, 1: Yes).
- **NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.
- **Designation:** Customer's designation in their current organization.
- **MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**
- **PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.
- **ProductPitched:** The type of product pitched to the customer.
- **NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.-
- **DurationOfPitch:** Duration of the sales pitch delivered to the customer.


# Setup Instructions

### 1. Created Conda Environment

I created and activated a conda environment:

```bash
conda create -n tourism-mlops python=3.10 -y
conda activate tourism-mlops
```

### 2. Installed Requirements

I installed the project dependencies:

```bash
pip install -r requirements.txt
```

### 3. Set Up Hugging Face

#### Installed Hugging Face CLI

I installed the Hugging Face CLI:

```bash
curl -LsSf https://hf.co/cli/install.sh | bash
```

#### Created Hugging Face Account & Token

I completed the following steps:

1. Went to [huggingface.co](https://huggingface.co) and signed in / signed up
2. Clicked my profile → Settings → Access Tokens
3. Created a New token (type: Write) and copied it

#### Logged In from Terminal

I logged in from the terminal (inside the conda environment):

```bash
huggingface-cli login
```

I pasted my token when prompted.

#### Created Dataset Repository on Hugging Face

I created the dataset repository in my browser:

1. Went to [huggingface.co/datasets](https://huggingface.co/datasets)
2. Clicked **New dataset**
3. Named it: `mukherjee78/tourism-wellness-package`
4. Set visibility to **Public**
5. Clicked **Create**

### 4. Created Project Structure

I created the project folder structure:

```bash
mkdir data notebooks src
```

Then I:
1. Created a notebook inside the `notebooks` folder
2. Copied the `tourism.csv` file into the `data` folder

# Data Registration (Hugging Face Datasets)

In [None]:
HF_USERNAME = "mukherjee78"
DATASET_REPO_ID = f"{HF_USERNAME}/tourism-wellness-package"
MODEL_REPO_ID = f"{HF_USERNAME}/tourism-wellness-best-model"
HF_SPACE_REPO_ID = f"{HF_USERNAME}/tourism-wellness-space"

In [None]:
from huggingface_hub import HfApi
import os

api = HfApi()

local_data_path = "../data/tourism.csv"

api.upload_file(
    path_or_fileobj=local_data_path,
    path_in_repo="data/tourism.csv",
    repo_id=DATASET_REPO_ID,
    repo_type="dataset"
)

print("Uploaded tourism.csv to Hugging Face Datasets repo:", DATASET_REPO_ID)

> We created a Hugging Face Dataset repository mukherjee78/tourism-wellness-package and uploaded the raw tourism.csv file to it. This satisfies the data registration requirement and allows the rest of the pipeline to load data directly from the data hub.

In [None]:
from datasets import load_dataset

dataset = load_dataset(DATASET_REPO_ID, data_files={"full": "data/tourism.csv"})
dataset

In [None]:
import pandas as pd

df = dataset["full"].to_pandas()
df.head()
df.info()
df.describe(include="all")

## Data Preparation

### Load the dataset from Hugging Face

In [None]:
from datasets import load_dataset

dataset = load_dataset(DATASET_REPO_ID, data_files={"full": "data/tourism.csv"})
df = dataset["full"].to_pandas()

df.head()

### Basic data inspection (for explanation + cleaning decisions)

In [None]:
print("Shape:", df.shape)
print("\nColumns:\n", df.columns.tolist())

print("\nInfo:")
print(df.info())

print("\nMissing values per column:")
print(df.isna().sum())

print("\nNumber of duplicate rows:", df.duplicated().sum())

### Data cleaning

In [None]:
TARGET_COL = "ProdTaken"

In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

df_clean = df.copy()

cols_to_drop = ["CustomerID", "Unnamed: 0"]

df_clean = df_clean.drop(columns=cols_to_drop)
print(f"Dropped columns: {cols_to_drop}")

# Drop duplicates
before = df_clean.shape[0]
df_clean = df_clean.drop_duplicates()
after = df_clean.shape[0]
print(f"Dropped {before - after} duplicate rows")

# Impute missing values
feature_cols = [c for c in df_clean.columns if c != TARGET_COL]
numeric_cols = df_clean[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df_clean[feature_cols].select_dtypes(exclude=[np.number]).columns.tolist()

df_imputed = df_clean.copy()

if numeric_cols:
    num_imputer = SimpleImputer(strategy="median")
    df_imputed[numeric_cols] = num_imputer.fit_transform(df_imputed[numeric_cols])

if categorical_cols:
    cat_imputer = SimpleImputer(strategy="most_frequent")
    df_imputed[categorical_cols] = cat_imputer.fit_transform(df_imputed[categorical_cols])

print("Remaining missing values after imputation:", df_imputed.isna().sum().sum())

### Train–test split and save locally

In [None]:
from sklearn.model_selection import train_test_split

X = df_imputed.drop(columns=[TARGET_COL])
y = df_imputed[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

train_df = X_train.copy()
train_df[TARGET_COL] = y_train
existing_cols_to_drop = [col for col in cols_to_drop if col in X_train.columns]
if existing_cols_to_drop:
    print(f"Warning: Dropping identifier columns that shouldn't be present: {existing_cols_to_drop}")
    X_train = X_train.drop(columns=existing_cols_to_drop)


test_df = X_test.copy()
test_df[TARGET_COL] = y_test
existing_cols_to_drop = [col for col in cols_to_drop if col in X_test.columns]
if existing_cols_to_drop:
    print(f"Warning: Dropping identifier columns that shouldn't be present: {existing_cols_to_drop}")
    X_test = X_test.drop(columns=existing_cols_to_drop)

train_path = "../data/train.csv" 
test_path  = "../data/test.csv"

train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)

print(f"Saved train to {train_path}, shape={train_df.shape}")
print(f"Saved test to {test_path}, shape={test_df.shape}")

In [None]:
pd.read_csv(train_path).head()

In [None]:
pd.read_csv(test_path).head()

### Upload train.csv and test.csv back to Hugging Face Dataset Space

In [None]:
from huggingface_hub import HfApi

api = HfApi()

api.upload_file(
    path_or_fileobj=train_path,
    path_in_repo="data/train.csv",
    repo_id=DATASET_REPO_ID,
    repo_type="dataset",
)

api.upload_file(
    path_or_fileobj=test_path,
    path_in_repo="data/test.csv",
    repo_id=DATASET_REPO_ID,
    repo_type="dataset",
)

print("Uploaded train.csv and test.csv to HF dataset repo:", DATASET_REPO_ID)

# Modeling & Experiment Tracking

### Load Train/Test From Hugging Face Dataset Space

In [None]:
from datasets import load_dataset
import pandas as pd

dataset_splits = load_dataset(
    DATASET_REPO_ID,
    data_files={
        "train": "data/train.csv",
        "test": "data/test.csv"
    }
)

train_df = dataset_splits["train"].to_pandas()
test_df = dataset_splits["test"].to_pandas()

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

train_df.head()

### Prepare Features and Target

In [None]:
X_train = train_df.drop(columns=[TARGET_COL])
y_train = train_df[TARGET_COL]
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)


X_test = test_df.drop(columns=[TARGET_COL])
y_test = test_df[TARGET_COL]
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

### Preprocessing Pipeline (Categorical Encoding + Numeric Passthrough)

In [None]:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_cols),
        ("cat", OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ]
)

### RandomForest Model + Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

rf_model = RandomForestClassifier(random_state=42, n_jobs=-1)

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", rf_model),
])

rf_param_dist = {
    "model__n_estimators": randint(100, 400),
    "model__max_depth": [None, 5, 10, 20],
    "model__min_samples_split": randint(2, 10),
    "model__min_samples_leaf": randint(1, 5),
    "model__max_features": ["sqrt", "log2"],
}

rf_search = RandomizedSearchCV(
    rf_pipe,
    rf_param_dist,
    n_iter=20,
    scoring="f1",
    cv=3,
    n_jobs=-1,
    verbose=2,
    random_state=42,
)

rf_search.fit(X_train, y_train)
rf_best = rf_search.best_estimator_
rf_best_params = rf_search.best_params_
rf_search.best_score_, rf_best_params

### XGBoost Model + Tuning

In [None]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=42,
    n_estimators=300
)

xgb_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", xgb_model),
])

xgb_param_dist = {
    "model__learning_rate": [0.01, 0.05, 0.1],
    "model__max_depth": [3, 5, 7],
    "model__subsample": [0.6, 0.8, 1.0],
    "model__colsample_bytree": [0.6, 0.8, 1.0],
}

xgb_search = RandomizedSearchCV(
    xgb_pipe,
    xgb_param_dist,
    n_iter=10,
    scoring="f1",
    cv=3,
    n_jobs=-1,
    verbose=2,
    random_state=42,
)

xgb_search.fit(X_train, y_train)
xgb_best = xgb_search.best_estimator_
xgb_best_params = xgb_search.best_params_
xgb_search.best_score_, xgb_best_params

### Evaluate Both Models on the Test Set

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def evaluate(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    return {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, zero_division=0),
        "recall": recall_score(y_test, y_pred, zero_division=0),
        "f1": f1_score(y_test, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_test, y_prob)
    }

rf_metrics = evaluate(rf_best, X_test, y_test)
xgb_metrics = evaluate(xgb_best, X_test, y_test)

rf_metrics, xgb_metrics

### Compare Both Models

In [None]:
import pandas as pd

comparison_df = pd.DataFrame([rf_metrics, xgb_metrics], index=["RandomForest", "XGBoost"])
comparison_df

> XGBoost achieved higher recall and F1-score, making it more suitable for identifying potential buyers. Therefore, XGBoost is selected as the final model for deployment.

### Experiment Tracking Using MLflow

In [None]:
import mlflow
import mlflow.sklearn

mlflow.set_experiment("tourism_wellness_modeling")

### Log RandomForest

In [None]:
with mlflow.start_run(run_name="RandomForest_Best"):
    mlflow.log_params(rf_best_params)
    for k, v in rf_metrics.items():
        mlflow.log_metric(k, v)
    mlflow.sklearn.log_model(rf_best, name="rf_model")
    print("✅ Logged to MLflow with run name: RandomForest_Best")

### Log XGBoost

In [None]:
with mlflow.start_run(run_name="XGBoost_Best"):
    mlflow.log_params(xgb_best_params)
    for k, v in xgb_metrics.items():
        mlflow.log_metric(k, v)
    mlflow.sklearn.log_model(xgb_best, name="xgb_model")
    print("✅ Logged to MLflow with run name: XGBoost_Best")

### Select Best Model & Save Locally

In [None]:
best_model_name = "XGBoost" if xgb_metrics["f1"] > rf_metrics["f1"] else "RandomForest"
best_model = xgb_best if best_model_name == "XGBoost" else rf_best

print("Best model selected:", best_model_name)

In [None]:
import joblib, os

os.makedirs("../models", exist_ok=True)
model_path = f"../models/best_model.pkl"
joblib.dump(best_model, model_path)

model_path

### Register Best Model on Hugging Face Model Hub

In [None]:
from huggingface_hub import HfApi, create_repo

HF_MODEL_REPO_ID = f"{HF_USERNAME}/tourism-wellness-best-model"

create_repo(repo_id=HF_MODEL_REPO_ID, repo_type="model", exist_ok=True)

api = HfApi()

api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="best_model.pkl",
    repo_id=HF_MODEL_REPO_ID,
    repo_type="model"
)

print("Model uploaded to:", HF_MODEL_REPO_ID)

# Model Deployment (HF Spaces)

### Dockerfile

In [None]:
# Already created in src/Dockerfile
with open("../src/Dockerfile", "r") as f:
    dockerfile_content = f.read()
    print(dockerfile_content)


### Streamlit App

In [None]:
# Already created in src/app.py
with open("../src/app.py", "r") as f:
    app_content = f.read()
    print(app_content)


### Dependency Handling

In [None]:
# Already created in src/requirements.txt
with open("../src/requirements.txt", "r") as f:
    requirements_content = f.read()
    print(requirements_content)


### Hosting in HF

In [None]:
api = HfApi()

api.create_repo(
    repo_id=HF_SPACE_REPO_ID,
    repo_type="space",
    exist_ok=True,
    space_sdk="docker"
)

files_to_upload = [
    (f"../src/Dockerfile", "Dockerfile"),
    (f"../src/app.py", "app.py"),
    (f"../src/requirements.txt", "requirements.txt"),
]

for local_path, remote_path in files_to_upload:
    print(f"Uploading {local_path} to {HF_SPACE_REPO_ID}:{remote_path}")
    api.upload_file(
        path_or_fileobj=local_path,
        path_in_repo=remote_path,
        repo_id=HF_SPACE_REPO_ID,
        repo_type="space"
    )

print(f"✅ Deployment files pushed to Hugging Face Space: {HF_SPACE_REPO_ID}")

## CI/CD with GitHub Actions

In [None]:
# Already created in .github/workflows/pipeline.yml
with open("../.github/workflows/pipeline.yml", "r") as f:
    pipeline_content = f.read()
    print(pipeline_content)


# Final Results & Links

### GitHub (link to repository, screenshot of folder structure and executed workflow)

[https://github.com/siddmkrj/tourism_mlops_project](https://github.com/siddmkrj/tourism_mlops_project)



### Screenshot: Folder Structure

![Folder](../git_folders.png)


### Screenshot: Workflow

![Workflow](../workflow.png)


### Streamlit on Hugging Face (link to HF space, screenshot of Streamlit app)

[https://huggingface.co/spaces/mukherjee78/tourism-wellness-space](https://huggingface.co/spaces/mukherjee78/tourism-wellness-space)



### Screenshot: Streamlit app

![Streamlit app](../streamlit.png)