# Problem Statement

## **Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.


## **Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

## **Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:

**Customer Details**
- **CustomerID:** Unique identifier for each customer.
- **ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).
- **Age:** Age of the customer.
- **TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).
- **CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).
- **Occupation:** Customer's occupation (e.g., Salaried, Freelancer).
- **Gender:** Gender of the customer (Male, Female).
- **NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.
- **PreferredPropertyStar:** Preferred hotel rating by the customer.
- **MaritalStatus:** Marital status of the customer (Single, Married, Divorced).
- **NumberOfTrips:** Average number of trips the customer takes annually.
- **Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).
- **OwnCar:** Whether the customer owns a car (0: No, 1: Yes).
- **NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.
- **Designation:** Customer's designation in their current organization.
- **MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**
- **PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.
- **ProductPitched:** The type of product pitched to the customer.
- **NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.-
- **DurationOfPitch:** Duration of the sales pitch delivered to the customer.


# Model Building

## Data Registration

In [67]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [73]:
import os

# Base path
%cd "/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred"
base_path = os.getcwd()
print(base_path)

# Create data + model_building folders
os.makedirs(os.path.join(base_path, "data"), exist_ok=True)
os.makedirs(os.path.join(base_path, "model_building"), exist_ok=True)
os.makedirs(os.path.join(base_path, "deployment"), exist_ok=True)

print("Folders created:")
print(os.path.abspath(os.path.join(base_path, "data")))
print(os.path.abspath(os.path.join(base_path, "model_building")))
print(os.path.abspath(os.path.join(base_path, "deployment")))

/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred
/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred
Folders created:
/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/data
/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/model_building
/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment


In [70]:
%%writefile .gitignore
.env

Writing .gitignore


In [74]:
#%%writefile /model_building/data_register.py

from huggingface_hub.utils import RepositoryNotFoundError
from huggingface_hub import HfApi, create_repo
from dotenv import load_dotenv

# Repo Information
repo_id = "Vaddiritz/Tourism-Package-Prediction-rithika"
repo_type = "dataset"

# Hugging Face Token
#os.environ["HF_TOKEN"] = userdata.get("HF_token")
load_dotenv()
hf_token = os.getenv('HF_TOKEN')
api = HfApi()

# Check if dataset repo exists, otherwise create it
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Dataset repo '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Dataset repo '{repo_id}' not found. Creating new repo...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Dataset repo '{repo_id}' created.")

# Path to your CSV
base_path = os.getcwd()
print(base_path)
csv_path = os.path.join(base_path,"data/tourism.csv")
print(csv_path)

# Upload the CSV file
api.upload_file(
    path_or_fileobj=csv_path,
    path_in_repo="tourism.csv",
    repo_id=repo_id,
    repo_type=repo_type,
    commit_message="Upload tourism dataset"
)

print("tourism.csv uploaded successfully to Hugging Face Hub!")

Dataset repo 'Vaddiritz/Tourism-Package-Prediction-rithika' already exists. Using it.
/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred
/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/data/tourism.csv


No files have been modified since last commit. Skipping to prevent empty commit.


tourism.csv uploaded successfully to Hugging Face Hub!


## Data Preparation

In [13]:
%%writefile /model_building/prep.py

# for data manipulation
import pandas as pd
import os
# for data preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
# for converting text data into numerical representation
from sklearn.preprocessing import LabelEncoder
# for hugging face hub API
from huggingface_hub import HfApi, create_repo
from huggingface_hub.utils import RepositoryNotFoundError
from google.colab import userdata
from huggingface_hub import hf_hub_download


# Load Hugging Face Token
os.environ["HF_TOKEN"] = userdata.get("HF_token")
api = HfApi()

# Download dataset file from HF repo
local_path = hf_hub_download(
    repo_id="Vaddiritz/Tourism-Package-Prediction-rithika",
    repo_type="dataset",
    filename="tourism.csv",
    token=os.environ["HF_TOKEN"]
)

# Load into pandas
df = pd.read_csv(local_path)
print("Dataset loaded. Shape:", df.shape)


# Basic Cleaning

# Drop unique identifier column
if "CustomerID" in df.columns:
    df.drop(columns=["CustomerID"], inplace=True)
    print("Removed CustomerID column.")

# Drop index-like column if present
if "Unnamed: 0" in df.columns:
  df = df.drop(columns=["Unnamed: 0"])
  print("Dropped 'Unnamed: 0'")

# Fix typos/inconsistent categories
if "Gender" in df.columns:
  df["Gender"] = df["Gender"].replace({"Fe Male": "Female"})

# Handle missing values
for col in df.columns:
    if df[col].dtype in ["int64", "float64"]:
        df[col] = df[col].fillna(df[col].median())
    else:
        df[col] = df[col].fillna(df[col].mode()[0])

# Encode categorical columns
categorical_cols = df.select_dtypes(include=["object"]).columns
encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = encoder.fit_transform(df[col])
    print(f"Encoded {col}")


# Split into features and target
target_col = "ProdTaken"
X = df.drop(columns=[target_col])
y = df[target_col]

Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train/test split done:", Xtrain.shape, Xtest.shape)

# Save locally
Xtrain.to_csv("Xtrain.csv", index=False)
Xtest.to_csv("Xtest.csv", index=False)
ytrain.to_csv("ytrain.csv", index=False)
ytest.to_csv("ytest.csv", index=False)

# Upload back to Hugging Face
files = ["Xtrain.csv", "Xtest.csv", "ytrain.csv", "ytest.csv"]

for file_path in files:
    api.upload_file(
        path_or_fileobj=file_path,
        path_in_repo=os.path.basename(file_path),
        repo_id="Vaddiritz/Tourism-Package-Prediction-rithika",
        repo_type="dataset",
        token=os.environ["HF_TOKEN"]
    )

print("Data prep finished and uploaded to HF.")

tourism.csv: 0.00B [00:00, ?B/s]

Dataset loaded. Shape: (4128, 21)
Removed CustomerID column.
Dropped 'Unnamed: 0'
Encoded TypeofContact
Encoded Occupation
Encoded Gender
Encoded ProductPitched
Encoded MaritalStatus
Encoded Designation
Train/test split done: (3302, 18) (826, 18)
Data prep finished and uploaded to HF.


# Model Training and Registration with Experimentation Tracking

### Development Environment (with MLflow + Ngrok)

In [14]:
!pip install mlflow==3.0.1 pyngrok==7.2.12 -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.4/247.4 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m703.4/703.4 kB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.4/203.4 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [17]:
from pyngrok import ngrok
import subprocess
import mlflow
import time

# Set your ngrok auth token
ngrok.set_auth_token("31zVM0HGj8PMzdV3KphcOoa7dYV_4RCNYG2mhVxj74CBrbCNT")

# Start MLflow UI on port 5000
process = subprocess.Popen(["mlflow", "ui", "--port", "5000"])

# Add a small delay to allow MLflow UI to start
time.sleep(5) # Adjust the sleep time if needed

# Create public tunnel
public_url = ngrok.connect(5000).public_url
print("MLflow UI is available at:", public_url)

# Point MLflow to tracking server
mlflow.set_tracking_uri(public_url)
mlflow.set_experiment("Tourism_Package_Experiment")

MLflow UI is available at: https://3a4dc6765b05.ngrok-free.app


<Experiment: artifact_location='mlflow-artifacts:/381147829795522362', creation_time=1756527127372, experiment_id='381147829795522362', last_update_time=1756527127372, lifecycle_stage='active', name='Tourism_Package_Experiment', tags={}>

### Model Training & Experiment Tracking

In [18]:
import pandas as pd
import os
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report
import xgboost as xgb
import joblib
import mlflow
from huggingface_hub import hf_hub_download

# Download prepared datasets from Hugging Face
Xtrain_path = hf_hub_download(
    repo_id="Vaddiritz/Tourism-Package-Prediction-rithika",
    repo_type="dataset",
    filename="Xtrain.csv"
)
Xtest_path = hf_hub_download(
    repo_id="Vaddiritz/Tourism-Package-Prediction-rithika",
    repo_type="dataset",
    filename="Xtest.csv"
)
ytrain_path = hf_hub_download(
    repo_id="Vaddiritz/Tourism-Package-Prediction-rithika",
    repo_type="dataset",
    filename="ytrain.csv"
)
ytest_path = hf_hub_download(
    repo_id="Vaddiritz/Tourism-Package-Prediction-rithika",
    repo_type="dataset",
    filename="ytest.csv"
)

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path).values.ravel()
ytest = pd.read_csv(ytest_path).values.ravel()

print("Tourism dataset loaded for training.")

# Feature groups
numeric_features = Xtrain.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = Xtrain.select_dtypes(include=["object"]).columns.tolist()

# Handle class imbalance
class_weight = ytrain.tolist().count(0) / ytrain.tolist().count(1)

# Preprocessor
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown="ignore"), categorical_features)
)

# Base model
xgb_model = xgb.XGBClassifier(scale_pos_weight=class_weight, random_state=42, use_label_encoder=False, eval_metric="logloss")

# Hyperparameter space
param_dist = {
    'xgbclassifier__n_estimators': [50, 100, 200, 300],
    'xgbclassifier__max_depth': [3, 4, 5, 6, 7],
    'xgbclassifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'xgbclassifier__colsample_bytree': [0.3, 0.5, 0.7, 1.0],
    'xgbclassifier__subsample': [0.6, 0.8, 1.0]
}

# Pipeline
model_pipeline = make_pipeline(preprocessor, xgb_model)

# MLflow experiment
with mlflow.start_run():
    random_search = RandomizedSearchCV(
        model_pipeline,
        param_distributions=param_dist,
        n_iter=10,   # number of random combinations to try
        cv=5,
        n_jobs=-1,
        random_state=42
    )
    random_search.fit(Xtrain, ytrain)

    # Log best parameters
    mlflow.log_params(random_search.best_params_)

    # Evaluate model
    best_model = random_search.best_estimator_
    y_pred_train = best_model.predict(Xtrain)
    y_pred_test = best_model.predict(Xtest)

    train_report = classification_report(ytrain, y_pred_train, output_dict=True)
    test_report = classification_report(ytest, y_pred_test, output_dict=True)

    mlflow.log_metrics({
        "train_accuracy": train_report['accuracy'],
        "train_precision": train_report['1']['precision'],
        "train_recall": train_report['1']['recall'],
        "train_f1-score": train_report['1']['f1-score'],
        "test_accuracy": test_report['accuracy'],
        "test_precision": test_report['1']['precision'],
        "test_recall": test_report['1']['recall'],
        "test_f1-score": test_report['1']['f1-score']
    })

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Xtrain.csv: 0.00B [00:00, ?B/s]

Xtest.csv: 0.00B [00:00, ?B/s]

ytrain.csv: 0.00B [00:00, ?B/s]

ytest.csv: 0.00B [00:00, ?B/s]

Tourism dataset loaded for training.


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


🏃 View run amusing-horse-3 at: https://3a4dc6765b05.ngrok-free.app/#/experiments/381147829795522362/runs/a1302ce9b8b844e49b8c8288e22d7f64
🧪 View experiment at: https://3a4dc6765b05.ngrok-free.app/#/experiments/381147829795522362


### Experimentation and Tracking (Production Environment)

In [22]:
%%writefile /model_building/train.py

# for data manipulation
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# for model training, tuning, and evaluation
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from sklearn.metrics import classification_report
# for model serialization
import joblib
# for hugging face space authentication to upload files
from huggingface_hub import HfApi, create_repo
from huggingface_hub.utils import RepositoryNotFoundError
import mlflow

# MLflow tracking
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("Tourism_Package_Experiment")

api = HfApi()

# Dataset paths from Hugging Face
Xtrain_path = "hf://datasets/Vaddiritz/Tourism-Package-Prediction-rithika/Xtrain.csv"
Xtest_path = "hf://datasets/Vaddiritz/Tourism-Package-Prediction-rithika/Xtest.csv"
ytrain_path = "hf://datasets/Vaddiritz/Tourism-Package-Prediction-rithika/ytrain.csv"
ytest_path = "hf://datasets/Vaddiritz/Tourism-Package-Prediction-rithika/ytest.csv"

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path).values.ravel()
ytest = pd.read_csv(ytest_path).values.ravel()

print("Tourism dataset loaded successfully.")

# Feature groups
numeric_features = Xtrain.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = Xtrain.select_dtypes(include=["object"]).columns.tolist()

# Handle class imbalance
class_weight = ytrain.tolist().count(0) / ytrain.tolist().count(1)

# Preprocessor
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown="ignore"), categorical_features)
)

# Base model
xgb_model = xgb.XGBClassifier(scale_pos_weight=class_weight, random_state=42)

# Hyperparameter distributions (broader for Random Search)
param_distributions = {
    'xgbclassifier__n_estimators': [50, 100, 150, 200, 300],
    'xgbclassifier__max_depth': [3, 4, 5, 6, 8, 10],
    'xgbclassifier__colsample_bytree': np.linspace(0.3, 1.0, 8),
    'xgbclassifier__learning_rate': np.linspace(0.01, 0.3, 10),
    'xgbclassifier__reg_lambda': np.linspace(0.1, 2.0, 10),
}

# Model pipeline
model_pipeline = make_pipeline(preprocessor, xgb_model)

# Start MLflow run
with mlflow.start_run():
    random_search = RandomizedSearchCV(
        model_pipeline,
        param_distributions=param_distributions,
        n_iter=20,# number of random combinations to try
        cv=5,
        n_jobs=-1,
        random_state=42,
    )
    random_search.fit(Xtrain, ytrain)

    # Log all param sets & scores
    results = random_search.cv_results_
    for i in range(len(results['params'])):
        param_set = results['params'][i]
        mean_score = results['mean_test_score'][i]
        std_score = results['std_test_score'][i]

        with mlflow.start_run(nested=True):
            mlflow.log_params(param_set)
            mlflow.log_metric("mean_test_score", mean_score)
            mlflow.log_metric("std_test_score", std_score)

    # Log best params
    mlflow.log_params(random_search.best_params_)

    # Best model
    best_model = random_search.best_estimator_

    y_pred_train = best_model.predict(Xtrain)
    y_pred_test = best_model.predict(Xtest)

    train_report = classification_report(ytrain, y_pred_train, output_dict=True)
    test_report = classification_report(ytest, y_pred_test, output_dict=True)

    # Log metrics
    mlflow.log_metrics({
        "train_accuracy": train_report['accuracy'],
        "train_precision": train_report['1']['precision'],
        "train_recall": train_report['1']['recall'],
        "train_f1-score": train_report['1']['f1-score'],
        "test_accuracy": test_report['accuracy'],
        "test_precision": test_report['1']['precision'],
        "test_recall": test_report['1']['recall'],
        "test_f1-score": test_report['1']['f1-score']
    })

    # Save model
    model_path = "best_tourism_model_v1.joblib"
    joblib.dump(best_model, model_path)
    mlflow.log_artifact(model_path, artifact_path="model")
    print(f"Model saved at: {model_path}")

    # Upload to Hugging Face Hub
    repo_id = "Vaddiritz/Tourism-Package-Prediction-rithika"
    repo_type = "model"

    try:
        api.repo_info(repo_id=repo_id, repo_type=repo_type)
        print(f"Repo '{repo_id}' already exists.")
    except RepositoryNotFoundError:
        print(f"Repo '{repo_id}' not found. Creating new repo...")
        create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
        print(f"Repo '{repo_id}' created.")

    api.upload_file(
        path_or_fileobj=model_path,
        path_in_repo=model_path,
        repo_id=repo_id,
        repo_type=repo_type,
    )
    print(f"Model uploaded to Hugging Face Hub: {repo_id}")

Tourism dataset loaded successfully.
🏃 View run adorable-roo-959 at: http://localhost:5000/#/experiments/381147829795522362/runs/346aa6c3eb654d709079512da2229c27
🧪 View experiment at: http://localhost:5000/#/experiments/381147829795522362
🏃 View run languid-yak-143 at: http://localhost:5000/#/experiments/381147829795522362/runs/08b2e1581db443f7b39d6c1f7d9790cd
🧪 View experiment at: http://localhost:5000/#/experiments/381147829795522362
🏃 View run puzzled-rat-804 at: http://localhost:5000/#/experiments/381147829795522362/runs/69fa2dea7932426ba198d167a0b7bf9d
🧪 View experiment at: http://localhost:5000/#/experiments/381147829795522362
🏃 View run suave-carp-26 at: http://localhost:5000/#/experiments/381147829795522362/runs/888c4144984b4c36b35e150748d8e575
🧪 View experiment at: http://localhost:5000/#/experiments/381147829795522362
🏃 View run orderly-grouse-447 at: http://localhost:5000/#/experiments/381147829795522362/runs/81263115c9d74cfc91c38ebfe61bda42
🧪 View experiment at: http://loca

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  best_tourism_model_v1.joblib          : 100%|##########|  380kB /  380kB            

Model uploaded to Hugging Face Hub: Vaddiritz/Tourism-Package-Prediction-rithika
🏃 View run crawling-shoat-614 at: http://localhost:5000/#/experiments/381147829795522362/runs/782fd5d223524e78875c9621c08a6328
🧪 View experiment at: http://localhost:5000/#/experiments/381147829795522362


# Deployment

## Dockerfile

In [29]:
%%writefile /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/Dockerfile

# Use a minimal base image with Python 3.9 installed
FROM python:3.9

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# Create non-root user
RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
	PATH=/home/user/.local/bin:$PATH

WORKDIR $HOME/app
COPY --chown=user . $HOME/app

# Run the Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]


Overwriting /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/Dockerfile


## Streamlit App

In [39]:
%%writefile /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/app.py

import streamlit as st
import pandas as pd
import joblib
from huggingface_hub import hf_hub_download
from sklearn.preprocessing import LabelEncoder

# Download and load model from Hugging Face
model_path = hf_hub_download(repo_id="Vaddiritz/Tourism-Package-Prediction-rithika", filename="best_tourism_model_v1.joblib")
model = joblib.load(model_path)

# Streamlit UI
st.title("Tourism Package Recommendation App")
st.write("""
This application predicts whether a customer is likely to purchase a **tourism package**
based on their profile and preferences.
Fill in the details below to get a prediction.
""")

# Customer Details
age = st.number_input("Age", min_value=18, max_value=100, value=30)
typeofcontact = st.selectbox("Type of Contact", ["Company Invited", "Self Inquiry"])
citytier = st.selectbox("City Tier", [1, 2, 3])
occupation = st.selectbox("Occupation", ["Salaried", "Freelancer"])
gender = st.selectbox("Gender", ["Male", "Female"])
numberofpersonvisiting = st.number_input("Number Of Person Visiting", min_value=1, max_value=10, value=1)
preferredpropertystar = st.selectbox("Preferred Property Star", [1, 2, 3, 4, 5])
maritalstatus = st.selectbox("Marital Status", ["Single", "Married", "Divorced"])
numberoftrips = st.number_input("Number Of Trips", min_value=0, max_value=20, value=1)
passport = st.selectbox("Passport", [0, 1])
owncar = st.selectbox("Own Car", [0, 1])
numberofchildrenvisiting = st.number_input("Number Of Children Visiting", min_value=0, max_value=10, value=0)
designation = st.selectbox("Designation", ["Manager", "Executive", "Senior Manager", "AVP"])
monthlyincome = st.number_input("Monthly Income", min_value=1000, value=50000)
pitchsatisfactionscore = st.slider("Pitch Satisfaction Score", 1, 5, 3)
productpitched = st.selectbox("Product Pitched", ["Basic", "Deluxe", "Super Deluxe", "King", "Standard"])
numberoffollowups = st.number_input("Number Of Followups", min_value=0, max_value=20, value=2)
durationofpitch = st.number_input("Duration Of Pitch (minutes)", min_value=0, max_value=60, value=10)

# --- Create input dataframe ---
input_data = pd.DataFrame([[age, typeofcontact, citytier, occupation, gender,numberofpersonvisiting, preferredpropertystar,
                            maritalstatus,numberoftrips, passport, owncar, numberofchildrenvisiting, designation,
                            monthlyincome, pitchsatisfactionscore, productpitched,numberoffollowups, durationofpitch]],
                          columns=["Age", "TypeofContact", "CityTier", "Occupation", "Gender",
                                   "NumberOfPersonVisiting", "PreferredPropertyStar", "MaritalStatus",
                                   "NumberOfTrips", "Passport", "OwnCar", "NumberOfChildrenVisiting",
                                   "Designation", "MonthlyIncome", "PitchSatisfactionScore", "ProductPitched",
                                   "NumberOfFollowups", "DurationOfPitch"])


# Display input summary
st.subheader("Entered Details:")
st.write(input_data)

if st.button("Predict Package Purchase"):
    prediction = model.predict(input_data)[0]
    result = "Likely to Purchase Package" if prediction == 1 else "Unlikely to Purchase"
    st.subheader("Prediction Result:")
    st.success(result)

Overwriting /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/app.py


## Dependencies (requirements.txt)

In [31]:
%%writefile /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/requirements.txt
pandas==2.2.2
huggingface_hub==0.32.6
streamlit==1.43.2
joblib==1.5.1
scikit-learn==1.6.0
xgboost==2.1.4
mlflow==3.0.1

Writing /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/requirements.txt


## Hosting Script (push_to_hf.py)

In [40]:
#%%writefile /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/push_to_hf.py
import os
from huggingface_hub import HfApi, create_repo, upload_file
from huggingface_hub.utils import RepositoryNotFoundError

# Hugging Face repo details
repo_id = "Vaddiritz/Tourism-Package-Prediction-rithika"
repo_type = "space"

api = HfApi()

# Check if repo exists, else create it
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f" Repo '{repo_id}' already exists.")
except RepositoryNotFoundError:
    print(f" Creating new Space '{repo_id}'...")
    create_repo(repo_id=repo_id, repo_type=repo_type, space_sdk="streamlit")
    print(f" Repo '{repo_id}' created.")

# Upload deployment files
files_to_upload = ["/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/Dockerfile",
                   "/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/app.py",
                   "/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/requirements.txt"]

for file in files_to_upload:
    upload_file(
        path_or_fileobj=file,
        path_in_repo=os.path.basename(file),
        repo_id=repo_id,
        repo_type=repo_type
    )
    print(f" Uploaded {file} to {repo_id}")


No files have been modified since last commit. Skipping to prevent empty commit.


 Repo 'Vaddiritz/Tourism-Package-Prediction-rithika' already exists.
 Uploaded /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/Dockerfile to Vaddiritz/Tourism-Package-Prediction-rithika


No files have been modified since last commit. Skipping to prevent empty commit.


 Uploaded /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/app.py to Vaddiritz/Tourism-Package-Prediction-rithika
 Uploaded /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred/deployment/requirements.txt to Vaddiritz/Tourism-Package-Prediction-rithika


# MLOps Pipeline with Github Actions Workflow

```
name: MLOps Pipeline - Tourism Package Prediction

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main
  workflow_dispatch:   # allows manual trigger

jobs:
  build-and-train:
    runs-on: ubuntu-latest

    steps:
    # Step 1: Checkout repo
    - name: Checkout repository
      uses: actions/checkout@v3

    # Step 2: Setup Python
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'

    # Step 3: Install dependencies
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r deployment/requirements.txt

    # Step 4: Run training
    - name: Run training pipeline
      run: |
        python training/train.py

    # Step 5: Run evaluation
    - name: Run evaluation
      run: |
        python training/evaluate.py

    # Step 6: Save trained model as artifact
    - name: Upload model artifact
      uses: actions/upload-artifact@v3
      with:
        name: model-artifact
        path: models/

  deploy-to-hf:
    needs: build-and-train
    runs-on: ubuntu-latest

    steps:
    # Step 1: Checkout repo
    - name: Checkout repository
      uses: actions/checkout@v3

    # Step 2: Setup Python
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'

    # Step 3: Install Hugging Face Hub
    - name: Install Hugging Face Hub
      run: pip install huggingface_hub

    # Step 4: Push deployment files to HF Space
    - name: Deploy to Hugging Face
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        python deployment/push_to_hf.py

```



In [65]:
# 1. Install Git
!apt-get -qq install git

# 2. Configure Git identity
!git config --global user.email "vaddi.rithika@gmail.com"
!git config --global user.name "vaddiparthirithika"

# 3. Clone your repo
!git clone https://github.com/vaddiparthirithika/Tourism_Package_Pred_MLOps_Rithika.git
%cd TourismPackagePrediction/

# 4. Copy your project files into the repo
!cp -r /content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred


# 5. Add your GitHub token securely
import getpass, os
os.environ["GITHUB_TOKEN"] = getpass.getpass("Enter GitHub token: ")

# 6. Push changes to GitHub
!git add .
!git commit -m "Updated project files from Colab"

!git push https://vaddiparthirithika:${GITHUB_TOKEN}@github.com/vaddiparthirithika/Tourism_Package_Pred_MLOps_Rithika.git main

fatal: destination path 'Tourism_Package_Pred_MLOps_Rithika' already exists and is not an empty directory.
[Errno 2] No such file or directory: 'TourismPackagePrediction/'
/content/Tourism-Package-Prediction-rithika
cp: missing destination file operand after '/content/drive/MyDrive/Colab_Notebooks/MLOps_TourismPackagePred'
Try 'cp --help' for more information.
Enter GitHub token: ··········
On branch main
nothing to commit, working tree clean
To https://github.com/vaddiparthirithika/Tourism_Package_Pred_MLOps_Rithika.git
 [31m! [rejected]       [m main -> main (fetch first)
[31merror: failed to push some refs to 'https://github.com/vaddiparthirithika/Tourism_Package_Pred_MLOps_Rithika.git'
[m[33mhint: Updates were rejected because the remote contains work that you do[m
[33mhint: not have locally. This is usually caused by another repository pushing[m
[33mhint: to the same ref. You may want to first integrate the remote changes[m
[33mhint: (e.g., 'git pull ...') before pushin