# End-to-End Travel Behavior Analysis Pipeline

This notebook implements a pipeline to predict tourist spending categories using SVM and PyTorch Neural Networks.

### Key Steps:
1. **Data Cleaning**: Handling missing values and specific string formatting issues.
2. **Ordinal Encoding**: Preserving the rank order of features like Age and Trip Duration.
3. **Model 1**: Support Vector Machine (Scikit-Learn).
4. **Model 2**: Deep Neural Network (PyTorch with GPU).
5. **Submission**: Generating the final CSV.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# PyTorch Imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


## 1. Load and Explore Data

In [10]:
# Load datasets (Assuming files are in the current directory)
try:
    train_df = pd.read_csv('travel/train.csv')
    test_df = pd.read_csv('travel/test.csv')
    print("Datasets loaded successfully.")
except FileNotFoundError:
    print("Error: train.csv or test.csv not found. Please upload the data.")

# Save trip_ids for submission later
test_ids = test_df['trip_id']

# Display basic info
print(f"Train Shape: {train_df.shape}")
print(f"Test Shape: {test_df.shape}")
train_df.head()

Datasets loaded successfully.
Train Shape: (12654, 25)
Test Shape: (5852, 24)


Unnamed: 0,trip_id,country,age_group,travel_companions,num_females,num_males,main_activity,visit_purpose,is_first_visit,mainland_stay_nights,...,food_included,domestic_transport_included,sightseeing_included,guide_included,insurance_included,days_booked_before_trip,arrival_weather,total_trip_days,has_special_requirements,spend_category
0,tour_idftaa27vp,FRANCE,45-64,With Spouse and Children,1.0,2.0,Beach Tourism,Leisure and Holidays,Yes,0,...,No,No,No,No,No,,"cloudy,",30+,,1.0
1,tour_iduck75m57,KENYA,45-64,Alone,1.0,0.0,Conference Tourism,Meetings and Conference,Yes,6,...,No,No,No,No,No,15-30,"sunny,",30+,,2.0
2,tour_id8y3w40h8,SOUTH AFRICA,25-44,With Other Friends/Relatives,2.0,0.0,Cultural Tourism,Meetings and Conference,No,4,...,No,No,No,No,No,90+,"sunny,",30+,none,2.0
3,tour_idkoh8mkgr,ITALY,25-44,With Spouse,1.0,1.0,Widlife Tourism,Leisure and Holidays,Yes,0,...,Yes,Yes,Yes,Yes,No,8-14,,,none,0.0
4,tour_idkmsfa00a,ITALY,25-44,With Spouse,1.0,1.0,Beach Tourism,Leisure and Holidays,Yes,0,...,Yes,No,No,No,No,90+,"sunny,",7-14,,0.0


## 2. Preprocessing & Feature Engineering

This is the most critical step. We will define explicit mappings for Ordinal features to ensure the model understands that '45-64' is "greater" than '25-44'.

In [11]:
def preprocess_data(df, is_train=True):
    df = df.copy()
    
    # --- 1. Clean String Columns ---
    # Sometimes data has trailing commas or spaces based on the image provided
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        df[col] = df[col].str.strip().str.replace(',', '', regex=False)

    # --- 2. Ordinal Encoding Mappings ---
    
    # Age Group
    age_map = {
        '<18': 0, '15-24': 1, '25-44': 2, '45-64': 3, '65+': 4
    }
    # Map unlisted values to a default or mode if necessary, here we stick to knowns
    df['age_group'] = df['age_group'].map(age_map).fillna(2) # Fill NaN with mode (approx)

    # Total Trip Days
    trip_days_map = {
        '< 24 hours': 0, '1-3': 1, '4-6': 2, '7-14': 3, 
        '15-30': 4, '31-60': 5, '61-90': 6, '90+': 7
    }
    df['total_trip_days'] = df['total_trip_days'].map(trip_days_map).fillna(3)

    # Days Booked Before Trip
    booking_map = {
        '0-7': 0, '8-14': 1, '15-30': 2, '31-60': 3, '61-90': 4, '90+': 5
    }
    df['days_booked_before_trip'] = df['days_booked_before_trip'].map(booking_map).fillna(2)

    # Binary Yes/No Columns
    binary_cols = [
        'is_first_visit', 'intl_transport_included', 'accomodation_included', 
        'food_included', 'domestic_transport_included', 'sightseeing_included',
        'guide_included', 'insurance_included'
    ]
    
    for col in binary_cols:
        if col in df.columns:
            df[col] = df[col].map({'Yes': 1, 'No': 0}).fillna(0)

    # --- 3. Numerical Feature Selection ---
    num_cols = ['num_females', 'num_males', 'mainland_stay_nights', 'island_stay_nights']
    
    # Fill NaNs in numerical columns
    for col in num_cols:
        df[col] = df[col].fillna(df[col].median())
        
    # --- 4. Categorical (One-Hot) ---
    # These are nominal, no intrinsic order
    nominal_cols = [
        'country', 'travel_companions', 'main_activity', 'visit_purpose', 
        'tour_type', 'info_source', 'arrival_weather', 'has_special_requirements'
    ]
    
    # Return processed parts to be assembled by a pipeline/column transformer
    return df, num_cols, nominal_cols

# Apply Initial Cleaning / Ordinal Mapping
train_df_clean, num_features, nom_features = preprocess_data(train_df)
test_df_clean, _, _ = preprocess_data(test_df, is_train=False)

# Drop ID column for training
X = train_df_clean.drop(columns=['trip_id', 'spend_category'])
y = train_df_clean['spend_category']
X_test_final = test_df_clean.drop(columns=['trip_id'])

print("Ordinal encoding complete. Preparing for OneHot and Scaling...")

Ordinal encoding complete. Preparing for OneHot and Scaling...


### Encoding Pipeline
We use `ColumnTransformer` to One-Hot encode nominal features and Scale numerical features. Ordinal features are already integers, but we should scale them too so they don't dominate the gradients.

In [12]:
# Identify columns that are already ordinal encoded (integers now)
ordinal_cols = ['age_group', 'total_trip_days', 'days_booked_before_trip']
binary_cols = ['is_first_visit', 'intl_transport_included', 'accomodation_included', 
               'food_included', 'domestic_transport_included', 'sightseeing_included',
               'guide_included', 'insurance_included']

# Preprocessing Pipeline
# 1. OneHot for Nominal
# 2. Standard Scaler for Numerical + Ordinal
# 3. Passthrough for Binary (already 0/1)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features + ordinal_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), nom_features),
        ('bin', 'passthrough', binary_cols)
    ])

# --- REPLACEMENT CODE ---

# 1. Filter out rows with missing 'spend_category' in the training set
print(f"Original Train Shape: {train_df_clean.shape}")
train_df_clean = train_df_clean.dropna(subset=['spend_category'])
print(f"Shape after dropping missing targets: {train_df_clean.shape}")

# 2. Ensure target is an integer (standard for PyTorch/Sklearn classification)
train_df_clean['spend_category'] = train_df_clean['spend_category'].astype(int)

# 3. Define X and y again with the clean data
X = train_df_clean.drop(columns=['trip_id', 'spend_category'])
y = train_df_clean['spend_category']

# 4. Process the features
# Note: Re-run your ColumnTransformer (preprocessor) on this new X
X_processed = preprocessor.fit_transform(X)
X_test_processed = preprocessor.transform(X_test_final)

print(f"Processed Feature Matrix Shape: {X_processed.shape}")

# 5. Split Data (Now this will work because y has no NaNs)
X_train, X_val, y_train, y_val = train_test_split(
    X_processed, 
    y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print("Split successful!")

Original Train Shape: (12654, 25)
Shape after dropping missing targets: (12620, 25)
Processed Feature Matrix Shape: (12620, 187)
Split successful!


## 3. Model 1: Support Vector Machine (SVM)
SVM is effective for high-dimensional spaces created by One-Hot Encoding.

In [13]:
# Initialize SVM
# 'rbf' is generally the best default kernel for non-linear relationships
# class_weight='balanced' helps if spending categories are unequal
svm_model = SVC(kernel='rbf', C=1.0, gamma='scale', class_weight='balanced', random_state=42)

print("Training SVM...")
svm_model.fit(X_train, y_train)

print("Evaluating SVM...")
y_pred_svm = svm_model.predict(X_val)
print(classification_report(y_val, y_pred_svm))
print(f"SVM Accuracy: {accuracy_score(y_val, y_pred_svm):.4f}")

Training SVM...
Evaluating SVM...
              precision    recall  f1-score   support

           0       0.84      0.83      0.83      1249
           1       0.68      0.60      0.64       982
           2       0.48      0.69      0.57       293

    accuracy                           0.72      2524
   macro avg       0.67      0.71      0.68      2524
weighted avg       0.74      0.72      0.73      2524

SVM Accuracy: 0.7246


## 4. Model 2: PyTorch Neural Network
Using CUDA for acceleration. We will define a custom Dataset and a flexible MLP architecture.

In [14]:
# --- Dataset Class ---
class TravelDataset(Dataset):
    def __init__(self, X, y=None):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y.values, dtype=torch.long) if y is not None else None

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        if self.y is not None:
            return self.X[idx], self.y[idx]
        return self.X[idx]

# --- Data Loaders ---
train_dataset = TravelDataset(X_train, y_train)
val_dataset = TravelDataset(X_val, y_val)

BATCH_SIZE = 64
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

# --- Neural Network Architecture ---
class TravelNet(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(TravelNet, self).__init__()
        self.layer1 = nn.Linear(input_dim, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.layer2 = nn.Linear(256, 128)
        self.bn2 = nn.BatchNorm1d(128)
        self.layer3 = nn.Linear(128, 64)
        self.bn3 = nn.BatchNorm1d(64)
        self.output = nn.Linear(64, output_dim)
        
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = self.relu(self.bn1(self.layer1(x)))
        x = self.dropout(x)
        x = self.relu(self.bn2(self.layer2(x)))
        x = self.dropout(x)
        x = self.relu(self.bn3(self.layer3(x)))
        x = self.output(x)
        return x

# Initialize Model
input_dim = X_train.shape[1]
output_dim = 3 # Low, Medium, High

model = TravelNet(input_dim, output_dim).to(device)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print(model)

TravelNet(
  (layer1): Linear(in_features=187, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layer2): Linear(in_features=256, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layer3): Linear(in_features=128, out_features=64, bias=True)
  (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (output): Linear(in_features=64, out_features=3, bias=True)
  (relu): ReLU()
  (dropout): Dropout(p=0.3, inplace=False)
)


In [15]:
# --- Training Loop ---
EPOCHS = 50
train_losses = []
val_accuracies = []

for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0
    
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    # Validation phase
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            outputs = model(X_batch)
            _, predicted = torch.max(outputs.data, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
    
    epoch_acc = 100 * correct / total
    train_losses.append(running_loss/len(train_loader))
    val_accuracies.append(epoch_acc)
    
    if (epoch+1) % 5 == 0:
        print(f"Epoch [{epoch+1}/{EPOCHS}], Loss: {running_loss/len(train_loader):.4f}, Val Acc: {epoch_acc:.2f}%")

print("Training Complete.")

# # Plot training progress
# plt.figure(figsize=(10, 4))
# plt.subplot(1, 2, 1)
# plt.plot(train_losses, label='Loss')
# plt.title('Training Loss')
# plt.subplot(1, 2, 2)
# plt.plot(val_accuracies, label='Accuracy', color='orange')
# plt.title('Validation Accuracy')
# plt.show()

Epoch [5/50], Loss: 0.5725, Val Acc: 76.66%
Epoch [10/50], Loss: 0.5249, Val Acc: 76.31%
Epoch [15/50], Loss: 0.4884, Val Acc: 75.55%
Epoch [20/50], Loss: 0.4644, Val Acc: 75.55%
Epoch [25/50], Loss: 0.4344, Val Acc: 74.21%
Epoch [30/50], Loss: 0.4109, Val Acc: 73.77%
Epoch [35/50], Loss: 0.3879, Val Acc: 74.25%
Epoch [40/50], Loss: 0.3705, Val Acc: 74.17%
Epoch [45/50], Loss: 0.3556, Val Acc: 73.61%
Epoch [50/50], Loss: 0.3473, Val Acc: 73.42%
Training Complete.


## 5. Generate Submission File
We will use the PyTorch model for final predictions as it usually generalizes better on complex categorical data if tuned.

In [16]:
# Prepare Test Loader
test_dataset = TravelDataset(X_test_processed)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

model.eval()
all_preds = []

with torch.no_grad():
    for X_batch in test_loader:
        X_batch = X_batch.to(device)
        outputs = model(X_batch)
        _, predicted = torch.max(outputs.data, 1)
        all_preds.extend(predicted.cpu().numpy())

# Create Submission DataFrame
submission = pd.DataFrame({
    'trip_id': test_ids,
    'spend_category': all_preds
})

# Verify format
print(submission.head())

# Save
submission.to_csv('travel_sub/submission_1.csv', index=False)
print("submission.csv created successfully!")

           trip_id  spend_category
0  tour_id8gzpck76               1
1  tour_idow1zxkou               0
2  tour_idue7esfqz               0
3  tour_idnj3mjzpb               0
4  tour_ida3us5yk2               0
submission.csv created successfully!
