<a href="https://colab.research.google.com/github/stepthom/869_course/blob/main/2026%20869%20Project%20Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMAI 869 Project: Example Notebook

*Updated May 1, 2025*

This notebook serves as a template for the Team Project. Teams can use this notebook as a starting point, and update it successively with new ideas and techniques to improve their model results.

Note that is not required to use this template. Teams may also alter this template in any way they see fit.

# Preliminaries: Inspect and Set up environment

No action is required on your part in this section. These cells print out helpful information about the environment, just in case.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, classification_report

In [None]:
import datetime
print(datetime.datetime.now())

In [None]:
!python --version

In [None]:
# TODO: if you need to install any package, do so here. For example:
#pip install unidecode

# 0: Data Loading and Inspection

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/stepthom/869_course/refs/heads/main/data/spaceship_titanic_train.csv")

In [None]:
df.info()

In [None]:
# Let's print some descriptive statistics for all the numeric features.

df.describe().T

In [None]:
# Let's print some descriptive statistics for all the numeric features.

df.describe().T# What is the number of unique values in all the categorical features? And what is
# the value with the highest frequency?

df.describe(include=object).T

In [None]:
# How much missing data is in each feature?

df.isna().sum()

In [None]:
# For convienience, let's save the names of all numeric features to a list,
# and the names of all categorical features to another list.

numeric_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

categorical_features = ['HomePlanet', 'VIP', 'CryoSleep', 'Destination', 'Cabin', 'Name']

In [None]:
# TODO: Can add more EDA here, as desired

# 1: Pipeline 1: Simple Feature Engineering and then Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split

In [None]:
# Scikit-learn needs us to put the features in one dataframe, and the label in another.
# It's tradition to name these variables X and y, but it doesn't really matter.

X = df.drop(['PassengerId', 'Transported'], axis=1)
y = df['Transported']

## 1.1: Cleaning and FE

In [None]:
# We know this dataset has categorical features, and we also know that DTs don't
# allow categorical features. For now, we'll just remove (i.e., drop) these
# features.
#
# TODO: do something better, like encode them (as discussed in the course)

X = X.drop(categorical_features, axis=1, errors='ignore')

In [None]:
# We know this dataset has some missing data, and we also know that DTs don't
# allow missing data. For now, we'll just do simple imputation.
#
# TODO: consider doing something different/better, like impute them (as
# discussed in class)

imp = SimpleImputer()
imp.fit(X)
X = imp.transform(X)

In [None]:
# TODO: Add more data cleaning and FE, as desired.

## 1.2: Model creation, hyperparameter tuning, and validation

In [None]:
# Let's create a very simple DecisionTree.

clf = DecisionTreeClassifier(max_depth=3, random_state=0)

# TODO: Can try different algorithms

In [None]:
# We use cross_validate to perform K-fold cross validation for us.
cv_results = cross_validate(clf, X, y, cv=5, scoring="f1_macro")

# TODO: can also add hyperparameter tuning to explore different values of the algorithms
# hyperparameters, and see how much those affect results.
# See GridSearchCV or RandomizedSearchCV.

In [None]:
# Now that cross validation has completed, we can see what it estimates the peformance
# of our model to be.

display(cv_results)
print("The mean CV score is:")
print(np.mean(cv_results['test_score']))


Once we are happy with the estimated performance of our model, we can move on to the final step.

First, we train our model one last time, using all available training data (unlike CV, which always uses a subset). This final training will give our model the best chance as the highest performance.

Then, we must load in the (unlabeled) competition data from the cloud and use our model to generate predictions for each instance in that data. We will then output those predictions to a CSV file and upload it to the competition.

In [None]:
# Our model's "final form"

clf = clf.fit(X, y)

In [None]:
X_comp = pd.read_csv("https://raw.githubusercontent.com/stepthom/869_course/refs/heads/main/data/spaceship_titanic_test.csv")

# Will need to save these IDs for later
passengerIDs = X_comp["PassengerId"]

# Importantly, we need to perform the same cleaning/transformation steps
# on this competition data as you did the training data. Otherwise, we will
# get an error and/or unexpected results.

X_comp = X_comp.drop(['PassengerId'], axis=1, errors='ignore')
X_comp = X_comp.drop(categorical_features, axis=1, errors='ignore')

X_comp = imp.transform(X_comp)

# Use your model to make predictions
pred_comp = clf.predict(X_comp)

# Create a simple dataframe with two columns: the passenger ID (just the same as the test data) and our predictions
my_submission = pd.DataFrame({
    'PassengerId': passengerIDs,
    'Transported': pred_comp})

# Let's take a peak at the results (as a sanity check)
display(my_submission.head(10))

# You could use any filename.
my_submission.to_csv('submission.csv', index=False)

# You can now download the 'submission.csv' from Colab/Kaggle (see menu on the left or right)