# Supervised Learning Flow – Titanic Survival Prediction  

**Students:** Yarin S. (ID: 8635)
             Shahaf L. (ID: 8284)

### Introduction
In this assignment we worked on a complete supervised learning flow using the Titanic dataset.
Our goal was to build a model that predicts whether a passenger survived (1) or not (0).

### Tools and Assistance Used
While preparing this assignment we reviewed some online resources.
We also used ChatGPT as a study aid – it helped us understand how cross-validation works,
how to evaluate models with the F1 score, and the importance of comparing algorithms such as Logistic Regression and Decision Tree.


## 1. Data Loading and Initial Exploration

In [None]:

import pandas as pd

# Load train and test data (relative paths for portability)
train_df = pd.read_csv("titanic_train.csv")
test_df = pd.read_csv("titanic_test.csv")

# Display first 5 rows
train_df.head()


In [None]:

# Shape and basic information
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("\nMissing values in train:")
print(train_df.isnull().sum())


## 2. Exploratory Data Analysis (EDA)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# survival distribution
sns.countplot(x="Survived", data=train_df)
plt.title("Survival Distribution")
plt.show()

*This plot shows the overall survival distribution – most passengers did not survive.*

In [None]:
# survival by sex
sns.countplot(x="Sex", hue="Survived", data=train_df)
plt.title("Survival by Sex")
plt.show()

*This plot shows survival by gender – women had a much higher survival rate than men.*

In [None]:
# survival by Pclass
sns.countplot(x="Pclass", hue="Survived", data=train_df)
plt.title("Survival by Passenger Class")
plt.show()

*This plot shows survival by passenger class – first class passengers survived more often than those in lower classes.*

## 3. Feature Engineering and Preprocessing

In [None]:

features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
target = "Survived"

# Handle missing values
print("Missing values before:")
print(train_df[features].isnull().sum())

train_df["Age"] = train_df["Age"].fillna(train_df["Age"].median())
train_df["Embarked"] = train_df["Embarked"].fillna(train_df["Embarked"].mode()[0])
test_df["Age"] = test_df["Age"].fillna(test_df["Age"].median())
test_df["Fare"] = test_df["Fare"].fillna(test_df["Fare"].median())
test_df["Embarked"] = test_df["Embarked"].fillna(test_df["Embarked"].mode()[0])

print("\nMissing values after:")
print(train_df[features].isnull().sum())

# One-hot encoding
X = pd.get_dummies(train_df[features])
y = train_df[target]
X_test_final = pd.get_dummies(test_df[features])

X, X_test_final = X.align(X_test_final, join="left", axis=1, fill_value=0)

print("\nFeatures after one-hot encoding:")
print(list(X.columns))
print("\nFirst 5 rows of processed data:")
print(X.head())


*I decided to keep all the selected features since feature selection did not improve the results. This ensures better model stability.*

## 4. Model Training and Experiments

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, make_scorer

models = {
    "LogisticRegression": {
        "model": LogisticRegression(max_iter=1000),
        "params": {
            "C": [0.1, 1, 10],
            "solver": ["liblinear"]
        }
    },
    "DecisionTree": {
        "model": DecisionTreeClassifier(),
        "params": {
            "max_depth": [3, 5, None],
            "min_samples_split": [2, 5, 10]
        }
    }
}

results = []
scorer = make_scorer(f1_score)

for name, mp in models.items():
    clf = GridSearchCV(mp["model"], mp["params"], cv=5, scoring=scorer, n_jobs=-1, return_train_score=True)
    clf.fit(X, y)
    for mean, params in zip(clf.cv_results_['mean_test_score'], clf.cv_results_['params']):
        results.append({"Model": name, "Params": params, "Mean F1": mean})

results_df = pd.DataFrame(results)
print("All experiment results:")
print(results_df)

sns.barplot(x="Model", y="Mean F1", data=results_df)
plt.title("Model Comparison (F1 Score)")
plt.show()

best_row = results_df.sort_values(by="Mean F1", ascending=False).iloc[0]
print("\nBest configuration:")
print(best_row)

*I compared Logistic Regression and Decision Tree. The Decision Tree achieved a higher F1 score, so I selected it as the final model.*

## 5. Train Best Model

In [None]:
if best_row["Model"] == "LogisticRegression":
    final_model = LogisticRegression(max_iter=1000, **best_row["Params"])
else:
    final_model = DecisionTreeClassifier(**best_row["Params"])

final_model.fit(X, y)
print("Final model trained with params:", best_row["Params"])

## 6. Evaluation on Test Set

In [None]:

# Predict on test set
test_predictions = final_model.predict(X_test_final)

# Show first predictions
print("First 10 predictions:", test_predictions[:10])

# Prediction distribution
sns.countplot(x=test_predictions)
plt.title("Prediction Distribution on Test Set")
plt.show()

# Handle PassengerId: use if exists, else create index
if "PassengerId" in test_df.columns:
    submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": test_predictions
    })
else:
    submission = pd.DataFrame({
        "PassengerId": range(1, len(test_predictions) + 1),
        "Survived": test_predictions
    })

submission.to_csv("titanic_predictions.csv", index=False)
print("Submission file saved: titanic_predictions.csv")
