# **Titanic Survival Prediction Challenge**

## **Kaggle Competition Project**

## **1. Introduction to the Titanic Dataset**

### **The Challenge**

Welcome to one of the most famous machine learning competitions on Kaggle! The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered "unsinkable" RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren't enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

**Your Mission:** Build a predictive model that answers the question: "what sorts of people were more likely to survive?" using passenger data (ie name, age, gender, socio-economic class, etc).

<div align="center">
  <img src="https://storage.googleapis.com/kaggle-competitions/kaggle/3136/logos/header.png" width="600"/>
</div>

---

### **üì• Dataset Information**

**Kaggle Competition Link:** [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

**Dataset Files:**
1. **train.csv** - Training dataset with survival labels (891 passengers)
2. **test.csv** - Test dataset without survival labels (418 passengers)
3. **gender_submission.csv** - Sample submission file in correct format

**How to Download:**
1. Create a free account on [Kaggle](https://www.kaggle.com)
2. Go to the [Titanic Competition Page](https://www.kaggle.com/c/titanic)
3. Click on the "Data" tab
4. Download all three files
5. Place them in your Day_4 folder

---

### **Dataset Features**

| Feature | Description | Type |
|---------|-------------|------|
| **PassengerId** | Unique identifier for each passenger | Integer |
| **Survived** | Survival status (0 = No, 1 = Yes) | Integer (Target) |
| **Pclass** | Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) | Integer |
| **Name** | Passenger name | String |
| **Sex** | Gender (male/female) | String |
| **Age** | Age in years | Float |
| **SibSp** | Number of siblings/spouses aboard | Integer |
| **Parch** | Number of parents/children aboard | Integer |
| **Ticket** | Ticket number | String |
| **Fare** | Passenger fare | Float |
| **Cabin** | Cabin number | String |
| **Embarked** | Port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton) | String |

**Key Points:**
- **Training Set:** 891 passengers with survival labels
- **Test Set:** 418 passengers without survival labels (you need to predict these)
- **Target Variable:** Survived (0 or 1) - This is **binary classification**
- **Missing Data:** Some features have missing values that need handling

## **2. Submission Requirements**

### **What You Need to Submit**

To participate in this Kaggle competition, you need to submit a CSV file with your predictions in a specific format.

**Submission File Format:**

Your submission file should have **exactly 2 columns** and **418 rows** (one for each passenger in test.csv):

```csv
PassengerId,Survived
892,0
893,1
894,0
...
```

**Requirements:**
1. **Column 1 (PassengerId):** Must match the PassengerId from test.csv
2. **Column 2 (Survived):** Your prediction (0 or 1)
   - 0 = Did not survive
   - 1 = Survived
3. **Header:** Must include column names `PassengerId,Survived`
4. **Order:** Passengers should be in the same order as test.csv
5. **File Format:** CSV (Comma Separated Values)

**Example Submission Code:**

```python
# Create submission dataframe
submission = pd.DataFrame({
    'PassengerId': test_data['PassengerId'],
    'Survived': predictions  # Your model's predictions
})

# Save to CSV
submission.to_csv('my_submission.csv', index=False)
```

---

### **How to Submit on Kaggle**

1. **Generate Predictions:** Use your trained model to predict survival for test.csv
2. **Create Submission File:** Format predictions as shown above
3. **Upload to Kaggle:**
   - Go to the [competition page](https://www.kaggle.com/c/titanic)
   - Click "Submit Predictions" button
   - Upload your CSV file
   - Add a description (optional)
   - Click "Make Submission"
4. **View Your Score:** Kaggle will evaluate your predictions and show your accuracy

**Evaluation Metric:**

Your submission is scored based on **accuracy** - the percentage of passengers you correctly predict:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

**Submission Limits:**
- You can make **10 submissions per day**
- Use this to experiment with different models and improvements!

## **3. Building Your Neural Network Solution**

### **Binary Classification vs Multi-Class Classification**

In the Fashion MNIST notebook, we performed **multi-class classification** (10 classes). The Titanic problem is **binary classification** (2 classes: survived or not).

**Key Differences:**

| Aspect | Multi-Class (Fashion MNIST) | Binary (Titanic) |
|--------|---------------------------|------------------|
| **Output Classes** | 10 (T-shirt, Trouser, etc.) | 2 (Survived: 0 or 1) |
| **Output Layer Neurons** | 10 neurons | 1 neuron |
| **Activation Function** | Softmax | Sigmoid |
| **Loss Function** | CrossEntropyLoss | BCEWithLogitsLoss |
| **Prediction** | argmax(probabilities) | round(sigmoid(output)) |

---

### **Sigmoid Activation Function**

For binary classification, we use the **sigmoid function** which squashes any input to a value between 0 and 1:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

**Interpretation:**
- Output close to 0 ‚Üí Likely did not survive
- Output close to 1 ‚Üí Likely survived
- Threshold at 0.5 ‚Üí If output ‚â• 0.5, predict 1 (survived), else predict 0

---

### **Your Implementation Roadmap**

Here's a step-by-step guide to build your solution:

#### **Step 1: Import Libraries**
Import the necessary libraries: `torch`, `torch.nn`, `torch.optim`, `Dataset`, `DataLoader`, `numpy`, `pandas`, and `matplotlib`.

#### **Step 2: Load and Explore Data**
- Load `train.csv` and `test.csv` using pandas
- Explore the data using `.head()`, `.info()`, and `.describe()`

#### **Step 3: Data Preprocessing**

This is the **most important step** for tabular data! Unlike images, you need to:

**3.1. Handle Missing Values**
- Check for missing values using `.isnull().sum()`
- Fill missing `Age` with median
- Fill missing `Embarked` with mode (most common value)
- Fill missing `Fare` with median

**3.2. Convert Categorical Variables to Numbers**
- Convert `Sex` to binary (male=0, female=1) using `.map()`
- Convert `Embarked` to numbers (S=0, C=1, Q=2)

**3.3. Feature Selection**
- Select useful features (remove irrelevant ones like Name, Ticket, Cabin)
- Suggested features: `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, `Embarked`

**3.4. Feature Scaling (Normalization)**
- Normalize features to range [0, 1] using `MinMaxScaler` from sklearn

#### **Step 4: Create Custom Dataset**
- Create a `TitanicDataset` class that inherits from `Dataset`
- Implement `__init__`, `__len__`, and `__getitem__` methods
- Convert features and labels to PyTorch tensors
- Create a `DataLoader` with appropriate batch size

#### **Step 5: Define Neural Network Architecture**
- Create a `TitanicNet` class that inherits from `nn.Module`
- Add hidden layers with ReLU activation
- **Important:** Output layer should have only **1 neuron** for binary classification!
- No activation on output layer (loss function handles it)

#### **Step 6: Define Loss Function and Optimizer**
- Use `nn.BCEWithLogitsLoss()` for binary classification (includes sigmoid)
- Use `optim.Adam` optimizer with learning rate ~0.001

#### **Step 7: Training Loop**
- Loop through epochs
- For each batch: forward pass ‚Üí compute loss ‚Üí backward pass ‚Üí optimizer step
- Apply sigmoid and threshold (0.5) to get predictions
- Track and print loss and accuracy

#### **Step 8: Make Predictions on Test Set**
- **Important:** Apply the SAME preprocessing to test data as training data
- Use the SAME scaler (`.transform()`, not `.fit_transform()`)
- Convert to tensor and run through model in eval mode
- Apply sigmoid and threshold to get final predictions (0 or 1)

#### **Step 9: Create Submission File**
- Create a DataFrame with `PassengerId` and `Survived` columns
- Save to CSV using `.to_csv('titanic_submission.csv', index=False)`

---

### **Important Notes:**

‚ö†Ô∏è **Apply the SAME preprocessing to test data as training data**

‚ö†Ô∏è **Use the SAME scaler (don't fit a new one on test data)**

‚ö†Ô∏è **Ensure test features are in the SAME order as training features**

‚ö†Ô∏è **Binary classification uses 1 output neuron, not 2**

## **4. Tips for Improving Accuracy**

### **üéØ Achieving Better Results**

Here are proven strategies to improve your model's performance:

---

### **1. Feature Engineering**

Create new meaningful features from existing ones:

**Family Size:**
- Combine `SibSp` and `Parch` to create a `FamilySize` feature (add 1 for the passenger themselves)

**Is Alone:**
- Create a binary feature indicating if passenger is traveling alone (FamilySize == 1)

**Title Extraction from Name:**
- Extract title (Mr, Mrs, Miss, Master, etc.) from the Name column using string operations
- Group rare titles together
- Convert titles to numerical values

**Age Bands:**
- Group ages into categories (Child, Teen, Adult, Senior, etc.)

**Fare Bands:**
- Group fares into quartile categories

---

### **2. Better Missing Value Handling**

Instead of simple median/mode filling:

**Age Prediction:**
- Predict missing ages based on other features
- Group by `Pclass` and `Sex` to fill missing ages with group median

**Cabin Feature:**
- Instead of dropping Cabin, create a binary `HasCabin` feature

---

### **3. Model Architecture Optimization**

**Add Dropout to Prevent Overfitting:**
- Add `nn.Dropout(0.3)` layers between hidden layers
- Dropout randomly "drops" neurons during training to prevent memorization

**Batch Normalization:**
- Add `nn.BatchNorm1d()` after linear layers
- Helps stabilize and speed up training

**Experiment with Layer Sizes:**
- Try different architectures: [128, 64, 32], [256, 128, 64, 32], etc.

---

### **4. Hyperparameter Tuning**

**Experiment with different values:**

- **Learning Rate:** Try 0.0001, 0.001, 0.01
- **Batch Size:** Try 16, 32, 64
- **Number of Epochs:** Try 50, 100, 200
- **Hidden Layer Sizes:** Experiment with different architectures

**Learning Rate Scheduler:**
- Use `ReduceLROnPlateau` to automatically reduce learning rate when progress stalls

---

### **5. Cross-Validation**

Split your training data to validate your model before submitting:

- Use `train_test_split` from sklearn to create a validation set (80% train, 20% validation)
- Train on training set, evaluate on validation set
- This helps estimate performance before submitting

**K-Fold Cross-Validation:**
- Split data into K folds (typically 5)
- Train K models, each using a different fold for validation
- Average results for more reliable performance estimate

---

### **6. Ensemble Methods (Advanced)**

Combine multiple models for better predictions:
- Train multiple models with different random seeds
- Average their predictions
- This often improves accuracy

---

### **7. Data Analysis and Insights**

**Understand the data before modeling:**
- Calculate survival rate by gender, class, age group
- Visualize relationships using seaborn/matplotlib
- Use insights to guide feature engineering

**Key Insights from Historical Data:**
- **Women and children first:** Female survival rate was ~74%, male was ~19%
- **Class matters:** 1st class had 63% survival, 3rd class had 24%
- **Age:** Children had higher survival rates
- **Family size:** Traveling with 1-3 family members increased survival

---

### **8. Common Mistakes to Avoid**

‚ùå **Don't fit scaler on test data** - Use `.transform()`, not `.fit_transform()`

‚ùå **Don't forget to handle missing values in test set** - Same preprocessing as training

‚ùå **Don't use different features** - Test must have same features as training

‚ùå **Don't overfit** - If training accuracy >> validation accuracy, your model is memorizing

‚ùå **Don't ignore data leakage** - Don't use information from test set during training

‚úÖ **Do save your preprocessor** - Keep track of how you transformed data

‚úÖ **Do experiment systematically** - Change one thing at a time

‚úÖ **Do validate locally first** - Check performance on validation set before submitting

---

### **üìä Expected Performance**

**Baseline (Simple Features):** ~75-78% accuracy

**With Feature Engineering:** ~78-82% accuracy

**With Optimized Model:** ~80-84% accuracy

**Top Performers:** ~85%+ accuracy (requires advanced techniques)

Remember: The current leaderboard top score is around 100% (on public test set), but aim for consistent 80%+ to demonstrate strong understanding!

## **5. Getting Started - Implementation Cells**

Now it's your turn! Use the code cells below to implement your solution. Follow the roadmap from Section 3 and apply the tips from Section 4.

**Your Journey:**
1. ‚úÖ Load and explore the data
2. ‚úÖ Preprocess and engineer features
3. ‚úÖ Build and train your neural network
4. ‚úÖ Make predictions on test set
5. ‚úÖ Create submission file
6. ‚úÖ Submit to Kaggle!
7. ‚úÖ Iterate and improve

Good luck! Remember: machine learning is iterative - don't expect perfection on the first try. Learn from each submission and keep improving! üöÄ

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd

print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

In [None]:
# Load the datasets
# Make sure train.csv and test.csv are in the same folder as this notebook

# YOUR CODE HERE

In [None]:
# Explore the training data
# Use .head(), .info(), .describe(), .isnull().sum()

# YOUR CODE HERE

In [None]:
# Data Preprocessing
# 1. Handle missing values
# 2. Convert categorical variables to numerical
# 3. Feature engineering (optional but recommended)
# 4. Feature selection

# YOUR CODE HERE

In [None]:
# Feature Scaling
# Normalize your features using MinMaxScaler or StandardScaler

# YOUR CODE HERE

In [None]:
# Create Custom Dataset Class

class TitanicDataset(Dataset):
    def __init__(self, features, labels):
        # YOUR CODE HERE
        pass
    
    def __len__(self):
        # YOUR CODE HERE
        pass
    
    def __getitem__(self, idx):
        # YOUR CODE HERE
        pass

In [None]:
# Create DataLoader

# YOUR CODE HERE

In [None]:
# Define Neural Network Architecture

class TitanicNet(nn.Module):
    def __init__(self, input_size):
        super(TitanicNet, self).__init__()
        
        # YOUR CODE HERE
        # Remember: Binary classification needs 1 output neuron!
        
    def forward(self, x):
        # YOUR CODE HERE
        pass

# Create model instance
# model = TitanicNet(input_size)

In [None]:
# Define Loss Function and Optimizer
# Use BCEWithLogitsLoss for binary classification
# Use Adam or SGD optimizer

# YOUR CODE HERE

In [None]:
# Training Loop

# YOUR CODE HERE

In [None]:
# Preprocess Test Data
# Apply the SAME preprocessing steps as training data
# Use the SAME scaler (transform, not fit_transform)

# YOUR CODE HERE

In [None]:
# Make Predictions on Test Set

# YOUR CODE HERE

In [None]:
# Create Submission File

# YOUR CODE HERE
# submission = pd.DataFrame({
#     'PassengerId': test_data['PassengerId'],
#     'Survived': predictions
# })
# submission.to_csv('titanic_submission.csv', index=False)

## **6. Conclusion**

üéâ **Congratulations!**

You've now learned how to:
- Apply neural networks to real-world tabular data
- Perform data preprocessing and feature engineering
- Build binary classification models with PyTorch
- Submit predictions to a Kaggle competition

**Next Steps:**
1. Submit your predictions to Kaggle
2. Check your score on the leaderboard
3. Analyze what worked and what didn't
4. Implement improvements from Section 4
5. Submit again and track your progress!

**Remember:**
- Machine learning is iterative - keep experimenting!
- Learn from the Kaggle community - read kernels/notebooks from top performers
- Document your experiments - track what works and what doesn't
- Have fun and enjoy the learning process! üöÄ

---

**Share Your Results:**
- What accuracy did you achieve?
- What features/techniques helped the most?
- What challenges did you face?

Good luck, and may your model have smooth sailing! ‚öì