# 00 - Data Loading and Validation

## Purpose
Load the Titanic dataset from Kaggle CSV format and perform initial data quality validation.

## Objectives
- Load train and test datasets
- Examine structure, shape, and dtypes
- Document missing values and data quality issues
- Create feature data dictionary
- Establish baseline understanding of passenger data

In [None]:
import pandas as pd
import numpy as np

# Load datasets
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')

print('Train shape:', df_train.shape)
print('Test shape:', df_test.shape)
print('\nFirst few rows:')
df_train.head()

In [None]:
# Data types and missing values
print('Data Types:\n')
print(df_train.dtypes)
print('\nMissing Values:\n')
print(df_train.isnull().sum())
print('\nMissing Percentage:\n')
print((df_train.isnull().sum() / len(df_train) * 100).round(2))

In [None]:
# Summary statistics
df_train.describe()

## Feature Dictionary

| Feature | Type | Description |
|---------|------|-------------|
| PassengerId | int | Unique passenger identifier |
| Survived | int | Target variable (0=No, 1=Yes) |
| Pclass | int | Ticket class (1=First, 2=Second, 3=Third) |
| Name | str | Passenger name |
| Sex | str | Passenger gender (male, female) |
| Age | float | Age in years |
| SibSp | int | Number of siblings/spouses aboard |
| Parch | int | Number of parents/children aboard |
| Ticket | str | Ticket number |
| Fare | float | Ticket price in GBP |
| Cabin | str | Cabin number |
| Embarked | str | Port of embarkation (C/Q/S) |