# Titanic Data Analysis (Beginner)

This notebook is a simple, beginner‑friendly analysis of a small Titanic dataset. It covers loading data, quick exploration, cleaning basics, and a few visualizations using **matplotlib**.

**How to use:**
1. Run each cell from top to bottom.
2. Modify the code and re-run to learn by doing.

**Files included:**
- `titanic_sample.csv` — a small CSV file in the same folder as this notebook.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)


## 1) Load the data

In [None]:
data_path = Path('titanic_sample.csv')
df = pd.read_csv(data_path)
df.head()

## 2) Quick exploration

In [None]:
print('Shape:', df.shape)
print('\nInfo:')
print(df.dtypes)

df.describe(include='all')

## 3) Basic cleaning
- Convert categorical columns to 'category'
- Handle any missing values (if present)

In [None]:
categorical_cols = ['Sex', 'Embarked']
for c in categorical_cols:
    df[c] = df[c].astype('category')

# Example: fill missing ages with the median (dataset here has no missing Age, but this shows the pattern)
if df['Age'].isna().any():
    df['Age'] = df['Age'].fillna(df['Age'].median())

df.head()

## 4) Visualizations

In [None]:
# 4.1 Survival counts by Sex
ax = df.groupby('Sex')['Survived'].mean().plot(kind='bar')
ax.set_title('Average Survival Rate by Sex')
ax.set_ylabel('Survival Rate')
ax.set_xlabel('Sex')
plt.show()

In [None]:
# 4.2 Age distribution
ax = df['Age'].plot(kind='hist', bins=10)
ax.set_title('Age Distribution')
ax.set_xlabel('Age')
plt.show()

## 5) Feature preparation example (optional)
This section shows how to prepare features for a simple ML model (no training here).

In [None]:
df_model = df.copy()
df_model['Sex'] = df_model['Sex'].cat.codes
df_model['Embarked'] = df_model['Embarked'].cat.codes

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = 'Survived'

X = df_model[features]
y = df_model[target]

X.head()

## Next steps
- Try splitting the data into train/test sets and fitting a Logistic Regression model.
- Add more visualizations (e.g., survival by class, fare distribution).
- Write your findings in the README of your GitHub repository.