# Day 01 — Data cleaning + EDA fundamentals

This notebook is a **one-stop, beginner-friendly walkthrough** for exploratory data analysis (EDA).
We will cover:

- What EDA is and why it matters
- How to inspect data types, shapes, and missing values
- Summary statistics for numerical & categorical features
- Quick visualizations to understand distributions and outliers
- Simple feature engineering ideas you can try immediately

**Goal:** By the end, you should feel comfortable exploring a new dataset and
identifying what needs cleaning or deeper analysis.


## 1) Create a small sample dataset
In real projects you would load a CSV/SQL table.
Here we use a tiny in-memory dataset to keep the concepts clear.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = {
    "age": [22, 35, 28, None, 40, 19, 50],
    "salary": [48000, 54000, 50000, 62000, None, 41000, 80000],
    "department": ["sales", "marketing", "sales", "engineering", "engineering", "sales", "marketing"],
    "tenure": [1.2, 3.4, 2.1, 5.0, 4.2, 0.8, 6.5],
}
df = pd.DataFrame(data)
df.head()


## 2) Basic structure checks
Start by answering these questions:
- How many rows and columns do we have?
- What are the column names?
- Which columns are numeric vs categorical?


In [None]:
df.shape
df.columns
df.info()


## 3) Summary statistics
Summary stats give a quick sense of ranges, averages, and potential outliers.
We also look at category counts for non-numeric features.


In [None]:
df.describe()
df["department"].value_counts()


## 4) Missing values
Missing values are common. First, quantify them, then decide on a strategy:
- **Drop rows/columns** if the missing rate is huge
- **Impute** (fill) with mean/median/mode
- **Add an indicator column** if missingness might carry meaning


In [None]:
df.isna().sum()

df["age_missing"] = df["age"].isna().astype(int)
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].median())

df.isna().sum()


## 5) Data types and categories
Explicitly cast columns to the right types to avoid subtle bugs later.
Categorical columns often benefit from the `category` dtype.


In [None]:
df["department"] = df["department"].astype("category")
df.dtypes


## 6) Distributions and outliers
Histograms show how values are distributed.
Boxplots help identify possible outliers.


In [None]:
sns.histplot(df["age"], kde=True)
plt.title("Age distribution")
plt.show()

sns.boxplot(x="department", y="salary", data=df)
plt.title("Salary by department")
plt.show()


## 7) Relationships between numerical features
Correlation helps spot linear relationships. Use heatmaps for quick insight.


In [None]:
numeric_cols = df.select_dtypes(include="number")
corr = numeric_cols.corr()
sns.heatmap(corr, annot=True, cmap="Blues")
plt.title("Correlation matrix")
plt.show()


## 8) Simple feature engineering ideas
Feature engineering is about creating more useful signals from raw columns.
Here are two common quick wins: ratios and bucketed categories.


In [None]:
df["salary_per_year"] = df["salary"] / (df["tenure"] + 0.1)
df["age_bucket"] = pd.cut(df["age"], bins=[0, 25, 35, 45, 100], labels=["<25", "25-35", "35-45", "45+"])
df[["age", "salary", "tenure", "salary_per_year", "age_bucket"]].head()


## 9) What to do next
At this point you should have a good feel for the dataset. Typical next steps:
- Check for target leakage if you have a label
- Encode categorical variables for modeling
- Split into train/test and establish a baseline model

In Day 02, we’ll build a simple baseline model and talk about evaluation.
