# Exploratory Data Analysis (EDA) — Beginner Lessons

This notebook converts the **"Exploratory Data Analysis (EDA) — Beginner Lessons"** outline into a beginner-friendly, runnable notebook. Use this to guide a newcomer through the core concepts, quick examples, and a mini-project.


## 1. What is EDA? (Basics)

- **Why EDA matters** — discover patterns, spot errors, form hypotheses before modeling.
- **Types of data** — numerical, categorical, date/time.
- **Loading datasets** — CSV, Excel, and built-in datasets (`pandas.read_csv`, `pd.read_excel`, `seaborn.load_dataset`).
- **First look at data** — `.head()`, `.info()`, `.describe()` to quickly understand a dataset.


## 2. Cleaning Data

- **Handling missing values** — drop or impute (mean/median/mode or constant).
- **Removing duplicates** — `df.drop_duplicates()`.
- **Fixing data types** — convert strings to `datetime`, numbers, or `category`.
- **Handling outliers** — inspect with boxplots, decide to keep/clip/remove.


## 3. Understanding Variables

- **Distributions** — histograms and boxplots for numeric features.
- **Categorical analysis** — `value_counts()` and bar charts.
- **Correlation and covariance** — `df.corr()` + heatmap.
- **Feature relationships** — scatter plots, pairplots.


## 4. Visual EDA (Beginner-Friendly)

- **Matplotlib basics** — `plt.plot`, `plt.hist`, `plt.boxplot`.
- **Seaborn basics** — `sns.countplot`, `sns.boxplot`, `sns.heatmap`, `sns.pairplot`.
- **Heatmaps for correlations** — visualize numeric correlations.
- **Pairplots** — visualize pairwise relationships for small datasets.


## 5. EDA for ML Preparation

- **Train/test split & data leakage awareness** — use `train_test_split` and avoid leaking test information.
- **Scaling features** — `StandardScaler`, `MinMaxScaler`.
- **Encoding categories** — `LabelEncoder` vs `OneHotEncoder` / `pd.get_dummies()`.
- **Feature importance** — quick check using tree-based models.


---
### Tiny example: First glance at a dataset

Run the cell below to load a sample dataset and print basic summaries.

In [None]:
import pandas as pd
import seaborn as sns

# load a built-in small dataset
df = sns.load_dataset('titanic')
print('Shape:', df.shape)
display(df.head())
print('\nInfo:')
print(df.info())
print('\nDescribe (numeric):')
print(df.describe())


### Suggested Learning Path (Beginner → Intermediate)

| Section | Time |
|--------|------|
| Basics of EDA | 30–45 min |
| Cleaning data | 45–60 min |
| Visual analysis | 60–90 min |
| EDA for ML prep | 45 min |
| Mini-project | 1–2 hours |


### Mini-Project Suggestion

Use any dataset (Titanic, Iris, your own CSV) and do:
1. Load data
2. Check missing values
3. Visualize distributions
4. Explore correlations
5. Write 5 insights you discover


### Next steps / Options

- Expand any section into a hands-on lesson with step-by-step code.
- Convert this notebook into a guided exercise notebook with tests/quizzes.
- Export as `.py` or `.ipynb` (download link is provided below).
