# Exploratory Data Analysis (EDA) — Beginner Lessons

This notebook converts the **"Exploratory Data Analysis (EDA) — Beginner Lessons"** outline into a beginner-friendly, runnable notebook. Use this to guide a newcomer through the core concepts, quick examples, and a mini-project.


## 1. What is EDA? (Basics)

- **Why EDA matters** — discover patterns, spot errors, form hypotheses before modeling.
- **Types of data** — numerical, categorical, date/time.
- **Loading datasets** — CSV, Excel, and built-in datasets (`pandas.read_csv`, `pd.read_excel`, `seaborn.load_dataset`).
- **First look at data** — `.head()`, `.info()`, `.describe()` to quickly understand a dataset.


## 2. Cleaning Data

- **Handling missing values** — drop or impute (mean/median/mode or constant).
- **Removing duplicates** — `df.drop_duplicates()`.
- **Fixing data types** — convert strings to `datetime`, numbers, or `category`.
- **Handling outliers** — inspect with boxplots, decide to keep/clip/remove.


## 3. Understanding Variables

- **Distributions** — histograms and boxplots for numeric features.
- **Categorical analysis** — `value_counts()` and bar charts.
- **Correlation and covariance** — `df.corr()` + heatmap.
- **Feature relationships** — scatter plots, pairplots.


## 4. Visual EDA (Beginner-Friendly)

- **Matplotlib basics** — `plt.plot`, `plt.hist`, `plt.boxplot`.
- **Seaborn basics** — `sns.countplot`, `sns.boxplot`, `sns.heatmap`, `sns.pairplot`.
- **Heatmaps for correlations** — visualize numeric correlations.
- **Pairplots** — visualize pairwise relationships for small datasets.


## 5. EDA for ML Preparation

- **Train/test split & data leakage awareness** — use `train_test_split` and avoid leaking test information.
- **Scaling features** — `StandardScaler`, `MinMaxScaler`.
- **Encoding categories** — `LabelEncoder` vs `OneHotEncoder` / `pd.get_dummies()`.
- **Feature importance** — quick check using tree-based models.


---
### Tiny example: First glance at a dataset

Run the cell below to load a sample dataset and print basic summaries.

In [1]:
import pandas as pd
import seaborn as sns

# load a built-in small dataset
df = sns.load_dataset('titanic')
print('Shape:', df.shape)
display(df.head())
print('\nInfo:')
print(df.info())
print('\nDescribe (numeric):')
print(df.describe())


Shape: (891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True



Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None

Describe (numeric):
         survived      pclass         age     

### Suggested Learning Path (Beginner → Intermediate)

| Section | Time |
|--------|------|
| Basics of EDA | 30–45 min |
| Cleaning data | 45–60 min |
| Visual analysis | 60–90 min |
| EDA for ML prep | 45 min |
| Mini-project | 1–2 hours |


### Mini-Project Suggestion

Use any dataset (Titanic, Iris, your own CSV) and do:
1. Load data
2. Check missing values
3. Visualize distributions
4. Explore correlations
5. Write 5 insights you discover


### Next steps / Options

- Expand any section into a hands-on lesson with step-by-step code.
- Convert this notebook into a guided exercise notebook with tests/quizzes.
- Export as `.py` or `.ipynb` (download link is provided below).
