# Lesson 03: Data Exploration (EDA)

**What you'll learn:**
- Check data distribution with histograms
- Find correlations with heatmaps
- Detect outliers with box plots
- Check class balance

**EDA = Exploratory Data Analysis** (looking at your data before modeling)

---

## Section 1: Why Explore Data?

### READ

Before building a model, you need to understand your data:
- What do the values look like?
- Are there outliers (extreme values)?
- Which features might be useful for prediction?
- Is the data balanced (equal amounts of each class)?

**Good data exploration = Better models!**

### TRY IT

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for nicer plots
plt.style.use('seaborn-v0_8-whitegrid')

# Load our practice dataset
df = pd.read_csv('../datasets/tomatjus.csv')

# Quick overview
print("Dataset Overview:")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(f"\nColumn names: {df.columns.tolist()}")

In [None]:
# Basic statistics
df.describe()

### EXPLAIN

What we see:
- `count`: Number of non-missing values
- `mean`: Average value
- `std`: Standard deviation (how spread out the values are)
- `min/max`: Smallest and largest values
- `25%/50%/75%`: Quartiles (50% = median)

---

## Section 2: Visualizing Distributions

### READ

A **histogram** shows how values are distributed:
- Tall bars = many values in that range
- Short bars = few values

Common shapes:
- **Normal (bell curve)**: Most values in the middle
- **Skewed**: Values bunched on one side
- **Uniform**: Values spread evenly

### TRY IT

In [None]:
# Histogram of one feature
plt.figure(figsize=(8, 5))
plt.hist(df['pH'], bins=20, edgecolor='black', color='skyblue')
plt.xlabel('pH Value')
plt.ylabel('Count')
plt.title('Distribution of pH Values')
plt.show()

In [None]:
# Histograms for ALL numeric columns at once
df.hist(figsize=(12, 10), bins=20, edgecolor='black')
plt.tight_layout()
plt.show()

### EXPLAIN

What we learned:
- `bins=20` divides the range into 20 bars
- Some features (like `pH`) look normally distributed
- Some features (like `chlorides`) are skewed (bunched on left)

---

## Section 3: Checking Class Balance

### READ

For classification, check if classes are **balanced**:
- **Balanced**: Roughly equal samples per class
- **Imbalanced**: Some classes have many more samples

**Why it matters:**
Imbalanced data can cause problems - the model might just predict the majority class and ignore rare classes.

### TRY IT

In [None]:
# Count samples per class
print("Class distribution:")
class_counts = df['quality'].value_counts()
print(class_counts)

# Calculate percentages
print("\nPercentages:")
print(df['quality'].value_counts(normalize=True).round(3) * 100)

In [None]:
# Visualize class distribution
plt.figure(figsize=(8, 5))
class_counts.plot(kind='bar', color=['skyblue', 'lightgreen', 'salmon'])
plt.xlabel('Quality')
plt.ylabel('Count')
plt.title('Class Distribution')
plt.xticks(rotation=0)
plt.show()

In [None]:
# Check YOUR assignment dataset
nsl = pd.read_csv('../datasets/NSL_KDD/NSL_ppTrain.csv')

print("NSL-KDD Class Distribution:")
nsl_counts = nsl['atakcat'].value_counts()
print(nsl_counts)

# Visualize
plt.figure(figsize=(10, 5))
nsl_counts.plot(kind='bar', color='steelblue')
plt.xlabel('Attack Category')
plt.ylabel('Count')
plt.title('NSL-KDD: HIGHLY IMBALANCED!')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"\nImbalance ratio: benign has {nsl_counts['benign'] // nsl_counts['u2r']}x more samples than u2r!")

### EXPLAIN

**Tomato juice dataset:** Slightly imbalanced (Average has 3x more than Special)

**NSL-KDD (your assignment):** HIGHLY imbalanced!
- benign: ~67,000 samples
- u2r: only 52 samples

This is why we need Lesson 09 (Handling Imbalance)!

---

## Section 4: Finding Correlations

### READ

**Correlation** measures how two features move together:
- **+1**: Perfect positive (both go up together)
- **-1**: Perfect negative (one up, one down)
- **0**: No correlation

**Why it matters:**
- Features correlated with the target are useful for prediction
- Highly correlated features might be redundant

### TRY IT

In [None]:
# Get only numeric columns
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Calculate correlation matrix
correlation = numeric_df.corr()

# Show correlation matrix
print("Correlation Matrix (first 5 columns):")
print(correlation.iloc[:5, :5].round(2))

In [None]:
# Heatmap visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

### EXPLAIN

Reading the heatmap:
- **Red**: Positive correlation (both increase together)
- **Blue**: Negative correlation (one increases, other decreases)
- **White/pale**: Little to no correlation

Example findings:
- Diagonal is always 1.0 (feature correlates perfectly with itself)
- Look for strong colors off the diagonal

---

## Section 5: Detecting Outliers

### READ

**Outliers** are extreme values that don't fit the pattern.
They can confuse ML models.

**Box plots** help visualize outliers:
- The box shows the middle 50% of data
- The line in the box is the median
- Dots outside the "whiskers" are outliers

### TRY IT

In [None]:
# Box plot for one feature
plt.figure(figsize=(8, 5))
plt.boxplot(df['pulp'])
plt.ylabel('Pulp Value')
plt.title('Box Plot of Pulp - Look for dots (outliers)!')
plt.show()

In [None]:
# Box plots for multiple features
plt.figure(figsize=(14, 6))
df.drop('quality', axis=1, errors='ignore').boxplot()
plt.xticks(rotation=45)
plt.title('Box Plots for All Features - Dots are Outliers')
plt.tight_layout()
plt.show()

In [None]:
# Find outliers using IQR method
def count_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return ((column < lower) | (column > upper)).sum()

print("Outlier counts per column:")
for col in df.select_dtypes(include=['float64', 'int64']).columns:
    outliers = count_outliers(df[col])
    if outliers > 0:
        print(f"  {col}: {outliers} outliers")

### EXPLAIN

**IQR (Interquartile Range) method:**
- Q1 = 25th percentile, Q3 = 75th percentile
- IQR = Q3 - Q1
- Outliers are below Q1 - 1.5*IQR or above Q3 + 1.5*IQR

**What to do with outliers:**
- Sometimes they're real (keep them)
- Sometimes they're errors (remove them)
- Sometimes we cap them (replace with max/min)

---

## Section 6: Quick EDA Checklist

Always do these steps when exploring new data:

| Step | Code | Why |
|------|------|-----|
| Shape | `df.shape` | Know size of data |
| First rows | `df.head()` | See what data looks like |
| Data types | `df.info()` | Check for wrong types |
| Statistics | `df.describe()` | Get summary stats |
| Missing values | `df.isnull().sum()` | Find gaps in data |
| Class balance | `df['target'].value_counts()` | Check for imbalance |
| Histograms | `df.hist()` | See distributions |
| Correlations | `df.corr()` | Find relationships |
| Box plots | `df.boxplot()` | Find outliers |

---

## Practice Exercises

In [None]:
# Exercise 1: Load the churn dataset and check for missing values
churn = pd.read_csv('../datasets/churn_modelling.csv')

# YOUR CODE HERE:


In [None]:
# Exercise 2: Check the class distribution of 'Exited' column - is it balanced?

# YOUR CODE HERE:


In [None]:
# Exercise 3: Create a histogram of the 'Age' column

# YOUR CODE HERE:


In [None]:
# Exercise 4: Which feature is most correlated with 'Balance'?

# YOUR CODE HERE:


---

## Next Lesson

In **Lesson 04: Data Preprocessing**, you'll learn:
- How to encode categorical variables (text to numbers)
- How to scale features
- How to split data into train/test
- The correct order of preprocessing steps