# Data Encoding & Transformation

This notebook demonstrates three critical data transformations for machine learning:
1. **Date Conversion** - String to Datetime to Integer
2. **Label Encoding** - Ordinal data (has order)
3. **One-Hot Encoding** - Nominal data (no order)

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [None]:
df = pd.read_csv('udemy_courses.csv')
df.head()

---
## Part 1: Date Conversion

**Problem:** Dates stored as strings → ML models need numbers

**Solution:** Convert to datetime, extract year and month

In [None]:
# Step 1: Check original data type
print("Original data type:", df['published_timestamp'].dtype)
print("Sample values:")
print(df['published_timestamp'].head())

In [None]:
# Step 2: Convert to datetime
df['published_timestamp'] = pd.to_datetime(df['published_timestamp'])
print("New data type:", df['published_timestamp'].dtype)

In [None]:
# Step 3: Extract year and month as integers
df['year'] = df['published_timestamp'].dt.year
df['month'] = df['published_timestamp'].dt.month

print("Year column:")
print(df['year'].head())
print("\nMonth column:")
print(df['month'].head())

---
## Part 2: Label Encoding (Ordinal Data)

**Problem:** "Beginner", "Intermediate", "Expert" → These have a natural order

**Solution:** LabelEncoder assigns 0, 1, 2, 3... preserving hierarchy

In [None]:
# Step 1: Check unique levels
print("Unique levels:")
print(df['level'].unique())

In [None]:
# Step 2: Apply LabelEncoder
le = LabelEncoder()
df['level_encoded'] = le.fit_transform(df['level'])

print("Encoding mapping:")
for level, code in zip(le.classes_, le.transform(le.classes_)):
    print(f"{level:20} → {code}")

In [None]:
# Step 3: Compare original vs encoded
print("\nSide-by-side comparison:")
print(df[['level', 'level_encoded']].head(10))

---
## Part 3: One-Hot Encoding (Nominal Data)

**Problem:** "Web Dev", "Business", "Music" → NO natural order, Music ≠ "greater than" Business

**Solution:** Create binary columns (0 or 1) for each category

In [None]:
# Step 1: Check unique subjects
print("Unique subjects:")
print(df['subject'].unique())

In [None]:
# Step 2: Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['subject'], drop_first=True)

print("New columns created:")
new_cols = [col for col in df_encoded.columns if col.startswith('subject_')]
print(new_cols)

In [None]:
# Step 3: View encoded data
print("\nOne-hot encoded columns:")
print(df_encoded[new_cols].head(10))

---
## Summary: When to Use Each Method

| Data Type | Example | Method | Why? |
|-----------|---------|--------|------|
| **Dates** | "2017-01-15" | `pd.to_datetime()` + `.dt.year` | Convert text to numbers ML can use |
| **Ordinal** | Beginner → Expert | `LabelEncoder` | Has natural order/hierarchy |
| **Nominal** | Subjects, Categories | `pd.get_dummies()` | No order, prevents bias |