# Raw Data vs ML-Ready Data

This notebook demonstrates the complete transformation pipeline from human-readable to machine-readable format.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

## Step 1: View Raw Data (Human Readable)

In [None]:
df_raw = pd.read_csv('udemy_courses.csv')
print("Raw Data Sample:")
print(df_raw[['course_title', 'subject', 'level', 'price', 'published_timestamp']].head())

## Step 2: Apply Transformations to Make It ML-Ready

We'll transform the raw data using:
- TF-IDF vectorization for text (course_title)
- Label Encoding for ordinal categories (level)
- One-Hot Encoding for nominal categories (subject)
- DateTime conversion for published_timestamp

In [None]:
# 1. TF-IDF Vectorization for course_title
vectorizer = TfidfVectorizer(max_features=5, stop_words='english')
title_tfidf = vectorizer.fit_transform(df_raw['course_title'])
title_df = pd.DataFrame(title_tfidf.toarray(), columns=[f'title_tfidf_{i}' for i in range(5)])

# 2. Label Encoding for 'level'
label_encoder = LabelEncoder()
df_raw['level_encoded'] = label_encoder.fit_transform(df_raw['level'])

# 3. One-Hot Encoding for 'subject'
subject_dummies = pd.get_dummies(df_raw['subject'], prefix='subject')

# 4. DateTime conversion for published_timestamp
df_raw['published_timestamp'] = pd.to_datetime(df_raw['published_timestamp'])
df_raw['year'] = df_raw['published_timestamp'].dt.year
df_raw['month'] = df_raw['published_timestamp'].dt.month

# Combine all features
df_ml_ready = pd.concat([
    title_df,
    df_raw[['level_encoded', 'year', 'month', 'price']],
    subject_dummies
], axis=1)

print("\nML-Ready Data Sample (Numerical Only):")
print(df_ml_ready.head())

## Step 3: Side-by-Side Comparison

Let's compare the same course (row 0) before and after transformation:

In [None]:
print("BEFORE (Human Readable):")
print(f"Course: {df_raw.iloc[0]['course_title']}")
print(f"Subject: {df_raw.iloc[0]['subject']}")
print(f"Level: {df_raw.iloc[0]['level']}")
print(f"Price: ${df_raw.iloc[0]['price']}")
print(f"Published: {df_raw.iloc[0]['published_timestamp']}")

print("\n" + "="*60)

print("\nAFTER (Machine Readable):")
print(df_ml_ready.iloc[0].to_dict())

## Summary: Why This Transformation Matters

**Machine Learning algorithms require numerical input.** They can't understand text like "Web Development" or dates like "2017-07-05".

By transforming:
- **Text → Numbers**: TF-IDF converts words to importance scores
- **Categories → Numbers**: Encoding makes categories mathematical
- **Dates → Numbers**: Year/month become features the model can use

The result is a fully numerical dataset ready for ML models like RandomForest to learn patterns and make predictions!