# NLP: Extracting Value from Text

This notebook demonstrates how to convert course titles into numerical features using Natural Language Processing (NLP).

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
df = pd.read_csv('udemy_courses.csv')
print("Sample course titles:")
print(df['course_title'].head())

---
## Step 1: Initialize TF-IDF Vectorizer

**TF-IDF** = Term Frequency - Inverse Document Frequency
- Converts text into numerical vectors
- Weighs words by importance (rare words score higher)

In [None]:
vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
print("Vectorizer created!")
print(f"Max features: 100")
print(f"Stop words: Removed common words like 'the', 'a', 'is'")

## Step 2: Transform Titles to Vectors

In [None]:
title_vectors = vectorizer.fit_transform(df['course_title'])
print(f"Shape: {title_vectors.shape}")
print(f"Type: {type(title_vectors)}")

## Step 3: View Top Keywords

In [None]:
feature_names = vectorizer.get_feature_names_out()
print("Top 20 keywords extracted:")
print(list(feature_names[:20]))

## Step 4: Convert to DataFrame

In [None]:
title_df = pd.DataFrame(title_vectors.toarray(), columns=[f"txt_{i}" for i in range(100)])
print("Text features DataFrame:")
print(title_df.head())

---
## Step 5: Create Numerical Features

Now prepare the numerical features from the original dataset.

In [None]:
numerical_df = df[['price', 'num_reviews', 'num_lectures']].reset_index(drop=True)
print("Numerical features:")
print(numerical_df.head())

## Step 6: Merge Features Using concat()

**The Magic Move**: Combine numerical + text features

In [None]:
X_combined = pd.concat([numerical_df, title_df], axis=1)
print(f"Combined dataset shape: {X_combined.shape}")
print(f"\nFirst few columns:")
print(X_combined.head())

In [None]:
print("Column breakdown:")
print(f"Numerical features: {len(numerical_df.columns)}")
print(f"Text features: {len(title_df.columns)}")
print(f"Total features: {len(X_combined.columns)}")

---
## Summary: The Power of Text Features

✅ **Physical Stats**: Price, Reviews, Lectures (what the course IS)  
✅ **Psychological Stats**: Title keywords (what the course PROMISES)  
✅ **Combined Dataset**: Ready for ML model training!

This "Super-Dataset" captures both objective data AND marketing language.