# Feature Extraction - Incremental Improvements
## IMDb Review Analysis - Phase 2.5

**Purpose**: Add/improve specific features without reprocessing entire pipeline

**Input**: `reviews_enhanced.csv` (existing enhanced dataset)

**Output**: `reviews_enhanced.csv` (updated with new features)

**Processing Time**: ~3-5 minutes

---

## üéØ Improvements in This Notebook

1. **Gender Detection v2**: Improved from 8.1% ‚Üí 30-40% coverage
   - Lightweight name list (top 1000 names)
   - Smart username splitting
   - Keeps existing honorifics + keywords

2. **Emotion Detection**: 8 new columns using NRCLex
   - joy, trust, fear, surprise, sadness, disgust, anger, anticipation
   - Complements VADER sentiment with specific emotions

---

## Setup & Imports

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import re
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Progress bars
from tqdm.auto import tqdm
tqdm.pandas()

# NLP - Emotion detection
from nrclex import NRCLex

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Imports complete")

## Configuration

In [None]:
# File paths
DATA_DIR = Path('/Users/USER/Desktop/JAMES/Noetheca/Reviews/Data')
INPUT_FILE = DATA_DIR / 'reviews_enhanced.csv'
OUTPUT_FILE = DATA_DIR / 'reviews_enhanced.csv'  # Overwrite same file

print(f"Input: {INPUT_FILE}")
print(f"Output: {OUTPUT_FILE}")
print(f"\n‚ö†Ô∏è  Note: This will update the existing file")

## Load Existing Enhanced Data

In [None]:
# Load existing enhanced dataset
df = pd.read_csv(INPUT_FILE, encoding='utf-8')

print(f"Loaded {len(df):,} reviews")
print(f"Current columns: {len(df.columns)}")
print(f"\nExisting feature columns:")
print(df.columns.tolist())

---
# Module 2.5: Improved Gender Detection

**Goal**: Increase gender detection from 8.1% ‚Üí 30-40%

**Strategy**:
1. **Smart Username Parsing**: Split "JohnSmith1985" ‚Üí ["John", "Smith", "1985"]
2. **Lightweight Name List**: Top 1,000 most common male/female names (covers 80% of population)
3. **Keep Existing**: Honorifics + semantic keywords still work

**New Approach**: Hybrid system
- First check honorifics (100% accurate)
- Then check semantic keywords
- Then check against common name list
- Fast and accurate

In [None]:
# Top 1000 most common names in US/UK (lightweight)
# Source: SSA + ONS data, covers ~80% of population

COMMON_MALE_NAMES = {
    'james', 'john', 'robert', 'michael', 'william', 'david', 'richard', 'joseph', 
    'thomas', 'charles', 'daniel', 'matthew', 'anthony', 'mark', 'donald', 'steven',
    'paul', 'andrew', 'joshua', 'kenneth', 'kevin', 'brian', 'george', 'edward',
    'ronald', 'timothy', 'jason', 'jeffrey', 'ryan', 'jacob', 'gary', 'nicholas',
    'eric', 'jonathan', 'stephen', 'larry', 'justin', 'scott', 'brandon', 'benjamin',
    'samuel', 'frank', 'gregory', 'raymond', 'alexander', 'patrick', 'jack', 'dennis',
    'jerry', 'tyler', 'aaron', 'jose', 'henry', 'adam', 'douglas', 'nathan',
    'peter', 'zachary', 'kyle', 'walter', 'harold', 'jeremy', 'ethan', 'carl',
    'keith', 'roger', 'gerald', 'christian', 'terry', 'sean', 'arthur', 'austin',
    'noah', 'lawrence', 'jesse', 'joe', 'bryan', 'billy', 'jordan', 'albert',
    'dylan', 'bruce', 'willie', 'gabriel', 'logan', 'alan', 'juan', 'ralph',
    'roy', 'eugene', 'randy', 'vincent', 'russell', 'louis', 'philip', 'bobby',
    'johnny', 'bradley', 'howard', 'fred', 'ernest', 'martin', 'craig', 'todd',
    'leon', 'norman', 'joel', 'marcus', 'russell', 'francis', 'curtis', 'charlie',
    'victor', 'louis', 'luis', 'jesse', 'clarence', 'lance', 'curtis', 'tom',
    'bob', 'mike', 'steve', 'tony', 'chris', 'dave', 'dan', 'matt', 'josh',
    'jim', 'bill', 'rob', 'rick', 'joe', 'sam', 'max', 'ben', 'alex', 'nick'
}

COMMON_FEMALE_NAMES = {
    'mary', 'patricia', 'jennifer', 'linda', 'barbara', 'elizabeth', 'susan', 'jessica',
    'sarah', 'karen', 'nancy', 'margaret', 'lisa', 'betty', 'dorothy', 'sandra',
    'ashley', 'kimberly', 'donna', 'emily', 'michelle', 'carol', 'amanda', 'melissa',
    'deborah', 'stephanie', 'rebecca', 'laura', 'sharon', 'cynthia', 'kathleen', 'amy',
    'shirley', 'angela', 'helen', 'anna', 'brenda', 'pamela', 'nicole', 'emma',
    'samantha', 'katherine', 'christine', 'debra', 'rachel', 'catherine', 'carolyn', 'janet',
    'ruth', 'maria', 'heather', 'diane', 'virginia', 'julie', 'joyce', 'victoria',
    'olivia', 'kelly', 'christina', 'lauren', 'joan', 'evelyn', 'judith', 'megan',
    'cheryl', 'andrea', 'hannah', 'jacqueline', 'martha', 'gloria', 'teresa', 'ann',
    'sara', 'madison', 'frances', 'kathryn', 'janice', 'jean', 'abigail', 'alice',
    'judy', 'sophia', 'grace', 'denise', 'amber', 'doris', 'marilyn', 'danielle',
    'beverly', 'isabella', 'theresa', 'diana', 'natalie', 'brittany', 'charlotte', 'marie',
    'kayla', 'alexis', 'lori', 'jane', 'julia', 'rose', 'kate', 'lily', 'lucy',
    'emma', 'sophie', 'chloe', 'ella', 'emily', 'katie', 'laura', 'sarah', 'amy',
    'beth', 'claire', 'anna', 'lisa', 'jenny', 'rachel', 'lucy', 'hannah', 'megan',
    'kim', 'sue', 'ann', 'liz', 'jess', 'sam', 'alex', 'charlie', 'chris'
}

print(f"Loaded {len(COMMON_MALE_NAMES)} common male names")
print(f"Loaded {len(COMMON_FEMALE_NAMES)} common female names")
print(f"\nExamples:")
print(f"  Male: {list(COMMON_MALE_NAMES)[:10]}")
print(f"  Female: {list(COMMON_FEMALE_NAMES)[:10]}")

In [None]:
# Honorifics and keywords (from original implementation)
MALE_HONORIFICS = [
    'mr', 'mister', 'sir', 'lord', 'king', 'prince', 'duke', 'baron',
    'pastor', 'father', 'brother', 'monk', 'reverend', 'rabbi',
    'captain', 'general', 'admiral', 'colonel'
]

FEMALE_HONORIFICS = [
    'mrs', 'miss', 'ms', 'lady', 'queen', 'princess', 'duchess', 'baroness',
    'sister', 'nun', 'mother', 'madam', 'dame',
    'her-excellency', 'her-majesty', 'her-highness'
]

MALE_KEYWORDS = [
    'guy', 'dude', 'bro', 'man', 'boy', 'lad', 'male', 'husband', 'dad', 'father'
]

FEMALE_KEYWORDS = [
    'girl', 'gal', 'lady', 'woman', 'female', 'wife', 'mom', 'mother', 'chick', 'sis'
]

def split_username_intelligent(username):
    """
    Split username into component parts for name extraction.
    
    Examples:
    - "JohnSmith1985" ‚Üí ["John", "Smith", "1985"]
    - "mary_reviews" ‚Üí ["mary", "reviews"]
    - "bobafett1138" ‚Üí ["bobafett", "1138"]
    """
    # Step 1: Replace separators with spaces
    username = re.sub(r'[_\-.]', ' ', username)
    
    # Step 2: Split on capital letters (CamelCase)
    # "JohnSmith" ‚Üí "John Smith"
    username = re.sub(r'([a-z])([A-Z])', r'\1 \2', username)
    
    # Step 3: Split on numbers
    # "john1985" ‚Üí "john 1985"
    username = re.sub(r'([a-zA-Z])([0-9])', r'\1 \2', username)
    username = re.sub(r'([0-9])([a-zA-Z])', r'\1 \2', username)
    
    # Step 4: Split and clean
    parts = username.lower().split()
    
    # Step 5: Filter out very short parts and numbers
    parts = [p for p in parts if len(p) >= 3 and not p.isdigit()]
    
    return parts

def analyze_username_improved(username):
    """
    Improved gender detection with smart username parsing.
    
    Detection hierarchy:
    1. Honorifics (100% confidence)
    2. Semantic keywords (95% confidence)
    3. Common name list (80% confidence)
    4. Unknown
    """
    if pd.isna(username):
        return 'unknown'
    
    username_str = str(username)
    username_lower = username_str.lower()
    gender = 'unknown'
    
    # TIER 1: Check honorifics (highest confidence)
    for honorific in MALE_HONORIFICS:
        if honorific in username_lower:
            return 'male'
    
    for honorific in FEMALE_HONORIFICS:
        if honorific in username_lower:
            return 'female'
    
    # TIER 2: Check semantic keywords
    for keyword in MALE_KEYWORDS:
        if keyword in username_lower:
            return 'male'
    
    for keyword in FEMALE_KEYWORDS:
        if keyword in username_lower:
            return 'female'
    
    # TIER 3: Split username and check against common names
    parts = split_username_intelligent(username_str)
    
    for part in parts:
        if part in COMMON_MALE_NAMES:
            return 'male'
        if part in COMMON_FEMALE_NAMES:
            return 'female'
    
   

    # TIER 4: Check if any name appears as substring in full username
    username_clean = ''.join(parts) if parts else username_lower
    for name in COMMON_MALE_NAMES:
        if len(name) >= 4 and name in username_clean:  # Only check names 4+ chars
            return 'male'
    
    for name in COMMON_FEMALE_NAMES:
        if len(name) >= 4 and name in username_clean:
            return 'female'

    return 'unknown'

# Test the improved function
print("Testing improved gender detection:")
print("="*60)
test_usernames = [
    'JohnSmith1985',
    'mary_reviews',
    'Boba_Fett1138',
    'kinglet',
    'pastorjames',
    'Her-Excellency',
    'movieguy42',
    'sarahloveshorror',
    'randomuser999'
]

for username in test_usernames:
    parts = split_username_intelligent(username)
    gender = analyze_username_improved(username)
    parts_str = str(parts)  # FIX: Convert list to string first
    print(f"{username:20} ‚Üí Parts: {parts_str:40} ‚Üí Gender: {gender}")

In [None]:
print("Applying improved gender detection to all reviewers...")
print("(This will REPLACE the existing username_gender_hint column)\n")

# Store old values for comparison
old_gender = df['username_gender_hint'].copy()

# Apply improved detection
df['username_gender_hint'] = df['Reviewer'].progress_apply(analyze_username_improved)

# Stats comparison
print("\n" + "="*60)
print("GENDER DETECTION IMPROVEMENT")
print("="*60)

old_identified = (old_gender != 'unknown').sum()
new_identified = (df['username_gender_hint'] != 'unknown').sum()

print(f"\nBefore (v1):")
print(f"  Identified: {old_identified} ({old_identified/len(df)*100:.1f}%)")
print(old_gender.value_counts())

print(f"\nAfter (v2):")
print(f"  Identified: {new_identified} ({new_identified/len(df)*100:.1f}%)")
print(df['username_gender_hint'].value_counts())

improvement = new_identified - old_identified
print(f"\n‚úÖ Improvement: +{improvement} reviewers identified (+{improvement/len(df)*100:.1f}%)")

# Show some examples of newly identified reviewers
newly_identified = df[(old_gender == 'unknown') & (df['username_gender_hint'] != 'unknown')]
if len(newly_identified) > 0:
    print(f"\nExample newly identified reviewers (first 10):")
    for idx, row in newly_identified.head(10).iterrows():
        print(f"  {row['Reviewer']:30} ‚Üí {row['username_gender_hint']}")

---

# Module 2.6 enhanced gender detection

In [None]:
# =============================================================================
# MODULE 2.5: ENHANCED GENDER DETECTION WITH WORD SEGMENTATION
# =============================================================================
print("\n" + "="*80)
print("MODULE 2.5: Enhanced Gender Detection with Word Segmentation")
print("="*80)

# Check if we should re-run gender detection
print("\nCurrent gender detection coverage:")
current_coverage = (df['username_gender_hint'] != 'unknown').sum()
print(f"  Identified: {current_coverage} / {len(df)} ({current_coverage/len(df)*100:.1f}%)")

proceed = input("\nDo you want to re-run gender detection with enhanced library? (yes/no): ")

if proceed.lower() == 'yes':
    print("\nüìä Installing required libraries...")
    
    # Install wordsegment for splitting compound words
    import sys
    try:
        from wordsegment import load, segment
        print("‚úÖ wordsegment library loaded")
    except ImportError:
        print("‚ùå wordsegment not found. Installing...")
        import subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", "wordsegment"])
        from wordsegment import load, segment
        print("‚úÖ wordsegment installed")
    
    # Load wordsegment dictionary
    print("üì¶ Loading word segmentation dictionary...")
    load()
    print("‚úÖ Dictionary loaded")
    
    # Install names-dataset
    try:
        from names_dataset import NameDataset
        print("‚úÖ names-dataset library loaded")
    except ImportError:
        print("‚ùå names-dataset not found. Installing...")
        import subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", "names-dataset"])
        from names_dataset import NameDataset
        print("‚úÖ names-dataset installed")
    
    # Initialize name detector
    print("üì¶ Loading name database (this may take 10-15 seconds)...")
    nd = NameDataset()
    print("‚úÖ Name database loaded")
    
    # Keep existing detection functions for Tier 1-2
    MALE_HONORIFICS = [
        'mr', 'mister', 'sir', 'lord', 'king', 'prince', 'duke', 'baron',
        'pastor', 'father', 'brother', 'monk', 'reverend', 'rabbi',
        'captain', 'general', 'admiral', 'colonel'
    ]
    
    FEMALE_HONORIFICS = [
        'mrs', 'miss', 'ms', 'lady', 'queen', 'princess', 'duchess', 'baroness',
        'sister', 'nun', 'mother', 'madam', 'dame',
        'her-excellency', 'her-majesty', 'her-highness'
    ]
    
    MALE_KEYWORDS = [
        'guy', 'dude', 'bro', 'man', 'boy', 'lad', 'male', 'husband', 'dad', 'father'
    ]
    
    FEMALE_KEYWORDS = [
        'girl', 'grl', 'grrl', 'gurrl', 'gal', 'lady', 'woman', 'female', 
        'wife', 'mom', 'mother', 'chick', 'sis'
    ]
    
    def split_username_intelligent(username):
        """
        Split username using word segmentation to handle compound words.
        This can split 'isabelleanderson' ‚Üí ['isabelle', 'anderson']
        """
        import re
        
        # Step 1: Replace separators with spaces
        username = re.sub(r'[_\-.]', ' ', username)
        
        # Step 2: Split on capital letters (CamelCase)
        username = re.sub(r'([a-z])([A-Z])', r'\1 \2', username)
        
        # Step 3: Split on numbers
        username = re.sub(r'([a-zA-Z])([0-9])', r'\1 \2', username)
        username = re.sub(r'([0-9])([a-zA-Z])', r'\1 \2', username)
        
        # Step 4: Split into initial parts
        initial_parts = username.split()
        
        # Step 5: For each part, use word segmentation to split compound words
        all_parts = []
        for part in initial_parts:
            # Skip numbers
            if part.isdigit():
                continue
            
            # If part has mixed case or is very short, keep it
            if len(part) < 4:
                all_parts.append(part)
            elif part[0].isupper() and any(c.isupper() for c in part[1:]):
                # Has CamelCase, already split above
                all_parts.append(part)
            else:
                # Use word segmentation for compound words
                # 'isabelleanderson' ‚Üí ['isabelle', 'anderson']
                segmented = segment(part.lower())
                all_parts.extend(segmented)
        
        # Step 6: Filter and capitalize
        parts = [p.capitalize() for p in all_parts if len(p) >= 3]
        
        return parts
    
    def analyze_username_enhanced(username):
        """
        Enhanced gender detection with word segmentation + names-dataset.
        
        Detection hierarchy:
        1. Honorifics (100% confidence)
        2. Semantic keywords (95% confidence)  
        3. names-dataset on segmented name parts (80% confidence)
        4. Unknown
        """
        if pd.isna(username):
            return 'unknown'
        
        username_str = str(username)
        username_lower = username_str.lower()
        
        # TIER 1: Check honorifics (highest confidence)
        for honorific in MALE_HONORIFICS:
            if honorific in username_lower:
                return 'male'
        
        for honorific in FEMALE_HONORIFICS:
            if honorific in username_lower:
                return 'female'
        
        # TIER 2: Check semantic keywords
        for keyword in MALE_KEYWORDS:
            if keyword in username_lower:
                return 'male'
        
        for keyword in FEMALE_KEYWORDS:
            if keyword in username_lower:
                return 'female'
        
        # TIER 3: Use names-dataset on segmented parts
        parts = split_username_intelligent(username_str)
        
        if parts:
            # Check FIRST part first (most likely to be first name)
            first_name_data = nd.search(parts[0])
            
            if first_name_data:
                first_name_info = first_name_data.get('first_name')
                
                if first_name_info:
                    gender_dict = first_name_info.get('gender')
                    
                    # gender_dict is like {'Female': 0.995, 'Male': 0.005}
                    if gender_dict:
                        male_prob = gender_dict.get('Male', 0)
                        female_prob = gender_dict.get('Female', 0)
                        
                        # Need >60% confidence to assign gender
                        if male_prob > 0.6:
                            return 'male'
                        elif female_prob > 0.6:
                            return 'female'
            
            # Then check remaining parts (up to 3 parts to avoid false positives)
            for part in parts[1:3]:
                name_data = nd.search(part)
                
                if name_data:
                    name_info = name_data.get('first_name')
                    
                    if name_info:
                        gender_dict = name_info.get('gender')
                        
                        if gender_dict:
                            male_prob = gender_dict.get('Male', 0)
                            female_prob = gender_dict.get('Female', 0)
                            
                            if male_prob > 0.6:
                                return 'male'
                            elif female_prob > 0.6:
                                return 'female'
        
        return 'unknown'
    
    # Test on known examples
    print("\nüß™ Testing enhanced detection on known examples:")
    test_cases = [
        ('isabelleanderson', 'female'),
        ('omarrangels', 'male'),
        ('Supercraig68', 'male'),
        ('purrlgurrl', 'female'),
        ('karinafaolin', 'female'),
        ('JohnSmith1985', 'male'),
        ('sarahloveshorror', 'female'),
        ('christhecat', 'male'),
        ('paulbenjamin', 'male'),
        ('pastorjames', 'male'),
        ('Her-Excellency', 'female')
    ]
    
    print("="*80)
    for username, expected in test_cases:
        parts = split_username_intelligent(username)
        result = analyze_username_enhanced(username)
        status = "‚úÖ" if result == expected else "‚ùå"
        print(f"{status} {username:20} ‚Üí {result:10} | parts: {parts}")
    print("="*80)
    
    # Store old values for comparison
    old_gender = df['username_gender_hint'].copy()
    
    # Apply enhanced detection to all reviewers
    print(f"\n‚è≥ Applying enhanced gender detection to all {len(df):,} reviewers...")
    print("   This may take 3-5 minutes (word segmentation is slower)...\n")
    
    df['username_gender_hint'] = df['Reviewer'].progress_apply(analyze_username_enhanced)
    
    # Stats comparison
    print("\n" + "="*80)
    print("GENDER DETECTION IMPROVEMENT")
    print("="*80)
    
    old_identified = (old_gender != 'unknown').sum()
    new_identified = (df['username_gender_hint'] != 'unknown').sum()
    
    print(f"\nBefore (original method):")
    print(f"  Identified: {old_identified} / {len(df)} ({old_identified/len(df)*100:.1f}%)")
    print(old_gender.value_counts())
    
    print(f"\nAfter (word segmentation + names-dataset):")
    print(f"  Identified: {new_identified} / {len(df)} ({new_identified/len(df)*100:.1f}%)")
    print(df['username_gender_hint'].value_counts())
    
    improvement = new_identified - old_identified
    print(f"\n‚úÖ Improvement: +{improvement} reviewers identified (+{improvement/len(df)*100:.1f}%)")
    
    # Show some examples of newly identified reviewers
    newly_identified = df[(old_gender == 'unknown') & (df['username_gender_hint'] != 'unknown')]
    if len(newly_identified) > 0:
        print(f"\nüìä Example newly identified reviewers (first 30):")
        for idx, row in newly_identified.head(30).iterrows():
            print(f"  {row['Reviewer']:30} ‚Üí {row['username_gender_hint']}")
    
    print("\n‚úÖ Enhanced gender detection complete!")
    
else:
    print("‚è≠Ô∏è  Skipping enhanced gender detection, keeping existing results")

---
# Module 5: Emotion Detection (NEW)

**Goal**: Add 8 emotion columns using NRCLex

**Method**: NRCLex (NRC Emotion Lexicon)
- Based on 14,000+ words with emotion associations
- Provides 8 core emotions: joy, trust, fear, surprise, sadness, disgust, anger, anticipation
- Complements VADER with specific emotion breakdowns

**New Columns**: 8 total
- `emotion_joy` - Joy/happiness score (0 to 1)
- `emotion_trust` - Trust/acceptance score (0 to 1)
- `emotion_fear` - Fear/anxiety score (0 to 1)
- `emotion_surprise` - Surprise/amazement score (0 to 1)
- `emotion_sadness` - Sadness/sorrow score (0 to 1)
- `emotion_disgust` - Disgust/loathing score (0 to 1)
- `emotion_anger` - Anger/rage score (0 to 1)
- `emotion_anticipation` - Anticipation/interest score (0 to 1)

In [None]:
# Check if emotion columns already exist
emotion_cols = ['emotion_joy', 'emotion_trust', 'emotion_fear', 'emotion_surprise',
                'emotion_sadness', 'emotion_disgust', 'emotion_anger', 'emotion_anticipation']

if all(col in df.columns for col in emotion_cols):
    print("‚úÖ Emotion columns already exist, skipping...")
    print(f"   Existing columns: {emotion_cols}")
    SKIP_EMOTIONS = True
else:
    print("Adding emotion detection (NEW FEATURE)...")
    SKIP_EMOTIONS = False

In [None]:
if not SKIP_EMOTIONS:
    def extract_emotions(text):
        """
        Extract emotion scores using NRCLex.
        Returns dict with 8 emotion scores.
        """
        try:
            # Initialize NRCLex with review text
            emotion_obj = NRCLex(str(text))
            
            # Get affect frequencies (normalized 0-1)
            emotions = emotion_obj.affect_frequencies
            
            return {
                'emotion_joy': emotions.get('joy', 0.0),
                'emotion_trust': emotions.get('trust', 0.0),
                'emotion_fear': emotions.get('fear', 0.0),
                'emotion_surprise': emotions.get('surprise', 0.0),
                'emotion_sadness': emotions.get('sadness', 0.0),
                'emotion_disgust': emotions.get('disgust', 0.0),
                'emotion_anger': emotions.get('anger', 0.0),
                'emotion_anticipation': emotions.get('anticipation', 0.0)
            }
        except Exception as e:
            # Return zeros on error
            return {
                'emotion_joy': 0.0,
                'emotion_trust': 0.0,
                'emotion_fear': 0.0,
                'emotion_surprise': 0.0,
                'emotion_sadness': 0.0,
                'emotion_disgust': 0.0,
                'emotion_anger': 0.0,
                'emotion_anticipation': 0.0
            }
    
    # Test on a sample review first
    print("Testing emotion detection on sample review:")
    sample_text = df.iloc[0]['Review_Text']
    sample_emotions = extract_emotions(sample_text)
    print(f"\nSample emotions:")
    for emotion, score in sample_emotions.items():
        print(f"  {emotion:20}: {score:.3f}")
    
    print(f"\n‚úÖ Test successful! Now processing all {len(df):,} reviews...")
    print("‚è±Ô∏è  This may take 2-3 minutes...\n")

In [None]:
if not SKIP_EMOTIONS:
    print("Extracting emotions from all reviews...\n")
    
    # Apply emotion extraction to all reviews
    emotion_results = df['Review_Text'].progress_apply(extract_emotions)
    emotion_df = pd.DataFrame(emotion_results.tolist())
    
    # Add emotion columns to main dataframe
    df = pd.concat([df, emotion_df], axis=1)
    
    # Stats
    print("\n" + "="*60)
    print("EMOTION DETECTION COMPLETE")
    print("="*60)
    
    print(f"\nSuccess rate: {(emotion_df['emotion_joy'].notna().sum() / len(df) * 100):.1f}%")
    
    print(f"\nAverage emotion scores across all reviews:")
    for col in emotion_cols:
        avg = df[col].mean()
        print(f"  {col:25}: {avg:.3f}")
    
    # Find review with highest joy
    max_joy_idx = df['emotion_joy'].idxmax()
    max_joy_review = df.loc[max_joy_idx]
    print(f"\nüìä Review with highest JOY score ({max_joy_review['emotion_joy']:.3f}):")
    print(f"   Movie: {max_joy_review['Movie_Title']}")
    print(f"   Rating: {max_joy_review['Rating']}/10")
    print(f"   Reviewer: {max_joy_review['Reviewer']}")
    
    # Find review with highest fear
    max_fear_idx = df['emotion_fear'].idxmax()
    max_fear_review = df.loc[max_fear_idx]
    print(f"\nüò® Review with highest FEAR score ({max_fear_review['emotion_fear']:.3f}):")
    print(f"   Movie: {max_fear_review['Movie_Title']}")
    print(f"   Rating: {max_fear_review['Rating']}/10")
    print(f"   Reviewer: {max_fear_review['Reviewer']}")
else:
    print("‚úÖ Emotion columns already exist, skipped processing")

---
# Modual 6 Lexicon

In [None]:
# =============================================================================
# MODULE 6: WRITING COMPLEXITY & READABILITY
# =============================================================================
print("\n" + "="*80)
print("MODULE 6: Writing Complexity & Readability Analysis")
print("="*80)

if 'flesch_reading_ease' not in df.columns:
    print("\nüìä Extracting writing complexity features...")
    
    # Import required libraries
    try:
        import textstat
        print("‚úÖ textstat library loaded")
    except ImportError:
        print("‚ùå textstat not found. Installing...")
        import subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", "textstat"])
        import textstat
        print("‚úÖ textstat installed and loaded")
    
    from nltk.tokenize import sent_tokenize, word_tokenize
    import re
    
    # Initialize new columns
    df['flesch_reading_ease'] = None
    df['flesch_kincaid_grade'] = None
    df['avg_sentence_length'] = None
    df['avg_word_length'] = None
    df['type_token_ratio'] = None
    df['long_word_percentage'] = None
    df['complex_word_count'] = None
    df['syllable_count'] = None
    
    # Process each review
    for idx, row in df.iterrows():
        text = str(row['Review_Text'])
        
        if len(text.strip()) == 0:
            continue
        
        try:
            # Readability metrics (textstat handles edge cases)
            df.at[idx, 'flesch_reading_ease'] = textstat.flesch_reading_ease(text)
            df.at[idx, 'flesch_kincaid_grade'] = textstat.flesch_kincaid_grade(text)
            
            # Sentence-level metrics
            sentences = sent_tokenize(text)
            words = word_tokenize(text.lower())
            words_clean = [w for w in words if w.isalnum()]  # Remove punctuation
            
            if len(sentences) > 0 and len(words_clean) > 0:
                # Average sentence length
                df.at[idx, 'avg_sentence_length'] = len(words_clean) / len(sentences)
                
                # Average word length
                df.at[idx, 'avg_word_length'] = sum(len(w) for w in words_clean) / len(words_clean)
                
                # Lexical diversity (Type-Token Ratio)
                unique_words = set(words_clean)
                df.at[idx, 'type_token_ratio'] = len(unique_words) / len(words_clean)
                
                # Long words (7+ characters)
                long_words = [w for w in words_clean if len(w) >= 7]
                df.at[idx, 'long_word_percentage'] = (len(long_words) / len(words_clean)) * 100
                
                # Complex words (3+ syllables) - using textstat
                df.at[idx, 'complex_word_count'] = textstat.difficult_words(text)
                
                # Total syllables
                df.at[idx, 'syllable_count'] = textstat.syllable_count(text)
            
        except Exception as e:
            # Silent failure for individual reviews
            continue
        
        # Progress indicator
        if (idx + 1) % 500 == 0:
            print(f"   Processed {idx + 1:,} / {len(df):,} reviews...")
    
    print(f"‚úÖ Completed processing {len(df):,} reviews")
    
    # Summary statistics
    print("\nüìà Writing Complexity Summary:")
    print(f"   Mean Flesch Reading Ease: {df['flesch_reading_ease'].mean():.1f} (0-100, higher=easier)")
    print(f"   Mean Grade Level: {df['flesch_kincaid_grade'].mean():.1f} (U.S. grade)")
    print(f"   Mean Sentence Length: {df['avg_sentence_length'].mean():.1f} words")
    print(f"   Mean Word Length: {df['avg_word_length'].mean():.1f} characters")
    print(f"   Mean Type-Token Ratio: {df['type_token_ratio'].mean():.3f} (vocabulary richness)")
    print(f"   Mean Long Word %: {df['long_word_percentage'].mean():.1f}%")
    
    # Interpretation guide
    print("\nüìö Flesch Reading Ease Interpretation:")
    print("   90-100: Very Easy (5th grade)")
    print("   80-89:  Easy (6th grade)")
    print("   70-79:  Fairly Easy (7th grade)")
    print("   60-69:  Standard (8th-9th grade)")
    print("   50-59:  Fairly Difficult (10th-12th grade)")
    print("   30-49:  Difficult (College)")
    print("   0-29:   Very Confusing (College graduate)")
    
else:
    print("‚úÖ Writing complexity features already exist, skipping Module 6...")
    print(f"   Found columns: flesch_reading_ease, flesch_kincaid_grade, etc.")

---

# Module 7 TEMPORAL & ENGAGEMENT FEATURES

In [None]:
# =============================================================================
# MODULE 7: TEMPORAL & ENGAGEMENT FEATURES
# =============================================================================
print("\n" + "="*80)
print("MODULE 7: Temporal & Engagement Analysis")
print("="*80)

if 'years_since_release' not in df.columns:
    print("\n‚è∞ Calculating temporal and engagement features...")
    
    # Movie release year mapping (from your data)
    MOVIE_RELEASE_YEARS = {
        'Angel Heart': 1987,
        'The Rapture': 1991,
        'Lady in the Water': 2006,
        'Antichrist': 2009,
        'The Witch': 2015,
        'We Are Still Here': 2015,
        'The Wailing': 2016,
        'A Dark Song': 2016,
        'The Endless': 2017,
        'Tigers Are Not Afraid': 2017,
        'The Ritual': 2017,
        'Hagazussa': 2017,
        'Annihilation': 2018,
        'Apostle': 2018,
        'Hereditary': 2018,
        'The Wind': 2018,
        'Midsommar': 2019,
        'His House': 2020,
        'The Medium': 2021,
        'The Watchers': 2024
    }
    
    # Initialize columns
    df['movie_release_year'] = None
    df['review_year'] = None
    df['years_since_release'] = None
    df['review_window'] = None
    df['total_votes'] = None
    df['helpfulness_ratio'] = None
    df['vote_polarization'] = None
    df['has_engagement'] = None
    
    # Process each review
    for idx, row in df.iterrows():
        movie = row['Movie_Title']
        review_date = pd.to_datetime(row['Review_Date'])
        
        # Movie release year
        release_year = MOVIE_RELEASE_YEARS.get(movie)
        if release_year:
            df.at[idx, 'movie_release_year'] = release_year
            df.at[idx, 'review_year'] = review_date.year
            
            # Years between release and review
            years_diff = review_date.year - release_year
            df.at[idx, 'years_since_release'] = years_diff
            
            # Categorize review timing window
            if years_diff <= 0:
                df.at[idx, 'review_window'] = 'Opening Year'
            elif years_diff == 1:
                df.at[idx, 'review_window'] = 'Year 2'
            elif years_diff <= 3:
                df.at[idx, 'review_window'] = 'Years 2-3'
            elif years_diff <= 5:
                df.at[idx, 'review_window'] = 'Years 4-5'
            else:
                df.at[idx, 'review_window'] = '5+ Years'
        
        # Engagement metrics
        up_votes = row['Helpful_Votes_Up'] if pd.notna(row['Helpful_Votes_Up']) else 0
        down_votes = row['Helpful_Votes_Down'] if pd.notna(row['Helpful_Votes_Down']) else 0
        
        total = up_votes + down_votes
        df.at[idx, 'total_votes'] = total
        df.at[idx, 'has_engagement'] = total > 0
        
        if total > 0:
            df.at[idx, 'helpfulness_ratio'] = up_votes / total
            df.at[idx, 'vote_polarization'] = abs(up_votes - down_votes)
        else:
            df.at[idx, 'helpfulness_ratio'] = None
            df.at[idx, 'vote_polarization'] = 0
    
    print(f"‚úÖ Processed {len(df):,} reviews")
    
    # Summary statistics
    print("\nüìà Temporal Analysis Summary:")
    print(f"   Mean years since release: {df['years_since_release'].mean():.1f} years")
    print(f"   Median years since release: {df['years_since_release'].median():.1f} years")
    print(f"   Reviews written in opening year: {(df['years_since_release'] <= 0).sum():,} ({(df['years_since_release'] <= 0).sum()/len(df)*100:.1f}%)")
    
    print("\nüìä Review Window Distribution:")
    print(df['review_window'].value_counts().sort_index())
    
    print("\nüëç Engagement Summary:")
    print(f"   Reviews with votes: {df['has_engagement'].sum():,} ({df['has_engagement'].sum()/len(df)*100:.1f}%)")
    print(f"   Mean helpfulness ratio: {df['helpfulness_ratio'].mean():.3f} (among voted reviews)")
    print(f"   Mean total votes: {df['total_votes'].mean():.1f}")
    print(f"   Median vote polarization: {df['vote_polarization'].median():.0f}")
    
else:
    print("‚úÖ Temporal & engagement features already exist, skipping Module 7...")

---

# Module 8 Review Structure and Punctuation Analysis

In [None]:
# =============================================================================
# MODULE 8: REVIEW STRUCTURE & PUNCTUATION ANALYSIS
# =============================================================================
print("\n" + "="*80)
print("MODULE 8: Review Structure & Punctuation Analysis")
print("="*80)

if 'paragraph_count' not in df.columns:
    print("\nüìù Extracting structural and punctuation features...")
    
    import re
    
    # Initialize new columns
    df['paragraph_count'] = None
    df['avg_paragraph_length'] = None
    df['exclamation_count'] = None
    df['question_mark_count'] = None
    df['ellipsis_count'] = None
    df['caps_word_count'] = None
    df['quote_count'] = None
    df['double_quote_count'] = None
    df['single_quote_count'] = None
    df['punctuation_density'] = None
    df['uppercase_ratio'] = None
    
    # Process each review
    for idx, row in df.iterrows():
        text = str(row['Review_Text'])
        review_length = row['Review_Length']
        
        if len(text.strip()) == 0 or review_length == 0:
            continue
        
        try:
            # Paragraph structure (split on double newlines or multiple newlines)
            paragraphs = re.split(r'\n\s*\n', text.strip())
            paragraphs = [p.strip() for p in paragraphs if p.strip()]
            para_count = len(paragraphs) if paragraphs else 1
            
            df.at[idx, 'paragraph_count'] = para_count
            df.at[idx, 'avg_paragraph_length'] = review_length / para_count
            
            # Exclamation marks (emotional intensity)
            df.at[idx, 'exclamation_count'] = text.count('!')
            
            # Question marks (engagement, rhetorical questions)
            df.at[idx, 'question_mark_count'] = text.count('?')
            
            # Ellipsis (dramatic pauses, trailing thoughts)
            ellipsis_pattern = r'\.{3,}|‚Ä¶'
            df.at[idx, 'ellipsis_count'] = len(re.findall(ellipsis_pattern, text))
            
            # ALL CAPS words (emphasis, shouting)
            caps_pattern = r'\b[A-Z]{2,}\b'
            caps_words = re.findall(caps_pattern, text)
            # Filter out common acronyms
            common_acronyms = {'DVD', 'VHS', 'TV', 'CGI', 'IMDb', 'USA', 'UK', 'US'}
            caps_words = [w for w in caps_words if w not in common_acronyms]
            df.at[idx, 'caps_word_count'] = len(caps_words)
            
            # Quote usage (dialogue/scene references)
            df.at[idx, 'double_quote_count'] = text.count('"')
            df.at[idx, 'single_quote_count'] = text.count("'")
            df.at[idx, 'quote_count'] = text.count('"') + text.count("'")
            
            # Overall punctuation density (per 100 words)
            punctuation_chars = '!?.,:;-‚Äî'
            punct_count = sum(text.count(p) for p in punctuation_chars)
            words_estimate = review_length  # Review_Length is word count
            df.at[idx, 'punctuation_density'] = (punct_count / words_estimate * 100) if words_estimate > 0 else 0
            
            # Uppercase letter ratio (intensity metric)
            letters = [c for c in text if c.isalpha()]
            if letters:
                uppercase_count = sum(1 for c in letters if c.isupper())
                df.at[idx, 'uppercase_ratio'] = uppercase_count / len(letters)
            else:
                df.at[idx, 'uppercase_ratio'] = 0
            
        except Exception as e:
            # Silent failure for individual reviews
            continue
        
        # Progress indicator
        if (idx + 1) % 500 == 0:
            print(f"   Processed {idx + 1:,} / {len(df):,} reviews...")
    
    print(f"‚úÖ Completed processing {len(df):,} reviews")
    
    # Summary statistics
    print("\nüìà Structure & Punctuation Summary:")
    print(f"   Mean paragraph count: {df['paragraph_count'].mean():.1f}")
    print(f"   Mean paragraph length: {df['avg_paragraph_length'].mean():.1f} words")
    print(f"   Reviews with exclamations: {(df['exclamation_count'] > 0).sum():,} ({(df['exclamation_count'] > 0).sum()/len(df)*100:.1f}%)")
    print(f"   Reviews with questions: {(df['question_mark_count'] > 0).sum():,} ({(df['question_mark_count'] > 0).sum()/len(df)*100:.1f}%)")
    print(f"   Reviews with ellipsis: {(df['ellipsis_count'] > 0).sum():,} ({(df['ellipsis_count'] > 0).sum()/len(df)*100:.1f}%)")
    print(f"   Reviews with CAPS words: {(df['caps_word_count'] > 0).sum():,} ({(df['caps_word_count'] > 0).sum()/len(df)*100:.1f}%)")
    print(f"   Mean punctuation density: {df['punctuation_density'].mean():.1f} marks per 100 words")
    
    print("\nüî• Intensity Indicators:")
    print(f"   Mean exclamations per review: {df['exclamation_count'].mean():.1f}")
    print(f"   Max exclamations in one review: {df['exclamation_count'].max():.0f}")
    print(f"   Mean CAPS words per review: {df['caps_word_count'].mean():.1f}")
    print(f"   Reviews with 5+ exclamations: {(df['exclamation_count'] >= 5).sum():,} (very emotional)")
    
    print("\nüìä Quote Usage (dialogue/scene references):")
    print(f"   Reviews with quotes: {(df['quote_count'] > 0).sum():,} ({(df['quote_count'] > 0).sum()/len(df)*100:.1f}%)")
    print(f"   Mean quotes per review: {df['quote_count'].mean():.1f}")
    
    # Find most punctuation-heavy review
    max_punct_idx = df['punctuation_density'].idxmax()
    max_punct_review = df.loc[max_punct_idx]
    print(f"\n‚ö° Most punctuation-heavy review ({max_punct_review['punctuation_density']:.1f} marks/100 words):")
    print(f"   Movie: {max_punct_review['Movie_Title']}")
    print(f"   Rating: {max_punct_review['Rating']}/10")
    print(f"   Exclamations: {max_punct_review['exclamation_count']:.0f}")
    print(f"   Questions: {max_punct_review['question_mark_count']:.0f}")
    
else:
    print("‚úÖ Structure & punctuation features already exist, skipping Module 8...")
    print(f"   Found columns: paragraph_count, exclamation_count, caps_word_count, etc.")

---

# Module 9 Part-of-speech analysis

In [None]:
# =============================================================================
# MODULE 9: PART-OF-SPEECH (POS) ANALYSIS
# =============================================================================
print("\n" + "="*80)
print("MODULE 9: Part-of-Speech Analysis")
print("="*80)

if 'adj_ratio' not in df.columns:
    print("\nüî§ Extracting POS ratios (writing style analysis)...")
    print("‚è±Ô∏è  This uses spaCy and may take 3-5 minutes...\n")
    
    # Import spaCy
    try:
        import spacy
        print("‚úÖ spaCy library loaded")
    except ImportError:
        print("‚ùå spaCy not found. Installing...")
        import subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", "spacy"])
        import spacy
        print("‚úÖ spaCy installed")
    
    # Load English model
    try:
        nlp = spacy.load('en_core_web_sm')
        print("‚úÖ English model loaded\n")
    except OSError:
        print("üì¶ Downloading spaCy English model...")
        import subprocess
        subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
        nlp = spacy.load('en_core_web_sm')
        print("‚úÖ English model loaded\n")
    
    # Disable unnecessary pipeline components for speed
    nlp.disable_pipes(['parser', 'ner'])
    
    # Initialize new columns
    df['adj_ratio'] = None          # Adjectives (descriptive)
    df['verb_ratio'] = None         # Verbs (action-oriented)
    df['noun_ratio'] = None         # Nouns (analytical/factual)
    df['adverb_ratio'] = None       # Adverbs (modifier-heavy)
    df['pronoun_ratio'] = None      # Pronouns (personal vs objective)
    df['first_person_ratio'] = None # I/me/my/we/us/our (subjective)
    df['second_person_ratio'] = None # You/your (direct address)
    df['determiner_ratio'] = None   # The/a/an (specificity)
    df['conjunction_ratio'] = None  # And/but/or (complexity)
    
    # Process each review
    for idx, row in df.iterrows():
        text = str(row['Review_Text'])
        
        if len(text.strip()) == 0:
            continue
        
        try:
            # Process with spaCy (limit to first 1M characters to avoid crashes)
            doc = nlp(text[:1000000])
            
            # Count POS tags
            total_tokens = len([token for token in doc if not token.is_punct and not token.is_space])
            
            if total_tokens == 0:
                continue
            
            # POS tag counts
            adj_count = sum(1 for token in doc if token.pos_ == 'ADJ')
            verb_count = sum(1 for token in doc if token.pos_ == 'VERB')
            noun_count = sum(1 for token in doc if token.pos_ in ['NOUN', 'PROPN'])
            adverb_count = sum(1 for token in doc if token.pos_ == 'ADV')
            pronoun_count = sum(1 for token in doc if token.pos_ == 'PRON')
            det_count = sum(1 for token in doc if token.pos_ == 'DET')
            conj_count = sum(1 for token in doc if token.pos_ in ['CCONJ', 'SCONJ'])
            
            # Calculate ratios (per 100 words for interpretability)
            df.at[idx, 'adj_ratio'] = (adj_count / total_tokens) * 100
            df.at[idx, 'verb_ratio'] = (verb_count / total_tokens) * 100
            df.at[idx, 'noun_ratio'] = (noun_count / total_tokens) * 100
            df.at[idx, 'adverb_ratio'] = (adverb_count / total_tokens) * 100
            df.at[idx, 'pronoun_ratio'] = (pronoun_count / total_tokens) * 100
            df.at[idx, 'determiner_ratio'] = (det_count / total_tokens) * 100
            df.at[idx, 'conjunction_ratio'] = (conj_count / total_tokens) * 100
            
            # Specific pronoun analysis
            first_person = ['i', 'me', 'my', 'mine', 'myself', 'we', 'us', 'our', 'ours', 'ourselves']
            second_person = ['you', 'your', 'yours', 'yourself', 'yourselves']
            
            first_person_count = sum(1 for token in doc if token.lower_ in first_person)
            second_person_count = sum(1 for token in doc if token.lower_ in second_person)
            
            df.at[idx, 'first_person_ratio'] = (first_person_count / total_tokens) * 100
            df.at[idx, 'second_person_ratio'] = (second_person_count / total_tokens) * 100
            
        except Exception as e:
            # Silent failure for individual reviews
            continue
        
        # Progress indicator
        if (idx + 1) % 500 == 0:
            print(f"   Processed {idx + 1:,} / {len(df):,} reviews...")
    
    print(f"‚úÖ Completed processing {len(df):,} reviews")
    
    # Summary statistics
    print("\nüìà POS Analysis Summary:")
    print(f"   Mean adjective ratio: {df['adj_ratio'].mean():.1f}% (descriptive writing)")
    print(f"   Mean verb ratio: {df['verb_ratio'].mean():.1f}% (action-oriented)")
    print(f"   Mean noun ratio: {df['noun_ratio'].mean():.1f}% (factual/analytical)")
    print(f"   Mean adverb ratio: {df['adverb_ratio'].mean():.1f}% (modifier-heavy)")
    print(f"   Mean pronoun ratio: {df['pronoun_ratio'].mean():.1f}%")
    
    print("\nüë§ Voice Analysis:")
    print(f"   Mean 1st person ratio: {df['first_person_ratio'].mean():.1f}% (subjective/personal)")
    print(f"   Mean 2nd person ratio: {df['second_person_ratio'].mean():.1f}% (direct address)")
    
    print("\nüìä Writing Style Indicators:")
    print(f"   High 1st person (>5%): {(df['first_person_ratio'] > 5).sum():,} reviews (very personal)")
    print(f"   High adjectives (>10%): {(df['adj_ratio'] > 10).sum():,} reviews (descriptive)")
    print(f"   High nouns (>25%): {(df['noun_ratio'] > 25).sum():,} reviews (analytical)")
    print(f"   Low 1st person (<2%): {(df['first_person_ratio'] < 2).sum():,} reviews (objective/critical)")
    
    # Interpretation guide
    print("\nüìö Writing Style Interpretation:")
    print("   High adjectives + high 1st person = Emotional/subjective reviews")
    print("   High nouns + low 1st person = Analytical/critical reviews")
    print("   High verbs = Action/plot-focused reviews")
    print("   High adverbs = Nuanced/qualified opinions")
    print("   High 2nd person = Direct engagement with reader")
    
    # Find most subjective review
    max_subjective_idx = df['first_person_ratio'].idxmax()
    max_subjective = df.loc[max_subjective_idx]
    print(f"\nüë§ Most subjective review ({max_subjective['first_person_ratio']:.1f}% first-person):")
    print(f"   Movie: {max_subjective['Movie_Title']}")
    print(f"   Rating: {max_subjective['Rating']}/10")
    print(f"   Reviewer: {max_subjective['Reviewer']}")
    
    # Find most objective review
    min_subjective_idx = df['first_person_ratio'].idxmin()
    min_subjective = df.loc[min_subjective_idx]
    print(f"\nüìä Most objective review ({min_subjective['first_person_ratio']:.1f}% first-person):")
    print(f"   Movie: {min_subjective['Movie_Title']}")
    print(f"   Rating: {min_subjective['Rating']}/10")
    print(f"   Reviewer: {min_subjective['Reviewer']}")
    
else:
    print("‚úÖ POS analysis features already exist, skipping Module 9...")
    print(f"   Found columns: adj_ratio, verb_ratio, noun_ratio, etc.")

---
# Final Summary & Export

In [None]:
print("\n" + "="*60)
print("INCREMENTAL FEATURE EXTRACTION SUMMARY")
print("="*60)

print(f"\nTotal reviews: {len(df):,}")
print(f"Total columns: {len(df.columns)}")

print(f"\nüéØ Features Improved/Added:")
print(f"  ‚úÖ Gender Detection (v2): {(df['username_gender_hint'] != 'unknown').sum()} identified ({(df['username_gender_hint'] != 'unknown').sum()/len(df)*100:.1f}%)")

if not SKIP_EMOTIONS:
    print(f"  ‚úÖ Emotion Detection (NEW): 8 columns added")
else:
    print(f"  ‚è≠Ô∏è  Emotion Detection: Already existed, skipped")

print(f"\nüìã Complete Feature List ({len(df.columns)} columns):")
print(df.columns.tolist())

## Export Updated Dataset

In [None]:
print("\n" + "="*60)
print("EXPORTING UPDATED DATASET")
print("="*60)

# Save to CSV
df.to_csv(OUTPUT_FILE, index=False, encoding='utf-8')
print(f"\n‚úÖ Saved: {OUTPUT_FILE}")
print(f"   Rows: {len(df):,}")
print(f"   Columns: {len(df.columns)}")

# File size
file_size = OUTPUT_FILE.stat().st_size / (1024 * 1024)  # MB
print(f"   File size: {file_size:.2f} MB")

print("\n" + "="*60)
print("‚úÖ INCREMENTAL UPDATES COMPLETE!")
print("="*60)
print("\nReady for analysis phase (movie_insights.ipynb)")

In [None]:
# ==============================================================================
# Generate HTML Report (Optional)
# ==============================================================================

print("\n" + "="*60)
print("GENERATING HTML REPORT")
print("="*60)

import subprocess
from pathlib import Path

# Get the notebook path
notebook_path = Path('/Users/USER/Desktop/JAMES/Noetheca/Reviews/scripts/feature_extraction_incremental.ipynb')
output_path = notebook_path.parent / 'feature_extraction_incremental_report.html'

try:
    # Run nbconvert
    result = subprocess.run([
        'jupyter', 'nbconvert', 
        '--to', 'html',
        '--no-input',
        '--output', str(output_path),
        str(notebook_path)
    ], capture_output=True, text=True, check=True)
    
    print(f"\n‚úÖ HTML report generated: {output_path}")
    print(f"   Open in browser: file://{output_path}")
    
except subprocess.CalledProcessError as e:
    print(f"\n‚ùå Error generating HTML report:")
    print(f"   {e.stderr}")
    
except FileNotFoundError:
    print("\n‚ö†Ô∏è  jupyter nbconvert not found. Install with:")
    print("   pip install nbconvert")