# Hybrid Recommendation System (Advanced)

## Goal
Combine **ML predictions** with **content-based scoring** to create a hybrid recommendation system with **advanced filtering and user preference support**.

## What's New vs. Basic Content-Based?
This notebook includes ALL advanced features from model_development.ipynb:

**ðŸ”’ Hard Filters** (auto-exclude):
- NSFW content filter
- Early Access filter  
- Meta genre/tag filtering (Indie, Casual, Utilities, etc.)
- Minimum review thresholds
- User-specified hard exclusions

**ðŸŽ¯ Smart Scoring**:
- Tag matching weighted by YOUR playtime (45%)
- Genre matching (20%)
- Median playtime signal (20%)
- Review quality tiers (15%)
- **Disliked tag/genre penalties** (learns from low-playtime games)
- **User preference boosts** (optional manual adjustments)

## Approach
1. Load trained ML model from feature_engineering.ipynb
2. Build content-based scoring system with all advanced features
3. Generate recommendations with 3 approaches:
   - **Pure ML**: Use only ML predictions
   - **Pure Content-Based**: Use only similarity scores (with all filters/penalties)
   - **Hybrid**: Weighted combination of both
4. Compare all 3 approaches side-by-side
5. Analyze which approach gives best recommendations

## Scoring Formula
```
Content Score = 
  + 45% Tag Matching (weighted by playtime, excludes NSFW/meta)
  + 20% Genre Matching (excludes meta genres)
  + 20% Median Playtime (engagement signal)
  + 15% Review Quality (tiered by score, volume bonus)
  - 10 points per disliked tag (soft penalty)
  - 15 points per disliked genre (soft penalty)
  
Hybrid Score = 0.40*ML + 0.40*Content + 0.20*Review
```

*(Weights can be tuned based on results)*

**Dislike Learning**: Games you own but never play (<5 hours) reveal disliked tags/genres, which get penalized in recommendations."

In [None]:
import pandas as pd
import numpy as np
import ast
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')

print("Libraries loaded successfully!")

## 2. Load Data

Load all necessary datasets:
- User's owned games (with engagement scores)
- Steam catalog (candidates for recommendations)
- Trained ML model and scaler
- ML recommendations (from feature_engineering.ipynb)

In [None]:
# Load datasets
owned_games = pd.read_csv('../data/owned_games_enriched.csv')
steam_catalog = pd.read_csv('../data/steam_catalog_detailed.csv')
ml_recommendations_v2 = pd.read_csv('../data/ml_recommendations_v2_diverse.csv')

print(f"Owned games: {len(owned_games)}")
print(f"Steam catalog: {len(steam_catalog)}")
print(f"ML recommendations loaded: {len(ml_recommendations_v2)}")

# Identify owned game IDs
owned_appids = set(owned_games['appid'].values)
catalog_unowned = steam_catalog[~steam_catalog['appid'].isin(owned_appids)].copy()

print(f"\nCatalog games you don't own: {len(catalog_unowned)}")

## 3. Load/Prepare ML Predictions

We already have ML predictions from feature_engineering.ipynb.  
Let's normalize them to 0-100 scale for consistency with content-based scoring.

In [None]:
# ML recommendations already have 'predicted_engagement_v2' column (0-100 scale)
# Let's verify the range
print("ML Prediction Statistics:")
print(ml_recommendations_v2['predicted_engagement_v2'].describe())

# Normalize to 0-100 if needed (should already be in this range)
ml_scores = ml_recommendations_v2[['appid', 'name', 'predicted_engagement_v2']].copy()
ml_scores.rename(columns={'predicted_engagement_v2': 'ml_score'}, inplace=True)

print(f"\nML scores prepared for {len(ml_scores)} games")

## 4. Build Content-Based Scoring System

Implement content-based filtering using:
1. **Tag similarity** to user's loved games (45%)
2. **Genre overlap** (20%)
3. **Median playtime similarity** (20%)
4. **Review quality** (15%)

In [None]:
def parse_tags(tag_string):
    """Parse tag string into dictionary"""
    if pd.isna(tag_string):
        return {}
    try:
        return ast.literal_eval(str(tag_string))
    except:
        return {}

def parse_genre(genre_string):
    """Parse genre string into list"""
    if pd.isna(genre_string):
        return []
    return [g.strip() for g in str(genre_string).split(',')]

# Parse tags and genres for both datasets
owned_games['tags_dict'] = owned_games['tags'].apply(parse_tags)
owned_games['genre_list'] = owned_games['genre'].apply(parse_genre)

catalog_unowned['tags_dict'] = catalog_unowned['tags'].apply(parse_tags)
catalog_unowned['genre_list'] = catalog_unowned['genre'].apply(parse_genre)

print("Tags and genres parsed for content-based scoring")

In [None]:
# Build user profile from loved games (50+ hours) AND identify disliked games
loved_games = owned_games[owned_games['playtime_forever'] > 3000].copy()  # 50+ hours

# Also identify disliked games (games owned but barely/never played)
disliked_games = owned_games[owned_games['playtime_forever'] < 300].copy()  # <5 hours

print(f"Your loved games (50+ hours): {len(loved_games)}")
print(f"Your disliked games (<5 hours): {len(disliked_games)}")

print(f"\n\nTop 10 loved games:")
for _, game in loved_games.nlargest(10, 'playtime_forever')[['name', 'playtime_forever']].iterrows():
    print(f"  - {game['name']}: {game['playtime_forever']/60:.1f} hours")

In [None]:
# Create user tag profile (weighted by playtime)
# Also build disliked profile from games with low playtime
user_tag_profile = {}
total_playtime = loved_games['playtime_forever'].sum()

# NSFW and meta tag filters (same as model_development)
nsfw_tags = {
    'Sexual Content', 'Nudity', 'NSFW', 'Adult',
    'Hentai', 'Erotic', 'Sexual', 'Porn', '18+', 'Adult Only'
}

meta_tags = {
    'Indie', 'Casual', 'Free to Play', 'Early Access',
    'Great Soundtrack', 'Singleplayer', 'Multiplayer',
    'Co-op', 'Online Co-Op', 'PvP', 'PvE',
    'Moddable', 'Controller', 'Partial Controller Support',
    'Steam Achievements', 'Steam Cloud', 'Steam Trading Cards',
    'VR', 'VR Only',
    'Anime', 'Cute', 'Funny', 'Comedy',
    'Classic', 'Remake', 'Remaster', 'Retro'
}

# Build loved tag profile
for _, game in loved_games.iterrows():
    playtime_weight = game['playtime_forever'] / total_playtime
    for tag, votes in game['tags_dict'].items():
        # Skip NSFW and meta tags
        if tag in nsfw_tags or tag in meta_tags:
            continue
        if tag not in user_tag_profile:
            user_tag_profile[tag] = 0
        user_tag_profile[tag] += votes * playtime_weight

# Build disliked tag profile (tags that appear in low-playtime games)
disliked_tag_profile = {}
for _, game in disliked_games.iterrows():
    for tag in game['tags_dict'].keys():
        # Skip NSFW and meta tags
        if tag in nsfw_tags or tag in meta_tags:
            continue
        disliked_tag_profile[tag] = disliked_tag_profile.get(tag, 0) + 1

# Remove overlaps with loved tags (don't penalize tags you also love)
loved_tag_set = set(user_tag_profile.keys())
disliked_tag_profile = {tag: count for tag, count in disliked_tag_profile.items() 
                       if tag not in loved_tag_set and count >= 3}  # At least 3 games to be confident

# Sort by importance
top_user_tags = sorted(user_tag_profile.items(), key=lambda x: x[1], reverse=True)[:30]

print("\n\nYour top 30 tags (weighted by playtime, excluding NSFW/meta):")
for tag, score in top_user_tags:
    print(f"  {tag}: {score:.1f}")

if disliked_tag_profile:
    print(f"\n\nDisliked tags (appear in unplayed games, NOT in loved games):")
    for tag, count in sorted(disliked_tag_profile.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  - {tag}: {count} games (will be penalized)")

In [None]:
# Create user genre profile
# Also build disliked genre profile
user_genre_profile = {}
disliked_genre_profile = {}

# Meta genres to exclude from matching
meta_genres = {
    'Indie', 'Casual', 'Early Access', 'Free to Play',
    'Massively Multiplayer',
    'Utilities', 'Software', 'Animation & Modeling', 'Design & Illustration',
    'Audio Production', 'Video Production', 'Web Publishing', 'Education',
    'Photo Editing', 'Game Development'
}

for _, game in loved_games.iterrows():
    playtime_weight = game['playtime_forever'] / total_playtime
    for genre in game['genre_list']:
        # Skip meta genres
        if genre in meta_genres:
            continue
        if genre not in user_genre_profile:
            user_genre_profile[genre] = 0
        user_genre_profile[genre] += playtime_weight

# Build disliked genre profile
for _, game in disliked_games.iterrows():
    for genre in game['genre_list']:
        # Skip meta genres
        if genre in meta_genres:
            continue
        disliked_genre_profile[genre] = disliked_genre_profile.get(genre, 0) + 1

# Remove overlaps
loved_genre_set = set(user_genre_profile.keys())
disliked_genre_profile = {genre: count for genre, count in disliked_genre_profile.items() 
                         if genre not in loved_genre_set and count >= 3}

print("\n\nYour genre preferences (excluding meta genres):")
for genre, score in sorted(user_genre_profile.items(), key=lambda x: x[1], reverse=True):
    print(f"  {genre}: {score:.2%}")

if disliked_genre_profile:
    print(f"\n\nDisliked genres (will be penalized):")
    for genre, count in sorted(disliked_genre_profile.items(), key=lambda x: x[1], reverse=True):
        print(f"  - {genre}: {count} games")

In [None]:
# Calculate content-based scores for catalog games (ADVANCED VERSION)
def calculate_content_score(game_row, user_tag_profile, user_genre_profile, loved_games, 
                            disliked_tag_profile, disliked_genre_profile):
    """
    Calculate content-based score (0-100) for a single game
    
    Components:
    - Tag similarity: 45 points (MOST IMPORTANT - specific gameplay features)
    - Genre overlap: 20 points (broad categories)
    - Median playtime match: 20 points (engagement signal)
    - Review quality: 15 points (community sentiment)
    
    Penalties:
    - Disliked tags: -10 points per matching disliked tag (soft penalty)
    - Disliked genres: -15 points per matching disliked genre (soft penalty)
    - NSFW tags: Auto-exclude (hard filter, not scored)
    - Early Access: Can be filtered out (hard filter, not scored)
    """
    score = 0
    
    # NSFW filter - these tags auto-disqualify
    nsfw_tags = {
        'Sexual Content', 'Nudity', 'NSFW', 'Adult',
        'Hentai', 'Erotic', 'Sexual', 'Porn', '18+', 'Adult Only'
    }
    
    meta_tags = {
        'Indie', 'Casual', 'Free to Play', 'Early Access',
        'Great Soundtrack', 'Singleplayer', 'Multiplayer',
        'Co-op', 'Online Co-Op', 'PvP', 'PvE',
        'Moddable', 'Controller', 'Partial Controller Support',
        'Steam Achievements', 'Steam Cloud', 'Steam Trading Cards',
        'VR', 'VR Only',
        'Anime', 'Cute', 'Funny', 'Comedy',
        'Classic', 'Remake', 'Remaster', 'Retro'
    }
    
    meta_genres = {
        'Indie', 'Casual', 'Early Access', 'Free to Play',
        'Massively Multiplayer',
        'Utilities', 'Software', 'Animation & Modeling', 'Design & Illustration',
        'Audio Production', 'Video Production', 'Web Publishing', 'Education',
        'Photo Editing', 'Game Development'
    }
    
    # Check for NSFW content (hard filter - return 0 if found)
    game_tags = game_row['tags_dict']
    if any(tag in nsfw_tags for tag in game_tags.keys()):
        return 0  # NSFW content auto-disqualified
    
    # 1. Tag similarity (45 points) - MOST IMPORTANT
    tag_score = 0
    tag_penalty = 0
    if game_tags:
        for tag, votes in game_tags.items():
            # Skip NSFW and meta tags from scoring
            if tag in nsfw_tags or tag in meta_tags:
                continue
            
            # Positive matches
            if tag in user_tag_profile:
                tag_score += user_tag_profile[tag] * (votes / 1000)  # Normalize votes
            
            # Negative matches (soft penalty)
            if tag in disliked_tag_profile:
                tag_penalty += 10  # -10 points per disliked tag
        
        # Normalize tag score to 0-45 range
        max_possible_tag_score = sum(user_tag_profile.values()) * 10  # Approximate max
        tag_score = min(45, (tag_score / max_possible_tag_score) * 45) if max_possible_tag_score > 0 else 0
    
    score += tag_score
    score -= tag_penalty  # Apply soft penalty
    
    # 2. Genre overlap (20 points)
    genre_score = 0
    genre_penalty = 0
    game_genres = game_row['genre_list']
    if game_genres:
        for genre in game_genres:
            # Skip meta genres
            if genre in meta_genres:
                continue
            
            # Positive matches
            if genre in user_genre_profile:
                genre_score += user_genre_profile[genre]
            
            # Negative matches (soft penalty)
            if genre in disliked_genre_profile:
                genre_penalty += 15  # -15 points per disliked genre
        
        genre_score = min(20, genre_score * 20)  # Normalize to 0-20
    
    score += genre_score
    score -= genre_penalty  # Apply soft penalty
    
    # 3. Median playtime similarity (20 points)
    # Higher median = deeper, more engaging game
    median_playtime = game_row.get('median_forever', 0)
    if pd.notna(median_playtime) and median_playtime > 0:
        median_hours = median_playtime / 60
        # Games with high median playtime get higher scores
        if median_hours >= 50:
            playtime_score = 20  # Deep, engaging game
        elif median_hours >= 20:
            playtime_score = 15  # Good engagement
        elif median_hours >= 10:
            playtime_score = 10  # Decent engagement
        elif median_hours >= 5:
            playtime_score = 5   # Some engagement
        else:
            playtime_score = 0   # Low engagement
    else:
        playtime_score = 0
    score += playtime_score
    
    # 4. Review quality (15 points)
    positive = game_row.get('positive', 0)
    negative = game_row.get('negative', 0)
    total_reviews = positive + negative
    
    if total_reviews > 0:
        positive_ratio = positive / total_reviews
        review_percentage = positive_ratio * 100
        
        # Quality tiers (matching model_development logic)
        if review_percentage >= 95:
            quality_multiplier = 2.5  # Overwhelmingly Positive
        elif review_percentage >= 90:
            quality_multiplier = 2.0  # Very Positive
        elif review_percentage >= 80:
            quality_multiplier = 1.5  # Mostly Positive
        elif review_percentage >= 70:
            quality_multiplier = 1.0  # Positive
        elif review_percentage >= 60:
            quality_multiplier = 0.5  # Mixed (penalize)
        else:
            quality_multiplier = 0.1  # Negative/Mostly Negative (heavy penalty)
        
        # Volume bonus (logarithmic - more reviews = more reliable)
        volume_score = np.log10(total_reviews + 1) * quality_multiplier
        review_score = min(15, volume_score * 1.5)  # Scale to 0-15 range
    else:
        review_score = 0
    score += review_score
    
    # Ensure score doesn't go below 0
    return max(0, score)

print("Calculating content-based scores for catalog games...")
print("This may take a minute...")
print("\nFeatures:")
print("  âœ… Tag matching (45%) - specific gameplay features")
print("  âœ… Genre matching (20%) - broad categories")
print("  âœ… Median playtime (20%) - engagement signal")
print("  âœ… Review quality (15%) - community sentiment")
print("  âœ… NSFW filter (hard exclusion)")
print("  âœ… Meta tag/genre filtering (excluded from matching)")
print("  âœ… Disliked tag penalties (soft -10 points per tag)")
print("  âœ… Disliked genre penalties (soft -15 points per genre)")

catalog_unowned['content_score'] = catalog_unowned.apply(
    lambda row: calculate_content_score(
        row, user_tag_profile, user_genre_profile, loved_games,
        disliked_tag_profile, disliked_genre_profile
    ),
    axis=1
)

print(f"\nâœ“ Content-based scores calculated!")
print(f"\nContent Score Statistics:")
print(catalog_unowned['content_score'].describe())

## 5. Merge ML and Content-Based Scores

In [None]:
# Merge ML scores with catalog
catalog_with_scores = catalog_unowned.merge(
    ml_scores,
    on='appid',
    how='left',
    suffixes=('', '_ml')
)

# Fill missing ML scores with 0 (games not in ML recommendations)
catalog_with_scores['ml_score'] = catalog_with_scores['ml_score'].fillna(0)

print(f"Combined dataset: {len(catalog_with_scores)} games")
print(f"Games with ML scores: {(catalog_with_scores['ml_score'] > 0).sum()}")
print(f"Games with content scores: {(catalog_with_scores['content_score'] > 0).sum()}")

## 6. Calculate Hybrid Scores

Combine ML and content-based scores with configurable weights

In [None]:
# Hybrid scoring weights (can be tuned)
WEIGHT_ML = 0.40
WEIGHT_CONTENT = 0.40
WEIGHT_REVIEW = 0.20

print(f"Hybrid Weights:")
print(f"  ML: {WEIGHT_ML:.0%}")
print(f"  Content-Based: {WEIGHT_CONTENT:.0%}")
print(f"  Review Quality: {WEIGHT_REVIEW:.0%}")
print(f"  Total: {WEIGHT_ML + WEIGHT_CONTENT + WEIGHT_REVIEW:.0%}")

# Calculate review quality score (0-100 scale)
def calculate_review_score(row):
    positive = row.get('positive', 0)
    negative = row.get('negative', 0)
    total = positive + negative
    
    if total == 0:
        return 0
    
    positive_ratio = positive / total
    # Base score: positive ratio (0-80 points)
    score = positive_ratio * 80
    # Volume bonus (0-20 points)
    volume_bonus = min(20, (np.log1p(total) / np.log1p(100000)) * 20)
    
    return score + volume_bonus

catalog_with_scores['review_score'] = catalog_with_scores.apply(calculate_review_score, axis=1)

# Calculate hybrid score
catalog_with_scores['hybrid_score'] = (
    WEIGHT_ML * catalog_with_scores['ml_score'] +
    WEIGHT_CONTENT * catalog_with_scores['content_score'] +
    WEIGHT_REVIEW * catalog_with_scores['review_score']
)

print(f"\nâœ“ Hybrid scores calculated!")
print(f"\nScore Statistics:")
print(catalog_with_scores[['ml_score', 'content_score', 'review_score', 'hybrid_score']].describe())

## 6.5 Optional: Apply User Preference Boosts/Hard Exclusions

You can further customize recommendations by:
- **Boosting** specific genres/tags you want to see more of (+5 to +20 points)
- **Hard excluding** genres/tags you absolutely don't want (removes from candidates)

In [None]:
# Optional: Apply user preference adjustments
# Uncomment and customize these to your preferences

# Example: Boost specific genres/tags you want to see more of
BOOST_GENRES = {}  # e.g., {'RPG': 10, 'Action': 5}
BOOST_TAGS = {}    # e.g., {'Open World': 15, 'Multiplayer': 10}

# Example: Hard exclude genres/tags you never want to see
HARD_EXCLUDE_GENRES = []  # e.g., ['Sports', 'Racing']
HARD_EXCLUDE_TAGS = []    # e.g., ['Horror', 'Survival Horror', '2D']

def apply_preference_adjustments(df, boost_genres, boost_tags, exclude_genres, exclude_tags):
    """
    Apply preference boosts and hard exclusions
    
    Returns:
        - df with adjusted scores
        - df with hard exclusions removed
    """
    df = df.copy()
    
    # Apply boosts
    if boost_genres or boost_tags:
        print(f"\nApplying preference boosts...")
        
        for idx, row in df.iterrows():
            boost = 0
            
            # Genre boosts
            for genre in row['genre_list']:
                if genre in boost_genres:
                    boost += boost_genres[genre]
            
            # Tag boosts
            for tag in row['tags_dict'].keys():
                if tag in boost_tags:
                    boost += boost_tags[tag]
            
            # Apply boost to all scores
            if boost > 0:
                df.at[idx, 'ml_score'] += boost * 0.4
                df.at[idx, 'content_score'] += boost * 0.4
                df.at[idx, 'hybrid_score'] += boost
    
    # Apply hard exclusions
    if exclude_genres or exclude_tags:
        print(f"\nApplying hard exclusions...")
        before_exclude = len(df)
        
        if exclude_genres:
            df = df[~df['genre_list'].apply(lambda x: any(g in exclude_genres for g in x))]
            print(f"  - Excluded {before_exclude - len(df)} games with genres: {exclude_genres}")
        
        if exclude_tags:
            before_tag_exclude = len(df)
            df = df[~df['tags_dict'].apply(lambda x: any(t in exclude_tags for t in x.keys()))]
            print(f"  - Excluded {before_tag_exclude - len(df)} games with tags: {exclude_tags}")
    
    return df

# Apply adjustments to catalog
catalog_with_scores = apply_preference_adjustments(
    catalog_with_scores,
    BOOST_GENRES,
    BOOST_TAGS,
    HARD_EXCLUDE_GENRES,
    HARD_EXCLUDE_TAGS
)

# Recalculate hybrid scores if boosts were applied
if BOOST_GENRES or BOOST_TAGS:
    catalog_with_scores['hybrid_score'] = (
        WEIGHT_ML * catalog_with_scores['ml_score'] +
        WEIGHT_CONTENT * catalog_with_scores['content_score'] +
        WEIGHT_REVIEW * catalog_with_scores['review_score']
    )
    print(f"\nâœ“ Hybrid scores recalculated with preference boosts!")

print(f"\nâœ“ Preference adjustments applied!")
print(f"Remaining candidates: {len(catalog_with_scores)}")

## 7. Generate Top 20 Recommendations (All 3 Approaches)

In [None]:
# Apply universal filters (quality gates) - matching model_development logic
MIN_REVIEWS = 1000
MIN_REVIEW_SCORE = 70  # % positive
EXCLUDE_EARLY_ACCESS = True
SFW_ONLY = True

print(f"Applying quality filters:")
print(f"  - Minimum reviews: {MIN_REVIEWS}")
print(f"  - Minimum review score: {MIN_REVIEW_SCORE}%")
print(f"  - Exclude Early Access: {EXCLUDE_EARLY_ACCESS}")
print(f"  - SFW only: {SFW_ONLY}")

filtered_catalog = catalog_with_scores[
    ((catalog_with_scores['positive'] + catalog_with_scores['negative']) >= MIN_REVIEWS) &
    ((catalog_with_scores['positive'] / (catalog_with_scores['positive'] + catalog_with_scores['negative']) * 100) >= MIN_REVIEW_SCORE) &
    (catalog_with_scores['content_score'] > 0)  # Exclude games with 0 content score (NSFW filtered)
].copy()

# Apply Early Access filter
if EXCLUDE_EARLY_ACCESS:
    before_ea = len(filtered_catalog)
    filtered_catalog = filtered_catalog[
        ~filtered_catalog['genre_list'].apply(lambda x: 'Early Access' in x)
    ]
    print(f"  - Filtered out {before_ea - len(filtered_catalog)} Early Access games")

print(f"\nAfter quality filters: {len(filtered_catalog)} games")
print(f"  (min {MIN_REVIEWS} reviews, min {MIN_REVIEW_SCORE}% positive)")

# Get top 20 for each approach
top_20_ml = filtered_catalog.nlargest(20, 'ml_score')
top_20_content = filtered_catalog.nlargest(20, 'content_score')
top_20_hybrid = filtered_catalog.nlargest(20, 'hybrid_score')

print(f"\nâœ“ Top 20 recommendations generated for all 3 approaches!")

## 8. Compare Approaches Side-by-Side

In [None]:
def display_recommendations(df, approach_name, score_col):
    """
    Display top 20 recommendations in a readable format
    """
    print("\n" + "="*100)
    print(f"TOP 20 RECOMMENDATIONS - {approach_name.upper()}")
    print("="*100)
    
    for idx, (_, row) in enumerate(df.iterrows(), 1):
        # Calculate review score
        total_reviews = row['positive'] + row['negative']
        review_pct = (row['positive'] / total_reviews * 100) if total_reviews > 0 else 0
        
        # Get top 3 tags
        game_tags = row['tags_dict']
        if game_tags:
            top_tags = sorted(game_tags.items(), key=lambda x: x[1], reverse=True)[:3]
            tag_names = [t[0] for t in top_tags]
        else:
            tag_names = []
        
        print(f"\n{idx}. {row['name']}")
        print(f"   Score: {row[score_col]:.1f}/100")
        if approach_name == 'Hybrid':
            print(f"   (ML: {row['ml_score']:.1f} | Content: {row['content_score']:.1f} | Review: {row['review_score']:.1f})")
        print(f"   Reviews: {review_pct:.1f}% positive ({int(total_reviews):,} total)")
        print(f"   Genre: {row['genre']}")
        print(f"   Tags: {', '.join(tag_names)}")
        print(f"   Median Playtime: {row.get('median_forever', 0):.0f} min")
        print(f"   Price: ${row.get('price', 0)/100:.2f}")

# Display all three approaches
display_recommendations(top_20_ml, 'Pure ML', 'ml_score')
display_recommendations(top_20_content, 'Pure Content-Based', 'content_score')
display_recommendations(top_20_hybrid, 'Hybrid (40% ML + 40% Content + 20% Review)', 'hybrid_score')

## 9. Analyze Overlap Between Approaches

In [None]:
# Get app IDs for each approach
ml_appids = set(top_20_ml['appid'].values)
content_appids = set(top_20_content['appid'].values)
hybrid_appids = set(top_20_hybrid['appid'].values)

# Calculate overlaps
ml_content_overlap = ml_appids.intersection(content_appids)
ml_hybrid_overlap = ml_appids.intersection(hybrid_appids)
content_hybrid_overlap = content_appids.intersection(hybrid_appids)
all_three_overlap = ml_appids.intersection(content_appids).intersection(hybrid_appids)

print("OVERLAP ANALYSIS")
print("="*60)
print(f"\nML vs Content-Based: {len(ml_content_overlap)}/20 games in common ({len(ml_content_overlap)/20*100:.0f}%)")
print(f"ML vs Hybrid: {len(ml_hybrid_overlap)}/20 games in common ({len(ml_hybrid_overlap)/20*100:.0f}%)")
print(f"Content vs Hybrid: {len(content_hybrid_overlap)}/20 games in common ({len(content_hybrid_overlap)/20*100:.0f}%)")
print(f"\nAll 3 approaches agree on: {len(all_three_overlap)} games ({len(all_three_overlap)/20*100:.0f}%)")

if all_three_overlap:
    print(f"\nGames that appear in ALL 3 top 20 lists:")
    consensus_games = catalog_with_scores[catalog_with_scores['appid'].isin(all_three_overlap)]
    for _, game in consensus_games.iterrows():
        print(f"  - {game['name']}")

# Unique recommendations per approach
ml_unique = ml_appids - content_appids - hybrid_appids
content_unique = content_appids - ml_appids - hybrid_appids
hybrid_unique = hybrid_appids - ml_appids - content_appids

print(f"\nUnique to ML only: {len(ml_unique)} games")
print(f"Unique to Content only: {len(content_unique)} games")
print(f"Unique to Hybrid only: {len(hybrid_unique)} games")

## 10. Visualize Score Distributions

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# ML scores
axes[0, 0].hist(filtered_catalog['ml_score'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(top_20_ml['ml_score'].min(), color='red', linestyle='--', label='Top 20 cutoff')
axes[0, 0].set_xlabel('ML Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('ML Score Distribution')
axes[0, 0].legend()

# Content scores
axes[0, 1].hist(filtered_catalog['content_score'], bins=30, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].axvline(top_20_content['content_score'].min(), color='red', linestyle='--', label='Top 20 cutoff')
axes[0, 1].set_xlabel('Content Score')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Content-Based Score Distribution')
axes[0, 1].legend()

# Hybrid scores
axes[1, 0].hist(filtered_catalog['hybrid_score'], bins=30, edgecolor='black', alpha=0.7, color='purple')
axes[1, 0].axvline(top_20_hybrid['hybrid_score'].min(), color='red', linestyle='--', label='Top 20 cutoff')
axes[1, 0].set_xlabel('Hybrid Score')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Hybrid Score Distribution')
axes[1, 0].legend()

# Score correlation scatter
axes[1, 1].scatter(filtered_catalog['ml_score'], filtered_catalog['content_score'], alpha=0.3)
axes[1, 1].set_xlabel('ML Score')
axes[1, 1].set_ylabel('Content Score')
axes[1, 1].set_title('ML vs Content Score Correlation')
correlation = filtered_catalog[['ml_score', 'content_score']].corr().iloc[0, 1]
axes[1, 1].text(0.05, 0.95, f'Correlation: {correlation:.3f}', 
               transform=axes[1, 1].transAxes, verticalalignment='top')

plt.tight_layout()
plt.show()

print(f"\nML vs Content-Based correlation: {correlation:.3f}")
if correlation < 0.3:
    print("â†’ Low correlation: ML and Content capture DIFFERENT signals (good for hybrid!)")
elif correlation > 0.7:
    print("â†’ High correlation: ML and Content agree strongly (hybrid may not add much)")
else:
    print("â†’ Moderate correlation: ML and Content complement each other (hybrid promising!)")

## 11. Save Hybrid Recommendations

In [None]:
# Save top 20 for each approach
top_20_ml[['appid', 'name', 'ml_score', 'genre', 'positive', 'negative', 'median_forever', 'price']].to_csv(
    '../data/recommendations_ml_only.csv', index=False
)

top_20_content[['appid', 'name', 'content_score', 'genre', 'positive', 'negative', 'median_forever', 'price']].to_csv(
    '../data/recommendations_content_only.csv', index=False
)

top_20_hybrid[['appid', 'name', 'hybrid_score', 'ml_score', 'content_score', 'review_score', 
               'genre', 'positive', 'negative', 'median_forever', 'price']].to_csv(
    '../data/recommendations_hybrid.csv', index=False
)

print("âœ“ All recommendations saved!")
print("\nFiles created:")
print("  - recommendations_ml_only.csv")
print("  - recommendations_content_only.csv")
print("  - recommendations_hybrid.csv")

## Summary

### âœ… Hybrid System Complete with Advanced Filtering!

**What We Built**:
1. Pure ML recommendations (learned from your playtime patterns)
2. Pure content-based recommendations (similarity to your loved games)
3. Hybrid recommendations (weighted combination)

**Advanced Features Included** (matching model_development.ipynb):

**ðŸ”’ Hard Filters** (auto-exclude before scoring):
- âœ… NSFW content filter (Sexual Content, Nudity, etc.)
- âœ… Early Access filter (exclude unfinished games)
- âœ… Meta genre filter (Utilities, Software, etc. excluded from matching)
- âœ… Meta tag filter (Indie, Casual, etc. excluded from matching)
- âœ… Minimum review count (1000+ reviews)
- âœ… Minimum review score (70%+ positive)
- âœ… Hard exclusions (user-specified genres/tags never shown)

**ðŸŽ¯ Soft Scoring** (influence recommendations):
- âœ… Tag matching (45%) - weighted by YOUR playtime on each tag
- âœ… Genre matching (20%) - broad category preferences
- âœ… Median playtime (20%) - community engagement signal
- âœ… Review quality (15%) - community sentiment with quality tiers
- âœ… Disliked tag penalties (-10 points per tag from unplayed games)
- âœ… Disliked genre penalties (-15 points per genre from unplayed games)
- âœ… User preference boosts (+5 to +20 points for preferred genres/tags)

**Key Findings**:
- ML vs Content correlation: [see output above]
- Overlap between approaches: [see output above]
- Best weight configuration: **To be determined based on your evaluation**

**Filtering Logic**:
```
1. Apply universal filters (NSFW, Early Access, min reviews, meta genres)
2. Calculate ML score (learned from playtime)
3. Calculate Content score (similarity + penalties for disliked features)
4. Calculate Review score (quality + volume)
5. Apply user preference boosts (optional)
6. Combine into hybrid score
7. Apply hard exclusions (user-specified genres/tags to never show)
8. Return top 20
```

**Next Steps**:
1. âœ… Evaluate recommendations manually (which approach is best?)
2. Adjust hybrid weights based on your preferences
3. Set hard exclusions for genres/tags you never want to see
4. Set preference boosts for genres/tags you want to see more of
5. Add diversity filter to reduce similar games in top 20 (from feature_engineering.ipynb)
6. Build web interface for easier interaction

**Files Created**:
- `recommendations_ml_only.csv` - Pure ML top 20
- `recommendations_content_only.csv` - Pure content-based top 20
- `recommendations_hybrid.csv` - Hybrid top 20 with all advanced features

**How This Differs from Basic Content-Based**:
- âœ… Learns dislike patterns from low-playtime games (soft penalties)
- âœ… Filters out NSFW, meta genres, Early Access automatically
- âœ… Combines ML predictions with content similarity
- âœ… Supports hard exclusions for genres/tags you hate
- âœ… Supports preference boosts for genres/tags you want more of
- âœ… More accurate tag/genre matching (excludes meta tags like "Indie")

## 13. Experiment: Tune Hybrid Weights

Try different weight combinations to see which works best

In [None]:
# Define weight combinations to try
weight_experiments = [
    {'name': 'ML Heavy', 'ml': 0.50, 'content': 0.30, 'review': 0.20},
    {'name': 'Content Heavy', 'ml': 0.30, 'content': 0.50, 'review': 0.20},
    {'name': 'Balanced', 'ml': 0.40, 'content': 0.40, 'review': 0.20},
    {'name': 'Review Heavy', 'ml': 0.35, 'content': 0.35, 'review': 0.30},
]

print("WEIGHT TUNING EXPERIMENTS")
print("="*80)

for exp in weight_experiments:
    # Calculate hybrid score with these weights
    catalog_with_scores[f'hybrid_{exp["name"].lower().replace(" ", "_")}'] = (
        exp['ml'] * catalog_with_scores['ml_score'] +
        exp['content'] * catalog_with_scores['content_score'] +
        exp['review'] * catalog_with_scores['review_score']
    )
    
    # Get top 10 for this configuration
    col_name = f'hybrid_{exp["name"].lower().replace(" ", "_")}'
    top_10 = filtered_catalog.nlargest(10, col_name)
    
    print(f"\n{exp['name']} (ML: {exp['ml']:.0%}, Content: {exp['content']:.0%}, Review: {exp['review']:.0%})")
    print("-" * 80)
    for idx, (_, row) in enumerate(top_10.iterrows(), 1):
        print(f"{idx}. {row['name']} (Score: {row[col_name]:.1f})")

print("\n" + "="*80)
print("Compare the lists above and see which weight combination appeals most to you!")

## Summary

### âœ… Hybrid System Complete!

**What We Built**:
1. Pure ML recommendations (learned from your playtime patterns)
2. Pure content-based recommendations (similarity to your loved games)
3. Hybrid recommendations (weighted combination)

**Key Findings**:
- ML vs Content correlation: [see output above]
- Overlap between approaches: [see output above]
- Best weight configuration: **To be determined based on your evaluation**

**Next Steps**:
1. âœ… Evaluate recommendations manually (which approach is best?)
2. Adjust hybrid weights based on your preferences
3. Add diversity filter to reduce similar games in top 20
4. Implement user preference system (boost/penalize tags/genres)
5. Build web interface for easier interaction

**Files Created**:
- `recommendations_ml_only.csv` - Pure ML top 20
- `recommendations_content_only.csv` - Pure content-based top 20
- `recommendations_hybrid.csv` - Hybrid top 20