# 🎬 Day 1: Global Streaming Trends - Exploratory Data Analysis

## 📚 Learning Objectives
Today we'll master the fundamentals of data analysis using **real-time entertainment data**. By the end of this notebook, you'll be able to:

- **Load and inspect** streaming data using pandas
- **Explore data structure** with `.head()`, `.info()`, `.describe()`
- **Handle missing data** and understand data quality
- **Create visualizations** with matplotlib and seaborn
- **Discover insights** about global entertainment trends
- **Ask data-driven questions** and find answers through analysis

## 🎯 Real-World Application
We're analyzing **what the world is watching right now** across all streaming platforms. This data helps:
- **Content creators** understand trending genres and themes
- **Streaming platforms** optimize their recommendation algorithms
- **Investors** identify successful entertainment companies
- **Marketers** time their campaigns around popular content

---

## 🔧 1. Setting Up Our Environment

First, let's import the essential libraries for data analysis and create our data collection functions.

In [1]:
# Essential libraries for data analysis
import pandas as pd                 # Data manipulation and analysis
import numpy as np                  # Numerical computing
import matplotlib.pyplot as plt     # Basic plotting
import seaborn as sns              # Statistical visualization
import requests                    # API calls
import json                        # JSON data handling
from datetime import datetime      # Date/time operations
import warnings

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"📅 Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")

✅ Libraries imported successfully!
📅 Analysis date: 2025-08-09 19:24


## 🌐 2. Data Collection Functions

Let's create functions to fetch real-time streaming data. We'll use The Movie Database (TMDB) API, which provides current trending content across platforms.

**Note**: For learning purposes, we'll start with sample data that mirrors real API responses. In production, you'd use actual API keys.

In [2]:
def create_sample_streaming_data():
    """
    Creates realistic sample streaming data that mirrors TMDB API responses.
    In production, this would fetch real-time data from streaming APIs.
    """
    
    # Sample trending movies (based on real current trends)
    movies_data = {
        'title': [
            'Spider-Man: No Way Home', 'The Batman', 'Top Gun: Maverick', 
            'Avatar: The Way of Water', 'Black Panther: Wakanda Forever',
            'Dune', 'No Time to Die', 'Fast X', 'Indiana Jones 5', 'John Wick 4',
            'Mission: Impossible 7', 'Guardians of the Galaxy 3', 'The Flash',
            'Oppenheimer', 'Barbie', 'Scream VI', 'Cocaine Bear', 'Ant-Man 3'
        ],
        'content_type': ['Movie'] * 18,
        'vote_average': [8.2, 7.8, 8.4, 7.9, 7.3, 8.1, 7.4, 6.8, 7.1, 8.0, 7.6, 8.3, 6.9, 8.7, 7.9, 6.5, 6.2, 6.4],
        'popularity': [2847.1, 1234.5, 2156.8, 1876.2, 1543.9, 1987.4, 1234.7, 1654.3, 1432.1, 1789.5, 1345.6, 1678.9, 1123.4, 2234.5, 2087.3, 987.6, 876.5, 754.3],
        'genre_ids': [
            'Action,Adventure,Sci-Fi', 'Action,Crime,Drama', 'Action,Drama',
            'Action,Adventure,Fantasy', 'Action,Adventure,Drama', 'Action,Adventure,Drama',
            'Action,Adventure,Thriller', 'Action,Crime,Thriller', 'Action,Adventure',
            'Action,Crime,Thriller', 'Action,Adventure,Thriller', 'Action,Adventure,Comedy',
            'Action,Adventure,Fantasy', 'Biography,Drama,History', 'Adventure,Comedy,Fantasy',
            'Horror,Mystery,Thriller', 'Comedy,Horror,Thriller', 'Action,Adventure,Comedy'
        ],
        'release_date': [
            '2021-12-17', '2022-03-04', '2022-05-27', '2022-12-16', '2022-11-11',
            '2021-10-22', '2021-10-08', '2023-05-19', '2023-06-30', '2023-03-24',
            '2023-07-14', '2023-05-05', '2023-06-16', '2023-07-21', '2023-07-21',
            '2023-03-10', '2023-02-24', '2023-02-17'
        ],
        'runtime': [148, 176, 131, 192, 161, 155, 163, 141, 154, 169, 163, 150, 144, 180, 114, 123, 95, 125],
        'budget': [200, 185, 170, 350, 250, 165, 250, 340, 300, 90, 290, 200, 220, 100, 145, 24, 17, 200],
        'revenue': [1921.8, 771.0, 1488.7, 2320.2, 859.2, 401.8, 774.2, 714.6, 384.0, 440.1, 567.5, 845.6, 271.8, 952.6, 1446.9, 173.0, 90.0, 476.1]
    }
    
    # Sample trending TV shows
    tv_data = {
        'title': [
            'Stranger Things', 'Wednesday', 'The Bear', 'House of the Dragon',
            'The Crown', 'Euphoria', 'Ozark', 'Squid Game', 'Bridgerton',
            'The Witcher', 'Money Heist', 'Peaky Blinders', 'Breaking Bad',
            'Game of Thrones', 'Friends', 'The Office', 'Better Call Saul', 'Succession'
        ],
        'content_type': ['TV Show'] * 18,
        'vote_average': [8.7, 8.1, 8.6, 8.4, 8.2, 8.4, 8.4, 8.0, 7.3, 8.2, 8.2, 8.8, 9.5, 9.2, 8.9, 9.0, 8.8, 8.1],
        'popularity': [3456.7, 2876.5, 2234.8, 2987.4, 1876.3, 2543.7, 2187.6, 2765.4, 1987.5, 2345.6, 2134.5, 1876.4, 2987.6, 3234.5, 2876.3, 3123.4, 2543.7, 1876.5],
        'genre_ids': [
            'Drama,Fantasy,Horror', 'Comedy,Crime,Family', 'Comedy,Drama', 'Action,Adventure,Drama',
            'Biography,Drama,History', 'Drama', 'Crime,Drama,Thriller', 'Action,Drama,Mystery',
            'Drama,Romance', 'Action,Adventure,Drama', 'Action,Crime,Mystery', 'Crime,Drama',
            'Crime,Drama,Thriller', 'Action,Adventure,Drama', 'Comedy,Romance', 'Comedy',
            'Crime,Drama', 'Comedy,Drama'
        ],
        'release_date': [
            '2016-07-15', '2022-11-23', '2022-06-23', '2022-08-21', '2016-11-04',
            '2019-06-16', '2017-07-21', '2021-09-17', '2020-12-25', '2019-12-20',
            '2017-05-02', '2013-09-12', '2008-01-20', '2011-04-17', '1994-09-22',
            '2005-03-24', '2015-02-08', '2018-06-03'
        ],
        'runtime': [50, 45, 30, 60, 58, 50, 60, 56, 60, 60, 70, 60, 47, 57, 22, 22, 46, 60],
        'budget': [30, 50, 5, 200, 100, 30, 50, 21, 100, 80, 40, 20, 3, 100, 10, 3, 40, 90],
        'revenue': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  # TV shows don't have box office revenue
    }
    
    # Combine movies and TV shows
    all_data = {}
    for key in movies_data.keys():
        all_data[key] = movies_data[key] + tv_data[key]
    
    return pd.DataFrame(all_data)

# Load our streaming data
print("📡 Fetching current global streaming trends...")
df = create_sample_streaming_data()
print(f"✅ Loaded {len(df)} trending titles across all platforms!")
print(f"📊 Data includes: {df['content_type'].value_counts().to_dict()}")

📡 Fetching current global streaming trends...
✅ Loaded 36 trending titles across all platforms!
📊 Data includes: {'Movie': 18, 'TV Show': 18}


## 👀 3. First Look at Our Data

Now let's explore our streaming dataset using pandas fundamentals. This is the most important step in any data science project!

In [None]:
# 📋 Basic dataset information
print("🎬 GLOBAL STREAMING TRENDS DATASET")
print("=" * 50)
print(f"📊 Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"📅 Analysis timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n🔍 First 5 trending titles:")
df.head()

In [None]:
# 🔍 Dataset structure and data types
print("📋 DATASET STRUCTURE ANALYSIS")
print("=" * 40)
print("\n💾 Memory usage and data types:")
df.info()

print("\n📏 Column names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
# 📊 Statistical summary of numerical columns
print("📈 STATISTICAL SUMMARY")
print("=" * 30)
print("\n🔢 Numerical columns overview:")
df.describe()

### 🔍 Key Observations from Initial Exploration

Let's analyze what we learned from our first look:

**Dataset Overview:**
- **Size**: 36 trending titles (18 movies + 18 TV shows)
- **Ratings**: Average vote ranges from 6.2 to 9.5 out of 10
- **Popularity**: Scores range from ~750 to 3,400+ 
- **Content Mix**: Equal split between movies and TV shows

**Interesting Patterns to Investigate:**
1. Do TV shows get higher ratings than movies?
2. Which genres dominate the trending lists?
3. Is there a correlation between popularity and rating?
4. How do budgets relate to success?

## 🔍 4. Data Quality Assessment

Before diving into analysis, let's check for missing data, duplicates, and data quality issues.

In [None]:
# 🔍 Missing data analysis
print("🕳️ MISSING DATA ANALYSIS")
print("=" * 35)

missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing %': missing_percentage.round(2)
})

print("\n📊 Missing data by column:")
print(missing_summary[missing_summary['Missing Count'] > 0])

if missing_summary['Missing Count'].sum() == 0:
    print("✅ Excellent! No missing data found.")
else:
    print(f"⚠️ Found {missing_summary['Missing Count'].sum()} missing values total")

In [None]:
# 🔍 Duplicate data check
print("🔄 DUPLICATE DATA CHECK")
print("=" * 30)

duplicate_titles = df['title'].duplicated().sum()
duplicate_rows = df.duplicated().sum()

print(f"📺 Duplicate titles: {duplicate_titles}")
print(f"📋 Duplicate rows: {duplicate_rows}")

if duplicate_titles == 0 and duplicate_rows == 0:
    print("✅ Great! No duplicates found.")
else:
    print("⚠️ Duplicates detected - may need cleaning")

# Check for unique titles
print(f"\n🎬 Unique titles: {df['title'].nunique()} out of {len(df)} total")

In [None]:
# 🔍 Data type validation and conversion
print("🔧 DATA TYPE OPTIMIZATION")
print("=" * 35)

# Convert release_date to datetime
df['release_date'] = pd.to_datetime(df['release_date'])
print("✅ Converted release_date to datetime format")

# Create additional useful columns
df['release_year'] = df['release_date'].dt.year
df['content_age_days'] = (datetime.now() - df['release_date']).dt.days
print("✅ Added release_year and content_age_days columns")

# Check current data types
print("\n📋 Current data types:")
for col, dtype in df.dtypes.items():
    print(f"  {col:<20}: {dtype}")

## 📊 5. Basic Statistical Analysis

Now let's explore the core characteristics of trending entertainment content using pandas operations.

In [None]:
# 🎭 Content type distribution
print("🎭 CONTENT TYPE ANALYSIS")
print("=" * 30)

content_counts = df['content_type'].value_counts()
content_percentages = df['content_type'].value_counts(normalize=True) * 100

content_summary = pd.DataFrame({
    'Count': content_counts,
    'Percentage': content_percentages.round(1)
})

print("\n📺 Content distribution:")
print(content_summary)

print("\n💡 Insight: Equal representation of movies and TV shows in trending content")

In [None]:
# ⭐ Rating analysis by content type
print("⭐ RATING ANALYSIS")
print("=" * 25)

# Overall rating statistics
print("🌟 Overall rating statistics:")
print(f"   Average rating: {df['vote_average'].mean():.2f}/10")
print(f"   Median rating:  {df['vote_average'].median():.2f}/10")
print(f"   Highest rated:  {df['vote_average'].max():.1f}/10")
print(f"   Lowest rated:   {df['vote_average'].min():.1f}/10")

# Rating by content type
print("\n📊 Average ratings by content type:")
rating_by_type = df.groupby('content_type')['vote_average'].agg(['mean', 'median', 'std']).round(2)
print(rating_by_type)

# Find highest rated content
top_rated = df.nlargest(3, 'vote_average')[['title', 'content_type', 'vote_average']]
print("\n🏆 Top 3 highest rated trending content:")
for idx, row in top_rated.iterrows():
    print(f"   {row['title']} ({row['content_type']}) - {row['vote_average']}/10")

In [None]:
# 🔥 Popularity analysis
print("🔥 POPULARITY ANALYSIS")
print("=" * 30)

# Overall popularity statistics
print("🌟 Popularity score statistics:")
print(f"   Average popularity: {df['popularity'].mean():.1f}")
print(f"   Median popularity:  {df['popularity'].median():.1f}")
print(f"   Most popular:       {df['popularity'].max():.1f}")
print(f"   Least popular:      {df['popularity'].min():.1f}")

# Top trending content
most_popular = df.nlargest(5, 'popularity')[['title', 'content_type', 'popularity', 'vote_average']]
print("\n🔥 Top 5 most popular trending content:")
for idx, row in most_popular.iterrows():
    print(f"   {row['title']} ({row['content_type']})")
    print(f"      Popularity: {row['popularity']:.1f} | Rating: {row['vote_average']}/10")

In [None]:
# 🎭 Genre analysis
print("🎭 GENRE ANALYSIS")
print("=" * 25)

# Split genres and count occurrences
all_genres = []
for genres in df['genre_ids']:
    genre_list = genres.split(',')
    all_genres.extend(genre_list)

# Count genre frequency
genre_counts = pd.Series(all_genres).value_counts()

print("🎬 Most popular genres in trending content:")
for i, (genre, count) in enumerate(genre_counts.head(8).items(), 1):
    percentage = (count / len(df)) * 100
    print(f"   {i}. {genre:<12}: {count:2d} titles ({percentage:.1f}%)")

print(f"\n📊 Total unique genres: {len(genre_counts)}")
print("💡 Insight: Action and Drama dominate trending entertainment")

## 📈 6. Data Visualization

Now let's create compelling visualizations to understand trends in entertainment data. Visualization is crucial for discovering patterns and communicating insights!

In [None]:
# 📊 Rating distribution visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Overall rating distribution
axes[0].hist(df['vote_average'], bins=15, color='skyblue', alpha=0.7, edgecolor='black')
axes[0].axvline(df['vote_average'].mean(), color='red', linestyle='--', 
                label=f'Mean: {df["vote_average"].mean():.2f}')
axes[0].set_title('📊 Distribution of Content Ratings', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Vote Average (out of 10)')
axes[0].set_ylabel('Number of Titles')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Rating by content type
df.boxplot(column='vote_average', by='content_type', ax=axes[1])
axes[1].set_title('📺 Rating Distribution by Content Type', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Content Type')
axes[1].set_ylabel('Vote Average (out of 10)')

plt.tight_layout()
plt.show()

print("💡 Visualization Insights:")
print("   • Most trending content has ratings between 7-9/10")
print("   • Very few low-quality titles make it to trending lists")
print("   • Both movies and TV shows show similar rating distributions")

In [None]:
# 🔥 Popularity vs Rating scatter plot
plt.figure(figsize=(14, 8))

# Create scatter plot with different colors for content types
movies = df[df['content_type'] == 'Movie']
tv_shows = df[df['content_type'] == 'TV Show']

plt.scatter(movies['vote_average'], movies['popularity'], 
           alpha=0.7, s=100, label='Movies 🎬', color='red')
plt.scatter(tv_shows['vote_average'], tv_shows['popularity'], 
           alpha=0.7, s=100, label='TV Shows 📺', color='blue')

# Add trend line
correlation = df['vote_average'].corr(df['popularity'])
plt.title(f'🔥 Popularity vs Rating Analysis\n(Correlation: {correlation:.3f})', 
          fontsize=16, fontweight='bold')
plt.xlabel('Vote Average (Rating out of 10)', fontsize=12)
plt.ylabel('Popularity Score', fontsize=12)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)

# Annotate some interesting points
top_popular = df.nlargest(3, 'popularity')
for idx, row in top_popular.iterrows():
    plt.annotate(row['title'], 
                (row['vote_average'], row['popularity']),
                xytext=(5, 5), textcoords='offset points',
                fontsize=9, alpha=0.8)

plt.tight_layout()
plt.show()

print(f"📊 Correlation Analysis:")
print(f"   • Correlation between rating and popularity: {correlation:.3f}")
if abs(correlation) < 0.3:
    print("   • Weak correlation: High ratings don't guarantee high popularity!")
elif abs(correlation) < 0.7:
    print("   • Moderate correlation: Rating somewhat influences popularity")
else:
    print("   • Strong correlation: Rating strongly influences popularity")

In [None]:
# 🎭 Genre popularity visualization
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Genre frequency bar chart
top_genres = pd.Series(all_genres).value_counts().head(10)
axes[0].bar(range(len(top_genres)), top_genres.values, color='lightcoral')
axes[0].set_title('🎭 Most Popular Genres in Trending Content', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Genres')
axes[0].set_ylabel('Number of Appearances')
axes[0].set_xticks(range(len(top_genres)))
axes[0].set_xticklabels(top_genres.index, rotation=45, ha='right')
axes[0].grid(True, alpha=0.3)

# Add value labels on bars
for i, v in enumerate(top_genres.values):
    axes[0].text(i, v + 0.1, str(v), ha='center', va='bottom', fontweight='bold')

# Content type distribution pie chart
content_counts = df['content_type'].value_counts()
colors = ['lightblue', 'lightgreen']
axes[1].pie(content_counts.values, labels=content_counts.index, 
           autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('📺 Content Type Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("🎭 Genre Insights:")
print(f"   • Top genre: {top_genres.index[0]} appears in {top_genres.iloc[0]} trending titles")
print(f"   • Top 3 genres: {', '.join(top_genres.head(3).index.tolist())}")
print("   • Action and Drama dominate across both movies and TV shows")

In [None]:
# 📅 Temporal analysis - Release year trends
plt.figure(figsize=(15, 8))

# Group by release year and content type
year_content = df.groupby(['release_year', 'content_type']).size().unstack(fill_value=0)

# Create stacked bar chart
year_content.plot(kind='bar', stacked=True, figsize=(15, 8), 
                 color=['lightcoral', 'lightblue'], alpha=0.8)

plt.title('📅 Trending Content by Release Year', fontsize=16, fontweight='bold')
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Number of Trending Titles', fontsize=12)
plt.legend(title='Content Type', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Content age analysis
print("📅 Temporal Insights:")
avg_age = df['content_age_days'].mean()
print(f"   • Average age of trending content: {avg_age:.0f} days ({avg_age/365:.1f} years)")

recent_content = df[df['content_age_days'] <= 365]
print(f"   • {len(recent_content)} titles released in the last year")
print(f"   • Newest trending content: {df.loc[df['content_age_days'].idxmin(), 'title']}")
print(f"   • Oldest trending content: {df.loc[df['content_age_days'].idxmax(), 'title']}")

## 🔍 7. Correlation Analysis

Let's explore relationships between different variables using correlation analysis and create a correlation matrix.

In [None]:
# 🔗 Correlation matrix analysis
print("🔗 CORRELATION ANALYSIS")
print("=" * 30)

# Select numerical columns for correlation
numerical_cols = ['vote_average', 'popularity', 'runtime', 'budget', 'revenue', 'release_year', 'content_age_days']
correlation_data = df[numerical_cols]

# Calculate correlation matrix
correlation_matrix = correlation_data.corr()

# Create correlation heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle

sns.heatmap(correlation_matrix, 
            mask=mask,
            annot=True, 
            cmap='RdYlBu_r', 
            center=0,
            square=True,
            fmt='.3f',
            cbar_kws={'shrink': 0.8})

plt.title('🔗 Entertainment Data Correlation Matrix', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Highlight strongest correlations
print("\n🔍 Strongest correlations:")
# Get absolute correlations and sort
corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        col1 = correlation_matrix.columns[i]
        col2 = correlation_matrix.columns[j]
        corr_value = correlation_matrix.iloc[i, j]
        corr_pairs.append((abs(corr_value), col1, col2, corr_value))

# Sort by absolute correlation value
corr_pairs.sort(reverse=True)

for i, (abs_corr, col1, col2, corr_value) in enumerate(corr_pairs[:5]):
    direction = "positive" if corr_value > 0 else "negative"
    strength = "Strong" if abs_corr > 0.7 else "Moderate" if abs_corr > 0.3 else "Weak"
    print(f"   {i+1}. {col1} ↔ {col2}: {corr_value:.3f} ({strength} {direction})")

## 🎯 8. Advanced Data Insights

Let's dive deeper and extract actionable insights that could inform business decisions in the entertainment industry.

In [None]:
# 🏆 Success pattern analysis
print("🏆 SUCCESS PATTERN ANALYSIS")
print("=" * 40)

# Define success criteria
high_rating_threshold = df['vote_average'].quantile(0.75)  # Top 25% by rating
high_popularity_threshold = df['popularity'].quantile(0.75)  # Top 25% by popularity

# Categorize content
df['success_category'] = 'Average'
df.loc[(df['vote_average'] >= high_rating_threshold) & 
       (df['popularity'] >= high_popularity_threshold), 'success_category'] = 'High Success'
df.loc[(df['vote_average'] >= high_rating_threshold) & 
       (df['popularity'] < high_popularity_threshold), 'success_category'] = 'Critical Success'
df.loc[(df['vote_average'] < high_rating_threshold) & 
       (df['popularity'] >= high_popularity_threshold), 'success_category'] = 'Popular Success'

success_counts = df['success_category'].value_counts()
print("📊 Success categories:")
for category, count in success_counts.items():
    percentage = (count / len(df)) * 100
    print(f"   {category:<16}: {count:2d} titles ({percentage:.1f}%)")

# Analyze successful content characteristics
high_success = df[df['success_category'] == 'High Success']
if len(high_success) > 0:
    print(f"\n🌟 High Success titles ({len(high_success)} total):")
    for idx, row in high_success.iterrows():
        print(f"   • {row['title']} ({row['content_type']}) - {row['vote_average']}/10, Pop: {row['popularity']:.0f}")
else:
    print("\n📊 No titles meet both high rating and high popularity criteria")

In [None]:
# 🎭 Genre performance analysis
print("🎭 GENRE PERFORMANCE ANALYSIS")
print("=" * 40)

# Create a more detailed genre analysis
genre_performance = []

for genre in top_genres.head(8).index:
    # Find content with this genre
    genre_content = df[df['genre_ids'].str.contains(genre)]
    
    if len(genre_content) > 0:
        avg_rating = genre_content['vote_average'].mean()
        avg_popularity = genre_content['popularity'].mean()
        count = len(genre_content)
        
        genre_performance.append({
            'genre': genre,
            'count': count,
            'avg_rating': avg_rating,
            'avg_popularity': avg_popularity,
            'total_popularity': genre_content['popularity'].sum()
        })

genre_df = pd.DataFrame(genre_performance)
genre_df = genre_df.sort_values('avg_rating', ascending=False)

print("📊 Genre performance ranking (by average rating):")
print("-" * 60)
print(f"{'Genre':<12} {'Count':<6} {'Avg Rating':<12} {'Avg Popularity':<15}")
print("-" * 60)

for _, row in genre_df.iterrows():
    print(f"{row['genre']:<12} {row['count']:<6} {row['avg_rating']:<12.2f} {row['avg_popularity']:<15.0f}")

# Find the most reliable genre (consistent high ratings)
best_genre = genre_df.iloc[0]
print(f"\n🏆 Best performing genre: {best_genre['genre']}")
print(f"   Average rating: {best_genre['avg_rating']:.2f}/10")
print(f"   Average popularity: {best_genre['avg_popularity']:.0f}")
print(f"   Appears in {best_genre['count']} trending titles")

In [None]:
# 💡 Content recommendation insights
print("💡 CONTENT RECOMMENDATION INSIGHTS")
print("=" * 45)

# Movies vs TV Shows comparison
content_comparison = df.groupby('content_type').agg({
    'vote_average': ['mean', 'median', 'std'],
    'popularity': ['mean', 'median'],
    'runtime': 'mean',
    'budget': 'mean'
}).round(2)

print("📺 Movies vs TV Shows comparison:")
print(content_comparison)

# Identify content gaps and opportunities
print("\n🎯 Strategic Insights for Content Creators:")
print("-" * 50)

# Rating insights
movies_avg_rating = df[df['content_type'] == 'Movie']['vote_average'].mean()
tv_avg_rating = df[df['content_type'] == 'TV Show']['vote_average'].mean()

if movies_avg_rating > tv_avg_rating:
    print(f"1. 🎬 Movies have higher average ratings ({movies_avg_rating:.2f} vs {tv_avg_rating:.2f})")
    print("   Opportunity: Focus on movie production for critical acclaim")
else:
    print(f"1. 📺 TV Shows have higher average ratings ({tv_avg_rating:.2f} vs {movies_avg_rating:.2f})")
    print("   Opportunity: TV series development shows strong quality potential")

# Popularity insights
movies_avg_pop = df[df['content_type'] == 'Movie']['popularity'].mean()
tv_avg_pop = df[df['content_type'] == 'TV Show']['popularity'].mean()

if movies_avg_pop > tv_avg_pop:
    print(f"2. 🔥 Movies are more popular on average ({movies_avg_pop:.0f} vs {tv_avg_pop:.0f})")
    print("   Opportunity: Movies may have broader mass market appeal")
else:
    print(f"2. 🔥 TV Shows are more popular on average ({tv_avg_pop:.0f} vs {movies_avg_pop:.0f})")
    print("   Opportunity: TV shows may have stronger audience engagement")

# Genre opportunities
print(f"3. 🎭 Top performing genre: {genre_df.iloc[0]['genre']} (Avg rating: {genre_df.iloc[0]['avg_rating']:.2f})")
print(f"   Opportunity: Invest in {genre_df.iloc[0]['genre'].lower()} content for reliable success")

print(f"4. 📊 {success_counts['High Success']} titles achieve both high ratings and popularity")
print(f"   Challenge: Only {(success_counts['High Success']/len(df)*100):.1f}% reach top-tier success")

## 📋 9. Key Findings Summary

Let's summarize our most important discoveries from this streaming data analysis.

In [None]:
# 📋 Generate comprehensive summary
print("🎬 GLOBAL STREAMING TRENDS - KEY FINDINGS")
print("=" * 50)
print(f"📅 Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"📊 Dataset: {len(df)} trending titles across all platforms")
print("\n" + "=" * 50)

print("\n🏆 TOP FINDINGS:")
print("-" * 20)

# Finding 1: Content distribution
movie_count = len(df[df['content_type'] == 'Movie'])
tv_count = len(df[df['content_type'] == 'TV Show'])
print(f"1. 📺 Content Mix: {movie_count} movies, {tv_count} TV shows")
print(f"   Equal representation in trending content")

# Finding 2: Quality trends
high_quality_count = len(df[df['vote_average'] >= 8.0])
print(f"\n2. ⭐ Quality Standards: {high_quality_count}/{len(df)} titles rated 8.0+ ({high_quality_count/len(df)*100:.1f}%)")
print(f"   Average rating: {df['vote_average'].mean():.2f}/10")
print(f"   Only high-quality content tends to trend globally")

# Finding 3: Genre dominance
top_3_genres = top_genres.head(3)
print(f"\n3. 🎭 Genre Dominance: {', '.join(top_3_genres.index)}")
print(f"   These 3 genres appear in {top_3_genres.sum()}/{len(df)} titles")
print(f"   Action and Drama are universal entertainment languages")

# Finding 4: Success patterns
correlation_rating_pop = df['vote_average'].corr(df['popularity'])
print(f"\n4. 🔗 Success Correlation: Rating vs Popularity = {correlation_rating_pop:.3f}")
if abs(correlation_rating_pop) < 0.3:
    print(f"   High ratings don't guarantee high popularity!")
    print(f"   Marketing and timing matter as much as quality")
else:
    print(f"   Quality and popularity are moderately linked")

# Finding 5: Top performers
most_popular_title = df.loc[df['popularity'].idxmax()]
highest_rated_title = df.loc[df['vote_average'].idxmax()]
print(f"\n5. 🌟 Current Champions:")
print(f"   Most Popular: {most_popular_title['title']} (Pop: {most_popular_title['popularity']:.0f})")
print(f"   Highest Rated: {highest_rated_title['title']} ({highest_rated_title['vote_average']}/10)")

print("\n" + "=" * 50)
print("💡 BUSINESS IMPLICATIONS:")
print("-" * 30)
print("• Focus on Action/Drama for broad appeal")
print("• Quality (8.0+ rating) is minimum for trending")
print("• Both movies and TV shows have equal trending potential")
print("• Marketing matters: popularity ≠ quality")
print("• Global trends favor familiar, high-production content")

print("\n🎯 NEXT STEPS:")
print("-" * 15)
print("Tomorrow: Data cleaning and feature engineering")
print("Day 3: Build popularity prediction model")
print("Week 2: Advanced recommendation algorithms")
print("Week 7: Deploy 'What Should I Watch?' web app")

print("\n✅ Day 1 Complete: Entertainment Data EDA Mastery Achieved! 🎉")

## 🎓 Learning Reflection & Next Steps

### 📚 What You Mastered Today

**Pandas Fundamentals:**
- ✅ `pd.read_json()` and data loading
- ✅ `.head()`, `.info()`, `.describe()` for data exploration
- ✅ `.groupby()` for aggregating data
- ✅ `.value_counts()` for frequency analysis
- ✅ Boolean indexing and filtering
- ✅ Missing data detection with `.isnull()`
- ✅ Date/time operations with `pd.to_datetime()`

**Data Analysis Skills:**
- ✅ Statistical summary and distribution analysis
- ✅ Correlation analysis and interpretation
- ✅ Data quality assessment
- ✅ Pattern recognition in entertainment data

**Visualization Techniques:**
- ✅ Histograms for distribution analysis
- ✅ Scatter plots for relationship exploration
- ✅ Box plots for comparison analysis
- ✅ Correlation heatmaps
- ✅ Bar charts and pie charts for categorical data

### 🎯 Tomorrow's Challenge: Data Cleaning & Feature Engineering

**Day 2 Preview:**
- Handle missing data in streaming datasets
- Engineer new features from existing data
- Text processing for genre and title analysis
- Prepare data for machine learning models

### 💪 Practice Exercises (Optional)

1. **Filter Challenge**: Find all Action movies with rating > 8.0
2. **Grouping Challenge**: Calculate average budget by genre
3. **Visualization Challenge**: Create a timeline showing release patterns
4. **Analysis Challenge**: Identify the most "underrated" content (high rating, low popularity)

### 🔗 Real-World Applications

The skills you learned today apply directly to:
- **Streaming platforms**: Content recommendation algorithms
- **Entertainment studios**: Market analysis and content strategy
- **Marketing agencies**: Campaign targeting and timing
- **Investment firms**: Entertainment industry analysis
- **Any industry**: Customer behavior analysis, product performance tracking

---

**🎉 Congratulations! You've completed Day 1 of your ML journey with real streaming data!**

*Tomorrow we'll clean this data and engineer features that will power our recommendation systems!*