# Online Gaming Behavior Dataset - EDA & ML Demo

This notebook demonstrates:
1. **Exploratory Data Analysis (EDA)** - Understanding the dataset
2. **Machine Learning** - Predicting PlayerExpertise and SpendingPropensity using Random Forest

Dataset: 10,000 player-game combinations with realistic patterns for educational purposes.

## Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 1. Load Dataset

In [None]:
# Load the dataset
df = pd.read_csv('generated_gaming_dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"Unique players: {df['PlayerID'].nunique()}")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Display first few rows
df.head(10)

In [None]:
# Data types and missing values
print("\n=== Data Info ===")
df.info()

print("\n=== Missing Values ===")
print(df.isnull().sum())

display(df.head(20))

## 2. Exploratory Data Analysis (EDA)

### 2.1 Descriptive Statistics

In [None]:
# Numerical features summary
df.describe().T

In [None]:
# Categorical features summary
categorical_cols = ['Gender', 'Location', 'GameGenre', 'GameDifficulty', 
                    'EngagementLevel', 'PlayerExpertise', 'SpendingPropensity']

for col in categorical_cols:
    print(f"\n{col} Distribution:")
    print(df[col].value_counts())
    print(f"Proportions:\n{df[col].value_counts(normalize=True)}")

### 2.2 Distribution Visualizations

In [None]:
# Age distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['Age'], bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Age Distribution')

df['Age'].plot(kind='box', ax=axes[1])
axes[1].set_ylabel('Age')
axes[1].set_title('Age Boxplot')

plt.tight_layout()
plt.show()

print(f"Age: Mean={df['Age'].mean():.1f}, Median={df['Age'].median():.1f}, Std={df['Age'].std():.1f}")

In [None]:
# PlayTimeHours distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['PlayTimeHours'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('PlayTime (Hours)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('PlayTime Distribution')

# Log scale for better visibility
axes[1].hist(np.log10(df['PlayTimeHours'] + 1), bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[1].set_xlabel('log10(PlayTime + 1)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('PlayTime Distribution (Log Scale)')

plt.tight_layout()
plt.show()

In [None]:
# Spending distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['TotalSpend'], bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0].set_xlabel('Total Spend (£)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Total Spend Distribution')

# Focus on non-zero spenders
spenders = df[df['TotalSpend'] > 0]
axes[1].hist(spenders['TotalSpend'], bins=50, edgecolor='black', alpha=0.7, color='darkgreen')
axes[1].set_xlabel('Total Spend (£)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Total Spend Distribution (Spenders Only)')

plt.tight_layout()
plt.show()

print(f"Non-spenders: {(df['TotalSpend'] == 0).sum()} ({(df['TotalSpend'] == 0).mean()*100:.1f}%)")
print(f"Spenders: Mean=£{spenders['TotalSpend'].mean():.2f}, Median=£{spenders['TotalSpend'].median():.2f}")

### 2.3 Categorical Variable Distributions

In [None]:
# Demographics visualizations
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Gender
df['Gender'].value_counts().plot(kind='bar', ax=axes[0, 0], color='skyblue', edgecolor='black')
axes[0, 0].set_title('Gender Distribution')
axes[0, 0].set_ylabel('Count')
axes[0, 0].tick_params(axis='x', rotation=45)

# Location
df['Location'].value_counts().plot(kind='bar', ax=axes[0, 1], color='lightcoral', edgecolor='black')
axes[0, 1].set_title('Location Distribution')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=45)

# GameGenre
df['GameGenre'].value_counts().plot(kind='bar', ax=axes[0, 2], color='lightgreen', edgecolor='black')
axes[0, 2].set_title('Game Genre Distribution')
axes[0, 2].set_ylabel('Count')
axes[0, 2].tick_params(axis='x', rotation=45)

# GameDifficulty
df['GameDifficulty'].value_counts().plot(kind='bar', ax=axes[1, 0], color='gold', edgecolor='black')
axes[1, 0].set_title('Game Difficulty Distribution')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=45)

# PlayerExpertise (TARGET)
df['PlayerExpertise'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 1], color='purple', edgecolor='black')
axes[1, 1].set_title('PlayerExpertise Distribution (TARGET 1)')
axes[1, 1].set_ylabel('Count')
axes[1, 1].tick_params(axis='x', rotation=45)

# SpendingPropensity (TARGET)
df['SpendingPropensity'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 2], color='orange', edgecolor='black')
axes[1, 2].set_title('SpendingPropensity Distribution (TARGET 2)')
axes[1, 2].set_ylabel('Count')
axes[1, 2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### 2.4 Correlation Analysis

In [None]:
# Correlation matrix for numerical features
numerical_cols = ['Age', 'PlayTimeHours', 'SessionsPerWeek', 'AvgSessionDurationMinutes',
                  'PlayerLevel', 'AchievementsUnlocked', 'DaysPlayed', 
                  'PurchaseCount', 'TotalSpend', 'AvgPurchasesPerMonth', 'AvgPurchaseValue']

corr_matrix = df[numerical_cols].corr()

plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Numerical Features', fontsize=16, pad=20)
plt.tight_layout()
plt.show()

print("\nKey Correlations:")
print(f"PlayTimeHours ↔ PlayerLevel: {df['PlayTimeHours'].corr(df['PlayerLevel']):.3f}")
print(f"PlayTimeHours ↔ AchievementsUnlocked: {df['PlayTimeHours'].corr(df['AchievementsUnlocked']):.3f}")
print(f"PurchaseCount ↔ TotalSpend: {df['PurchaseCount'].corr(df['TotalSpend']):.3f}")
print(f"DaysPlayed ↔ PlayTimeHours: {df['DaysPlayed'].corr(df['PlayTimeHours']):.3f}")

### 2.5 Relationship Analysis

In [None]:
# PlayTimeHours vs PlayerLevel scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['PlayTimeHours'], df['PlayerLevel'], alpha=0.3, s=10)
plt.xlabel('PlayTime (Hours)')
plt.ylabel('Player Level')
plt.title('PlayTime vs Player Level')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Genre preferences by Location
genre_location = pd.crosstab(df['Location'], df['GameGenre'], normalize='index') * 100

genre_location.plot(kind='bar', figsize=(10, 6), edgecolor='black')
plt.title('Game Genre Preferences by Location (%)')
plt.xlabel('Location')
plt.ylabel('Percentage')
plt.legend(title='Genre')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nGenre by Location (%):\n", genre_location.round(1))

In [None]:
# Gender distribution by Genre
gender_genre = pd.crosstab(df['GameGenre'], df['Gender'], normalize='index') * 100

gender_genre.plot(kind='bar', figsize=(10, 6), edgecolor='black')
plt.title('Gender Distribution by Game Genre (%)')
plt.xlabel('Game Genre')
plt.ylabel('Percentage')
plt.legend(title='Gender')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nGender by Genre (%):\n", gender_genre.round(1))

In [None]:
# Spending by EngagementLevel
engagement_spending = df.groupby('EngagementLevel')['TotalSpend'].agg(['mean', 'median', 'count'])

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

engagement_spending['mean'].plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title('Average Spending by Engagement Level')
axes[0].set_ylabel('Mean Total Spend (£)')
axes[0].set_xlabel('Engagement Level')
axes[0].tick_params(axis='x', rotation=0)
axes[0].grid(axis='y', alpha=0.3)

df.boxplot(column='TotalSpend', by='EngagementLevel', ax=axes[1])
axes[1].set_title('Spending Distribution by Engagement Level')
axes[1].set_ylabel('Total Spend (£)')
axes[1].set_xlabel('Engagement Level')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

print("\nSpending by EngagementLevel:\n", engagement_spending)

In [None]:
# PlayerExpertise vs GameDifficulty
expertise_difficulty = pd.crosstab(df['PlayerExpertise'], df['GameDifficulty'], normalize='index') * 100

expertise_difficulty.plot(kind='bar', figsize=(10, 6), edgecolor='black')
plt.title('Game Difficulty Choice by Player Expertise (%)')
plt.xlabel('Player Expertise')
plt.ylabel('Percentage')
plt.legend(title='Difficulty')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nDifficulty by Expertise (%):\n", expertise_difficulty.round(1))

### 2.6 Multi-Game Players Analysis

In [None]:
# Games per player
games_per_player = df.groupby('PlayerID').size()

print("Games per Player Distribution:")
print(games_per_player.value_counts().sort_index())
print(f"\nMean games per player: {games_per_player.mean():.2f}")

games_per_player.value_counts().sort_index().plot(kind='bar', figsize=(8, 5), color='teal', edgecolor='black')
plt.title('Distribution of Games per Player')
plt.xlabel('Number of Games')
plt.ylabel('Number of Players')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Relationship between PlayTime and Spending
# Do players who play more also spend more?

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot: PlayTime vs TotalSpend
axes[0].scatter(df['PlayTimeHours'], df['TotalSpend'], alpha=0.3, s=10)
axes[0].set_xlabel('PlayTime (Hours)')
axes[0].set_ylabel('Total Spend (£)')
axes[0].set_title('PlayTime vs Total Spend')
axes[0].grid(True, alpha=0.3)

# Boxplot: Spending by PlayTime quartiles
df['PlayTime_Quartile'] = pd.qcut(df['PlayTimeHours'], q=4, labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)'])
df.boxplot(column='TotalSpend', by='PlayTime_Quartile', ax=axes[1])
axes[1].set_xlabel('PlayTime Quartile')
axes[1].set_ylabel('Total Spend (£)')
axes[1].set_title('Spending Distribution by PlayTime Quartile')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

print("\n=== Spending by PlayTime Quartile ===")
quartile_spending = df.groupby('PlayTime_Quartile')['TotalSpend'].agg(['mean', 'median', 'count'])
print(quartile_spending)

print(f"\nCorrelation (PlayTime vs TotalSpend): {df['PlayTimeHours'].corr(df['TotalSpend']):.3f}")

In [None]:
# Purchase behavior analysis
spenders_only = df[df['TotalSpend'] > 0].copy()

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Distribution of PurchaseCount (spenders only)
axes[0, 0].hist(spenders_only['PurchaseCount'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 0].set_xlabel('Number of Purchases')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Purchase Count Distribution (Spenders Only)')
axes[0, 0].grid(axis='y', alpha=0.3)

# Distribution of AvgPurchaseValue (spenders only)
axes[0, 1].hist(spenders_only['AvgPurchaseValue'], bins=30, edgecolor='black', alpha=0.7, color='darkgreen')
axes[0, 1].set_xlabel('Average Purchase Value (£)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Average Purchase Value Distribution (Spenders Only)')
axes[0, 1].grid(axis='y', alpha=0.3)

# Scatter: PurchaseCount vs TotalSpend
axes[1, 0].scatter(spenders_only['PurchaseCount'], spenders_only['TotalSpend'], alpha=0.3, s=20)
axes[1, 0].set_xlabel('Number of Purchases')
axes[1, 0].set_ylabel('Total Spend (£)')
axes[1, 0].set_title('Purchase Count vs Total Spend')
axes[1, 0].grid(True, alpha=0.3)

# Scatter: AvgPurchaseValue vs PurchaseCount (colored by SpendingPropensity)
for category, color in [('Occasional', 'orange'), ('Whale', 'red')]:
    subset = spenders_only[spenders_only['SpendingPropensity'] == category]
    axes[1, 1].scatter(subset['PurchaseCount'], subset['AvgPurchaseValue'], 
                      alpha=0.5, s=30, label=category, color=color)
axes[1, 1].set_xlabel('Number of Purchases')
axes[1, 1].set_ylabel('Average Purchase Value (£)')
axes[1, 1].set_title('Purchase Patterns by Spending Category')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n=== Purchase Behavior Summary (Spenders Only) ===")
print(f"Average purchases per spender: {spenders_only['PurchaseCount'].mean():.1f}")
print(f"Median purchases per spender: {spenders_only['PurchaseCount'].median():.0f}")
print(f"Average purchase value: £{spenders_only['AvgPurchaseValue'].mean():.2f}")
print(f"Median purchase value: £{spenders_only['AvgPurchaseValue'].median():.2f}")

In [None]:
# Spending by Location (checking for Asia whale bias)
location_spending = df.groupby('Location').agg({
    'TotalSpend': ['sum', 'mean', 'median'],
    'PurchaseCount': 'mean',
    'PlayerID': 'count'
}).round(2)

location_spending.columns = ['Total_Revenue', 'Avg_Spend', 'Median_Spend', 'Avg_Purchases', 'Player_Count']
print("\n=== Spending by Location ===")
print(location_spending)

# Whale percentage by location
whale_by_location = pd.crosstab(df['Location'], df['SpendingPropensity'], normalize='index') * 100
print("\n=== Spending Propensity by Location (%) ===")
print(whale_by_location.round(1))

whale_by_location.plot(kind='bar', figsize=(10, 6), edgecolor='black', stacked=False)
plt.title('Spending Propensity Distribution by Location')
plt.xlabel('Location')
plt.ylabel('Percentage (%)')
plt.legend(title='Spending Category')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Spending by Game Genre
genre_spending = df.groupby('GameGenre').agg({
    'TotalSpend': ['sum', 'mean', 'median'],
    'PurchaseCount': 'mean',
    'PlayerID': 'count'
}).round(2)

genre_spending.columns = ['Total_Revenue', 'Avg_Spend', 'Median_Spend', 'Avg_Purchases', 'Player_Count']
print("\n=== Spending by Game Genre ===")
print(genre_spending)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Total revenue by genre
genre_spending['Total_Revenue'].plot(kind='bar', ax=axes[0], color=['#1f77b4', '#ff7f0e', '#2ca02c'], 
                                      edgecolor='black')
axes[0].set_title('Total Revenue by Genre')
axes[0].set_ylabel('Total Revenue (£)')
axes[0].set_xlabel('Genre')
axes[0].tick_params(axis='x', rotation=0)
axes[0].grid(axis='y', alpha=0.3)

# Average spend by genre
genre_spending['Avg_Spend'].plot(kind='bar', ax=axes[1], color=['#1f77b4', '#ff7f0e', '#2ca02c'],
                                  edgecolor='black')
axes[1].set_title('Average Spend per Player by Genre')
axes[1].set_ylabel('Average Spend (£)')
axes[1].set_xlabel('Genre')
axes[1].tick_params(axis='x', rotation=0)
axes[1].grid(axis='y', alpha=0.3)

# Spender rate by genre
spender_rate = df.groupby('GameGenre').apply(lambda x: (x['TotalSpend'] > 0).mean() * 100)
spender_rate.plot(kind='bar', ax=axes[2], color=['#1f77b4', '#ff7f0e', '#2ca02c'],
                  edgecolor='black')
axes[2].set_title('Spender Rate by Genre (%)')
axes[2].set_ylabel('% of Players Who Spend')
axes[2].set_xlabel('Genre')
axes[2].tick_params(axis='x', rotation=0)
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Revenue concentration - Pareto principle (80/20 rule)
# What % of players generate what % of revenue?

spenders = df[df['TotalSpend'] > 0].copy()
spenders_sorted = spenders.sort_values('TotalSpend', ascending=False).reset_index(drop=True)
spenders_sorted['cumulative_revenue'] = spenders_sorted['TotalSpend'].cumsum()
spenders_sorted['cumulative_revenue_pct'] = (spenders_sorted['cumulative_revenue'] / 
                                               spenders_sorted['TotalSpend'].sum() * 100)
spenders_sorted['player_pct'] = (spenders_sorted.index + 1) / len(spenders_sorted) * 100

# Find key thresholds
top_1_pct_idx = int(len(spenders_sorted) * 0.01)
top_5_pct_idx = int(len(spenders_sorted) * 0.05)
top_10_pct_idx = int(len(spenders_sorted) * 0.10)
top_20_pct_idx = int(len(spenders_sorted) * 0.20)

print("=== Revenue Concentration (Pareto Analysis) ===")
print(f"Top 1% of spenders generate: {spenders_sorted.iloc[top_1_pct_idx]['cumulative_revenue_pct']:.1f}% of revenue")
print(f"Top 5% of spenders generate: {spenders_sorted.iloc[top_5_pct_idx]['cumulative_revenue_pct']:.1f}% of revenue")
print(f"Top 10% of spenders generate: {spenders_sorted.iloc[top_10_pct_idx]['cumulative_revenue_pct']:.1f}% of revenue")
print(f"Top 20% of spenders generate: {spenders_sorted.iloc[top_20_pct_idx]['cumulative_revenue_pct']:.1f}% of revenue")

# Pareto chart
plt.figure(figsize=(12, 6))
plt.plot(spenders_sorted['player_pct'], spenders_sorted['cumulative_revenue_pct'], linewidth=2)
plt.axhline(y=80, color='r', linestyle='--', label='80% of revenue', alpha=0.7)
plt.axvline(x=20, color='r', linestyle='--', label='20% of players', alpha=0.7)
plt.xlabel('% of Players (Ranked by Spending)')
plt.ylabel('Cumulative % of Revenue')
plt.title('Revenue Concentration Curve (Pareto Analysis)')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Overall spending statistics
print("=== Spending Overview ===")
print(f"Total players: {len(df)}")
print(f"Non-spenders: {(df['TotalSpend'] == 0).sum()} ({(df['TotalSpend'] == 0).mean()*100:.1f}%)")
print(f"Spenders: {(df['TotalSpend'] > 0).sum()} ({(df['TotalSpend'] > 0).mean()*100:.1f}%)")
print(f"\nTotal revenue (all players): £{df['TotalSpend'].sum():,.2f}")
print(f"Average spend per player: £{df['TotalSpend'].mean():.2f}")
print(f"Average spend (spenders only): £{df[df['TotalSpend'] > 0]['TotalSpend'].mean():.2f}")
print(f"Median spend (spenders only): £{df[df['TotalSpend'] > 0]['TotalSpend'].median():.2f}")

print("\n=== Spending Propensity Breakdown ===")
for category in ['NonSpender', 'Occasional', 'Whale']:
    subset = df[df['SpendingPropensity'] == category]
    print(f"\n{category}:")
    print(f"  Count: {len(subset)} ({len(subset)/len(df)*100:.1f}%)")
    print(f"  Total revenue: £{subset['TotalSpend'].sum():,.2f}")
    print(f"  Avg spend: £{subset['TotalSpend'].mean():.2f}")
    print(f"  Avg purchases: {subset['PurchaseCount'].mean():.1f}")
    print(f"  Revenue share: {subset['TotalSpend'].sum()/df['TotalSpend'].sum()*100:.1f}%")

### 2.7 Deep Dive: Spending Behavior Analysis

Let's explore spending patterns in detail to understand monetization dynamics.

## 3. Machine Learning - Random Forest Classification

We'll build two separate models:
1. **PlayerExpertise Prediction** (Harder task - multi-factorial)
2. **SpendingPropensity Prediction** (Easier task - clearer patterns)

### 3.1 Data Preparation

In [None]:
# Create feature matrix
# We'll encode categorical variables and drop target variables

# Make a copy for ML
df_ml = df.copy()

# Encode categorical features
le_dict = {}
categorical_features = ['Gender', 'Location', 'GameGenre', 'GameDifficulty', 'EngagementLevel']

for col in categorical_features:
    le = LabelEncoder()
    df_ml[col + '_encoded'] = le.fit_transform(df_ml[col])
    le_dict[col] = le

# Features for modeling (exclude IDs, original categorical, and both targets)
feature_cols = ['Age', 'PlayTimeHours', 'SessionsPerWeek', 'AvgSessionDurationMinutes',
                'PlayerLevel', 'AchievementsUnlocked', 'DaysPlayed', 
                'PurchaseCount', 'TotalSpend', 'AvgPurchasesPerMonth', 'AvgPurchaseValue',
                'Gender_encoded', 'Location_encoded', 'GameGenre_encoded', 
                'GameDifficulty_encoded', 'EngagementLevel_encoded']

print(f"Features for modeling: {len(feature_cols)}")
print(feature_cols)

### 3.2 Model 1: PlayerExpertise Prediction

In [None]:
# Prepare data for PlayerExpertise prediction
# Remove SpendingPropensity from features for this task
X_expertise = df_ml[feature_cols]
y_expertise = df_ml['PlayerExpertise']

# 80/20 train-test split
X_train_exp, X_test_exp, y_train_exp, y_test_exp = train_test_split(
    X_expertise, y_expertise, test_size=0.2, random_state=42, stratify=y_expertise
)

print(f"Training set: {X_train_exp.shape}")
print(f"Test set: {X_test_exp.shape}")
print(f"\nClass distribution in training set:")
print(y_train_exp.value_counts(normalize=True))

In [None]:
# Train Random Forest for PlayerExpertise
print("Training Random Forest for PlayerExpertise prediction...")
rf_expertise = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)

rf_expertise.fit(X_train_exp, y_train_exp)
print("Training complete!")

# Predictions
y_pred_exp = rf_expertise.predict(X_test_exp)

# Evaluation
accuracy_exp = accuracy_score(y_test_exp, y_pred_exp)
print(f"\nAccuracy: {accuracy_exp:.4f}")

print("\n=== Classification Report ===")
print(classification_report(y_test_exp, y_pred_exp))

In [None]:
# Confusion Matrix for PlayerExpertise
cm_exp = confusion_matrix(y_test_exp, y_pred_exp)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_exp, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Beginner', 'Expert', 'Intermediate'],
            yticklabels=['Beginner', 'Expert', 'Intermediate'])
plt.title('Confusion Matrix - PlayerExpertise Prediction')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance for PlayerExpertise
feature_importance_exp = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_expertise.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(range(len(feature_importance_exp)), feature_importance_exp['importance'], color='steelblue')
plt.yticks(range(len(feature_importance_exp)), feature_importance_exp['feature'])
plt.xlabel('Importance')
plt.title('Feature Importance - PlayerExpertise Model')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 Important Features:")
print(feature_importance_exp.head(10))

### 3.3 Model 2: SpendingPropensity Prediction

In [None]:
# Prepare data for SpendingPropensity prediction
# IMPORTANT: Exclude spending features to avoid data leakage!
# SpendingPropensity is derived from TotalSpend, so we must exclude:
# - TotalSpend, PurchaseCount, AvgPurchasesPerMonth, AvgPurchaseValue

spending_features = [col for col in feature_cols if col not in 
                     ['TotalSpend', 'PurchaseCount', 'AvgPurchasesPerMonth', 'AvgPurchaseValue']]

print(f"Features for SpendingPropensity (excluding spending metrics): {len(spending_features)}")
print(spending_features)
print("\nThis makes it a REAL prediction task - predicting spending from behavior/demographics only!")

X_spending = df_ml[spending_features]
y_spending = df_ml['SpendingPropensity']

# 80/20 train-test split
X_train_spend, X_test_spend, y_train_spend, y_test_spend = train_test_split(
    X_spending, y_spending, test_size=0.2, random_state=42, stratify=y_spending
)

print(f"\nTraining set: {X_train_spend.shape}")
print(f"Test set: {X_test_spend.shape}")
print(f"\nClass distribution in training set:")
print(y_train_spend.value_counts(normalize=True))

In [None]:
# Train Random Forest for SpendingPropensity
print("Training Random Forest for SpendingPropensity prediction...")
rf_spending = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)

rf_spending.fit(X_train_spend, y_train_spend)
print("Training complete!")

# Predictions
y_pred_spend = rf_spending.predict(X_test_spend)

# Evaluation
accuracy_spend = accuracy_score(y_test_spend, y_pred_spend)
print(f"\nAccuracy: {accuracy_spend:.4f}")

print("\n=== Classification Report ===")
print(classification_report(y_test_spend, y_pred_spend))

In [None]:
# Confusion Matrix for SpendingPropensity
cm_spend = confusion_matrix(y_test_spend, y_pred_spend)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_spend, annot=True, fmt='d', cmap='Greens',
            xticklabels=['NonSpender', 'Occasional', 'Whale'],
            yticklabels=['NonSpender', 'Occasional', 'Whale'])
plt.title('Confusion Matrix - SpendingPropensity Prediction')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance for SpendingPropensity
feature_importance_spend = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_spending.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(range(len(feature_importance_spend)), feature_importance_spend['importance'], color='darkgreen')
plt.yticks(range(len(feature_importance_spend)), feature_importance_spend['feature'])
plt.xlabel('Importance')
plt.title('Feature Importance - SpendingPropensity Model')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 Important Features:")
print(feature_importance_spend.head(10))

### 3.4 Model Comparison

In [None]:
# Compare model performance
comparison = pd.DataFrame({
    'Model': ['PlayerExpertise', 'SpendingPropensity'],
    'Accuracy': [accuracy_exp, accuracy_spend],
    'Difficulty': ['Harder (Multi-factorial)', 'Easier (Clear patterns)']
})

print("\n=== Model Performance Comparison ===")
print(comparison)

# Visualization
plt.figure(figsize=(8, 5))
plt.bar(comparison['Model'], comparison['Accuracy'], color=['purple', 'green'], edgecolor='black')
plt.ylabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1.0)
plt.axhline(y=0.5, color='r', linestyle='--', label='Baseline')
plt.legend()
plt.grid(axis='y', alpha=0.3)

for i, (model, acc) in enumerate(zip(comparison['Model'], comparison['Accuracy'])):
    plt.text(i, acc + 0.02, f'{acc:.3f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Key Findings & Insights

### EDA Insights:
- Clear location-genre preferences (USA→Action, Asia→RPG, Europe→Strategy)
- Gender balance varies by genre (RPG most balanced, Action male-dominated)
- Strong correlations between PlayTime, PlayerLevel, and Achievements
- Spending strongly influenced by engagement level
- Expert players predominantly choose Hard difficulty

### ML Results:
- **PlayerExpertise** prediction (77% accuracy):
  - Multi-factorial task requiring the model to combine multiple signals
  - Most important features: GameDifficulty, PlayerLevel, AchievementsUnlocked
  - Good balance between being learnable but not trivial
  - Demonstrates realistic ML challenge for students

- **SpendingPropensity** prediction:
  - **Important Note**: We EXCLUDE spending features (TotalSpend, PurchaseCount) to avoid data leakage
  - SpendingPropensity is derived from TotalSpend, so including it would make the task trivial
  - By excluding spending metrics, students must predict spending from behavior/demographics
  - This creates a realistic business problem: "Can we predict who will spend based on how they play?"
  - Expected accuracy will be lower than PlayerExpertise due to the indirect relationship

### Data Leakage Lesson:
This dataset provides a great teaching opportunity about **data leakage**:
- If we include TotalSpend when predicting SpendingPropensity, we get 100% accuracy
- This is because SpendingPropensity is DERIVED from TotalSpend (deterministic relationship)
- In real ML: never include features that directly determine or are derived from the target
- Students should learn to identify and avoid such "too good to be true" results

### Educational Value:
This dataset successfully demonstrates:
1. Real-world data patterns and relationships
2. Importance of feature engineering
3. Handling class imbalance
4. Data leakage concepts (spending features example)
5. Business context (player segmentation, monetization)
6. Multi-table data structure (player-game combinations)

## 5. Additional Exercises for Students

### EDA Exercises:
1. Analyze the relationship between Age and Genre preferences
2. Create a cohort analysis by DaysPlayed (new vs veteran players)
3. Explore multi-game players - do they spend more?
4. Analyze achievement completion rates by genre
5. Investigate session patterns (frequency vs duration)

### Feature Engineering:
1. Create aggregate features by PlayerID:
   - Total games played
   - Total spending across all games
   - Genre diversity score
2. Derive ratio features:
   - Achievements per hour played
   - Level progression rate
   - Spend per hour (engagement value)
3. Create categorical features:
   - Player age groups (young, adult, senior)
   - Playtime categories (casual, regular, hardcore)

### ML Improvements:
1. Handle class imbalance with SMOTE or class weights
2. Try other algorithms (XGBoost, SVM, Neural Networks)
3. Perform hyperparameter tuning with GridSearchCV
4. Implement cross-validation for more robust evaluation
5. Build ensemble models combining multiple algorithms