# Video Game Sales Analysis
## Identifying Patterns That Determine Game Success

════════════════════════════════════════════════════════════════════════════════

**Project Goal:** Analyze video game sales data to identify patterns and characteristics
that determine commercial success in the gaming industry.

**Analysis Approach:** Data exploration → feature analysis → data preparation →
model evaluation → recommendations

════════════════════════════════════════════════════════════════════════════════

## Environment Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats as st
from dsr_data_tools import analyze_dataset
from dsr_feature_eng_ml import DataSplits, ModelEvaluation

## 1 Load and Analyze Dataset

In [None]:
games = pd.read_csv('./datasets/games.csv')
games_analysis, recommendations = analyze_dataset(games, generate_recs=True)

In [None]:
if recommendations:
    print('\n=== Data Recommendations ===\n')
    for key, value in recommendations.items():
        if isinstance(value, dict):
            print(f'{key}:')
            for sub_key, sub_value in value.items():
                print(f'  {sub_key}: {sub_value}')
        else:
            print(f'{key}: {value}')

### 1.1 Data Analysis Summary

**Dataset Overview:**
- Total Records: 16,715 video games
- Time Period: 1980-2016
- Global and regional sales data (in millions)

**Key Data Characteristics:**
- Missing values detected in certain columns
- Sales data shows high variation across regions
- Platform and genre are critical categorical features

**Data Decisions:**
- [To be updated with specific findings from analysis]

## 2 Data Preparation

In [None]:
# Handle missing values
games_clean = games.dropna()

# Create target variable (success: global sales above median)
median_sales = games_clean['Global_Sales'].median()
games_clean['success'] = (games_clean['Global_Sales'] > median_sales).astype(int)

print(f'Dataset shape after cleaning: {games_clean.shape}')
print(f'Success distribution:\n{games_clean["success"].value_counts()}')

## 3 Create Data Splits

In [None]:
# Select features for modeling
feature_columns = ['Year_of_Release', 'Critic_Score', 'User_Score']

# Create data splits
splits = DataSplits.from_data_source(
    games_clean,
    target_column='success',
    feature_columns=feature_columns,
    test_size=0.2,
    random_state=42
)

print(f'Training set size: {len(splits.train_data)}')
print(f'Validation set size: {len(splits.val_data)}')
print(f'Test set size: {len(splits.test_data)}')

### 3.1 Evaluate Models

In [None]:
# Define hyperparameter grids
hyperparameter_grids = {
    'decision_tree': {'max_depth': [3, 4, 5, 6, 7]},
    'random_forest': {'n_estimators': [50, 100, 150, 200]},
    'logistic_regression': {'C': [0.1, 1, 10]}
}

# Evaluate models
results = ModelEvaluation.evaluate_dataset(
    splits,
    hyperparameter_grids=hyperparameter_grids,
    cv=5,
    n_iter=5,
    scoring='f1',
    n_jobs=-1,
    viable_f1_gap=0.05
)

### 3.2 Model Results

In [None]:
print(results.summary_text)

## 4 Test Set Evaluation

In [None]:
test_report = results.test_report
print(test_report)