# Data Analysis for Gameplay Dataset
### **Introduction**

This notebook provides an analysis of the gameplay dataset, including:
- Summary statistics
- Distribution of actions
- Visualization of feature patterns
- Insights for model optimization


### Usage Instructions

- Replace `DATASET_PATH` with the correct file path to your dataset.
- Ensure `frame_features` are accessible and correctly formatted as `.npy` files or another numeric format.
- Use the insights to refine the dataset before training the models.

In [None]:


# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization style
sns.set(style="whitegrid")

# File paths
DATASET_PATH = "data/processed/nn_dataset.csv"

# Load the dataset
df = pd.read_csv(DATASET_PATH)

# Preview the dataset
df.head()


### **1. Dataset Overview**

# Check the shape of the dataset
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

# Display column information
df.info()

# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values)


### **2. Action Distribution**

# Distribution of actions
action_counts = df['action'].value_counts()
print("Action distribution:")
print(action_counts)

# Plot action distribution
plt.figure(figsize=(10, 6))
sns.barplot(x=action_counts.index, y=action_counts.values, palette="viridis")
plt.title("Action Distribution")
plt.xlabel("Actions")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()


### **3. Feature Analysis**

# Load a sample of frame features for analysis
# Assuming features are stored as paths to .npy files in the 'frame_features' column
sample_features = np.load(df['frame_features'].iloc[0])

print(f"Sample feature shape: {sample_features.shape}")
print(f"Sample feature values:\n{sample_features}")

# Calculate basic statistics for feature values
feature_stats = {
    "Mean": np.mean(sample_features),
    "Median": np.median(sample_features),
    "Min": np.min(sample_features),
    "Max": np.max(sample_features),
    "Std": np.std(sample_features)
}
print("\nFeature Statistics:")
for stat, value in feature_stats.items():
    print(f"{stat}: {value:.4f}")


### **4. Visualizing Feature Patterns**


# Plot the distribution of sample feature values
plt.figure(figsize=(12, 6))
sns.histplot(sample_features, kde=True, color="blue", bins=30)
plt.title("Feature Value Distribution")
plt.xlabel("Feature Value")
plt.ylabel("Frequency")
plt.show()

# Generate a heatmap for feature correlation if features are 2D
if len(sample_features.shape) > 1:
    plt.figure(figsize=(10, 8))
    sns.heatmap(np.corrcoef(sample_features), cmap="coolwarm", annot=False)
    plt.title("Feature Correlation Heatmap")
    plt.show()


### **5. Insights for Model Optimization**


# Analyze the balance of action classes
balanced = action_counts.std() / action_counts.mean()
if balanced < 0.1:
    print("The action classes are relatively balanced.")
else:
    print("The action classes are imbalanced. Consider data augmentation.")

# Suggest next steps based on feature distribution
if np.max(sample_features) > 1.0:
    print("Feature values are not normalized. Apply normalization during preprocessing.")
else:
    print("Feature values are normalized. Ready for model training.")

# Check for rare actions
rare_actions = action_counts[action_counts < 0.01 * len(df)]
print("\nRare Actions:")
print(rare_actions)

if len(rare_actions) > 0:
    print("Consider oversampling or augmenting rare actions to improve model performance.")
else:
    print("No rare actions detected.")


print("Analysis complete.")




### **6. Conclusion**

- The dataset has been analyzed for completeness, balance, and feature patterns.
- Next steps:
  - Address any imbalances in the action classes.
  - Normalize feature values if needed.
  - Use visual insights to inform model architecture and training.

#Save the notebook and proceed with model training.

# Save updated dataset if changes are made
# df.to_csv("data/processed/updated_nn_dataset.csv", index=False)