# VertexRec Data Exploration

This notebook provides a comprehensive exploration of the VertexRec recommendation system dataset, including user behavior analysis, item catalog insights, and interaction patterns.

## Table of Contents
1. [Data Loading and Overview](#data-loading)
2. [User Analysis](#user-analysis)
3. [Item Analysis](#item-analysis)
4. [Interaction Analysis](#interaction-analysis)
5. [Recommendation Insights](#recommendation-insights)


## 1. Data Loading and Overview {#data-loading}


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure plotly
import plotly.io as pio
pio.templates.default = "plotly_white"


In [None]:
# Load data
users_df = pd.read_csv('../data/users.csv')
items_df = pd.read_csv('../data/items.csv')
interactions_df = pd.read_csv('../data/interactions.csv')

print("Data loaded successfully!")
print(f"Users: {len(users_df):,}")
print(f"Items: {len(items_df):,}")
print(f"Interactions: {len(interactions_df):,}")


## 2. User Analysis {#user-analysis}


In [None]:
# User demographics analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Age group distribution
users_df['age_group'].value_counts().plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Age Group Distribution')
axes[0,0].set_xlabel('Age Group')
axes[0,0].set_ylabel('Count')
axes[0,0].tick_params(axis='x', rotation=45)

# Gender distribution
users_df['gender'].value_counts().plot(kind='pie', ax=axes[0,1], autopct='%1.1f%%')
axes[0,1].set_title('Gender Distribution')

# Location distribution
users_df['location'].value_counts().head(10).plot(kind='bar', ax=axes[1,0], color='lightgreen')
axes[1,0].set_title('Top 10 Locations')
axes[1,0].set_xlabel('Location')
axes[1,0].set_ylabel('Count')
axes[1,0].tick_params(axis='x', rotation=45)

# Subscription type distribution
users_df['subscription_type'].value_counts().plot(kind='bar', ax=axes[1,1], color='orange')
axes[1,1].set_title('Subscription Type Distribution')
axes[1,1].set_xlabel('Subscription Type')
axes[1,1].set_ylabel('Count')

plt.tight_layout()
plt.show()


## 3. Item Analysis {#item-analysis}


In [None]:
# Item popularity analysis
item_popularity = interactions_df.groupby('item_id').agg({
    'rating': ['count', 'mean', 'std'],
    'user_id': 'nunique'
}).round(2)

item_popularity.columns = ['total_interactions', 'avg_rating', 'rating_std', 'unique_users']
item_popularity = item_popularity.reset_index()

# Merge with item metadata
item_analysis = items_df.merge(item_popularity, on='item_id', how='left')
item_analysis = item_analysis.fillna(0)

print("=== ITEM POPULARITY STATISTICS ===")
print(item_analysis.describe())


## 4. Interaction Analysis {#interaction-analysis}


In [None]:
# Convert timestamp to datetime
interactions_df['timestamp'] = pd.to_datetime(interactions_df['timestamp'])

# Temporal analysis
interactions_df['hour'] = interactions_df['timestamp'].dt.hour
interactions_df['day_of_week'] = interactions_df['timestamp'].dt.dayofweek
interactions_df['month'] = interactions_df['timestamp'].dt.month

print("=== TEMPORAL ANALYSIS ===")
print(f"Date range: {interactions_df['timestamp'].min()} to {interactions_df['timestamp'].max()}")
print(f"Total days: {(interactions_df['timestamp'].max() - interactions_df['timestamp'].min()).days}")


## 5. Recommendation Insights {#recommendation-insights}


In [None]:
# User-item matrix analysis
user_item_matrix = interactions_df.pivot_table(
    index='user_id', 
    columns='item_id', 
    values='rating', 
    fill_value=0
)

print(f"User-item matrix shape: {user_item_matrix.shape}")
print(f"Sparsity: {(user_item_matrix == 0).sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]):.2%}")


## Conclusion

This data exploration provides valuable insights for building an effective recommendation system:

1. **Data Quality**: The dataset shows good coverage with reasonable sparsity
2. **User Behavior**: Clear patterns in user activity and preferences
3. **Item Popularity**: Significant variation in item popularity across genres
4. **Temporal Patterns**: Distinct usage patterns by time of day and day of week
5. **Cold Start**: Notable cold start problem that needs addressing
6. **Recommendation Opportunities**: High potential for personalized recommendations

These insights will guide the feature engineering and model training phases of the recommendation system.
