# KDD Cup 2022 ESCI Challenge - Data Exploration

This notebook explores the Shopping Queries Dataset for the ESCI Challenge. We'll analyze the structure, characteristics, and patterns in the data to understand:

1. Dataset overview and statistics
2. Query and product distributions
3. ESCI label distributions
4. Text characteristics
5. Language and locale patterns

**Dataset Files:**
- `shopping_queries_dataset_examples.parquet` - Query-product pairs with relevance labels
- `shopping_queries_dataset_products.parquet` - Product information (title, description, etc.)
- `shopping_queries_dataset_sources.csv` - Query source information

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 1. Load Data Files

Let's start by loading the three main data files and understanding their structure.

In [None]:
# Load the data files
try:
    # Define file paths
    data_path = "../data/raw/"
    
    # Load examples data
    df_examples = pd.read_parquet(f"{data_path}shopping_queries_dataset_examples.parquet")
    print(f"✓ Examples data loaded: {df_examples.shape}")
    
    # Load products data
    df_products = pd.read_parquet(f"{data_path}shopping_queries_dataset_products.parquet")
    print(f"✓ Products data loaded: {df_products.shape}")
    
    # Load sources data
    df_sources = pd.read_csv(f"{data_path}shopping_queries_dataset_sources.csv")
    print(f"✓ Sources data loaded: {df_sources.shape}")
    
    print("\nAll data files loaded successfully!")
    
except FileNotFoundError as e:
    print(f"❌ Error: {e}")
    print("Please ensure the data files are in the correct directory:")
    print("- data/raw/shopping_queries_dataset_examples.parquet")
    print("- data/raw/shopping_queries_dataset_products.parquet") 
    print("- data/raw/shopping_queries_dataset_sources.csv")

## 2. Dataset Overview

Let's examine the structure and basic statistics of each dataset.

In [None]:
# Examine examples dataset
print("=" * 60)
print("EXAMPLES DATASET (df_examples)")
print("=" * 60)
print(f"Shape: {df_examples.shape}")
print(f"Columns: {list(df_examples.columns)}")
print("\nFirst few rows:")
display(df_examples.head(3))
print("\nDataset info:")
print(df_examples.info())

In [None]:
# Examine products dataset
print("\n" + "=" * 60)
print("PRODUCTS DATASET (df_products)")
print("=" * 60)
print(f"Shape: {df_products.shape}")
print(f"Columns: {list(df_products.columns)}")
print("\nFirst few rows:")
display(df_products.head(3))
print("\nDataset info:")
print(df_products.info())

In [None]:
# Examine sources dataset
print("\n" + "=" * 60)
print("SOURCES DATASET (df_sources)")
print("=" * 60)
print(f"Shape: {df_sources.shape}")
print(f"Columns: {list(df_sources.columns)}")
print("\nFirst few rows:")
display(df_sources.head(3))
print("\nDataset info:")
print(df_sources.info())

## 3. ESCI Label Analysis

Let's analyze the distribution of ESCI labels (Exact, Substitute, Complement, Irrelevant) across the dataset.

In [None]:
# ESCI label distribution
print("ESCI Label Distribution")
print("=" * 30)
esci_counts = df_examples['esci_label'].value_counts().sort_index()
print(esci_counts)
print(f"\nTotal examples: {len(df_examples):,}")

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Bar plot
esci_counts.plot(kind='bar', ax=ax1, color='skyblue', alpha=0.8)
ax1.set_title('ESCI Label Distribution')
ax1.set_xlabel('ESCI Label')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=0)

# Add percentage labels on bars
for i, v in enumerate(esci_counts.values):
    ax1.text(i, v + 1000, f'{v/len(df_examples)*100:.1f}%', 
             ha='center', va='bottom')

# Pie chart
ax2.pie(esci_counts.values, labels=esci_counts.index, autopct='%1.1f%%', 
        startangle=90, colors=['#ff9999', '#66b3ff', '#99ff99', '#ffcc99'])
ax2.set_title('ESCI Label Distribution (Percentage)')

plt.tight_layout()
plt.show()

# ESCI label interpretation
print("\nESCI Label Interpretation:")
print("E (Exact): Product matches the query exactly")
print("S (Substitute): Product is a substitute for what the user is looking for")
print("C (Complement): Product complements what the user is looking for")
print("I (Irrelevant): Product is irrelevant to the query")