# Phase I: Detailed Data Exploration - SQuAD v1.1

This notebook provides a comprehensive analysis of the Stanford Question Answering Dataset (SQuAD) v1.1. We will explore dataset statistics, length distributions, and answer patterns to inform our model choices and preprocessing hyperparameters.

In [1]:
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json
import numpy as np

sns.set_theme(style="whitegrid")

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'matplotlib'

## 1. Load Dataset

In [None]:
print("Downloading SQuAD v1.1 dataset...")
dataset = load_dataset("squad")
dataset

## 2. Dataset Overview

Let's look at the basic statistics of the train and validation splits.

In [None]:
splits = ['train', 'validation']
summary = []

for split in splits:
    df = dataset[split].to_pandas()
    summary.append({
        'Split': split,
        'Total Records': len(df),
        'Unique Contexts': df['context'].nunique(),
        'Unique Titles': df['title'].nunique()
    })

pd.DataFrame(summary)

## 3. Length Analysis

Understanding the distribution of context, question, and answer lengths is crucial for setting `max_length` in tokenization.

In [None]:
train_df = dataset['train'].to_pandas()

# Calculate lengths in words (approximation of tokens)
train_df['context_len'] = train_df['context'].apply(lambda x: len(x.split()))
train_df['question_len'] = train_df['question'].apply(lambda x: len(x.split()))
train_df['answer_len'] = train_df['answers'].apply(lambda x: len(x['text'][0].split()))

train_df[['context_len', 'question_len', 'answer_len']].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

### Visualizing Distributions

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.histplot(train_df['context_len'], bins=50, ax=axes[0], color='skyblue')
axes[0].set_title('Context Length Distribution (Words)')
axes[0].set_xlabel('Words')

sns.histplot(train_df['question_len'], bins=30, ax=axes[1], color='salmon')
axes[1].set_title('Question Length Distribution (Words)')
axes[1].set_xlabel('Words')

sns.histplot(train_df['answer_len'], bins=20, ax=axes[2], color='lightgreen')
axes[2].set_title('Answer Length Distribution (Words)')
axes[2].set_xlabel('Words')

plt.tight_layout()
plt.show()

## 4. Topic Analysis

What are the most frequent topics in the training set?

In [None]:
plt.figure(figsize=(10, 6))
train_df['title'].value_counts()[:15].plot(kind='barh', color='darkblue')
plt.title('Top 15 Topics in SQuAD v1.1')
plt.xlabel('Count')
plt.ylabel('Title')
plt.gca().invert_yaxis()
plt.show()

## 5. Sample Record Inspection

Let's look at a few examples, including a long context and a short context.

In [None]:
print("--- Long Context Example ---")
long_sample = train_df.sort_values(by='context_len', ascending=False).iloc[0]
print(f"Title: {long_sample['title']}")
print(f"Context Length: {long_sample['context_len']} words")
print(f"Question: {long_sample['question']}")
print(f"Answer: {long_sample['answers']['text'][0]}")

print("\n--- Short Context Example ---")
short_sample = train_df.sort_values(by='context_len', ascending=True).iloc[0]
print(f"Title: {short_sample['title']}")
print(f"Context Length: {short_sample['context_len']} words")
print(f"Question: {short_sample['question']}")
print(f"Answer: {short_sample['answers']['text'][0]}")

## 6. Save Sample for Reference

In [None]:
data_dir = "data"
os.makedirs(data_dir, exist_ok=True)

sample_record = dataset['train'][0]
with open(os.path.join(data_dir, "squad_sample.json"), "w") as f:
    json.dump(sample_record, f, indent=4)

print(f"Sample saved to {os.path.join(data_dir, 'squad_sample.json')}")