# Phase I: Detailed Data Exploration - SQuAD v1.1

This notebook provides a comprehensive analysis of the Stanford Question Answering Dataset (SQuAD) v1.1. We will explore dataset statistics, length distributions, and answer patterns to inform our model choices and preprocessing hyperparameters.

In [None]:
# Phase I: Detailed Data Exploration - SQuAD v1.1

This notebook provides a comprehensive analysis of the Stanford Question Answering Dataset (SQuAD) v1.1. We will explore dataset statistics, length distributions, and answer patterns to inform our model choices and preprocessing hyperparameters.

# Try basic imports first
try:
    from datasets import load_dataset
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import os
    import json
    import numpy as np
    print("All imports successful!")
    sns.set_theme(style="whitegrid")
except ImportError as e:
    print(f"Import error: {e}")
    print("Trying alternative approach...")
    # Fallback to basic analysis without visualization
    from datasets import load_dataset
    import pandas as pd
    import os
    import json
    print("Basic imports successful - proceeding without visualization")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\Users\My Device\Desktop\Question Answering with Transformers_NLP\localenv\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\Users\My Device\Desktop\Question Answering with Transformers_NLP\localenv\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "c:\Users\My Device\Desktop\Question Answering with Tra

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

## 1. Load Dataset

In [6]:
print("Downloading SQuAD v1.1 dataset...")
dataset = load_dataset("squad")
dataset

Downloading SQuAD v1.1 dataset...




DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

## 2. Dataset Overview

Let's look at the basic statistics of the train and validation splits.

In [7]:
splits = ['train', 'validation']
summary = []

for split in splits:
    df = dataset[split].to_pandas()
    summary.append({
        'Split': split,
        'Total Records': len(df),
        'Unique Contexts': df['context'].nunique(),
        'Unique Titles': df['title'].nunique()
    })

pd.DataFrame(summary)

Unnamed: 0,Split,Total Records,Unique Contexts,Unique Titles
0,train,87599,18891,442
1,validation,10570,2067,48


## 3. Length Analysis

Understanding the distribution of context, question, and answer lengths is crucial for setting `max_length` in tokenization.

In [8]:
train_df = dataset['train'].to_pandas()

# Calculate lengths in words (approximation of tokens)
train_df['context_len'] = train_df['context'].apply(lambda x: len(x.split()))
train_df['question_len'] = train_df['question'].apply(lambda x: len(x.split()))
train_df['answer_len'] = train_df['answers'].apply(lambda x: len(x['text'][0].split()))

train_df[['context_len', 'question_len', 'answer_len']].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

Unnamed: 0,context_len,question_len,answer_len
count,87599.0,87599.0,87599.0
mean,119.763125,10.061108,3.162159
std,49.365,3.55923,3.392334
min,20.0,1.0,1.0
25%,89.0,8.0,1.0
50%,110.0,10.0,2.0
75%,142.0,12.0,3.0
90%,183.0,15.0,7.0
95%,213.0,17.0,10.0
99%,282.0,21.0,18.0


### Visualizing Distributions

In [11]:
import matplotlib.pyplot as plt
import seaborn as sns


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\Users\My Device\Desktop\Question Answering with Transformers_NLP\localenv\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\Users\My Device\Desktop\Question Answering with Transformers_NLP\localenv\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "c:\Users\My Device\Desktop\Question Answering with Tra

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

In [12]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.histplot(train_df['context_len'], bins=50, ax=axes[0], color='skyblue')
axes[0].set_title('Context Length Distribution (Words)')
axes[0].set_xlabel('Words')

sns.histplot(train_df['question_len'], bins=30, ax=axes[1], color='salmon')
axes[1].set_title('Question Length Distribution (Words)')
axes[1].set_xlabel('Words')

sns.histplot(train_df['answer_len'], bins=20, ax=axes[2], color='lightgreen')
axes[2].set_title('Answer Length Distribution (Words)')
axes[2].set_xlabel('Words')

plt.tight_layout()
plt.show()

NameError: name 'plt' is not defined

## 4. Topic Analysis

What are the most frequent topics in the training set?

In [None]:
plt.figure(figsize=(10, 6))
train_df['title'].value_counts()[:15].plot(kind='barh', color='darkblue')
plt.title('Top 15 Topics in SQuAD v1.1')
plt.xlabel('Count')
plt.ylabel('Title')
plt.gca().invert_yaxis()
plt.show()

## 5. Sample Record Inspection

Let's look at a few examples, including a long context and a short context.

In [None]:
print("--- Long Context Example ---")
long_sample = train_df.sort_values(by='context_len', ascending=False).iloc[0]
print(f"Title: {long_sample['title']}")
print(f"Context Length: {long_sample['context_len']} words")
print(f"Question: {long_sample['question']}")
print(f"Answer: {long_sample['answers']['text'][0]}")

print("\n--- Short Context Example ---")
short_sample = train_df.sort_values(by='context_len', ascending=True).iloc[0]
print(f"Title: {short_sample['title']}")
print(f"Context Length: {short_sample['context_len']} words")
print(f"Question: {short_sample['question']}")
print(f"Answer: {short_sample['answers']['text'][0]}")

## 6. Save Sample for Reference

In [None]:
## 7. Save Sample for Reference

data_dir = "data"
os.makedirs(data_dir, exist_ok=True)

sample_record = dataset['train'][0]
with open(os.path.join(data_dir, "squad_sample.json"), "w") as f:
    json.dump(sample_record, f, indent=4)

print(f"Sample saved to {os.path.join(data_dir, 'squad_sample.json')}")

# Also save visualizations
os.makedirs(os.path.join(data_dir, "visualizations"), exist_ok=True)

# Save the length distributions plot
fig.savefig(os.path.join(data_dir, "visualizations", "length_distributions.png"), dpi=300, bbox_inches='tight')
print("Visualizations saved to data/visualizations/")