# EDA

This notebook explores `qa_train_data.csv`

## 1. Data origin
The data was taken from the **MSMarco QA dataset** (`dev_v2.1.json`).
#### Related Links:
- **Official Webpage**: [MSMarco QA Overview](https://microsoft.github.io/MSMARCO-Question-Answering/)
- **Dataset Page** (Agree to terms of usage to access datasets): [MSMarco Datasets](https://microsoft.github.io/msmarco/#qna)

#### Dataset Files:
- **Training Data**: [train_v2.1.json.gz](https://msmarco.z22.web.core.windows.net/msmarco/train_v2.1.json.gz)
- **Development Data**: [dev_v2.1.json.gz](https://msmarco.z22.web.core.windows.net/msmarco/dev_v2.1.json.gz)
- **Evaluation Data**: [eval_v2.1_public.json.gz](https://msmarco.z22.web.core.windows.net/msmarco/eval_v2.1_public.json.gz)

## 2. Data preparation
We extracted the following fields using `make_data.py`:

## Extracted Fields:
- **`query_id`**: A unique identifier for each question.
- **`question`**: The actual question being asked.
- **`answer`**: The correct answer(s) provided by human annotators.
- **`passages`**: Text passages that contain relevant information to answer the question.

## Data Extraction Process:
- The script `make_data.py` processes the dataset by:
  - Filtering out questions without a valid answer (i.e., `"No Answer Present."`).
  - Matching questions with relevant passages that contain supporting information.
  - Formatting the output as a structured dataset for training.

## Example Entry:
```json
{
    "query_id": "0",
    "question": "What is a corporation?",
    "answer": ["A corporation is a company or group of people authorized to act as a single entity and recognized as such in law."],
    "passages": "McDonald's Corporation is one of the most recognizable corporations in the world. A corporation is a company or group of people authorized to act as a single entity (legally a person) and recognized as such in law. Early incorporated entities were established by charter..."
}

In [1]:
import pandas as pd

In [9]:
data_split = 'eval'

In [10]:

# Load your dataset
df = pd.read_csv(f"qa_{data_split}_data.csv")

# Basic Overview
print("Dataset Info:")
print(df.info())  # Column types, missing values

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101092 entries, 0 to 101091
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   query_id  101092 non-null  int64 
 1   question  101092 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB
None


In [11]:
df.head()

Unnamed: 0,query_id,question
0,0,#ffffff color code
1,1,why did qbe stock drop?
2,10,why did the monroe doctrine originate?
3,100,why do trade winds move in this pattern
4,1000,distance between erie in buffalo new york


In [12]:
print(df.isnull().sum())

query_id    0
question    0
dtype: int64


In [13]:
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

query_id: 101092 unique values
question: 101092 unique values


In [14]:
#Length of Passages
if 'passages' in df.columns:
    df["passage_length"] = df["passages"].apply(lambda x: len(str(x).split()))
    print("\nPassage Length Stats:")
    print(df["passage_length"].describe())


In [15]:
# Filter questions that start with "Wh" and end with "?"
wh_questions = df[df["question"].str.match(r"^Wh.*\?$", na=False)]

# Count the number of such questions
num_wh_questions = len(wh_questions)

print(f"Number of 'Wh' ? questions: {num_wh_questions}")

Number of 'Wh' ? questions: 30


In [16]:
# Filter questions that start with "Wh" and end with "?"
wh_questions = df[df["question"].str.match(r"^Wh.*", na=False)]

# Count the number of such questions
num_wh_questions = len(wh_questions)

print(f"Number of 'Wh' questions: {num_wh_questions}")

Number of 'Wh' questions: 110
