# EDA

This notebook explores `qa_train_data.csv`

## 1. Data origin
The data was taken from the **MSMarco QA dataset** (only from `dev_v2.1.json`).
#### Related Links:
- **Official Webpage**: [MSMarco QA Overview](https://microsoft.github.io/MSMARCO-Question-Answering/)
- **Dataset Page** (Agree to terms of usage to access datasets): [MSMarco Datasets](https://microsoft.github.io/msmarco/#qna)

#### Dataset Files:
- **Training Data**: [train_v2.1.json.gz](https://msmarco.z22.web.core.windows.net/msmarco/train_v2.1.json.gz)
- **Development Data**: [dev_v2.1.json.gz](https://msmarco.z22.web.core.windows.net/msmarco/dev_v2.1.json.gz)
- **Evaluation Data**: [eval_v2.1_public.json.gz](https://msmarco.z22.web.core.windows.net/msmarco/eval_v2.1_public.json.gz)

## 2. Data preparation
We extracted the following fields using `make_data.py`:

## Extracted Fields:
- **`query_id`**: A unique identifier for each question.
- **`question`**: The actual question being asked.
- **`answer`**: The correct answer(s) provided by human annotators. Is a list:)
- **`passages`**: Text passages that contain relevant information to answer the question.

## Data Extraction Process:
- The script `make_data.py` processes the dataset by:
  - Filtering out questions without a valid answer (i.e., `"No Answer Present."`).
  - Matching questions with relevant passages that contain supporting information.
  - Formatting the output as a structured dataset for training.

## Example Entry:
```json
{
    "query_id": "0",
    "question": "What is a corporation?",
    "answer": ["A corporation is a company or group of people authorized to act as a single entity and recognized as such in law."],
    "passages": "McDonald's Corporation is one of the most recognizable corporations in the world. A corporation is a company or group of people authorized to act as a single entity (legally a person) and recognized as such in law. Early incorporated entities were established by charter..."
}

In [16]:
import pandas as pd

In [17]:
data_split = 'dev'

In [21]:

# Load your dataset
df = pd.read_csv(f"qa_data_55578.csv")

# Basic Overview
print("Dataset Info:")
print(df.info())  # Column types, missing values

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55578 entries, 0 to 55577
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   query_id    55578 non-null  int64 
 1   question    55578 non-null  object
 2   answer      55578 non-null  object
 3   query_type  55578 non-null  object
 4   passages    55578 non-null  object
dtypes: int64(1), object(4)
memory usage: 2.1+ MB
None


In [22]:
df.head()

Unnamed: 0,query_id,question,answer,query_type,passages
0,0,. what is a corporation?,['A corporation is a company or group of peopl...,DESCRIPTION,McDonald's Corporation is one of the most reco...
1,1,why did rachel carson write an obligation to e...,['Rachel Carson writes The Obligation to Endur...,DESCRIPTION,The Obligation to Endure by Rachel Carson Rach...
2,10000,symptoms of a dying mouse,['The symptoms of a dying mouse are runny eyes...,ENTITY,The symptoms are similar but the mouse will be...
3,100000,average number of lightning strikes per day,"['Globally 8,640,000 lightning strikes per day.']",NUMERIC,Although many lightning flashes are simply clo...
4,100001,can you burn your lawn with fertilizer,"['Yes, over fertilizing can burn lawn.']",DESCRIPTION,Fertilizer burn is the result of over fertiliz...


In [29]:
for row in df.sample(10).itertuples():
    print(row)

Pandas(Index=2519, query_id=12744, question='what are the penalties of closing my hsa', answer="['There are no tax penalties of closing Health Savings Accounts.']", query_type='DESCRIPTION', passages="You will have to pay taxes plus a 10% 20% (I'm pretty sure that's right) fee for withdrawing the money from your HSA. There are no tax penalties only if you pay for medical expenses FROM the account, which is no longer possible since it is closed.. Health Savings Accounts (HSAs) A Health Savings Account (HSA) is a tax-exempt trust or custodial account you set up with a qualified HSA trustee to pay or reimburse certain medical expenses you incur. You must be an eligible individual to qualify for an HSA. No permission or authorization from the IRS is necessary to establish an HSA.", passage_length=104)
Pandas(Index=10249, query_id=23050, question='what is weather in invergordon scotland in july', answer="['The average maximum daytime temperature in Invergordon in February is a cold 5°C (41°

In [24]:
print(df.isnull().sum())

query_id      0
question      0
answer        0
query_type    0
passages      0
dtype: int64


In [25]:
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

query_id: 55578 unique values
question: 55578 unique values
answer: 53641 unique values
query_type: 5 unique values
passages: 55437 unique values


In [26]:
#Length of Passages
if 'passages' in df.columns:
    df["passage_length"] = df["passages"].apply(lambda x: len(str(x).split()))
    print("\nPassage Length Stats:")
    print(df["passage_length"].describe())



Passage Length Stats:
count    55578.000000
mean        58.144913
std         27.074754
min          7.000000
25%         42.000000
50%         51.000000
75%         68.000000
max        565.000000
Name: passage_length, dtype: float64


In [27]:
# Filter questions that start with "Wh" and end with "?"
wh_questions = df[df["question"].str.match(r"^Wh.*\?$", na=False)]

# Count the number of such questions
num_wh_questions = len(wh_questions)

print(f"Number of 'Wh' ? questions: {num_wh_questions}")

Number of 'Wh' ? questions: 2


In [28]:
# Filter questions that start with "Wh" and end with "?"
wh_questions = df[df["question"].str.match(r"^Wh.*", na=False)]

# Count the number of such questions
num_wh_questions = len(wh_questions)

print(f"Number of 'Wh' questions: {num_wh_questions}")

Number of 'Wh' questions: 25
