# Basic EDA and my first impression

- Providing text of science paper, we are going to extract (predict) a name of dataset contained in the text.
- `train.csv` contains 3 columns abount dataset:
    - dataset_title: official dataset name
    - dataset_label: dataset name exactly appears in the text
    - cleaned_label: dataset name cleaned from dataset_label using `clean_text` operation (see [competition page](https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/overview/evaluation)). All ground-truth texts have been cleaned for matching purposes.
- Actual train/test texts are provided via JSON format. One JSON file contains multiple sections as a list and each section is composed of a title and a text.
    
This is obviously NLP task. Additionally, I think:
- Extracting a part of input text is a popular NLP task. SQuAD is a typical dataset often used as a benchmark for extractive QA task.
    - Huggingface introduction: https://huggingface.co/transformers/task_summary.html#extractive-question-answering
    - Keras example: https://keras.io/examples/nlp/text_extraction_with_bert/
    - For this purpose, I have created a CSV file contains begin/end position of the dataset statement. Feel free to use it :).
- There are a plenty of texts. Treating all of them with a heavy deep learning model (e.g. BERT-based model) cound result in exceeding the submission time-limit.
    - Officially announced that "the hidden test set has roughly ~8000 publications, many times the size of the public test set".
- The existence of unseen datasets is announced.
    - Is is important to recognize not only known datasets but also the surrounding text pattern indicating the existence of dataset.
    - Some regulatization encouraging a model to focus the surrounding pattern is a possible solution (e.g. drop a part of an exact dataset name from the training texts).

In [None]:
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from glob import glob
from tqdm import tqdm
from pandas_profiling import ProfileReport

datadir = '/kaggle/input/coleridgeinitiative-show-us-the-data'

In [None]:
df_train = pd.read_csv(f'{datadir}/train.csv')
df_train.iloc[:10000:1000]

In [None]:
profile = ProfileReport(df_train, title="Pandas Profiling Report")
profile.to_widgets()

# Relation between dataset_title and dataset_label

A unique `dataset_title` can be stated with a different form. For example, *Alzheimer's Disease Neuroimaging Initiative (ADNI)* is stated in 3 forms as below:

In [None]:
unique_dataset_titles = df_train.dataset_title.value_counts().reset_index()
unique_dataset_titles.columns = ['dataset_title', 'counts']
print('most appeared dataset_title:', unique_dataset_titles.iloc[0])

sample_ds_title = unique_dataset_titles.loc[0, 'dataset_title']
df = df_train.query('dataset_title == @sample_ds_title')
df = df[['dataset_title', 'dataset_label', 'cleaned_label']]

print(f'\nRetrieved dataset_label/cleaned_label from "{sample_ds_title}"')
df.pivot_table(index='dataset_title', columns=['dataset_label', 'cleaned_label'], aggfunc=len)

In [None]:
df = df_train.groupby('dataset_title').apply(lambda df: df.dataset_label.value_counts())
df = df.reset_index()
df.columns = ['dataset_title', 'dataset_label', 'counts']
df

In [None]:
df_unique_labels = df.groupby('dataset_title').size()
df_unique_labels

For example *SARS-CoV-2 genome sequence* is stated with 17 forms

In [None]:
# The title related with the most unique dataset_label
title = 'SARS-CoV-2 genome sequence'
df.query('dataset_title == @title')

Conversely, a `dataset_label` can be uniquely associated with a `dataset_title`

In [None]:
df = df_train.groupby('dataset_label').apply(lambda df: df.dataset_title.value_counts())
df = df.reset_index()
df.columns = ['dataset_label', 'dataset_title', 'counts']
df.sort_values('dataset_title')

# Explore texts

In [None]:
train_paths = glob(f'{datadir}/train/*.json')
train_data = []
for path in train_paths[:100]:
    with open(path, 'r') as f:
        train_data.append(json.load(f))

test_paths = glob(f'{datadir}/test/*.json')
test_data = []
for path in test_paths:
    with open(path, 'r') as f:
        test_data.append(json.load(f))
        
print(f'Train files: {len(train_paths)}')
print(f'Test files: {len(test_paths)}')

Single JSON file is composed of multiple sections. Each section has a title and a text.

In [None]:
sample_sections = train_data[0]
sample_path = train_paths[0]
filename = os.path.basename(sample_path)

with open(sample_path, 'r') as f:
    sections = json.load(f)

print(f'{filename} has {len(sample_sections)} sections:')

for section in sections[:10]:
    title = section['section_title']
    text = section['text']
    print(f'title: {title.ljust(70, " ")}, text: {text[:50]} ...')

Counting the number of words composing each section in order to check the scale of the problem.

In [None]:
'''
# This aggregation takes tens of minutes. I have saved the resulting table and use it.

df_text_train = pd.DataFrame(columns=['Id', 'title', 'n_words'])

train_ids = df_train.Id.unique()

for train_id in tqdm(train_ids):
    path = f'{datadir}/train/{train_id}.json'
    with open(path, 'r') as f:
        data = json.load(f)
        
    for section in data:
        title = section['section_title']
        text = section['text']
        n_words = len(text.split(' '))
        row = {'Id': train_id, 'title': title, 'n_words': n_words}
        df_text_train = df_text_train.append(row, ignore_index=True)
'''

df_text_train = pd.read_csv('/kaggle/input/coleridge-initiative-assets/train_text.csv')
df_text_train

- Single publication (i.e single JSON file) often has ~40 sections but those with over 100 sections also exist.
- Large part of a section has ~1000 words but those with over 1500 words also exist.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))
ax1.set_xlabel('N sections per id')
ax1.set_ylabel('counts')
ax1.hist(df_text_train.Id.value_counts(), range=(0, 100), bins=20)

ax2.set_xlabel('N words per section')
ax2.set_ylabel('counts')
ax2.hist(df_text_train.n_words, range=(0, 2000), bins=20)

# Format for Extractive QA task

In the original BERT paper, the extractive QA task is solved via predicting start/end positions of the answer (see details in [huggingface document](https://huggingface.co/transformers/task_summary.html#extractive-question-answering) or [Keras example](https://keras.io/examples/nlp/text_extraction_with_bert/).
For this purpose, I have converted the `train.csv` and JSON contents to the table that contains the target `dataset_label` and corresponding section and start/end positions.

In [None]:
def find_label_position(row):
    if len(row) > 1:
        raise ValueError('Multiple rows detected')
    
    file_id, pub_title, dataset_title, dataset_label, cleaned_label = row.iloc[0].values

    filename = f'{datadir}/train/{file_id}.json'
    with open(filename, 'r') as f:
        sections = json.load(f)
        
    data = []
    for i, section in enumerate(sections):
        text = section['text']        
        begin = text.find(dataset_label)
        while begin >= 0:
            data.append([i, begin, begin+len(dataset_label)])
            begin = text.find(dataset_label, begin+1)
    df = pd.DataFrame(data, columns=['section_id', 'ds_label_begin', 'ds_label_end'])
    return df

tqdm.pandas()
df = df_train.groupby(['Id', 'dataset_label']).progress_apply(find_label_position).reset_index()
df = df[['Id', 'dataset_label', 'section_id', 'ds_label_begin', 'ds_label_end']]
df = df_train.merge(df, on=['Id', 'dataset_label'])
df

In [None]:
sample_row = df.iloc[0]
print(sample_row)

file_id, section_id, label_begin, label_end = sample_row[['Id', 'section_id', 'ds_label_begin', 'ds_label_end']]

path = f'{datadir}/train/{file_id}.json'
with open(path, 'r') as f:
    sections = json.load(f)
    
target_section = sections[section_id]['text']
target_section[label_begin: label_end]

In [None]:
df.to_csv('train_extactive_qa.csv', index=False)