# Simple EDA and preprocessing of DataFrame
## Coleridge Initiative - Show US the Data

In this competition, we are given scientific articles and asked to identify mentions of datasets. 

> The objective of the competition is to identify the mention of datasets within scientific publications. 

Let's just dive in!


**If you find this helpful, please give it an upvote!**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from tqdm.notebook import tqdm
from fastcore.all import *
from collections import Counter

Let's check the directory:

In [None]:
dataset_path = Path('../input/coleridgeinitiative-show-us-the-data')

In [None]:
dataset_path.ls()

In [None]:
(dataset_path/'train').ls()

In [None]:
(dataset_path/'test').ls()

We have our `train.csv`, `sample_submission.csv` and the `train` & `test` folders with json files of the text of the scientific articles. Let's check our train CSV file:

In [None]:
train_df = pd.read_csv(dataset_path/'train.csv')

In [None]:
train_df.head()

Basically we have the following columns:

* `id` - publication id - note that there are multiple rows for some training documents, indicating multiple mentioned datasets
* `pub_title` - title of the publication (a small number of publications have the same title)
* `dataset_title` - the title of the dataset that is mentioned within the publication
* `dataset_label` - a portion of the text that indicates the dataset
* `cleaned_label` - the dataset_label, as passed through the `clean_text` function:
```
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()
```

Let's see what dataset titles are mentioned in the scientific articles. Note that there can be even more dataset mentions that are not labeled.

In [None]:
Counter(train_df.dataset_title)

In [None]:
print(f'There are {len(Counter(train_df.dataset_title))} dataset titles in this dataset')

These are the dataset titles, but the exact way these datasets are mentioned in the paper is provided in the `dataset_label` column. Additionally, the `dataset_label` column has a simple text cleaning function applied to it, giving us `cleaned_labels`.

In [None]:
print(f'There are {len(Counter(train_df.cleaned_label))} ways the datasets are mentioned in the papers')

Let's look at an example JSON file:

In [None]:
example_json = (dataset_path/'train'/(train_df.Id.iloc[0]+'.json')).read_json()

You can check JSON file by showing the output of the below code cell:

In [None]:
print(example_json)

Here, I am finding the target `dataset_label` and providing some context text:

In [None]:
for i in range(len(example_json)):
    position = example_json[i]['text'].find(train_df.iloc[0].dataset_label)
    if position != -1:
        print(f'Found in section {i}, {example_json[i]["section_title"]}:')
        print(f'{example_json[i]["text"][position-400:position+len(train_df.iloc[0].dataset_label)+400]}')

A function to find the section and position of the `dataset_label`

In [None]:
def find_str_in_json(json, string):
    for i in range(len(json)):
        position = json[i]['text'].find(string)
        if position != -1:
            return i, position
    print('problem')
    return -1, -1

Let's process the whole dataset and add the section and position of the `dataset_label` as new columns.

In [None]:
section_list = []
position_list = []
for i in tqdm(range(len(train_df))):
    current_json = (dataset_path/'train'/(train_df.Id.iloc[i]+'.json')).read_json()
    current_str = train_df.iloc[i].dataset_label
    section, position = find_str_in_json(current_json, current_str)
    section_list.append(str(section))
    position_list.append(str(position))
train_df['section'] = section_list
train_df['position'] = position_list
    

Interestingly, there are many `dataset_label`s that are not found in the text json and I need to investigate this further. Please let me know if you have any ideas why this is happening.

In [None]:
train_df

Here, I merge the rows with the same `Id` column, which is how it's found in the `sample_submission.csv`:

In [None]:
new_train_df = pd.DataFrame({'dataset_label': train_df.groupby('Id', sort=False)['dataset_label'].apply('|'.join), 
              'cleaned_label': train_df.groupby('Id', sort=False)['cleaned_label'].apply('|'.join),
              'section': train_df.groupby('Id', sort=False)['section'].apply('|'.join),
              'position': train_df.groupby('Id', sort=False)['position'].apply('|'.join)})

In [None]:
new_train_df.head()

Before we end, let's quickly take a look at `sample_submission.csv`. Currently, no example prediction strings are provided. We are basically supposed to provide the exercepts of the dataset mentions, separated by `|` character. 

In [None]:
pd.read_csv(dataset_path/'sample_submission.csv')

The following Jaccard similarity score is used for evaluation:

In [None]:
def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

For now, **we are done.**

If you enjoyed this kernel, please give it an upvote. If you have any questions or suggestions, please leave a comment!