<center><h1> Coleridge Initiative - Show US the Data</h1>
    <h2>ðŸ“š EDA + NaÃ¯ve Submission ðŸ“š</h2>

<img src="https://coleridgeinitiative.org/wp-content/uploads/2021/02/rich-context.png"/>
    <p style="text-align:center;">Image <a href="https://coleridgeinitiative.org/coming-soon-a-new-kaggle-competition-featuring-rich-context/">source</a>.</p>
</center>

# Overview

Citing the competition's hosts:
> The objective of the competition is to identify the mention of datasets within scientific publications.

Thus, given a set of publications, for which we'll have acces to information such as their title, paragraphs titles and text bodies, we'll have to extract short excerpts from the publications that appear to note a dataset. Such work would prove to be greatly beneficial in the context of data sharing and availability. One would be able to quickly search publications of interest that utilize a given dataset in order to gain insights on the dataset, or in the opposite route, quickly figure out what kind of data is being used in a set of un-explored publications that tackle a specific topic of interest. Consequently, publication processing could be automated and research work greatly accelerated.

In this notebook, we'll take a look at an Exploratory Data Analysis of the training data provided for this competition, as well as building and running a naÃ¯ve solution that basically performs dataset title string matching (from a set of known datasets titles) to predict whether a given publication sites, or not, a given dataset.


### Outline:

1. [Setup and Basic EDA](#head-1)  
  1.1. [Dataset Title VS Dataset Label](#head-1-1)  
  1.2. [Datasets Popularity](#head-1-2)  
  1.3. [Datasets Occurence Together](#head-1-3)
2. [Wordcloud of the Articles Titles](#head-2)  
  2.1. [Wordclouds of Article Titles by Dataset](#head-2-1)
3. [Loading JSON Contents into a Pandas DataFrame](#head-3)  
4. [A NaÃ¯ve Dataset Title Matching Submission](#head-4)  

# 1. Setup and Basic EDA <a class="anchor" id="head-1"></a>

In [None]:
import os
import re
import json
import glob
from collections import defaultdict

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

os.listdir('/kaggle/input/coleridgeinitiative-show-us-the-data')

We are provided with 4 main pieces of data:
* `train.csv`: The CSV file containing all the metadata of the publications, such as their title and the dataset they utilize.
* `train`: The directory containing the actual publications that are referenced in `train.csv` in JSON format.
* `test`: The directory containing the actual publications that will be used for testing purposes (thus, with no ground truth CSV file available).
* `sample_submission.csv`: The CSV file containing all the publications IDs in the test set, for which we'll have to populate the prediction column.

In [None]:
train = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
train

The training data contains 19,661 rows, with 5 columns describing each row.

In [None]:
train.info()

That's great! There are no missing values, and the dataset looks complete.

In [None]:
for col in train.columns:
    print(f"{col}: {len(train[col].unique())}")

It looks like there are only **14,316** unique IDs in the dataset, meaning that some publications include a multitude of datasets. Also, notice that the `pub_title` unique count is slightly smaller than the `Id` unique counts. This points to the precense of several occurences of having 2 separate publications, eahc with a unique ID, but sharing the exact same title.

Also, there are a total of **45** unique `dataset_title` and 130 unique `dataset_label`. Meaning that a single dataset could have multible labels throughout different publications.

## 1.1. Dataset Title VS Dataset Label <a class="anchor" id="head-1-1"></a>

In [None]:
print("Printing below the dataset titles that have multiple dataset labels associated with them:\n")
datasets_titles_unique = train["dataset_title"].unique()
for dataset_title in datasets_titles_unique:
    if len(train[train["dataset_title"] == dataset_title]["dataset_label"].unique()) > 1:
        print(f"'{dataset_title}':", list(train[train["dataset_title"] == dataset_title]["dataset_label"].unique()), "\n")

## 1.2. Datasets Popularity <a class="anchor" id="head-1-2"></a>

In [None]:
dataset_titles_counts = train['dataset_title'].value_counts()

fig = go.Figure(data=[go.Table(
  columnwidth = [0.25, 2, 0.5],
  header=dict(
    values=["<b>Rank</b>", "<b>Dataset Title</b>", "<b>Mentions</b>"],
    line_color='darkslategray',
    fill_color="royalblue",
    align='center',
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=np.array([np.array((str(i+1), "<i>" + x + "</i>", "<b>" + str(y) + "</b>", )) for i, (x, y) in enumerate(zip(dataset_titles_counts.index, dataset_titles_counts.values))]).T,
    line_color='darkslategray',
    # 2-D list of colors for alternating rows
    fill_color = [["white","lavender"]*25],
    align = 'center',
    font = dict(color = 'darkslategray', size = 11)
    ))
])

fig.update_layout(
    title={"text": "<b>Datasets Titles Mentions Counts</b>",
           "x": 0.5,
           "xanchor":"center",
           "font_size": 22},
    margin={"r":20, "l":20})

fig.show()

## 1.3. Datasets Occurence Together <a class="anchor" id="head-1-3"></a>

In [None]:
multi_mentions = [] # list of tuples (Id, num_unique_mentions) 

for Id in train.Id.unique():
    num_unique_mentions = len(train[train['Id'] == Id]['dataset_title'].unique())
    if num_unique_mentions > 1:
        multi_mentions.append((Id, num_unique_mentions))

print(f"There are {len(multi_mentions)} publications in the training set that mention more than 1 dataset. That is {len(multi_mentions) / len(train.Id.unique()) * 100:.2f}% of the publications.")

In [None]:
co_mentions = defaultdict(int)

for Id, num_mentions in multi_mentions:
    co_mentions[tuple(sorted(train[train['Id'] == Id]["dataset_title"].unique()))] += 1

print(f"There are {len(co_mentions)} unique sets of co-occurences of dataset mentions.")

In [None]:
co_mentions = dict(sorted(co_mentions.items(), key=lambda item: item[1], reverse=True))

fig = go.Figure(data=[go.Table(
  columnwidth = [0.25, 2, 0.5],
  header=dict(
    values=["<b>Rank</b>", "<b>Datasets Co-Mentions Sets</b>", "<b>Occurences</b>"],
    line_color='darkslategray',
    fill_color="royalblue",
    align='center',
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=np.array([np.array((str(i+1), "<i>" + str(k) + "</i>", "<b>" + str(v) + "</b>", )) for i, (k, v) in enumerate(co_mentions.items())]).T,
    line_color='darkslategray',
    fill_color = [["white","lavender"]*60],
    align = 'center',
    font = dict(color = 'darkslategray', size = 11)
    ))
])

fig.update_layout(
    title={"text": "<b>Datasets Mentioned in the same Publications</b>",
           "x": 0.5,
           "xanchor":"center",
           "font_size": 22},
    margin={"r":20, "l":20})

fig.show()

# 2. Wordcloud of the Articles Titles <a class="anchor" id="head-2"></a>

In [None]:
from wordcloud import WordCloud, STOPWORDS

words_in_titles = list(train.pub_title.str.split(expand=True).stack())

wordcloud = WordCloud(stopwords = STOPWORDS,
                      background_color = "white",
                      width = 3000,
                      height = 2000
                     ).generate(' '.join(words_in_titles))
plt.figure(1, figsize = (18, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

## 2.1. Wordclouds of Article Titles by Dataset Mentions <a class="anchor" id="head-2-1"></a>

In [None]:
words_in_titles_by_dataset = defaultdict(list)

# Separating out positive and negative words (i.e., words appearing in negative and positive tweets),
# in order to visualize each set of words independently
for _, row in train.iterrows():
    words_in_titles_by_dataset[row['dataset_title']].extend(row['pub_title'].split())

# Defining our word cloud drawing function
def wordcloud_draw(data, color = 'white'):
    wordcloud = WordCloud(stopwords = STOPWORDS,
                          background_color = color,
                          width = 3000,
                          height = 2000
                         ).generate(' '.join(data))
    plt.figure(1, figsize = (12, 8))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()

for dataset_title in train['dataset_title'].unique():
    print("Wordcloud for publications mentioning", dataset_title, ":")
    wordcloud_draw(words_in_titles_by_dataset[dataset_title])

# 3. Loading JSON Contents into a Pandas DataFrame <a class="anchor" id="head-3"></a>

In [None]:
# Gathering the files paths
train_files = glob.glob("../input/coleridgeinitiative-show-us-the-data/train/*.json")
test_files = glob.glob("../input/coleridgeinitiative-show-us-the-data/test/*.json")

In [None]:
# Generate the training publications dataframe
df_train_publications = pd.DataFrame()

for train_file in train_files:
    file_data = pd.read_json(train_file)
    file_data.insert(0,'pub_id', train_file.split('/')[-1].split('.')[0])
    df_train_publications = pd.concat([df_train_publications, file_data])

df_train_publications.to_csv("df_train_publications.csv",index=False)

df_train_publications

In [None]:
# Generate the testing publications dataframe
df_test_publications = pd.DataFrame()

for test_file in test_files:
    file_data = pd.read_json(test_file)
    file_data.insert(0,'pub_id', test_file.split('/')[-1].split('.')[0])
    df_test_publications = pd.concat([df_test_publications, file_data])

df_test_publications.to_csv("df_test_publications.csv",index=False)

df_test_publications

# 4. A NaÃ¯ve Dataset Title Matching Submission <a class="anchor" id="head-4"></a>

Obviously, the end goal of such a competition is not simply do string matching of known datasets names in order to detect mentions of datasets in publication, however, it is to build a strong enough NLP model that can infer from context whether or not a piece of text in a publication is refering to the usage of a dataset or not.

That being said, below we will implement a very simple known-dataset string names matching technique as a POC and template for building a submission. Such a technique would be rather useless when applied on publications mentioning datasets not present in our "known datasets" list, which is the case in the majority of the hidden test set of this competition.

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

In [None]:
submission_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv', index_col=0)

In [None]:
submission_df

In [None]:
submission_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv', index_col=0)
datasets_titles = [x.lower() for x in set(train['dataset_title'].unique()).union(set(train['dataset_label'].unique()))]

labels = []
for index in submission_df.index:
    publication_text = df_test_publications[df_test_publications['pub_id'] == index].text.str.cat(sep='\n').lower()
    label = []
    for dataset_title in datasets_titles:
        if dataset_title in publication_text:
            label.append(clean_text(dataset_title))
    labels.append('|'.join(label))

submission_df['PredictionString'] = labels

submission_df.to_csv('submission.csv')

submission_df

# This notebook is under development ðŸš§

---

## Please upvote if you found it useful ðŸ˜Š