# Objective: 

The objective of the competition is to identify the mention of datasets within scientific publications. Your predictions will be short excerpts from the publications that appear to note a dataset. Predictions that more accurately match the precise words used to identify the dataset within the publication will score higher. 

<img src="https://coleridgeinitiative.org/wp-content/uploads/2021/02/rich-context.png"/>

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from colorama import Fore, Back, Style
import plotly.express as px
import plotly.graph_objects as go

# Setting color palette.
purple_black = [
"#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"
]

# Data Familiarization

In [None]:
# load the meta data
train_csv = pd.read_csv("/kaggle/input/coleridgeinitiative-show-us-the-data/train.csv")
train_csv.head()

In [None]:
print(Fore.BLUE + "Metadata file has {} rows and {} columns".format(train_csv.shape[0],train_csv.shape[1]),Style.RESET_ALL)

### Let's check the data present in the metadata file

In [None]:
# Let's check publication ID
# check the no. of unique publications present in the metadata
print(Fore.BLUE + "No. of Unique Publications:",train_csv.Id.nunique(),Style.RESET_ALL)

In [None]:
# Let's check total no. of rows present in the train.csv
print(Fore.BLUE +"Total no. of rows in the metadata file:",train_csv.shape[0],Style.RESET_ALL)

# train.csv has 19661 rows whereas there are only 14316 unique publications, which means there are multiple rows
# for few publications, because they might have reffered to multiple datasets

In [None]:
# No. of unique publications titles
print(Fore.BLUE +"No. of unique publication titles:",train_csv.pub_title.nunique(),Style.RESET_ALL)

# There seems to be 14271 unique titles, whereas it should have been 14316, which means a small number of publications 
# have the same title

In [None]:
# No. of unique dataset titles(title of the dataset that is mentioned within the publication)
print(Fore.BLUE +"No. of unique dataset titles:",train_csv.dataset_title.nunique(),Style.RESET_ALL)

In [None]:
# No. of unique dataset labels(a portion of the text that indicates the dataset) in the metadata 
print(Fore.BLUE +"No. of unique Labels in the meta:",train_csv.dataset_label.nunique(),Style.RESET_ALL)

There are only 45 unique titles whereas 130 labels, which means different variants of the dataset titles are used in the publications. let's verify.

In [None]:
# unique titles used
count = train_csv.dataset_title.value_counts()

fig = go.Figure(data=[go.Table(
  columnwidth = [0.25, 2, 0.5],
  header=dict(
    values=["<b>Rank</b>", "<b>Dataset Title</b>", "<b>Mentions</b>"],
    line_color='darkslategray',
    fill_color="green",
    align='center',
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=np.array([np.array((str(i+1), "<i>" + x + "</i>", "<b>" + str(y) + "</b>", )) for i, (x, y) in enumerate(zip(count.index, count.values))]).T,
    line_color='darkslategray',
    # 2-D list of colors for alternating rows
    fill_color = [["white","lavender"]*25],
    align = 'center',
    font = dict(color = 'darkslategray', size = 11)
    ))
])

fig.update_layout(
    title={"text": "<b>Datasets Titles Mentions Counts</b>",
           "x": 0.5,
           "xanchor":"center",
           "font_size": 22},
    margin={"r":20, "l":20})

fig.show()

In [None]:
# Let's Visualize top 20 of the titles used 

fig = px.pie(count,
             values=count.values[:20],
             names=count.index[:20],
             color_discrete_sequence=purple_black,
             hole=.4,title="Top 20 Titles")
fig.update_traces(textinfo='percent', pull=0.05)
fig.show()

In [None]:
# unique labels used
count = train_csv.dataset_label.value_counts()

fig = go.Figure(data=[go.Table(
  columnwidth = [0.25, 2, 0.5],
  header=dict(
    values=["<b>Rank</b>", "<b>Dataset Labels</b>", "<b>Mentions</b>"],
    line_color='darkslategray',
    fill_color="green",
    align='center',
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=np.array([np.array((str(i+1), "<i>" + x + "</i>", "<b>" + str(y) + "</b>", )) for i, (x, y) in enumerate(zip(count.index, count.values))]).T,
    line_color='darkslategray',
    # 2-D list of colors for alternating rows
    fill_color = [["white","lavender"]*25],
    align = 'center',
    font = dict(color = 'darkslategray', size = 11)
    ))
])

fig.update_layout(
    title={"text": "<b>Datasets Labels Mentions Counts</b>",
           "x": 0.5,
           "xanchor":"center",
           "font_size": 22},
    margin={"r":20, "l":20})

fig.show()

In [None]:
# Let's Visualize top 20 of the labels used 

fig = px.pie(count,
             values=count.values[:20],
             names=count.index[:20],
             color_discrete_sequence=purple_black,
             hole=.4,title="Top 20 Labels")
fig.update_traces(textinfo='percent', pull=0.05)
fig.show()

From above two results, we can confirm that different variants of the titles are used in the publications.
for example, "ADNI" & "Alzheimer's Disease Neuroimaging Initiative (ADNI)" have been used interchangably in the publications.

# Sanity Check

In [None]:
import os
path = os.walk("../input/coleridgeinitiative-show-us-the-data/train")

json_list = []

for _,_,files in path:
    for file in files:
        #names.append(file[:-5])
        json_list.append(file)

print(Fore.BLUE + "No. of Json Files in the training folder:", len(json_list),Style.RESET_ALL)

In [None]:
# lets take first publication from train.csv and see if it is referred in the related publication in the train folder
import json
  
# Opening JSON file
f = open("../input/coleridgeinitiative-show-us-the-data/train" + "/" + json_list[0])
  
# returns JSON object as 
# a dictionary
data = json.load(f)
  
# Iterating through the json list
for i in data:
    print(Fore.GREEN + "First Section Title",Style.RESET_ALL)
    print(i)
    break # break after printing first section_title
# Closing file
f.close()

# we have publication for id d0fa7568-7d8e-4db9-870f-f9c6f668c17b in "data" variable
# now we will check whether "dataset title - National Education Longitudinal Study" is present in the publication or not

for i in range(len(data)):
    if train_csv.loc[0]['dataset_title'] in data[i]['text']:
        print(Fore.BLUE +"Title {}{}{} is Present in the given publication".format("'",train_csv.loc[0]['dataset_title'],"'"),Style.RESET_ALL)
        break

# NER using SPACY

### SPACY Supports following entity types
<img src = "https://miro.medium.com/max/875/1*qQggIPMugLcy-ndJ8X_aAA.png"/>

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

One of the nice things about Spacy is that we only need to apply nlp once, the entire background pipeline will return the objects.

### Entity

In [None]:
doc = nlp(str(data)) # "data" still has text from the publication we used earlier in this notebook 
print([(X.text, X.label_) for X in doc.ents[0:20]])

All the Entities seems to have tagged correctly!

### Token Level
During the above example, we were working on entity level, in the following example, 
we are demonstrating token-level entity annotation using the BILUO tagging scheme to describe the entity boundaries.
<img src = "https://miro.medium.com/max/875/1*_sYTlDj2p_p-pcSRK25h-Q.png">

In [None]:
print([(X, X.ent_iob_, X.ent_type_) for X in doc[:20]])

"B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

In [None]:
print("There are {} entities in the publication".format(len(doc.ents)))

labels = [x.label_ for x in doc.ents]
print("\nThese entities are represented by {} unique labels".format(len(Counter(labels))))

print("\nFollowing is the list of unique labels:\n")
print(Counter(labels))

In [None]:
print("Following are the 3 most common entities")
items = [x.text for x in doc.ents]
Counter(items).most_common(3)

In [None]:
# Let’s run displacy.render to generate the raw markup.
displacy.render(nlp(str(data[0:1])), jupyter=True, style='ent')

In [None]:
# Using spaCy’s built-in displaCy visualizer, here’s what the above publication and its dependencies look like:
displacy.render(nlp(str(doc[0:20])), style='dep', jupyter = True, options = {'distance': 120})

Next, we verbatim, extract part-of-speech and lemmatize this publication.

In [None]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(doc[0:100])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

In [None]:
dict([(str(x), x.label_) for x in nlp(str(doc[0:200])).ents])

In [None]:
print([(x, x.ent_iob_, x.ent_type_) for x in doc[0:200]])

# Thank you all for your upvotes :) Please check my [NER MODEL](https://www.kaggle.com/jagdmir/spacy-ner-model)  on model building for this competition