## Coleridge Initiative - Show US the Data
### Discover how data is used for the public good

This competition challenges data scientists to show how publicly funded data are used to serve science and society. Evidence through data is critical if government is to address the many threats facing society, including; pandemics, climate change, Alzheimer’s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications.



The objective of the competition is to identify the mention of datasets within scientific publications. Your predictions will be short excerpts from the publications that appear to note a dataset. Predictions that more accurately match the precise words used to identify the dataset within the publication will score higher. Predictions should be cleaned using the clean_text function from the Evaluation page to ensure proper matching.

Publications are provided in JSON format, broken up into sections with section titles.

The goal in this competition is not just to match known dataset strings but to generalize to datasets that have never been seen before using NLP and statistical techniques. A percentage of the public test set publications are drawn from the training set - not all datasets have been identified in train, so these unidentified datasets have been used as a portion of the public test labels. These should serve as guides for the difficult task of labeling the private test set.

Note that the hidden test set has roughly ~8000 publications, many times the size of the public test set. Plan 
your compute time accordingly.


#### This competition to classify the written about dataset_label in whole of text content of an academic paper. ¶

## 1. Import Required Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

import os
import re
import json
import glob
from collections import defaultdict
from textblob import TextBlob
from functools import partial

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

import nltk
import spacy
nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
nlp.max_length = 4000000
from nltk.probability import FreqDist
from wordcloud import WordCloud, STOPWORDS

from tqdm.autonotebook import tqdm
import string

%matplotlib inline

os.listdir('../input/coleridgeinitiative-show-us-the-data')

In [None]:
import random
random_seed = 42
def seed_all(seed=random_seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
seed_all()

## 2. Explore the data 

## Data

#### Files
1. train - the full text of the training set's publications in JSON format, broken into sections with section titles
2. test - the full text of the test set's publications in JSON format, broken into sections with section titles
3. train.csv - labels and metadata for the training set
4. sample_submission.csv - a sample submission file in the correct format

#### Columns
1. id - publication id - note that there are multiple rows for some training documents, indicating multiple mentioned datasets
2. pub_title - title of the publication (a small number of publications have the same title)
3. dataset_title - the title of the dataset that is mentioned within the publication
4. dataset_label - a portion of the text that indicates the dataset
5. cleaned_label - the dataset_label, as passed through the clean_text function from the Evaluation page

In [None]:
train = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/train.csv")
sample = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/sample_submission.csv")
train.head()

In [None]:
train.nunique()

In [None]:
train.shape

In [None]:
train.describe().T

1. **Id 14316**: The id of an academic paper. The train folder contains this id + ".json" file. This json file will be the full text of the treatise.

2. **pub_title 14271**: The title of the publication of the academic paper.

3. **dataset_title 45**: The title of the dataset mentioned in the publication. 

4. **dataset_label 130**: Predict this. Part of the text that indicates the dataset. (The name used by the author of the treatise. It seems that there are more number of dataset_title because there are missing items)
    
5. **cleaned_label 130 pieces**: As shown in the evaluation item, dataset_label is nicely formatted with lowercase letters. The submission should be in this form 

#### There are 19661 lines in all, but there are many duplicates. (we will explore further in following section)

In [None]:
sample

In [None]:
sample.describe().T

id + ".json" file in the Id: test folder. This json file will be the full text of the treatise. 

**PreditionString**: As we will see later, we will list the dataset_label here. If you think there are more than one, connect them with "|".

### View the train Data 

In [None]:
train.head()

In [None]:
train_path = "../input/coleridgeinitiative-show-us-the-data/train"
test_path = "../input/coleridgeinitiative-show-us-the-data/test"


In [None]:
all_train_path = [os.path.join(train_path,s) + ".json" for s in train["Id"]]
all_test_path = [os.path.join(test_path,s) + ".json" for s in sample["Id"]]

In [None]:
json_path = all_train_path[0]
json_path

In [None]:
with open(json_path, 'r') as f:
        json_decode = json.load(f)

In [None]:
json_decode[:1]

### After review of the above data closely, it looks like that section title and text are alternately dictionary type. we can create a dataframe

In [None]:
jsontest = pd.DataFrame(json_decode)
jsontest

### Look at the above file it contains section_title and its contents written in text. It's easy to get an image if you imagine an academic paper. 

### Let us attach the full text of the title and text as follows 

In [None]:
texts = ""

for a in jsontest.values:
    texts += a[0] +" "+ a[1] + " "

In [None]:
texts[:300]

In [None]:
json_path = all_train_path[0]
with open(json_path, 'r') as f:
        json_decode = json.load(f)

jsontest = pd.DataFrame(json_decode)

texts = ""

for a in jsontest.values:
    texts += a[0] +" "+ a[1] +" "

### Let us execute this for all of train dataset with a for statement) 

In [None]:
%%time

alltexts = []

for json_path in tqdm(all_train_path):

    with open(json_path, 'r') as f:
            json_decode = json.load(f)
    jsontest = pd.DataFrame(json_decode)

    texts = ""

    for a in jsontest.values:
        texts += a[0] +" "+ a[1] + " "
        
    alltexts.append(texts)

In [None]:
train["text"] = alltexts
train

In [None]:
Idgroup = train.groupby("Id")["dataset_label"].count().reset_index()
Idgroup.columns = ["Id","count"]
Idgroup = Idgroup.sort_values("count").reset_index(drop=True)
Idgroup

### Most IDs 

In [None]:
mostId = train[train["Id"] == Idgroup["Id"].iloc[-1]]
mostId.head()

#### Look at the above data same ID has different dataset_title and dataset_label.
### There are multiple dataset_labels in the PredictionString of submission, they are connected by "|", so in this case, let us try creating them

In [None]:
mostIdlist = mostId["cleaned_label"].to_list()
mostIdlist


In [None]:
mostIdlist = ("|").join(mostIdlist)
mostIdlist

#### Look at the above, very long, but it is described in PredictionString like this and submitted. Let us see how to handle this

In [None]:
sample

### There may be multiple papers in the same publication, or there may be several lines in the same paper. ¶

#### 1 Hypothesis: If the whole sentence contains words that appear in the dataset label and cleaned dataset label, check that these are part of train data are included in the full text.
#### Looks like  some test data is not included in this. 

#### dataset_label is defined in evaluation. It becomes cleaned_label when it passes through the following function ...

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

In [None]:
check = []
for a in range(len(train)):
    
    if clean_text(train["dataset_label"].iloc[a]) == train["cleaned_label"].iloc[a]:
        check.append(1)
    else:
        check.append(0)

In [None]:
np.sum(check)/len(train)

Let me check it should ideally be 1

In [None]:
train["check"] = check
checkdf = train[train["check"]==0]
checkdf.head(3)

In [None]:
checkdf["dataset_label"].unique()

In [None]:
checkdf["cleaned_label"].unique()

In [None]:
clean_text(checkdf["dataset_label"].iloc[2])

In [None]:
cleanlabel = []
for a in tqdm(train["cleaned_label"]):
    if a[-1] == " ":
        cleanlabel.append(a[:-1])
    else:
        cleanlabel.append(a)

In [None]:
check = []
for a in range(len(train)):
    
    if clean_text(train["dataset_label"].iloc[a]) == cleanlabel[a]:
        check.append(1)
    else:
        check.append(0)

In [None]:
np.sum(check)/len(train)

Now its correct and we will replace this 

In [None]:
train["cleaned_label"]=cleanlabel

In [None]:
train["text"] = [clean_text(s) for s in tqdm(train["text"])]

In [None]:
dslabel = [clean_text(s) for s in train["dataset_label"].unique()]

List of unique dataset_label

In [None]:
len(dslabel)

In [None]:
labeljudge = []
all_labels = []
label_len = []

for a in tqdm(train["text"]):
    labels = []
    for b in dslabel:
        if b in a:
            labels.append(clean_text(b))
            break
    if len(labels)==0:
        labeljudge.append(0)
    else:
        labeljudge.append(1)
    
    #all_labels.append("|".join(labels))
    #label_len.append(len(labels))

In [None]:
np.sum(labeljudge)/len(train)

## Test Data and Submission

### From the above results, if the test data includes exist_label, clean it as dataset_label and submit it. * A rule that connects multiple items with "|". 

In [None]:
sample

In [None]:
alltexts = []

for json_path in (all_test_path):

    with open(json_path, 'r') as f:
            json_decode = json.load(f)
    jsontest = pd.DataFrame(json_decode)

    texts = ""

    for a in jsontest.values:
        texts += a[0] + " " + a[1] + " "
        
    alltexts.append(texts)

In [None]:
sample["text"] = alltexts

In [None]:
sample

### Clean the treatise

In [None]:
sample["text"] = [clean_text(s) for s in tqdm(sample["text"])]

### If there is a word in dslabel, remove the word and merge it, add dataset_title to dslabel as it will increase the score. ¶

In [None]:
print(len(dslabel))
dstitle = [clean_text(s) for s in train["dataset_title"].unique()]
dslabel = set(dslabel + dstitle) 
len(dslabel)

In [None]:
labeljudge = []
all_labels = []
label_len = []

for a in tqdm(sample["text"]):
    labels = []
    for b in dslabel:
        if b in a:
            labels.append(clean_text(b))
            
    if len(labels)==0:
        labeljudge.append(0)
    else:
        labeljudge.append(1)
    
    all_labels.append("|".join(labels))
    label_len.append(len(labels))

In [None]:
sample["PredictionString"] = all_labels
sample

In [None]:
sample["PredictionString"].iloc[3]

In [None]:
sample = sample[["Id","PredictionString"]]

In [None]:
sample.to_csv("submission.csv",index=False)

In [None]:
sample

# Credits to lot of Kagglers who have spent time in understanding the problem and solivng this

# I am absolute beginner and learning from great stuff people have done, please comment and if you like upvote and share it further. I will definetely try to improve this further. 

