Hi everyone!

The following code allows you to extract all entities from the training data texts using Spacy's Named Entity Recognition (NER) tool. This process was designed with the purpose of producing inputs for a classification model that would break every text as a set of candidate entities that possibly refer to a dataset. 

We label each entity using the 'clean_label' column provided by the competition: if clean_label is in the entity.text.lower() string, the entity is labeled with a 1 (dataset), else a 0 (not dataset).

The outputs of the classification model would be entities with their respect probability of actually being a dataset which, given a threshold for the predictions, could be postprocessed into something like *dataset_1|dataset_2|dataset_3.*


Since the NER task is quite memory-intensive to be executed locally using all the data, this process was coded as a script that would work on chunks of the training set saving entities into different csv files corresponding to each chunk. Once run the script in the console, delimitators for the chunks come from user inputs prompts (*input()* function) in execution.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import json
import os
import random
import string
import re

In [None]:
train_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
train_files_path = '../input/coleridgeinitiative-show-us-the-data/train'

Credits for the *read_append_return* function to [prashansdixit](https://www.kaggle.com/prashansdixit)  at https://www.kaggle.com/prashansdixit/coleridge-initiative-eda-baseline-model.

In [None]:
def read_append_return(filename, train_files_path = train_files_path, output = 'text'):
    json_path = os.path.join(train_files_path, (filename + '.json'))
    headings = []
    contents = []
    combined = []
    with open(json_path, 'r') as f:
        json_decode = json.load(f)
        for data in json_decode:
            headings.append(data.get('section_title'))
            contents.append(data.get('text'))
            combined.append(data.get('section_title'))
            combined.append(data.get('text'))
    
    all_headings = ' '.join(headings)
    all_contents = ' '.join(contents)
    all_data = '. '.join(combined)
    
    if output == 'text':
        return all_contents
    elif output == 'head':
        return all_headings
    else:
        return all_data


def text_cleaning(text):
    '''
    Converts all text to lower case, Removes special charecters, emojis and multiple spaces
    text - Sentence that needs to be cleaned
    '''
    text = ''.join([k for k in text if k not in string.punctuation])
    text = re.sub('[^A-Za-z0-9]+', ' ', str(text).lower()).strip()
    text = re.sub(' +', ' ', text)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text

In [None]:
train_df['text'] = train_df['Id'].apply(read_append_return)

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')
#nlp.max_length = 5000000

# Console execution
# low_ = int(input('low: ')) # Low end of the chunk
# high_ = int(input('high: ')) # High end of the chunk

#Kaggle kernel
low_ = 0
high_ = 5


data = []
for ix in train_df.iloc[low_:high_,:].index:
    df = train_df.loc[ix]
    text = df.text
    true_label = df.cleaned_label
    if len(text) < nlp.max_length:
        data.append((ix, text, true_label))

In [None]:
texts = [x[1] for x in data]
indexes = [x[0] for x in data]
labs = [x[2] for x in data]

Using a NER spacy pipeline, we extract all entities from the selected texts. Starting and ending position of the entity's

In [None]:
print('spacy pipeline')

idx = 0
orgs = []
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    orgs.extend([[indexes[idx], ent.text, ent.start_char, ent.end_char, ent.label_, labs[idx]]for ent in doc.ents])
    idx += 1
    print(idx, end = '\r')

print(len(orgs),'entities')

Once the entities are extracted, we use the actual target variable (clean_label) to label each entity as a dataset (1) or not dataset (0).

In [None]:
labeled = []
for org in orgs:
    if org[-1] in org[1].lower():
        labeled.append(org + [1])
    else:
        labeled.append(org + [0])

We export the data into a csv files using the delimitators (variables *high_* and *low_*) for naming each file.

In [None]:
df = pd.DataFrame(labeled, columns = ['ix','phrase','start_char',
                                           'end_char','ner_label','dataset_label',
                                           'label_in_phrase'])

print('Entities dataframe shape:', df.shape)

output_file = 'ents_%s_%s.csv' % (low_,high_)
print('Output filename:', output_file)
df.to_csv(output_file)