# Introduction

* Problem
* Solution
* NER extraction model
    * Ready dataset for annotation
    * Building and training model
    * Testing the model
* Real NER extraction by extracting search engine results for each question and output the results
* Conclusion
    

# Problem
Entire world affected by Covid-19. Experts are trying there best for solving this global crisis. Every single day thousands of research articles are published. Keeping up with these huge pools of articles is impossible. Often great articles failed to draw attention. For this reason, Kaggle as a global data science and machine learning community hosts this challenge to solve this problem. In round 2, we have to summarize articles in a tabular format. So that researcher and expert can easily get a bird's eye view of articles and more importantly filter these huge pools of articles. 

# Solution
Research papers have thousands of variation. There are no single patters that can be used to summarize them into a tabular format. Regex is useful but because of this large diversity, the regex usefulness is very limited. Often time it will break and if we build any regex engine it will be not very efficient. For this, we have to move to more forward technology. We will use NLP ( Natural Language Processing ) for those labels that are not possible to handle with regex. There are also some static columns like Date, Study title, Journal, their data is already available in metadata dataset. We will use some pre-hand coded keywords for extracting study design. 

One of the important division of NLP is Named Entity Recognition ( NER ). It is very widely used in the NLP industry. 
> Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.

![](http://imanage.com/wp-content/uploads/2014/10/NER1.png)

That's exactly what we need. We need to extract data from raw text (articles) to pre-defined categories like Address ```Population```, ```Challenge```, ```Solutions```. So for this project, I have build custom NER models for each task that successfully extract the necessary information for given labels and store it to a table. This is a de-facto solution for this type of problem. Because if you use regex then for new articles it may not match. But NER is a machine learning model. It can identify context and can perform very good on any new research articles. 

# Custom NER model building for articles summary
This is the main section of this notebook. In this section, we will be extracting all the necessary labels for summary tables for this task. This section is divided into different subsections. Each subsection has a detailed explanation. First, we will import all the provided hand-labelled target table datasets and combined them into one dataframe. Then we will get information on static columns from metadata dataset. After that, we will use a portion of that dataframe to build a custom annotation dataset. And then we will train our NER model using spacy NLP library. And we will use that model to extract labels of our target tables.

### Ready our necessary datasets
This section imports necessary libraries, provided datasets and done some inital filtering. We will combine our provided hand labeled dataset and import metadata dataframe. 

In [None]:
from __future__ import unicode_literals, print_function

import numpy as np 
import pandas as pd 
import json
import glob
import itertools
import logging
import re

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
from tqdm import tqdm, tqdm_notebook

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
# We will use provided dataset for training and testing our ner model.
# It is easily possible to replace current testing data with search 
# engine articles for each question and get summary tables. 


path = r'../input/CORD-19-research-challenge/Kaggle/target_tables/1_population/'
all_files = glob.glob(path + "/*.csv")

temp_df = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    temp_df.append(df)

df_all_provided_summary_tables = pd.concat(temp_df, axis=0, ignore_index=True)
# this df_all_provided_summary_tables dataframe will be using to train and test our NER model. 
# this dataframe contains all the pre summary tables curated by experts 

In [None]:
# import metadata file where all the covid19 research paper metadata stored
df_metadata = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv',
                          low_memory=False)
# there are some rows which contains multiple entry for location of articles
# we will remove those and keep first
df_metadata.pdf_json_files = df_metadata.pdf_json_files.apply(
    lambda x: x.split(';')[0] if pd.notnull(x) else x)
df_metadata.pmc_json_files = df_metadata.pmc_json_files.apply(
    lambda x: x.split(';')[0] if pd.notnull(x) else x)

### Define all necessary functions 
This is a very important section. In here we are defining all our necessary functions for this notebook. This function will help us to efficiently build our solution. Necessary documentation is available inside the function. Some function may be intentionally hidden to make this notebook more readable. Feel free to unhide them. 

In [None]:
def convert_dataturks_to_spacy(dataturks_JSON_FilePath):
    """
    we have annotated our text dataset for ner model.
    we use dataturks.com for our annotation tools and spacy 
    for ner model library. 
    this function converts the annoationa json files to
    spacy trainable input format.
    source: https://dataturks.com/help/dataturks-ner-json-to-spacy-train.php
    """
    
    try:
        training_data = []
        lines=[]
        with open(dataturks_JSON_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content']
            entities = []
            for annotation in data['annotation']:
                #only a single point in text annotation.
                point = annotation['points'][0]
                labels = annotation['label']
                # handle both list of labels or a single label.
                if not isinstance(labels, list):
                    labels = [labels]

                for label in labels:
                    #dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                    entities.append((point['start'], point['end'] + 1 ,label))


            training_data.append((text, {"entities" : entities}))

        return training_data
    except Exception as e:
        print(str(e))
        logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e))
        return None

In [None]:
def get_raw_articles_by_title(provided_titles, metadata,
                articles_base_location='../input/CORD-19-research-challenge/'):
    '''
    get raw articles by title
    
    this method filter metadata dataframe by provided titles.
    after that it access the json files and read them and extract
    methods, results section and entire body text of articles.
    then it returns static columns from metadata dataframe and abstract,
    body_text, methods and results from articles json files
    
    
    
    provided_titles: string
    metadata: dataframe
    
    return: datframe
        'title', 'doi', 'publish_time', 'journal', 'url', 'abstract', 'body_text'
    '''
    methods = ['methods','method','statistical methods','materials',
               'materials and methods','data collection','the study',
               'study design','experimental design','objective',
               'objectives','procedures','data collection and analysis',
               'methodology','material and methods','the model',
               'experimental procedures','main text']
    
    metadata_filtered = metadata.loc[metadata.title.isin(provided_titles)]
    
    # replace empty pdf_json_files column with pmc_json_files column value
    metadata_filtered['pdf_json_files'] = \
    metadata_filtered.pdf_json_files.fillna(metadata_filtered.pmc_json_files)
    # drop those rows that doesn't have location of articles 
    metadata_filtered = metadata_filtered.dropna(subset=['pdf_json_files'])
    # create articles location for reading articles
    metadata_filtered['articles_location'] = articles_base_location \
    + metadata_filtered['pdf_json_files']
    
    metadata_filtered['body_text'] = '' # create a column for articles body text
    metadata_filtered['methods'] = ''
    metadata_filtered['results'] = ''
    # fill body_text column
    for index, row in metadata_filtered.iterrows():
        temp_body_text = ''
        temp_methods = ''
        temp_results = ''
        with open(row['articles_location']) as file:
            content = json.load(file)
            for entry in content['body_text']:
                temp_body_text = temp_body_text + entry['text']
            # Methods
            for entry in content['body_text']:
                section_title = ''.join(
                    x.lower() for x in entry['section'] \
                    if x.isalpha()) #remove numbers and spaces
                if any(m in section_title for m in [''.join(
                    x.lower() for x in m \
                    if x.isalpha()) for m in methods]) : 
                    temp_methods = temp_methods + entry['text']
            # Results
            results_synonyms = ['result', 'results']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] \
                                        if x.isalpha())
                if any(r in section_title for r in results_synonyms) :
                    temp_results = temp_results + entry['text']
                    
        metadata_filtered.at[index, 'body_text'] = temp_body_text
        metadata_filtered.at[index, 'methods'] = temp_methods
        metadata_filtered.at[index, 'results'] = temp_results
        
    metadata_filtered = metadata_filtered.rename(
        columns={'title': 'Study', 'publish_time': 'Date'})
    return metadata_filtered[['Study', 'doi', 'Date',
                              'journal', 'url', 'abstract',
                              'methods', 'results', 'body_text']]




def preprocess_articles(raw_articles_dataframe):
    '''
    clean abstract, body text for performance
    
    this function clean articles abstract, methods, results and body_text section
    after that it combine articles abstract, methods and results section into one 
    column. sometimes articles doesn't have abstract other than the title
    we will be using first 1500 letters from body text. 
    
    
    
    raw_articles_dataframe: dataframe
        this dataframe should contain articles abstract,
        methods, results, body_text.
        ideal dataframe is the return of 
        get_raw_articles_by_title() function
    
    '''
    raw_articles_dataframe['abstract'] = \
    raw_articles_dataframe['abstract']\
    .fillna(raw_articles_dataframe.body_text.str[:1500])
    
    raw_articles_dataframe['shorten_full_article_text'] = \
    raw_articles_dataframe['Study'] \
    + "\n\n" + raw_articles_dataframe['abstract'] \
    + "\n\n" + raw_articles_dataframe['methods'] \
    + "\n\n" + raw_articles_dataframe['results']
    
    
    # remove (), [] and all text between baraces and normalize whitespace
    raw_articles_dataframe['shorten_full_article_text'] = \
    raw_articles_dataframe['shorten_full_article_text']\
    .str.replace(r"\s*([\(\[]).*?([\)\]])","").str.strip()
    
    # remove all urls from text
    raw_articles_dataframe['shorten_full_article_text'] = \
    raw_articles_dataframe['shorten_full_article_text']\
    .str.replace(r"http\S+|www.\S+","").str.strip()
    
    # remove all single digit number
    raw_articles_dataframe['shorten_full_article_text'] = \
    raw_articles_dataframe['shorten_full_article_text']\
    .str.replace(r"(?<!\d)[1-7]\b","").str.strip()
    
    
    
    

    
    return raw_articles_dataframe


def generate_articles_for_annotation(processed_articles_dataframe):
    '''
    this function generate text for annotation
    input is the dataframe that contains process full articles text
    this function just make .txt file for each row of processed
    dataframe in preprocess_articles() function
    
    note: hold some data from processed_articles_dataframe before providing
    to this function and you can use it in later for testing the model
    
    '''
    temp = pd.DataFrame(columns=['articles'])
    temp['articles'] = processed_articles_dataframe['shorten_full_article_text']
    temp = temp.dropna()
    
    temp = temp.reset_index(drop=True)
    
    file = './{}.txt'
    for i, row in temp.iterrows():
        with open(file.format(str(i)), 'w') as f:
            f.write(str(row['articles']))
            
    return "TEXT SAVE TO WORKING DIRECTORY"

def training_ner_model(training_data, model=None, output_dir='./', n_iter=100):
    
    '''
    this function build and train a spacy ner model
    '''
    
    TRAIN_DATA = training_data.copy()
    print('Training started...')
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in tqdm_notebook(range(n_iter)):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
    print('Training completed.')
    return nlp


# function for extracting data field value
def _ner_apply(text):
    '''
    this function is the inside function of pandas apply
    this function apply trained ner model to pandas sereis and 
    extract labels 
    '''
    # pass our text to spacy
    # it will return us doc (spacy doc)
    doc = ner_model(text)
    # return list of tuples look like 
    # this [('four-year', 'EDUCATION_YEARS'), ('college or university', 'SCHOOL_TYPE')]
    return [(ent.text, ent.label_) for ent in doc.ents]

def ner_extraction(model, processed_articles):
    
    """
    return full extracted dataset.
    
    ner model extract in a list of dict.
    in this function we will extract labels and process store each label
    to new column and output a total fresh of summary tables 
    """
    temp = processed_articles.copy()
    temp = temp.reset_index(drop=True)
    # apply the function and store the result in a new column 
    temp['temp_entity'] = temp['shorten_full_article_text'].apply(lambda x: _ner_apply(x))


    # process our data field column and seperate each column and store their value in their column
    flatter = sorted([list(x) + [idx] for idx, y in enumerate(temp['temp_entity']) 
                      for x in y], key = lambda x: x[1]) 

    # Find all of the values that will eventually go in each F column                
    for key, group in itertools.groupby(flatter, lambda x: x[1]):
        list_of_vals = [(val, idx) for val, _, idx in group]

        # Add each value at the appropriate index and F column
        for val, idx in list_of_vals:
            temp.loc[idx, key] = val
    return temp

In [None]:
# below is keyword list for each of the study design 
# we will use these keywords to indentify study design


systematic_review_and_meta_analysis = ["systematic review",
                                       "meta-analysis", "electronic search",
 "pooled odds ratio", "Cohen\'s d", "difference in means",
 "difference between means", "d-pooled", "pooled adjusted odds ratio",
 "pooled OR", "pooled AOR", "pooled risk ratio", "pooled RR",
 "pooled relative risk", "Cochrane review", "PRISMA", "protocol",
 "registry", "search string", "search criteria", "search strategy",
 "eligibility criteria", "inclusion criteria", "exclusion criteria",
 "interrater reliability", "cohen\'s kappa", "databases searched",
 "risk of bias", "heterogeneity", "i2", "publication bias"]

prospective_observational_study = ["prospective", "followed up",
                                   "baseline characteristics",
                                   "lost to follow-up", 
                                   "number of patients potentially",
                                   "eligible", "examined for eligibility",
                                   "confirmed eligible", "included in the study",
                                   "completing follow-up", "and analysed"]

retrospective_observational_study = ["retrospective", "medical records review",
                                     "chart review", "case control",
                                     "data collection instrument",
                                     "data abstraction forms"]

cross_sectional_study = ["prevalence survey", "survey instrument" ,
                         "syndromic surveillance", "surveillance",
                         "registry data", "frequency", "response rate",
                         "questionnaire development",
                         "psychometric evaluation of instrument",
                         "non-response bias"]

case_series = ["case study", "case series", "case report",
"clinical findings", "symptoms", "diagnosis", 
"interventions", "outcomes", "dosage",
"strength", "duration", "follow-up",
"adherence", "tolerability"]

expert_review = ['expert review', 'literature review', 'expert', 'expert conceses', 'expert reviewers']
editorial = ['dear editorial', 'editorial', 'editorials']
ecological_regression = ['regression']

simulation = ["computer model", "forecast", "mathematical model",
"statistical model","stochastic model", "simulation",
"synthetic data", "monte carlo", "bootstrap",
"machine learning", "deep learning",
"AUC", "area under the curve", "receiver-operator curve",
"ROC", "model fit", "AIC", "Akaike Information Criterion"]





systematic_review_and_meta_analysis = [re.escape(m) for m in systematic_review_and_meta_analysis]
prospective_observational_study = [re.escape(m) for m in prospective_observational_study]
retrospective_observational_study = [re.escape(m) for m in retrospective_observational_study]
cross_sectional_study = [re.escape(m) for m in cross_sectional_study]
case_series = [re.escape(m) for m in case_series]
expert_review = [re.escape(m) for m in expert_review]
editorial = [re.escape(m) for m in editorial]
ecological_regression = [re.escape(m) for m in ecological_regression]
simulation = [re.escape(m) for m in simulation]




# build a dict with key as study design name and value of keywords list
study_designs = {'systematic_review_and_meta_analysis': systematic_review_and_meta_analysis, 
                'prospective_observational_study': prospective_observational_study,
                'retrospective_observational_study': retrospective_observational_study,
                 'cross_sectional_study' : cross_sectional_study,
                 'case_series' : case_series,
                 'expert_review' : expert_review,
                 'editorial' : editorial,
                 'ecological_regression' : ecological_regression,
                 'simulation' : simulation
                }

In [None]:
def extract_study_design(dataframe, study_designs):
    
    '''
    this function extract study design by looking for predefined 
    keywords 
    
    source: https://www.kaggle.com/danielwolffram/cord-19-create-dataframe-june-9
    '''
    
    df = dataframe.copy()
    df['study_abstract'] = [set() for _ in range(len(df))]
    df['study_methods'] = [set() for _ in range(len(df))]
    df['study_results'] = [set() for _ in range(len(df))]

    for tag in study_designs.keys():
        for synonym in study_designs[tag]:
            df[df.abstract.str.contains(synonym, case=False, na=False) | df.Study.str.contains(synonym, case=False, na=False)].study_abstract.apply(lambda x: x.add(tag))
            df[df.methods.str.contains(synonym, case=False, na=False)].study_methods.apply(lambda x: x.add(tag))
            df[df.results.str.contains(synonym, case=False, na=False)].study_results.apply(lambda x: x.add(tag))
    
    df['Study Design'] = df.apply(lambda x: list(x.study_abstract.union(x.study_methods).union(x.study_results)), axis=1)
    df.study_abstract = df.study_abstract.apply(lambda x: list(x))
    df.study_methods = df.study_abstract.apply(lambda x: list(x))
    df.study_results = df.study_results.apply(lambda x: list(x))
    
    df = df.drop(['study_abstract', 'study_methods', 'study_results'], axis=1)
    
    return df

### Prepare training datasets
This section prepares the dataset for training. All this does is first get articles texts by titles provided by hand labelled data and preprocess them and combine them for annotation and save them. We have commented on this section. Because we already build our annotation dataset but you can easily uncomment them when you need to build a training dataset for annotations. 

In [None]:
# temp = get_raw_articles_by_title(df_all_provided_summary_tables.Study.tolist(), df_metadata)
# temp = preprocess_articles(temp)
# train = temp[:50]
# test = temp[50:]

# generate_articles_for_annotation(train)

### Model training
In this section, we will be training our model. Because we already build our model and stored in easy to use as a function. We just here use that function to easily train our model. Because the model takes some time to train we move the training part to [different notebook](https://www.kaggle.com/niyamatalmass/task-1-training-ner-model) and import the model in here for reuse purpose. But we provided the code, just uncomment them and you can train the model. 

In [None]:
# uncomment the below two line code for model training 

# annotation_for_training = convert_dataturks_to_spacy('../input/covid19-annotation/task_1_population_annotation.json')
# ner_model = training_ner_model(annotation_for_training)

In [None]:
# load the trained model 
# model training notebook https://www.kaggle.com/niyamatalmass/task-1-training-ner-model
ner_model = spacy.load("../input/task-1-training-ner-model/task_1_ner_model")

# Model testing
Now we have built our model. Let's test our model. We have trained our model using some portion of hand labelled summary table data. Now we will be using @davidmezzetti search engine data for each table. @davidmezzetti has notebooks for each task and output relative papers for each question. We will be using that to make a summary table of each question in a task and save them as CSV file. 

We will make a dataset from the davidmezzetti's search engine output and use the search engine results articles titles to produce summary tables of those articles. We can also give titles of kaggle provided hand labeled data set and it will also work. 

In [None]:
def create_summary_table_multiple_article(model, titles, metadata_dataframe):
    '''
    this function provides easy to use for making summary of articles.
    it take the trained ner model, titles of articles and our provided metadata df.
    it output articles summary table
    '''
    
    # get the articles raw text by matching provided titles
    temp = get_raw_articles_by_title(titles, metadata_dataframe)
    # processed those articles 
    temp = preprocess_articles(temp)
    # extract study design from articles 
    temp = extract_study_design(temp, study_designs)
    # extract labels using ner model
    temp = ner_extraction(model, temp)
    # drop unnecessary column
    temp = temp.drop(['abstract', 'methods', 'results', 'body_text', 'shorten_full_article_text', 'temp_entity'], axis=1)
    return temp

In [None]:
lower_social_search_engine = pd.read_csv('../input/population-articles-by-davidmezzetti/Management of patients who are underhoused or otherwise lower social economic status.csv')
titles_lower_social = lower_social_search_engine.Study.tolist()

create_summary_table_multiple_article(ner_model, titles_lower_social, df_metadata)

We are seeing a summary table of question 'Management of patients who are under-housed or otherwise lower social-economic status'. @david search engine results for this question is used for testing. We are seeing very good results. Our NER model has picked up correctly for most of the summary of the article. Next, we will be building summary tables for all question in this task and saving them as they are named. 

In [None]:
path = r'../input/population-articles-by-davidmezzetti/'
all_files = glob.glob(path + "/*.csv")

temp_df = []
for filename in all_files:
    if filename.split('/')[3] == 'population.csv':
        continue
    titles = pd.read_csv(filename)['Study'].tolist()
    summary_table = create_summary_table_multiple_article(ner_model, titles, df_metadata)
    summary_table_name = filename.split('/')[3]
    summary_table.to_csv('./' + summary_table_name, index=False)

In [None]:
!ls

## Pros
* Not based on hard-coded rules or regex. Because of the ner model learns from NLP machine learning, it is easily generalizable and accurate for future new articles 

* The notebook provides all the necessary function to easily make summary tables of articles by providing only articles title. 

* With more annotation data we can make the NER model more robust. 

* This model can be applied to all the tasks and any articles with custom annotations. 


## Cons
* The NER model has trained only small subset of hand-labelled articles provided by kaggle. More annotation data will help to make the NER model more robust.

# Conclusion
Thanks for reading! Hope this notebook will help the researcher to find important knowledge and help to fight COVID-19 pandemic. Thanks!