![logo](https://drive.google.com/uc?id=1VrvlBTHH4D7xsrNp74wtLBamMZygG8Sy)

## Preamble

This is a notebook created by a collaborative effort of CoronaWhy.org, multi-disciplinary global effort of volunteers. 


- Visit our [website](https://www.coronawhy.org) to learn more.
- Read our [story](https://medium.com/@arturkiulian/im-an-ai-researcher-and-here-s-how-i-fight-corona-1e0aa8f3e714).
- Visit our [main notebook](https://www.kaggle.com/arturkiulian/coronawhy-org-global-collaboration-join-slack) for historical context on how this community started.

## Table of Contents
1. [Task Overview](#1)
2. [Results](#2)
3. [Code](#3)
4. [Contributors](#4)

<a id="1"></a>
## Task Overview

This notebook responds to Task "Create summary tables that address patient descriptions related to COVID-19" and particularly to the three questions pertaining to incubation periods and lengths of viral shedding. 

The notebook kicks off with a search framework that utilizes the seed articles in the target tables to find articles that could have been missed or are published after the initial review. The mechanism is simple. It indexes all articles in the CORD-19 collection with its embedding (i.e. mathematical representation of an article) and conducts a k-nearest neighbor search to discover similar articles. For each seed article, we identify a set of candidates, but only those that surface at least twice across all searches are considered as potential additions to the summary table. In addition, we use the Facebook AI Similarity Search (FAISS) library, which is optimized to perform similarity search in dense vector spaces, to ensure the search remains efficient as the literature body expands.

Once we gather our list of candidate articles, we apply a suite of AI-powered utility tools trained with supervised learning methods to construct the summary tables. Functionality wise, we have focused on two types of tools: one extracts information that is fundamental to all literature reviews such as study design and sample size, and the other extracts a range of information with shared characteristics. For instance, age, incubation period, and duration of viral shedding are essentially different kinds of time periods and thus can be extracted using the same tool with minimum filtering. While like most supervised approaches, collecting the training datasets can be expensive and labor-intensive, the trained models have great generalizability and can be easily modified and reused in other contexts.



<a id="2"></a>
## Results
We present the summary tables created by this notebook in PowerBI reports. 

[Link to view the reports in full screen](https://app.powerbi.com/view?r=eyJrIjoiMzU2YTk5ZjMtODU5My00ZjgyLWFmMWEtZDE4NzRjNzJhZTg1IiwidCI6ImRjMWYwNGY1LWMxZTUtNDQyOS1hODEyLTU3OTNiZTQ1YmY5ZCIsImMiOjEwfQ%3D%3D)


In [None]:
from IPython.display import IFrame
IFrame('https://app.powerbi.com/view?r=eyJrIjoiNzVjZmJkM2UtYzdhMi00NGQ2LWEwZTYtYjZjNjU4OTc0MDI3IiwidCI6ImRjMWYwNGY1LWMxZTUtNDQyOS1hODEyLTU3OTNiZTQ1YmY5ZCIsImMiOjEwfQ%3D%3D', width="100%", height=500)

<a id="3"></a>
## Code

Below we present the annotated code to reproduce the summary tables.

### Set up

We will use cord_19_embeddings to perform similarity search and then retrieve articles' metadata from metadata.csv.

In [None]:
!pip install faiss-cpu --no-cache -q
# !pip install faiss
import pandas as pd
import numpy as np
import os
import glob
import faiss
from faiss import normalize_L2
import collections
from collections import defaultdict
import json
from datetime import datetime
import re
from nltk.tokenize import sent_tokenize

In [None]:
def setup_local_data():
    input_dir = '/kaggle/input/'
    for item in glob.glob(os.path.join(input_dir,'*')):
        print(item)
    return input_dir

def read_metadata(input_dir):
    metadata = pd.read_csv(os.path.join(input_dir,'CORD-19-research-challenge','metadata.csv'))
    print(metadata.info())
    return metadata

def read_cord_19_embeddings(input_dir):                                      
    emb_path = glob.glob(os.path.join(input_dir,'CORD-19-research-challenge', 'cord_19_embeddings', ''.join(('cord_19_embeddings_2020-07-27.csv'))))#[0]
    print('input_dir: ', input_dir)
    print('emb_path: ', emb_path)
    emb = pd.read_csv(emb_path[0], header = None, index_col = 0)
    print(emb.head())
    return emb


In [None]:
# Fetching cord_uids of the seed articles by matching article titles
def get_ref_cord_uid(target_table, metadata):
    titles = target_table['Study'].unique().tolist()
    cord_uids = metadata[metadata['title'].isin(titles)]['cord_uid'].tolist()
    return cord_uids
    

In [None]:
local_dir = setup_local_data()
metadata = read_metadata(local_dir)
emb = read_cord_19_embeddings(local_dir)

### Creating faiss search index
We index all the articles with their embeddings in a dense vector space. By normalizing the vectors and calculating the inner products between vectors, we then perform a K-nearest neighbor search based on cosine distance.

In [None]:
# Creating a matrix to store article embeddings 
xb = np.ascontiguousarray(emb).astype(np.float32)
# Assigning dimension for the vector space
d = xb.shape[1]

In [None]:
# Building the index
index = faiss.IndexFlatIP(d) #IndexFlatIP: taking inner product of the vectors
print(index.is_trained)
# AddING vectors to the index
normalize_L2(xb)
index.add(xb)                  
print(index.ntotal)

### Listing target tables from task 
Let's take a look at the target tables for the task. As mentioned, in this notebook, we focus on three questions:
* Length of viral shedding after illness onset
* Incubation period of the virus
* Incubation period across different age groups

In [None]:
# Listing summary tables from a task

# task_dir = '3_patient_descriptions'
# table_path = os.path.join(local_dir,'CORD-19-research-challenge/Kaggle/target_tables',task_dir)
# for root, dirs, files in os.walk(table_path):
#     for filename in files:
#         print(filename)
    

In [None]:
# Refreshing from previous summary tables
table_path = os.path.join(local_dir,'summary-tables-2020-06-16')

### Conducting similarity search 
Once we've got the target table and the seed articles, we're ready to conduct the similarity search. First, we retrieve and normalize the seed articles' embedding vectors. Then, we search each article's 10 nearest neighbors from the entire corpus (the whole CORD-19 collection). Lastly, we look at all the search results collectively, and count the frequency of each result. We judge that the more frequently an article surfaces in a search, the more likely the article belongs to this topic and can contain relevant information. We use a recurrent frequency of two as the threshold for the new additions.

In [None]:
def refresh_table(table_name:str, metadata, emb):
    
    """Refresh the target table with new additions if available
       Params:
       table_name: string, file name of the target table
       metadata: df, loaded from metadat.csv
       emb: df, loaded from cord_19_embeddings
    """
    
    # Reading in the target table
    df = pd.read_csv(os.path.join(table_path, table_name), index_col=0)
    print('Number of studies in target table: {}'.format(len(df['Study Link'].unique())))

    target_columns = df.columns
    
    # Finding k nearest neighbors for each seed article
    # Retrieving seed article's cord_uid
    ref_cord_uids = get_ref_cord_uid(df,metadata)
    k = 10
    similar_id_list=[]
    for idx in ref_cord_uids:
        # Retreiving ref article's embeddings
        xq = np.ascontiguousarray(emb.loc[idx]).reshape(1,-1).astype(np.float32)
        # Remember to normalize
        normalize_L2(xq)
        # Searching top k similar articles and return a distance array (D) and an index array (I)
        D, I = index.search(xq, k+1)   # search k+1 to get k articles additional to self
        similar_id_list.extend(I.tolist()[0])
        
    # Counting the frequency of recurrence for each article found 
    similar_cord_uid_list = [ cid for cid in emb.iloc[similar_id_list].index if cid not in ref_cord_uids ]
    frequency_of_reoccurence = collections.Counter(similar_cord_uid_list)
    
    # Finalizing the list of article to include: seed articles + new articles with recurrent freq > 1 (at least 2)
    final_list = ref_cord_uids + [k for k in frequency_of_reoccurence.keys() if frequency_of_reoccurence[k] > 1]
    
    # Geting metadata for all articles (with full text) in the final list 
    final_articles_metadata = metadata[(metadata['cord_uid'].isin(final_list)) & (pd.notnull(metadata['pdf_json_files']) | pd.notnull(metadata['pmc_json_files']))]
    
    #Attaching "Added on" variable as required in the target table. This variable records the date that the article is added to the summary table.
    if 'Added on' in df.columns:
        final_articles_metadata['Added on'] = final_articles_metadata['title'].map(dict(zip(df['Study'],df['Added on'])))

        final_articles_metadata['Added on'] = final_articles_metadata['Added on'].fillna(datetime.today().strftime('%-m/%d/%Y'))
    else:
        final_articles_metadata['Added on'] = datetime.today().strftime('%-m/%d/%Y')
        
    return target_columns, final_articles_metadata

In [None]:
def get_full_text_sentences(final_articles_metadata):
    
    """Getting full text for the articles in the final article list for data extraction
       Params: df derived from refresh_table
    """
    # Reading json file if exists
    full_text_dict = defaultdict(list)
    final_articles_metadata = final_articles_metadata.set_index('cord_uid')
    for idx in final_articles_metadata.index:
        if pd.notnull(final_articles_metadata.loc[idx]['pmc_json_files']):
            pdf = final_articles_metadata.loc[idx]['pmc_json_files']
        elif pd.notnull(final_articles_metadata.loc[idx]['pdf_json_files']):
            pdf = final_articles_metadata.loc[idx]['pdf_json_files']

        with open(os.path.join(local_dir,'CORD-19-research-challenge',pdf), 'r') as f:
            full_text = json.loads(f.read())
        #  Extracting columns to a new df
        #  Section labels will be helpful for filtering out noises during data extraction
        for item in full_text['body_text']:
            full_text_dict['cord_uid'].append(idx)
            full_text_dict['section'].append(item['section'])
            full_text_dict['text'].append(item['text'])

    df = pd.DataFrame(full_text_dict)         
    
    # We do a minor preprocessing here mainly to comfrom to our NER training dataset (details to follow)
    def preprocess(text):
        text = re.sub('feces|faeces|fecal|faecal|stools|stool samples|stool sample', 'stool', text)
        text = re.sub('stool', 'stool samples', text)
        text = re.sub('serum|plasma', 'blood', text)
        text = re.sub('swabs', 'swab', text)
        text = re.sub('-swab', ' swab', text)
        text = re.sub(u'\u00B7', '.', text)
        return text.lower()
    df['text'] = df['text'].apply(lambda x: preprocess(str(x)))
    
    # Tokenizing paragraphs from the jsons into sentences
    df['sentence'] = df['text'].apply(lambda x: sent_tokenize(str(x)))
    df = df.explode('sentence')

    df = df.join(final_articles_metadata, on='cord_uid')
    
    return df


### Data extraction
The target data fields to extract in this notebook include: 
* Sample Size: the number of patients/cases or articles included in the analysis
* Sample Obtained: the types of specimen tested for the presence of the virus
* Age: age of the included population
* Days/Range (Days): incubation period or duration of viral shedding, usually median/mean (range or variation) is reported
* Study Type: study design employed by the study

We use a set of custom Named Entity Recognition (NER) models and a classification model to extract these fields.


In [None]:
!pip install ijson -q
!pip install word2number -q
import spacy, ijson, re
from itertools import cycle
from word2number import w2n
from dateutil.parser import parse
import calendar

#### [Time Period NER Model](http://https://github.com/CoronaWhy/task-ties/blob/master/task_ties/Time_Period_NER.ipynb)
We use this model to extract age, incubation period, and length of viral shedding by different context filtering.

The model was trained on manually annotated data using [spaCy](https://spacy.io/usage/training#ner). We used mini-batch (i.e. the number of training examples passed to the model at a time) with compounding batch size, where the initial batch size is set at 4 and increases to 32 when updating model parameters. We use Adam solver as the optimization algorithm with a learning rate of 0.001, and a dropout rate of 0.2. We experimented with a variety of epochs (i.e. number of times the complete training data is passed through the model) ranging from 68 to 100 and used an epoch of 75 as it produces the best F-1 score.



In [None]:
# Loading the time period model
tpnlp = spacy.load(os.path.join(local_dir,'time-period-ner-75-v2','TimePeriodNER_75_v2'))

def get_age(sentence, model):
    
    """Extract median/mean/average age from sentences"""
    
    sentence = str(sentence)
    # To reduce noise, we extract from senteces reporting median or mean age of the population
    if any(substring in sentence for substring in ['median age','mean age','average age']):  
        doc = model(sentence)
        ents = [ent.text for ent in doc.ents] 
        if len(ents) > 0:
            return ', '.join(ents)

def get_time_period(sentence: str, model, period_types: list):
    
    """Extract median/mean/average length of viral shedding time
    
    params:
    sentence: sentence to extract data from
    model: pretrained/custom spacy model to call
    period_types: list of keywords to indicate shedding, incubation, etc...
    
    """
    # By switching the 
    PERIOD_TYPES = period_types
    
    sentence = str(sentence)
    # Filtering sentence by context keywords
    if any(s in sentence for s in ['median','mean','average','period','time','duration']) \
    and not any(s in sentence for s in ['diagnosis', 'sampling'])\
    and any(s in sentence for s in PERIOD_TYPES): 
        
        doc = model(sentence)
        context = [ent.text for ent in doc.ents if ent.label_ == 'TPcontext'] 
        data = [ent.text for ent in doc.ents if ent.label_ == 'TPdata']
        
        if len(context)*len(data)>0:
            output = [item for item in [c + ': ' + d for c, d in zip(cycle(context), data)] \
                      if any( s in item for s in PERIOD_TYPES) and ('day' in item) and re.search(r'\d+(\.\d+)?', item)] 
            if len(output) > 0:
                return  '; '.join(output)

#### [CORD-NER Model](https://github.com/CoronaWhy/task-ties/blob/master/task_ties/train_ner.py)
We use this model to extract the types of specimen used in tests for COVID-19.

The model was trained on data from the [CORD-NER dataset](https://xuanwang91.github.io/2020-03-20-cord19-ner/). This dataset included text from the CORD-19 corpus annotated for general entity types, biomedical entity types, and 9 new entity types related specifically to COVID-19. The annotations were generated using a combination of methods pre-trained NER models, knowledge base-guided NER models, and seed-guided NER models. Since the dataset has not been updated with the latest CORD-19 corpus, we used the annotations to retrain our own NER model. We only included the 9 new COVID-19 entity types and the “Activity” entity types to train our model. We also replaced all underscores in the data with spaces in order to better reflect the format of the CORD-19 corpus texts. Using this data, we trained a blank SpaCy model with 20 iterations and a dropout rate of 0.35.

In [None]:

cord_ner_nlp = spacy.load(os.path.join(local_dir,'cord-ner-space/cord_ner')) 
  
def get_sample_type(sentence, model):
    
    """Extract SUBSTRATE (sample/specimen) from sentences"""
    
    sentence = str(sentence)
    doc = model(sentence)
    
    # Getting texts with SUBSTRATE entity label 
    ents = [ent.text for ent in doc.ents if ent.label_ == 'SUBSTRATE']
    tokens = sentence.split()
    
    # We mannually extracted specimens from nasopharygeal swabs as the model wasn't tuned enough to recognize them
    if 'swab' in tokens:
        idx = tokens.index('swab')
        ents.append(' '.join([tokens[idx-1], tokens[idx]]))
    
    # Checking for texts wtih CORONAVIRUS entity label 
    has_corona = 'CORONAVIRUS' in [ent.label_ for ent in doc.ents]
    # Checking mentions of covid-19 in the same sentence increases the confidence that the sentece has relevant information
    if ents and has_corona:  
        return ', '.join(ents)

#### [Sample Size NER Model](https://github.com/CoronaWhy/task-ties/blob/master/task_ties/Sample_Size_Extraction_%26_Modeling.ipynb)

A sample of sentence-level data was obtained via CoronaWhy's [ElasticSearch infrastructure](https://github.com/CoronaWhy/covid-19-infrastructure) and then fed into [Doccano](https://github.com/CoronaWhy/doccano) (an open-source text annotation tool) for manual annotation. We also secured paragraph-level data (to allow annotators an improved sense of linguistic context), though the data turned out to be far less useful than the sentence-level annotations. Annotations were then fed into SpaCy for the development of a custom NER model from a blank slate. The model training utilized a compounding mini-batch size (from 2 to 16) and a decaying dropout rate (from 0.6 to 0.2).

We planned on using this model to extract sample size data, but due to time constraint, we have not gathered sufficient training data. Therefore, the data extraction is temporarily based on rule-base matching. We first removed dates from the text as we have found that dates often contaminate our extracted output. We also converted all numbers written in words back to numerical format. Lastly, we used a set of keywords that are commonly used in describing sample size in the abstracts (the reason that we used abstracts was that they tend to be more concise and thus have less noise around the target information) to narrow down the extraction scope and extracted the target data. 

In [None]:
def remove_dates(string):
    
    """Helper function to remove the dates like Mar 30 2020 or 30 Mar 2020"""
    
    months = '('+ '|'.join([calendar.month_name[i] for i in range(1,13)]) + ')'
    months_abbr = '('+'|'.join([calendar.month_abbr[i] for i in range(1,13)])+ ')'
    dates = [months + '\s\d{1,2}\s\d{4}', 
             '\d{1,2}\s'+months+'\s\d{4}',
             months_abbr + '\s\d{1,2}\s\d{4}', 
             '\d{1,2}\s'+months_abbr+'\s\d{4}']
   
    string = re.sub('|'.join(dates), ' ', string)
    return string

def get_sample_size_regex(abstract:str):
    
    """Extract sample size from the abstracts"""
    
    abstract = re.sub(',','',abstract)
    abstract = remove_dates(abstract)
    words_nums = []
    for word in abstract.split():
        try:
            words_nums.append(str(w2n.word_to_num(word)))
        except:
            words_nums.append(word)
    abstract = ' '.join(words_nums)
    if any(w in abstract for w in ['enroll',
                                   'includ',
                                   'review',
                                   'extract',
                                   'divide',
                                   'collect',
                                   'examin',
                                   'evaluat',
                                   'report',
                                   'identif',
                                   'admit',
                                   'of the'])\
    and any(w in abstract for w in ['patients',
                                    'cases',
                                    'men', 
                                    'males',
                                    'children',
                                    'articles',
                                    'studies'])\
    and re.search('\s\d+([^\.%]{1,25})(patients|cases|men|males|chidren|articles|studies)', abstract):
        return re.search('\s\d+([^\.%]{1,25})(patients|cases|men|males|chidren|articles|studies)', abstract).group().strip()

# # Code to implement the sample size NER model
# nnlp = spacy.load(os.path.join(local_dir,'sample-size-ner-v3','sentence_level_model_v3'))
# def get_sample_size(abstract, model):
#     """Extract sample size from sentences"""
#     sentences = sent_tokenize(str(abstract).lower())
#     nums=[]
#     for sentence in sentences:
#         doc = model(sentence)
#         ents = [ent.text for ent in doc.ents if ent.label_=='enrolled' or ent.label_=='enrolled_add']
#         try:
#             nums.extend([str(w2n.word_to_num(ent)) for ent in ents if not any(s.isdigit() for s in ent)])
#         except:
#             nums.extend(ents)
#     if len(nums) > 0:
#         return ', '.join(filter(None,nums))

#### [Study Design Classifier](https://github.com/CoronaWhy/Classy)

The model was trained on a dataset of approximately 1500 papers annotated by CoronaWhy volunteers, consisting of these classes: 
* Computational: in sillico, modeling or simulation studies
* Experimental - in vitro: experimental studies performed on cells or othe microorganisms 
* Experimental - in vivo: experimental studies performed on animals
* Clinical-interventional: interventional studies including randomized or non-randomized controlled trials 
* Clinical-observational: observational studies including cohort studies, case-control studies, case series, etc
* Systematic review and/or meta-analysis 
* Review: reviews that do not use systematic review methodology

The model was traindd using catboost that runs gradient boosting algorithms on decision trees. Hyperparameter tuning was performed using grid search on the base gradient boosting parameters (e.g. learning rate, depth) with 5-fold cross validation to achieve the best F1 macro, per class F1 and accuracy.

In [None]:
!pip install catboost==0.20.1 -q --force-reinstall
from catboost import Pool, CatBoostClassifier

In [None]:
def get_study_type(test_pool):
    
    """Predict study design based on title and abstract of an article
        Params:
        test_pool: title and abstract organized in catboost compliant format
        EX: 
        test_pool = Pool(df[['abstract', 'title']],
        feature_names=['abstract', 'title'],
        text_features=['abstract', 'title'])
    """
    CATBOOST_MODEL_NAME="study_design_catboost_classifier_7_June_2020.cbm"
    # Instantiating the model
    model = CatBoostClassifier()
    # Loading pretrained classifier
    model.load_model(os.path.join(local_dir,'study-type-classifier',CATBOOST_MODEL_NAME))
    # Predicting on input data
    predictions = model.predict(test_pool)
    return predictions

### Assembling summary tables

Finally, with all the components set up, we are ready to assemble the summary tables!

In [None]:
def find_target_number(x):
    
    """Helper function to clean up extracted excerpts and extract the number of days.
        In most cases, it extracts the mean or median value but when range is reported, 
        the function extracts the upper bound.
    """
    match = re.match(r'(.*?)(\d+(\.\d+)?)(\s)(day)', x)
    if match and "±" not in x:
        return match.group(2)
    else:
        return re.search(r'\s\d+(\.\d+)?(days)?', x).group()

In [None]:
def assemble_summary_table(table_name:str, metadata, emb):
    
    """ Assemble a summary table from end to end. 
        Params:
        table_name: str, the file name of the target table
        metadata: df, laoded from metadat.csv
        emb: df, loaded from cord_19_embeddings   
    """
    
    # Refreshing target table to include potential additions
    target_columns, final_articles_metadata = refresh_table(table_name, metadata, emb)
    
    # Retriving full text for data extraction
    df = get_full_text_sentences(final_articles_metadata)

    # Removing sections that can include data from previous/background studies, rather than the study per se
    sections_to_keep = [s for s in df['section'] if s.lower() not in 
                    ['title',
                     'abstract', 
                     'background',
                     'summary background',
                     'introduction',
                     'discussion',
                     'discussions',
                     'statistical analysis']
                   ]
    df = df[df['section'].isin(sections_to_keep)]
    
    # Extracting median/mean/average age from full text
    df['Age'] = df['sentence'].apply(lambda x: get_age(x, tpnlp))
    
    # Extracting sample size
    df_sample_size = df[['cord_uid', 'abstract']].drop_duplicates() 
    df_sample_size['Sample Size'] = df_sample_size['abstract'].apply(lambda x: get_sample_size_regex(x) if pd.notnull(x) else x)
    
    # Extracting type of sample obtained
    df['Sample Obtained'] = df['sentence'].apply(lambda x: get_sample_type(x, cord_ner_nlp))
    
    # Extracting time period based on context
    if "shedding" in table_name.lower():
        df['extracted'] = df['sentence'].apply(lambda x: get_time_period(x, tpnlp, ['shedding','positive','clearance']))
        
    if 'incubation' in table_name.lower():
        df['extracted'] = df['sentence'].apply(lambda x: get_time_period(x, tpnlp, ['incubation']))
    
    # Extracting numeric values from the extracted 
    df['Days'] = df['extracted']\
    .apply(lambda x: find_target_number(x) if pd.notnull(x) and re.search(r'\s\d+(\.\d+)?(days)?', x) else x)\
    .apply(lambda x: ' '.join([x, 'days']) if pd.notnull(x) else x)
    
    df['Range (Days)'] = df['extracted'].apply(lambda x: x[x.find("(")+1:x.find(")")] if (pd.notnull(x)) and (")" in x) else None)

    #### Preparing final component #1: study metadata ####
    output_metadata = final_articles_metadata[['cord_uid','publish_time','title','abstract','url','journal','source_x','Added on']]
    output_metadata['journal'] = output_metadata['journal'].fillna(output_metadata['source_x'])
    
    # Predicting study type
    test_pool = Pool(
    output_metadata[['abstract', 'title']].fillna(""),
    feature_names=['abstract', 'title'],
    text_features=['abstract', 'title'])
    
    output_metadata['Study Type'] = get_study_type(test_pool)

    #### Preparing final component #2: extracted metadata ####
    extracted_metadata = df[['cord_uid','Age', 'Sample Obtained']].groupby('cord_uid', as_index=False)\
    .agg(lambda x: list(set(x))) #add sample size, study type

    # Compressing multuple results into one and cleaning up
    extracted_metadata = extracted_metadata.set_index('cord_uid')[['Age','Sample Obtained']]\
    .applymap(lambda x: ', '.join(filter(None,x))).reset_index()
    
    extracted_metadata['Sample Obtained'] = extracted_metadata['Sample Obtained']\
    .apply(lambda x: x.split(',') if pd.notnull(x) else x)\
    .apply(lambda x: [s.strip() for s in x] if isinstance(x, list) else x)\
    .apply(lambda x: ', '.join(list(set(x))) if isinstance(x, list) else x)
    
    extracted_metadata = extracted_metadata.merge(df_sample_size[['cord_uid','Sample Size']], on='cord_uid', how='outer')

    #### Combining metadata with findings ####
    output = df[pd.notnull(df['extracted'])][['cord_uid','sentence','extracted','Days', 'Range (Days)']]\
    .merge(output_metadata, on='cord_uid', how='outer')\
    .merge(extracted_metadata,on='cord_uid', how='outer')

    #### Re-arranging and renameing ####
    output=output[['publish_time',
                   'title',
                   'url', 
                   'journal',
                   'Study Type',
                   'Sample Size',
                   'Age', 
                   'Sample Obtained', 
                   'Days', 
                   'Range (Days)',
                   'sentence', 
                   'Added on']]
    
    output.columns = ['Date', 
                      'Study', 
                      'Study Link', 
                      'Journal',
                      'Study Type',
                      'Sample Size',
                      'Age', 
                      'Sample Obtained', 
                      'Days',
                      'Range (Days)',
                      'Excerpt', 
                      'Added on']
    
    # Discarding additions that end up having no data of interest (i.e. false positives)
    output = output[~((output['Added on']==datetime.today().strftime('%-m/%d/%Y')) & (pd.isnull(output['Excerpt'])))]

    return output

### Voila! Summary Tables!

In [None]:
table_names = ['Length of viral shedding after illness onset.csv',
               'What is the incubation period of the virus_.csv',
               'Incubation period across different age groups.csv']

for table_name in table_names:
    summary_table = assemble_summary_table(table_name, metadata, emb)
    display(summary_table)
    summary_table.to_csv(table_name, index = False)

<a id="4"></a>
## [Contributors](https://docs.google.com/document/d/1oP4Qf3OMrbG28ESC74BzIPUaMtpGeyYYvdJ-cnF_wI8/edit?usp=sharing)
Many Thanks to all the contributors who have made this project possible.