# Project Description

This collaborative project is put together by students of TCSS 592 at the [School of Engineering and Technology, University of Washington Tacoma](https://www.tacoma.uw.edu/set/school-engineering-technology-home) and [NLPCORE](https://nlpcore.com) a Seattle, WA startup using NLPCORE's search engine to extract meaningful phrases (concepts) grouped together in named categories (topics) along with their specific linkages / relationships (joint references) in the literature. These topics could be dictionary terms such as Proteins, Cell Lines or user specified such as (Host Cells, Viruses) or dynamically extracted by the search engine based upon search terms.

 

The objective of the project is to provide most relevant and specific references (not just articles but specific sentences with-in each article) along with relevant biological materials as a response for the questions posed in this challenge. Our goal is to enable life sciences researchers to quickly gather, triage and identify most applicable subset of candidate proteins and/or reagents for their experiments related to Covid-19 research.

 

Besides extracting references, we also validated search results against LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/) an expertly curated set articles on Covid-19 and found both good matches as well as data set anomalies. We have provided this reproducible validation test scripts along with the dataset in an accompanying Jupyter notebook available at https://www.kaggle.com/varunmittalnlpcore/litcovid-validation. For optimal performance, we highly recommend users to download both of these notebooks and run locally on their personal workstations. Publically hosted environments such as Kaggle and Google Collab have resource limitations that may severely impact performance and limit functionality.

 

Finally we provide visualizations that show distribution and correlation of results - both through sample code in this notebook as well as click-through URLs to our Search Portal, to help researchers sift through results and narrow them down to a more relevant subset for their experiments or further investigations.

 

## Background

Varun Mittal - cofounder at NLPCORE, is a University of Washington alumni with Masters CS degree in AI and ML techniques and has remained as faculty support for Prof. Dr. Ka Yee Yeung at UW, who is conducting its TCSS592 class this spring. This CORD19 challenge together with expertly curated Covid-19 datasets have provided a unique opportunity for both UW TCSS592 class students and NLPORE team to work together under guidance of Dr. Ka Yee.

 

NLPCORE is a knowledge discovery platform powered by its unique AI and ML techniques (US Patents:  [#10102274](https://patents.google.com/patent/US10102274B2) & [#10372739](https://patents.google.com/patent/US20190005049A1)) that delivers contextual and actionable results for users across various verticals â€“ life sciences, case law, patents, insurance and more.

 

![Identify Entities and Relationships using Part of Speech tags](https://i.imgur.com/dXT19EW.png)

 

Its search technology collects statistics such as word frequencies, offsets as well as part of speech tags (e.g. noun, pronoun, or verb) in its index. Words that appear most frequently and closest to the search keyword(s), provide seed articles for its neural-net algorithms that also factor in heuristics, dictionaries and in-place user-feedback. For any given search keyword(s), its search engine scans across all matching articles deploying a (Hadoop like) cluster of processing nodes to identify and retrieve the most appropriate concepts, their grouping into meaningful topics, their relationships to each other and their specific annotated references from the entire text corpus.

 

In this project submission, we used both the dataset provided as well as the open-access subset from NIH (pubmed central) to focus on all coronavirus related research and extract related content from both existing and newly available research. Furthermore for validation, we used expertly curated LitCovid dataset that we have enclosed as additional datasets along with this submission for reproducibility.

 

## Extracting, Analyzing and Presenting Results

In order to respond to challenge questions, we submitted a number of search keywords to NLPCORE along with suggested topics to extract based upon students' research and suggestions. We collected these results i.e. concepts, topics and their joint references into DataFrames. We then experimented with a number of concept/link attributes such as frequency of terms or co-occurrences, distance of these terms from searched keywords, their part of speech tags (mostly pronouns, nouns or verbs), topics they belong to, etc. as way to filter the DataFrames to the most meaningful subset. We then present the output along with individual text references (Article Ref, Title, Section Title, Surrounding Sentences) in recommended table format that can be readily exported as a CSV file and consumed by the researchers for their further analysis and experimentation.

 

## Validating Results

As part of extracting results down to specific phrases (topics and concepts), we also compared research articles that contained these phrases against expertly curated dataset. To do so, we validated search results against LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/) an expertly curated set. We found both good matches as well as data set anomalies (when using our search engine and CORD-19 dataset versus using our search engine and LitCovid dataset). We have provided this reproducible validation test scripts along with the dataset in an accompanying Jupyter notebook available at https://www.kaggle.com/varunmittalnlpcore/litcovid-validation. For optimal performance, we highly recommend users to download both of these notebooks and run locally on their personal workstations. Publically hosted environments such as Kaggle and Google Collab have resource limitations that may severely impact performance and limit functionality. The notebook contains both the datasets and scripts that were used to perform the comparison for reproducibility. We found that in cases where we were able to uniquely identify articles with proper identity (such as Pubmed or PMC ID) - both in the curated set and that used by our search engine, we had a high degree of match but when the dataset lacked a matching identification for the articles, our matches decreased significantly.

 

## Visualization of Results

We also provided sample scripts in this notebook as well as a blog post at our website (see references below) to help researchers visualize the results for easy categorization and filtering. To this end, we provide a grouping of results by various topics and ability to drill-down to results with-in a topic. We also provide one-click access URLs that allow users to access these results at our Search Portal as well where they can further explore results with various filters and visual representations.

 

## Reusability of our Approach

The methodology and approach that we have taken to respond to this challenge as well as to validate our search results is very easily extensible by virtue of user-defined topics in NLPCORE search engine. Besides search terms, we force the search engine to look for cluster of most relevant words (concepts) for one or more user-defined topics and therefore go above and beyond simple text or regular expression matches to identify articles and references with-in them (hit highlighting) containing these concepts. One has to simply change the list of user-defined topics and search terms to extract references and articles for any specific use case and text corpus.

 

## Use of cache & pre-computed results

Our notebooks have used cached results and reused pre-computed results both for reproducibility and more importantly as a work-around for resource limitations be it CPU usage, memory usage and network bandwidth (for web API calls) enforced by publically hosted Jupyter Notebook platforms including Kaggle and Google Collab. We therefore recommend that users download these notebooks, make changes at will and execute them locally.

 

## References

1. Kaggle CORD19 Challenge Submission (this notebook): https://www.kaggle.com/varunmittalnlpcore/cord19-round1-response-by-uw-and-nlpcore

2. Validation of search results (accompanying notebook): https://www.kaggle.com/varunmittalnlpcore/litcovid-validation

3. Visualization of results (sample code and blogpost): https://www.nlpcore.com/blog_interna.html

4. NLPCORE search platform (requires one-time free registration): https://search.nlpcore.com/search-results?asp=&d=1&p=cord19-dataset&q=covid-19&rViewType=graph

5. LitCovid Dataset: https://www.ncbi.nlm.nih.gov/research/coronavirus/

 

## Future Plans

We continue to improve both the quality and presentation of our results at our search portal where users can interact with results in various formats such as document, list or graph views, filter them at will for any combination of attributes from topics, concepts and/or articles and jump to specific article with color-coded highlights (where color represents a topic category). We continue to further our engagements with life sciences researchers, help them apply results from our technology (that remains available to research community at no cost) for their experiments and identify areas of further improvements in our toolset.

 

## Acknowledgements

We at NLPCORE acknowledge Prof. Dr. Ka Yee Yeung and her class of TCSS592 along with our intern and MIT freshman Yos Wagenmans who contributed immensely to research for and prepare this submission.



# Tasks Attempted
For round 1, we have attempted to respond to following CORD-19 challenges.

* Task 1: What is known about transmission, incubation, and environmental stability?
* Task 2: What do we know about COVID-19 risk factors?
* Task 3: What do we know about virus genetics, origin, and evolution?
* Task 4: What do we know about vaccines and therapeutics?

For each task, we took the key phrases (mostly unique words) from the description of the task itself and forced our search engine to search the neighborhood of these words together with mention of coronavirus itself in the literature and extracted the most frequent concepts, biomaterials (proteins and cells) and their combined references in articles. These references should in most cases approximate the response to the challenge posed. We may have noisy results in our first round submission but will attempt to improve upon the same in our subsequent submission.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import requests # process http requests
from time import sleep # timer functions
import json # process json objects
from tqdm import tqdm # progress bar
from hashlib import md5 # md5 hash for caching
import networkx as nx # generate d3 compatible graph
import IPython.display
from IPython.core.display import display, HTML, Javascript
from string import Template
import networkx as nx
from difflib import SequenceMatcher
from os import path

In [None]:
# ************************************** FUNCTION DEFINITIONS ***********************************************
"""
User defined Topics that forces search engine to look in their neighborhood also
"""
select_topics = set(['ACTIVITY', 'ADE', 'AGENT', 'ANIMAL', 'ANIMALS', 'ANTAGONIST', 'ANTIVIRAL', 'ASYMPTOMATIC',
                     'BAT', 'BINDING', 'BUFFER', 'CELL', 'CELLS', 'CIRCULATION', 'CLARITHROMYCIN', 'CO-INFECTIONS',
                     'CO-MORBIDITIES', 'DISEASE', 'DRUG', 'DRUGS', 'ENVIRONMENT', 'ENZYME', 'ENZYMES', 'EXPERIMENTAL',
                     'FARMERS', 'GENOME', 'HIGH-RISK', 'HISTONE', 'HOST', 'HYDROPHILIC', 'HYDROPHOBIC', 'INFECTION',
                     'INTERACTIONS','IMMUNE', 'LIGAND', 'LIVESTOCK', 'MINOCYCLINE', 'MODEL', 'NAGOYA', 'NAPROXEN',
                     'NEONATES', 'NUCLEOTIDE', 'PATIENT', 'PATHOGENESIS', 'PEPTIDE', 'PEPTIDES', 'PHENOTYPE', 'PLATES',
                     'POLYPROTEIN', 'PPE', 'PRE-EXISTING', 'PREGNANCY', 'PROTEIN', 'PROTOCOL', 'PROPHYLAXIS',
                     'PULMONARY', 'RBD', 'RANGE', 'REAGENT', 'REAGENTS', 'RECEPTER', 'REPLICATION', 'RESIDUES', 'RESPONSE'
                     'SEQUENCING', 'SHEDDING', 'SMOKING', 'STRAIN', 'STRUCTURES', 'THERAPEUTIC', 'TRACKING',
                     'TRANSCRIBE', 'TRANSCRIPTASE', 'TRANSMISSION', 'TREATMENT', 'VACCINE', 'VIRAL', 'VIRUS',
                     'WILDLIFE', 'UNIVERSAL'])

"""
Extract concepts and topics and their relationship from the search engine including user-defined topics
"""
def get_graph(project_name="cord19-dataset", source="coronavirus", target="transmission", auth="test-key"):

    params = {
        'auth': auth,
        'u_name': source,
        "v_name": target,
        "return_dataframe": True,
        "return_type": "dataframe",        
        "additional_topics": ",".join(map(lambda word: word.lower(), select_topics)),
        "project_name": project_name
    }
    r = requests.get("https://apis.nlpcore.com/apis/get_graph/", params=params)
    if r.status_code != 200:
        raise RuntimeError("Failed to get_graph, please try again., %s" % r.content)
    dataframe = pd.DataFrame(json.loads(r.content))
    return dataframe

"""
Filter rows in a dataframe to specific topics
"""
def subset_dataframe(dataframe, given_topics):

    select_dataframe_rows = []
    for _,row in dataframe.iterrows():
        source_topics = set(row['source_topics'])
        target_topics = set(row['target_topics'])
        if (given_topics & source_topics or given_topics & target_topics):
            select_dataframe_rows.append(row)
    return pd.DataFrame(select_dataframe_rows)

"""
Get Article metadata/attributes for a given document id, store results in caches for repeated calls
"""
def document_metadata(project_name, document_id, auth="test-key"):
    
    cache_key_str = "%s-%s" % (project_name, document_id)
    cache_key = md5(cache_key_str.encode()).hexdigest()
    cache_path = "/tmp/metadata2-%s.json" % cache_key

    try:
        return json.load(open(cache_path))
    except FileNotFoundError:
        pass

    r = requests.get("https://apis.nlpcore.com/apis/get_document_metadata/", params={'project_name': project_name,
                                                                            'auth': auth,
                                                                            'd': document_id})
    if r.status_code == 200:
        reference_data = r.json()
        json.dump(reference_data, open(cache_path, "w"))

    return r.json()

"""
Dataframe returned from the above calls has a list of concepts and their references. For each reference we can request 
text segments. The parameter "r" is a comma seperated list of integers which are senetence numbers.
Cache references for repeated calls.
"""
def get_references(project_name, document_id, r, auth="test-key"):
    
    cache_key_str = "%s-%s-%s" % (project_name, r, document_id)
    cache_key = md5(cache_key_str.encode()).hexdigest()
    cache_path = "/tmp/%s.json" % cache_key
    
    try:
        return json.load(open(cache_path))
    except FileNotFoundError:
        pass
    
    r = requests.get("https://apis.nlpcore.com/apis/get_references/", params={'project_name': project_name,
                                                                             'auth': auth, 'r': r,
                                                                             'd': document_id})
    if r.status_code == 200:
        reference_data = r.json()
        json.dump(reference_data, open(cache_path, "w"))
    
    return r.json()

"""
Augment the dataframe with article and sentence references for each of the co-occuring concepts in each row
"""
def refine_dataframe(project_name, dataframe, auth="test-key"):
    select_dataframe_rows = []
    for _,row in tqdm(list(dataframe.iterrows())):
        source_topics = set(row['source_topics'])
        target_topics = set(row['target_topics'])
        if (select_topics & source_topics and select_topics & target_topics) and row['source_idf'] < 3 and row['target_idf'] < 3:            
            reference_texts = []
            for document_id,references in row['references'].items():
                title = document_metadata(project_name=project_name, document_id=document_id)['title'] or "<No Title>"
                sections = {}
                for reference in references[:]:
                    r = "%d,%d" % (reference['u_curr_ref'], reference['v_curr_ref'])
                    text = get_references(project_name=project_name, document_id=document_id, r=r)
                    for section in text.values():
                        try:
                            section_title = section['section_title']
                        except Exception as e:
                            raise e
                        try:
                            bucket = sections[section_title]
                        except KeyError:
                            bucket = []
                            sections[section_title] = bucket                        
                        bucket.append(section['sentence'])
                reference_texts.append({'title': title, 'sections': sections})                    
            select_dataframe_rows.append({'source': row['u_name'], 'target': row['v_name'], 'source_types': ", ".join(source_topics),
                                         'target_types': ", ".join(target_topics), 'count': row['count'],
                                         'references': reference_texts})
    return pd.DataFrame(select_dataframe_rows)

"""
Augment the dataframe with select sentences that match keywords from task
"""
def search_task_words(dataframe, given_topics):
    
    select_dataframe_rows = []
    given_topics = [word.lower() for word in given_topics] 
    for _,row in dataframe.iterrows():
        matched_sentences = []
        matched_words = []
        for reference_obj in row['references']:
            for section_title, sentences in reference_obj['sections'].items():
                for sentence in sentences:
                    matched = [word for word in given_topics if word in sentence.lower()]
                    if matched:
                        matched_sentences.append(sentence)
        row['sentences'] = matched_sentences
        select_dataframe_rows.append(row.to_dict())
    return pd.DataFrame(select_dataframe_rows)        

def convert_df(dataframe):
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()

    g = nx.DiGraph()
    groups = {}

    def get_group(group_name):
        try:
            group_id = groups[group_name]
        except KeyError:
            group_id = len(groups) + 1
            groups[group_name] = group_id
        return group_id

    def add_node(concept_name, group_name):
        concept_name = concept_name.lower()
        
        if concept_name in g.nodes:
            g.nodes[concept_name]['size'] += 1
        else:
            sim_scores = [(_node, similar(_node, concept_name)) for _node in g.nodes]
            if len(sim_scores) > 0:
                _node, score = max(sim_scores, key=lambda item: item[1])
                if score > 0.7:
                    return add_node(_node, g.nodes[_node]['group'])
            g.add_node(concept_name, size=1, group=get_group(group_name))
        return concept_name

    for _, row in dataframe.iterrows():
        source_id = add_node(row['source'], row['source_types'])
        target_id = add_node(row['target'], row['target_types'])
        edge = g.get_edge_data(source_id, target_id)
        if edge:
            edge['value'] += 1
        else:
            g.add_edge(source_id, target_id, value=1)

    dataframe_rows = []
    for node_id in g.nodes:
        node = g.nodes[node_id]
        name = "project.%d.%s" % (node['group'], node_id)
        dataframe_rows.append({'id': name, 'value': node['size'], 'value1': node['size']})
    dataframe_rows = sorted(dataframe_rows, key=lambda item: item['value'], reverse=True)[:100]
    return pd.DataFrame(dataframe_rows)

def return_bubble_data(csv_file_path, html_element_id):
    html = """<!DOCTYPE html><svg id="%s" width="760" height="760" font-family="sans-serif" font-size="10" text-anchor="middle"></svg>""" % html_element_id
    js = """require.config({paths: {d3: "https://d3js.org/d3.v4.min"}});require(["d3"], function(d3) {var svg=d3.select("#%s"),width=+svg.attr("width"),height=+svg.attr("height"),format=d3.format(",d"),color=d3.scaleOrdinal(d3.schemeCategory20c);console.log(color);var pack=d3.pack().size([width,height]).padding(1.5);d3.csv("%s",function(t){if(t.value=+t.value,t.value)return t},function(t,e){if(t)throw t;var n=d3.hierarchy({children:e}).sum(function(t){return t.value}).each(function(t){if(e=t.data.id){var e,n=e.lastIndexOf(".");t.id=e,t.package=e.slice(0,n),t.class=e.slice(n+1)}}),a=(d3.select("body").append("div").style("position","absolute").style("z-index","10").style("visibility","hidden").text("a"),svg.selectAll(".node").data(pack(n).leaves()).enter().append("g").attr("class","node").attr("transform",function(t){return"translate("+t.x+","+t.y+")"}));a.append("circle").attr("id",function(t){return t.id}).attr("r",function(t){return t.r}).style("fill",function(t){return color(t.package)}),a.append("clipPath").attr("id",function(t){return"clip-"+t.id}).append("use").attr("xlink:href",function(t){return"#"+t.id}),a.append("svg:title").text(function(t){return t.value}),a.append("text").attr("clip-path",function(t){return"url(#clip-"+t.id+")"}).selectAll("tspan").data(function(t){return t.class.split(/(?=[A-Z][^A-Z])/g)}).enter().append("tspan").attr("x",0).attr("y",function(t,e,n){return 13+10*(e-n.length/2-.5)}).text(function(t){return t})});});""" % (html_element_id, csv_file_path)
    return html, js

# ************************************** END OF FUNCTION DEFINITIONS ***********************************************

# Task 1: What is known about transmission, incubation, and environmental stability?
From the description of the task, we identified following key phrases as our primary search topics:

* Transmission
* Incubation
* asymptomatic shedding
* hydrophilic surface
* hydrophobic surface
* virus shedding
* disease model
* animal model
* phenotype change
* PPE effectiveness

Following code block computes the dataframe that returns the most applicable result set for the same.

In [None]:
"""
Filter the results to only focus on most relvant topics for this challenge
"""
task_topics = set(['ANIMAL', 'ANIMALS', 'ASYMPTOMATIC', 'MODEL', 'TRANSMISSION', 'INCUBATION', 
                   'SHEDDING' 'HYDROPHILIC', 'HYDROPHOBIC', 'VIRUS', 'DISEASE', 'PHENOTYPE',
                   'PPE'])
project_name="cord19-dataset"
source="coronavirus"
target="transmission"
auth="test-key"

In [None]:
"""
Get the initial dataframe and filter it down to topics of interest and add article references
"""
if path.isfile("/kaggle/input/nlpcore-cord19-output/task_1.csv"):
    task_df = pd.read_csv("/kaggle/input/nlpcore-cord19-output/task_1.csv", index_col=0)
else:
    df = get_graph(project_name, source, target, auth)
    task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)
    task_df = search_task_words(task_df, task_topics)
    task_df.to_csv("/kaggle/working/task_1.csv")

# Print results

graph_data = convert_df(task_df)

In [None]:
task_df

# Task 2: What do we know about COVID-19 risk factors?
From the description of the task, we identified following key phrases as our primary search topics:

* Smoking
* pre-existing pulmonary disease
* co-infections
* co-morbidities
* neonates
* pregnancy
* high-risk patient group

Following code block computes the dataframe that returns the most applicable result set for the same.

In [None]:
"""
Filter the results to only focus on most relvant topics for this challenge
"""
task_topics = set(['SMOKING', 'PRE-EXISTING', 'PULMONARY', 'DISEASE', 'CO-INFECTIONS', 'CO-MORBIDITIES', 'NEONATES', 'PREGNANCY', 
                   'HIGH-RISK', 'PATIENT'])
project_name="cord19-dataset"
source="coronavirus"
target="disease"
auth="test-key"

In [None]:
"""
Get the initial dataframe and filter it down to topics of interest and add article references
"""
if path.isfile("/kaggle/input/nlpcore-cord19-output/task_2.csv"):
    task_df = pd.read_csv("/kaggle/input/nlpcore-cord19-output/task_2.csv", index_col=0)
else:
    df = get_graph(project_name, source, target, auth)
    task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)
    task_df = search_task_words(task_df, task_topics)
    task_df.to_csv("/kaggle/working/task_2.csv")

task_df.to_csv("/kaggle/working/task_2.csv")
graph_data = convert_df(task_df)

In [None]:
graph_data.to_csv("task_2_graph.csv")
html, js = return_bubble_data("task_2_graph.csv", "graph_2_csv")

h = display(HTML(html))
j = IPython.display.Javascript(js)
IPython.display.display_javascript(j)

In [None]:
task_df

# Task 3: What do we know about virus genetics, origin, and evolution?
From the description of the task, we identified following key phrases as our primary search topics:

* Genome tracking
* strain circulation
* Nagoya Protocol
* livestock
* recepter binding
* farmers
* wildlife
* host range
* experimental infection
* animal host

Following code block computes the dataframe that returns the most applicable result set for the same.

In [None]:
"""
Filter the results to only focus on most relvant topics for this challenge
"""
task_topics = set(['GENOME', 'TRACKING', 'STRAIN', 'CIRCULATION', 'NAGOYA', 'LIVESTOCK', 'RECEPTER', 'BINDING', 
                   'FARMERS' 'WILDLIFE', 'HOST', 'RANGE', 'EXPERIMENTAL', 'INFECTION', 'ANIMAL', 'PROTOCOL'
                   'HOST'])
project_name="cord19-dataset"
source="coronavirus"
target="strain"
auth="test-key"

In [None]:
"""
Get the initial dataframe and filter it down to topics of interest and add article references
"""
if path.isfile("/kaggle/input/nlpcore-cord19-output/task_3.csv"):
    task_df = pd.read_csv("/kaggle/input/nlpcore-cord19-output/task_1.csv", index_col=0)
else:
    df = get_graph(project_name, source, target, auth)
    task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)
    task_df = search_task_words(task_df, task_topics)
    task_df.to_csv("/kaggle/working/task_3.csv")

task_df.to_csv("/kaggle/working/task_3.csv")

In [None]:
task_df

# Task 4: What do we know about vaccines and therapeutics?
From the description of the task, we identified following key phrases as our primary search topics:

* naproxen
* clarithromycin
* minocycline
* Antibody Dependent Enhancement (ADE)
* therapeutic
* antiviral agent
* universal vaccine
* prophylaxis (preventative)
* vaccine immune response

Following code block computes the dataframe that returns the most applicable result set for the same.

In [None]:
"""
Filter the results to only focus on most relvant topics for this challenge
"""
task_topics = set(['NAPROXEN', 'CLARITHROMYCIN', 'MINOCYCLINE', 'ADE', 'THERAPEUTIC', 'ANTIVIRAL', 'AGENT', 'UNIVERSAL'
                  'VACCINE', 'PROPHYLAXIS', 'IMMUNE', 'RESPONSE'])
project_name="cord19-dataset"
source="coronavirus"
target="vaccine"
auth="test-key"

In [None]:
"""
Get the initial dataframe and filter it down to topics of interest and add article references
"""
if path.isfile("/kaggle/input/nlpcore-cord19-output/task_4.csv"):
    task_df = pd.read_csv("/kaggle/input/nlpcore-cord19-output/task_4.csv", index_col=0)
else:
    df = get_graph(project_name, source, target, auth)
    task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)
    task_df = search_task_words(task_df, task_topics)
    task_df.to_csv("/kaggle/working/task_4.csv")

task_df.to_csv("/kaggle/working/task_4.csv")

In [None]:
task_df

## END OF FILE