# Correlation between genes and symptoms

This notebook will try to find most common symptoms, organs and genes that appear in this dataset. Also this notebook will try find topics that are related to some organ, gene etc.

**NOTE**

As dataset is updated I will try to update this notebook accordingly

Following imports are imports of `scispacy`, `langdetect` and `en_ner_bionlp13cg_md`. They are used for language detection and NER for genes, organs and chemicals. Imports of those are optional since I have prepared dataset where everything is precomputed on much faster machine

In [None]:
#!pip install scispacy
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_bionlp13cg_md-0.3.0.tar.gz
#!pip install langdetect

- imports of other libraries

In [None]:
import collections
from urllib.parse import urlparse

import re

import os
import json
from os import path
from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
#from langdetect import detect, lang_detect_exception

#import scispacy
import spacy

#import en_ner_bionlp13cg_md

In [None]:
articles_metadata = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv')
articles_metadata.head()

Number of articles in dataset

In [None]:
len(articles_metadata)

### Basic dataset stats

In this section we will try to find out number of articles per time and their source (pdf or pmc)

In [None]:
fig = plt.figure(figsize=(10,40))
gs = gridspec.GridSpec(nrows=4, ncols=1)

ax1 = fig.add_subplot(gs[0,0])
ax1.set_title('Sources')

source_counts = articles_metadata.groupby('source_x').count()
cord_uid_count = source_counts.cord_uid.sort_values(ascending=True)
labels = cord_uid_count.index
values = cord_uid_count.values

ax1.barh(np.arange(len(labels)), values, tick_label=labels)
ax1.set_xlabel('Number of articles')


ax2 = fig.add_subplot(gs[1, 0])
ax2.set_title('Number of articles per url domain (NaN exclued)')

domains = map(lambda x: urlparse(x).netloc, articles_metadata[~pd.isna(articles_metadata.url)].url)
counter = collections.Counter(domains)
most_common_domains = counter.most_common(20)
labels = list(map(lambda x: x[0], most_common_domains))[-1::-1]
values = list(map(lambda x: x[1], most_common_domains))[-1::-1]

ax2.barh(np.arange(len(labels)), values, tick_label=labels)

plt.show()

In [None]:
def extract_year(date: str) -> int:
    return int(date[:4]) if (date and type(date) is str) else -1

def extract_month(date: str) -> int:
    if type(date) is str:
        tokens = date.split('-')
        if len(tokens) >= 2:
            return int(tokens[1])
    
    return -1

def has_text(title) -> bool:
    return (title is not None) and (type(title) == str) and len(title)>0

def detect_language(abstract) -> bool:
    try:
        return detect(abstract) == 'en'
    except lang_detect_exception.LangDetectException:
        return False
    
articles_metadata['year_publish'] = list(map(lambda x: extract_year(x), articles_metadata.publish_time))
articles_metadata['month_publish'] = list(map(lambda x: extract_month(x), articles_metadata.publish_time))
articles_metadata['has_title'] = list(map(lambda x: has_text(x), articles_metadata.title))
articles_metadata['has_abstract'] = list(map(lambda x: has_text(x), articles_metadata.abstract))


articles_metadata.head()

- in the following cell we can see how many articles have titles and abstracts

In [None]:
articles_metadata.groupby(['has_title', 'has_abstract']).count()['cord_uid']

In [None]:
fig = plt.figure(figsize=(20,28))
gs = gridspec.GridSpec(nrows=2, ncols=1)

articles_metadata['year_publish'] = list(map(lambda x: extract_year(x), articles_metadata.publish_time))

per_year_counter = articles_metadata.groupby('year_publish').count().cord_uid
ax1 = fig.add_subplot(gs[0,0])

plt.xticks(rotation=90)
ax1.set_title('Number of articles per year')
ax1.bar(np.arange(len(per_year_counter)), per_year_counter.values, tick_label=per_year_counter.index, log=True)
ax1.set_xlabel('Number of articles')


ax2 = fig.add_subplot(gs[1, 0])
ax2.set_title('Number of articles per month in 2020')

months_counter = collections.Counter(list(map(lambda x: extract_month(x), articles_metadata[articles_metadata.year_publish == 2020].publish_time)))
months_data = sorted(months_counter.items())[1:]
months_values = [x[1] for x in months_data]
months_labels = [x[0] for x in months_data]
ax2.plot(months_labels, months_values)
ax2.grid(True)

plt.show()

In [None]:
sorted(months_counter.items())

From articles published in 2020, 149103 of them don't have month or day of publishing.

## Analysis of genes and chemical compounds

This section will examine co-occurences of genes, checmical compounds and organs that appear in abstracts of articles. Since this dataset is diverse we will limit analysis to the following subset:
* only articles in English
* only articles that have date of creation 2020 or later

In [None]:
articles_2020 = articles_metadata[articles_metadata.year_publish >= 2020].copy()
articles_2020.head()

In [None]:
len(articles_2020)

Time condition limited analysis to 387554 articles

Remove duplicate titles

In [None]:
articles_2020 = articles_2020.drop_duplicates(subset=['title'])
len(articles_2020)

Removing duplicates further reduced dataset to 297871 articles

### Trying to filter english papers and extract genes, organs and chemical compounds

In [None]:
articles_2020_tagged = None
PREPROCESSED_PATH = '../input/preprocessed-tagged-articles-for-cord19/preprocessed-articles-v1.0.csv'
if path.exists(PREPROCESSED_PATH):
    preprocessed_data = pd.read_csv(PREPROCESSED_PATH)
    articles_2020_tagged = pd.merge(articles_2020, preprocessed_data, how='right', right_on='cord_uid', left_on='cord_uid')
else:
    language_labels = []
    for t in tqdm(articles_2020.title):
        try:
            language_labels.append(detect(t))
        except lang_detect_exception.LangDetectException:
            language_labels.append('unknown')
            print(f"Error with title {t}")
        except TypeError:
            language_labels.append('unknown')
            print(f"Type error with title {t}")

In [None]:
preprocessed_data = pd.read_csv(PREPROCESSED_PATH, compression=None)
preprocessed_data.head()

In [None]:
articles_2020_tagged.head()

In [None]:
len(articles_2020_tagged)

### Analysis of genes, chemical compunds and organs

In [None]:
articles_2020_tagged['genes'] = articles_2020_tagged['genes'].apply(eval)
articles_2020_tagged['organs'] = articles_2020_tagged['organs'].apply(eval)
articles_2020_tagged['chems'] = articles_2020_tagged['chems'].apply(eval)

In [None]:
gene_counter = collections.Counter()
for g in articles_2020_tagged['genes']:
    gene_counter.update(g)

In [None]:
#false_postive_genes = ['COVID-19','Covid-19','UK','PPE','COVID-19patients','2019-nCoV',
#                       'MERS-CoV','stay-at-home','USA','e.g.', '', 'Iran', 'Food']
false_postive_genes = ['covid-19','uk','ppe','covid-19patients','2019-ncov','mers-cov','stay-at-home','usa','e.g.', '', 'iran', 'food', 'covid-19 patients']

In [None]:
for fpg in false_postive_genes:
    del gene_counter[fpg]

In [None]:
gene_counter.most_common(20)

In [None]:
organ_counter = collections.Counter()
for o in articles_2020_tagged['organs']:
    organ_counter.update(o)

organ_counter.most_common(20)

### Co-occurences of organs and genes in dataset corpus

In [None]:
cooccurence = collections.Counter()
for index, article in articles_2020_tagged.iterrows():
    #print(article)
    gene_list = article.genes
    organ_list = article.organs
    for g in gene_list:
        for o in organ_list:
            if g not in false_postive_genes:
                cooccurence[f"{g}#{o}"] += 1

In [None]:
cooccurence.most_common(100)

**Plotting heatmap of coocurences**

In [None]:
def extract_most_common(counter, num_of_items=20):
    items = counter.most_common(num_of_items)
    return [i[0] for i in items]

In [None]:
gene_axis = extract_most_common(gene_counter)
organ_axis = extract_most_common(organ_counter)
#gene_axis

In [None]:
heatmap = []
for g in gene_axis:
    row = []
    for o in organ_axis:
        c = cooccurence[f"{g}#{o}"]
        row.append(c if c>0 else 1)
    heatmap.append(row)

In [None]:
fig = plt.figure(figsize=(10,15))

ax = fig.add_subplot(111)

ax.imshow(heatmap)

ax.set_xticks(np.arange(len(gene_axis)))
ax.set_yticks(np.arange(len(organ_axis)))

ax.set_xticklabels(gene_axis)
ax.set_yticklabels(organ_axis)

plt.setp(ax.get_xticklabels(), rotation=90, ha="right",
         rotation_mode="anchor")

plt.show()

## Extracting topics for genes and organs

In this section we will try to extract topics for combinations of genes and organs. In the cell bellow there will be defined class for such purpose

- first, let's import gensim LDA

In [None]:
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from gensim.test.utils import common_texts
from gensim.parsing.preprocessing import remove_stopwords
from gensim.corpora.dictionary import Dictionary

In [None]:
class TopicExtractor:
    
    _genes=[]
    _organs=[]
    
    _data_subset=None
    
    def __init__(self, genes=[], organs=[]):
        self._genes = genes
        self._organs = organs
        
        self._data_subset = articles_2020_tagged
        for o in self._organs:
            indices = self._data_subset.organs.apply(lambda organ_list: o in organ_list)
            self._data_subset = self._data_subset[indices]

        for g in self._genes:
            indices = self._data_subset.genes.apply(lambda  gene_list: g in gene_list)
            self._data_subset = self._data_subset[indices]
        
    def get_topics(self, num_of_topics=20):
        article_chapters = []

        for doc in self._data_subset.pdf_json_files:
            if doc and type(doc) is str:
                path = doc.split('; ')[0]
                with open(f"./../input/CORD-19-research-challenge/{path}") as article:
                    raw_data = article.read()
                    obj = json.loads(raw_data)
                    body_text = obj['body_text']

                    texts = [x['text'] for x in body_text if 'text' in x]
                    # print(texts)
                    article_chapters.extend(texts)
        print(len(article_chapters))          
        corpus_data = [simple_preprocess(remove_stopwords(text)) for text in article_chapters]
                    
        common_dictionary = Dictionary(corpus_data)
        corpus = [common_dictionary.doc2bow(text) for text in corpus_data]

        self.lda_model = LdaModel(corpus, num_topics=num_of_topics, id2word=common_dictionary)
        self.lda_model.print_topics()

### Lungs articles topics

In [None]:
lungs_lda = TopicExtractor(organs=['lung'])
lungs_lda.get_topics()

In [None]:
lungs_lda.lda_model.print_topics()

### Topics for articles about heart and Angiotensin-converting enzyme 2

In [None]:
heart_igg_lda = TopicExtractor(genes=['ace2'], organs=['heart'])
heart_igg_lda.get_topics(num_of_topics=8)

In [None]:
heart_igg_lda.lda_model.print_topics()

### Topics for articles about liver

In [None]:
liver_lda = TopicExtractor(organs=['liver'])
liver_lda.get_topics(num_of_topics=15)

In [None]:
liver_lda.lda_model.print_topics()

You can use the following class for playing and finding more data about other combination about organs and genes.

__More improvements to notebook will follow :-)__