# About this notebook

> #### The novel **COVID-19** has come and changed how we as humans in this new era of civilization, view diseases. Everything escalated quickly, number of confirmed cases increased exponentially with the R number (which signifies the average number of people which one person infected person will pass the virus to) between 2 and 2.5 at the beginning, and what made it harder is that we don’t understand the disease and more and more lives were lost, we’re in a race with time to try to save as many lives as possible, we want to know more about the disease to flatten the curve, i.e. decrease the R number, and by knowing the risk factors to covid-19, we will be able to do so!! 

> #### *And this is what our model "Corona Explorer" aims to, by directing the healthcare giver to the most relevant paper that he might find what he’s looking for. And if our solution saved only one life, then we would be very proud that applying some science and using our time did this!*

## Importing Important Libraries

In [None]:
import os
import json
import math
import numpy as np 
import pandas as pd 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import bigrams, trigrams, ngrams
from sklearn.feature_extraction.text import TfidfVectorizer


## Preprocessing Data

### Tokenizing the body text of a paper
By removing unnecessary words, punctuation marks, currency symbols and numbers

In [None]:
def textPreprocessing (text):    
    stop_words = stopwords.words("english")
    stop_words += [wr for wr in ['one','av','however','moreover','yet']]
    words = nltk.word_tokenize(text)
    new_words =[] 
    for word in words: 
        word = word.lower()
        if ((word not in stop_words) and (word.isalpha())):
            new_words.append(word)
    return new_words #list of words in a text

### Retrieving json files of document from the directory

In [None]:
def textReading(file_dir,x=0,y=10):
    filenames = os.listdir(file_dir)
#     all_files = []
    docs_bagOfWords = {}
    for filename in filenames[x:y]: 
        text = ''
        file = json.load(open(os.path.join(json_dir,filename), 'rb'))
        for i in file['body_text']:
            text += i['text']  
        docs_bagOfWords[(filename[:-5],file['metadata']['title'])] = textPreprocessing(text)
    return docs_bagOfWords #dictionary {paper_id:[]}

In [None]:
json_dir = '/kaggle/input/CORD-19-research-challenge/document_parses/pdf_json'
docs_bagOfWords = textReading(json_dir,2000,2800)

## Implementing Raw TF-IDF

### First, TF (Term Frequency Calculation)

![](https://miro.medium.com/proxy/1*HM0Vcdrx2RApOyjp_ZeW_Q.png)

### Computing IDF

![](https://miro.medium.com/proxy/1*A5YGwFpcTd0YTCdgoiHFUw.png)

### Computing TF-IDF

![](https://miro.medium.com/proxy/1*nSqHXwOIJ2fa_EFLTh5KYw.png)

## Implementing TF-IDF Using sklearn

In [None]:
def totalTFIDF(docs_bagOfWords):
    """
    Calculating TFIDF for the whole documents
    Args:
        docs_bagOfWords: dict bag of words of each paper
    Returns:
        documentsText: list of strings (the complete body text of each document) 
        feature_names: list of strings (the total vocab unique words)
        tfidf_dict: dict with paper_id as keys and list of tfidf for this paper
    """
    vectorizer = TfidfVectorizer()
    documentsText = []
    for k in docs_bagOfWords.keys():
        str1 = ' '
        str1 = str1.join(docs_bagOfWords[k])
        documentsText.append(str1)
    vectors = vectorizer.fit_transform(documentsText)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    tfidf_dict = {}
    for key,tfidf in zip(docs_bagOfWords.keys(),denselist):
        tfidf_dict[key] = tfidf
    return documentsText,feature_names,tfidf_dict


In [None]:
def calculateTFIDF(vec_query):
    """
    docs_bagOfWords: dict bag of words of each paper
    """
    vectorizer = TfidfVectorizer()
    documentsText = []
    str1 = ' '
    str1 = str1.join(vec_query)
    documentsText.append(str1)
    vectors = vectorizer.fit_transform(documentsText)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    return denselist,feature_names

In [None]:
def getTotalVocab(docs):
    """
    for getting the total_vocab of the given documents in the form of list of words
    """
    total_vocab = []
    for i in docs:
        total_vocab += i.split(' ')
    return total_vocab #list of total vocab

In [None]:
d,features,tfidf_dict = totalTFIDF(docs_bagOfWords)
total_vocab = getTotalVocab(d)

## Calculating Cosine Distance

![image.png](http://sites.temple.edu/tudsc/files/2017/03/cosine-equation.png)


#### Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.
![](http://miro.medium.com/max/650/1*OGD_U_lnYFDdlQRXuOZ9vQ.png)
A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as a keyword) or phrase in the document. Thus, each document is an object represented by what is called a term-frequency vector.
This is what implemented in this model, to find to what extent the input query is similar to the available documents and help us in answering the inquiries about the risk factors of the newly arised COVID-19 virus.

In [None]:
def getCosineDistance(q_vec,doc_dict):
    """
    Calculates the cosine distance between a query and documents
    Args:
        q_vec: A vector representing the query
        doc_dict: Dictionary having with - key as a document title
                                         - value as vector representation for this document
    Returns:
        q_norm: The norm of the input query
        cosDistances: Dictionary containing the documents sorted according to their cosine distances with the query
    
    """
    cosDistances = {}
    q_norm = np.linalg.norm(q_vec)
    for k in doc_dict.keys():
        v2 = doc_dict[k]
        z = np.zeros(((len(v2)-len(q_vec)),))
        q_vec = np.concatenate((q_vec,z), axis=0)
        dotProduct = np.dot(q_vec,v2)
        cosDistances[k] = dotProduct/(q_norm*np.linalg.norm(v2))
    cosDistances = {i: j for i, j in sorted(cosDistances.items(), key=lambda item: item[1],reverse=True)}
    return q_norm, cosDistances
        

The input question is to be placed in 'question' variable

In [None]:
question = """What is the covid-19 risk factors? how it affects smokers, pregnants,children and how it
        influences people with cronic diseases like hypertension, diabetes and cardiologic diseases?"""
text_filtered = textPreprocessing(question)
tfidfList,words = calculateTFIDF(text_filtered)


In [None]:
q_norm,c = getCosineDistance(tfidfList[0],tfidf_dict)

#### Displaying the output results:

In [None]:
def pretty(d, indent=0,r=5):
    for key, i in zip(d.keys(),range(r)):
        print('\t' * indent + str(key))
        if isinstance(d[key], dict):
            pretty(d[key], indent+1)
        else:
            print('\t' * (indent+1) + str(round(d[key],3)))

In [None]:
pretty(c)

In [None]:
paper = 'f084fa3b9768063bc36f08970e7c28a5e3f2f13b.json'
json.load(open(os.path.join(json_dir,paper), 'rb'))