IMPORTING REQUIRED PACKAGES

In [1]:
import numpy as np 
import logging 
from tqdm import tqdm 
import re
import scipy
import torch

from gensim.models import Word2Vec, KeyedVectors, word2vec # Word2Vec model
from matplotlib import pyplot as plt
from nltk import sent_tokenize

In [2]:
def cosine(a, b):
    return 1-scipy.spatial.distance.cosine(a,b)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

EXTRACTING THE FIRST 1000 FILES

In [8]:
corpus = open('/content/drive/MyDrive/dataset/sentences.txt', 'r', encoding="utf-8")
buffer = corpus.read()
corpus.close()

count = 0
wr = 0
for i in range(len(buffer)):
    if buffer[i]=='\n':
        count+=1
        if count==1000:
            wr = i
            break
sentences1000 = buffer[:wr+1]
sentences1000 = re.sub('\n', '', sentences1000)

sentences = sent_tokenize(sentences1000)
sl = []
for s in tqdm(sentences):
    if (len(s)>15) and (len(s)<=200):
        sl.append(s)
buffer = '\n'.join(sl)

tokens = re.sub('\n', '', buffer)
tokens = re.split(' ', tokens)

f = open('sentencecorpus.txt', 'w', encoding='utf-8')
f.write(buffer)
f.close()

100%|██████████| 79385/79385 [00:00<00:00, 1260810.12it/s]


TRAINING THE MODEL

In [10]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sent = open('sentencecorpus.txt', 'r', encoding="utf-8")
sents = word2vec.LineSentence(sent)
model = word2vec.Word2Vec(size = 100, window = 5, workers = 8, min_count = 1)
model.build_vocab(sents)
model.train(sents, total_examples = model.corpus_count, epochs = 10)

2022-05-25 21:02:53,076 : INFO : collecting all words and their counts
2022-05-25 21:02:53,140 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-05-25 21:02:53,286 : INFO : PROGRESS: at sentence #10000, processed 198671 words, keeping 15074 word types
2022-05-25 21:02:53,474 : INFO : PROGRESS: at sentence #20000, processed 404931 words, keeping 22224 word types
2022-05-25 21:02:53,737 : INFO : PROGRESS: at sentence #30000, processed 607258 words, keeping 27640 word types
2022-05-25 21:02:54,040 : INFO : PROGRESS: at sentence #40000, processed 805429 words, keeping 34517 word types
2022-05-25 21:02:54,352 : INFO : PROGRESS: at sentence #50000, processed 1006498 words, keeping 39061 word types
2022-05-25 21:02:54,681 : INFO : collected 41738 word types from a corpus of 1161033 raw words and 57692 sentences
2022-05-25 21:02:54,695 : INFO : Loading a fresh vocabulary
2022-05-25 21:02:55,111 : INFO : effective_min_count=1 retains 41738 unique words (100% of ori

(8699665, 11610330)

In [11]:
wordvecsize = 100
sentvecsize = 125
mat = np.zeros((len(sl), sentvecsize))

W = np.random.normal(0, 0.01, (sentvecsize, wordvecsize))
U = np.random.normal(0, 0.01, (sentvecsize, sentvecsize))
h = np.zeros(sentvecsize)
for n, sent in tqdm(enumerate(sl)):
    for word in re.split(' ', sent.strip()):
        try:
            x = model.wv[word]
            x = scipy.stats.zscore(x)
            h = np.tanh((U@h.T) + (W@x.T))
        except:
            None
        mat[n] += h
        h = h/sentvecsize

57692it [03:30, 274.59it/s]


In [12]:
print('Similar sentences to: \n', sl[1], '\n')
temp = []
for i in range(len(sl)):
    temp.append(cosine(mat[i], mat[1]))
temp = torch.from_numpy(np.array(temp))

values, key = torch.topk(temp, k=6, axis=-1)
for i in range(5):
  print(sl[key[i+1]], '\n')

Similar sentences to: 
 the surgeon of covid dedicated hospital do rarely practice surgery . 

some of these guideline include triaging of the patient , prioritizing emergency surgery and delaying the elective surgical procedure till the covid pandemic is over . 

the society of critical care medicine sccm guideline recommend only that icu admission criterion should select patient who are likely to benefit from icu care . 

this focus on medical and surgical presentation to the emergency department ed , but fails to provide guidance on triaging psychiatric presentation . 

the who also produced the pocket book of hospital care for child , providing clinical guideline for physician and nurse delivering pediatric hospital care in resource - limited setting . 

we acknowledge the ongoing dedicated work of all our clinical colleague , including our hospital palliative care team , intensive care staff and elderly care team . 



In [13]:
print('Similar sentences to: \n', sl[5], '\n')
temp = []
for i in range(len(sl)):
    temp.append(cosine(mat[i], mat[5]))
temp = torch.from_numpy(np.array(temp))

values, key = torch.topk(temp, k=6, axis=-1)
for i in range(5):
  print(sl[key[i+1]], '\n')

Similar sentences to: 
 the patient , diagnosed with coronavirus infection and treated at home is admitted to covid dedicated multidisciplinary hospital , where surgical care is provided . 

in case of covid suspicion , patient were sent to dedicated covid - unit at the emergency department . 

first , for patient with laboratory - confi rmed infection who required hospital admission for medical reason , we examined the risk of death , admission to icu , and mechanical ventilation . 

patient were advised very early to stay at home , avoid public transportation and unnecessary contact , and report every symptom before coming to the outpatient clinic . 

, bed assignment for patient being admitted from the emergency department who may require cohorting by their covid status . 

the patient should be admitted to the surgical department , where treatment is provided only to those covid negative . 



In [14]:
print('Similar sentences to: \n', sl[17], '\n')
temp = []
for i in range(len(sl)):
    temp.append(cosine(mat[i], mat[17]))
temp = torch.from_numpy(np.array(temp))

values, key = torch.topk(temp, k=6, axis=-1)
for i in range(5):
  print(sl[key[i+1]], '\n')

Similar sentences to: 
 feline infectious peritonitis . 

pneumoniae infection . 

reconditum infection . 

influenzae and atypical pathogen . 

influenzae and atypical pathogen . 

multiple organ dysfunction syndrome mod and shock were defined following international criterion . 



In [15]:
print('Similar sentences to: \n', sl[213], '\n')
temp = []
for i in range(len(sl)):
    temp.append(cosine(mat[i], mat[213]))
temp = torch.from_numpy(np.array(temp))

values, key = torch.topk(temp, k=6, axis=-1)
for i in range(5):
  print(sl[key[i+1]], '\n')

Similar sentences to: 
 in addition , we also observed that higher disease severity wa associated with more elevated plasma ifn - , il - 1b , il il ip and mcp concentration . 

they concluded that the local cardiac production of inflammatory cytokine containing il - 1β , il and il were elevated in patient with ac . 

all the other measured cytokine and chemokines il , il il il - 12p70 , tnf - , ifn - , rantes and mig did not show any significant difference before and after corticosteroid treatment . 

coagulation abnormality is also contained in low platelet count , increased fibrinogen , - dimer level and prolonged pt in severe patient . 

substantial difference were found among strain in the capacity to induce interleukin il and tumour necrosis factor tnf - , while the difference for il and il were le pronounced . 

alt , ast , tbil , bun , and cr level showed the significant increase , following decrease in albumin level a the main liver and kidney outcome in severe patient compared

In [17]:
print('Similar sentences to: \n', sl[2612], '\n')
temp= []
for i in range(len(sl)):
    temp.append(cosine(mat[i], mat[2612]))
temp = torch.from_numpy(np.array(temp))

values, key = torch.topk(temp, k=6, axis=-1)
for i in range(5):
  print(sl[key[i+1]], '\n')

Similar sentences to: 
 vivax infection in south indian city of mangaluru merozoite surface protein msp malaria still remains a serious health problem in several tropical region . 

glabrata in northern europe ; in the latter , acquired echinocandin resistance is emerging . 

climate change is likely to accelerate pandemic emergence because temperate zone encompass larger area of the globe , expanding vector territory and favoring bacterial pathogen . 

ibv is endemic in most country around the world and cause huge economical loss in the poultry industry . 

the map suggest that there are potential hotspot of disease emergence - particularly in central america , tropical africa and south asia - that warrant greater surveillance . 

infectious bronchitis virus ibv cause highly contagious disease in chicken and is one of the major problem of poultry industry in many country . 

