Our system relies on Gensim library for embeddings, this ipynb shows some basic examples of embedding training using Gensim, for more details please refer to
https://radimrehurek.com/gensim/index.html

**Reading HealthTap data and preparing it for embedding generation**

In [1]:
import pandas

In [2]:
HealthTapExcel_Part1 = pandas.read_excel("data/HealthTap/healthtap_medical_qna_dataset_PART1.xlsx")["question"].values.tolist()
HealthTapExcel_Part2 = pandas.read_excel("data/HealthTap/healthtap_medical_qna_dataset_PART2.xlsx")["question"].values.tolist()
HealthTapExcel_Part3 = pandas.read_excel("data/HealthTap/healthtap_medical_qna_dataset_PART3.xlsx")["question"].values.tolist()

HealthTapDataFull = HealthTapExcel_Part1+HealthTapExcel_Part2+HealthTapExcel_Part3

**Gensim wants Double list of Sentences and tokenized words , First list holds sentences and second list holds words of that sentence**

List[x][y] where x is Sentence No and y is Word No,

List[5][3] for example would be the 3rd word of 5th sentence

In [3]:
HealthTapDataFull = [str(x).split(" ") for x in HealthTapDataFull]

In [4]:
HealthTapDataFull[0]

['Like',
 '5',
 'yrs',
 'ago',
 'to',
 'now',
 "I've",
 'had',
 'some',
 'reoccurring',
 'itching',
 'and',
 'when',
 'I',
 'scratched',
 'some',
 'burning',
 'n',
 'vagina',
 'no',
 'pain',
 'never',
 'and',
 'never',
 'noticed',
 'sores..is',
 'it',
 'herpes?']

**Training word2vec embeddings using Gensim**

In [5]:
from gensim.models import Word2Vec



Size = Specifies dimensionality of embeddings

Window = Length of Continious Bag of Words window from Word2Vec Paper

Min_count = Eliminates words that occur less then specified value

In [6]:
model = Word2Vec(HealthTapDataFull, size=300, window=5, min_count=1)

**Embedding of a random word**

In [7]:
model.wv["ago"]

array([ 1.4985436 , -0.93943465,  0.68374467,  1.3171884 ,  0.5429083 ,
        0.0923715 , -1.0728434 , -3.7402248 , -0.93942845,  1.1618556 ,
        1.3701938 ,  2.0547423 , -2.9124913 ,  3.3194475 ,  3.2695813 ,
       -0.15496333,  1.9057144 ,  1.3711376 ,  0.00599014,  0.5740006 ,
       -2.375684  , -2.0654345 ,  1.5171928 ,  1.0854349 , -0.9539455 ,
        1.0907439 , -3.6334534 , -0.84238523, -2.1533766 , -2.5649467 ,
       -0.33561435, -1.2634087 ,  4.9782777 , -0.07704367,  2.4643593 ,
       -1.3824862 ,  0.22310168,  2.6727889 , -1.0735788 , -1.2835679 ,
        0.39647087, -4.3591037 ,  0.41256806, -2.2683375 , -0.12141391,
        1.2015678 , -0.04859994,  0.1735369 , -0.463938  ,  1.8183864 ,
        0.8871008 ,  1.9598567 ,  2.2676911 , -0.77150124,  0.08752996,
       -1.2322072 ,  0.8992477 , -1.1367015 ,  0.97264254, -0.49476513,
       -1.0159514 , -0.45841268,  0.19310912, -0.7945231 , -0.68279684,
       -2.617394  , -1.2409375 , -1.001362  , -1.2814157 , -1.05

**Saving HealthTap Word2Vec vectors**

Warning: This will overwrite previously trained HealthTap word2vec vectors

In [8]:
model.wv.save("Embeddings/healthTapEmbedding.embed")

**Training FastText embeddings using Gensim**

In [9]:
from gensim.models import FastText

In [10]:
model = FastText(HealthTapDataFull, size=300, window=5, min_count=5)

KeyboardInterrupt: 

**Embedding of a random word**

In [None]:
model.wv["Wrong"]

**Fast Text can also generate embeddings for words not in corpus**

In [None]:
model.wv["ron"]

**Saving FastText model**

In [None]:
model.save("Embeddings/HealthTapFastText")

**Note:**

System automatically recognizes and integrates FastText embeddings located at

"Embeddings/HealthTapFastText" (relative path)

and

Word2Vec embeddings located at

"Embeddings/healthTapEmbedding.embed"

For embeddings that are saved on these paths, one needs to merely select FastText or HealthTap options of EmbeddingHandler to use these (examples are given on Training Model Script)