Our system relies on Gensim library for embeddings, this ipynb shows some basic examples of embedding training using Gensim, for more details please refer to
https://radimrehurek.com/gensim/index.html

**Reading HealthTap data and preparing it for embedding generation**

In [1]:
import pandas

In [2]:
HealthTapExcel_Part1 = pandas.read_excel("data/HealthTap/healthtap_medical_qna_dataset_PART1.xlsx")["question"].values.tolist()
HealthTapExcel_Part2 = pandas.read_excel("data/HealthTap/healthtap_medical_qna_dataset_PART2.xlsx")["question"].values.tolist()
HealthTapExcel_Part3 = pandas.read_excel("data/HealthTap/healthtap_medical_qna_dataset_PART3.xlsx")["question"].values.tolist()

HealthTapDataFull = HealthTapExcel_Part1+HealthTapExcel_Part2+HealthTapExcel_Part3

**Gensim wants Double list of Sentences and tokenized words , First list holds sentences and second list holds words of that sentence**

List[x][y] where x is Sentence No and y is Word No,

List[5][3] for example would be the 3rd word of 5th sentence

In [3]:
HealthTapDataFull = [str(x).split(" ") for x in HealthTapDataFull]

In [4]:
HealthTapDataFull[0]

['Like',
 '5',
 'yrs',
 'ago',
 'to',
 'now',
 "I've",
 'had',
 'some',
 'reoccurring',
 'itching',
 'and',
 'when',
 'I',
 'scratched',
 'some',
 'burning',
 'n',
 'vagina',
 'no',
 'pain',
 'never',
 'and',
 'never',
 'noticed',
 'sores..is',
 'it',
 'herpes?']

**Training word2vec embeddings using Gensim**

In [5]:
from gensim.models import Word2Vec



Size = Specifies dimensionality of embeddings

Window = Length of Continious Bag of Words window from Word2Vec Paper

Min_count = Eliminates words that occur less then specified value

In [6]:
model = Word2Vec(HealthTapDataFull, size=300, window=5, min_count=1)

**Embedding of a random word**

In [7]:
model.wv["ago"]

array([ 0.23624176,  0.86371934, -0.59717935, -0.33274433,  2.1809297 ,
        1.0751551 ,  1.3771075 ,  0.21610369,  1.3021597 , -2.261604  ,
        0.76357377,  2.6477323 ,  0.31953207,  0.16070518, -0.8733566 ,
       -2.3634849 ,  1.4916706 , -1.0095518 , -2.7047503 , -0.21719232,
       -1.1655084 , -1.5482459 , -0.88501555,  2.6081882 , -4.3836875 ,
       -0.13019651, -1.3841583 ,  1.5802212 ,  0.7647551 ,  1.070706  ,
       -0.2813985 ,  1.0980543 ,  0.06336986,  1.9708833 , -2.192429  ,
        1.4093034 ,  2.102222  ,  0.5577079 ,  2.6753132 , -1.2378845 ,
       -0.01318395, -0.88287896, -0.72377306, -0.76236135, -1.9257376 ,
        1.6235664 , -0.49885786,  0.07376118,  0.04735577,  2.4873514 ,
       -0.05608618, -0.55282253,  1.8858175 ,  1.3656172 , -1.3066772 ,
        2.5739288 , -0.28250182,  0.9907314 , -2.0611129 , -0.8282988 ,
        2.3155067 ,  0.38717756,  0.9907071 ,  1.0324124 ,  2.5583866 ,
        3.4258902 , -1.6307019 ,  0.6070435 , -2.9895284 , -0.70

**Saving HealthTap Word2Vec vectors**

Warning: This will overwrite previously trained HealthTap word2vec vectors

In [8]:
model.wv.save("Embeddings/healthTapEmbedding.embed")

**Training FastText embeddings using Gensim**

In [9]:
from gensim.models import FastText

In [10]:
model = FastText(HealthTapDataFull, size=300, window=5, min_count=5)

**Embedding of a random word**

In [11]:
model.wv["Wrong"]

array([-4.24654186e-01,  6.38079822e-01,  8.33980858e-01, -5.74639775e-02,
        5.22777677e-01, -1.68366492e-01,  1.29339290e+00, -4.55708086e-01,
        4.69660275e-02, -7.89554358e-01,  5.74761212e-01,  9.32855830e-02,
        7.80359149e-01,  1.29782051e-01, -3.49001467e-01, -1.22761846e+00,
        9.73470151e-01,  9.02960181e-01, -2.90807605e-01, -2.64510602e-01,
        5.40391393e-02, -7.17941821e-01,  5.10859847e-01,  1.10535812e+00,
       -6.51779413e-01,  1.63895404e+00,  1.21418846e+00,  3.05036642e-02,
        4.07770753e-01, -8.32325339e-01,  3.00387979e-01,  2.47068748e-01,
        8.62963796e-01, -4.51068699e-01, -9.60161984e-01, -4.16189671e-01,
        3.01043361e-01, -5.96583366e-01,  5.17236173e-01, -1.18707508e-01,
       -1.22428477e+00, -5.23279667e-01, -2.18948051e-01,  7.15846196e-02,
       -6.49653614e-01,  2.83037245e-01, -2.56146252e-01,  2.17112929e-01,
        1.18035293e+00,  1.21132970e+00,  5.62645197e-02, -5.27279079e-01,
        7.95797706e-01, -

**Fast Text can also generate embeddings for words not in corpus**

In [12]:
model.wv["ron"]

array([ 1.0641270e+00, -4.4317898e-01,  1.4017481e+00, -2.5567670e+00,
        2.5962510e+00, -2.3989165e+00,  1.8445976e-01, -1.4772114e+00,
        6.4381802e-01, -1.2886997e+00,  4.1982040e-01, -8.4903830e-01,
       -1.0403616e+00, -1.4109166e+00,  1.1609293e+00, -5.1313955e-01,
        4.0730423e-01,  2.4448764e+00, -1.3329360e+00,  6.8753302e-01,
       -3.6237946e-01, -3.2056470e+00,  3.8348430e-01,  6.8279552e-01,
       -2.1085920e+00,  4.3641672e+00,  1.0798147e+00, -1.4949433e+00,
        4.4921145e+00, -1.5892622e+00,  1.3344519e+00,  1.5045664e+00,
        3.4865922e-01,  3.6322594e-01, -1.5536114e+00, -2.7004006e+00,
       -1.1215789e+00, -1.1736128e+00,  6.8620509e-01, -1.4383031e+00,
       -3.8586097e+00,  5.1931828e-01, -2.2794168e+00,  2.7944164e+00,
       -2.1070967e+00, -1.3243779e+00,  2.6625004e-01,  1.7203839e+00,
        9.9426413e-01,  3.1758065e+00,  2.4291711e+00, -1.8124142e-01,
        1.4394394e+00, -1.7632482e+00, -5.4365546e-01,  2.7626872e-03,
      

**Saving FastText model**

In [13]:
model.save("Embeddings/HealthTapFastText")

**Note:**

System automatically recognizes and integrates FastText embeddings located at

"Embeddings/HealthTapFastText" (relative path)

and

Word2Vec embeddings located at

"Embeddings/healthTapEmbedding.embed"

For embeddings that are saved on these paths, one needs to merely select FastText or HealthTap options of EmbeddingHandler to use these (examples are given on Training Model Script)