### Remark/Warning: 

To use this code you have to use your own copy of "J. K. Rowling - Harry Potter 1 - Sorcerer’s Stone".

## 1. Bigram Language Model

Build a bigram language model on the data "J. K. Rowling - Harry Potter 1 - Sorcerer’s Stone".


1. Tokenize the data into a list of words using the function nltk.wordpunct_tokenize; convert all words to lower case; remove stop words and only keep alphabetic terms.
2. Build a vocabulary from the tokens, how many unique words do you find?
3. Build a frequency table where the rows represents the word wt−1 and the columns represent one word
afterwards wt. Count the occurrence of each word pair C(wt−1wt)
4. How many times does the following words occur: "harry", "stone", "hagrid", "feeling", "living" (hint:
what if you sum all the numbers in a row?)
5. Calculate the following conditional probabilities p(potter|harry), p(said|harry), p(knows|everyone).

In [1]:
# Loading the data set + libraries 

import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords
import nltk

file = open('J. K. Rowling - Harry Potter 1 - Sorcerer\'s Stone','r')
harrypotter_corpus = file.read()


## 1. Tokenize the data into a list of words using the function nltk.wordpunct_tokenize; convert all words to lower case; remove stop words and only keep alphabetic terms.

In [2]:
## Tokenization, dictionary.

nltk.download('stopwords') # we delete she, you ... etc
stop_words = set(stopwords.words('english'))
print(stop_words)

{'these', 'doesn', 'his', 'couldn', 'or', 'yours', 'off', 'them', 'does', "it's", 'weren', 'is', 'down', "needn't", 'has', 'whom', 'to', 'hers', 'theirs', 't', 'your', 'me', 'because', 'it', 'mightn', 'i', 'other', 'll', "shan't", 'yourselves', 'and', 'at', 'how', 'no', 'we', "won't", 'where', 'hadn', 'been', 'during', 'needn', 'until', 'will', 'shouldn', 'ourselves', "isn't", 'am', 'have', 'the', 'each', 'herself', 'can', 'by', 'who', 'm', 'more', 'aren', 's', 'when', 'into', 'after', 'he', 'own', 'but', 'from', 'themselves', 'be', 'had', 'than', "hasn't", 'up', 'out', 're', 'through', 'our', 'being', 'mustn', 'nor', 'doing', 'she', 'now', 'hasn', 'my', 'few', 'here', 'wasn', 'this', 'an', 'a', 'on', 'above', 'over', "wasn't", 'of', 'against', "aren't", 'before', "didn't", 'd', 'its', "you'll", "don't", 'just', 'was', 'some', "doesn't", 'once', 'too', "shouldn't", 'under', 'ours', 'you', 'same', 'most', 'again', 'further', 'what', 'itself', 'between', 'such', 'are', 'so', "you've", 't

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/victormaciamedina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# tokenization

#nltk.download('punkt')
word_tokens = wordpunct_tokenize(harrypotter_corpus)


In [4]:
word_tokens = [w.lower() for w in word_tokens]
word_tokens = [w.lower() for w in word_tokens if not w in stop_words]
word_tokens = [w.lower() for w in word_tokens if w.isalpha()] # isalpha() true or false if all the characters are alphabets

In [5]:
word_tokens

['harry',
 'potter',
 'sorcerer',
 'stone',
 'chapter',
 'one',
 'boy',
 'lived',
 'mr',
 'mrs',
 'dursley',
 'number',
 'four',
 'privet',
 'drive',
 'proud',
 'say',
 'perfectly',
 'normal',
 'thank',
 'much',
 'last',
 'people',
 'expect',
 'involved',
 'anything',
 'strange',
 'mysterious',
 'hold',
 'nonsense',
 'mr',
 'dursley',
 'director',
 'firm',
 'called',
 'grunnings',
 'made',
 'drills',
 'big',
 'beefy',
 'man',
 'hardly',
 'neck',
 'although',
 'large',
 'mustache',
 'mrs',
 'dursley',
 'thin',
 'blonde',
 'nearly',
 'twice',
 'usual',
 'amount',
 'neck',
 'came',
 'useful',
 'spent',
 'much',
 'time',
 'craning',
 'garden',
 'fences',
 'spying',
 'neighbors',
 'dursleys',
 'small',
 'son',
 'called',
 'dudley',
 'opinion',
 'finer',
 'boy',
 'anywhere',
 'dursleys',
 'everything',
 'wanted',
 'also',
 'secret',
 'greatest',
 'fear',
 'somebody',
 'would',
 'discover',
 'think',
 'could',
 'bear',
 'anyone',
 'found',
 'potters',
 'mrs',
 'potter',
 'mrs',
 'dursley',
 '

## Build a vocabulary from the tokens

In [6]:
vocab = set(word_tokens)

## How many unique words do you find?

In [7]:
len(np.unique(vocab)[0])

5615

## Build a frequency table where the rows represents the word $w_{t−1}$ and the columns represent one word afterwards $w_t$. Count the occurrence of each word pair C($w_{t−1}w_t$)

In [8]:
char_to_int = dict((c,i) for i,c in enumerate(vocab))
int_to_char = dict((i,c) for i,c in enumerate(vocab))

In [9]:
X = np.zeros((len(word_tokens),len(vocab)))
Y = np.array([])

Xwords = []
Ywords = []
window_size = 1 # 5 words above and below

for i, word in enumerate(word_tokens):
    isetvalue = 0
    w2v = np.zeros(len(vocab))
    for icontext in range(max(i-window_size,0), i+1):
        if icontext != i:
             w2v[char_to_int[word_tokens[icontext]]] = w2v[char_to_int[word_tokens[icontext]]]+1
    X[i] = w2v  

In [10]:
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## How many times does the following words occur: "harry", "stone", "hagrid", "feeling", "living" (hint: what if you sum all the numbers in a row?)

In [11]:
colnames = [int_to_char[i] for i in range(len(vocab))]
word2vecrep = pd.DataFrame(X, columns = colnames)

In [12]:
vocab

{'hygienic',
 'trousers',
 'twentieth',
 'bird',
 'saturday',
 'photos',
 'cough',
 'could',
 'stuff',
 'sensible',
 'tea',
 'proof',
 'kinda',
 'group',
 'studying',
 'puddin',
 'pickled',
 'nervous',
 'dried',
 'crumpled',
 'side',
 'held',
 'stubs',
 'beechwood',
 'greeting',
 'surprised',
 'knot',
 'balloons',
 'snigget',
 'spun',
 'largest',
 'nah',
 'squinting',
 'six',
 'possession',
 'fed',
 'insisting',
 'pelt',
 'holds',
 'recognize',
 'briskly',
 'casually',
 'courage',
 'dumpy',
 'hall',
 'calmly',
 'photographs',
 'goblets',
 'trunks',
 'furor',
 'keep',
 'broomsticks',
 'charms',
 'wheeling',
 'jars',
 'chilled',
 'easily',
 'crush',
 'often',
 'scraped',
 'dinnertime',
 'experts',
 'pile',
 'plunged',
 'flopped',
 'worth',
 'drag',
 'hoops',
 'drained',
 'whispering',
 'scar',
 'africa',
 'seeping',
 'quill',
 'warty',
 'vault',
 'provoked',
 'headmistress',
 'gossiped',
 'shadows',
 'paint',
 'perfect',
 'fading',
 'ring',
 'git',
 'contains',
 'smaller',
 'headed',
 'p

In [13]:
word2vecrep

Unnamed: 0,hygienic,trousers,twentieth,bird,saturday,photos,cough,could,stuff,sensible,...,silent,halloween,fastenings,oaf,scandal,begins,ajar,raged,shaped,grinning
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40759,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40760,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
word2vecrep['wordkey'] = word_tokens
word2vecrep['wordkey']

0           harry
1          potter
2        sorcerer
3           stone
4         chapter
           ...   
40757         lot
40758         fun
40759      dudley
40760      summer
40761         end
Name: wordkey, Length: 40762, dtype: object

In [15]:
word2vecbycount = word2vecrep.groupby(['wordkey']).sum()

In [16]:
# Ocurrences word harry

word2vecbycount.loc['harry'].sum()

1326.0

In [17]:
# Ocurrences word stone

word2vecbycount.loc['stone'].sum()

80.0

In [18]:
# Ocurrences word hagrid

word2vecbycount.loc['hagrid'].sum()

370.0

In [19]:
# Ocurrences word feeling

word2vecbycount.loc['feeling'].sum()

25.0

In [20]:
# Ocurrences word living

word2vecbycount.loc['living'].sum()

11.0

In [21]:
word2vecbycount.loc['potter','harry'] # ocurrences of harry potter

30.0

In [22]:
word2vecbycount.loc['harry','potter'] # ocurrences of potter harry

6.0

In [23]:
word2vecbycount.loc['walls','stone'] # ocurrences of walls stone

1.0

## Calculate the following conditional probabilities p(potter|harry), p(said|harry), p(knows|everyone).

What I understand from the questions is that we are considering the pair $w_{t-1}w_t$, then we want to count two consecutive words. In my code we label rows with $w_{t-1}$ and columns with $w_t$, for that reason if we want to calculate 
$$P(w_t|w_{t-1})$$
we need to use the code word2vecbycount.loc['w_{t-1}','w_t']. In this case the order is important because we are not considering the previous and subsequent word, we are simply considering the previous word.

$P(w_t|w_{t-1})$ is the probability of getting $w_t$ given $w_{t-1}$.



In [24]:
# p(potter|harry)

word2vecbycount.loc['potter','harry']/word2vecbycount.loc['harry'].sum()

# we write potter first because rows represent w_{t-1} and columns w_t.


0.02262443438914027

In [25]:
# p(said|harry)

word2vecbycount.loc['said','harry']/word2vecbycount.loc['harry'].sum()

# we write said first because rows represent w_{t-1} and columns w_t.



0.015082956259426848

In [26]:
# p(knows|everyone)

word2vecbycount.loc['knows','everyone']/word2vecbycount.loc['everyone'].sum()

# we write everyone first because rows represent w_{t-1} and columns w_t.



0.06666666666666667

In [27]:
# p(everyone|knows)

word2vecbycount.loc['knows','everyone']/word2vecbycount.loc['everyone'].sum()

0.06666666666666667

## 2 Word2vec: Medical Transcripts
Understand medical notes is a challenging NLP problem. Lots of good application can be made if a machine can read doctors’ notes and interpret the underlying medical conditions and severity. In this exercise, you are presented a simple data of 5000 medical cases "medicaltranscriptions.csv". Each case has the transcript and the associated medical specialty. Please
1. For the "description" of each individual,use "word_tokenize" function from nltk and convert the corpus into a list of words.
2. Create a vocabulary containing all words appear in the descriptions. Count the number of total occurrence of each word. List the top 10 words that has the highest occurrence. Are those words related to medical terms?
3. Convert the words in question (2) vocabulary to continuous vectors, using the pretrained word to vector dictionary "PubMed-and-PMC-w2v.bin". You may download the data from http://evexdb.org/pmresources/vec-space-models. Can all words find the corresponding vector representations?
4. Calculate the cosine-similarity of the following word pairs: "allergy/allergic"; "heart/lung"; "water/- heart". Do the similarity measures make sense to you?

In [28]:
# Loading the data set

medical = pd.read_csv('medicaltranscriptions.csv')

In [29]:
medical

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."
...,...,...,...,...,...,...
4994,4994,Patient having severe sinusitis about two to ...,Allergy / Immunology,Chronic Sinusitis,"HISTORY:, I had the pleasure of meeting and e...",
4995,4995,This is a 14-month-old baby boy Caucasian who...,Allergy / Immunology,Kawasaki Disease - Discharge Summary,"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...","allergy / immunology, mucous membranes, conjun..."
4996,4996,A female for a complete physical and follow u...,Allergy / Immunology,Followup on Asthma,"SUBJECTIVE: , This is a 42-year-old white fema...",
4997,4997,Mother states he has been wheezing and coughing.,Allergy / Immunology,Asthma in a 5-year-old,"CHIEF COMPLAINT: , This 5-year-old male presen...",


In [30]:
## 1. For the "description" of each individual,use "word_tokenize" 
##    function from nltk and convert the corpus into a list of words.

descriptions = medical['description'].values

In [31]:
descriptions

array([' A 23-year-old white female presents with complaint of allergies.',
       ' Consult for laparoscopic gastric bypass.',
       ' Consult for laparoscopic gastric bypass.', ...,
       ' A female for a complete physical and follow up on asthma with allergic rhinitis.',
       ' Mother states he has been wheezing and coughing.',
       ' Acute allergic reaction, etiology uncertain, however, suspicious for Keflex.'],
      dtype=object)

In [32]:
# Stop words

nltk.download('stopwords') # we delete she, you ... etc
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/victormaciamedina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
print(stop_words)

{'these', 'doesn', 'his', 'couldn', 'or', 'yours', 'off', 'them', 'does', "it's", 'weren', 'is', 'down', "needn't", 'has', 'whom', 'to', 'hers', 'theirs', 't', 'your', 'me', 'because', 'it', 'mightn', 'i', 'other', 'll', "shan't", 'yourselves', 'and', 'at', 'how', 'no', 'we', "won't", 'where', 'hadn', 'been', 'during', 'needn', 'until', 'will', 'shouldn', 'ourselves', "isn't", 'am', 'have', 'the', 'each', 'herself', 'can', 'by', 'who', 'm', 'more', 'aren', 's', 'when', 'into', 'after', 'he', 'own', 'but', 'from', 'themselves', 'be', 'had', 'than', "hasn't", 'up', 'out', 're', 'through', 'our', 'being', 'mustn', 'nor', 'doing', 'she', 'now', 'hasn', 'my', 'few', 'here', 'wasn', 'this', 'an', 'a', 'on', 'above', 'over', "wasn't", 'of', 'against', "aren't", 'before', "didn't", 'd', 'its', "you'll", "don't", 'just', 'was', 'some', "doesn't", 'once', 'too', "shouldn't", 'under', 'ours', 'you', 'same', 'most', 'again', 'further', 'what', 'itself', 'between', 'such', 'are', 'so', "you've", 't

In [34]:
# Tokenize each sentence

X = len(descriptions)*[0]

for i in range(len(descriptions)):
        X[i] = wordpunct_tokenize(descriptions[i])

for i in range(len(descriptions)):
        X[i] = [w.lower() for w in X[i]]
        X[i] = [w.lower() for w in X[i] if not w in stop_words]
        X[i] = [w.lower() for w in X[i] if w.isalpha()]

In [35]:
# sentence by sentence

Z = X
Z

[['year', 'old', 'white', 'female', 'presents', 'complaint', 'allergies'],
 ['consult', 'laparoscopic', 'gastric', 'bypass'],
 ['consult', 'laparoscopic', 'gastric', 'bypass'],
 ['mode', 'doppler'],
 ['echocardiogram'],
 ['morbid',
  'obesity',
  'laparoscopic',
  'antecolic',
  'antegastric',
  'roux',
  'en',
  'gastric',
  'bypass',
  'eea',
  'anastomosis',
  'year',
  'old',
  'female',
  'overweight',
  'many',
  'years',
  'tried',
  'many',
  'different',
  'diets',
  'unsuccessful'],
 ['liposuction',
  'supraumbilical',
  'abdomen',
  'revision',
  'right',
  'breast',
  'reconstruction',
  'excision',
  'soft',
  'tissue',
  'fullness',
  'lateral',
  'abdomen',
  'flank'],
 ['echocardiogram'],
 ['suction', 'assisted', 'lipectomy', 'lipodystrophy', 'abdomen', 'thighs'],
 ['echocardiogram', 'doppler'],
 ['morbid',
  'obesity',
  'laparoscopic',
  'roux',
  'en',
  'gastric',
  'bypass',
  'antecolic',
  'antegastric',
  'mm',
  'eea',
  'anastamosis',
  'esophagogastroduodenos

In [36]:
# append all the sentences

E = X[0]
for i in range(1,len(X)):
     E = np.append(E,X[i])

E

array(['year', 'old', 'white', ..., 'however', 'suspicious', 'keflex'],
      dtype='<U32')

In [37]:
# all the words - tokenization

word_tokens2 = E
word_tokens2

array(['year', 'old', 'white', ..., 'however', 'suspicious', 'keflex'],
      dtype='<U32')

In [38]:
# If what was asked is about converting this into a list of words. 
# We can create a plane text file containing all this info and then proceed as above.
# 
# This cover question 2

A = set(X[0])

for i in range(1,len(descriptions)):
     B = set(X[i])
     A = A.union(B)
len(A)

5286

In [39]:
A = [w.lower() for w in A]
A = [w.lower() for w in A if not w in stop_words]
A = [w.lower() for w in A if w.isalpha()]

In [40]:
vocab = A
vocab # unique words - vocabulary

['adhesions',
 'dural',
 'neotal',
 'supraglottitis',
 'tolerating',
 'va',
 'rings',
 'complicated',
 'robotics',
 'microcalcifications',
 'generalization',
 'photos',
 'cough',
 'nocturia',
 'could',
 'boney',
 'non',
 'history',
 'drill',
 'osteogenic',
 'profile',
 'decompression',
 'company',
 'cerebellar',
 'came',
 'group',
 'tweezers',
 'tinea',
 'perianal',
 'metaphyseal',
 'subclavian',
 'vertebroplasty',
 'contusions',
 'bladder',
 'impression',
 'denies',
 'myelofibrosis',
 'identification',
 'tit',
 'side',
 'thickness',
 'serum',
 'frontal',
 'comprehensive',
 'nsvd',
 'intractable',
 'workload',
 'willful',
 'poly',
 'advantage',
 'dilator',
 'iiic',
 'adjacent',
 'presented',
 'sellar',
 'titanium',
 'cells',
 'bones',
 'sarcoma',
 'endotine',
 'centered',
 'peek',
 'dates',
 'hemivulvectomy',
 'temperature',
 'cul',
 'bm',
 'pelvic',
 'six',
 'diagnoses',
 'wire',
 'fed',
 'pronounced',
 'defibrillation',
 'occasionally',
 'pilonidal',
 'explantation',
 'ectopic',
 'wo

In [41]:
word_tokens2 = word_tokens2.tolist()

In [42]:
# count the number of total appeareance of each word

char_to_int = dict((c,i) for i,c in enumerate(vocab))
int_to_char = dict((i,c) for i,c in enumerate(vocab))

In [43]:
word_tokens2

['year',
 'old',
 'white',
 'female',
 'presents',
 'complaint',
 'allergies',
 'consult',
 'laparoscopic',
 'gastric',
 'bypass',
 'consult',
 'laparoscopic',
 'gastric',
 'bypass',
 'mode',
 'doppler',
 'echocardiogram',
 'morbid',
 'obesity',
 'laparoscopic',
 'antecolic',
 'antegastric',
 'roux',
 'en',
 'gastric',
 'bypass',
 'eea',
 'anastomosis',
 'year',
 'old',
 'female',
 'overweight',
 'many',
 'years',
 'tried',
 'many',
 'different',
 'diets',
 'unsuccessful',
 'liposuction',
 'supraumbilical',
 'abdomen',
 'revision',
 'right',
 'breast',
 'reconstruction',
 'excision',
 'soft',
 'tissue',
 'fullness',
 'lateral',
 'abdomen',
 'flank',
 'echocardiogram',
 'suction',
 'assisted',
 'lipectomy',
 'lipodystrophy',
 'abdomen',
 'thighs',
 'echocardiogram',
 'doppler',
 'morbid',
 'obesity',
 'laparoscopic',
 'roux',
 'en',
 'gastric',
 'bypass',
 'antecolic',
 'antegastric',
 'mm',
 'eea',
 'anastamosis',
 'esophagogastroduodenoscopy',
 'normal',
 'left',
 'ventricle',
 'moder

In [44]:
X = np.zeros((len(word_tokens2),len(vocab)))
Y = np.array([])

Xwords = []
Ywords = []
window_size = 3 # 5 words above and below

for i, word in enumerate(word_tokens2):
    isetvalue = 0
    w2v = np.zeros(len(vocab))
    for icontext in range(max(i-window_size,0), min(i+window_size, len(word_tokens2)-1)+1):
        if icontext != i:
             w2v[char_to_int[word_tokens2[icontext]]] = w2v[char_to_int[word_tokens2[icontext]]]+1
    X[i] = w2v  

In [45]:
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [46]:
colnames = [int_to_char[i] for i in range(len(vocab))]
word2vecrep = pd.DataFrame(X, columns = colnames)

In [47]:
total_appearance = len(vocab)*[0]

In [48]:
for i,word in enumerate(vocab):
    total_appearance[i] = word2vecrep[word].sum()
    

In [49]:
total_appearance

[402.0,
 36.0,
 12.0,
 12.0,
 24.0,
 12.0,
 12.0,
 66.0,
 18.0,
 6.0,
 12.0,
 60.0,
 276.0,
 24.0,
 60.0,
 30.0,
 258.0,
 2874.0,
 18.0,
 12.0,
 24.0,
 588.0,
 18.0,
 12.0,
 156.0,
 24.0,
 12.0,
 30.0,
 24.0,
 12.0,
 186.0,
 12.0,
 12.0,
 582.0,
 12.0,
 168.0,
 12.0,
 12.0,
 12.0,
 270.0,
 84.0,
 12.0,
 264.0,
 96.0,
 12.0,
 90.0,
 24.0,
 12.0,
 12.0,
 12.0,
 12.0,
 30.0,
 90.0,
 348.0,
 12.0,
 108.0,
 30.0,
 54.0,
 30.0,
 18.0,
 24.0,
 54.0,
 6.0,
 12.0,
 42.0,
 36.0,
 12.0,
 552.0,
 114.0,
 12.0,
 126.0,
 12.0,
 24.0,
 24.0,
 18.0,
 24.0,
 12.0,
 72.0,
 342.0,
 228.0,
 48.0,
 30.0,
 24.0,
 12.0,
 66.0,
 84.0,
 12.0,
 258.0,
 12.0,
 12.0,
 12.0,
 24.0,
 552.0,
 12.0,
 12.0,
 384.0,
 114.0,
 78.0,
 108.0,
 18.0,
 132.0,
 12.0,
 12.0,
 18.0,
 12.0,
 30.0,
 18.0,
 66.0,
 12.0,
 12.0,
 126.0,
 12.0,
 24.0,
 12.0,
 18.0,
 12.0,
 24.0,
 30.0,
 102.0,
 36.0,
 12.0,
 12.0,
 66.0,
 12.0,
 48.0,
 12.0,
 12.0,
 12.0,
 150.0,
 36.0,
 36.0,
 12.0,
 84.0,
 108.0,
 12.0,
 30.0,
 78.0,
 42.0,
 36.0,


In [50]:
totaldf = pd.DataFrame(np.array(total_appearance).reshape(1,-1), columns = vocab)

In [51]:
# number of appearance 

totaldf


Unnamed: 0,adhesions,dural,neotal,supraglottitis,tolerating,va,rings,complicated,robotics,microcalcifications,...,bothers,administering,temple,trapezius,implantable,cleft,eluting,shaped,bronchodilator,severely
0,402.0,36.0,12.0,12.0,24.0,12.0,12.0,66.0,18.0,6.0,...,12.0,24.0,24.0,12.0,24.0,36.0,48.0,24.0,6.0,12.0


In [52]:
# List the top 10 words that has the highest occurrence. 

tenmax = 10*['a']
p = totaldf
i = 0
for word in totaldf.columns:
    if i < 10:
        tenmax[i] = max(p)
        del p[max(p)]
        i += 1
    else:
        del p
        break

    
tenmax

['zygomatic',
 'zones',
 'zone',
 'zometa',
 'zimmer',
 'zephyr',
 'z',
 'youngswick',
 'young',
 'yielding']

It seems 'z' is not in the stop_words.

## Are those words related to medical terms?

To answer this question we need to study the context, that is, we need to study the surrounding words. In this case we will use 3 words above and below our target. Now we will study which are the words with 'maximal connections' with these words. 

In [53]:
list = ['zygomatic',
 'zones',
 'zone',
 'zometa',
 'zimmer',
 'zephyr',
 'z',
 'youngswick',
 'young',
 'yielding']

relatedornot = word2vecrep[list]



In [54]:
# words before or after zones 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.zones == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.zones == 1].wordkey.unique()[i])


selective
neck
dissection
skin
biopsy
performed
endoscopic
proximal
distal
patient
year
old


The word zones is used to talk about medical terms.

In [55]:
# words before or after zone 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.zone == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.zone == 1].wordkey.unique()[i])

malignant
mesothelioma
marginal
lymphoma
malt
type
syndrome
resolved


The word zones is used to talk about medical terms.

In [56]:
# words before or after zometa 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.zometa == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.zometa == 1].wordkey.unique()[i])

previous
treatments
included
faslodex
aromasin
found


The word zometa is used to talk about medical terms.

In [57]:
# words before or after zimmer 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.zimmer == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.zimmer == 1].wordkey.unique()[i])

bilateral
knee
arthroplasty
nexgen
total


The word zimmer is used to talk about medical terms.

In [58]:
# words before or after zephyr 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.zephyr == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.zephyr == 1].wordkey.unique()[i])

anterior
cervical
plate
microscopic
dissection
herniated
arthrodesis


The word zephyr is used to talk about medical terms.

In [59]:
# words before or after youngswick 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.youngswick == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.youngswick == 1].wordkey.unique()[i])

pathology
pending
austin
bunionectomy
biopro
implant
bioarc
midurethral
sling
osteotomy
internal
screw
acromioclavicular
joint
injection
injections


The word youngswick is used to talk about medical terms.

In [60]:
# words before or after young 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.young == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.young == 1].wordkey.unique()[i])

patient
year
old
man
fluid
collection
disease
lady
renal
failure
chlorambucil
radioactive
phosphorus
age
concern
secondary
history
thyroid
insufficiency
breath
sinus
congestion
alertness


The word young is used to talk about age, and diseases related with age. 

In [61]:
# words before or after yielding 

word2vecrep['wordkey'] = word_tokens2
for i in range(len(word2vecrep[word2vecrep.yielding == 1].wordkey.unique())):
        print(word2vecrep[word2vecrep.yielding == 1].wordkey.unique()[i])

teased
e
performed
significant
amount
central


The world yielding is not used to talk about medical terms.

In [69]:
#  Convert the words in question (2) vocabulary to continuous vectors, using the pretrained word 
#  to vector dictionary "PubMed-and-PMC-w2v.bin". You may download the data from 
#  http://evexdb.org/pmresources/vec-space-models. Can all words find the corresponding vector representations?

# I guess 'question 2' refers to the 10 words.

from gensim.models.keyedvectors import KeyedVectors
from gensim.models import Word2Vec

model = KeyedVectors.load_word2vec_format('PubMed-and-PMC-w2v.bin',binary=True)



In [None]:
# Checking if words are in the vocabulary zephyr is not on the vocabulary.

In [95]:
ten = ['zygomatic', # I deleted 'z' which is not a word but is not contained in the stopping words
 'zones',           # I deleted zephyr because is not on the vocabulary
 'zone',            # I deleted youngswick because is not on the vocabulary
 'zometa',
 'zimmer',
 'young',
 'yielding']

P = np.zeros((7,200)) 

i = 0
for word in ten:
    P[i] = model[word]
    i += 1

In [98]:
P

array([[ 0.22096065,  0.12747723, -0.12780438, ..., -0.09814331,
         0.003502  ,  0.16643265],
       [ 0.18052036,  0.12074965, -0.22591648, ...,  0.0535944 ,
         0.12612985,  0.00117337],
       [ 0.32436565,  0.0712044 , -0.25453416, ..., -0.02198332,
         0.21200548, -0.14795859],
       ...,
       [ 0.04193804,  0.04243576, -0.02697315, ..., -0.02604444,
         0.04178841,  0.05199338],
       [-0.20082997,  0.27157128, -0.06557874, ...,  0.0894241 ,
        -0.19359608, -0.18772388],
       [ 0.00688885, -0.08760611,  0.19180359, ..., -0.05121295,
         0.0324964 ,  0.02904414]])

In [66]:
# Calculate the cosine-similarity of the following word pairs: "allergy/allergic"; 
# "heart/lung"; "water/- heart". Do the similarity measures make sense to you?

print(model.similarity('allergy','allergic'))
print(model.similarity('heart','lung'))
print(model.similarity('water','heart'))




0.73083645
0.397354
-0.032861322


The similarity makes sense. In the first case the similarity is hight and should be high because both words are almost the same. In the second case there is less similarity, both are organs and the hearth is close to the lung. In the last case water has nothing to do with heart.