In [166]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

import nltk
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

from zipfile import ZipFile
from io import BytesIO
import urllib.request as urllib2

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Drug reviews dataset:
From UCI public data repository<br>
https://archive-beta.ics.uci.edu/ml/datasets/462

In [167]:
r = urllib2.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip').read()
file = ZipFile(BytesIO(r))
file.namelist()

['drugsComTest_raw.tsv', 'drugsComTrain_raw.tsv']

In [168]:
txt = file.open('drugsComTrain_raw.tsv')
df = pd.read_csv(txt, sep='\t', nrows=7000)
df

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil""",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective.""",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas.""",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch""",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin.""",9.0,"November 27, 2016",37
...,...,...,...,...,...,...,...
6995,112817,Bisacodyl,Constipation,"""Took 2 tablets 2.30pm, bowel motion and diarrhea 6 hours later with painful stomach cramps and nausea, continued on and off for a few hours. At 6.30pm the following day still experiencing stomach pains and back pains. Not a gentle drug. Will use herbal next time. Would not give to children.""",1.0,"August 27, 2017",9
6996,73606,Ethinyl estradiol / norethindrone,Acne,"""After 2 1/2 years of horrible acne, during which I was on antibiotics &amp; topicals that never ended up working, my dermatologist and Dr. suggested I should try birth control. I went to Planned Parenthood and was put on Microgestin Fe 1/20. After about 6 months on this pill, I actually lost weight (due to nausea at first), breasts grew, lighter periods every month, and no change in everything else. As for my acne, it probably got better about 40%, but still had at least 3-5 pimples/cysts all the time and no reduction in face oiliness. Trial and error, now trying Ortho Tri Cyclen.""",4.0,"May 16, 2013",19
6997,89784,Celexa,Anxiety and Stress,"""Celexa was the first medication I was ever put on for mild depression, stress and anxiety mostly stemming from my stressful work environment (prison). I felt no difference with this medication, and I was nauseous as a side effect. Doctor switched me to Wellbutrin, which did not work, and then to Effexor, which is a wonder drug.""",5.0,"April 14, 2009",34
6998,40582,Zanaflex,Muscle Spasm,"""I&#039;ve been suffering with severe muscle spasms in neck, back, chest and shoulder since a whiplash incident 1-1/2 years ago. Other medicines previously prescribed include Soma and Flexeril, but results were insubstantial. I received substantial relief the first time I used Zanaflex and finally feel like I might be able to learn to live with this condition.""",10.0,"April 10, 2009",86


---
### Pre-processing
- Remove non-alphanumeric characters
- Make lowercase and split on whitespace
- Remove stop words (*the*, *and*, etc.)
- Lemmatize by POS
- Remove most common words

In [169]:
df['review'].replace('\W', ' ', regex=True, inplace=True)
df['review'] = df['review'].str.lower().str.split()
df['review'].head()

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           [it, has, no, side, effect, i, take, it, in, combination, of, bystolic, 5, mg, and, fish, oil]
1                                                        [my, son, is, halfway, through, his, fourth, week, of, intuniv, we, became, concerned, when, he, began, this, last, week, when, he, started, taking, the, highest, dose, he, will, be, on, for, two, days, he, could, hardly, get, out, of, bed, was, very, cranky, and, slept, for, nearly, 8, hour

#### Lemmatize
Transform a word to its base form given its part of speech<br>
For example:<br>
- *running* --> *run*<br>
- *cars* --> *car*<br>

In [170]:
def lemmatize_doc(doc):
    """
    Removes stop words and performs lemmatization
    """
    # list of stop words
    stop_words = list(set(nltk.corpus.stopwords.words('english')))+['039']
    out_list = []
    
    for word in doc:
        # remove stop words
        if word not in stop_words:
            lemma = ''
            # determine word part of speech
            pos = nltk.pos_tag([word])[0][1]
            
            # NOUNS
            if pos in ['NN', 'NNP', 'NNS', 'NNPS']:
                lemma = WordNetLemmatizer().lemmatize(word, pos='n')
            # VERBS
            elif pos in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
                lemma = WordNetLemmatizer().lemmatize(word, pos='v')
            # ADJECTIVES
            elif pos in ['JJ', 'JJR', 'JJS']:
                lemma = WordNetLemmatizer().lemmatize(word, pos='a')
            # ADVERBS
            elif pos in ['RB', 'RBR', 'RBS']:
                lemma = WordNetLemmatizer().lemmatize(word, pos='r')
            # MISC
            elif pos in ['CC', 'CD', 'DT', 'LS', 'MD', 'PDT', 'PRP', 'PRP$', 'RP', 'WDT', 'WP', 'WP$' 'WRB']:
                lemma = word
            else:
                pass
            out_list.append(lemma)
        else:
            pass
    # remove blank elements from output
    return(list(filter(None, out_list)))

In [171]:
df['review_clean'] = df['review'].apply(lambda x: lemmatize_doc(x))
df['review_clean'].head()

0                                                                                                                                                                                                                                                                                                                                                                                                                                                            [side, effect, take, combination, bystolic, 5, mg, fish, oil]
1                                 [son, halfway, fourth, week, intuniv, become, concerned, begin, last, week, start, take, high, dose, two, day, could, hardly, get, bed, cranky, slept, nearly, 8, hour, drive, home, school, vacation, unusual, call, doctor, monday, morning, say, stick, day, see, school, get, morning, last, two, day, problem, free, much, agreeable, ever, less, emotional, good, thing, less, cranky, remember, thing, overall, behavior, well, try, many, different, medication, fa

### Most common words
These tend to show up in multiple topics, so they aren't very useful

In [172]:
df_ex = df.explode('review_clean')['review_clean']
common_words = df_ex.value_counts()[:15]
common_words

take      5882
day       4350
get       3715
go        2992
month     2951
year      2866
work      2770
week      2536
start     2472
effect    2431
side      2301
time      2257
feel      2239
pain      2068
first     1923
Name: review_clean, dtype: int64

In [173]:
df['review_clean'] = df['review_clean'].apply(lambda x: [word for word in x if word not in common_words]).str.join(' ')
df['review_clean'].head()

0                                                                                                                                                                                                                                                                                                                                              combination bystolic 5 mg fish oil
1    son halfway fourth intuniv become concerned begin last high dose two could hardly bed cranky slept nearly 8 hour drive home school vacation unusual call doctor monday morning say stick see school morning last two problem free much agreeable ever less emotional good thing less cranky remember thing overall behavior well try many different medication far effective
2                      use another oral contraceptive 21 pill cycle happy light period max 5 contain hormone gestodene available us switch lybrel ingredient similar pill end lybrel immediately period instruction say period last two second pack two third pack t

---
### Topic Modeling
via sklearn Latent Dirichlet Allocation
- Create sparse matrix of word counts by document
- Use grid search to optimize LDA parameters
- Score via log-likelihood and perplexity

In [174]:
def best_topic_model(DW_matrix):
    """
    Grid search algorithm to product best performing topic model
    """
    LDA = LatentDirichletAllocation()
    model = GridSearchCV(LDA, param_grid={'n_components': [4, 5, 6], 'learning_decay': [0.5, 0.7, 0.9]}, n_jobs=-1)
    model.fit(DW_matrix)
    return(model)

In [175]:
def show_topics(vectorizer, lda_model, n_words=15):
    """
    Displays dataframe consisting of top n words for each topic
    """
    # vocabulary from corpus
    keywords = np.array(vectorizer.get_feature_names())
    
    topic_keywords = []
    for topic_weights in lda_model.components_:
        # sort index of vocabulary
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        # match vocabulary items to index
        topic_keywords.append(keywords.take(top_keyword_locs))
    return(pd.DataFrame(topic_keywords))

In [176]:
# Document-Word matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['review_clean'])

# perform grid search and fit LDA
model = best_topic_model(X)
best_LDA_model = model.best_estimator_

print('log-likelihood: ', model.score(X))
print('perplexity: ', best_LDA_model.perplexity(X))

log-likelihood:  -1763668.4969527111
perplexity:  1713.9778720846186


In [177]:
topics_df = show_topics(vectorizer=vectorizer, lda_model=best_LDA_model)
topics_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,period,pill,control,weight,gain,birth,cramp,bad,sex,bleeding,would,mood,acne,spot,never
1,use,well,back,try,medication,help,doctor,medicine,lose,life,bad,make,also,give,would
2,anxiety,sleep,well,medication,help,night,bad,make,doctor,life,medicine,would,felt,depression,one
3,use,hour,dose,eat,stomach,bad,nausea,water,still,one,taste,morning,quot,two,night


### Bi-grams 
("*side effect*")

In [178]:
# Document-bigram matrix
vectorizer_bg = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X_bg = vectorizer_bg.fit_transform(df['review_clean'])

model_bg = best_topic_model(X_bg)
best_LDA_model_bg = model_bg.best_estimator_

print('log-likelihood: ', model_bg.score(X_bg))
print('perplexity: ', best_LDA_model_bg.perplexity(X_bg))

log-likelihood:  -2778797.6606447077
perplexity:  178041.38200721284


In [179]:
topics_bg_df = show_topics(vectorizer=vectorizer_bg, lda_model=best_LDA_model_bg, n_words=10)
topics_bg_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,birth control,weight gain,mood swing,sex drive,lose weight,panic attack,much well,gain weight,weight loss,doctor prescribed
1,birth control,weight gain,mood swing,sex drive,much well,come back,gain weight,doctor prescribed,would recommend,panic attack
2,birth control,mood swing,weight gain,sex drive,panic attack,blood pressure,much well,gain weight,lose weight,fall asleep
3,birth control,blood pressure,weight gain,panic attack,come back,dry mouth,gain weight,would recommend,sex drive,mood swing


---
### Topic prediction
Determine most likely topic for new piece of text

In [180]:
test = file.open('drugsComTest_raw.tsv')
test_df = pd.read_csv(test, sep='\t', nrows=1000)
test_df

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me.""",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done very well on the Asacol. He has no complaints and shows no side effects. He has taken as many as nine tablets per day at one time. I&#039;ve been very happy with the results, reducing his bouts of diarrhea drastically.""",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for alcohol, smoking, and opioid cessation. People lose weight on it because it also helps control over-eating. I have no doubt that most obesity is caused from sugar/carb addiction, which is just as powerful as any drug. I have been taking it for five days, and the good news is, it seems to go to work immediately. I feel hungry before I want food now. I really don&#039;t care to eat; it&#039;s just to fill my stomach. Since I have only been on it a few days, I don&#039;t know if I&#039;ve lost weight (I don&#039;t have a scale), but my clothes do feel a little looser, so maybe a pound or two. I&#039;m hoping that after a few months on this medication, I will develop healthier habits that I can continue without the aid of Contrave.""",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cycle. After reading some of the reviews on this type and similar birth controls I was a bit apprehensive to start. Im giving this birth control a 9 out of 10 as I have not been on it long enough for a 10. So far I love this birth control! My side effects have been so minimal its like Im not even on birth control! I have experienced mild headaches here and there and some nausea but other than that ive been feeling great! I got my period on cue on the third day of the inactive pills and I had no idea it was coming because I had zero pms! My period was very light and I barely had any cramping! I had unprotected sex the first month and obviously didn&#039;t get pregnant so I&#039;m very pleased! Highly recommend""",9.0,"October 22, 2015",4
...,...,...,...,...,...,...,...
995,133441,Tri-Sprintec,Birth Control,"""I have been using Tri-Sprintec for a few months, before that I was on Ortho Tri-Cyclen Lo. I used Ortho for a few years but had to change due to switching insurance companies. I was a little bit hesitant to use Tri-Sprintec at first due to the previous reviews that I read about. My experience has been very good. I haven&#039;t had any break outs, weight gain or emotional problems. I have had an increase in breast size and tenderness but the tenderness wasn&#039;t too uncomfortable. Usually when I switch birth control, I have a cycle for at least a month and a half but not this time around. I had no break through bleeding. I have had light and regular cycles, which is the purpose of me being on birth control. I would definitely recommend Tri-Sprinte""",7.0,"July 3, 2010",7
996,51690,Azithromycin,Sinusitis,"""Have had a sinus infection for 11 months... tried all types of antibiotics and they would help a little but right back to same stuff. Major congestion, Hard phlegm, trouble breathing, etc. Took azithromycin day 1 was able to breathe thru nose again. loosened up all mucus. Day 2 felt like a new person.""",10.0,"February 21, 2015",37
997,191217,Pentasa,Ulcerative Proctitis,"""I&#039;ve been using pentasa 2 times a day a total of 4 pills it definitely helps! Sometimes I get headaches and feel sick sometimes and occasionally get pain and for some reason ever since I started taking them I get these weird skin patches""",9.0,"April 2, 2016",1
998,150787,Sulfamethoxazole / trimethoprim,Acne,"""I cannot believe how amazing this medicine has been for my cystic acne. I have tried monocycline, doxycycline, prescription topical creams, NOTHING worked. I finally decided to ask my doctor about Bactrim. I have been completely cyst free since October. This is a record amount of time for me to have NO cysts! I have had, maybe three pimples. I&#039;ve had no side effects. The monocycline gave me horrible joint aches (to the point where I&#039;d almost cry they hurt so bad). So far I have had absolutely NO side effects. It&#039;s wonderful!""",10.0,"January 23, 2013",47


In [181]:
# remomve non-alphanumeric characters
test_df['review_clean'] = test_df['review'].replace('\W', ' ', regex=True)
# make lowercase; tokenize
test_df['review_clean'] = test_df['review_clean'].str.lower().str.split()
# remove stop words and lemmatize
test_df['review_clean'] = test_df['review_clean'].apply(lambda x: lemmatize_doc(x))
# remove common words
df_ex = test_df.explode('review_clean')['review_clean']
common_words = df_ex.value_counts()[:10]
test_df['review_clean'] = test_df['review_clean'].apply(lambda x: [word for word in x if word not in common_words]).str.join(' ')

test_df['review_clean'].head()

0                                                                                                             try antidepressant citalopram fluoxetine amitriptyline none help depression insomnia amp anxiety doctor suggest change 45mg mirtazapine medicine save life thankfully especially common weight gain actually lose alot weight still suicidal thought mirtazapine save
1                                                                                                                                                                                                                                                           son crohn disease do well asacol complaint show many nine tablet one time happy result reduce bout diarrhea drastically
2                                                                                                                                                                                                                                                               

In [182]:
X_test = vectorizer.transform(test_df['review_clean'])
test_df['topic_pred'] = np.argmax(best_LDA_model.transform(X_test), axis=1)
test_df['topic_pred']

0      2
1      1
2      3
3      1
4      0
      ..
995    0
996    2
997    0
998    1
999    0
Name: topic_pred, Length: 1000, dtype: int64

In [183]:
test_df['topic'] = list(topics_df.iloc[test_df['topic_pred']].values)
test_df[['review', 'topic']]

Unnamed: 0,review,topic
0,"""I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me.""","[anxiety, sleep, well, medication, help, night, bad, make, doctor, life, medicine, would, felt, depression, one]"
1,"""My son has Crohn&#039;s disease and has done very well on the Asacol. He has no complaints and shows no side effects. He has taken as many as nine tablets per day at one time. I&#039;ve been very happy with the results, reducing his bouts of diarrhea drastically.""","[use, well, back, try, medication, help, doctor, medicine, lose, life, bad, make, also, give, would]"
2,"""Quick reduction of symptoms""","[use, hour, dose, eat, stomach, bad, nausea, water, still, one, taste, morning, quot, two, night]"
3,"""Contrave combines drugs that were used for alcohol, smoking, and opioid cessation. People lose weight on it because it also helps control over-eating. I have no doubt that most obesity is caused from sugar/carb addiction, which is just as powerful as any drug. I have been taking it for five days, and the good news is, it seems to go to work immediately. I feel hungry before I want food now. I really don&#039;t care to eat; it&#039;s just to fill my stomach. Since I have only been on it a few days, I don&#039;t know if I&#039;ve lost weight (I don&#039;t have a scale), but my clothes do feel a little looser, so maybe a pound or two. I&#039;m hoping that after a few months on this medication, I will develop healthier habits that I can continue without the aid of Contrave.""","[use, well, back, try, medication, help, doctor, medicine, lose, life, bad, make, also, give, would]"
4,"""I have been on this birth control for one cycle. After reading some of the reviews on this type and similar birth controls I was a bit apprehensive to start. Im giving this birth control a 9 out of 10 as I have not been on it long enough for a 10. So far I love this birth control! My side effects have been so minimal its like Im not even on birth control! I have experienced mild headaches here and there and some nausea but other than that ive been feeling great! I got my period on cue on the third day of the inactive pills and I had no idea it was coming because I had zero pms! My period was very light and I barely had any cramping! I had unprotected sex the first month and obviously didn&#039;t get pregnant so I&#039;m very pleased! Highly recommend""","[period, pill, control, weight, gain, birth, cramp, bad, sex, bleeding, would, mood, acne, spot, never]"
...,...,...
995,"""I have been using Tri-Sprintec for a few months, before that I was on Ortho Tri-Cyclen Lo. I used Ortho for a few years but had to change due to switching insurance companies. I was a little bit hesitant to use Tri-Sprintec at first due to the previous reviews that I read about. My experience has been very good. I haven&#039;t had any break outs, weight gain or emotional problems. I have had an increase in breast size and tenderness but the tenderness wasn&#039;t too uncomfortable. Usually when I switch birth control, I have a cycle for at least a month and a half but not this time around. I had no break through bleeding. I have had light and regular cycles, which is the purpose of me being on birth control. I would definitely recommend Tri-Sprinte""","[period, pill, control, weight, gain, birth, cramp, bad, sex, bleeding, would, mood, acne, spot, never]"
996,"""Have had a sinus infection for 11 months... tried all types of antibiotics and they would help a little but right back to same stuff. Major congestion, Hard phlegm, trouble breathing, etc. Took azithromycin day 1 was able to breathe thru nose again. loosened up all mucus. Day 2 felt like a new person.""","[anxiety, sleep, well, medication, help, night, bad, make, doctor, life, medicine, would, felt, depression, one]"
997,"""I&#039;ve been using pentasa 2 times a day a total of 4 pills it definitely helps! Sometimes I get headaches and feel sick sometimes and occasionally get pain and for some reason ever since I started taking them I get these weird skin patches""","[period, pill, control, weight, gain, birth, cramp, bad, sex, bleeding, would, mood, acne, spot, never]"
998,"""I cannot believe how amazing this medicine has been for my cystic acne. I have tried monocycline, doxycycline, prescription topical creams, NOTHING worked. I finally decided to ask my doctor about Bactrim. I have been completely cyst free since October. This is a record amount of time for me to have NO cysts! I have had, maybe three pimples. I&#039;ve had no side effects. The monocycline gave me horrible joint aches (to the point where I&#039;d almost cry they hurt so bad). So far I have had absolutely NO side effects. It&#039;s wonderful!""","[use, well, back, try, medication, help, doctor, medicine, lose, life, bad, make, also, give, would]"
