# Sentiment analysis on drug reviews

WebMD is an organization which provides information, support and reference material about health subjects through a team of doctors and health experts across a broad range of specialty areas. This dataset was acquired by scraping [WebMD](https://www.webmd.com/drugs/2/index) website until Mar 2020. The main objective of this notebook was to perform sentiment analysis on the reviews to predict if the reviewer was satisfied with the drug he or she rated or not. As per [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis):
> Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

I used Python 3.7, Altair for visualization because I have not used it before, Keras for ML and spaCy for lemmatization.

### Importing libraries

In [None]:
import altair as alt
import itertools
import keras
import math
import numpy as np
import pandas as pd
import re
import string
import spacy
import tensorflow as tf
import warnings
warnings.filterwarnings("ignore")

from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from keras.callbacks import EarlyStopping
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Altair cannot do word cloud so I use Matplotlib just to render the image
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline
%config InlineBackend.figure_format = "retina"

### Importing dataset

In [None]:
webmddf = pd.read_csv("../input/webmd-drug-reviews-dataset/webmd.csv")
webmddf.head(3)

### Cleaning

For the purpose of this kernel I will use only the following columns.

In [None]:
df = webmddf[["Age", "Condition", "Drug", "DrugId", "Satisfaction", "Sex", "Reviews"]]

Letâ€™s check the number of null values.

In [None]:
df.isnull().sum()

Let's remove them.

In [None]:
df = df.dropna()

I also strip leading and trailing white spaces from string type data. These white spaces often conceal the fact that there is nothing useful in the data cell simply because white spaces are data as well.

In [None]:
for col in df.columns:
    if df[col].dtype.kind == "O":
        df[col] = df[col].str.strip()

Let's take stock of the number of unique values in each column except `Reviews`.

In [None]:
data = [[col, df[col].nunique()] for col in df.columns.difference(["Reviews"])]
uniques = pd.DataFrame(data=data, columns=["columns", "num of unique values"])

bars = (alt.Chart()
           .mark_bar(size=25, 
                     color="#FFAA00",
                     strokeWidth=1,
                     stroke="white",
                     strokeOpacity=0.7)
           .encode(x=alt.X(shorthand="num of unique values:Q",
                           scale=alt.Scale(type="log"),
                           axis=alt.Axis(title="num of unique values, log scaled")),
                   y=alt.Y("columns:O", sort="-x"),
                   tooltip=("num of unique values:Q",
                            "columns:O",),
                   color=alt.Color("num of unique values",
                                   scale=alt.Scale(scheme="lightgreyteal",
                                                   type="log")))
           .properties(title='Unique Values'))

text = (alt.Chart()
           .mark_text(align="left",
                      baseline="middle",
                      dx=3)
           .encode(x=alt.X(shorthand="num of unique values:Q"),
                   y=alt.Y("columns:O",
                           axis=alt.Axis(title="columns",
                                         grid=False),
                           sort="-x"),
                   text="num of unique values:Q"))

chart = ((alt.layer(bars, text, data=uniques)
             .configure(background='#11043a')
             .configure_title(font="Arial",
                              fontSize=18,
                              color="#e6f3ff",
                              dy=-10)
             .configure_text(color="white")
             .configure_legend(titleFontSize=12,
                               titleColor="white",
                               tickCount=10,
                               titleOpacity=0.8,
                               labelColor="white",
                               labelOpacity=0.7,
                               titlePadding=10)
             .configure_axis(titleFontSize=13,
                             titlePadding=20,
                             titleColor="white",
                             titleOpacity=0.8,
                             labelColor="white",
                             labelOpacity=0.7,
                             labelFontSize=11,
                             tickOffset=0,
                             grid=True,
                             gridOpacity=0.15)
             .configure_view(strokeWidth=0)
             .properties(height=200, width=680)))

chart

From the chart above it appears that:

- there are many `Condition` types, a closer look may be required to see the distribution of reviews for each. 
- there are more `Drug` than `DrugId` values, which suggests some kind of peculiarity in the way drugs are named.
- the values of `Satisfaction` is supposed to be from 1 to 5. Some reviews may have wrong or missing values.
- `Sex` also has more values than what I am going to consider possible.

Before I explore the individual columns, I would like to see how many missing values are in the dataset. Previously I already deleted the null value rows but I also deleted leading and trailing white space which may have created new empty cells.

In [None]:
def missing_values(df):
    """Returns a summary of missing values in df"""
    nrows = df.shape[0]
    data = []
    
    def pct(n, total):
        return round(n/total, 2)
    
    for col in df.columns:

        # string (Object) type columns
        if df[col].dtype.kind == "O":
            df[col] = df[col].str.strip()
            nulls = df[df[col] == ""][col].count()
            nulls += df[col].isnull().sum()

        # numerical (int) type columns
        elif df[col].dtype.kind == "i":
            nulls = df[col].isnull().sum()

        pctofnulls = pct(nulls, nrows)
        data.extend(
            [{"column": col, "pct": 1-pctofnulls, "num of records": nrows-nulls, "type": "not missing"},
             {"column": col, "pct": pctofnulls, "num of records": nulls, "type": "missing"}])
    
    return pd.DataFrame(data)

missing = missing_values(df)

bars = (alt.Chart()
           .mark_bar(size=25, 
                     strokeWidth=1,
                     stroke="white",
                     strokeOpacity=0.7,
                     )
           .encode(x=alt.X("sum(num of records)",
                           axis=alt.Axis(title="number of records",
                                         grid=True)), 
                   y=alt.Y("column:O",
                           axis=alt.Axis(title="columns")),
                   tooltip=("column", "type", "num of records:Q",
                            alt.Tooltip("pct:Q", format=".1%")),
                   color=alt.Color("type",
                                   scale=alt.Scale(range=["#11043a", "#648bce"])))
           .properties(title="Missing Values"))

text = (alt.Chart()
           .mark_text(align="right",
                      dx=-1)
           .encode(x=alt.X("sum(num of records)", 
                           stack="zero"),
                   y=alt.Y("column"),
                   color=alt.Color("type",
                                   legend=None,
                                   scale=alt.Scale(range=["white"])),
                   text=alt.Text("pct", format=".0%")))

(alt.layer(bars, text, data=missing)
    .configure(background='#11043a')
    .configure_title(font="Arial",
                     fontSize=18,
                     color="#e6f3ff",
                     dy=-10)
    .configure_text(color="white")
    .configure_legend(titleFontSize=12,
                      titleColor="white",
                      tickCount=10,
                      titleOpacity=0.8,
                      labelColor="white",
                      labelOpacity=0.7,
                      titlePadding=10)
    .configure_axis(titleFontSize=13,
                    titlePadding=20,
                    titleColor="white",
                    titleOpacity=0.8,
                    labelFontSize=11,
                    labelColor="white",
                    labelOpacity=0.7,
                    tickOffset=0,
                    grid=False,
                    gridOpacity=0.15)
    .configure_view(strokeWidth=0)
    .resolve_scale(color='independent')
    .properties(height=300, width=680))

Indeed, the chart above reveals about 12% new missing values in `Reviews`, 7% in `Sex` and 3% in `Age`. `Condition` also has a few (0.01%) empty cells. I remove all the rows with missing values.

In [None]:
for col in ["Age", "Condition", "Sex", "Reviews"]:
    df = df[(df[col].astype(bool) & df[col].notnull())]

In the next step I'll check individual columns. I begin with `Satisfaction`. I print then carry forward only the rows which have correct values.

In [None]:
print(df["Satisfaction"].value_counts())
df = df[df["Satisfaction"] <= 5]

To make life easier I simplify the categories and group them as follows:
- `Satisfaction` 1 and 2 &rarr; 0 (Negative)
- `Satisfaction` 3 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&rarr; 1 (Neutral)
- `Satisfaction` 4 and 5 &rarr; 2 (Positive)

In [None]:
def relabel(x):
    return 0 if x < 3 else 1 if x == 3 else 2

df["Satisfaction"] = df["Satisfaction"].apply(relabel)

The same goes for `Sex` but apparently an earlier cleaning step already removed those rows which has null values.

In [None]:
print(df["Sex"].value_counts())

Now I turn my attention to `Drug` and `DrugId`.

In [None]:
drugs = {}
for drugid, drug in df[["DrugId", "Drug"]].itertuples(index=False):
    drugs.setdefault(drugid, set()).add(drug)
drugs = {k:list(v) for k,v in drugs.items()}

drugs_with_more_names = {k:list(v) for k,v in drugs.items() if len(v) > 1}
for k,v in dict(itertools.islice(drugs_with_more_names.items(), 10)).items():
    print(f"{k:10}: {list(v)[:2]}")

It turns out the reason why `Drug` has more values than `DrugId` is that some drugs are sold is different forms, like cream, pill, gel, etc. I am going to be using `DrugId` so I do not care about this now. 

Finally, before I tackle the `Reviews` column I explore `Condition` a bit. I would like to see the frequency of the various conditions both in value counts and in normalized form and the number of unique drugs (`DrugId`) used to treat each condition.

In [None]:
value_count_per_condition = df["Condition"].value_counts()
value_count_per_condition_norm = df["Condition"].value_counts(normalize=True)
unique_drugs_per_condition = df.groupby("Condition")["DrugId"].apply(set).to_frame().reset_index()
unique_drugs_per_condition.columns = ["condition", "unique_drugs"]

tempdf = pd.DataFrame({"condition": value_count_per_condition.index, 
                       "condition_freq": value_count_per_condition.values,
                       "condition_freq_norm": value_count_per_condition_norm.values})

tempdf = pd.merge(tempdf, unique_drugs_per_condition, on="condition")

For the purpose of my visualizations, I am going to group (bin) condition types based on their frequencies. The bin size will vary and follow a quasi logarithmic scale, which I call as "money range" mimicking the coin and note values in circulation in most countries (1, 2, 5, 10, 20, 50, 100, etc), starting from 20. In other words conditions with 20 reviews or less will be grouped together in the first bin, conditions with 21-50 reviews go into the second bin and so on.

In [None]:
def mrange(*args, ceiling=True):
    """Returns money range generator, yields 1, 2, 5, 10, 20, 50..."""
    f = lambda x: (((x - 1) % 3)**2 + 1) * 10**((x-1)//3)
    if len(args) == 1:
        start, stop = 1, args[0]
    else:
        start, stop = max(1, args[0]), args[1]
    c = 1
    x = f(c)
    while x < start:
        c += 1
        x = f(c)
    while True:
        yield x
        c += 1
        x = f(c)
        if x > stop:
            break
    if ceiling:
        yield x

def roundup(x, nearest=1000):
    """Rounds x to the nearest 1000 or the optional argument."""
    return int(math.ceil(x / float(nearest))) * nearest

ceiling = roundup(value_count_per_condition[0]) + 1
bins = [0] + [x for x in mrange(20, ceiling)]
labels = [str(x) for x in bins[1:]]
binlabels = pd.cut(tempdf["condition_freq"], bins=bins, labels=labels)
conddf = tempdf.assign(bin=binlabels.values)

Before I plot the grouped conditions let's look at the Top 15 most often occurring condition types.

In [None]:
topN = 15

data = conddf[:topN][["condition", "condition_freq", "condition_freq_norm"]]

bars = (alt.Chart(title=f"Top {topN} Conditions")
           .mark_bar(size=20,
                     strokeWidth=1,
                     stroke="white",
                     strokeOpacity=0.7,
                     xOffset=-1)
           .encode(x=alt.X("condition", sort="-y"),
                   y=alt.Y("condition_freq:Q",
                           axis=alt.Axis(title="number of reviews",
                                         grid=True)), 
                   tooltip=("condition",
                            "condition_freq:Q",
                            alt.Tooltip("condition_freq_norm:Q", format=".1%")),
                   color=alt.Color("condition_freq:Q",
                                   scale=alt.Scale(scheme="lightgreyteal",
                                                   type="log"))))

text = (alt.Chart()
           .mark_text(align="center",
                      baseline="bottom",
                      dx=-1, dy=-3)
           .encode(x=alt.X("condition", sort="-y"),
                   y=alt.Y("condition_freq:Q"),
                   size = alt.SizeValue(9),
                   text=alt.Text("condition_freq_norm:Q", format=".1%")))

chart = (alt.layer(bars, text, data=data)
            .configure(background='#11043a')
            .configure_title(font="Arial",
                             fontSize=18,
                             color="#e6f3ff",
                             dy=-10)
            .configure_text(color="white")
            .configure_legend(title=None,
                              titleFontSize=12,
                              titleColor="white",
                              tickCount=5,
                              titleOpacity=0.8,
                              labelColor="white",
                              labelOpacity=0.7,
                              titlePadding=10)
            .configure_axis(titleFontSize=13,
                            titlePadding=20,
                            titleColor="white",
                            titleOpacity=0.8,
                            labelFontSize=11,
                            labelColor="white",
                            labelOpacity=0.7,
                            #labelAngle=45,
                            tickOffset=0,
                            grid=False,
                            gridOpacity=0.15)
            .configure_view(strokeWidth=0)
            .properties(height=300, width=700))
chart

The first 5 conditions make up about 35% of all conditions, with condition type "Other" leading the list which is just as broad of a condition as it can be. The second most common category is "Pain" which is again very generic.

Let's see now that chart which I mentioned and prepaired the dataset for earlier.

In [None]:
# this aggregates the sets of unique_drugs which fall into the same bin and counts the number of elements
aggr_sets = lambda x: sum(1 for n in set.union(*x))

data = (conddf.groupby("bin")
              .agg({"condition": "count", "condition_freq": "sum",
                    "condition_freq_norm": "sum", "unique_drugs": aggr_sets})
              .reset_index())
data.columns = ["bin", "condition_count", "condition_freq_sum",
                "condition_freq_norm_sum", "unique_drugs_count"]

bars = (alt.Chart(title="Distribution Of Condition Frequency And Drug Use")
           .mark_bar(size=20,
                     strokeWidth=1,
                     stroke="white",
                     strokeOpacity=0.7,
                     xOffset=-1)
           .encode(x=alt.X(shorthand="bin:Q",
                           scale=alt.Scale(round=False, type="log"),
                           axis=alt.Axis(title="binned condition counts",
                                         grid=False,
                                         orient="bottom")),
                   y=alt.Y(shorthand="condition_freq_sum:Q",
                           scale=alt.Scale(type="log"),
                           axis=alt.Axis(title="sum of condition counts and unique drug use, log scaled")),
                   tooltip=("bin:Q",
                            "condition_count:Q",
                            "condition_freq_sum:Q", 
                            alt.Tooltip("condition_freq_norm_sum:Q", format=".1%"),
                                        "unique_drugs_count:Q"),
                   color=alt.Color("condition_count:Q",
                                   scale=alt.Scale(scheme="lightgreyteal",
                                                   type="log"))))

text = (alt.Chart()
           .mark_text(align="center",
                      baseline="bottom",
                      dx=-1, dy=-3)
           .encode(x=alt.X("bin:Q"),
                   y=alt.Y("condition_freq_sum:Q"),
                   size = alt.SizeValue(9),
                   text=alt.Text("condition_freq_norm_sum:Q", format=".1%")))

line = (alt.Chart()
           .mark_line(color="red",
                      xOffset=-1,
                      size=1)
           .encode(x=alt.X("bin:Q"),
                   y=alt.Y("unique_drugs_count:Q")))

point = (alt.Chart()
            .mark_point(color="red",
                        xOffset=-1,
                        size=15,
                        shape="diamond")
            .encode(x=alt.X("bin:Q"),
                    y=alt.Y("unique_drugs_count:Q"),
                    tooltip=("unique_drugs_count:Q")))

chart = (alt.layer(bars, line, text, point, data=data[data["condition_count"] > 0])
            .configure(background='#11043a')
            .configure_title(font="Arial",
                             fontSize=18,
                             color="#e6f3ff",
                             dy=-10)
            .configure_text(color="white")
            .configure_legend(title=None,
                              titleFontSize=12,
                              titleColor="white",
                              tickCount=10,
                              titleOpacity=0.8,
                              labelColor="white",
                              labelOpacity=0.7,
                              titlePadding=10)
            .configure_axis(titleFontSize=13,
                            titlePadding=20,
                            titleColor="white",
                            titleOpacity=0.8,
                            labelColor="white",
                            labelOpacity=0.7,
                            tickOffset=0,
                            grid=True,
                            gridOpacity=0.15)
            .configure_view(strokeWidth=0)
            .properties(height=300, width=700))

chart.resolve_scale(color="independent")

This plot shows a lot, so I explain it a bit. Note that it is log scaled.
- On the far right is the bin of conditions with number of reviews between 20,000-50,000, in this case containing just one condition, as the previous chart already depicted, which has 43,449 reviews and it is the Condition "Other". This group accounts for 14.8% of the total samples and 3,667 different drugs (DrugId) are reviewed in it.
- On the far left bin are the conditions with 20 or less reviews and there are 1,022 of them adding up to 5,531 records in total which is 1.9% of the dataset. There are 1,311 unique drugs reviewed in this bin.
- In the middle are scattered a few other bins of varying sizes but generally featuring much less conditions in each especially in the bin 1001-2000 and above. For example in the bin 10,000 (conditions with number of reviews between 5,000-10,000) there are only 4 distict conditions with 24,422 records (8.3%). There were 256 distict drugs reviewed in this group which is the lowest of all.

In summary around 40% of the dataset contains reviews for 1652 conditions with less than 2,000 reviewer per condition and around 43% of reviews rate 9 conditions with an average of 14,000 reviewer per condition. 

Let's see now how `Satisfaction` is looking across the Age groups.

In [None]:
data = (df.groupby(["Age", "Satisfaction"])
          .agg({"Reviews": "count"})
          .reset_index()).sort_values(["Age", "Satisfaction"], ascending=True)
#data['Cumulative_Reviews'] = data.groupby(['Age'])['Reviews'].apply(lambda x: x.cumsum())

bars = (alt.Chart(data=data, title="Distribution of Reviews Over Age")
           .mark_bar(size=40,
                     strokeWidth=0.5,
                     stroke="white")
           .encode(x=alt.X('Age:O',
                           axis=alt.Axis(title="Age groups", grid=False)),
                   y=alt.Y('Reviews:Q', stack='zero',
                           scale=alt.Scale(type="linear"),
                           axis=alt.Axis(title="num of reviews")),
                   order=alt.Order('Satisfaction', sort='ascending'),
                   color=alt.Color("Satisfaction:Q",
                                   scale=alt.Scale(scheme="lightgreyteal",
                                                   bins=[0,1,2,3],
                                                   reverse=False))))

text = (alt.Chart(data=data[data["Reviews"] > 1500])
           .mark_text(align="center",
                      baseline="middle",
                      dx=0, dy=5)
           .encode(x=alt.X("Age:O"),
                   y=alt.Y("Reviews:Q", stack='zero'),
                   size = alt.SizeValue(9),
                   text="Reviews:Q",
                   color=alt.condition(alt.datum.Satisfaction > 1,
                                          alt.value("white"),
                                          alt.value("black"))))
    
chart = (alt.layer(bars, text)
            .configure(background="#11043a")
            .configure_title(font="Arial",
                             fontSize=18,
                             color="#e6f3ff",
                             dy=-10)
            .configure_text(color="white")
            .configure_legend(titleFontSize=12,
                              titleColor="white",
                              tickCount=10,
                              titleOpacity=0.8,
                              labelColor="white",
                              labelOpacity=0.7,
                              titlePadding=10)
            .configure_axis(titleFontSize=13,
                            titlePadding=20,
                            titleColor="white",
                            titleOpacity=0.8,
                            labelColor="white",
                            labelOpacity=0.7,
                            labelAngle=0,
                            tickOffset=0,
                            grid=True,
                            gridOpacity=0.15)
            .configure_view(strokeWidth=0)
            .properties(height=300, width=700)
)
chart

From the chart it seems certain `Age` groups do not seem to take or review drugs, which is understandable given their young age. Also after the simplification of `Satisfaction` from 5 to 3 categories the number of "Neutral" reviews is now very low compared to those in the "Positive" and "Negative" categories. The most active reviewers are from Age group '45-54' and '55-64'.

Finally let's plot the distribution of Satisfaction ratings.

In [None]:
data = (df.groupby(["DrugId"])
          .agg({"Reviews": "count", "Satisfaction": "mean"})
          .reset_index()
          .sort_values(["Reviews"], ascending=False))
data["Drug"] = data["DrugId"].map(drugs)

alt.data_transformers.disable_max_rows()
scatter = (alt.Chart(title="Distribution Of Reviews Over Satisfaction")
            .mark_point(color="#648bce")
            .encode(x=alt.X('Satisfaction:Q',
                            axis=alt.Axis(title="Mean Satisfaction",
                                          grid=False)),
                    y=alt.Y('Reviews:Q',
                             scale=alt.Scale(type="log"),
                             axis=alt.Axis(title="Number of Reviews, log scaled")),
                    size='Reviews:Q',
                    color=alt.Color("Satisfaction:Q",
                                   scale=alt.Scale(scheme="lightgreyteal",
                                                   type="linear")),
                    tooltip=['DrugId', 'Drug', 'Reviews',
                              alt.Tooltip("Satisfaction", format=".3")])
            .interactive())

chart = (alt.layer(scatter, data=data[data["Reviews"] > 20])
            .configure(background='#11043a')
            .configure_title(font="Arial",
                             fontSize=18,
                             color="#e6f3ff",
                             dy=-10)
            .configure_text(color="white")
            .configure_legend(titleFontSize=12,
                              titleColor="white",
                              tickCount=6,
                              titleOpacity=0.8,
                              labelColor="white",
                              labelOpacity=0.7,
                              titlePadding=10)
            .configure_axis(titleFontSize=13,
                            titlePadding=20,
                            titleColor="white",
                            titleOpacity=0.8,
                            labelFontSize=11,
                            labelColor="white",
                            labelOpacity=0.7,
                            labelAngle=0,
                            tickOffset=0,
                            grid=False,
                            gridOpacity=0.15)
            .configure_view(strokeWidth=0)
            .properties(height=300, width=700)
)
chart

The scatter plot suggests a pretty homogenous distribution. We know there are much less "Neutral" reviews in the dataset than the other reviews, therefore the mean `Satisfaction` mostly comes from the average of "Positive" and "Negative" reviews. Perhaps there are a few more larger bubbles above the mean rating 1 ("Neutral") than below it.

### Pre-processing `Reviews` for ML
I select three random `Reviews` to see how they look before the cleaning.

In [None]:
indexes = np.random.randint(df.shape[0], size=3)
print(" ".join(df["Reviews"].iloc[indexes].tolist()))

It is not too bad but there are plenty of characters which convey little or no information about the sentiment of the reviewers. On the other hand there can be many signs which potentially carry hints about the outcome of the rating, for example emoticons are ubiquitous in any modern textual content on the Internet nowadays. I am going to perform the following steps:

- convert positive emoticons to the word "happyemoticon" and negative emoticons to "sademoticon"
- remove puctuations, which are none alphanumerical characters
- remove stopwords and anything starting with numbers
- lemmatize the text

In [None]:
%%time

nlp = spacy.load("en", disable=["ner", "parser"])
STOPWORDS = set(ENGLISH_STOP_WORDS).union(set(stopwords.words("english")))

def clean_review(text, STOPWORDS=STOPWORDS, nlp=nlp):
    """Cleans up text"""
    
    def rep_emo(text, placeholder_pos=' happyemoticon ', placeholder_neg=' sademoticon '):
        """Replace emoticons"""
        # Credit https://github.com/shaheen-syed/Twitter-Sentiment-Analysis/blob/master/helper_functions.py
        emoticons_pos = [":)", ":-)", ":p", ":-p", ":P", ":-P", ":D",":-D", ":]", ":-]", ";)", ";-)",
                         ";p", ";-p", ";P", ";-P", ";D", ";-D", ";]", ";-]", "=)", "=-)", "<3"]
        emoticons_neg = [":o", ":-o", ":O", ":-O", ":(", ":-(", ":c", ":-c", ":C", ":-C", ":[", ":-[",
                         ":/", ":-/", ":\\", ":-\\", ":n", ":-n", ":u", ":-u", "=(", "=-(", ":$", ":-$"]

        for e in emoticons_pos:
            text = text.replace(e, placeholder_pos)

        for e in emoticons_neg:
            text = text.replace(e, placeholder_neg)   
        return text

    def rep_punct(text):
        """Replace all punctuation with space"""
        for c in string.punctuation:
            text = text.replace(c, " ")
        return text

    def rem_stop_num(text):
        """Remove stop words and anything starting with number"""
        return " ".join(word for word in text.split() if word not in STOPWORDS and not word[0].isdigit())

    def lemmatize(text):
        """Return lemmas of tokens in text"""
        return " ".join(tok.lemma_.lower().strip() for tok in nlp(text) if tok.lemma_ != "-PRON-")  

    return lemmatize(rem_stop_num(rep_punct(rep_emo(text))))

mldf = df[["Satisfaction", "Reviews"]]
mldf["Reviews"] = mldf["Reviews"].apply(clean_review)

# remove any rows with new empty strings following the clean-up
mldf["Reviews"].replace("", np.nan, inplace=True)
mldf.dropna(inplace=True)
# adding indexes as "index" column for later use to recreate same splits 
mldf.reset_index(inplace=True)

With this done let's check how the same random `Reviews` look now.

In [None]:
print(" ".join(mldf["Reviews"].iloc[indexes].tolist()))

I delete the original dataframe because I do not need it anymore and I have memory constraints.

In [None]:
del df

For a quick visualization I use word clouds. Word clouds have an appeal that is hard to deny, they are engaging and easy for the brain to digest. They are here to stay but they definitely should not be considered insightful but more like a fun amidst the hard work. Let's see first the word cloud built from the "Negative" reviews. I use Matplotlib because Altair which is based on Vega-Lite does not support this type of visualization as opposed to Vega which does.

In [None]:
negdf = mldf[mldf["Satisfaction"] == 0]
negatives = []
for review in negdf["Reviews"]:
    negatives.append(review)
negatives = pd.Series(negatives).str.cat(sep=" ")

wordcloud = WordCloud(width=1600, height=800, max_font_size=200).generate(negatives)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

And now from the "Positive" reviews.

In [None]:
posdf = mldf[mldf["Satisfaction"] == 2]
positives = []
for review in posdf["Reviews"]:
    positives.append(review)
positives = pd.Series(positives).str.cat(sep=" ")

wordcloud = WordCloud(width=1600, height=800, max_font_size=200).generate(positives)
plt.figure(figsize=(12, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

I did warn you that word clouds may not be too revealing, didn't I?

#### Representing `Reviews` in Numeric Form

For a machine to be able to work with text the content needs to be converted to and represented by numbers. There are three main approaches to do this. The Bag of Words, the TF-IDF and the Word2Vec. The first step is to create a vocabulary of all the unique words found in the documents. Here a Review is a document. In the Bag of Words approach each Review is converted to a vector of same size, which is the size of the vocabulary, where each position in the vector represents a word from the vocabulary and its value is the frequency of the word found in the document or zero if the word does not occur in the document. 

The TF-IDF method not only assigns fixed weight to an occurrence of a word ("TF" or Term Frequency) but it also looks at how many times the word also occurs in other documents ("IDF" or Inverse Document Frequency) and assigns a value to a vector position which is the combination of the two for a given word.

First, I want to build a reference model and for that I chose the multinomial Naive Bayes classifier. The multinomial distribution normally requires integer features, which is a vector what the Bag of Words algorithm produces but in practice fractional values, what the TF-IDF method makes, may work as well. For a basic model I chose the Bag of Words approach. To perform the vectorization I use CountVectorizer from sklearn.

Let us proceed splitting the dataset into training (75%) and test (%25) sets. I specify the "stratify" parameter to make sure that after the split the proportion of the `Satisfaction` categories remain the same in both sets as was before the split.

In [None]:
train_set, test_set = train_test_split(mldf, test_size=0.25, random_state=0, stratify=mldf["Satisfaction"])
train_index = train_set.index
test_index = test_set.index
print(train_set.shape)
print(test_set.shape)

I also create a function to visualize the confusion matrix in Altair.

In [None]:
def confusion_matrix_altair(labels, predictions):
    """Returns Altair heatmap as confusion matrix"""
    
    alt.data_transformers.disable_max_rows()
    source = pd.DataFrame([labels, predictions]).T
    source.columns=["True", "Predicted"]

    # Configure base chart
    base = (alt.Chart(source, title="Confusion Matrix")
               .transform_aggregate(count="count()",
                                    groupby=["True", "Predicted"])
               .transform_joinaggregate(total="sum(count)")
               .transform_calculate(pct="datum.count / datum.total")
               .encode(x=alt.X("Predicted:O", scale=alt.Scale(paddingInner=0)),
                       y=alt.Y("True:O", scale=alt.Scale(paddingInner=0)),
                       tooltip=(alt.Tooltip("pct:Q", format=".1%"))))
    # Configure heatmap
    heatmap = (base.mark_rect()
                   .encode(color=alt.Color("count:Q",
                           scale=alt.Scale(scheme="blues"),
                           legend=alt.Legend(direction="vertical"))))
    # Configure text
    text = (base.mark_text(baseline="middle")
                .encode(text="count:Q",
                        color=alt.condition(alt.datum.count > 10000,
                                            alt.value("white"),
                                            alt.value("black"))))
    # Draw the chart
    chart = ((heatmap + text)
                .configure(background="#11043a")
                .configure_title(fontSize=18,
                                 color="#e6f3ff",
                                 dy=-20)
                .configure_text(color="white",
                                fontSize=14)
                .configure_legend(titleFontSize=12,
                                  titleColor="white",
                                  titleOpacity=0.8,
                                  labelColor="white",
                                  labelOpacity=0.7,
                                  titlePadding=10)
                .configure_axis(titleFontSize=14,
                                titlePadding=20,
                                titleColor="white",
                                titleOpacity=0.8,
                                labelFontSize=13,
                                labelColor="white",
                                labelOpacity=0.7,
                                labelAngle=0)
                .configure_view(strokeWidth=0)
                .properties(height=400, width=400)
            )
    return chart

In [None]:
%%time
vectorizer = CountVectorizer(max_features=2500, min_df=10, max_df=0.8)
X_train = vectorizer.fit_transform(train_set["Reviews"]).toarray()
X_test = vectorizer.transform(test_set["Reviews"]).toarray()
y_train = train_set["Satisfaction"].values
y_test = test_set["Satisfaction"].values

In the code cell above set up the vectorizer with vocabulary size of 2500 words (`max_features`) and  with two additional parameters. Words when occur less than 10 times (`min_df`) in the entire corpus (in all the `Reviews`) are not considered important and are excluded automatically. On the other side of the scale if a word occurs in 80% (`max_df`) or more in all of the `Reviews` then it is also not considered useful enough for classification because it is so common and therefore are ignored.

The reason why I did split first and "fit_transform" only on the training set afterwards is to make sure the vocabulary is learned strictly from the training set. The test set was only transformed using the already fitted vectorizer. Let's proceed fitting the MultinomialNB model. I leave the "alpha" at default value 1.0. The idea behind this hyperparameter is to solve the problem of zero probability. In other words if a word is not encountered in the Review the conditional probability of that word (which is 0) does not make the whole probability 0.

In [None]:
%%time
model = MultinomialNB(alpha=1.0)
model.fit(X_train, y_train)

To evaluate the performance of the models, I use the traditional classification metrics such as accuracy, F1 measure and confusion matrix.

In [None]:
acc_train = accuracy_score(y_train, model.predict(X_train))
print(f"\nAccuracy in train set: {acc_train:.2}")
predictions = model.predict(X_test)
acc_test = accuracy_score(y_test, predictions)
print(f"\nAccuracy in test  set: {acc_test:.2}\n")
print(classification_report(y_test, predictions))
confusion_matrix_altair(y_test, predictions)

The model's accuracy is ~65%. Not only it missed ~13% on "Neutral" (1) reviews but sadly it misclassified twice as many in the "Negative" (0) and "Positive" (2) categories as well. Let's see if we can regain some of the losses by eliminating "Neutral" reviews and classifing only "Negative" and "Positive" feedbacks.

In [None]:
%%time
train_set = train_set[train_set["Satisfaction"] != 1]
test_set = test_set[test_set["Satisfaction"] != 1]
print(train_set.shape)
print(test_set.shape)

vectorizer = CountVectorizer(max_features=2500, min_df=10, max_df=0.8)
X_train = vectorizer.fit_transform(train_set["Reviews"]).toarray()
X_test = vectorizer.transform(test_set["Reviews"]).toarray()
y_train = train_set["Satisfaction"].values
y_test = test_set["Satisfaction"].values

model = MultinomialNB(alpha=1.0)
model.fit(X_train, y_train)

acc_train = accuracy_score(y_train, model.predict(X_train))
print(f"\nAccuracy in train set: {acc_train:.2}")
predictions = model.predict(X_test)
acc_test = accuracy_score(y_test, predictions)
print(f"\nAccuracy in test  set: {acc_test:.2}\n")
print(classification_report(y_test, predictions))
confusion_matrix_altair(y_test, predictions)

As I suspected the predicting power of the model has become as much better as the number of neutral misclassifications made it worse in the previous model.

#### RandomForestClassifier with TfidfVectorizer

I move on and try another ML algorithm, the RandomForestClassifier from sklearn. In addition I am also switching from the Bag of Words vectorization approch to TF-IDF. I set up TfidfVectorizer with the same parameters as the CountVectorizer before. The labels (y_train, y_test) have not changed so I do not touch them.

In [None]:
%%time
train_set = mldf.loc[train_index]
test_set  = mldf.loc[test_index]
print(train_set.shape)
print(test_set.shape)

vectorizer = TfidfVectorizer(max_features=2500, min_df=10, max_df=0.8)
X_train = vectorizer.fit_transform(train_set["Reviews"]).toarray()
X_test = vectorizer.transform(test_set["Reviews"]).toarray()
y_train = train_set["Satisfaction"].values
y_test = test_set["Satisfaction"].values

The RandomForestClassifier has many hyperparameters. I left all of them on their default values except the  "min_samples_split" which I increased from 2 to 6 to lessen the chance of overfitting the model to the training data. Also I decided not to perform hyperparameter tuning through a grid of parameter ranges this time.

In [None]:
%%time
model = RandomForestClassifier(min_samples_split=6, random_state=0)
model.fit(X_train, y_train)

Now I evaluate the model on the tes test.

In [None]:
acc_train = accuracy_score(y_train, model.predict(X_train))
print(f"\nAccuracy in train set: {acc_train:.2}")
predictions = model.predict(X_test)
acc_test = accuracy_score(y_test, predictions)
print(f"\nAccuracy in test  set: {acc_test:.2}\n")
print(classification_report(y_test, predictions))
confusion_matrix_altair(y_test, predictions)

There is a great deal of misclassified "Neutral" reviews, amounting to ~9% but in overall we gained +12% in model accuracy. How about removing the "Neutral" reviews and training the model again.

In [None]:
%%time
train_set = train_set[train_set["Satisfaction"] != 1]
test_set = test_set[test_set["Satisfaction"] != 1]
print(train_set.shape)
print(test_set.shape)

vectorizer = TfidfVectorizer(max_features=2500, min_df=10, max_df=0.8)
X_train = vectorizer.fit_transform(train_set["Reviews"]).toarray()
X_test = vectorizer.transform(test_set["Reviews"]).toarray()
y_train = train_set["Satisfaction"].values
y_test = test_set["Satisfaction"].values

model = RandomForestClassifier(min_samples_split=6, random_state=0)
model.fit(X_train, y_train)

And the result

In [None]:
acc_train = accuracy_score(y_train, model.predict(X_train))
print(f"\nAccuracy in train set: {acc_train:.2}")
predictions = model.predict(X_test)
acc_test = accuracy_score(y_test, predictions)
print(f"\nAccuracy in test  set: {acc_test:.2}\n")
print(classification_report(y_test, predictions))
confusion_matrix_altair(y_test, predictions)

Not bad, we have gone from previous 76% to 83% accuracy without the "Neutral" reviews.

#### Recurrent Neural Network (LSTM)

Instead of trying out another traditional ML algorithm I want to experiment with Neural Networks, more specifically a so called "LSTM" model. Long Short Term Memory networks (LSTMs) are a special type of Recurrent Neural Networks (RNNs), capable of learning long-term dependencies, that is remembering information for long periods and that is what they can do by default very well. LSTM solves the vanishing gradient problem of RNN networks.

The process is the same as before. The Tokenizer splits up the text, creates a vocabulary and turns the text into a sequence of numbers (vectors).

In [None]:
%%time
X_train = mldf.loc[train_index]
X_test = mldf.loc[test_index]
y_train = mldf.loc[train_index]
y_test = mldf.loc[test_index]

X_train = X_train[X_train["Satisfaction"] != 1]["Reviews"].values
X_test = X_test[X_test["Satisfaction"] != 1]["Reviews"].values
y_train = y_train[y_train["Satisfaction"] != 1]["Satisfaction"].values
y_test = y_test[y_test["Satisfaction"] != 1]["Satisfaction"].values

num_words = 2500
maxlen = 200

tokenizer = Tokenizer(num_words=num_words, split=" ", lower=False)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)
y_train = pd.get_dummies(y_train).values
y_test = pd.get_dummies(y_test).values

In the code cell above I set `num_words` to 2500, so the interal vocabulary only keeps the 2500 most frequently used words. Then the tokenizer is fitted with the training data. I have seen quite a few publications where the authors fit the tokenizer to the entire dataset and subsequently split the dataset into training and test sets. I think that work flow is flawed because information can leak into the training process from the data which will be used later in the testing phase.

Padding makes sure that each feature vector will have the same `maxlen` size. Sequences shorter than `maxlen` will get padded and sequences longer will get truncated.

Let's build the LSTM model which is made up of a sequence of neural network layers.

In [None]:
%%time
embedding_vector_length = 100

model = Sequential()
model.add(Embedding(num_words, embedding_vector_length, input_length=maxlen))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(embedding_vector_length))
model.add(Dense(2, activation="softmax"))
model.compile(loss = "categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])
print(model.summary())

In the code cell above, I created a Sequential() model and added the following layers:
- Embedding layer with `input_dim` of `num_words` and `output_dim` of `embedding_vector_length`. This layer can be understood as a lookup table that maps from integer indices (of specific words in the vocabulary) to dense vectors (their embeddings)
- SpatialDropout1D later which performs the same function as Dropout, however it drops entire 1D feature maps instead of individual elements. This layer helps prevent overfitting. Here 20% of the input units will be drop
- LSTM, which saves information for later, thus preventing older signals from gradually vanishing during processing, I do not set dropout here because I do not want this layer to "forget" the past. 
- Dense layer with `softmax` activation to flatten the LSTM output

While I compile our model I use the "adam" optimizer, "categorical_crossentropy" loss function and "accuracy" as metrics. 10% of the training data will be used to find the training accuracy of the model. I am also going to add a callback which will monitor the loss value and if that does not improve for 2 epochs more than 0.0001 then it will stop the fitting process.

In [None]:
%%time
epochs = 15
batch_size = 64
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1,
                    callbacks=[EarlyStopping(monitor="val_loss", patience=2, min_delta=0.0001)])

The model accuracy on the training set is ~87%. Let's see how it peforms on the test set by passing it through the evaluate method. 

In [None]:
accr = model.evaluate(X_test, y_test)
print("Loss in test set: {:0.3f}\nAccuracy in test set: {:0.3f}\n".format(accr[0], accr[1]))
predictions = model.predict_classes(X_test, batch_size = batch_size)
labels = np.argmax(y_test, axis=1)
print(classification_report(labels, predictions))
confusion_matrix_altair(labels, predictions)

The test accuracy is ~83% which is less than that of the training accuracy. This suggests the model is slightly overfitted because it performs better on the training set than on the test set. The difference is not much however.

#### Prediction

Let's perform a test and predict the sentiment on a new review.

In [None]:
def predict_sentiment(text):
    cleaned_text = tokenizer.texts_to_sequences([clean_review(text)])
    padded_text = pad_sequences(cleaned_text, maxlen=maxlen)
    return "Positive" if model.predict_classes(padded_text)[0] else "Negative"

predict_sentiment("The drug is expensive but it is worth every cent.")

### Conclusion

One of the most common natural language processing tasks is text classification. There are multiple Machine Learning algorithms that can be deployed to perform sentiment analysis. In this notebook I showed three different methods to build models which can predict the sentiment of a user based on what he or she writes. I only scratched the surface of NLP. In the subsequent notebook I will attempt to predict the condition of the reviewer from the review, sex and age.

### References

https://en.wikipedia.org/wiki/Sentiment_analysis