# Data Preprocessing Steps

At this notebook, we are performing some data preprocessing steps on the dataset, such as:

- Extracting the year with a datetime accessory to process the counts of the reviews and medications;
- Splitting the sentences by words and spaces;
- Removing punctuation;
- Removing non-alphabetic character and applying lower case;
- Applying tokenization;
- Removing stopwords;
- Applying lemmatization.

Note:

The reader is also going to note that, while we were performing some of those steps, we are keeping exploring the dataset and learning about the features, such as the relationship between the dates and the names of some medications, as this is an iterative cycle of a project in data science.

### Step 3: Applying data processsing and NLP techniques

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# Import libraries
import pandas as pd 
import numpy as np
import sklearn
import plotly.express as px

In [3]:
# Load the dataset
# medication_reviews_df = pd.read_csv('/content/drive/Othercomputers/My MacBook Pro/Sentiment-Analysis-of-Medication-Reviews-Project/medication_reviews_dataset.csv', sep=',')
medication_reviews_df = pd.read_csv('/Users/rafaelaqueiroz/Sentiment-Analysis-of-Medication-Reviews-Project/medication_reviews_dataset.csv', sep=',')
medication_reviews_df.head() 

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37


In [4]:
medication_reviews_df.shape

(112329, 6)

In [5]:
medication_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112329 entries, 0 to 112328
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   drugName     112329 non-null  object 
 1   condition    112329 non-null  object 
 2   review       112329 non-null  object 
 3   rating       112329 non-null  float64
 4   date         112329 non-null  object 
 5   usefulCount  112329 non-null  int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 5.1+ MB


In [6]:
# Convert the "date" column to datetime format
medication_reviews_df['date'] = pd.to_datetime(medication_reviews_df['date'])

# Find the first and last date input in the DataFrame
medication_reviews_first_date_input = medication_reviews_df['date'].min()
medication_reviews_last_date_input = medication_reviews_df['date'].max()

# Print the first and last date input
print('The first date input is:', medication_reviews_first_date_input)
print('The last date input is:', medication_reviews_last_date_input)

The first date input is: 2008-02-24 00:00:00
The last date input is: 2017-12-12 00:00:00


To see which observation (row) in a pandas DataFrame consists of the first and last input, we are going to use the boolean indexing with the *.min( )* and *.max( )* methods on the column containing the dates.

In [7]:
# Use boolean indexing to extract the row(s) with the first date input
medication_reviews_first_date_row = medication_reviews_df[medication_reviews_df['date'] == medication_reviews_first_date_input]
medication_reviews_first_date_row

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
46354,Orlistat,Obesity,"""Xenical really helped me, but some of the bow...",7.0,2008-02-24,50
47184,Macrobid,Bladder Infection,"""Excellent for prevention of bladder infection...",8.0,2008-02-24,52
58962,Oxybutynin,Not Listed / Othe,"""Improved my problem dramatically. I never exp...",7.0,2008-02-24,22
101801,Chlorpheniramine / pseudoephedrine,Allergic Rhinitis,"""I when to a medical clinic with flu like symp...",1.0,2008-02-24,0


In [8]:
# Use boolean indexing to extract the row(s) with the last date input
medication_reviews_last_date_row = medication_reviews_df[medication_reviews_df['date'] == medication_reviews_last_date_input]
medication_reviews_last_date_row

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
1415,Terconazole,Vaginal Yeast Infection,"""This is the meanest diabolical cream. Not sur...",1.0,2017-12-12,0
5743,Propranolol,mance Anxiety,"""I suffer from glossophobia or the fear of pub...",10.0,2017-12-12,0
6713,Kyleena,Birth Control,"""I got Kyleena put in September (2017), so I&#...",9.0,2017-12-12,0
8058,Vaniqa,Hirsutism,"""I&#039;m thrilled with this product. After us...",10.0,2017-12-12,0
8614,Xarelto,Atrial Fibrillation,"""I made a comment here a couple of years ago w...",1.0,2017-12-12,0
83227,Xenical,Obesity,"""So I started just over a week ago, if you eat...",10.0,2017-12-12,0
87354,Infliximab,Rheumatoid Arthritis,"""I was diagnosed with Inflammatory Arthritis (...",10.0,2017-12-12,0
110322,Duloxetine,Depression,"""My doctor switched me to duloxetine from cita...",3.0,2017-12-12,0


From these results, we can see that the first review is dated from 2008-02-24 under the evaluation of a drug name *Orlistat* addressed to *Obesity*; and the last, from 2017-12-12, refers to the review of *Duloxetine* addressed to *Depression*. 

Now, let's try to understand the first and last reviews published in order to analyze which were in the group of the first and last medications reviewed by the users.

In [9]:
# Group the DataFrame by the condition, and sort it by date
grouped_medication_reviews_df = medication_reviews_df.groupby('condition').apply(lambda x: x.sort_values('date'))

# Select the first 100 rows of the dataset published
medication_reviews_first_reviews_selected = grouped_medication_reviews_df.head(100)
medication_reviews_first_reviews_selected

Unnamed: 0_level_0,Unnamed: 1_level_0,drugName,condition,review,rating,date,usefulCount
condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0</span> users found this comment helpful.,35168,Relpax,0</span> users found this comment helpful.,"""Tried all sorts of other migraine medicines o...",10.0,2008-08-01,0
0</span> users found this comment helpful.,9435,Loestrin Fe 1 / 20,0</span> users found this comment helpful.,"""The pill worked very well. I had been taking ...",4.0,2009-06-03,0
0</span> users found this comment helpful.,111184,Loestrin 24 Fe,0</span> users found this comment helpful.,"""I&#039;m fifteen and my doctor prescribed thi...",10.0,2009-07-17,0
0</span> users found this comment helpful.,109285,Suboxone,0</span> users found this comment helpful.,"""The only thing about this pill is, it gives m...",10.0,2009-10-20,0
0</span> users found this comment helpful.,46059,Loestrin 24 Fe,0</span> users found this comment helpful.,"""This pill was okay at first then about 6 week...",8.0,2009-11-03,0
...,...,...,...,...,...,...,...
12</span> users found this comment helpful.,105066,Fanapt,12</span> users found this comment helpful.,"""I have only been on this medication for 5 day...",5.0,2010-10-23,12
12</span> users found this comment helpful.,35493,Ambien CR,12</span> users found this comment helpful.,"""I came here to post my experience over the ye...",6.0,2010-11-15,12
12</span> users found this comment helpful.,6546,Amitiza,12</span> users found this comment helpful.,"""I was extremely constipated and no one seems ...",9.0,2011-12-26,12
12</span> users found this comment helpful.,25020,Dulcolax,12</span> users found this comment helpful.,"""It worked just fine. Took one and 12 hours la...",7.0,2012-02-19,12


In [10]:
# Group the DataFrame by the condition, and sort it by date
grouped_medication_reviews_df = medication_reviews_df.groupby('condition').apply(lambda x: x.sort_values('date'))

# Select the last review for each condition in the group of the last 100 reviews published
medication_reviews_last_reviews_selected = grouped_medication_reviews_df.tail(100)
medication_reviews_last_reviews_selected

Unnamed: 0_level_0,Unnamed: 1_level_0,drugName,condition,review,rating,date,usefulCount
condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
min),49361,Dapagliflozin / metformin,min),"""I have been taking this medication for a litt...",8.0,2017-09-15,2
mis,13635,Dapsone,mis,"""I was diagnosed in 1998 via papular biopsy by...",10.0,2008-02-26,44
mis,100151,Dapsone,mis,"""In two days I had wonderful relief from sever...",6.0,2009-04-13,15
mis,67703,Dapsone,mis,"""I took Dapsone for 20 years and without it I ...",10.0,2011-01-29,15
mis,102515,Dapsone,mis,"""I itched for about 5 years. There wasn&#039;...",10.0,2012-02-23,14
...,...,...,...,...,...,...,...
zen Shoulde,109708,Nabumetone,zen Shoulde,"""Very helpful for my frozen shoulder pain with...",9.0,2011-11-12,38
zen Shoulde,12919,Diclofenac,zen Shoulde,"""This medication has been a God send for me. ...",9.0,2015-05-14,11
zen Shoulde,45160,Naproxen,zen Shoulde,"""Very little relief. I finished PT and after ...",2.0,2015-05-14,6
zen Shoulde,81906,Nabumetone,zen Shoulde,"""The only side effect I have experienced with ...",1.0,2017-08-03,6


Among the last reviews collected, medications such as *Dapsone* and *Nabumetone* were commonly mentioned, whereas in 2008 and 2009 the contraceptive *Loestrin 24 Fe* appears followed by a rating of 10.

In addition, most of the first reviews were rated between 7 to 8, whereas the last, from 1 to 10 with a greater variation. From this information, we can pose another question: does the release of other medications followed by their promisses impacted the evaluation of them? How many medications appeared in 2008? And how many in 2017?

In [11]:
medication_reviews_first_reviews_selected.drugName.value_counts()[:20] # These are the counts of the first 100 reviews

Loestrin 24 Fe         11
Mirena                  6
Depo-Provera            6
Tri-Sprintec            5
Implanon                4
Sprintec                3
Hypercare               3
Drysol                  2
Ortho Tri-Cyclen Lo     2
Seasonique              2
Wellbutrin              2
Safyral                 2
Ambien                  1
Estradiol Patch         1
Norco                   1
Provigil                1
Adderall                1
Relpax                  1
Seroquel                1
Xanax XR                1
Name: drugName, dtype: int64

In [12]:
medication_reviews_last_reviews_selected.drugName.value_counts()[:20] # These are the counts of the last 100 reviews

Budesonide / formoterol                          39
Formoterol / mometasone                          20
Phenylephrine                                     6
Dapsone                                           5
Formoterol                                        4
Stimate                                           3
Naproxen                                          3
Diclofenac                                        3
Nabumetone                                        3
Mycophenolic acid                                 2
Arformoterol                                      2
Cyclobenzaprine                                   2
Antihemophilic factor / von willebrand factor     2
Dapagliflozin / metformin                         1
Fluconazole                                       1
Itraconazole                                      1
Simethicone                                       1
Salicylic acid / urea                             1
Ibuprofen                                         1
Name: drugNa

From the first 100 reviews collected at the dataset, we can see that *Loestrin 24 Fe* and *Mirena* were the medications with a higher count. On the other hand, it looks like that *Mirena* lost its place of highlight in the last years of medication reviews used to birth control as it doesn't show up often. However, *Budesonide / formoterol* and *Formoterol / mometasone* keep in the spot with a good count of mentions in the last 100 reviews collected and presented at this dataset.

In [13]:
# Extract the year from the date column
medication_reviews_df['year'] = pd.to_datetime(medication_reviews_df['date']).dt.year

# Filter the dataset to only include data from 2008
medication_reviews_df_2008 = medication_reviews_df[medication_reviews_df['year'] == 2008]

# Count the occurrences of each medication in 2008
medications_2008 = medication_reviews_df_2008['drugName'].value_counts()[:10]
print("Number of occurrences of medications in 2008:")
print(medications_2008)

Number of occurrences of medications in 2008:
Phentermine                    78
Alprazolam                     42
Oxycodone                      41
Lyrica                         35
Tramadol                       35
Pregabalin                     34
Methadone                      34
Acetaminophen / oxycodone      33
Lexapro                        33
Acetaminophen / hydrocodone    32
Name: drugName, dtype: int64


In [14]:
# Now, let's plot the results from 2008
# Filter the dataframe to include only rows where the date is from 2008
medication_reviews_df_2008 = medication_reviews_df[medication_reviews_df['date'].dt.year == 2008]
top_30_medication_counts = medication_reviews_df_2008['drugName'].value_counts().head(30) # Get the top 30 medications by count
top_30_medication_names = top_30_medication_counts.index.tolist()

# Filter the dataframe to include only the top 30 medications
medication_reviews_df_top_30 = medication_reviews_df_2008[medication_reviews_df_2008['drugName'].isin(top_30_medication_names)]

# Create the treemap chart using the filtered dataframe
fig = px.treemap(medication_reviews_df_top_30, title='Medications mentioned in reviews from 2008',
                 path=['drugName', 'date'], color='drugName', color_continuous_scale=px.colors.sequential.GnBu)

fig.show()

In [15]:
# Filter the dataset to only include data from 2017
medication_reviews_df_2017 = medication_reviews_df[medication_reviews_df['year'] == 2017]
medications_2017 = medication_reviews_df_2017['drugName'].value_counts()[:10]
print("Number of occurrences of medications in 2017:")
print(medications_2017)

Number of occurrences of medications in 2017:
Levonorgestrel                        589
Ethinyl estradiol / norethindrone     363
Etonogestrel                          352
Miconazole                            343
Nexplanon                             301
Ethinyl estradiol / norgestimate      268
Metronidazole                         225
Sertraline                            222
Gabapentin                            219
Ethinyl estradiol / levonorgestrel    217
Name: drugName, dtype: int64


In [16]:
# Now, let's plot the results from 2017
# Filter the dataframe to include only rows where the date is from 2017
medication_reviews_df_2017 = medication_reviews_df[medication_reviews_df['date'].dt.year == 2017]
top_30_medication_counts = medication_reviews_df_2017['drugName'].value_counts().head(30) # Get the top 30 medications by count
top_30_medication_names = top_30_medication_counts.index.tolist()

# Filter the dataframe to include only the top 30 medications
medication_reviews_df_top_30 = medication_reviews_df_2017[medication_reviews_df_2017['drugName'].isin(top_30_medication_names)]

# Create the treemap chart using the filtered dataframe
fig = px.treemap(medication_reviews_df_top_30, title='Medications mentioned in reviews from 2017',
                 path=['drugName', 'date'], color='drugName', color_continuous_scale=px.colors.sequential.GnBu)

fig.show()

In [17]:
# Get the number of the reviews published at each year
# Group the data by year and count the number of reviews
reviews_by_year = medication_reviews_df.groupby('year')['review'].count()
print("Number of reviews per year:")
print(reviews_by_year)

Number of reviews per year:
year
2008     3518
2009     7912
2010     5599
2011     7776
2012     6725
2013     8380
2014     8374
2015    19148
2016    24891
2017    20006
Name: review, dtype: int64


Considering now the total of the reviews published in 2008 and 2017 (almost 10 years of difference), it is noted that, *Levonorgestrel*, *Etonegestrel* and *Ethinyl estradiol* appear more often in 2017 in comparison with the year 2008, even though in 2008 there is a lower count of reviews colleted. It looks like that more investigation should be done to understand why some medications are not appearing as much as before (such as *Mirena*) and the reasons of the popularity of some of them among the users (such as *Levonorgestrel* and *Ethinyl estradiol*).

##### 3.1 Splitting the sentences by words and spaces

In [18]:
# Import libraries
import string

# Split the 'review' column by words/spaces
medication_reviews_df['review_words'] = medication_reviews_df['review'].str.split()
medication_reviews_df['review_words'][:1].tolist()

[['"It',
  'has',
  'no',
  'side',
  'effect,',
  'I',
  'take',
  'it',
  'in',
  'combination',
  'of',
  'Bystolic',
  '5',
  'Mg',
  'and',
  'Fish',
  'Oil"']]

In [19]:
medication_reviews_df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount,year,review_words
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27,2012,"[""It, has, no, side, effect,, I, take, it, in,..."
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192,2010,"[""My, son, is, halfway, through, his, fourth, ..."
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17,2009,"[""I, used, to, take, another, oral, contracept..."
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10,2015,"[""This, is, my, first, time, using, any, form,..."
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37,2016,"[""Suboxone, has, completely, turned, my, life,..."


#### 3.1 Removing the punctuation

As we consider that the punctuation is important to the construction of the meaning of a sentence, we are going to remove only some punctuation marks for now in order to preprocess the sample. We might re-evaluate this decision in the future.

In [20]:
# Let's evaluate which punctuation characters are going to be removed
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [21]:
# Define the punctuation characters to remove
punctuation_to_remove = r'!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~' # After many changes, we are only removing the apostrophe for now

# Remove the punctuation characters
medication_reviews_df['review_without_punctuation'] = medication_reviews_df['review_words'].apply(lambda x: str(x).replace(punctuation_to_remove, ''))
medication_reviews_df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount,year,review_words,review_without_punctuation
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27,2012,"[""It, has, no, side, effect,, I, take, it, in,...","['""It', 'has', 'no', 'side', 'effect,', 'I', '..."
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192,2010,"[""My, son, is, halfway, through, his, fourth, ...","['""My', 'son', 'is', 'halfway', 'through', 'hi..."
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17,2009,"[""I, used, to, take, another, oral, contracept...","['""I', 'used', 'to', 'take', 'another', 'oral'..."
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10,2015,"[""This, is, my, first, time, using, any, form,...","['""This', 'is', 'my', 'first', 'time', 'using'..."
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37,2016,"[""Suboxone, has, completely, turned, my, life,...","['""Suboxone', 'has', 'completely', 'turned', '..."


#### 3.3 Removing non-alphabetic characters and applying lower case

Now we are going to remove non-alphabetic characters by using regular expressions. Thus, we will use the *re* module to define a regular expression pattern that matches non-alphabetic characters and then use the *str.replace( )* method with this pattern to remove the non-alphabetic tokens.

In [22]:
# Import the "re" module ("re" from regular expression)
import re

# Convert non-string values to strings
medication_reviews_df['review_strings'] = medication_reviews_df['review_without_punctuation'].astype(str)

# Define a regular expression pattern that matches non-alphabetic characters
pattern = r'[^a-zA-Z\s]'

# Remove non-alphabetic tokens and convert the remaining tokens to lower case
medication_reviews_df['review_lower_case'] = medication_reviews_df['review_strings'].str.replace(pattern, '').str.lower()
medication_reviews_df['review_lower_case'][:4:2].tolist()


The default value of regex will change from True to False in a future version.



['it has no side effect i take it in combination of bystolic  mg and fish oil',
 'i used to take another oral contraceptive which had  pill cycle and was very happy very light periods max  days no other side effects but it contained hormone gestodene which is not available in us so i switched to lybrel because the ingredients are similar when my other pills ended i started lybrel immediately on my first day of period as the instructions said and the period lasted for two weeks when taking the second pack same two weeks and now with third pack things got even worse my third period lasted for two weeks and now its the end of the third week i still have daily brown discharge the positive side is that i didnt have any other side effects the idea of being period free was so tempting alas']

In [23]:
medication_reviews_df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount,year,review_words,review_without_punctuation,review_strings,review_lower_case
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27,2012,"[""It, has, no, side, effect,, I, take, it, in,...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...",it has no side effect i take it in combination...
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192,2010,"[""My, son, is, halfway, through, his, fourth, ...","['""My', 'son', 'is', 'halfway', 'through', 'hi...","['""My', 'son', 'is', 'halfway', 'through', 'hi...",my son is halfway through his fourth week of i...
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17,2009,"[""I, used, to, take, another, oral, contracept...","['""I', 'used', 'to', 'take', 'another', 'oral'...","['""I', 'used', 'to', 'take', 'another', 'oral'...",i used to take another oral contraceptive whic...
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10,2015,"[""This, is, my, first, time, using, any, form,...","['""This', 'is', 'my', 'first', 'time', 'using'...","['""This', 'is', 'my', 'first', 'time', 'using'...",this is my first time using any form of birth ...
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37,2016,"[""Suboxone, has, completely, turned, my, life,...","['""Suboxone', 'has', 'completely', 'turned', '...","['""Suboxone', 'has', 'completely', 'turned', '...",suboxone has completely turned my life around ...


#### 3.4 Applying tokenization

In [24]:
# Define a function to tokenize the reviews
def tokenize(medication_reviews):
    tokens = medication_reviews.split()
    return tokens

medication_reviews_df['review_words_tokenized'] = medication_reviews_df['review_lower_case'].apply(lambda x: tokenize(x))
medication_reviews_df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount,year,review_words,review_without_punctuation,review_strings,review_lower_case,review_words_tokenized
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27,2012,"[""It, has, no, side, effect,, I, take, it, in,...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...",it has no side effect i take it in combination...,"[it, has, no, side, effect, i, take, it, in, c..."
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192,2010,"[""My, son, is, halfway, through, his, fourth, ...","['""My', 'son', 'is', 'halfway', 'through', 'hi...","['""My', 'son', 'is', 'halfway', 'through', 'hi...",my son is halfway through his fourth week of i...,"[my, son, is, halfway, through, his, fourth, w..."
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17,2009,"[""I, used, to, take, another, oral, contracept...","['""I', 'used', 'to', 'take', 'another', 'oral'...","['""I', 'used', 'to', 'take', 'another', 'oral'...",i used to take another oral contraceptive whic...,"[i, used, to, take, another, oral, contracepti..."
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10,2015,"[""This, is, my, first, time, using, any, form,...","['""This', 'is', 'my', 'first', 'time', 'using'...","['""This', 'is', 'my', 'first', 'time', 'using'...",this is my first time using any form of birth ...,"[this, is, my, first, time, using, any, form, ..."
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37,2016,"[""Suboxone, has, completely, turned, my, life,...","['""Suboxone', 'has', 'completely', 'turned', '...","['""Suboxone', 'has', 'completely', 'turned', '...",suboxone has completely turned my life around ...,"[suboxone, has, completely, turned, my, life, ..."


#### 3.5 Removing stopwords

Now, we are going to apply another technique from *nltk* library to remove the stopwords.

Again, here, we are only removing some stopwords as we would need to maintain the construction of the sentiment in the sentences.

In [25]:
# Import the NLTK package
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# View the stopwords
english_stop_words = stopwords.words('english')
print(english_stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
# Define a function to remove all the stopwords (we are saving this function for now, as we might need it in the fucture)
def remove_all_stopwords(review_words_tokenized):
  text = [word for word in review_words_tokenized if word not in english_stop_words]
  return text

# medication_reviews_df['review_words_non_stop'] = medication_reviews_df['review_words_tokenized'].apply(lambda x: remove_all_stopwords(x))
# medication_reviews_df.head()

In [27]:
# Let's set our own stopwords
stopwords = set(stopwords.words('english'))

negative_stopwords = {'no', 'not', 'nor', 'don', "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't",
                      'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't",
                      'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't",
                      'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"}

def remove_some_stopwords(words):
    filtered_words = [word for word in words if word not in stopwords or word in negative_stopwords]
    return filtered_words

In [28]:
# Apply the function
medication_reviews_df['review_without_stopwords'] = medication_reviews_df['review_words_tokenized'].apply(remove_some_stopwords)
medication_reviews_df['review_without_stopwords']

0         [no, side, effect, take, combination, bystolic...
1         [son, halfway, fourth, week, intuniv, became, ...
2         [used, take, another, oral, contraceptive, pil...
3         [first, time, using, form, birth, control, im,...
4         [suboxone, completely, turned, life, around, f...
                                ...                        
112324    [mg, seems, work, every, nd, day, still, excru...
112325    [tekturna, days, effect, immediate, also, calc...
112326    [wrote, first, report, midoctober, not, alcoho...
112327    [ive, thyroid, medication, years, spent, first...
112328    [ive, chronic, constipation, adult, life, trie...
Name: review_without_stopwords, Length: 112329, dtype: object

In [29]:
medication_reviews_df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount,year,review_words,review_without_punctuation,review_strings,review_lower_case,review_words_tokenized,review_without_stopwords
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27,2012,"[""It, has, no, side, effect,, I, take, it, in,...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...",it has no side effect i take it in combination...,"[it, has, no, side, effect, i, take, it, in, c...","[no, side, effect, take, combination, bystolic..."
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192,2010,"[""My, son, is, halfway, through, his, fourth, ...","['""My', 'son', 'is', 'halfway', 'through', 'hi...","['""My', 'son', 'is', 'halfway', 'through', 'hi...",my son is halfway through his fourth week of i...,"[my, son, is, halfway, through, his, fourth, w...","[son, halfway, fourth, week, intuniv, became, ..."
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17,2009,"[""I, used, to, take, another, oral, contracept...","['""I', 'used', 'to', 'take', 'another', 'oral'...","['""I', 'used', 'to', 'take', 'another', 'oral'...",i used to take another oral contraceptive whic...,"[i, used, to, take, another, oral, contracepti...","[used, take, another, oral, contraceptive, pil..."
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10,2015,"[""This, is, my, first, time, using, any, form,...","['""This', 'is', 'my', 'first', 'time', 'using'...","['""This', 'is', 'my', 'first', 'time', 'using'...",this is my first time using any form of birth ...,"[this, is, my, first, time, using, any, form, ...","[first, time, using, form, birth, control, im,..."
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37,2016,"[""Suboxone, has, completely, turned, my, life,...","['""Suboxone', 'has', 'completely', 'turned', '...","['""Suboxone', 'has', 'completely', 'turned', '...",suboxone has completely turned my life around ...,"[suboxone, has, completely, turned, my, life, ...","[suboxone, completely, turned, life, around, f..."


#### 3.6 Applying lemmatization

In [30]:
# Import modules
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer() # It creates a lemmatizer object

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [31]:
# Define a function to lemmatize the text
def lemmatization(review_without_stopwords):
    review_lemms_list = []
    for review in review_without_stopwords:
        tokens = review.split()
        review_lemm = " ".join([lemmatizer.lemmatize(word) for word in tokens])
        review_lemms_list.append(review_lemm)
    return review_lemms_list

medication_reviews_df['review_word_lemm'] = medication_reviews_df['review_without_stopwords'].apply(lambda x: lemmatization(x))  
medication_reviews_df.head()  

Unnamed: 0,drugName,condition,review,rating,date,usefulCount,year,review_words,review_without_punctuation,review_strings,review_lower_case,review_words_tokenized,review_without_stopwords,review_word_lemm
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27,2012,"[""It, has, no, side, effect,, I, take, it, in,...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...","['""It', 'has', 'no', 'side', 'effect,', 'I', '...",it has no side effect i take it in combination...,"[it, has, no, side, effect, i, take, it, in, c...","[no, side, effect, take, combination, bystolic...","[no, side, effect, take, combination, bystolic..."
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192,2010,"[""My, son, is, halfway, through, his, fourth, ...","['""My', 'son', 'is', 'halfway', 'through', 'hi...","['""My', 'son', 'is', 'halfway', 'through', 'hi...",my son is halfway through his fourth week of i...,"[my, son, is, halfway, through, his, fourth, w...","[son, halfway, fourth, week, intuniv, became, ...","[son, halfway, fourth, week, intuniv, became, ..."
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17,2009,"[""I, used, to, take, another, oral, contracept...","['""I', 'used', 'to', 'take', 'another', 'oral'...","['""I', 'used', 'to', 'take', 'another', 'oral'...",i used to take another oral contraceptive whic...,"[i, used, to, take, another, oral, contracepti...","[used, take, another, oral, contraceptive, pil...","[used, take, another, oral, contraceptive, pil..."
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10,2015,"[""This, is, my, first, time, using, any, form,...","['""This', 'is', 'my', 'first', 'time', 'using'...","['""This', 'is', 'my', 'first', 'time', 'using'...",this is my first time using any form of birth ...,"[this, is, my, first, time, using, any, form, ...","[first, time, using, form, birth, control, im,...","[first, time, using, form, birth, control, im,..."
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37,2016,"[""Suboxone, has, completely, turned, my, life,...","['""Suboxone', 'has', 'completely', 'turned', '...","['""Suboxone', 'has', 'completely', 'turned', '...",suboxone has completely turned my life around ...,"[suboxone, has, completely, turned, my, life, ...","[suboxone, completely, turned, life, around, f...","[suboxone, completely, turned, life, around, f..."


In [32]:
medication_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112329 entries, 0 to 112328
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   drugName                    112329 non-null  object        
 1   condition                   112329 non-null  object        
 2   review                      112329 non-null  object        
 3   rating                      112329 non-null  float64       
 4   date                        112329 non-null  datetime64[ns]
 5   usefulCount                 112329 non-null  int64         
 6   year                        112329 non-null  int64         
 7   review_words                112329 non-null  object        
 8   review_without_punctuation  112329 non-null  object        
 9   review_strings              112329 non-null  object        
 10  review_lower_case           112329 non-null  object        
 11  review_words_tokenized      112329 non-

In [33]:
# Just to confirm that, again, we cleaned our dataset properly during the EDA
medication_reviews_df.isnull().sum()

drugName                      0
condition                     0
review                        0
rating                        0
date                          0
usefulCount                   0
year                          0
review_words                  0
review_without_punctuation    0
review_strings                0
review_lower_case             0
review_words_tokenized        0
review_without_stopwords      0
review_word_lemm              0
dtype: int64

In [34]:
# Save this dataset with all the processed columns and into a a new .csv file
medication_reviews_df.to_csv('medication_reviews_dataset_processed.csv', index=False) # Dataset with all the columns preprocessed

In [35]:
# Drop some columns and save the cleaned and preprocessed dataframe into a a new .csv file
medication_reviews_df_cleaned = medication_reviews_df.drop(['review', 'review_words', 'review_without_punctuation', 'review_strings', 'review_lower_case', 'review_words_tokenized', 'review_without_stopwords'], axis=1)
medication_reviews_df_cleaned.to_csv('medication_reviews_dataset_cleaned_and_processed.csv', index=False)