## Data Modeling and Evaluation Steps

At this moment, we are going to create and implement some machine learning models in order to find the one that can give us a better performance at a lower cost function to predict the sentiment analysis of medication reviews.

These are the models tried:

1. Random Forest Classifier
2. Naive Bayes Classifier
3. Ensemble Model
4. Long Short Term Memory (LSTM)

Note: this notebook was split to make each model more readable and its steps followed concise and clear.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

#### Step 1: Retake the *X* and *y* variables from our trainning set

In [6]:
# Import basic libraries
import pandas as pd
import numpy as np

In [9]:
# Load the trainning dataset
# train_medication_reviews = pd.read_csv('/content/drive/Othercomputers/My MacBook Pro/Sentiment-Analysis-of-Medication-Reviews-Project/medication_reviews_dataset_to_train.csv', sep=',')
train_medication_reviews = pd.read_csv('/Users/rafaelaqueiroz/Sentiment-Analysis-of-Medication-Reviews-Project/medication_reviews_dataset_to_train.csv', sep=',')
train_medication_reviews

Unnamed: 0,drugName,condition,rating,date,usefulCount,year,review_word_lemm,polarity,rating_classification
0,Valsartan,Left Ventricular Dysfunction,9.0,2012-05-20,27,2012,"['no', 'side', 'effect', 'take', 'combination'...",0.000000,2
1,Guanfacine,ADHD,8.0,2010-04-27,192,2010,"['son', 'halfway', 'fourth', 'week', 'intuniv'...",0.188021,2
2,Lybrel,Birth Control,5.0,2009-12-14,17,2009,"['used', 'take', 'another', 'oral', 'contracep...",0.113636,1
3,Ortho Evra,Birth Control,8.0,2015-11-03,10,2015,"['first', 'time', 'using', 'form', 'birth', 'c...",0.262500,2
4,Buprenorphine / naloxone,Opiate Dependence,9.0,2016-11-27,37,2016,"['suboxone', 'completely', 'turned', 'life', '...",0.163333,2
...,...,...,...,...,...,...,...,...,...
112324,Carbamazepine,Trigeminal Neuralgia,1.0,2016-01-31,10,2016,"['mg', 'seems', 'work', 'every', 'nd', 'day', ...",0.000000,0
112325,Tekturna,High Blood Pressure,7.0,2010-02-07,18,2010,"['tekturna', 'day', 'effect', 'immediate', 'al...",-0.087500,2
112326,Campral,Alcohol Dependence,10.0,2015-05-31,125,2015,"['wrote', 'first', 'report', 'midoctober', 'no...",0.261905,2
112327,Thyroid desiccated,Underactive Thyroid,10.0,2015-09-19,79,2015,"['ive', 'thyroid', 'medication', 'year', 'spen...",0.201313,2


In [10]:
# As we already know from our previous notebook (notebook 4), our independent variable (X) is going to be the "review_word_lemm" variable
X_train = train_medication_reviews.review_word_lemm
X_train

0         ['no', 'side', 'effect', 'take', 'combination'...
1         ['son', 'halfway', 'fourth', 'week', 'intuniv'...
2         ['used', 'take', 'another', 'oral', 'contracep...
3         ['first', 'time', 'using', 'form', 'birth', 'c...
4         ['suboxone', 'completely', 'turned', 'life', '...
                                ...                        
112324    ['mg', 'seems', 'work', 'every', 'nd', 'day', ...
112325    ['tekturna', 'day', 'effect', 'immediate', 'al...
112326    ['wrote', 'first', 'report', 'midoctober', 'no...
112327    ['ive', 'thyroid', 'medication', 'year', 'spen...
112328    ['ive', 'chronic', 'constipation', 'adult', 'l...
Name: review_word_lemm, Length: 112329, dtype: object

In [5]:
X_train.shape

(112329,)

In [12]:
# Make the words in a string to pass them in the pipeline
X_train_strings = [' '.join(words) for words in X_train]

In [13]:
type(X_train)

pandas.core.series.Series

In [11]:
# As we know, our target or dependent variable (y) is going to be the 'rating_classification' variable
y_train = train_medication_reviews.rating_classification
y_train

0         2
1         2
2         1
3         2
4         2
         ..
112324    0
112325    2
112326    2
112327    2
112328    2
Name: rating_classification, Length: 112329, dtype: int64

In [15]:
y_train.shape

(112329,)

In [16]:
type(y_train)

pandas.core.series.Series

#### Step 2: Create the *X* and *y* variables from our testing set

Note: As this is the testing set, we are not going to apply any cleaning or processint to it. However, as we have done label encoding to the *rating* column, we would need to label encoding this column at this set as well since the model cannot predict the sentiment of the reviews from rating 1 to 10 as we are representing those numbers differently, such as: 0 - negative reviews, 1 - "neutral" reviews, and 2 - positive reviews.

In [12]:
# Load the test dataset
# test_drug_reviews_df = pd.read_csv('/content/drive/MyDrive/Data-Science-Other-Materials/Data-Scientist-Bootcamp/Sentiment-Analysis-of-Drug-Reviews/drugsComTest_raw.tsv', delimiter='\t')
test_drug_reviews_df = pd.read_csv('/Users/rafaelaqueiroz/Sentiment-Analysis-of-Medication-Reviews-Project/drugsComTest_raw.tsv', delimiter='\t')
test_drug_reviews_df.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4


In [11]:
test_drug_reviews_df.shape

(53766, 7)

In [12]:
type(test_drug_reviews_df)

pandas.core.frame.DataFrame

In [13]:
# Import library to label encode the rating column
from sklearn.preprocessing import LabelEncoder

# Define the bin edges and labels (0 = 'negative', 1 = 'neutral', 2 = 'positive')
bin_edges = [0, 4, 6, 10]  # Ratings 1-4 are negative, 5-6 are neutral, 7-10 are positive
bin_labels = ['negative', 'neutral', 'positive'] # The rating_classification column should now only have 0, 1, or 2 values

# Use cut to bin the "rating" column and create the new column called "rating_classification"
test_drug_reviews_df['rating_classification'] = pd.cut(test_drug_reviews_df['rating'], bins=bin_edges, labels=bin_labels)

# Instantiate LabelEncoder and fit_transform the new column
le = LabelEncoder()
test_drug_reviews_df['rating_classification'] = le.fit_transform(test_drug_reviews_df['rating_classification'])
test_drug_reviews_df.head(30)

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount,rating_classification
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22,2
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17,2
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3,2
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35,2
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4,2
5,208087,Zyclara,Keratosis,"""4 days in on first 2 weeks. Using on arms an...",4.0,"July 3, 2014",13,0
6,215892,Copper,Birth Control,"""I&#039;ve had the copper coil for about 3 mon...",6.0,"June 6, 2016",1,1
7,169852,Amitriptyline,Migraine Prevention,"""This has been great for me. I&#039;ve been on...",9.0,"April 21, 2009",32,2
8,23295,Methadone,Opiate Withdrawal,"""Ive been on Methadone for over ten years and ...",7.0,"October 18, 2016",21,2
9,71428,Levora,Birth Control,"""I was on this pill for almost two years. It d...",2.0,"April 16, 2011",3,0


In [14]:
type(test_drug_reviews_df)

pandas.core.frame.DataFrame

In [15]:
test_drug_reviews_df.shape

(53766, 8)

In [14]:
X_test = test_drug_reviews_df['review']
X_test

0        "I&#039;ve tried a few antidepressants over th...
1        "My son has Crohn&#039;s disease and has done ...
2                            "Quick reduction of symptoms"
3        "Contrave combines drugs that were used for al...
4        "I have been on this birth control for one cyc...
                               ...                        
53761    "I have taken Tamoxifen for 5 years. Side effe...
53762    "I&#039;ve been taking Lexapro (escitaploprgra...
53763    "I&#039;m married, 34 years old and I have no ...
53764    "I was prescribed Nucynta for severe neck/shou...
53765                                        "It works!!!"
Name: review, Length: 53766, dtype: object

In [27]:
type(X_test)

pandas.core.series.Series

In [28]:
X_test.shape

(53766,)

In [15]:
y_test = test_drug_reviews_df.rating_classification
y_test

0        2
1        2
2        2
3        2
4        2
        ..
53761    2
53762    2
53763    2
53764    0
53765    2
Name: rating_classification, Length: 53766, dtype: int64

In [22]:
type(y_test)

pandas.core.series.Series

In [23]:
y_test.shape

(53766,)

#### Step 3.1: Create a pipeline of the *Random Forest Classifier*

As we know, the *Random Forest Classifier* is a machine learning algorithm used for classification tasks which offers most of the time a good performance, therefore it is our first choice to be used in our trainning. 

In [24]:
# Import libraries to create the pipeline
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer # For BoW and TFIDF
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [36]:
# Create a function to make the pipeline
def pipeline_rfc():
    # X is the feature matrix
    X_train = train_medication_reviews.review_word_lemm
    # y is the label vector
    y_train = train_medication_reviews.rating_classification
    
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=42, class_weight={0: 2, 1: 1, 2: 1})) # Initialize the Random Forest Classifier
        ])

    # Train the model on the training set
    pipeline.fit(X_train, y_train)

    # Predict the labels for the test set
    X_test = test_drug_reviews_df['review']
    y_test = test_drug_reviews_df.rating_classification
    y_pred = pipeline.predict(X_test)

    # Calculate the evaluation metrics
    print(classification_report(y_test, y_pred))
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    print(confusion_mat)

pipeline_rfc()

              precision    recall  f1-score   support

           0       0.91      0.40      0.56     13497
           1       0.99      0.05      0.10      4829
           2       0.74      0.99      0.85     35440

    accuracy                           0.76     53766
   macro avg       0.88      0.48      0.50     53766
weighted avg       0.81      0.76      0.71     53766

[[ 5423     2  8072]
 [  304   249  4276]
 [  225     0 35215]]


The results of the model evaluation show the precision, recall, and F1-score for each label in the test set, as well as the macro and weighted averages of these metrics.

From these results, we can point out the following key aspects:

Based on the precision, recall, and F1-score, the model performed best in predicting the positive class (class 2) with an F1-score of 0.85. The model also performed relatively well in predicting the negative class (class 0) with an F1-score of 0.56. However, the model did not perform well in predicting the neutral class (class 1) with an F1-score of only 0.10.

The confusion matrix provides more detailed information about the model's performance. 

Each row in the matrix represents the actual class, while each column represents the predicted class. The values in the matrix show the number of samples that fall into each category. Looking at the matrix, we can see that the model predicted a large number of samples as positive (class 2), which is reflected in the high recall score for that class. However, the model had a low precision score for the negative and neutral classes, indicating that many samples were incorrectly classified as positive.

Overall, the weighted average of the F1-score was 0.71, indicating that the model's performance was moderate across all classes. It may be useful to explore further why the model performed poorly on the neutral class and whether there are ways to improve its performance.

Let's see if we can improve these results by applying *GridSearchCV( )* to the pipeline.

In [60]:
def pipeline_rfcl():
    # X is the feature matrix
    X_train = train_medication_reviews['review_word_lemm']
    # y is the label vector
    y_train = train_medication_reviews['rating_classification']
    
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier()) # Initialize the Random Forest Classifier
        ])

    # Define the parameter grid to search over
    param_grid = {
    'vect__max_features': [1000, 5000, 10000],
    'vect__max_df': [0.5, 0.75, 1.0],
    'tfidf__use_idf': [True, False],
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [5, 10, 15],
    'clf__class_weight': [{0: 2, 1: 1, 2: 1}, {0: 1, 1: 1, 2: 1}]
}

    # Use GridSearchCV to find the best parameters
    grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
    grid_search.fit(X_train, y_train)

    # Print the best parameters and score
    print("Best parameters:", grid_search.best_params_)
    print("Best score:", grid_search.best_score_)

    # Predict the labels for the test set using the best model
    best_model = grid_search.best_estimator_
    X_test = test_drug_reviews_df['review']
    y_test = test_drug_reviews_df['rating_classification']
    y_pred = best_model.predict(X_test)

    # Calculate the evaluation metrics
    print(classification_report(y_test, y_pred))
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    print(confusion_mat)

pipeline_rfcl()

As the GridSearchCV couldn't detect the classification number 1 ("neutral" reviews) with a better metric, we are going to try again another pipeline of the *Random Forest Classifier*.

In [25]:
# Create the pipeline

# X is the feature matrix
X_train = train_medication_reviews.review_word_lemm
# y is the label vector
y_train = train_medication_reviews.rating_classification

# Define the preprocessor
preprocessor_tfidf = TfidfVectorizer(min_df=2, max_features=5000, ngram_range=(1,2))
                               
# Define the pipeline 
pipeline_rfc = Pipeline([
        ('preprocessing', preprocessor_tfidf),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=42, class_weight={0: 2, 1: 1, 2: 1})) # Initialize the Random Forest Classifier
])

# Train the model on the training set
pipeline_rfc.fit(X_train, y_train)

# Predict the labels for the test set
X_test = test_drug_reviews_df['review']
y_test = test_drug_reviews_df.rating_classification
y_pred = pipeline_rfc.predict(X_test)

# Calculate the evaluation metrics for the testing set
print(classification_report(y_test, y_pred))
labels = np.unique(y_pred)
confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
print(confusion_mat)

              precision    recall  f1-score   support

           0       0.84      0.69      0.76     13497
           1       1.00      0.13      0.24      4829
           2       0.82      0.97      0.89     35440

    accuracy                           0.83     53766
   macro avg       0.88      0.60      0.63     53766
weighted avg       0.84      0.83      0.80     53766

[[ 9279     2  4216]
 [  807   646  3376]
 [  970     1 34469]]


In [26]:
# Predict the labels for the training set
X_train = train_medication_reviews.review_word_lemm
y_train = train_medication_reviews.rating_classification
y_pred_train = pipeline_rfc.predict(X_train)
print(y_pred_train)

[2 2 1 ... 2 2 2]


In [27]:
# Calculate and print the evaluation metrics for the training set
print(classification_report(y_train, y_pred_train))
confusion_mat = confusion_matrix(y_train, y_pred_train)
print(confusion_mat)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     27839
           1       1.00      1.00      1.00      9993
           2       1.00      1.00      1.00     74497

    accuracy                           1.00    112329
   macro avg       1.00      1.00      1.00    112329
weighted avg       1.00      1.00      1.00    112329

[[27830     1     8]
 [    7  9976    10]
 [    7     0 74490]]


#### Save the model to the disk

In [50]:
# Import pickle
import pickle

# Save the trained model to a file
with open('rfc_model.pkl', 'wb') as f:
    pickle.dump(pipeline_rfc, f)

In [51]:
print(pipeline_rfc)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf',
                 RandomForestClassifier(class_weight={0: 2, 1: 1, 2: 1},
                                        random_state=42))])
