<h1 style="color:purple;">1. Make Necessary Imports</h1>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import re
import string

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.utils.multiclass import unique_labels

<h1 style="color:purple;">2. Understand Data</h1>

Let us load and perform a quick EDA on the data to futher understand it.

In [None]:
train_df = pd.read_csv("../input/hotel-review/train.csv")
test_df = pd.read_csv("../input/hotel-review/test.csv")

In [None]:
#print five rows of the training data
train_df.head()

In [None]:
#print datatype of columns
train_df.info()

In [None]:
#display count, uniqiue count and the most frequent value in each column
train_df.describe().transpose()

In [None]:
#Display percentage of distribution of data between the two target classes

happy_percent = train_df['Is_Response'].value_counts()['happy']/train_df['Is_Response'].count()
not_happy_percent = train_df['Is_Response'].value_counts()['not happy']/train_df['Is_Response'].count()
print(f'Happy: {happy_percent*100}%\nNot Happy: {not_happy_percent*100}%')

sns.countplot(train_df['Is_Response'])

<h1 style="color:purple;">3. Preprocess Data</h1>

We will be only taking into account the description column for fitting the model. Other columns such as userid, browser used and device used do not seem relevant to the task of sentiment analysis, thus we are dropping all the unnecessary columns. 

Also, we will be cleaning the text by removing unncessary characters, numbers and white spaces.

In [None]:
train_df.drop(columns=['User_ID', 'Browser_Used', 'Device_Used'], inplace=True)

In [None]:
def text_clean(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[""''_]', '', text)
    text = re.sub('\n', '', text)
    return text

In [None]:
def decontract_text(text):
    """
    Decontract text
    """
    # specific
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"won\’t", "will not", text)
    text = re.sub(r"can\’t", "can not", text)
    text = re.sub(r"\'t've", " not have", text)
    text = re.sub(r"\'d've", " would have", text)
    text = re.sub(r"\'clock", "f the clock", text)
    text = re.sub(r"\'cause", " because", text)

    # general
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)

    text = re.sub(r"n\’t", " not", text)
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’d", " would", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)
    
    return text

In [None]:
train_df['cleaned_description'] = train_df['Description'].apply(lambda x: decontract_text(x))
train_df['cleaned_description'] = train_df['cleaned_description'].apply(lambda x: text_clean(x))

In [None]:
print('Original Description:\n', train_df['Description'][0])
print('\n\nCleaned Description:\n', train_df['cleaned_description'][0])

Now we will perform an 80-20 split on the training data in order to obtain our training and testing dataset required for fitting the model.

In [None]:
x, y = train_df['cleaned_description'], train_df['Is_Response']

x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.1,
                                                    random_state=42)

print(f'x_train: {len(x_train)}')
print(f'x_test: {len(x_test)}')
print(f'y_train: {len(y_train)}')
print(f'y_test: {len(y_test)}')

<h1 style="color:purple;">4. Model</h1>

We will be using a [tfid vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for extracting the features by converting the cleaned text to a matrix of TF-IDF features. For the classification, we use logistic regression. Finally, we create a model pipeline by combining the vectorizer and the classifier.

In [None]:
tvec = TfidfVectorizer()
clf = LogisticRegression(solver='lbfgs', max_iter=1000)

model = Pipeline([('vectorizer', tvec), ('classifier', clf)])

In [None]:
model.fit(x_train, y_train)

<h1 style="color:purple;">5. Evaluation</h1>

We evaluate the model against the testing dataset. We compute the accuracy, precision and recall. Also, we plot a confusion matrix to get a better understanding about the model's performance.

In [None]:
y_pred = model.predict(x_test)

print(f'Accurcy: {accuracy_score(y_pred, y_test)}')
print(f'Precision: {precision_score(y_pred, y_test, average="weighted")}')
print(f'Recall: {recall_score(y_pred, y_test, average="weighted")}')

In [None]:
def print_confusion_matrix(confusion_matrix, class_names, figsize = (8,4), fontsize=12, model='clf'):
    """
    Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix,
    as a seaborn heatmap. 
    """
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names, 
    )
    fig, ax = plt.subplots(1, 1, figsize=figsize)
    heatmap = sns.heatmap(df_cm, annot=True, ax=ax, fmt="d", cmap=plt.cm.Oranges)   
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    plt.show()

In [None]:
conf_mat = confusion_matrix(y_test, y_pred)
uniq_labels = unique_labels(y_test, y_pred)

print_confusion_matrix(conf_mat, uniq_labels)

<h2 style="color:red;">WORK IN PROGRESS</h2>

This notebook is a starter notebook. I will further update this notebook using other sentiment analysis approaches such as TextBlob and VADER.

---

<h2 style="color:red;"> If you liked it, please upvote!</h2>