## Text classification with Naive-Bayes model

In [1]:
import pandas as pd
import numpy as np

### Loading the data

In [2]:
train_data=pd.read_csv('sentiment_dataset_train.csv') 
train_data.head()

Unnamed: 0,id,review,rating
0,0,Arrived about 10pm and check in was painless. ...,4
1,1,I checked in at 4pm even tough room was not re...,2
2,2,"I chose this hotel, as it was in a good locati...",2
3,3,"Great location, super close to shops & a 10min...",4
4,4,I was in the Sir Adam Hotel to visit a friend....,3


In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35005 entries, 0 to 35004
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      35005 non-null  int64 
 1   review  35005 non-null  object
 2   rating  35005 non-null  object
dtypes: int64(1), object(2)
memory usage: 820.6+ KB


In [4]:
dev_data=pd.read_csv('sentiment_dataset_dev.csv') 
dev_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7499 entries, 0 to 7498
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      7499 non-null   int64 
 1   review  7499 non-null   object
 2   rating  7499 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 175.9+ KB


In [5]:
test_data=pd.read_csv('sentiment_dataset_test.csv') 

#### Conclusion: No null in data. Dtype for rating in train and dev data are different (int64 and object respectively). So, we need to make dtypes the same.

### Take a quick look at the label (rating) column

In [6]:
train_data['rating'].value_counts()

2                                                7031
1                                                7028
4                                                6997
5                                                6977
3                                                6971
Tables not made up prior to guest seating. 2.       1
Name: rating, dtype: int64

In [7]:
dev_data['rating'].value_counts()

1    1523
2    1507
4    1500
5    1486
3    1483
Name: rating, dtype: int64

#### Conclusion: Lables in both train and dev data are almost uniformly distributed. So, there is no need to make them uniform for better classification results

### data splitting

In [8]:
#We also need to make the y_train and y_dev to have the same type
X_train = train_data['review']
y_train = train_data['rating'].astype(str)  
X_dev = dev_data['review']
y_dev = dev_data['rating'].astype(str)  

### Build pipelines to vectorize the data, then train and fit a model

Using "Naive Bayes" model in here. Buiding a pipline for text classification based on this model.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

### Feed the training data through the pipeline

In [10]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

### Run predictions and analyze the results (naïve Bayes)

In [11]:
# Form a prediction set
predictions = text_clf_nb.predict(X_dev)

In [12]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_dev,predictions))

[[1282  176   52    7    6]
 [ 289  978  221   15    4]
 [  87  153 1114  119   10]
 [  21   40  211 1082  146]
 [   9   20   64  281 1112]]


In [14]:
# Print a classification report
print(metrics.classification_report(y_dev,predictions))

              precision    recall  f1-score   support

           1       0.76      0.84      0.80      1523
           2       0.72      0.65      0.68      1507
           3       0.67      0.75      0.71      1483
           4       0.72      0.72      0.72      1500
           5       0.87      0.75      0.80      1486

    accuracy                           0.74      7499
   macro avg       0.75      0.74      0.74      7499
weighted avg       0.75      0.74      0.74      7499



In [16]:
# Print the overall accuracy
print(metrics.accuracy_score(y_dev,predictions))

0.7424989998666489


### Tunning hyperparameter to improve the accuracy

In [72]:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB(alpha=0.2, fit_prior=True)),
])

In [73]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('clf', MultinomialNB(alpha=0.2))])

In [74]:
# Form a prediction set
predictions = text_clf_nb.predict(X_dev)

In [75]:
print(metrics.confusion_matrix(y_dev,predictions))

[[1291  177   46    3    6]
 [ 268 1016  200   20    3]
 [  80  160 1103  123   17]
 [  21   47  170 1109  153]
 [  11   17   51  256 1151]]


In [76]:
print(metrics.classification_report(y_dev,predictions))

              precision    recall  f1-score   support

           1       0.77      0.85      0.81      1523
           2       0.72      0.67      0.69      1507
           3       0.70      0.74      0.72      1483
           4       0.73      0.74      0.74      1500
           5       0.87      0.77      0.82      1486

    accuracy                           0.76      7499
   macro avg       0.76      0.76      0.76      7499
weighted avg       0.76      0.76      0.76      7499



In [77]:
# Print the overall accuracy
print(metrics.accuracy_score(y_dev,predictions))

0.7561008134417923


### The model accuracy is ~75.6% for Naive Bayes model