Universal Studio gets a vast amount of reviews from visitors. To go through all the reviews can be a tedious job. We have to categorize reviews expressed. This can be utilized for the reviews management system. We determining overall reviews based on individual comments. So that company can get a complete idea of reviews provided by visitors and can take care of those particular fields. This makes more loyal visitors to the company, increase business, fame, brand value, and also profit.

Import

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Read

In [None]:
df = pd.read_csv("/kaggle/input/reviewuniversalstudio/universal_studio_branches.csv")
df

In [None]:
df['text'] = df['title'] + " " + df['review_text']
df

Analyse rating

In [None]:
df.groupby('rating').text.count().plot.bar(ylim=0)
plt.show()

In [None]:
percentage_review=(df.rating.value_counts() / len(df.rating)) * 100
percentage_review

Processing raw text and getting it ready for machine learning

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')
words = stopwords.words("english")

df['processed_text'] = df['text'].apply(lambda x: " ".join([stemmer.stem(i) 
for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())


In [None]:
import string

#make all words lower case
df['processed_text'] = df['processed_text'].str.lower()

#Remove punctuation
table = str.maketrans('', '', string.punctuation)
df['processed_text'] = [df['processed_text'][row].translate(table) for row in range(len(df['processed_text']))]

# remove hash tags
df['processed_text'] = df['processed_text'].str.replace("#", " ")

#remove words less than 1 character
df['processed_text'] = df['processed_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

Remove frequent words

In [None]:
#put frequent words in a mosiac
freq_words = ' '.join([text for text in df['processed_text']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=1, max_font_size=110, max_words=50).generate(freq_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()


Remove rare words

In [None]:
from collections import Counter
from itertools import chain

# split words into lists
v = df['processed_text'].str.split().tolist() 
# compute global word frequency
c = Counter(chain.from_iterable(v))
# filter, join, and re-assign
df['processed_text'] = [' '.join([j for j in i if c[j] > 1]) for i in v]

In [None]:
df

Define X and y

In [None]:
y=df.rating
X=df['processed_text']

X.shape, y.shape

Split dataset for training and validation

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.10, random_state=1, shuffle=True)
X_train.shape, X_val.shape, y_train.shape,y_val.shape

Convert text to vectors

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_tfidf = TfidfVectorizer(stop_words='english', max_df=0.7, sublinear_tf=True)
train_tfIdf = vectorizer_tfidf.fit_transform(X_train.values.astype('U'))
val_tfIdf = vectorizer_tfidf.transform(X_val.values.astype('U'))

print(vectorizer_tfidf.get_feature_names()[:5])

In [None]:
train_tfIdf.shape,  val_tfIdf.shape

SMOTE

In [None]:
#from imblearn import over_sampling
#from imblearn.over_sampling import SMOTE

# transform the dataset
#oversample = SMOTE()
#train_tfIdf, y_train = oversample.fit_resample(train_tfIdf, y_train)
#val_tfIdf, y_val = oversample.fit_resample(val_tfIdf, y_val)

#train_tfIdf.shape, y_train.shape, val_tfIdf.shape, y_val.shape

Feature selection

In [None]:
from sklearn.feature_selection import SelectPercentile, f_classif

selector = SelectPercentile(f_classif, percentile=40)
selector.fit(train_tfIdf, y_train)
train_tfIdf = selector.transform(train_tfIdf.toarray())
val_tfIdf = selector.transform(val_tfIdf.toarray())

train_tfIdf.shape,  val_tfIdf.shape

Define model

In [None]:
from sklearn.naive_bayes import ComplementNB

model = ComplementNB().fit(train_tfIdf, y_train)
print(model.score(train_tfIdf, y_train))

Predict on validation set

In [None]:
y_pred = model.predict(val_tfIdf)
print(model.score(val_tfIdf, y_val))

In [None]:
df_pred = pd.DataFrame({'Actual': y_val, 'Predicted':y_pred})
df_pred