# Amazon Musical Instruments Reviews

Webportals like Bhuvan get vast amount of feedback from the users. To go through all the feedback's can be a tedious job. You have to categorize opinions expressed in feedback forums. This can be utilized for feedback management system. We Classification of individual comments/reviews.and we also determining overall rating based on individual comments/reviews. So that company can get a complete idea on feedback's provided by customers and can take care on those particular fields. This makes more loyal Customers to the company, increase in business , fame ,brand value ,profits.

**Attributes in the dataset**

1. reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
2. asin - ID of the product, e.g. 0000013714
3. reviewerName - name of the reviewer
4. helpful - helpfulness rating of the review, e.g. 2/3
5. reviewText - text of the review
6. overall - rating of the product
7. summary - summary of the review
8. unixReviewTime - time of the review (unix time)
9. reviewTime - time of the review (raw)

**Task**

Classify reviews as positive, negative, and neutral based on the attributes above.

**Libraries Required**

In [None]:
import re # Regular Expressions
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Importing the dataset
df = pd.read_csv("../input/amazon-music-reviews/Musical_instruments_reviews.csv")

**Importing and analysis the dataset**

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.dtypes

## Data Cleaning

In [None]:
# I will modify overall column of the dataset to start with
# In the overall column the ratings (given by the user) is given.
# I will modify the column - 4 and 5 as positive rating, 1 and 2 as negative rating and 3 as neutral 
# Function to modify overall column

def change_score(rating):
    if rating < 3:
        return 0
    elif rating > 3:
        return 2
    else:
        return 1

df_score = df["overall"]
df_score = df_score.map(change_score)
df["overall"] = df_score

In [None]:
df.tail()

In [None]:
# Counting occurences of positive, negative and neutral reviews
df["overall"].value_counts()

* The dataset has imbalanced classes, which I will fix later

In [None]:
# The helpful column of the dataframe gives value - [x,y]
# out of 'y' people 'x' found the corresponding review helpful
# so people who found the review helful = x, and people who found the review "not helpful" = y-x (total people voted on the review - people who voted helful)
# I will seprate the x and y values in different columns

# Since helpful columns have values in object form, I will have to convert all the values to python list
def convert_to_list(str_lst):
    str_ = str_lst.strip("[]").replace(","," ")
    lst = str_.split()
    lst_to_int = list(map(int, lst))
    return lst_to_int
        
def total_rating(lst_rating):
    return lst_rating[1] # y

def helpful_rating(lst_rating):
    return lst_rating[0] # x

df["helpful"] = df["helpful"].map(convert_to_list) # "[x,y]" -> [x,y]
df["total_ratings"] = df["helpful"].map(total_rating) # y
df["helpful"] = df["helpful"].map(helpful_rating) # x

df.head()

In [None]:
# Checking for duplicate rows in columns - ["reviewerName", "reviewText", "unixReviewTime"]

print("Number rows having common values of [reviewerName, reviewText, unixReviewTime] =", df[df.duplicated(subset=["reviewerName", "reviewText", "unixReviewTime"])].shape[0])

* There are no duplicates in the dataset

In [None]:
# Checking in columns helpful and total_ratings if helpful > total_ratings
# If any row follows above condition, I will remove it

print("Number of rows, in which helpful > total_ratings =",df[df["helpful"] > df["total_ratings"]].shape[0])

* There are no rows which satisfy helpful > total_ratings 

In [None]:
# Checking for null values
df.isnull().sum()

In [None]:
# Review text is the most important column for classification
# I will remove columns having null review text
# Missing values of reviewerName column doesn't matter, because reveiwerName doesn't contribute in determining polarity of the review
df.drop(df[df["reviewText"].isnull()].index, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

## Text Preprocessing

In Text Preprcessing, I will-
* Remove HTML tags in the review text
* Remove special characters from the text (#, ! etc.)
* Convert the text in lowercase
* Removal of stop words
* Applying Stemming to the text

In [None]:
# Removing HTML tags
def remove_html(text):
    html_pattern = re.compile("<.*?>")
    text = re.sub(html_pattern, " ", text) # Substitute HTML tag with space
    return text

# Removing special characters
def remove_spl_char(text):
    text = re.sub(r"[?|!|.|,|)|(|\|/|#|\'|\"]", r"", text) # All the special characters removed
    return text

# Converting to lowercase
def in_lowercase(text):
    text = text.lower()
    return text

# Removing stop words
stop_words = set(stopwords.words("english")) # List of all the stop words

def remove_stopwords(text):
    filtered_text_lst = []
    text_lst = text.split()
    for word in text_lst:
        if word not in stop_words:
            filtered_text_lst.append(word)
        else:
            continue
    filtered_word = " ".join(filtered_text_lst)
    return filtered_word

stem = PorterStemmer()
def stemming(text):
    stemmed_txt_lst = []
    text_lst = text.split()
    for word in text_lst:
        stemmed_word = stem.stem(word)
        stemmed_txt_lst.append(stemmed_word)
    stemmed_txt_lst = " ".join(stemmed_txt_lst)
    return stemmed_txt_lst
    

In [None]:
def text_preprocessing(text):
    rem_html_txt = remove_html(text) # Remove HTML
    rem_spl_char_txt = remove_spl_char(rem_html_txt) # Remove Special Characters
    lowercase_txt = in_lowercase(rem_spl_char_txt) # Conversion in lowercase
    rem_stopwords_txt = remove_stopwords(lowercase_txt) # Remove stopwords
    stemmed_txt = stemming(text)
    final_txt = stemmed_txt
    return final_txt

df["final_review"] = df["reviewText"].map(text_preprocessing)

In [None]:
print("Before Text Preprocessing- ", "\n")
print(df["reviewText"][0], "\n")
print("After Text Preprocessing- ", "\n")
print(df["final_review"][0])

## Avg Word2Vec

* I will use word2vec to convert text values to vector
* I have sufficient training examples to construct W2V

In [None]:
reviews = []
def construct_reviews_lst(review):
    review_split = review.split()
    reviews.append(review_split)
df["final_review"].map(construct_reviews_lst)   

print(df["final_review"].iloc[0]) # Before
reviews[0] # After

In [None]:
# Constructing W2V model
w2v_model = Word2Vec(reviews, vector_size=50, min_count=5)

In [None]:
# Using Avg W2V for each review
def avg_w2v(reviews):
    text_vector = []
    for review in reviews:
        review_vec_sum = np.zeros(50)
        num_words = 0
        for word in review:
            try:
                word_vec = w2v_model.wv[word]
                review_vec_sum += word_vec
                num_words += 1
            except:
                pass
        avg_review_vector = review_vec_sum / num_words
        text_vector.append(avg_review_vector)
    return text_vector

text_vector = np.array(avg_w2v(reviews)) # Text Vector of all the reviews

In [None]:
print(df["final_review"][0])

print("\n\nVector Representation of above text - ")
text_vector[0] # Text Vector of review

* Now, I will create final dataset which will contain all features, which will be used for classification

In [None]:
# Feature name for text vector
def create_feature_names():
    text_features = []
    for index in range(1,51):
        feature_name = "text-feature-"+ str(index)
        text_features.append(feature_name)
    return text_features
text_features = create_feature_names()

In [None]:
# Constructing dataframe from 'text_vector' variable

def create_df_txt_vec(text_vector):
    df_text_lst = []
    for vector in text_vector:
        vector_reshape = np.reshape(vector ,(50, 1)).T
        df_vector = pd.DataFrame(vector_reshape, columns=text_features)
        df_text_lst.append(df_vector)
    df_text = pd.concat(df_text_lst, ignore_index=True)
    return df_text

df_text = create_df_txt_vec(text_vector)

In [None]:
# Selecting featrures 'helpful' and 'total_ratings' from the original dataframe, because they might affect the prediction
# Selecting 'overall' attribute for output

df_final = pd.concat([df_text, df["helpful"], df["total_ratings"], df["overall"]], axis=1)

In [None]:
df_final.head()

## Model Fitting

In [None]:
df_features = df_final.drop("overall", axis=1)
df_target = df_final["overall"]

df_features_columns = df_features.columns
df_features_scaled = StandardScaler().fit_transform(df_features)
df_features_scaled = pd.DataFrame(df_features_scaled, columns=df_features_columns)

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(df_features_scaled, df_target, test_size=0.25, train_size=0.75)

In [None]:
# Checking for imbalanced dataset
df_final["overall"].value_counts()

In [None]:
# Here, The positive reviews are 9015, neutral reviews are 772 and negative reviews are 467
# I will use combination of undersampling and oversampling to match the classes

# Creating Pipeline to perform SMOTE and UnderSampling techniques
oversampling_smote = SMOTE(sampling_strategy={1:5000, 0:5000})
undersampling = RandomUnderSampler(sampling_strategy={2:5000})
pipeline = Pipeline([('under', undersampling), ('smote', oversampling_smote)])
df_train_resampled = pipeline.fit_resample(X_train, y_train)

In [None]:
X_train = df_train_resampled[0] # Resampled X_train 
y_train = df_train_resampled[1] # Resampled y_train

X_train, y_train = shuffle(X_train, y_train)

### Logistic Regression

In [None]:
C = [0.001 ,0.01 ,0.1, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] # Values of hyperparameter C

cv_f1_mean= []

for value in C:
    model = LogisticRegression(C = value, solver="sag", max_iter=5000)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="f1_micro")
    cv_f1_mean.append(np.mean(scores))
    
max_f1 = max(cv_f1_mean)
index_max_f1 = cv_f1_mean.index(max_f1)

print("Optimal value of hyperparameter C: " + str(C[index_max_f1]))
print("F1 score at optimal C: " + str(max_f1))

In [None]:
model = LogisticRegression(C = 6, solver="sag", max_iter=5000) # Calculating test accuracy of model
model.fit(X_train, y_train)

In [None]:
acc_test = model.score(X_test, y_test)*100
acc_test = round(acc_test, 2)
print("Accuracy of LogisticRegression model on test set: " + str(acc_test) + "%")

### KNN

In [None]:
K = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] # Values of hyperparameter K

cv_f1_mean= []

for value in K:
    model = KNeighborsClassifier(n_neighbors=value)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="f1_micro")
    cv_f1_mean.append(np.mean(scores))
    
max_f1 = max(cv_f1_mean)
index_max_f1 = cv_f1_mean.index(max_f1)

print("Optimal value of hyperparameter K: " + str(K[index_max_f1]))
print("F1 score at optimal K: " + str(max_f1))

In [None]:
model = KNeighborsClassifier(n_neighbors=1) # Calculating test accuracy of model
model.fit(X_train, y_train)

In [None]:
acc_test = model.score(X_test, y_test)*100
acc_test = round(acc_test, 2)
print("Accuracy of 1-NN model on test set: " + str(acc_test) + "%")

### SVM

In [None]:
C = [0.001 ,0.01 ,0.1, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] # Values of hyperparameter C

cv_f1_mean= []

for value in C:
    model = SVC(C = value,kernel="rbf")
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="f1_micro")
    cv_f1_mean.append(np.mean(scores))
    
max_f1 = max(cv_f1_mean)
index_max_f1 = cv_f1_mean.index(max_f1)

print("Optimal value of hyperparameter C: " + str(C[index_max_f1]))
print("F1 score at optimal C: " + str(max_f1))

In [None]:
model = SVC(C = 15,kernel="rbf") # Calculating test accuracy of model
model.fit(X_train, y_train)

In [None]:
acc_test = model.score(X_test, y_test)*100
acc_test = round(acc_test, 2)
print("Accuracy of SVM model on test set: " + str(acc_test) + "%")

### Gaussian Naive Nayes

In [None]:
model = GaussianNB()
model.fit(X_train, y_train)

In [None]:
acc_test = model.score(X_test, y_test)*100
acc_test = round(acc_test, 2)
print("Accuracy of GaussianNB model on test set: " + str(acc_test) + "%")

### Random Forest

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

In [None]:
acc_test = model.score(X_test, y_test)*100
acc_test = round(acc_test, 2)
print("Accuracy of Random Forest model on test set: " + str(acc_test) + "%")