# Retrieving and analyzing E-commerce reviews (NLP)

### Motivation:

E-commerce is a fast-growing business, with the COVID-19 pandemic only accelerating their establishment in the societies. In this sense, companies such as Amazon, Alibaba, eBay, Best buy, etc., offer millions of products on their website, which can be bought and shipped to your house with a couple of clicks. 

Customers frequently rate these products (typically from 1 to 5), with the mean review score readily available for the user and seller. Then, the average review score gives a quick indication of how good the customers think the product is. Nevertheless, that number doesn't provide much more insight into why the users are highly/poorly rating the item or what they like (or don't) about the product. This information could be gathered by reading each review, but it would be time-consuming, especially on products with thousands of reviews. 

Thus, **the goal of this project** is to develop a tool that automatically retrieves the most frequently used words in positive and negative reviews. This tool will help the seller find the attributes that need improvement and those that have to maintain the quality. On the other hand, the tool can give customers a quick idea of what to watch out for during the free-return days.

### Project structure

This work will focus on Amazon's reviews on renewed/refurbished iPhones, as renewed smartphones are prone to positive and negative reviews. Specifically, an analysis is made on the iPhone X, XR, and 11 reviews, checking the pros and cons of each product's customer perspective. Then, the benefits and problems developed throughout each product are presented. Lastly, all the reviews are merged, and a sentiment analysis model is developed.

To do so, the project was split into two parts:

**Part I**

* Exploring amazon website to retrieve the desired reviews into databases, employing BeautifulSoup as a web scraper, together with Docker and Splash.

**Part II**

* Clean, organize, and analyze the reviews, drawing conclusions from the most frequently used words in positive and negative reviews.
* Develop a sentiment analysis tool

The script "reviews_scraper.py" for obtaining the reviews (Part I) is available at:

https://github.com/tomasmontielp/Retrieving_and_analyzing_E-commerce_reviews

This notebook focuses on Part II of the project.

### Exploratory analysis

In [1]:
import pandas as pd
import numpy as np

In Part I, reviews were extracted from Amazon's website and saved into .xlsx files.  Let's import each dataset:

In [2]:
iX=pd.read_excel(r'C:\Users\tomas\OneDrive - UPV\Data_Science\Projects\Amazon\iphone_X.xlsx')
iXR=pd.read_excel(r'C:\Users\tomas\OneDrive - UPV\Data_Science\Projects\Amazon\iphone_XR.xlsx')
i11=pd.read_excel(r'C:\Users\tomas\OneDrive - UPV\Data_Science\Projects\Amazon\iphone_11.xlsx')


Let's take a look at each dataset and ensure we have the expected product:

In [3]:
iX.head(3)

Unnamed: 0,product,title,rating,body
0,Apple iPhone X,THIS PHONE WAS NOT PAYED OFF,1,"I purchased a refurbished iPhone x, it arrived..."
1,Apple iPhone X,Cracked and does not turn on.,1,The screen was cracked and the phone did not t...
2,Apple iPhone X,Don't buy from this seller.,1,"This product is falsely labeled as unlocked, i..."


In [4]:
iXR.head(3)

Unnamed: 0,product,title,rating,body
0,Apple iPhone XR,Great!,4,It took a little while for it to arrive. But o...
1,Apple iPhone XR,DO NOT WASTE YOUR MONEY ON THIS PHONE,1,PLEASE DO NOT BUY THIS PHONE. They say that th...
2,Apple iPhone XR,It sucks,1,World's worst thing ever never get on this its...


In [5]:
i11.head(3)

Unnamed: 0,product,title,rating,body
0,Apple iPhone 11,Not FULLY Unlocked,1,Purchased this product advertised as fully unl...
1,Apple iPhone 11,"NOT EXPECTED, GREAT PURCHASE!",4,I was feeling a bit skeptical after I placed m...
2,Apple iPhone 11,Phone was NOT unlocked,1,Phone was not unlocked could it use it


In [6]:
iX['product'].unique()

array(['Apple iPhone X'], dtype=object)

In [7]:
iXR['product'].unique()

array(['Apple iPhone XR'], dtype=object)

In [8]:
i11['product'].unique()

array(['Apple iPhone 11'], dtype=object)

Everything seems ok. We can see that there's a column for the title of the review, the rating, and then de body of the review. Since all the databases have the same structure, let's merge them to apply each text mining function all at once. The new database will be called rev.

In [9]:
rev = pd.concat([iX, iXR, i11], ignore_index=True)

Let's confirm that the length of the new database is equal to the sum of the parts:

In [10]:
len(iX)  + len(iXR) + len(i11) == len(rev) 

True

## Text mining

#### NaN values

Initially, it is good to check for any NaN values, as it is better to replace them with whitespaces when employing NLP libraries.

In [11]:
rev.isnull().values.any()

True

Let's find and replace the NaN values:

In [12]:
is_NaN = rev.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = rev[row_has_NaN]
rows_with_NaN

Unnamed: 0,product,title,rating,body
1838,Apple iPhone X,Very good phone,5,
1861,Apple iPhone X,The phone is in good condition,4,
2087,Apple iPhone X,It works in Japan(!),5,
3400,Apple iPhone XR,Everything we expected,5,
3429,Apple iPhone XR,,5,Its like new
3471,Apple iPhone XR,,2,Screen Sensibility is poor
3637,Apple iPhone XR,,5,
3762,Apple iPhone XR,Excelente,5,
3806,Apple iPhone XR,,5,It was fully unlocked I loved it.
5279,Apple iPhone 11,Happy,5,


In [13]:
rev.fillna('', inplace=True)
rev[rev.index.isin(rows_with_NaN.index)]

Unnamed: 0,product,title,rating,body
1838,Apple iPhone X,Very good phone,5,
1861,Apple iPhone X,The phone is in good condition,4,
2087,Apple iPhone X,It works in Japan(!),5,
3400,Apple iPhone XR,Everything we expected,5,
3429,Apple iPhone XR,,5,Its like new
3471,Apple iPhone XR,,2,Screen Sensibility is poor
3637,Apple iPhone XR,,5,
3762,Apple iPhone XR,Excelente,5,
3806,Apple iPhone XR,,5,It was fully unlocked I loved it.
5279,Apple iPhone 11,Happy,5,


#### Merging columns

Frequently, the title of the review contains info as crucial as the body, given that customers sometimes are more explicit about what they like/dislike in the title. Moreover, as seen in the previous step, the customer could only fill the "title" field and leave the "body" empty or the other way around. Thus, it is convenient to merge both columns into the column named "review". Afterward, the original two columns are dropped.

In [14]:
rev['review'] = rev['title'] + ' ' + rev['body']
rev.drop(['title', 'body'], axis=1, inplace = True)
rev.head()

Unnamed: 0,product,rating,review
0,Apple iPhone X,1,THIS PHONE WAS NOT PAYED OFF I purchased a ref...
1,Apple iPhone X,1,Cracked and does not turn on. The screen was c...
2,Apple iPhone X,1,Don't buy from this seller. This product is fa...
3,Apple iPhone X,1,Possibility to change the phone? I bought this...
4,Apple iPhone X,5,Exceeded Expectations I loved how new this pho...


#### Emojis

Some reviews contain emojis (example below); let's remove them to have a cleaner database. Additionally, let set every word to lowercase

In [15]:
rev['review'][16]

'It was what I expect 👌 Now for the people who are reading this message it is important for you to know. if you want a really good offer for a really cheap price go ahead and take the chance Buy this iPhone. There’s no scam is the real deal. now note this you are buying a used iPhone. Second thing that you need to keep in mind Is the battery capacity if you’re looking for a phone you want to try to aim for 95% battery capacity is your best bet when I received mine I got it at 80% battery capacity so that’s also another thing to keep in mind when you’re buying a used phone make sure you’re looking at that battery capacity it’s really important another thing you will receive scratches on the display and the back glass in my opinion this is nothing to cry over just throw a case on it and it should look pretty good. you also want to make sure that your face recognition is on point make sure that everything on the phone is responsive after Unboxing thank you guys so much for taking your tim

In [16]:
#Remove emojis and convert every word to lowercase
rev['review'] = rev['review'].astype(str).apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))
rev['review'] = rev['review'].str.lower()

In [17]:
rev['review'][16]

'it was what i expect  now for the people who are reading this message it is important for you to know. if you want a really good offer for a really cheap price go ahead and take the chance buy this iphone. theres no scam is the real deal. now note this you are buying a used iphone. second thing that you need to keep in mind is the battery capacity if youre looking for a phone you want to try to aim for 95% battery capacity is your best bet when i received mine i got it at 80% battery capacity so thats also another thing to keep in mind when youre buying a used phone make sure youre looking at that battery capacity its really important another thing you will receive scratches on the display and the back glass in my opinion this is nothing to cry over just throw a case on it and it should look pretty good. you also want to make sure that your face recognition is on point make sure that everything on the phone is responsive after unboxing thank you guys so much for taking your time in re

#### Language of the reviews

Even though reviews were imported from the U.S. Amazon website, it is expected that some reviews will be in a foreign language. Several libraries could be used to detect the review of the language, although most of them use google translate API, which is paid service for unlimited access. Thus, we'll use the langid library, a standalone Language Identification (LangID) tool.

In [18]:
#If necessary, install the library:
#! pip install langid --user

In [19]:
import langid

Let's create a new column (lang) with the detected language:

In [20]:
pd.options.mode.chained_assignment = None  # default='warn'
rev['lang'] = rev['review'].apply(lambda x:langid.classify(x)[0])
rev.head()

Unnamed: 0,product,rating,review,lang
0,Apple iPhone X,1,this phone was not payed off i purchased a ref...,en
1,Apple iPhone X,1,cracked and does not turn on. the screen was c...,en
2,Apple iPhone X,1,don't buy from this seller. this product is fa...,en
3,Apple iPhone X,1,possibility to change the phone? i bought this...,en
4,Apple iPhone X,5,exceeded expectations i loved how new this pho...,en


Let's see how the review's languages are distributed:

In [21]:
rev['lang'].value_counts(normalize=True)

en    0.903459
es    0.066714
de    0.006354
it    0.004589
nl    0.003883
pt    0.002824
fr    0.002824
da    0.002118
ca    0.001059
pl    0.000882
ro    0.000882
no    0.000529
sv    0.000529
et    0.000529
eu    0.000353
eo    0.000353
lt    0.000353
sl    0.000353
mg    0.000176
fi    0.000176
la    0.000176
tl    0.000176
vi    0.000176
id    0.000176
lv    0.000176
gl    0.000176
Name: lang, dtype: float64

It can be seen that most of the reviews (90.3%) are in English, followed by Spanish (6.7%). It would be possible to translate each review, again, with Google services. Nevertheless, since most of the reviews are in English, it is possible to work with those reviews only and still have enough reviews to analyze.

In [22]:
rev = rev[rev['lang'].isin(['en'])]


In [23]:
rev['lang'].value_counts(normalize=True)

en    1.0
Name: lang, dtype: float64

#### TextBlob

TextBlob is a python library used to perform NLP tasks like tokenization, POS-Tagging, lemmatization, N-grams, and sentiment analysis. It is built on top of NLTK but has more features like spelling correction, translation, and language detection (with google services).

In this work, it will be employed for spelling correction.

In [24]:
#If necessary, install the library:
#! pip install -U textblob
#! python -m textblob.download_corpora

In [25]:
from textblob import TextBlob


In [26]:
#Correct the spelling of the reviews (it could take a while with more than 5k reviews)
rev['review_corrected'] = rev['review'].apply(lambda x:''.join(TextBlob(x).correct()))


In [27]:
#Drop the pre-correction 'review' column
rev.drop(['review'], axis=1, inplace = True)

Let's remove the neutral reviews (rating = 3) and classify reviews as positive (>3) or negative reviews (<3). This column will also serve to train the model for the sentiment analysis predictor.

In [28]:
rev = rev[rev['rating'] != 3]
rev['Positive review'] = np.where(rev['rating'] > 3, 1, 0)
rev.drop(['lang'], axis = 1, inplace =True)
rev.head()

Unnamed: 0,product,rating,Positive review,review_corrected
0,Apple iPhone X,1,0,this phone was not played off i purchased a re...
1,Apple iPhone X,1,0,cracked and does not turn on. the screen was c...
2,Apple iPhone X,1,0,don't buy from this seller. this product is fa...
3,Apple iPhone X,1,0,possibility to change the phone? i bought this...
4,Apple iPhone X,5,1,exceeded expectations i loved how new this pho...


### Frequent words finder

Let's use scikit-learn to try and find the most frequent words found in positive/negative reviews in each product. To do so, let's split the databases by product.

In [29]:
iX = rev[rev['product']=='Apple iPhone X'].reset_index(drop=True)
iXR = rev[rev['product']=='Apple iPhone XR'].reset_index(drop=True)
i11 = rev[rev['product']=='Apple iPhone 11'].reset_index(drop=True)


In [30]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets for every product

X_train_iX, X_test_iX, y_train_iX, y_test_iX = train_test_split(iX['review_corrected'], 
                                                    iX['Positive review'], 
                                                    random_state=0)

X_train_iXR, X_test_iXR, y_train_iXR, y_test_iXR = train_test_split(iXR['review_corrected'], 
                                                    iXR['Positive review'], 
                                                    random_state=0)

X_train_i11, X_test_i11, y_train_i11, y_test_i11 = train_test_split(i11['review_corrected'], 
                                                    i11['Positive review'], 
                                                    random_state=0)

#### CountVectorizer

The bag-of-words approach is a simple and commonly used way to represent text in machine learning, which ignores structure and only counts how often each word occurs. CountVectorizer allows us to use the bag-of-words approach by converting a collection of text documents into a matrix of token counts. 

First, we instantiate the CountVectorizer and fit it into our training data. Fitting the CountVectorizer consists of the tokenization of the trained data and the building of the vocabulary. It does so by finding all sequences of characters of at least two letters or numbers separated by word boundaries, then converting everything to lowercase (we already lowercased the text) and builds a vocabulary using these tokens. 

We'll set the minimum word occurrence to 10 (min_df) for it to be taken into account, to remove words that appear in only a few reviews and might not be good predictors. Additionally, we'll set the n-gram range from 2 to 3 to add context to the repeated words. 

Afterward, we use the transform method to transform the documents in X_train to a document term matrix, giving us the bag-of-word representation of X_train.  

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vect_iX = CountVectorizer(min_df=10, ngram_range=(2,3)).fit(X_train_iX)
X_train_vectorized_iX = vect_iX.transform(X_train_iX)

vect_iXR = CountVectorizer(min_df=10, ngram_range=(2,3)).fit(X_train_iXR)
X_train_vectorized_iXR = vect_iXR.transform(X_train_iXR)

vect_i11 = CountVectorizer(min_df=10, ngram_range=(2,3)).fit(X_train_i11)
X_train_vectorized_i11 = vect_i11.transform(X_train_i11)

Let's employ logistic regression to train our model, which works well for high-dimensional sparse data.

In [32]:
from sklearn.linear_model import LogisticRegression

model_iX = LogisticRegression()
model_iX.fit(X_train_vectorized_iX, y_train_iX)

model_iXR = LogisticRegression()
model_iXR.fit(X_train_vectorized_iXR, y_train_iXR)

model_i11 = LogisticRegression()
model_i11.fit(X_train_vectorized_i11, y_train_i11)

LogisticRegression()

#### Sorting the n-grams

Let's get the resulting n-grams (feature_names) in each vocabulary as NumPy arrays, and get their index sorted from most connected to negative reviews to most connected to positive reviews(sorted_coef_index). Lastly, let's merge the results and show the 15 most frequent words for negative and positive reviews of each product in the freq_words dataset.

In [33]:
feature_names_iX = np.array(vect_iX.get_feature_names())
sorted_coef_index_iX = model_iX.coef_[0].argsort()

feature_names_iXR = np.array(vect_iXR.get_feature_names())
sorted_coef_index_iXR = model_iXR.coef_[0].argsort()

feature_names_i11 = np.array(vect_i11.get_feature_names())
sorted_coef_index_i11 = model_i11.coef_[0].argsort()


freq_words = pd.DataFrame({'Negative_R iX': feature_names_iX[sorted_coef_index_iX[:15]], 
                           'Negative_R iXR': feature_names_iXR[sorted_coef_index_iXR[:15]],
                           'Negative_R i11': feature_names_i11[sorted_coef_index_i11[:15]],
                           'Positive_R iX': feature_names_iX[sorted_coef_index_iX[:-16:-1]],
                           'Positive_R iXR': feature_names_iXR[sorted_coef_index_iXR[:-16:-1]],
                           'Positive_R i11': feature_names_i11[sorted_coef_index_i11[:-16:-1]],})
freq_words

Unnamed: 0,Negative_R iX,Negative_R iXR,Negative_R i11,Positive_R iX,Positive_R iXR,Positive_R i11
0,doesn work,would not,very disappointed,great phone,brand new,love it
1,very disappointed,the speaker,do not,brand new,great phone,great phone
2,not good,not unlocked,doesn work,so far,love it,good phone
3,dont buy,stopped working,the screen,love it,screen protector,with the phone
4,turn on,lot of,waste of,perfect condition,so far,which is
5,do not,have to,there was,good good,no scratches,love the
6,to return,doesn work,to return,no scratches,great buy,is great
7,not buy,use it,not worth,like new,great product,brand new
8,your money,to return,is not,good phone,works perfectly,the charge
9,didn work,was not,didn work,very good,like new,no scratches


### Discussion

#### Negative reviews

Some of the trends extracted for further analysis are:

* The performance of the phone was the most penalized characteristic, with expressions related to the phone not working, and the decision to return it ("Doesn't work", "stopped working", "return it", "to return", etc.) appearing throughout the reviews of the three products.

* There seemed to be screen problems with the iPhone XR and iPhone 11.

* The iPhone XR appeared to have additional problems with the speaker and the phone not being fully unlocked.

#### Positive reviews

Some of the trends extracted for further analysis are:

* The external look of the phone is very important for the customer, with expressions like  "no scratches" and "like new" constantly appearing throughout the reviews of every product.

* Naturally, the performance was also important for customers ("Works well", "works perfectly", etc. 

* Reviews might be short-term oriented, with frequent expressions like "so far" indicating that the review could've been written with a short amount of product usage.

### Sentiment analysis

Lastly, let's train our model with the entire dataset and create a sentiment analysis tool.

In [34]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(rev['review_corrected'], 
                                                    rev['Positive review'], 
                                                    random_state=0)
vect = CountVectorizer(min_df=10, ngram_range=(2,3)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))
print('Accuracy: ', accuracy_score(y_test, predictions))

AUC:  0.8358538703723808
Accuracy:  0.8959537572254336


It has an AUC score of 0.84 and an accuracy of 0.90, which are good values to predict the sentiment of a review while avoiding overfitting.

Lastly, let's test the model with two sentences with the same words but in a different order:

In [35]:
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


Even though the same words are employed in both sentences, they were correctly identified as positive and negative reviews, respectively. Thus, the model can evaluate context and differentiate sentiment.

## Conclusions

In this work, the analysis of renewed iPhones reviews was done. To do so, various data analysis, NLP, and Machine Learning libraries were employed to:

* Clean, detect and filter language, and correct spelling of the datasets.
* Train a model and obtain the frequent words found in positive and negative reviews, employing the bag-of-words approach.
* Develop a sentiment analysis predictor.