# NLP Aplications in E-Commerce

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_0901.png)

- E-Commerce Catalog, a product catalog is a database of the products that the enterprise deals or a user can purchase.
- Review Analysis, analysis from user reviews for every product.
- Product Search, search engine for products.
- Product Recommendations, recommendation engine.

## Search in E-Commerce

A good search mechanism positively impacts the conversion rate, which directly impacts the revenue of the e-retailer.

The left-most section in E-Commerce depicts a set of filters (alternatively, “facets”) that allows the customer to guide their search in a way that matches their buying needs.

These filters are the key that defines the faceted search. However, they may not always be readily available for all products. Some reasons for that are:
- The seller didn’t upload all the required information while listing the product on the e-commerce website.
- Some of the filters are difficult to obtain, or the seller may not have the complete information to provide—for example, the caloric value of a food product, which is typically derived from the nutrient information provided on the product case.

Faceted search can be built with most popular search engine backends like Solr and Elasticsearch.

> TIP: In an e-commerce setting, we also need to account for business needs other than relevance in terms of facets and text. For instance, products that are part of a promotion or sale may be bumped up in results. This can be built by utilizing features like Elasticsearch boosting.

## Building an E-Commerce Catalog

- Attribute Extraction, extract the attributes of product such as color, size, etc.
    - Direct attribute extraction algorithms, assume the presence of the attribute value in the input text.
    - Derived attribute extraction algorithms, do not assume that the attribute of interest is present in the input text.

> TIP: For the models that use deep recurrent structures, the amount of data needed is typically much more than what’s needed when less-complex ML models such as CRF and HMM are used. The more data there is, the better the deep models learn. This is common to all DL models, as we saw in earlier chapters, but for e-commerce, getting a large set of well-sampled, annotated data is very expensive. Hence, it needs to be taken care of before we to build any sophisticated models.

- Product Catagorization and Taxonomy

A good taxonomy and properly linked products can be critical because it allows an e-commerce site to:
1. Show products similar to the product searched
2. Provide better recommendations
3. elect appropriate bundles of products for better deals for the customer
4. Replace old products with new ones
5. Show price comparisons of different products in the same category

There APIs that can build on large catalog content of various big retailers and provide the intelligence inside to categorize a product by scanning its unique product code such as Semantics3, eBay, and Lucidworks.

- Product Enrichment, typically seen as a larger and more continuous process than just improving product titles in any online retail setup.
- Product Deduplication and Matching
    - Attribute Match, If two products are the same, then the values of various attributes must be the same. Hence, once the attributes are extracted, we compare values for attributes for both of the products in question.
    - Title Match, we can compare bigrams and trigrams among titles duplicated.
    - Image Match, pixel-to-pixel match, feature map matching, or even advanced image-matching techniques like Siamese networks are popular.
    
> TIP: A/B testing is a good method of measuring the results and effectiveness of different algorithms in the e-commerce world. For procedures like attribute extraction, product enrichment and A/B testing different models will lead to an impact on business metrics. These metrics can be direct or indirect sales, click-through rates, time spent on one web page, etc., and an improvement in relevant metrics shows that a model works better.

## Review Analysis

- Sentiment Analysis

Negative reviews are more important to understand.

> TIP: Typically, a review contains more than one sentence. It’s advisable to break a review into sentences and pass each sentence as one data point. This is also relevant for sentence-wise aspect tagging, aspect-wise sentiment analysis, etc.

- Aspect-Level Sentiment Analysis, An aspect is a semantically rich, concept-centric collection of words that indicates certain properties or characteristics of the product. For example, aspects of a travel website might have: location, value, and cleanliness.
    - Supervised Approach, depends mainly on seed words.
    - Unsupervised Approach, topic modeling is a useful technique in identifying latent topics present in a document.
    
Connecting Overall Ratings to Aspects, we use a technique called latent rating regression analysis (LARA).

> TIP: User information is also key in handling reviews. Imagine a scenario where a popular user, as opposed to a less-popular user, writes a good review. The user matters! While performing the review analysis, a “user weight” can be defined for all users based on their ratings (generally given by other peers) and can be used in all calculations to discount the reviewer bias.

Understanding Aspects,

Given the huge volume of reviews an e-commerce website encounters, there will still be a lot of sentences under an aspect. Here, a summarization algorithm may save the day. LexRank is an algorithm, similar to PageRank, that assumes each sentence is a node and connects via sentence similarity. Once done, it picks the most central sentences out of it and presents an extractive summary of the sentences under an aspect.

Flowchart of review analysis: overall sentiments, aspect-level sentiments, and aspect-wise significant reviews.

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_0917.png)

> TIP: A complete understanding of a product can only be achieved by both user reviews and editorial reviews. Editorial reviews are generally provided by expert users or domain experts. These reviews are more reliable and can be shown at the top of the review section. But on the other hand, general user reviews reveal the true picture of the product experience from all users’ perspectives. Hence, melding editorial reviews with general user reviews is important. That may be achieved by mixing both kinds of reviews in the top section and ranking them accordingly.

## Recommendations for E-Commerce

Comprehensive study of techniques for various e-commerce recommendation scenarios

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_0918.png)

> NOTE: Recommendation engines deal with information from various sources. Proper matching of various data tables and consistency of the information across various data sources is important to maintain. For example, while collating the information about product attributes and product transaction history, the consistency of the information should be checked carefully. Complementary and substitute data can give indications about data quality. One should check for anomalous behavior while working with multifarious data sources, as in the case of e-commerce recommendation.

### A Case Study: Substitutes and Complements

Complements are products that are typically bought together. On the other hand, there are pairs that are bought in lieu of the other, and they’re known as substitute pairs.

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_0919.png)

## Extra

### LATENT ATTRIBUTE EXTRACTION FROM REVIEWS

Reviews contain specific information about product attributes. Explicit extraction of attributes from reviews may have limitations in representation, as we need to define an explicit ontology, so instead, we learn them via a latent vector representation. 

### PRODUCT LINKING

The next task is to understand how the two products are linked. We already obtained topic vectors, which capture the intrinsic properties of the product in a latent attribute space. Now, given a pair of products, we want to create a combined feature vector out of the respective topic vectors for the products and then predict if there’s any relationship between them. This can be viewed as a binary classification problem where the features have to be obtained from the respective topic vectors for the product pair. We call this process “link prediction”. 


In [1]:
from pprint import pprint

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re
import string

import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
from nltk.tokenize import word_tokenize, RegexpTokenizer

from pycorenlp import StanfordCoreNLP

[nltk_data] Downloading package punkt to C:\Users\Yasir Abdur
[nltk_data]     Rohman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to C:\Users\Yasir Abdur
[nltk_data]     Rohman\AppData\Roaming\nltk_data...


In [2]:
positive = "This fried chicken tastes very good. It is juicy and perfectly cooked."
negative = "This fried chicken tasted bad. It is dry and overcooked."
ambiguous = "Except the amazing fried chicken everything else at the restaurant tastes very bad."

# VaderSentiment

VADER Sentiment Analysis: VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains

In [3]:
# It scores from -1 to 1. -1 being negative and 1 being positive
def sentiment_analyzer_scores(text):
    sentiment_analyzer = SentimentIntensityAnalyzer()
    score = sentiment_analyzer.polarity_scores(text)
    pprint(text)
    pprint(score)
    print("-"*30)
    
print("Positive:")
sentiment_analyzer_scores(positive)

print("Negative:")
sentiment_analyzer_scores(negative)

print("Ambiguous:")
sentiment_analyzer_scores(ambiguous)

Positive:
'This fried chicken tastes very good. It is juicy and perfectly cooked.'
{'compound': 0.8122, 'neg': 0.0, 'neu': 0.575, 'pos': 0.425}
------------------------------
Negative:
'This fried chicken tasted bad. It is dry and overcooked.'
{'compound': -0.5423, 'neg': 0.28, 'neu': 0.72, 'pos': 0.0}
------------------------------
Ambiguous:
('Except the amazing fried chicken everything else at the restaurant tastes '
 'very bad.')
{'compound': 0.0018, 'neg': 0.204, 'neu': 0.592, 'pos': 0.204}
------------------------------


In [4]:
def get_word_sentiment(text):
    sentiment_analyzer = SentimentIntensityAnalyzer()
    
    tokenized_text = nltk.word_tokenize(text)
    
    positive_words=[]
    neutral_words=[]
    negative_words=[]
    for word in tokenized_text:
        if (sentiment_analyzer.polarity_scores(word)['compound']) >= 0.1:
            positive_words.append(word)
        elif (sentiment_analyzer.polarity_scores(word)['compound']) <= -0.1:
            negative_words.append(word)
        else:
            neutral_words.append(word)
    print(text)
    print('Positive:',positive_words)        
    print('Negative:',negative_words)    
    print('Neutral:',neutral_words)
    print("-"*30)
    
get_word_sentiment(positive)
get_word_sentiment(negative)
get_word_sentiment(ambiguous)

This fried chicken tastes very good. It is juicy and perfectly cooked.
Positive: ['good', 'perfectly']
Negative: []
Neutral: ['This', 'fried', 'chicken', 'tastes', 'very', '.', 'It', 'is', 'juicy', 'and', 'cooked', '.']
------------------------------
This fried chicken tasted bad. It is dry and overcooked.
Positive: []
Negative: ['bad']
Neutral: ['This', 'fried', 'chicken', 'tasted', '.', 'It', 'is', 'dry', 'and', 'overcooked', '.']
------------------------------
Except the amazing fried chicken everything else at the restaurant tastes very bad.
Positive: ['amazing']
Negative: ['bad']
Neutral: ['Except', 'the', 'fried', 'chicken', 'everything', 'else', 'at', 'the', 'restaurant', 'tastes', 'very', '.']
------------------------------


# Stanford Core NLP

Before moving on to execute the code we need to start the Stanford Core NLP server on our local machine.
To do that follow the steps below (tested on debian should work fine for other distributions too):

1. Download the Stanford Core NLP model from [here](https://stanfordnlp.github.io/CoreNLP/#download).
2. Unzip the folder
3. cd into the folder
4. cd stanford-corenlp-4.0.0/
5. Start the server using this command:

```java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000```

If you do not have java installed on your system please install it from the official Oracle page.

In [7]:
nlp = StanfordCoreNLP('http://localhost:9000')

def get_sentiment(text):
    res = nlp.annotate(text,
                       properties={'annotators': 'sentiment',
                                   'outputFormat': 'json',
                                   'timeout': 1000,
                       })
    print(text)
    print('Sentiment:', res['sentences'][0]['sentiment'])
    print('Sentiment score:', res['sentences'][0]['sentimentValue'])
    print('Sentiment distribution (0-v. negative, 5-v. positive:', res['sentences'][0]['sentimentDistribution'])
    print("-"*30)

In [8]:
get_sentiment(positive)
get_sentiment(negative)
get_sentiment(ambiguous)

This fried chicken tastes very good. It is juicy and perfectly cooked.
Sentiment: Negative
Sentiment score: 1
Sentiment distribution (0-v. negative, 5-v. positive: [0.14784497645863, 0.42150440150006, 0.26795232397677, 0.14973929036748, 0.01295900769705]
------------------------------
This fried chicken tasted bad. It is dry and overcooked.
Sentiment: Verynegative
Sentiment score: 0
Sentiment distribution (0-v. negative, 5-v. positive: [0.59929214372637, 0.37183747332338, 0.00491027284258, 0.01321244183821, 0.01074766826945]
------------------------------
Except the amazing fried chicken everything else at the restaurant tastes very bad.
Sentiment: Negative
Sentiment score: 1
Sentiment distribution (0-v. negative, 5-v. positive: [0.14730433605028, 0.42068997423614, 0.26920777577949, 0.14985314657637, 0.01294476735772]
------------------------------
