# Introduction to natural language processing with Python

TODO: goals, overview

## Preparations

TODO: describe setup.sh etc

## Positive adjectives from customer reviews

Download the `7817_1.csv` from https://www.kaggle.com/sasikala11/amazon-customer-reviews and store it in the same folder as this notebook.

In [1]:
review_csv_path = '7817_1.csv'

Read reviews from the CSV:

In [2]:
import csv

asins_to_reviews_map = {}
with open(review_csv_path, encoding='utf-8') as review_csv_file:
    review_csv_reader = csv.DictReader(review_csv_file)
    for row in review_csv_reader:
        asins = row['asins']
        review = row['reviews.text']
        if asins in asins_to_reviews_map:
            asins_to_reviews_map[asins].append(review)
        else:
            asins_to_reviews_map[asins] = [review]

Now let's take the first review we come across:

In [3]:
asins = next(iter(asins_to_reviews_map.keys()))
asins

'B00QJDU3KY'

In [4]:
reviews = asins_to_reviews_map[asins]
review = reviews[0]
print(review)

I initially had trouble deciding between the paperwhite and the voyage because reviews more or less said the same thing: the paperwhite is great, but if you have spending money, go for the voyage.Fortunately, I had friends who owned each, so I ended up buying the paperwhite on this basis: both models now have 300 ppi, so the 80 dollar jump turns out pricey the voyage's page press isn't always sensitive, and if you are fine with a specific setting, you don't need auto light adjustment).It's been a week and I am loving my paperwhite, no regrets! The touch screen is receptive and easy to use, and I keep the light at a specific setting regardless of the time of day. (In any case, it's not hard to change the setting either, as you'll only be changing the light level at a certain time of day, not every now and then while reading).Also glad that I went for the international shipping option with Amazon. Extra expense, but delivery was on time, with tracking, and I didnt need to worry about cus

Setup an nlp processor for English language:

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

Split the review into sentences:

In [6]:
doc = nlp(review)
for sent_number, sent in enumerate(doc.sents, 1):
    print(sent_number, '-', sent)

1 - I initially had trouble deciding between the paperwhite and the voyage because reviews more or less said the same thing: the paperwhite is great, but if you have spending money, go for the voyage.
2 - Fortunately, I had friends who owned each, so I ended up buying the paperwhite on this basis: both models now have 300 ppi, so the 80 dollar jump turns out pricey the voyage's page press isn't always sensitive, and if you are fine with a specific setting, you don't need auto light adjustment).It's been a week
3 - and I am loving my paperwhite, no regrets!
4 - The touch screen is receptive and easy to use, and I keep the light at a specific setting regardless of the time of day.
5 - (In any case, it's not hard to change the setting either, as you'll only be changing the light level at a certain time of day, not every now and then while reading).Also glad that I went for the international shipping option with Amazon.
6 - Extra expense, but delivery was on time, with tracking, and I didn

Setup up a sentiment analyzer:

In [7]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

Find out the sentiment for each sent:

In [8]:
sent = next(doc.sents)
print(sent)
score = analyser.polarity_scores(str(sent))
print(score)

I initially had trouble deciding between the paperwhite and the voyage because reviews more or less said the same thing: the paperwhite is great, but if you have spending money, go for the voyage.
{'neg': 0.052, 'neu': 0.876, 'pos': 0.072, 'compound': 0.1779}


Find sents that are mostly positive:

In [9]:
def mostly_positive_score_and_sents(reviews):
    for review_number, review in enumerate(reviews, 1):
        doc = nlp(review)
        for sent_number, sent in enumerate(doc.sents, 1):
            score = analyser.polarity_scores(str(sent))
            if score['neg'] < 0.1 and score['pos'] > 0.25:
                yield score, sent

for score, sent in mostly_positive_score_and_sents(reviews):
    print(f"{score['pos']} {sent}")

0.3 I have had a full week with my new Kindle Paperwhite and I have to admit, I'm in love.
0.267 There was never that feeling of oh man, reading on this thing is so awesome.
0.474 That desire is back and I simply adore my Kindle.
0.649 Make yourself happy.
0.425 Inspire the reader inside of you.
0.459 I am enjoying it so far.
0.672 Great for reading.
0.261 No Paperwhite screen, naturally, and all the cool usability that delivers, but it works well and has its own attractions as a companion to the Kindle.
0.331 Theres just so much good stuff out there!
0.275 I like having devices on which I can put anything I want and use it.
0.27 That's the one thing I'd like to see Amazon do in some future upgrades: make the Kindle treat sideloaded books just like the ones bought from them directly, with sharing funcion (quotes and Goodreads) enabled and so on.
0.354 The size is perfect, it sits very well in the hand, the light doesn't hurt the eyes in the dark (like the light on a tab does)...
0.333 

Let's extract adjectives from positive sents:

In [10]:
def positive_adjectives(reviews):
    for score, sent in mostly_positive_score_and_sents(reviews):
        for token in sent:
            if token.pos_ == 'ADJ':
                # Skip pronouns rendered as '-PRON-'
                if not token.lemma_.startswith('-'):
                    yield token.lemma_
                    
for lemma in positive_adjectives(reviews):
    print(lemma)

full
new
awesome
happy
great
all
cool
that
own
good
which
future
perfect
full
new
awesome
happy
obvious
new
worth
cool
sure
great
perfect
new
new
pleased
easy
comfortable
positive
easy
own
quick
easy
great
nice
terrific
entire
particulary
interesting
few
terrific
worthy
black
fantastic
bright
lovely
premium
perfect
bright
original
special
great
next
good
terrific
superb
super
sharp
skeptical
perfect
back
long
huge
entire
soft
white
glad
easy
little
great
perfect
quick
great
real
crisp
sharp
easy
good
original
pleased
new
dramatic
enough
great
which
happy
new
perfect
happy
fast
clear


In [11]:
from collections import Counter
Counter(positive_adjectives(reviews)).most_common(10)

[('new', 7),
 ('great', 7),
 ('perfect', 6),
 ('easy', 5),
 ('happy', 4),
 ('good', 3),
 ('terrific', 3),
 ('full', 2),
 ('awesome', 2),
 ('cool', 2)]

Filter too generic words:

In [12]:
GENERIC_POSITIVE_ADJECTIVES = {
    'awesome', 'good', 'great', 'happy', 'new', 'perfect', 'pleased', 'terrific',
}
Counter(
    adjective for adjective in positive_adjectives(reviews)
    if adjective not in GENERIC_POSITIVE_ADJECTIVES
).most_common(10)

[('easy', 5),
 ('full', 2),
 ('cool', 2),
 ('own', 2),
 ('which', 2),
 ('quick', 2),
 ('entire', 2),
 ('bright', 2),
 ('original', 2),
 ('sharp', 2)]

Let's warp this in a function:

In [13]:
def most_common_positive_adjectives(reviews):
    return [
        common_adjective for common_adjective, _ in Counter(
            adjective for adjective in positive_adjectives(reviews)
            if adjective not in GENERIC_POSITIVE_ADJECTIVES
        ).most_common(10)
    ]

most_common_positive_adjectives(reviews)

['easy',
 'full',
 'cool',
 'own',
 'which',
 'quick',
 'entire',
 'bright',
 'original',
 'sharp']

And now for all our products:

In [14]:
for asins, reviews in asins_to_reviews_map.items():
    print(f'{asins}: {most_common_positive_adjectives(reviews)}')

B00QJDU3KY: ['easy', 'full', 'cool', 'own', 'which', 'quick', 'entire', 'bright', 'original', 'sharp']
B002Y27P3M: ['excellent', 'bad', 'personal', 'neat', 'cheap', 'other', 'useful', 'overall', 'easy', 'nice']
B00DU15MU4: ['easy', 'most', 'free', 'excellent', 'remote', 'sure']
B01LW1MS9C: ['huge', 'smart', 'fantastic', 'nice', 'soft', 'elegant', 'several']
B01FWSVGQQ: ['easy', 'nice', 'which', 'bright', 'vibrant', 'strong', 'first', 'all', 'bluetooth']
B00DOPNLJ0: ['prime', 'available', 'which', 'downloadable', 'free', 'sound', 'excellent', 'stunning', 'intuitive', 'popular']
B00NO8LX7E: ['remote', 'nobodys', 'huge', 'cheap', 'second', 'gamer', 'other', 'nice', 'laughable']
B00LWHUAF0: ['nice']
B00KDRQEYQ: ['that', 'fine', 'first', 'worth']
B00OQVZDJM: ['sure', 'digital', 'welcome', 'long', 'physical', 'incredible', 'wonderful']
B00QJDVBFU: ['sure', 'digital', 'welcome', 'long', 'physical', 'incredible', 'wonderful']
B00VKLBU3Y: ['sensitive', 'large', 'original', 'cheap', 'smooth', 'n

Conclusion: there is still some noise and generic words that need to be added to `GENERIC_POSITIVE_ADJECTIVES` but the general principle looks promising.

## Semantic search


A German customer might search for:

> rotes kleid unter 100 EUR

With standard search, this will just search for each of the specific terms. What was actually meant from a search platform's point if view:

> kleid color:rot price_eur:\[0 TO 200\[

Modern search platforms can already find out that `rotes` stems to `rot`, so we don't have to bother about that here. So with stemming applied our search query would turn into:

To map key terms to fields we can use simple dictionaries:

In [15]:
FIELD_TO_TERM_MAP = {
    'color': ['blau', 'braun'
              , 'gelb', 'grün', 'rot', 'schwarz', 'weiß'],
    'brand': ['apple', 'braun', 'dior', 'samsung', 'sony'],
    # ...
}

Notice that `'braun`' is both a brand and a German color.

For efficient lookup we also need the reverse mapping:

In [16]:
TERM_TO_FIELDS_MAP = {}
for field, terms in FIELD_TO_TERM_MAP.items():
    for term in terms:
        if term in TERM_TO_FIELDS_MAP:
            TERM_TO_FIELDS_MAP[term].append(field)
        else:
            TERM_TO_FIELDS_MAP[term] = [field]
TERM_TO_FIELDS_MAP

{'blau': ['color'],
 'braun': ['color', 'brand'],
 'gelb': ['color'],
 'grün': ['color'],
 'rot': ['color'],
 'schwarz': ['color'],
 'weiß': ['color'],
 'apple': ['brand'],
 'dior': ['brand'],
 'samsung': ['brand'],
 'sony': ['brand']}

In [17]:
def resolved_standard_terms(search_query):
    result_parts = []
    for term in search_query.split():
        fields = TERM_TO_FIELDS_MAP.get(term)
        if fields is None:
            result_parts.append(term)
        else:
            for field in fields:
                result_parts.append(f'{field}:{term}')
    return ' '.join(result_parts)

resolved_standard_terms('rot kleid unter 100 EUR')

'color:rot kleid unter 100 EUR'

When looking for a razor, we might have to search both the color and the brand. This is ok because the color rarely matters for a razor.

In [18]:
resolved_standard_terms('braun rasierer')

'color:braun brand:braun rasierer'

To consider term composed of multiple word we can use regular expressions to map them to a replacement term:

In [19]:
import re
match = re.match(
    r'(.*\b)(unter\s+)(\d+)(\s+EUR)(\b.*)', 
    'rot kleid unter 100 EUR dior')

match.groups()

('rot kleid ', 'unter ', '100', ' EUR', ' dior')

With this we can build a mapping to replacement terms:

In [20]:
re.sub(
    r'(.*\b)(unter\s+)(\d+)(\s+EUR)(\b.*)', 
    r'\1price_eur:[0 TO \3]\5',
    'rot kleid unter 100 EUR dior'
)

'rot kleid price_eur:[0 TO 100] dior'

With can collect multiple regular expressions and their replacement in a map and build a funtion to apply all of them:

In [21]:
REGEX_TO_REPLACEMENT_MAP = {
    r'(.*\b)(unter\s+)(\d+)(\s+EUR)(\b.*)': r'\1price_eur:[0 TO \3]\5',
    r'(.*\b)(ab\s+)(\d+)(\s+EUR)(\b.*)': r'\1price_eur:[\3 TO 10000]\5',
}

def resolved_expressions(search_query):
    result = search_query
    for regex, replacement in REGEX_TO_REPLACEMENT_MAP.items():
        result = re.sub(regex, replacement, result)
        # FIXME: We might need some logic to prevent already replaced
        # parts to be replaced again. This is just a proof of concept.
    return result

resolved_expressions('rot kleid unter 100 EUR dior')

'rot kleid price_eur:[0 TO 100] dior'

And now let's combine this:

In [22]:
def semantic_query(search_query):
    return resolved_standard_terms(resolved_expressions(search_query))

semantic_query('rot kleid unter 100 EUR dior')

'color:rot kleid price_eur:[0 TO 100] brand:dior'