# Sentiment detection with spaCy

## Preparations
Various settings:

In [1]:
tellers_db_path = '/tmp/tellers.db'
lexicon_csv_path = 'lexicon_de.csv'

Import various modules we are going to need:

In [2]:
# Python standard library
import csv
import re
import sqlite3
from contextlib import closing
from enum import Enum

# SpaCy
import spacy
from spacy.tokens import Token

## Read feedbacks from database

Connect to the database with feedback texts:

In [3]:
connection = sqlite3.connect(tellers_db_path)

Class to store a feedback and the related question:

In [4]:
class Feedback():
    def __init__(self, question_id: int, question: str, feedback_id: int, text: str):
        self.question_id = question_id
        self.question = question
        self.feedback_id = feedback_id
        self.text = text

Read the feedback documents, where each feedback can consist of multiple sentences:

In [5]:
select_feedback_sql = """
    select
        que.question_id,
        que.text,
        fdb.feedback_id,
        fdb.text
    from
        feedback as fdb
        join source as src on
            src.source_id = fdb.source_id
        join question as que on
            que.question_id = fdb.question_id
    where 1 = 1
        and fdb.feedback_time >= '2017-10-01'
        and src.name = 'tellers'
"""

with closing(connection.cursor()) as cursor:
    feedbacks = [
        Feedback(question_id, question,feedback_id, feedback)
        for question_id, question, feedback_id, feedback in cursor.execute(select_feedback_sql)
    ]
print('found {} feedback documents'.format(len(feedbacks)))    

found 298 feedback documents


## Cleanup texts for further processing
Replace certain abbrevisations that would confuse spaCy when detection sentence borders:

In [6]:
ABBREVIATION_TO_EXPANDED_MAP = {
    'ca': 'circa',  # "approximately"
    'ev': 'eventuell',  # "possibly"
    'max': 'maximal',
    'vlt': 'vielleicht',  # "maybe"
}

replace_count = 0
for feedback in feedbacks:
    for abbreviation, expanded in ABBREVIATION_TO_EXPANDED_MAP.items():
        # TODO: Use compiled regex.
        previous_text = feedback.text
        feedback.text = re.sub(
            r'\b' + abbreviation + r'\.', 
            expanded + ' ',
            feedback.text,
            flags=re.IGNORECASE)
        if feedback.text != previous_text:
            replace_count += 1
print('replaced %d abbreviations' % replace_count)

replaced 1 abbreviations


Build a map of emojis (both western and eastern) to a distinct text form:

In [7]:
_EMOJI_PREFIX = 'emoji__'
_EMOJI_TO_NAME_MAP = {
    # Western
    ':)': 'slight_smile',
    ':-)': 'slight_smile',
    '=)': 'slight_smile',
    ':(': 'slight_frown',
    ':-(': 'slight_frown',
    ':D': 'smile',
    ':-D': 'smile',
    ':P': 'stuck_out_tongue',
    ':-P': 'stuck_out_tongue',
    ';)': 'wink',
    ';-)': 'wink',
    # Eastern
    '^^': 'slight_smile',
    '^_^': 'slight_smile',
}
_EMOJI_TO_TEXT_MAP = {
    emoji: ' ' + _EMOJI_PREFIX + name + ' '
    for emoji, name in _EMOJI_TO_NAME_MAP.items()
}

Replace emojis by text:

In [8]:
for feedback in feedbacks:
    for emoji, emoji_text in _EMOJI_TO_TEXT_MAP.items():
        feedback.text = feedback.text.replace(emoji, emoji_text)

Replace some Austrian slang term by proper German:

In [9]:
_AUSTRIAN_TO_GERMAN_SYNONYM_MAP = {
    'eh': 'ohnehin',
    'nix': 'nichts',
    'ois': 'alles',
}

replace_count = 0
for feedback in feedbacks:
    for austrian_word, german_word in _AUSTRIAN_TO_GERMAN_SYNONYM_MAP.items():
        # TODO: Use compiled regex.
        previous_text = feedback.text
        feedback.text = re.sub(
            r'\b' + austrian_word + r'\b', 
            german_word + ' ',
            feedback.text,
            flags=re.IGNORECASE)
        if feedback.text != previous_text:
            replace_count += 1
print('replaced %d austrian slang terms' % replace_count)

replaced 5 austrian slang terms


## Split into sentences
Definde a class to hold a single opinion:

In [10]:
class Opinion():
    def __init__(self, feedback: Feedback, sentence_nr: int, tokens):
        self.feedback = feedback
        self.sentence_nr = sentence_nr
        self.tokens = list(tokens)
        self.topic = None
        self.rating = None

Split the feedbacks into sentences and assign the sentence to an opinion:

In [11]:
nlp = spacy.load('de')

In [12]:
opinions = []
for feedback in feedbacks:
    document = nlp(feedback.text)
    sentence_nr = 1
    for sentence in document.sents:
        opinion = Opinion(feedback, sentence_nr, sentence)
        opinions.append(opinion)
        sentence_nr += 1
print('found', len(opinions), 'opinions')

found 481 opinions


## Topics

There are several ways to find appropriate topics. We are going to use:

* ambience: decoration, space, light, temperature, ...
* food and beverages: eating, drinking, taste, menu, selection
* hygiene: toilett, smell, ...
* service: waiting and reaction times, courtesy, competence, availability, ...
* value: size of portions, price, ...

This can be reprsented as Python Enum:

In [13]:
class Topic(Enum):
    UNKNWON = -99
    AMBIENCE = 1
    FOOD = 2
    HYGIENE = 3
    SERVICE = 4
    VALUE = 5

## Rating

There are serverl ways to represent a rating:

* Use "prositive" and "negative"
* Same as above but with more distincz values, e.g. 1 to 5 stars
* use a float between e.g. 0 and 1.0

We are going to use a system with 6 discret values:

In [14]:
class Rating(Enum):
    UNKNOWN = -99
    VERY_BAD = -4
    BAD = -2
    SOMEWHAT_BAD = -1
    SOMEWHAT_GOOD = 1
    GOOD = 2
    VERY_GOOD = 4

## Lexicon

In a lexicon based sentiment analysis, a lexicon collects words and assigns them to topics and ratings. It also includes more information a parser can utilize to combine multiple words.

Different types of word have various implications on how an opinion can be extracted. A simple system that works with simple sentences:

* noun: Schnitzel, Bier (beer), Geruch (smell)
* adjective: toll (great), entäuschend (disappointing), wohlschmeckend (tasty)
* verb: warten (to wait), stinken (to smell)
* modifiers (intensify or dimish adjective): eher (somewhat), besonders (very), zu (too), viel zu (much to)
* negator: nicht (not), kein

Negators can also be prefixes like "un" and "in", e.g. brauchbar (suitable) - unbrauchbar (unfit) or kompetent (competent) - inkompetent (incompetent).

Be aware that negators do not simply change the sign of a rating, for example:
* "schlecht" (bad) - BAD
* "nicht schlecht" (not bad) - SOMEWHAT_GOOD

In [15]:
class WordType(Enum):
    UNKNOWN = -99
    NOUN = 1
    ADJECTIVE = 2
    VERB = 3
    MODIFIER = 4
    NEGATOR = 5

Words can also be regular expressions to reduce the size. For example, various types of wine ("Rotwein", "Weißwein", "Portwein") can be reduced to the regular expression `r'.*wein'`.

The lexicon entry combines all this information for each word:

In [16]:
class LexiconEntry():
    _IS_REGEX_REGEX = re.compile(r'.*[.+*\[$^\\]')
    
    def __init__(self, lemma: str, word_type: WordType, topic: Topic, rating: Rating):
        assert rating is not None if word_type is WordType.MODIFIER else True, 'modifier must have rating: ' + lemma
        self.lemma = lemma
        self._lower_lemma = lemma.lower()
        self.word_type = word_type
        self.topic = topic
        self.rating = rating
        self.is_regex = LexiconEntry._IS_REGEX_REGEX.match(self.lemma) is not None       
        self._regex = re.compile(lemma) if self.is_regex else None
    
    def matching(self, token: Token) -> float:
        result = 0.0
        if self.is_regex:
            if self._regex.match(token.text):
                result = 0.6
            elif self._regex.match(token.lemma_):
                result = 0.5
        else:
            if token.text == self.lemma:
                result = 1.0
            elif token.text.lower() == self.lemma:
                result = 0.9
            elif token.lemma_ == self.lemma:
                result = 0.8
            elif token.lemma_.lower() == self.lemma:
                result = 0.7
        return result
    
    def __str__(self) -> str:
        result = 'LexiconEntry(%s, word_type=%s' % (self.lemma, self.word_type.name)
        if self.topic is not None:
            result += ', topic=%s' % self.topic.name
        if self.rating is not None:
            result += ', rating=%s' % self.rating.name
        if self.is_regex:
            result += ', is_regex=%s' % self.is_regex
        result += ')'
        return result

    def __repr__(self) -> str:
        return self.__str__()

For more information on this approach, see Liu (2015, p. 59ff).

## Storing and reading the lexicon

A simple way to store the lexicon is a CSV file with columns for:

* lemma or pattern
* word type
* topic
* rating

In [17]:
_RATING_NAME_TO_VALUE_MAP = {some.name.lower(): some for some in Rating}
_TOPIC_NAME_TO_VALUE_MAP = {some.name.lower(): some for some in Topic}
_WORD_TYPE_NAME_TO_VALUE_MAP = {some.name.lower(): some for some in WordType}

lexicon = []
with open(lexicon_csv_path, encoding='utf-8', newline='') as lexicon_file:
    lexicon_reader = csv.reader(lexicon_file, delimiter=',')
    for row in lexicon_reader:
        row = [item.strip() for item in row]
        row += 4 * ['']  # Ensure we have at least 4 strings
        lemma, word_type_text, topic_text, rating_text = row[:4]
        if lemma != '' and not lemma.startswith('#'):
            try:
                # Map certain columns to enums.
                word_type = _WORD_TYPE_NAME_TO_VALUE_MAP[word_type_text]
                topic = _TOPIC_NAME_TO_VALUE_MAP.get(topic_text)
                rating = _RATING_NAME_TO_VALUE_MAP.get(rating_text)
            except KeyError as error:
                raise csv.Error(
                    '%s:%d: cannot map value: %s' % (
                        lexicon_csv_path, lexicon_reader.line_num, error))
            lexicon_entry = LexiconEntry(lemma, word_type, topic, rating)
            lexicon.append(lexicon_entry)
print('found %d lexicon entries' % len(lexicon))

found 470 lexicon entries


# Find base word (lemma) for token

This can again be done with the help of SpaCy:

In [18]:
for opinion in opinions[:3]:
    print()
    print(opinion.feedback.question)
    print(opinion.tokens)
    for token in opinion.tokens:
        print('%s -> %s' % (token.text, token.lemma_))


Welche Änderungen müssten an unserem Restaurant vorgenommen werden, damit Sie eine bessere Bewertung abgeben?
[Ambiente, eventuell, da, es, zurzeit, nicht, sehr, asiatisch, wirkt]
Ambiente -> Ambiente
eventuell -> eventuell
da -> da
es -> ich
zurzeit -> zurzeit
nicht -> nicht
sehr -> sehr
asiatisch -> asiatisch
wirkt -> wirken

Welche Änderungen müssten an unserem Restaurant vorgenommen werden, damit Sie eine bessere Bewertung abgeben?
[Keine, alles, ist, tiptop, in, Ordnung, .,  ]
Keine -> Keine
alles -> alle
ist -> sein
tiptop -> tiptop
in -> in
Ordnung -> Ordnung
. -> .
  ->  

Welche Änderungen müssten an unserem Restaurant vorgenommen werden, damit Sie eine bessere Bewertung abgeben?
[emoji__slight_smile]
emoji__slight_smile -> emoji__slight_smile


## Match lemma with lexicon

In [19]:
def lexicon_entry_for(token) -> LexiconEntry:
    result = None
    lemma = token.lemma_
    lower_lemma = lemma.lower()
    lexicon_entry_index = 0
    best_matching = 0.0
    # TODO: Improve performance by not having to scan whole lexicon for each token.
    for lexicon_entry in lexicon:
        matching = lexicon_entry.matching(token)
        if matching > best_matching:
            result = lexicon_entry
            best_matching = matching
    return result

token = next(nlp('lecker').sents)
lecker_entry = LexiconEntry('lecker', WordType.ADJECTIVE, Topic.FOOD, Rating.GOOD)
print(lecker_entry.matching(token))
print(lecker_entry)
print(token)
print(token.text)
print(token.lemma_)
print(lexicon_entry_for(token))

for token in opinions[0].tokens:
    matching_lexicon_entry = lexicon_entry_for(token)
    if matching_lexicon_entry is None:
        print('%s -> %s -> %s' % (token.text, token.lemma_, matching_lexicon_entry))
    else:
        print('%s -> %s -> %s, %s' % (
            token.text, token.lemma_, matching_lexicon_entry.topic, matching_lexicon_entry.rating))
        


1.0
LexiconEntry(lecker, word_type=ADJECTIVE, topic=FOOD, rating=GOOD)
lecker
lecker
lecker
LexiconEntry(lecker, word_type=ADJECTIVE, topic=FOOD, rating=GOOD)
Ambiente -> Ambiente -> Topic.AMBIENCE, None
eventuell -> eventuell -> None
da -> da -> None
es -> ich -> None
zurzeit -> zurzeit -> None
nicht -> nicht -> None, None
sehr -> sehr -> None
asiatisch -> asiatisch -> None
wirkt -> wirken -> None


With this, we cann reduce opinion senteces to lists of topics and ratings:

In [20]:
opinion_essence = []
for token in nlp('Die Bratwurst schmeckt sehr lecker!'):
    matching_lexicon_entry = lexicon_entry_for(token)
    if matching_lexicon_entry is not None:
        opinion_essence.append(matching_lexicon_entry)
if len(opinion_essence) >= 1:
    print(opinion_essence)

[LexiconEntry(Bratwurst, word_type=NOUN, topic=FOOD), LexiconEntry(lecker, word_type=ADJECTIVE, topic=FOOD, rating=GOOD)]


## Add spaCy extensions for topic, rating, etc

SpaCy has an extension API to store (among other things) additional attributes on documents, spans and tokens (see Montani (2017)). To add attributes for topic and rating we can use:

In [21]:
Token.set_extension('topic', default=None)
Token.set_extension('rating', default=None)
Token.set_extension('is_negator', default=False)
Token.set_extension('is_intensifier', default=False)
Token.set_extension('is_dimisher', default=False)

We can now set and get these attributes using for example:

In [27]:
token = next(nlp('Bratwurst').sents)[0]
print(token.lemma_)
token._.topic = Topic.FOOD
print(token._.topic)

Bratwurst
Topic.FOOD


To simplify debugging the following function shows the Token and its relevant attributes:

In [31]:
def debugged_token(token: Token) -> str:
    result = 'Token(%s, lemma=%s' % (token.text, token.lemma_)
    if token._.topic is not None:
        result += ', topic=' + token._.topic.name
    if token._.rating is not None:
        result += ', rating=' + token._.rating.name
    if token._.is_dimisher:
        result += ', dimisher'
    if token._.is_intensifier:
        result += ', intensifier'
    if token._.is_negator:
        result += ', negator'
    result += ')'
    return result

print(debugged_token(token))

Token(Bratwurst, lemma=Bratwurst, topic=FOOD)


## Extend spaCy pipeline to set new attributes

Now we can extend the pipeline with a step that assigns topic and rating to each token by matching it with the lexicon:

In [23]:
def opinion_matcher(doc):
    for sentence in doc.sents:
        for token in sentence:
            lexicon_entry = lexicon_entry_for(token)
            if lexicon_entry is not None:
                if lexicon_entry.word_type is WordType.NEGATOR:
                    token._.is_negator = True
                elif lexicon_entry.word_type is WordType.MODIFIER:
                    if lexicon_entry.rating.value < 0:
                        token._.is_dimisher = True
                    else:
                        token._.is_intensifier = True
                else:
                    # Separate branch to not assign a token rating for modifiers.
                    token._.rating = lexicon_entry.rating
                token._.topic = lexicon_entry.topic
    return doc

if nlp.has_pipe('opinion_matcher'):
    nlp.remove_pipe('opinion_matcher')
nlp.add_pipe(opinion_matcher)

Now we can extract the essence for the opinions in a more integrated way:

In [37]:
def is_essential(token: Token) -> bool:
    return token._.topic is not None \
        or token._.rating is not None \
        or token._.is_dimisher or token._.is_intensifier or token._.is_negator
        
def essential_tokens(tokens):
    return [token for token in tokens if is_essential(token)]

For example:

In [40]:
doc = nlp('Die Bratwurst schmeckt nicht besonders lecker.')
# Literal English: The Bratwurst tastes not especially delicious.

opinion_essence = essential_tokens(doc)
for token in opinion_essence:
    print(debugged_token(token))

Token(Bratwurst, lemma=Bratwurst, topic=FOOD)
Token(nicht, lemma=nicht, negator)
Token(besonders, lemma=besonders, intensifier)
Token(lecker, lemma=lecker, topic=FOOD, rating=GOOD)


## Combine tokens to ratings

Now that we have all the tokens relevant to extract an opinion we have to resolve negators and modifiers.

For example:
* 'nicht schlecht' (not bad) means ``somewhat_good``
* 'besonders gut' (especially good) means ``very_good``
* 'nicht besonders good' (not especially good) means ``somewhat_bad``

The following function scans for adjectives and applies modifiers to them while reseting the rating on the modifier tokens. First we need a few utility functions to dimish and intensify the numeric values of ``Rating``:

In [55]:
def signum(value):
    if value > 0:
        return 1
    elif value < 0:
        return -1
    else:
        return 0

def dimished(rating_value):
    if abs(rating_value) > 1:
        return rating_value - signum(rating_value)
    else:
        return rating_value

def intensified(rating_value):
    if abs(rating_value) > 1:
        return rating_value + signum(rating_value)
    else:
        return rating_value

print(dimished(-2))  # Rating.BAD
print(dimished(-1))  # Rating.SOMEWHAT_BAD

-1
-1


In [56]:
def is_rating_modifier(token):
    return token._.is_dimisher or token._.is_intensifier or token._.is_negator

def combine_ratings(tokens):
    token_index = 0
    while token_index < len(tokens):
        if not is_rating_modifier(token) and (token._.rating is not None):
            print('combine rating for %s' % token.text)
            combined_rating_value = token._.rating.value
            modifier_token_index = token_index - 1
            while (modifier_token_index >= 0) and is_rating_modifier(tokens[modifier_token_index]):
                modifier_token = tokens[modifier_token_index]
                if modifier_token.is_dimisher:
                    combined_rating_value = 
                
        token_index += 1

SyntaxError: invalid syntax (<ipython-input-56-857fd8b6e541>, line 14)

# References

* Liu (2015) - Bing Liu. Sentiment Analysis. Cambridge, MA: Cambridge University Press, 2015.
* Montani (2017) - Ines Montani. Introducing custom pipelines and extensions for spaCy v2.0. https://explosion.ai/blog/spacy-v2-pipelines-extensions.