## Sentence reformulation

In [48]:
import numpy as np
import nltk
import warnings
from gensim import corpora, similarities
from gensim.models import KeyedVectors
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, strip_multiple_whitespaces, stem_text


warnings.filterwarnings('ignore')
%matplotlib inline

Download FastText pretrained vectors for English: 
[cc.en300.vec.gz](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz)

And download Yelp! dataset composed of reviews: 
[Yelp.train.text](https://drive.google.com/file/d/1TAcfL091lKb2LipaUELFteZqJjQu-gMa/view?usp=sharing)

Load downloaded pretrained FastText vectors by gensim library:

In [202]:
with open('Yelp.train.text', 'r') as f:
    yelp_set = np.array(f.readlines())

Compute similarity of two words using gensim

In [226]:
fname = 'cc.en.300.vec'

word_vectors = KeyedVectors.load_word2vec_format(fname, limit=5000)
print(word_vectors.similarity('king', 'egg'))

0.10155682


Sentence tokenization. Split Yelp! texts into separate tokens (words and punctuation marks) by space

In [230]:
preprocessors = [
    lambda word: word.lower(),   # Lowercase the word.
    strip_multiple_whitespaces,  # Remove repeating whitespaces.
]

tokenize_sentence = lambda x: preprocess_string(x, preprocessors)

tokens = np.array([tokenize_sentence(sentence) for sentence in yelp_set])

In [208]:
tokens[0]

['i', 'was', 'sadly', 'mistaken', '.']

Try part of speech tagging using [NLTK POS-tagger](https://www.nltk.org/book/ch05.html).
The function returns list of tuples (word, POS_tag)

In [214]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mariao/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [232]:
print(tokens[0])

def POS_tagging(tokens):
    return nltk.pos_tag(tokens)

print(POS_tagging(tokens[0]))

['i', 'was', 'sadly', 'mistaken', '.']
[('i', 'NN'), ('was', 'VBD'), ('sadly', 'RB'), ('mistaken', 'VBN'), ('.', '.')]


Can you find the most similar word to the given? Can you write a method that returns a list of tuples (word, similarity) in order of decreasing similarity?

In [225]:
def returnTopN(word, N):
    return word_vectors.most_similar('word', topn=N)

returnTopN('quit', 10)

[('phrase', 0.7383495569229126),
 ('words', 0.7039700150489807),
 ('meaning', 0.6170327067375183),
 ('Word', 0.5634478330612183),
 ('term', 0.4986676871776581),
 ('sentence', 0.4960007667541504),
 ('name', 0.48936933279037476),
 ('definition', 0.4658313989639282),
 ('describe', 0.45316362380981445),
 ('letter', 0.44678452610969543)]

Let's do the simplest reformulation task. We just want to reformulate some sentences replacing an ajective with a similar one

In [233]:
def reformulate_sentence(sentence):
    # Sentence tokenization
    tokenized_sentence = tokenize_sentence(sentence)

    # Part of speech tagging
    POS_tagged_words = POS_tagging(tokenized_sentence)

    reformulated_sentence_words = []
    for word, pos_tag in POS_tagged_words:
        # If the word is adjective...
        if pos_tag in ['JJR', 'JJS', 'JJ']:
            try:
                reformulated_sentence_words.append(returnTopN(word, 1)[0])
                # TODO: ...look for the word most similar to the given and replace it
                print('')
            except:
                print('There is no {} word in FastText dictionary! ...'.format(word))
        else:
            reformulated_sentence_words.append(word)
    # Join words list in a sentence
    return ' '.join(reformulated_sentence_words)

reformulate_sentence('ee')

'ee'

## Sentiment analysis

In [4]:
import random

In [5]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mariao/nltk_data...


VADER sentiment classifier from NLTK library. The range of sentiment is from -1 to 1 where -1 is negative, 0 is neutral and 1 is positive

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [6]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Read the dataset text file line by line and put lines into the list

In [93]:
lexicon_list = sentiment_analyzer.lexicon_file.split('\r\n')

Read Yelp dataset from text file and get 1000 random sentences

In [186]:
preprocessors = [
    lambda word: word.lower(),   # Lowercase the word.
    strip_multiple_whitespaces,  # Remove repeating whitespaces.
    remove_stopwords             # Remove stopwords.
]

sentences = yelp_set[np.random.randint(0, 1000, (1000, 1))]

processed_sentences = np.array([' '.join(preprocess_string(sentence[0], preprocessors)) for sentence in sentences])

Compute average sentiment of 1000 sentences sentences set by VADER sentiment classifier

In [187]:
def getAverage(sentences):
    scores = np.array([sentiment_analyzer.polarity_scores(s)['compound'] for s in sentences])
    return scores.mean()

getAverage(processed_sentences)

-0.0671777

Reformulate sentences and compute average sentiment again. Try to come up with ways to make senteces more positive on average. What about more negative? Can you come up with some interesting experiment on this data with POS-tagged reformulations?

In [188]:
# Let's look at words taht appear in positive sentences.
positive_sentences = processed_sentences[scores > 0.5]
print(positive_sentences)

['took _num_ minutes acknowledged .' "n't manager 10:30 pm ."
 'brought pile beverage napkins .' 'server nice , issue .'
 'completely ignored got left .' '_num_ .' '_num_ .'
 'bit confusing layout , tastefully .' ', better skipping .' 'terrible .'
 'service terrible .' 'door signage ( open hours ) .'
 'wind washing hands getting salad stuff .' 'lobster bisque soup good .'
 'wake going lose business .' "'s tiny long bar like old irish pubs ."
 'pretty hard mess salad .'
 'poor management skills rude behavior ruined holiday .']


In [189]:
positive_words = np.array([preprocess_string(s, preprocessors)[0] for s in positive_sentences]).flatten()
all_scores = np.array([sentiment_analyzer.polarity_scores(w)['compound'] for w in positive_words])
max_word = positive_words[all_scores.argmax()]
positive_words = positive_words[all_scores > 0]
positive_words

array(['pretty'], dtype='<U10')

In [190]:
# Words that appear in negative sentences.
negative_sentences = processed_sentences[scores < -0.2]
print(negative_sentences[:10])

['disappointed .' 'mushroom cheese omelette , cheese lacking .'
 'sat _num_ minutes server came table .'
 'minimal meat ton shredded lettuce .' "`` nah '' _num_ ." 'hell !'
 'bf got lost pittsburgh _num_ min .' "n't word ."
 "service n't bad , better ." ", maybe n't care ."]


In [191]:
negative_words = np.array([preprocess_string(s, preprocessors)[0] for s in negative_sentences]).flatten()
all_scores = np.array([sentiment_analyzer.polarity_scores(w)['compound'] for w in negative_words])
min_word = negative_words[all_scores.argmin()]
negative_words = negative_words[all_scores < 0]
negative_words

array(['disappointed', 'hell', 'awful', 'lame', 'crap', 'awful',
       'disappointed', 'disappointed', 'unfortunately', 'bad',
       'disappointing', 'appalling', 'misses', 'worst', 'ridiculous',
       'disgusting', 'lame', 'avoid', 'hell', 'noisy', 'low', 'wrong',
       'apathetic', 'worst', 'uncomfortable', 'unprofessional', 'rude',
       'poor', 'unprofessional', 'unfortunately', 'poor', 'terrible',
       'awful', 'complaints', 'bad', 'complaints', 'worst',
       'unfortunately', 'awful', 'disgusting', 'avoid', 'worst',
       'smothered', 'frustrated', 'worst', 'hate', 'worst', 'stopped',
       'meh', 'bad'], dtype='<U14')

In [192]:
def replace(sentence, old_words, new_word):
    words = preprocess_string(sentence, preprocessors)
    new_words = []
    for w in words:
        new_words.append(new_word if w in old_words else w)
    return ' '.join(new_words)
            
    
# If we replace positive words by negatives we make sentences more negative.
new_processed_sentences = np.array([replace(sentence, positive_words, min_word) for sentence in processed_sentences])
print(getAverage(new_processed_sentences))

-0.0812027


In [193]:
# If we replace negative words by positive we make sentences more positive.
new_processed_sentences = np.array([replace(sentence, negative_words, max_word) for sentence in processed_sentences])
print(getAverage(new_processed_sentences))

0.1258009
