## Sentence reformulation

In [13]:
import numpy as np
import nltk
import warnings
from gensim import corpora, similarities
from gensim.models import KeyedVectors
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, strip_multiple_whitespaces, \
                                         strip_punctuation, stem_text

warnings.filterwarnings('ignore')
%matplotlib inline

Download FastText pretrained vectors for English: 
[cc.en300.vec.gz](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz)

And download Yelp! dataset composed of reviews: 
[Yelp.train.text](https://drive.google.com/file/d/1TAcfL091lKb2LipaUELFteZqJjQu-gMa/view?usp=sharing)

Load downloaded pretrained FastText vectors by gensim library:

In [None]:
! wget -nc https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
! gunzip -k cc.en.300.vec.gz

Compute similarity of two words using gensim

In [3]:
fname = 'cc.en.300.vec'

word_vectors = KeyedVectors.load_word2vec_format(fname)
print(word_vectors.similarity('king', 'egg'))

0.10155682


Sentence tokenization. Split Yelp! texts into separate tokens (words and punctuation marks) by space

In [52]:
with open('Yelp.train.text', 'r') as f:
    yelp_set = np.array(f.readlines())

preprocessors = [
    lambda word: word.lower(),   # Lowercase the word.
    strip_multiple_whitespaces,  # Remove repeating whitespaces.
]

tokenize_sentence = lambda x: preprocess_string(x, preprocessors)

tokens = np.array([tokenize_sentence(sentence) for sentence in yelp_set])

In [24]:
tokens[0]

['i', 'wa', 'sadli', 'mistaken']

Try part of speech tagging using [NLTK POS-tagger](https://www.nltk.org/book/ch05.html).
The function returns list of tuples (word, POS_tag)

In [6]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mariao/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [7]:
print(tokens[0])

def POS_tagging(tokens):
    return nltk.pos_tag(tokens)

print(POS_tagging(tokens[0]))

['i', 'was', 'sadly', 'mistaken', '.']
[('i', 'NN'), ('was', 'VBD'), ('sadly', 'RB'), ('mistaken', 'VBN'), ('.', '.')]


Can you find the most similar word to the given? Can you write a method that returns a list of tuples (word, similarity) in order of decreasing similarity?

In [22]:
most_similar_n = lambda word, topn: word_vectors.most_similar(word, topn=topn)

most_similar_n('bowser', 10)

[('bowsers', 0.7382202744483948),
 ('koopa', 0.47836795449256897),
 ('Fawful', 0.44402459263801575),
 ('koopas', 0.43662241101264954),
 ('Bowser', 0.43195170164108276),
 ('FLUDD', 0.4224753677845001),
 ('mario', 0.4222983717918396),
 ('waluigi', 0.4180498719215393),
 ('Koopa', 0.4134894013404846),
 ('WHOAAAAAA', 0.41194313764572144)]

Let's do the simplest reformulation task. We just want to reformulate some sentences replacing an ajective with a similar one

In [23]:
def reformulate_sentence(sentence):
    # Sentence tokenization
    tokenized_sentence = tokenize_sentence(sentence)

    # Part of speech tagging
    POS_tagged_words = POS_tagging(tokenized_sentence)

    reformulated_sentence_words = []
    for word, pos_tag in POS_tagged_words:
        # If the word is adjective...
        if pos_tag in ['JJR', 'JJS', 'JJ']:
            try:
                new_word = most_similar_n(word, 1)[0][0]
                reformulated_sentence_words.append(new_word)
            except:
                print('There is no {} word in FastText dictionary! ...'.format(word))
        else:
            reformulated_sentence_words.append(word)
    # Join words list in a sentence
    return ' '.join(reformulated_sentence_words)

idx = 2
print(yelp_set[idx])
print(reformulate_sentence(yelp_set[idx]))

minimal meat and a ton of shredded lettuce .

minim meat and a ton of shread lettuc


## Sentiment analysis

In [25]:
import random

In [26]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mariao/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


VADER sentiment classifier from NLTK library. The range of sentiment is from -1 to 1 where -1 is negative, 0 is neutral and 1 is positive

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [27]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Read the dataset text file line by line and put lines into the list

In [28]:
# TODO: Not clear why should we read it.
lexicon_list = sentiment_analyzer.lexicon_file.split('\r\n')

Read Yelp dataset from text file and get 1000 random sentences

In [54]:
sentences = yelp_set[np.random.randint(0, 1000, (1000, 1))]

processed_sentences = np.array([' '.join(preprocess_string(sentence[0], preprocessors)) for sentence in sentences])

Compute average sentiment of 1000 sentences sentences set by VADER sentiment classifier

In [55]:
get_average = lambda sentences: np.array([sentiment_analyzer.polarity_scores(s)['compound'] for s in sentences]).mean()

get_average(processed_sentences)

-0.1151995

Reformulate sentences and compute average sentiment again. Try to come up with ways to make senteces more positive on average. What about more negative? Can you come up with some interesting experiment on this data with POS-tagged reformulations?

In [56]:
# Let's look at words that appear in positive sentences.
scores = np.array([sentiment_analyzer.polarity_scores(s)['compound'] for s in processed_sentences])
positive_sentences = processed_sentences[scores > 0.5]
print(positive_sentences[:10])

['this beer is like champaign and it makes my whole stomach nice and cold .'
 "that 's where the praise ends ."
 'ambiance , like i said , left much to be desired .'
 'i really really want this place to do better .'
 'while sitting there we noticed _num_ other parties walked out as well .'
 "but good lord , they 're just so loud !"
 'thanks for giving me one more reason never to come back .'
 'too many other good choices nearby to give them a second chance - sorry .'
 "but that 's not the worst ."
 "but yes , you guessed it , that 's overpriced too ."]


In [57]:
positive_words = np.array([preprocess_string(s, preprocessors)[0] for s in positive_sentences]).flatten()
all_scores = np.array([sentiment_analyzer.polarity_scores(w)['compound'] for w in positive_words])
max_word = positive_words[all_scores.argmax()]
positive_words = positive_words[all_scores > 0]
positive_words

array(['thanks'], dtype='<U8')

In [58]:
# Words that appear in negative sentences.
negative_sentences = processed_sentences[scores < -0.2]
print(negative_sentences[:10])

['they completely ignored us so we got up and left .'
 'the service has also suffered tremendously .'
 'poor service from the door .' 'this time , it was ridiculous .'
 'the service was terrible .'
 'the second was passable , but nothing special .'
 "the food was n't very good at all ." 'the food was lousy .'
 'but hour plus for a chicken sandwich is unacceptable .' 'no apology .']


In [59]:
negative_words = np.array([preprocess_string(s, preprocessors)[0] for s in negative_sentences]).flatten()
all_scores = np.array([sentiment_analyzer.polarity_scores(w)['compound'] for w in negative_words])
min_word = negative_words[all_scores.argmin()]
negative_words = negative_words[all_scores < 0]
negative_words

array(['poor', 'no', 'disappointing', 'poor', 'apathetic', 'bad', 'bad',
       'appalling', 'no', 'disgusting', 'terrible', 'worst', 'horrible',
       'unfortunately', 'terrible', 'horrible', 'worst', 'unfortunately',
       'terrible', 'appalling', 'no', 'unfortunately', 'horrible',
       'terrible', 'no', 'worst', 'awful', 'no', 'no', 'avoid', 'stopped',
       'no', 'worst', 'worst', 'terrible', 'lame', 'no', 'terrible',
       'worst', 'horrible', 'sadly', 'wrong', 'no', 'nah', 'bad',
       'terrible', 'unfortunately', 'terrible', 'no', 'worst', 'no', 'no',
       'terrible', 'stopped', 'disgusted', 'unfortunately',
       'unfortunately', 'avoid', 'no', 'disappointing', 'worst', 'no',
       'horrible', 'bad', 'rob', 'unfortunately', 'no', 'no', 'worst',
       'worst', 'no', 'no', 'unfortunately'], dtype='<U13')

In [60]:
def replace(sentence, old_words, new_word):
    words = preprocess_string(sentence, preprocessors)
    new_words = []
    for w in words:
        new_words.append(new_word if w in old_words else w)
    return ' '.join(new_words)
            
# If we replace positive words by negatives we make sentences more negative.
new_processed_sentences = np.array([replace(sentence, positive_words, min_word) for sentence in processed_sentences])
print(get_average(new_processed_sentences))

-0.11454839999999998


In [61]:
# If we replace negative words by positive we make sentences more positive.
new_processed_sentences = np.array([replace(sentence, negative_words, max_word) for sentence in processed_sentences])
print(get_average(new_processed_sentences))

0.04678070000000001
