## Sentence reformulation

In [48]:
import numpy as np
import nltk
import warnings
from gensim import corpora, similarities
from gensim.models import KeyedVectors
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, strip_multiple_whitespaces, stem_text


warnings.filterwarnings('ignore')
%matplotlib inline

Download FastText pretrained vectors for English: 
[cc.en300.vec.gz](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz)

And download Yelp! dataset composed of reviews: 
[Yelp.train.text](https://drive.google.com/file/d/1TAcfL091lKb2LipaUELFteZqJjQu-gMa/view?usp=sharing)

Load downloaded pretrained FastText vectors by gensim library:

In [51]:
preprocessors = [
    lambda word: word.lower(),   # Lowercase the word.
    strip_multiple_whitespaces,  # Remove repeating whitespaces.
    remove_stopwords             # Remove stopwords.
]

Compute similarity of two words using gensim

In [1]:

word_vectors = KeyedVectors.load(fname, mmap='r')

Sentence tokenization. Split Yelp! texts into separate tokens (words and punctuation marks) by space

In [5]:
#your code here

Try part of speech tagging using [NLTK POS-tagger](https://www.nltk.org/book/ch05.html).
The function returns list of tuples (word, POS_tag)

In [6]:
#your code here

Can you find the most similar word to the given? Can you write a method that returns a list of tuples (word, similarity) in order of decreasing similarity?

In [7]:
#your code here

Let's do the simplest reformulation task. We just want to reformulate some sentences replacing an ajective with a similar one

In [31]:
def reformulate_sentence(sentence):
    # Sentence tokenization
    tokenized_sentence = tokenize_sentence(sentence)

    # Part of speech tagging
    POS_tagged_words = POS_tagging(tokenized_sentence)

    reformulated_sentence_words = []
    for word, pos_tag in POS_tagged_words:
        # If the word is adjective...
        if pos_tag in ['JJR', 'JJS', 'JJ']:
            try:
                # ...look for the word most similar to the given and replace it
                # your code here
            except:
                print('There is no {} word in FastText dictionary! ...'.format(word))
        else:
            reformulated_sentence_words.append(word)
    # Join words list in a sentence
    return ' '.join(reformulated_sentence_words)

## Sentiment analysis

In [4]:
import random

In [5]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mariao/nltk_data...


VADER sentiment classifier from NLTK library. The range of sentiment is from -1 to 1 where -1 is negative, 0 is neutral and 1 is positive

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [6]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Read the dataset text file line by line and put lines into the list

In [93]:
lexicon_list = sentiment_analyzer.lexicon_file.split('\r\n')

Read Yelp dataset from text file and get 1000 random sentences

In [179]:
with open('Yelp.train.text', 'r') as f:
    yelp_set = np.array(f.readlines())

sentences = yelp_set[np.random.randint(0, 1000, (1000, 1))]

processed_sentences = np.array([' '.join(preprocess_string(sentence[0], preprocessors)) for sentence in sentences])

Compute average sentiment of 1000 sentences sentences set by VADER sentiment classifier

In [180]:
def getAverage(sentences):
    scores = np.array([sentiment_analyzer.polarity_scores(s)['compound'] for s in sentences])
    return scores.mean()

getAverage(processed_sentences)

-0.0720283

Reformulate sentences and compute average sentiment again. Try to come up with ways to make senteces more positive on average. What about more negative? Can you come up with some interesting experiment on this data with POS-tagged reformulations?

In [181]:
# Let's look at words taht appear in positive sentences.
positive_sentences = processed_sentences[scores > 0.5]
print(positive_sentences)

['service ok slow .' "n't matter time ." 'rude staff doesnt know .'
 'came away experience leave .' 'service professional !'
 'inside maybe worse , trash dirty .'
 'better italian restaurants pittsburgh .' 'service awful .'
 "'d comedians _num_ stars , food zero stars ." 'eat , maybe try .'
 "husband 's salad , italian chopped , better ."
 'disappointing easter dinner .' ', year crowd smaller smaller .'
 'quickly ate left movie .' 'absolutely problems .'
 'like white cheddar mash potatoes .' "n't hungry actually tastebuds ."
 'staff rude .']


In [172]:
positive_words = np.array([preprocess_string(s, preprocessors)[0] for s in positive_sentences]).flatten()
all_scores = np.array([sentiment_analyzer.polarity_scores(w)['compound'] for w in positive_words])
max_word = positive_words[all_scores.argmax()]
positive_words = positive_words[all_scores > 0]
positive_words

array(['good', 'party'], dtype='<U9')

In [173]:
# Words that appear in negative sentences.
negative_sentences = processed_sentences[scores < -0.5]
print(negative_sentences)

['eat .' 'important ... food terrible .' ", maybe n't care ."
 'time _num_ hours .' 'want chinese lee , china palace .'
 "n't cash inside menu ." "n't try stay experience bad ."
 "n't know corporate keeps place open ."
 'sat bar female bartender miserable .'
 "place worst dental offices 've walked ." 'zero flavor dry hell .'
 'took _num_ minutes mustard mayo .' ', suggest staying far , far away !'
 "service n't better ." 'lettuce drowning vinegar .'
 'wait _num_ days seen .' 'spend money !'
 'salad arrived , look like picture .' 'ordered past disappointed .'
 '_num_ got bar louie 8:30 .' 'place near waitresses .' 'confused ?'
 'believe unprofessional store !' '_num_ minutes brought martini .'
 'food cold bland , kids food cold .' ', certainly eat .' 'pinch .'
 'went barnes noble waterfront _num_ books list .'
 '_num_ minutes brought martini .' '_num_ staff members yelled .'
 '_num_ .' "food , , n't good ." 'eat .'
 'offered $ _num_ shirt shirts lost .' 'terrible .' 'ordinary .'
 'needl

In [174]:
negative_words = np.array([preprocess_string(s, preprocessors)[0] for s in negative_sentences]).flatten()
all_scores = np.array([sentiment_analyzer.polarity_scores(w)['compound'] for w in negative_words])
min_word = negative_words[all_scores.argmin()]
negative_words = negative_words[all_scores < 0]
negative_words

array(['confused', 'terrible', 'bad', 'regret', 'rude', 'dirty',
       'ridiculous', 'leave', 'lazy'], dtype='<U10')

In [183]:
def replace(sentence, old_words, new_word):
    words = preprocess_string(sentence, preprocessors)
    new_words = []
    for w in words:
        new_words.append(new_word if w in old_words else w)
    return ' '.join(new_words)
            
    
# If we replace positive words by negatives we make sentences more negative.
new_processed_sentences = np.array([replace(sentence, positive_words, min_word) for sentence in processed_sentences])
print(getAverage(new_processed_sentences))

-0.0925618


In [184]:
# If we replace negative words by positive we make sentences more positive.
new_processed_sentences = np.array([replace(sentence, negative_words, max_word) for sentence in processed_sentences])
print(getAverage(new_processed_sentences))

0.0018853000000000008
