## Sentence reformulation

In [1]:
import nltk
from gensim.models import KeyedVectors

Download FastText pretrained vectors for English: 
[cc.en300.vec.gz](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz)

And download Yelp! dataset composed of reviews: 
[Yelp.train.text](https://drive.google.com/file/d/1TAcfL091lKb2LipaUELFteZqJjQu-gMa/view?usp=sharing)

Load downloaded pretrained FastText vectors by gensim library:

In [2]:
filename = 'cc.en.300.vec'
vectors = KeyedVectors.load_word2vec_format(filename, binary=False)

Compute similarity of two words using gensim

In [20]:
#We discussed different words, look and similarity of 'king' and 'queen' for example. Could you put it inot context?
print(vectors.most_similar(positive=['dog'], negative=['wolf'], topn=5))
print(vectors.most_similar(positive=['cat'], negative=['tiger'], topn=5))
print(vectors.similarity('cat', 'tiger'))
print(vectors.similarity('dog', 'wolf'))

#your code here

[('pooches', 0.4657370448112488), ('dogs', 0.4568704068660736), ('doggie', 0.44288891553878784), ('pooch', 0.433940052986145), ('non-dog', 0.4216308891773224)]
[('cats', 0.4473192095756531), ('litter-box', 0.4075545072555542), ('litterboxes', 0.40291157364845276), ('kitty', 0.39684420824050903), ('kitties', 0.39618998765945435)]
0.50369895
0.47312132


Sentence tokenization. Split Yelp! texts into separate tokens (words and punctuation marks) by space

In [32]:
from nltk import word_tokenize
yelp_file = 'Yelp.train.text'
with open(yelp_file, "r") as f:
    lines = f.read().splitlines()

tokens = list(map(word_tokenize, lines))


In [44]:
def tokenize_sentence(sentence):
    return word_tokenize(sentence)

tokens[:3]

[['i', 'was', 'sadly', 'mistaken', '.'],
 ['so',
  'on',
  'to',
  'the',
  'hoagies',
  ',',
  'the',
  'italian',
  'is',
  'general',
  'run',
  'of',
  'the',
  'mill',
  '.'],
 ['minimal', 'meat', 'and', 'a', 'ton', 'of', 'shredded', 'lettuce', '.']]

Try part of speech tagging using [NLTK POS-tagger](https://www.nltk.org/book/ch05.html).
The function returns list of tuples (word, POS_tag)

In [48]:
def POS_tagging(sentence):
    return nltk.pos_tag(sentence)

POS_tagging(tokens[7])

[('are', 'VBP'), ('you', 'PRP'), ('kidding', 'VBG'), ('me', 'PRP'), ('?', '.')]

Can you find the most similar word to the given? Can you write a method that returns a list of tuples (word, similarity) in order of decreasing similarity?

In [154]:
def most_similar(word, topn=15, neg=[]):
    return vectors.most_similar(positive=[word], negative=neg, topn=topn)

most_similar('pretty', neg=['bad'])

[('fairly', 0.46226567029953003),
 ('pretttty', 0.394969642162323),
 ('farily', 0.3877026438713074),
 ('prettty', 0.37976735830307007),
 ('amazingly', 0.37048882246017456),
 ('prettttty', 0.36940139532089233),
 ('quite', 0.3675154447555542),
 ('ever-so', 0.36676985025405884),
 ('remarkably', 0.3537715673446655),
 ('failry', 0.34980258345603943),
 ('somewhat', 0.3493301272392273),
 ('impressively', 0.3471192717552185),
 ('retty', 0.34354445338249207),
 ('incredibly', 0.3428688049316406),
 ('oretty', 0.34186863899230957)]

Let's do the simplest reformulation task. We just want to reformulate some sentences replacing an ajective with a similar one

In [211]:
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() 

def reformulate_sentence(sentence, bias=['bad'], adjectives_num=2):
    # Sentence tokenization
    tokenized_sentence = tokenize_sentence(sentence)
#     print(tokenized_sentence)
    # Part of speech tagging
    POS_tagged_words = POS_tagging(tokenized_sentence)
#     print(POS_tagged_words)

    reformulated_sentence_words = []
    for word, pos_tag in POS_tagged_words:
        # If the word is adjective...
        if pos_tag in ['JJR', 'JJS', 'JJ']:
            try:
                # ...look for the word most similar to the given and replace it
                most_similar_words = most_similar(word, neg=bias)
                chosen_words = []
                for similar_word, similarity in most_similar_words:
                    if nltk.edit_distance(similar_word.lower(), word.lower()) < 4:
#                         print('skipping', similar_word)
                        continue
                    if '.' in similar_word:
                        continue
                    if not similar_word.isalpha():
                        continue
                    chosen_words.append(similar_word)
                    if len(chosen_words) == adjectives_num:
                        break
                if not chosen_words:
                    chosen_words = [word]
                reformulated_sentence_words.extend(chosen_words)
                    
                pass
            except:
                print('There is no {} word in FastText dictionary! ...'.format(word))
                
        else:
            reformulated_sentence_words.append(word)
    # Join words list in a sentence
    return ' '.join(reformulated_sentence_words)
print(lines[634])
reformulate_sentence(lines[634])

outdoor seating in good weather .


'outdoor seating in excellent great weather .'

In [201]:
reformulate_sentence(lines[163])

'very enjoyable environment which i will begin to frequent .'

## Sentiment analysis

In [111]:
import random

In [112]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/avshalommanevich/nltk_data...


VADER sentiment classifier from NLTK library. The range of sentiment is from -1 to 1 where -1 is negative, 0 is neutral and 1 is positive

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [113]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Read the dataset text file line by line and put lines into the list

In [164]:
random.shuffle(lines)
num_of_sentences = 1000
sample = lines[:num_of_sentences]

Read Yelp dataset from text file and get 1000 random sentences

Compute average sentiment of 1000 sentences sentences set by VADER sentiment classifier

In [220]:

def sentiment(sentence):
    return sentiment_analyzer.polarity_scores(sentence)['compound']

def average_sentiment(sentences):
    return sum(map(sentiment, sentences)) / len(sentences)


avg_sentiment = average_sentiment(sample)
avg_sentiment

0.2775767000000003

Reformulate sentences and compute average sentiment again. Try to come up with ways to make senteces more positive on average. What about more negative? Can you come up with some interesting experiment on this data with POS-tagged reformulations?

In [215]:
avg_sentiment_reformed = average_sentiment(list(map(reformulate_sentence, sample)))


There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...
There is no _num_ word in FastText dictionary! ...


In [219]:
avg_sentiment_reformed

0.2993779999999999