## Sentence reformulation

In [0]:
!pip install fasttext
!pip install scipy



In [0]:
import nltk
import fasttext
import scipy
from gensim.models import KeyedVectors
from tqdm import tqdm

Download FastText pretrained vectors for English: 
[cc.en300.vec.gz](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz)

And download Yelp! dataset composed of reviews: 
[Yelp.train.text](https://drive.google.com/file/d/1TAcfL091lKb2LipaUELFteZqJjQu-gMa/view?usp=sharing)

Load downloaded pretrained FastText vectors by gensim library:

In [0]:
yelp_download_link = "https://docs.google.com/uc?export=download&id=1TAcfL091lKb2LipaUELFteZqJjQu-gMa"
fasttext_download_link = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz"
fname = 'cc.en.300.vec'

In [0]:
!wget -O Yelp.train.text --no-check-certificate "$yelp_download_link"

In [0]:
!wget -nc $fasttext_download_link

In [0]:
!gunzip -k cc.en.300.vec.gz

In [0]:

word_vectors = tqdm(KeyedVectors.load_word2vec_format('cc.en.300.vec'))

In [0]:
import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model(fname)

Compute similarity of two words using gensim

In [0]:
#We discussed different words, look and similarity of 'king' and 'queen' for example. Could you put it inot context?

# word_vectors = KeyedVectors.load_word2vec_format(fname)
# print(word_vectors.similarity('king', 'queen'))

#from scipy import spatial as sp
#king_v = ft.get_word_vector("king")
#queen_v = ft.get_word_vector("queen")
#sp.distance.cosine(king_v, queen_v)

ft.get_nearest_neighbors('king', k=10)

Sentence tokenization. Split Yelp! texts into separate tokens (words and punctuation marks) by space

In [0]:
#your code here
from nltk.tokenize import WordPunctTokenizer
data_path = "Yelp.train.text"

with open(data_path, 'r') as f:
    lines = f.read().split("\n")

wpTokenizer = WordPunctTokenizer()
tokens = list(map(wpTokenizer.tokenize, lines))
tokens[0]

Try part of speech tagging using [NLTK POS-tagger](https://www.nltk.org/book/ch05.html).
The function returns list of tuples (word, POS_tag)

In [0]:
#your code here
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(tokens[0])

Can you find the most similar word to the given? Can you write a method that returns a list of tuples (word, similarity) in order of decreasing similarity?

In [0]:
#your code here
most_similar_n = lambda word, topn: ft.get_nearest_neighbors(word, topn=topn)

most_similar_n('dog', 10)

Let's do the simplest reformulation task. We just want to reformulate some sentences replacing an ajective with a similar one

In [0]:
def reformulate_sentence(sentence):
    # Sentence tokenization
    tokenized_sentence = tokenize_sentence(sentence)

    # Part of speech tagging
    POS_tagged_words = POS_tagging(tokenized_sentence)

    reformulated_sentence_words = []
    for word, pos_tag in POS_tagged_words:
        # If the word is adjective...
        if pos_tag in ['JJR', 'JJS', 'JJ']:
            try:
                # ...look for the word most similar to the given and replace it
                new_word = most_similar_n(word, 1)[0][0]
                reformulated_sentence_words.append(new_word)
            except:
                print('There is no {} word in FastText dictionary! ...'.format(word))
        else:
            reformulated_sentence_words.append(word)
    # Join words list in a sentence
    return ' '.join(reformulated_sentence_words)

idx = 2
print(lines[idx])
print(reformulate_sentence(lines[idx]))

## Sentiment analysis

In [0]:
import random

In [0]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, strip_multiple_whitespaces, \
                                         strip_punctuation, stem_text

VADER sentiment classifier from NLTK library. The range of sentiment is from -1 to 1 where -1 is negative, 0 is neutral and 1 is positive

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [0]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Read the dataset text file line by line and put lines into the list

In [0]:
#your code here
lexicon_list = sentiment_analyzer.lexicon_file.split('\r\n')

In [0]:
#your code here

Read Yelp dataset from text file and get 1000 random sentences

In [0]:
#your code here
sentences = lines[np.random.randint(0, 1000, (1000, 1))]

processed_sentences = np.array([' '.join(preprocess_string(sentence[0], preprocessors)) for sentence in sentences])

Compute average sentiment of 1000 sentences sentences set by VADER sentiment classifier

In [0]:
#your code here
get_average = lambda sentences: np.array([sentiment_analyzer.polarity_scores(s)['compound'] for s in sentences]).mean()

get_average(processed_sentences)

Reformulate sentences and compute average sentiment again. Try to come up with ways to make senteces more positive on average. What about more negative? Can you come up with some interesting experiment on this data with POS-tagged reformulations?

In [0]:
#your code here
scores = np.array([sentiment_analyzer.polarity_scores(s)['compound'] for s in processed_sentences])
positive_sentences = processed_sentences[scores > 0.5]
print(positive_sentences[:10])

In [0]:
negative_sentences = processed_sentences[scores < -0.2]
print(negative_sentences[:10])