<a href="https://colab.research.google.com/github/janchorowski/dl_uwr/blob/summer2020/Assignment1/Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

## Important notes
**Submission deadline:**
* **Thursday, 12.03.2020**

**Points: 13 + 2bp**

This assignment is meant to test your skills in course pre-requisites:  Scientific Python programming and  Machine Learning. If it is hard, I strongly advise you to drop the course.

Please use GitHub’s [pull requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) and issues to send corrections!

You can solve the assignment in any system you like, but we encourage you to try out [Google Colab](https://colab.research.google.com/).

## Assignment text
1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of at most 20000 most frequent words.

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 
    $$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$

4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is
	$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$

5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{\#negs prec. }w}E^{\text{\#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

6. **[2pb]** Propose, implement, and evaluate an extension to the above model.


In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
import re, string
import scipy.optimize as sopt
from collections import Counter,OrderedDict

In [2]:
df = pd.read_csv("train.tsv", sep="\t")

In [3]:
df_copy = pd.DataFrame(columns=['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'])

In [4]:
df = df.sort_values(by = ['SentenceId', 'PhraseId'], ascending = (True, True))

In [5]:
last_sent_id = 0
counter = 0
for index, row in df.iterrows():
    if last_sent_id!=row['SentenceId']:
        counter+=1
        df_copy = df_copy.append(row, ignore_index = True)
        last_sent_id = row['SentenceId']

In [6]:
df_copy.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,64,2,"This quiet , introspective and entertaining in...",4
2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1
3,117,4,A positively thrilling combination of ethnogra...,3
4,157,5,Aggressive self-glorification and a manipulati...,1


In [7]:
train_df = df_copy.iloc[:7000]
test_df = df_copy[7000:]

In [8]:
sentiment = np.zeros(7000)
for i, row in train_df.iterrows():
    sentiment[i] = row['Sentiment']/4

In [9]:
train_df.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
6995,130088,7007,"Snoots will no doubt rally to its cause , trot...",1
6996,130131,7008,It 's better suited for the history or biograp...,1
6997,130161,7009,Buries an interesting storyline,2
6998,130181,7010,This one is a few bits funnier than Malle 's d...,3
6999,130214,7011,The film has several strong performances .,3


In [10]:
words = {}
for i, row in train_df.iterrows():
    #pattern = re.compile('([\w]|)')
    phrase = row['Phrase']
    phrase.translate(str.maketrans('', '', string.punctuation))
    word = ""
    for s in phrase:
        if s == ' ':
            if word == "":
                continue
            if word not in words:
                words[word]=0
            words[word]+=1
            word = ""
        else:
            word += s.lower()

In [11]:
words = sorted(words.items(), key=lambda x: x[1])
words = words[-2000:]
words = dict(words)

In [12]:
vocab = {}

In [13]:
counter = 0
for key in words:
    vocab[key] = counter
    counter+=1

In [14]:
vocab

{'gritty': 0,
 'vincent': 1,
 'honestly': 2,
 'bloody': 3,
 'chabrol': 4,
 'gently': 5,
 'beneath': 6,
 'shakespeare': 7,
 'tragedies': 8,
 'common': 9,
 'screenwriters': 10,
 'technical': 11,
 'resonant': 12,
 'affection': 13,
 'victim': 14,
 'stop': 15,
 'returns': 16,
 'innocence': 17,
 'succeed': 18,
 'element': 19,
 'potent': 20,
 'revelatory': 21,
 'martha': 22,
 'refreshingly': 23,
 'dvd': 24,
 'open': 25,
 'clumsy': 26,
 'legged': 27,
 'freaks': 28,
 'romp': 29,
 'bigger': 30,
 'explosions': 31,
 'feature-length': 32,
 'uplifting': 33,
 'graphic': 34,
 'smug': 35,
 'superior': 36,
 'heavy-handed': 37,
 'smartly': 38,
 'eccentric': 39,
 'week': 40,
 'recycled': 41,
 'teenage': 42,
 'parable': 43,
 'keeping': 44,
 'public': 45,
 'computer': 46,
 'increasingly': 47,
 'woo': 48,
 'pair': 49,
 'photographed': 50,
 'staged': 51,
 'good-natured': 52,
 'incoherent': 53,
 'mean-spirited': 54,
 'biting': 55,
 'ticket': 56,
 'clarity': 57,
 'logic': 58,
 'false': 59,
 'hey': 60,
 'stand-u

In [15]:
sentence_encoding = np.zeros((7000, 2000))

In [16]:
for i, row in train_df.iterrows():
    phrase = row['Phrase']
    phrase.translate(str.maketrans('', '', string.punctuation))
    word = ""
    for s in phrase:
        if s == ' ':
            if word == "":
                continue
            if word in vocab:
                sentence_encoding[i][vocab[word]]+=1
            word = ""
        else:
            word += s.lower()

In [17]:
sentence_encoding.shape

(7000, 2000)

In [18]:
def sigmoid(X):
    return 1/(1 + np.exp(-X))
def logreg_loss(Theta, X, Y):
    #
    # Write a logistic regression cost suitable for use with fmin_l_bfgs
    #

    #reshape Theta
    ThetaR = Theta.reshape(X.shape[0],1)
    #print(1/(1+ np.exp(-(ThetaR*(X)))))
    #print(X)
    #print(ThetaR*(X))
    nll = -np.sum(Y.dot(np.log2(sigmoid((Theta.T).dot(X)) + 1e-100)) + (1-Y).dot(np.log2(1 -  sigmoid((Theta.T).dot(X))+ 1e-100)))
    grad = X.dot((sigmoid((Theta.T).dot(X)) - Y).T)
    #print(grad.T[0])
    #reshape grad into the shape of Theta, for fmin_l_bfsgb to work
    return nll, grad.reshape(Theta.shape)

Theta0 = np.random.normal(size = 2000)

#
# Call a solver
#
ThetaOpt = sopt.fmin_l_bfgs_b(lambda Theta: logreg_loss(Theta, sentence_encoding.T, sentiment), np.array(Theta0))[0]

In [19]:
ThetaOpt

array([ 0.37863122,  0.90223436, -0.48271491, ..., -0.01799274,
        0.02643673, -0.01255412])

In [20]:
test_df

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
7000,130218,7012,I have a new favorite musical -- and I 'm not ...,3
7001,130230,7013,This movie plays like an extended dialogue exe...,0
7002,130241,7014,Ninety minutes of Viva Castro !,2
7003,130246,7015,An indispensable peek at the art and the agony...,3
7004,130259,7016,"Judging by those standards , ` Scratch ' is a ...",3
...,...,...,...,...
8524,155985,8540,... either you 're willing to go with this cla...,2
8525,155998,8541,"Despite these annoyances , the capable Claybur...",2
8526,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1
8527,156032,8543,The movie 's downfall is to substitute plot fo...,1


In [21]:
test_sent_encoding = np.zeros((1529, 2000))
for i, row in test_df.iterrows():
    phrase = row['Phrase']
    phrase.translate(str.maketrans('', '', string.punctuation))
    word = ""
    for s in phrase:
        if s == ' ':
            if word == "":
                continue
            if word in vocab:
                test_sent_encoding[i-7000][vocab[word]]+=1
            word = ""
        else:
            word += s.lower()

In [22]:
test_sentiment = np.zeros(1529)
for i, row in test_df.iterrows():
    test_sentiment[i-7000] = row['Sentiment']/4

In [23]:
print(np.mean(np.abs(sigmoid(test_sent_encoding.dot(ThetaOpt))-test_sentiment))) #

0.23140704225813855


In [24]:
sorted_prob = np.argsort(ThetaOpt)

In [25]:
vocab_list = list(vocab.keys())

In [26]:
print("Negative:\n")
for i in range(20):
    print(vocab_list[sorted_prob[i]])
print("\nPositive:\n")
for i in range(20):
    print(vocab_list[sorted_prob[1999-i]])

Negative:

incoherent
unpleasant
devoid
stealing
poor
lacking
stupid
worst
horrible
disguise
poorly
shoot
lazy
rip-off
hands
eddie
repetitive
inept
car
crush

Positive:

dazzling
assured
delightfully
follow
refreshing
masterpiece
intoxicating
mesmerizing
rewarding
amazing
remarkably
chilling
remarkable
originality
vibrant
feel-good
charmer
detail
ahead
eyes


In [38]:
negation_words = ['not','neither','never','none','nobody','nor','nothing']
emphasion_words = ['very','big','enormous','great','much','lot','absolute','most','complete','pure','total','totally']

In [28]:
train_words_with_neg = np.zeros((7000,2000))
for i, row in train_df.iterrows():
    phrase = row['Phrase']
    phrase.translate(str.maketrans('', '', string.punctuation))
    word = ""
    negation = 1
    for s in phrase:
        if s == ' ':
            if word == "":
                continue
            if word in vocab:
                train_words_with_neg[i][vocab[word]]+=1*negation
            if word in negations:
                negation*=-1
            else:
                negation = 1
            word = ""
        else:
            word += s.lower()

In [29]:
Theta0 = np.random.normal(size = 2000)
Theta_with_neg = sopt.fmin_l_bfgs_b(lambda Theta: logreg_loss(Theta, sentence_encoding.T, sentiment), np.array(Theta0))[0]

In [30]:
test_sent_encoding_with_neg = np.zeros((1529, 2000))
for i, row in test_df.iterrows():
    phrase = row['Phrase']
    phrase.translate(str.maketrans('', '', string.punctuation))
    word = ""
    negation = 1
    for s in phrase:
        if s == ' ':
            if word == "":
                continue
            if word in vocab:
                test_sent_encoding_with_neg[i-7000][vocab[word]]+=1*negation
            if word in negations:
                negation*=-1
            else:
                negation = 1
            word = ""
        else:
            word += s.lower()

In [31]:
print(train_words_with_neg)

[[0. 0. 0. ... 2. 1. 3.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 1. 2. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 0. 1.]]


In [32]:
print(np.mean(np.abs(sigmoid(test_sent_encoding_with_neg.dot(Theta_with_neg))-test_sentiment))) #

0.22921459646143524


In [42]:
def word_encodings(Neg, Emp):
    vector_Sw = np.zeros((7000, 2000))
    vector_Neg = np.zeros((7000, 2000))
    vector_Emp = np.zeros((7000, 2000))
    for i, row in train_df.iterrows():
        phrase = row['Phrase']
        phrase.translate(str.maketrans('', '', string.punctuation))
        word = ""
        negations = 0
        emphasis = 0
        for s in phrase:
            if s == ' ':
                if word == "":
                    continue
                if word in vocab:
                    vector_Sw[i][vocab[word]] += (Neg**negations) * (Emp**emphasis)
                    if negations != 0:
                        vector_Neg[i][vocab[word]] += (negations*(Neg**(negations-1))) * (Emp**emphasis)
                    if emphasis != 0:
                        vector_Emp[i][vocab[word]]+= (Neg**negations) * (emphasis*(Emp**(emphasis-1)))
                    if word in negation_words:
                        negations += 1
                    else:
                        negations = 0
                    if word in emphasion_words:
                        emphasis += 1
                    else:
                        emphasis = 0
                word = ""
            else:
                word += s.lower()
    return vector_Sw,vector_Neg,vector_Emp

In [63]:
def custom_logreg_loss(Theta, Y):
    Theta_word = Theta[:2000]
    Neg = Theta[2000]
    Emp = Theta[2001]
    X_Sw, X_Neg, X_Emp = word_encodings(Neg, Emp)
    X_Sw = X_Sw.T
    X_Neg = X_Neg
    X_Emp = X_Emp
    nll = -np.sum(Y.dot(np.log2(sigmoid((Theta_word.T).dot(X_Sw)) + 1e-100)) + (1-Y).dot(np.log2(1 -  sigmoid((Theta_word.T).dot(X_Sw))+ 1e-100)))
    grad = X_Sw.dot((sigmoid((Theta_word.T).dot(X_Sw)) - Y).T)
    NegativeGrad = X_Neg.dot(Theta_word).T.dot(sigmoid((Theta_word.T).dot(X_Sw))-Y)
    EmphasisGrad = X_Emp.dot(Theta_word).T.dot(sigmoid((Theta_word.T).dot(X_Sw))-Y)
    grad = np.append(grad, NegativeGrad)
    grad = np.append(grad, EmphasisGrad)
    return nll, grad.reshape(Theta.shape)

In [64]:
Theta0 = np.random.normal(size = 2002)
ThetaOpt = sopt.fmin_l_bfgs_b(lambda Theta: custom_logreg_loss(Theta, sentiment), np.array(Theta0))[0]

In [65]:
print(np.mean(np.abs(sigmoid(test_sent_encoding.dot(ThetaOpt[:2000])) - test_sentiment)))

0.23186943830457066
