<a href="https://colab.research.google.com/github/sainvo/Project_Sentiment_analyser/blob/master/NLP_project_sentiment_detector_latest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone II

Project groups will analyze this data to 1) identify inconsistencies and challenges in annotation as well as to 2) identify groups of synonymous aspects.

The data can be analysed manually, or semi-automatically by calculating and analysing basic data statistics (e.g. label frequencies, aspect frequencies etc.).

In [3]:
!wget -nc https://raw.githubusercontent.com/sainvo/Project_Sentiment_analyser/master/combined-sentiment-judgments.tsv

File ‘combined-sentiment-judgments.tsv’ already there; not retrieving.



In [4]:
import csv
from pprint import pprint

data = []

with open("combined-sentiment-judgments.tsv", encoding= 'utf-8') as tsv:
    tsvreader = csv.reader(tsv, delimiter= '\t')   
    for line in tsvreader:
        # 0:id, 1=judgement, 2=aspects, 3=left, 4= target, 5=right
        #print (line)
        judg = line[1].split(",")
        #print(judg)
     #
        data.append({"id":line[0], "judgement":judg, "aspects":line[2].split(','), "left context":line[3], "target":line[4], "right context":line[5]})
#print(evals[:5])
print()
print(len(data))
pprint(data[:3])
       
counts = []
for item in data:
    #print(eval)
    #count = {"positive":0, "neutral":0, "negative":0, "mixed":0, "unclear":0, "total" :0}
    count = {"positive":0, "neutral":0, "negative":0, "mixed":0, "unclear":0}
    for i in item.get('judgement'):
        #print(i) 
        #count["total"] += 1s
        if (i == "mixed"):
            count["mixed"] = count.get("mixed") +1
        elif (i == "unclear"):
            count["unclear"] = count.get("unclear") +1
        elif (i == "positive"):
             count["positive"] = count.get("positive") +1
        elif(i == "neutral"):
            count["neutral"] = count.get("neutral") + 1
        elif(i == "negative"):
             count["negative"] = count.get("negative") +1
    counts.append(count)
pprint(counts[:10])


980
[{'aspects': ['status'],
  'id': 'ajoneuvot00692',
  'judgement': ['positive', 'neutral'],
  'left context': 'Tämä . BMW provosoi tarpeettomiin ohituksiin . '
                  'Liikennevaloissa lähtiessäkin on usein yhden hengen '
                  'kiihdytyskisat pystyssä . ',
  'right context': ' ratissa ei tarvitse katsella tätä , vaikka ajotyyli ja '
                   'nopeus on samoja . Onko tämä sitä kateutta ?',
  'target': 'Fordin'},
 {'aspects': ['auto', 'merkki', 'ääni'],
  'id': 'ajoneuvot00697',
  'judgement': ['positive', 'positive', 'positive', 'positive'],
  'left context': '',
  'right context': ' on kyllä upeita . Hyväkuntoiset on vaan aikalailla viety '
                   'käsistä . Aika paljon noita myös laitellaan , mutta '
                   'hyvällä maulla mielestäni ( motorsport hengessä ) . Olen '
                   'ollut parin + 250hp vaparin kyydissä . Suorakutonen laulaa '
                   'nätisti 8000rpm .',
  'target': 'Zetat'},
 {'aspects': ['hi

# Task II: analysis
So, above we already counted the instances of each label per sentence in the data.
Let's do some more analysis on the data before se use it




In [5]:
# Counting the judgement labels
from collections import Counter
import numpy as np

counts_tot = []
for count in counts:
    total = sum(count.values())
    counts_tot.append(total)

tot_counter = Counter()
tot_counter.update(counts_tot)

print("Distribution of label evals: ",tot_counter)
dict_tot = dict(tot_counter)
dict_keys = dict_tot.keys()
keys = np.array(list(dict_keys))
values = np.array(list(dict_tot.values()))
dict_copy = dict_tot
for key in dict_keys:
    dict_copy[key] = dict_copy[key]/980
print("Percentages: ", dict_copy)
pers = np.array(list(dict_copy.values()))
print("keys: ", keys, ", values: ", values, ", percentages", pers)

Distribution of label evals:  Counter({3: 346, 2: 293, 4: 195, 1: 109, 5: 37})
Percentages:  {2: 0.29897959183673467, 4: 0.1989795918367347, 3: 0.35306122448979593, 5: 0.03775510204081633, 1: 0.11122448979591837}
keys:  [2 4 3 5 1] , values:  [293 195 346  37 109] , percentages [0.29897959 0.19897959 0.35306122 0.0377551  0.11122449]


In [6]:
#statistics
average = np.average(keys)
print("normal average: ",average)
#weighted average
w_av = np.average(keys, weights=pers)
print("weighted average: ",w_av)
median = np.median(values)
av_val = np.average(values)
print("median: ", median, "average per label type: ",av_val)

normal average:  3.0
weighted average:  2.7530612244897963
median:  195.0 average per label type:  196.0


## inconsistencies and challenges in annotation





From this distribution we can see that most of the sentences have been evaluated twice or more, only about 11 per cent of the sentences having just one label and about 4 percent having five evals.So, on average each sentence has about 2.7 labels. 

Based on this, I decided to use the majority labels as a deciding factor for train/dev/test division, meaning that if there is a label with 1.00 weight or more (definite majority label present), this would become the final label for this sentence in the training set. The sentences with label *unclear* or *mixed* go to test set, so we can see, if the model can relabel them to one of the three major classes. The sentences with more than one label shall be handled as follows:

* if the sentence has a majority label with weight of >= 0.66, the majority label will be present in the dev set, otherwise it will be put into the test set.

## Preprocessing the data and division to train/dev/test sets


In [7]:
# let's change the counts to percentages 
for count in counts:
    #print(count)
    total = sum(count.values())
    #print(total) ##check
    
    for item in count:
        #print(item)
        pros = count.get(item)/total
        #print(pros)
        count[item] = pros
    #print(count)
pprint(counts[:3])

[{'mixed': 0.0,
  'negative': 0.0,
  'neutral': 0.5,
  'positive': 0.5,
  'unclear': 0.0},
 {'mixed': 0.0,
  'negative': 0.0,
  'neutral': 0.0,
  'positive': 1.0,
  'unclear': 0.0},
 {'mixed': 0.3333333333333333,
  'negative': 0.0,
  'neutral': 0.6666666666666666,
  'positive': 0.0,
  'unclear': 0.0}]


In [8]:
judg_list = []

for count in counts:
    #print(count)
    max_key = max(count.items(), key = lambda x :x[1])
    keys = list(count.keys())
    values = list(count.values())
    #print(max_key)
    for value in values:
        if value == count.get(max_key): ## check for 50/50 and other evenly dispersed cases
            max_key = null
    if (max_key[1]== 1.0):
        final_eval = max_key[0]
        #print("Final label: ",final_eval)
        judg_list.append(final_eval)
    else:
        multi = []
        for key in keys:
            if count.get(key) > 0:
                multi.append([key,count.get(key)])
        judg_list.append(multi)
pprint(judg_list[:10])
print(len(judg_list))
# as seen below (result), now the judgements are either single string per sentence or an array

[[['positive', 0.5], ['neutral', 0.5]],
 'positive',
 [['neutral', 0.6666666666666666], ['mixed', 0.3333333333333333]],
 [['positive', 0.75], ['neutral', 0.25]],
 'negative',
 'mixed',
 'negative',
 'positive',
 [['positive', 0.75], ['mixed', 0.25]],
 [['negative', 0.75], ['mixed', 0.25]]]
980


In [9]:
# Setting the single labels back to the data
for i in range(len(data)):
    datapoint= data[i]
    if len(judg_list[i]) ==1: # one judg, type=str
        datapoint['judgement'] = judg_list[i][0]
    else:
        datapoint['judgement'] = judg_list[i]
sample = data[:3]
for item in sample:
    print(item)
    print()

{'id': 'ajoneuvot00692', 'judgement': [['positive', 0.5], ['neutral', 0.5]], 'aspects': ['status'], 'left context': 'Tämä . BMW provosoi tarpeettomiin ohituksiin . Liikennevaloissa lähtiessäkin on usein yhden hengen kiihdytyskisat pystyssä . ', 'target': 'Fordin', 'right context': ' ratissa ei tarvitse katsella tätä , vaikka ajotyyli ja nopeus on samoja . Onko tämä sitä kateutta ?'}

{'id': 'ajoneuvot00697', 'judgement': 'positive', 'aspects': ['auto', 'merkki', 'ääni'], 'left context': '', 'target': 'Zetat', 'right context': ' on kyllä upeita . Hyväkuntoiset on vaan aikalailla viety käsistä . Aika paljon noita myös laitellaan , mutta hyvällä maulla mielestäni ( motorsport hengessä ) . Olen ollut parin + 250hp vaparin kyydissä . Suorakutonen laulaa nätisti 8000rpm .'}

{'id': 'ajoneuvot01589', 'judgement': [['neutral', 0.6666666666666666], ['mixed', 0.3333333333333333]], 'aspects': ['hinta', 'piirteet'], 'left context': 'Tässä langassa lienee paras kysyä , mitä tulisi huomioida ', 'tar

In [10]:
# let's divide dat into a train/dev/test batches by amount of labels first
clear_labels = ['positive','neutral','negative', 'mixed', 'unclear']
train_data =[]
dev_data =[]
test_data=[]
#count_unclears = 0
for item in data:
    #while(len(train_data))
    judg = item.get('judgement')
    #print(judg)
    if type(judg) == str: ##sentence has class label with 1.00 weight
        if judg == 'unclear' or judg == 'mixed': # these labels don't really tell us much, so let's try relabel them later
            #count_unclears += 1
            test_data.append(item)
        else:
            #print(judg)
            train_data.append(item)
        #if len(dev_data)/len(data) < 0.2 :    
            #dev_data.append(item)
        #else: 
        #train_data.append(item)
    else: # no single label present, so judgement is an list of lists
        max_val = 0
        max_ind = 0
        for x in judg:
            if x[1]> max_val: 
                max_val = x[1] ## max of precentage values
                max_ind = judg.index(x)
        if max_val >= 0.66: #these sentences have a majority label under 1.00
            #print(max_val)
            label = judg[max_ind][0]
            #print(label)
            item['judgement'] = label
            #if label in clear_labels:
                #dev_data.append(item)
            #else:
                #count_unclears += 1
                #test_data.append(item)
            dev_data.append(item)    
        else:
            test_data.append(item)
from sklearn.model_selection import train_test_split
train_data, dev_data= train_test_split(train_data, test_size=0.3)
print("train: ", len(train_data), ", dev:", len(dev_data), ", test: ", len(test_data))
#print("No of unclear and mixed: ",count_unclears)
#pprint(train_data[:10])
print('-----------------------------------')
#pprint(dev_data[:10])
print('--------------------------')
#pprint(test_data[:10])


train:  386 , dev: 166 , test:  193
-----------------------------------
--------------------------


# Looking for synonymous or similar aspects

In [11]:
# Let's list the aspects from the data
aspect_list =[] 
space = ' '
for item in data:
    aspects = item['aspects']
    #print(type(aspects))
    #print(aspects)
    if aspects[0] == '' or len(aspects[0]) <3: ##taking out empty or too short aspects to be full word
        pass
    else:
        temp = aspects[0].split(' ')
        if len(temp) > 1:
            for item in temp:
                #print(type(item))
                aspect_list.append(item)
        else:
            aspect_list.append(aspects[0])
print("Aspects list length: ",len(aspect_list))
#print(aspect_list[:100])

# then let's get unique set from the list
aspect_set = set()
for aspect in aspect_list: 
    aspect_set.add(aspect)
print("Unique aspects set length: ",len(aspect_set))
print(aspect_set)

Aspects list length:  778
Unique aspects set length:  232
{'aikomus', 'status', 'monipuolisuus', 'helppous', 'hinta-laatusuhde', 'akku', 'ajo-ominaisuudet', 'määrittely', 'moottori', 'mestaruus', 'ansio', 'rahasiirto', 'ajettevuus', 'luotettavuus', 'kaupunki', 'arvot', 'hankaluus', 'juoma', 'kannatus', 'arvostus', 'tarkkuus', 'kunto', 'mielipide', 'korjaus', 'ajo-ominaisuus', 'jousitus', 'rap', 'tietoturva', 'ostospäätös', 'maku', 'sähkö', 'ominaisuudet', 'laatu', 'markkina', 'käytettävys', 'kasetti', 'laa', 'valikoima', 'luotonanto', 'informointi', 'sopivuus', 'tarpeellisuus', 'kustomointi', 'käyttömukavuus', 'annos', 'artisti', 'hyöty', 'esimerkki', 'näkymät', 'hauskuus', 'helppokäyttöisyys', 'kauppa', 'ajomukavuus', 'huoltovarmuus', 'kattavuus', 'taso', 'riski', 'haave', 'ajettavuus', 'yleiskuva', 'groovaavuus', 'pelaaminen', 'suoritus', 'varustetaso', 'yleine', 'kieli', 'yhteensopivuus', 'app', 'ajoneuvo', 'arvio', 'toimintakyky', 'bensankulutus', 'nykyaikaisuus', 'asiakaspalvelu',

In [85]:
# Now we can start looking at the aspects 
# Let's start with some statistics...
from collections import Counter
aspect_counter = Counter()
aspect_counter.update(aspect_list) #aspect_list is the original list
#pprint(aspect_counter)
most_common = aspect_counter.most_common(20)
print()
print("Twenty most commonly aspects used in the annotation data:\n")
for item in most_common:
    print(item)


Twenty most commonly aspects used in the annotation data:

('laatu', 117)
('hinta', 93)
('kuntoisuus', 37)
('kysymys', 35)
('yleinen', 32)
('toimivuus', 22)
('maku', 22)
('käytettävyys', 19)
('mielipide', 17)
('luotettavuus', 12)
('käyttö', 12)
('ulkonäkö', 10)
('helppous', 7)
('ohje', 7)
('kestävyys', 6)
('ajettavuus', 6)
('ominaisuus', 6)
('arvo', 5)
('yleiskuva', 5)
('ikä', 4)


# 1. USING WORD2VEC

In [15]:
# this is for the synonym part as the fin-word2vec model is too big for Github
# you can download the model from my Gdrive here: https://drive.google.com/file/d/1spguHkZlP0wmb6hX6NbW6l80Z-Rg46Qb/view?usp=sharing

#mounting Google drive
#from google.colab import drive
#drive.mount('/content/drive')

In [16]:
# now let's get a word2vec model for Finnish
import gensim
from gensim.models import word2vec
## for downloading NOTE! dl is Slooowww due to size of the model (14,28 Gb)
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Opinnot/KiTe/JoKite/2020/fin-word2vec.bin', binary=True)
model.save_word2vec_format('fin-w2vec.txt', binary=False)

# if already loaded as local .txt
#model = gensim.models.KeyedVectors.load_word2vec_format('fin-w2vec.txt', binary=False)

KeyboardInterrupt: ignored

In [93]:
# We can see that there are some malformed words like *käytettävys* or *käytettävyy* which is not a unique aspect from käytettävyys
# So we need to clean this set using Levenshtein if it loads

import difflib

asp_clean_list = list(aspect_set)
saved = []
deleted = []
for item1 in asp_clean_list:
    if item1 in saved or item1 in deleted:
        pass
    if item1 not in model.vocab:
        # let's first check if there is a close match in the uniq_set itself
        closest = difflib.get_close_matches(item1, asp_uniq_list)
        print(closest)
        for x in closest:
            if difflib.SequenceMatcher(None, item1, x).ratio() >= 0.8:
                if item1 ==x:
                    pass
                elif item1 in x:
                    print( item1, x)
                    saved.append(x)
                    deleted.append(item1)
                    #asp_uniq_list.pop(asp_uniq_list.index(item1))
                elif x in item1:
                    print( x, item1)
                    saved.append(item1)
                    #deleted.append(x)
                else:   
                    saved.append(closest[1])
                    deleted.append(closest[0])
    else:
        saved.append(item1)
print("saved: ", saved)
print("deleted: ", deleted)


['ajettevuus', 'ajettavuus', 'kattavuus']
['ajo-ominaisuus', 'ajo-ominaisuudet', 'ominaisuus']
ominaisuus ajo-ominaisuus
['käytettävys', 'käytettävyys', 'kestävyys']
['groovaavuus', 'vahvuus', 'korjattavuus']
['itsenäisyy']
['hiihto/laskettelu']
['mielipi', 'mielipide', 'kieli']
mielipi mielipide
['kuntoisuu', 'kuntoisuus', 'kunto']
kuntoisuu kuntoisuus
saved:  ['aikomus', 'status', 'monipuolisuus', 'helppous', 'hinta-laatusuhde', 'akku', 'ajo-ominaisuudet', 'määrittely', 'moottori', 'mestaruus', 'ansio', 'rahasiirto', 'ajettavuus', 'luotettavuus', 'kaupunki', 'arvot', 'hankaluus', 'juoma', 'kannatus', 'arvostus', 'tarkkuus', 'kunto', 'mielipide', 'korjaus', 'ajo-ominaisuudet', 'ajo-ominaisuus', 'jousitus', 'rap', 'tietoturva', 'ostospäätös', 'maku', 'sähkö', 'ominaisuudet', 'laatu', 'markkina', 'käytettävyys', 'kasetti', 'laa', 'valikoima', 'luotonanto', 'informointi', 'sopivuus', 'tarpeellisuus', 'kustomointi', 'käyttömukavuus', 'annos', 'artisti', 'hyöty', 'esimerkki', 'näkymät', 'h

In [92]:
# then let's use the larger vocab

for item in saved:
    if item not in model.vocab:
        
        print(item)
        #print("not in vocab:",item1)
        closest = difflib.get_close_matches(item1, model.vocab)
        print(closest)
        for x in closest:
            if difflib.SequenceMatcher(None, x, item1).ratio()== 1.00:
                pass
            elif difflib.SequenceMatcher(None, x, item1).ratio() >= 0.7:
                if item1 in x:
                    print( item1, x)
                    saved.append(x)
                    deleted.append(item1)
                    #asp_uniq_list.pop(asp_uniq_list.index(item1))
                elif x in item1:
                    print( x, item1)
                    saved.append(item1)
                    deleted.append(x)
            else:   
                saved.append(closest[1])
                deleted.append(closest[0])
                #asp_uniq_list[asp_uniq_list.index(item1)] = word
#for item1 in asp_uniq_list:
 #   for item2 in asp_uniq_list:
  #      if len(item1) > len(item2):
   #         short = item2x
    #        long = item1
     #   else:
      #      short = item1
       #     long = item2
       # if difflib.SequenceMatcher(None, long, short).ratio() == 1.00:
        #    pass
       # elif difflib.SequenceMatcher(None, long, short).ratio() >= 0.9 and difflib.SequenceMatcher(None, long, short).ratio()< 1.00:
        #    print( long, short)
         #   asp_uniq_list[asp_uniq_list.index(short)] = long
          #  #asp_uniq_list.pop(asp_uniq_list.index(item2))
    
        
        
print("no of saved items:",len(saved))
asp_uniq_set = set()
for item in saved:
    asp_uniq_set.add(item)
print(len(asp_clean_list))
print(len(asp_uniq_set))
print("saved: ", saved)
print("deleted: ", deleted)


ajo-ominaisuus
['pyöräily', 'pyöräilyä', 'pyöräilyt']
pyöräily pyöräilyä
pyöräily pyöräilyt
no of saved items: 236
232
227
saved:  ['aikomus', 'status', 'monipuolisuus', 'helppous', 'hinta-laatusuhde', 'akku', 'ajo-ominaisuudet', 'määrittely', 'moottori', 'mestaruus', 'ansio', 'rahasiirto', 'ajettavuus', 'luotettavuus', 'kaupunki', 'arvot', 'hankaluus', 'juoma', 'kannatus', 'arvostus', 'tarkkuus', 'kunto', 'mielipide', 'korjaus', 'ajo-ominaisuudet', 'ajo-ominaisuus', 'jousitus', 'rap', 'tietoturva', 'ostospäätös', 'maku', 'sähkö', 'ominaisuudet', 'laatu', 'markkina', 'käytettävyys', 'kasetti', 'laa', 'valikoima', 'luotonanto', 'informointi', 'sopivuus', 'tarpeellisuus', 'kustomointi', 'käyttömukavuus', 'annos', 'artisti', 'hyöty', 'esimerkki', 'näkymät', 'hauskuus', 'helppokäyttöisyys', 'kauppa', 'ajomukavuus', 'huoltovarmuus', 'kattavuus', 'taso', 'riski', 'haave', 'ajettavuus', 'yleiskuva', 'pelaaminen', 'suoritus', 'varustetaso', 'yleine', 'kieli', 'yhteensopivuus', 'app', 'ajoneuvo

Then we can try to find synonymy or similarity between the aspects by using the word2vec model:

In [94]:
# creating a matrix of similarity scores
similarities = []
for aspect in asp_uniq_set:
    for x in asp_uniq_set:
        if aspect in model.vocab and x in model.vocab:
            similarity = model.similarity(aspect,x)
            similarities.append([aspect,x,similarity])
#pprint(similarities[:50])

  if np.issubdtype(vec.dtype, np.int):


Let's convert this matrix to a dataframe with integer values for readability:

In [95]:
import pandas as p
#import numpy as np

similarity_matrix = p.DataFrame(None, index=aspect_set, columns=aspect_set)
#print(similarity_matrix)
for item in similarities:
    similarity_matrix.loc[item[0], item[1]] = int(item[2]*100)
#print(similarity_matrix)

In [96]:
import math
# 
sim_dict_matrix = similarity_matrix.to_dict(orient='index')
#print(sim_dict_matrix)
aspect_comparison_dict = {}
max = ["","", 0]
min= ["","", 0]
for key, value in sim_dict_matrix.items():
    #print(key)
    closest={}
    farthest={}
    # find 4 closest
    desc_ord = sorted(value.items(), key=lambda x:x[1], reverse=True)
    #print(desc_ord)
    for k,v in desc_ord[1:10]:
        if not math.isnan(v) and k != key:
            if len(closest)<4:
                closest.update({k:v})
                if v > max[2]:
                    max=[key,k,v]
            else:break
    # find 4 least similar
    asc_ord = sorted(value.items(), key=lambda x:x[1], reverse=False)
    for k,v in asc_ord[1:10]:
        if not math.isnan(v) and k!=key:
            if len(farthest)<4:
                farthest.update({k:v})
                if v < min[2]:
                    min=[key,k,v]
            else:break
    #farthest ={"farthest_5":row_ord[len(row_ord)-1:-6:-1]}
    #print("Closest: ",closest)
    #print("Farthest: ",farthest)
    if bool(closest) and bool(farthest):
        aspect_comparison_dict[key]={}
        aspect_comparison_dict[key].update({"Closest":sorted(closest.items(), key=lambda x:x[1], reverse=True)})
        aspect_comparison_dict[key].update({"Farthest":sorted(farthest.items(), key=lambda x:x[1], reverse=False)})    
pprint(aspect_comparison_dict)
print()
print("Most similar aspects: ",max)
print("Least similar aspects: " ,min)

{'aika': {'Closest': [('ostospäätös', 22),
                      ('hankaluus', 20),
                      ('maku', 20),
                      ('hinta-laatusuhde', 17)],
          'Farthest': [('kattavuus', -4),
                       ('monipuolisuus', -3),
                       ('luotettavuus', -3),
                       ('sopivuus', -3)]},
 'aikomus': {'Closest': [('kannatus', 17),
                         ('kunto', 16),
                         ('mielipide', 15),
                         ('status', 14)],
             'Farthest': [('laa', -3),
                          ('ajo-ominaisuudet', -2),
                          ('ansio', -2),
                          ('tarkkuus', -2)]},
 'ajettavuus': {'Closest': [('ajomukavuus', 75),
                            ('ajo-ominaisuudet', 74),
                            ('jousitus', 71),
                            ('moottori', 59)],
                'Farthest': [('haave', -5),
                             ('artisti', -1),
                      

###RESULT ANALYSIS:
Now we have found out which aspects found in our dataset are most and least similar to each other on vector level. From the results we could deduct that 'tuttuus' or familiarity with a product is related with feelings of fun and easy, or that 'toimivuus' or functionality is close to ease of use, reliability, quality and usability. According to this analysis, most similar aspects are 'käytettävyys' and 'helppokäyttöisyys' that have score of 77 percent similarity between the vectors, but even they are not synonyms.

The pitfall of this analysis is that the similarity of vectors doesn't say anything on the synonymy of the words but only on their vectors' closeness. Thus, we also get antonyms or seemingly unrelated words in the close hits. This shows that word2vec alone can't be used to determine if two words are synonyms. 

# 2. USING FinnWordNet

> As word2vec looks at words from pure vector similarity aspect without regarding grammar, I wanted to try out also another option. So, I'm trying to use the Wordnet's synsets to find synonyms. Let's see how it goes...



In [90]:
# reading from synsets
synset_matrix =[]
with open('fiwn-synsets-extra.tsv',encoding= 'utf-8') as tsv:
    tsvreader = csv.reader(tsv, delimiter= '\t')   
    for line in tsvreader:
        synset = []
        #print(line)
        syns = line[2].split("|") #synonyms are separated by '|' in the col2 of the tsv file
        for i in range(len(syns)):
            word = syns[i].split('<')[0] # if there is XML behind the word f.ex. <broader>
            synset.append(word.strip())
        synset_matrix.append(synset)
#pprint(synset_matrix[:10])

def findInSynset(word):
    for i in range(len(synset_matrix)):
        #print(synset_matrix[i])
        ss = synset_matrix[i]
        #for j in range(len(synset_matrix[i])):
        if word == ss[0]:
            #print("Found a hit in synsets:\n word: " ,word," synset:", ss )
            return [True,ss]
# test
test = findInSynset('laatu')
print(test)

[True, ['laatu', 'luokka', 'taso']]


In [97]:
aspects = asp_uniq_set #from before
print("Unique aspects: ", len(aspects))
syn_aspects = []
nonsyn_aspects =[]
syn_as = []
for asp in aspects:
    value, synset = None, []
    check = findInSynset(asp)
    if check is not None:
        value = check[0]
        synset = check[1]
        if value:
            syn_as.append(asp)
            #print(synset)
            for syn in synset:
                if syn in aspects: 
                    if syn != asp:
                        syn_as.append(syn)
        if len(syn_as) >1:
            syn_aspects.append(syn_as)
        else:
            nonsyn_aspects.append(syn_as[0])
        syn_as = []
    else:
        nonsyn_aspects.append(asp)
print()
print("Synonymous aspects in the data: ",len(syn_aspects), '\n')
for item in syn_aspects:
    print(item)
print()
print("Aspects  without synonyms in the data:", len(nonsyn_aspects),'\n')
print(nonsyn_aspects)

Unique aspects:  227

Synonymous aspects in the data:  14 

['hankaluus', 'ongelma', 'vaikeus']
['laatu', 'taso']
['sopivuus', 'yhteensopivuus']
['ajoneuvo', 'auto']
['käytettävyys', 'saatavuus']
['mielenkiinto', 'mielenkiintoisuus']
['kysymys', 'ongelma']
['asiantuntemus', 'osaaminen']
['huolto', 'korjaus']
['ennuste', 'ennustus']
['arvostelu', 'korjaus']
['tehokkuus', 'vaikutus']
['suorituskyky', 'toiminta']
['hyvyys', 'hyvä']

Aspects  without synonyms in the data: 213 

['aikomus', 'status', 'monipuolisuus', 'helppous', 'hinta-laatusuhde', 'akku', 'ajo-ominaisuudet', 'määrittely', 'moottori', 'ansio', 'mestaruus', 'rahasiirto', 'luotettavuus', 'kaupunki', 'arvot', 'juoma', 'kannatus', 'arvostus', 'tarkkuus', 'kunto', 'mielipide', 'korjaus', 'ajo-ominaisuus', 'jousitus', 'rap', 'tietoturva', 'ostospäätös', 'maku', 'sähkö', 'ominaisuudet', 'markkina', 'kasetti', 'laa', 'valikoima', 'luotonanto', 'informointi', 'tarpeellisuus', 'kustomointi', 'käyttömukavuus', 'annos', 'artisti', 'hyö

RESULT ANALYSIS:

Above shows that some synonyms were found within the aspects, but not that many. Some of the synonyms found this way are logical, but others would need a closer look. For example, 'arvostelu' and 'korjaus' do not feel very synonymous, which raises the question of how suitable the synsets of FinnWordNet are for this kind of task. The semantic connection between the supposed synonyms would have to be studied to determine the final judgement on whether, in this case, they really are synonyms or not. I also tried using Voikko and a few other resources for this purpose.

All in all, I was positively surprised that I was able to find some synonyms in the data as my hypothesis was the opposite. Why? Because the annotation did not limit the choice of aspect in anyway or give suggestions, the annotators selected whatever word they thought of, or none, resulting in wide range of words with single instance in the data.

# Milestone III: : automatic sentiment detection (mandatory)

Each project group will create an automatic sentiment detection system using the annotated data and a text classification method as taught on the course (for example a bag-of-words SVM, but you are free to choose any other classification method as well).

The task setting is as follows: given target texts and their right and left contexts, assign each context to one of the sentiment classes (positive, negative, etc.). For example:

Left context Target Right context Sentiment “Minun mielestäni” “Nordea” “on ihan hyvä pankki.” positive “Siirryin” “Op:sta” “pois sössimisten vuoksi.” negative

The sentiment detection systems will be evaluated on a held-out portion of the annotated data in terms of their accuracy.

We will provide each project group with data for training and evaluating the sentiment detection methods in a simple TSV format.

In [98]:
# preprocessing the data
def collect_sentences(raw_data):
    sentences =[]
    for item in raw_data:
        left = item['left context']
        right = item['right context']
        target = item['target']
        sentence_with_context = left+' '+ target+ ' '+ right
        #print(sentence)
        label = item['judgement']
        sentences.append([label,target, sentence_with_context])
    return sentences

In [99]:
train_sents = collect_sentences(train_data)
print(train_sents[0])

['positive', 'Green gasin', 'Metallinen vai muovinen ? Varustelekasta ( sekä tietty ihan airsoft liikkeistäkin ) saa Cybergun APS3 - kaasua joka on ihan OK lähestulkoon kaikille kaasupistooleille . Hieman tehokkaampaa kuin 134a muttei kuitenkaan esim.  Green gasin  veroista .']


In [100]:
## same for dev and test data
dev_sents = collect_sentences(dev_data)
print(dev_sents[0])
print()
test_sents = collect_sentences(test_data)
print(test_sents[0])

['positive', 'Endomondoa', 'Treenipäivyrihän tuo on käytännössä , punttauksen puolellahan siitä on paljonkin hyötyä . Sports Tracker toimii erittäin paskasti Androidilla , kaatuu aina treenin lopussa . Suosittelen  Endomondoa  Android-käyttäjille .']

[[['positive', 0.5], ['neutral', 0.5]], 'Fordin', 'Tämä . BMW provosoi tarpeettomiin ohituksiin . Liikennevaloissa lähtiessäkin on usein yhden hengen kiihdytyskisat pystyssä .  Fordin  ratissa ei tarvitse katsella tätä , vaikka ajotyyli ja nopeus on samoja . Onko tämä sitä kateutta ?']


In [101]:
# gathering texts and labels to lists
train_texts = [item[2] for item in train_sents]
train_labels =  [item[0] for item in train_sents]
print("No of train texts: ", len(train_texts))
print("No of train labels: ", len(train_labels))
for label,text in list(zip(train_labels,train_texts))[:20]:
    print(label,text[:50]+"...")

No of train texts:  386
No of train labels:  386
positive Metallinen vai muovinen ? Varustelekasta ( sekä ti...
negative Kuva liittyy , siinä luukkulamputon  AE86  , joka ...
negative Nyymit hoi ! Tarvitsisin hieman apua . Koneen virk...
positive Olin viime viikolla Iso-Roban  Yamato Sushi -mesta...
positive  Mai Mai  on mahtava rentoutumiseen . Oma suosikki...
positive Hyvä tarkkuuspiippu parantaa tarkkuutta aivan silm...
neutral Oletko ollut tyytyväinen shinobiin ? Mulle tulee o...
negative  Deltacon  näppäimistöt tuppaa olemaan heikkolaatu...
positive Jos sinulla on muutama satku säästöissä , niin uno...
negative En tiedä näitten  Thinkpadien  hinnoittelusta mutt...
positive  Macho  hinta-laatusuhteeltan paras valinta , kell...
negative Kokeilin Minttiä n. viikon ajan . Vähän väliä tuli...
neutral Onko kukaan törmännyt kuvan tai muiden android-puh...
neutral Kyseinen auto taitaa olla  Mursu  . Niissä on nyky...
neutral Eikös  Nordnetissä  ole kaupankäyntikulut aina 8e/...
neutral Hu

In [102]:
#same for dev data 
dev_texts = [item[2] for item in dev_sents]
dev_labels =  [item[0] for item in dev_sents]
print("No of dev texts: ", len(dev_texts))
print("No of dev labels: ", len(dev_labels))
#and test
test_texts = [item[2] for item in test_sents]
test_labels =  [item[0] for item in test_sents]

No of dev texts:  166
No of dev labels:  166


In [103]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix_train=vectorizer.fit_transform(train_texts)
print("shape of train f-matrix =",feature_matrix_train.shape)

shape of train f-matrix = (386, 5286)


In [104]:
#print(feature_matrix)

In [105]:
#pprint(vectorizer.get_feature_names()[:1000])

In [106]:
feature_matrix_dev=vectorizer.transform(dev_texts)
print("shape of dev f-matrix =",feature_matrix_dev.shape)

shape of dev f-matrix = (166, 5286)


In [107]:
import sklearn.svm
classifier=sklearn.svm.LinearSVC(C=0.03,verbose=1)
classifier.fit(feature_matrix_train, train_labels)

[LibLinear]

LinearSVC(C=0.03, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

In [108]:
print("DEV",classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN",classifier.score(feature_matrix_train, train_labels))

DEV 0.536144578313253
TRAIN 0.9974093264248705


In [109]:
import sklearn.metrics
predictions_dev=classifier.predict(feature_matrix_dev)
print(predictions_dev)
print(sklearn.metrics.confusion_matrix(dev_labels,predictions_dev))
print(sklearn.metrics.accuracy_score(dev_labels,predictions_dev))

['positive' 'positive' 'positive' 'neutral' 'negative' 'positive'
 'neutral' 'positive' 'negative' 'positive' 'neutral' 'neutral' 'neutral'
 'negative' 'positive' 'positive' 'neutral' 'positive' 'neutral' 'neutral'
 'positive' 'positive' 'neutral' 'positive' 'neutral' 'negative'
 'positive' 'positive' 'neutral' 'neutral' 'neutral' 'neutral' 'positive'
 'positive' 'positive' 'positive' 'positive' 'positive' 'positive'
 'negative' 'neutral' 'positive' 'positive' 'positive' 'positive'
 'neutral' 'positive' 'positive' 'positive' 'neutral' 'neutral' 'positive'
 'positive' 'positive' 'positive' 'neutral' 'positive' 'positive'
 'positive' 'positive' 'positive' 'negative' 'positive' 'neutral'
 'positive' 'positive' 'neutral' 'positive' 'neutral' 'positive'
 'positive' 'positive' 'positive' 'positive' 'neutral' 'positive'
 'negative' 'positive' 'positive' 'neutral' 'positive' 'neutral'
 'positive' 'positive' 'positive' 'positive' 'positive' 'positive'
 'neutral' 'positive' 'positive' 'neutral' 

In [110]:
# The model is not very good, but let's save it anyway and make predictions on the test data to see some results
import pickle

with open("saved_model.pickle","wb") as f:
    pickle.dump((classifier,vectorizer),f)

In [111]:
with open("saved_model.pickle","rb") as f:    
    classifier_loaded,vectorizer_loaded=pickle.load(f)

feature_matrix_dev_loaded=vectorizer_loaded.transform(dev_texts)
print("DEV - loaded (should match the score above)",classifier_loaded.score(feature_matrix_dev, dev_labels))

DEV - loaded (should match the score above) 0.536144578313253


In [112]:
feature_matrix_test = vectorizer_loaded.transform(test_texts)
predictions_test = classifier_loaded.predict(feature_matrix_test)
#print(predictions_test[:10])
#print(test_labels[:10])

for item in test_data:
    judg = item['judgement']
    if(type(judg) == str):
        item['judgement'] = [item['judgement'],predictions_test[test_data.index(item)]]
    else:
        item['judgement'].append([predictions_test[test_data.index(item)]])
    #print(item['judgement'])
  

Let's test the saved model on the test set we prepared before to see if it can determine a final sentiment for the cases that had several.

In [115]:
for item in test_data:   
    print("Target word: ", item['target'])
    print("sentence: ", item['left context']+item['target']+item['right context'])
    print("predicted: ", item['judgement'][len(item['judgement'])-1], ", original sentiment labels: ", item['judgement'][:len(item['judgement'])-1])
    print('------------------------------------------------------')

Target word:  Fordin
sentence:  Tämä . BMW provosoi tarpeettomiin ohituksiin . Liikennevaloissa lähtiessäkin on usein yhden hengen kiihdytyskisat pystyssä . Fordin ratissa ei tarvitse katsella tätä , vaikka ajotyyli ja nopeus on samoja . Onko tämä sitä kateutta ?
predicted:  ['positive'] , original sentiment labels:  [['positive', 0.5], ['neutral', 0.5]]
------------------------------------------------------
Target word:  skoda octavia rs : t
sentence:  Vaikka uudet skoda octavia rs : t ovat ihan mukavan näköisiä , on silti aika absurdia sanoa semmoisen olevan unelma - auto . Noh , kukin tavallaan .
predicted:  positive , original sentiment labels:  ['mixed']
------------------------------------------------------
Target word:  subarussa
sentence:  No audissa on 142mm ja subarussa 144mm ? Ja onhan tossa normaalissa kutosessa pitkä ja pehmeä jousitus . Toteutettu vielä monitukivarsilla mäcpearsonin sijasta . Varmasti kokonaisuudessaan antaa paremmat kyydit mitä tollanen imppu .
predicted

As we can see above, the model can determine one sentiment for the nondefinitive cases. In most cases, it coincides with one of the sentiments annotated originally to the sentence, but in some cases the model puts a different sentiment label on it. This happens, of course, especially if the original label was 'mixed' or 'unclear' as these labels were excluded from the model for being ambigous. But as the last row shows, there is room for improvement...