# An all-nltk basic approach

In this notebook I will present an all-nltk very basic approach to the problem. It is not as well performing as neural-net based models, but it can ve a good starting point for beginners to grasp what is happening.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tweet-sentiment-extraction/train.csv
/kaggle/input/tweet-sentiment-extraction/test.csv
/kaggle/input/tweet-sentiment-extraction/sample_submission.csv


The metrics used for evaluation is, as defined in the evaluation rules :

In [2]:
def jaccard(str1, str2): 
    str1, str2 = str(str1), str(str2)
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Let's load our data.

In [3]:
train = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/train.csv")
test = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/test.csv")
sample_submission = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/sample_submission.csv")

In [4]:
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


We notice that for neutral sentiment, the selected text is nearly always the text itself. Let's check it :

In [5]:
v1 = train.loc[train.sentiment=='neutral', 'text'].values.tolist()
v2 = train.loc[train.sentiment=='neutral', 'selected_text'].values.tolist()
np.mean([jaccard(w1, w2) for w1, w2 in zip(v1, v2)])

0.9764467881939698

Therefore, for the test, it seems a good strategy to return the text itself for neutral sentiment.

In [6]:
test.head()

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


In [7]:
sample_submission.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,
1,96d74cb729,
2,eee518ae67,
3,01082688c6,
4,33987a8ee5,


As we say, let's first return the whole original text for all neutral labelled samples.

In [8]:
isNeutral = test.loc[test['sentiment'] == 'neutral', 'textID'].values.tolist()
def get_selected_text_neutral(textID, df=test):
    if textID in isNeutral:
        return df.loc[df.textID==textID, 'text'].values.tolist()[0]
    else:
        return np.nan
def treat_neutral(sample_submission):
    sample_submission['selected_text'] = sample_submission['textID'].apply(get_selected_text_neutral, df=test)
    return sample_submission
sample_submission = treat_neutral(sample_submission)
sample_submission.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,Last session of the day http://twitpic.com/67ezh
1,96d74cb729,
2,eee518ae67,
3,01082688c6,
4,33987a8ee5,


Now, let's treat positive and negative texts with the help of nltk.

In [9]:
from nltk import pos_tag, ngrams
# nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn, wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
lemmatizer = WordNetLemmatizer()
ps = PorterStemmer()

In [10]:
import string
filtre = [wn.NOUN, wn.ADJ, wn.ADV, wn.VERB]

In [11]:
def penn_to_wn(tag):
    """
    Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    else:
        return None
        
def get_sentiment(word, tag, verbose=0):
    """ returns list of pos neg and objective score. But returns empty list if not present in senti wordnet. """
    wn_tag = penn_to_wn(tag)
    if wn_tag not in filtre:
        return []

    lemma = lemmatizer.lemmatize(word, pos=wn_tag)
    if verbose:
        print(f'Lemmatizer : {lemma}')
    if not lemma:
        return []

    synsets = wn.synsets(word, pos=wn_tag)
    if verbose:
        print(f'Synsets : {synsets}')
    if not synsets:
        return []

    swn_synset_pos = []
    swn_synset_neg = []
    for synset in synsets:
        swn_synset = swn.senti_synset(synset.name())
        if verbose:
            print(f'Pos score : {swn_synset.pos_score()}, Neg score : {swn_synset.neg_score()}')
        swn_synset_pos.append(swn_synset.pos_score())
        swn_synset_neg.append(swn_synset.neg_score())
    return [np.mean(swn_synset_pos),np.mean(swn_synset_neg)]#,swn_synset.obj_score()

def robustify(text=''):
    if type(text) != str:
        try:
            text = str(text)
        except:
            text = ''
    return text

def score(text='', verbose=0):
    text = robustify(text)
    for dot in string.punctuation:
        text = text.replace(dot,'')
    tokenized_text = word_tokenize(text)
    if verbose:
        print(f'Tokenized text : {tokenized_text}')
#     stemmed_text = [ps.stem(x) for x in tokenized_text]
#     print(f'Stemmed text : {stemmed_text}')
#     tags = pos_tag(stemmed_text)
    tags = pos_tag(tokenized_text)
    senti_val = [(x.lower(), get_sentiment(x.lower(),y, verbose)) for (x,y) in tags]
    senti_val = list(filter(lambda x : len(x[1])>0, senti_val))
    return senti_val

In [12]:
score('that`s great!! weee!! visitors!', verbose=1)

Tokenized text : ['thats', 'great', 'weee', 'visitors']
Lemmatizer : thats
Synsets : []
Lemmatizer : great
Synsets : [Synset('great.s.01'), Synset('great.s.02'), Synset('great.s.03'), Synset('bang-up.s.01'), Synset('capital.s.03'), Synset('big.s.13')]
Pos score : 0.0, Neg score : 0.0
Pos score : 0.75, Neg score : 0.0
Pos score : 0.25, Neg score : 0.125
Pos score : 0.875, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Lemmatizer : weee
Synsets : []
Lemmatizer : visitor
Synsets : [Synset('visitor.n.01')]
Pos score : 0.0, Neg score : 0.0


[('great', [0.3125, 0.020833333333333332]), ('visitors', [0.0, 0.0])]

In [13]:
score('happy bday!', verbose=1)

Tokenized text : ['happy', 'bday']
Lemmatizer : happy
Synsets : [Synset('happy.a.01'), Synset('felicitous.s.02'), Synset('glad.s.02'), Synset('happy.s.04')]
Pos score : 0.875, Neg score : 0.0
Pos score : 0.75, Neg score : 0.0
Pos score : 0.5, Neg score : 0.0
Pos score : 0.125, Neg score : 0.0
Lemmatizer : bday
Synsets : []


[('happy', [0.5625, 0.0])]

In [14]:
score('Recession hit Veronique Branquinho', verbose=1)

Tokenized text : ['Recession', 'hit', 'Veronique', 'Branquinho']
Lemmatizer : recession
Synsets : [Synset('recession.n.01'), Synset('recess.n.02'), Synset('recession.n.03'), Synset('recession.n.04'), Synset('receding.n.02')]
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Lemmatizer : hit
Synsets : [Synset('hit.v.01'), Synset('hit.v.02'), Synset('hit.v.03'), Synset('reach.v.01'), Synset('hit.v.05'), Synset('shoot.v.01'), Synset('stumble.v.03'), Synset('score.v.01'), Synset('hit.v.09'), Synset('strike.v.04'), Synset('murder.v.01'), Synset('hit.v.12'), Synset('reach.v.02'), Synset('strike.v.10'), Synset('hit.v.15'), Synset('hit.v.16'), Synset('hit.v.17')]
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg score : 0.0
Pos score : 0.0, Neg sc

[('recession', [0.0, 0.0]), ('hit', [0.0, 0.051470588235294115])]

In [15]:
test['senti_val'] = test['text'].apply(score)

In [16]:
test.loc[test.sentiment != 'neutral'].head()

Unnamed: 0,textID,text,sentiment,senti_val
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,"[(shanghai, [0.0, 0.0]), (is, [0.0288461538461..."
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,"[(recession, [0.0, 0.0]), (hit, [0.0, 0.051470..."
3,01082688c6,happy bday!,positive,"[(happy, [0.5625, 0.0])]"
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,[]
5,726e501993,that`s great!! weee!! visitors!,positive,"[(great, [0.3125, 0.020833333333333332]), (vis..."


In [17]:
def treat_senti_val(sentiment, senti_val):
    if sentiment == 'neutral':
        return []
    sent = 0 if sentiment=='positive' else 1
    return [(t[0], t[1][sent]) for t in senti_val] # if t[1][sent]>0
test['senti_val'] = test.apply(lambda df: treat_senti_val(df.sentiment, df.senti_val), axis=1)
test.head(20)

Unnamed: 0,textID,text,sentiment,senti_val
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,[]
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,"[(shanghai, 0.0), (is, 0.028846153846153848), ..."
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,"[(recession, 0.0), (hit, 0.051470588235294115)..."
3,01082688c6,happy bday!,positive,"[(happy, 0.5625)]"
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,[]
5,726e501993,that`s great!! weee!! visitors!,positive,"[(great, 0.3125), (visitors, 0.0)]"
6,261932614e,I THINK EVERYONE HATES ME ON HERE lol,negative,"[(think, 0.009615384615384616), (hates, 0.375)..."
7,afa11da83f,"soooooo wish i could, but im in school and my...",negative,"[(wish, 0.0), (i, 0.0), (school, 0.0), (is, 0...."
8,e64208b4ef,and within a short time of the last clue all ...,neutral,[]
9,37bcad24ca,What did you get? My day is alright.. haven`...,neutral,[]


In case no sentiments have been returned, we will return the original text. Note that this code retreats the neutral case we have dealt with at the beginning.

In [18]:
def get_selected_text(text, senti_val):
    if len(senti_val)==0:
        return text
    else:
        return ' '.join([t[0] for t in senti_val if t[1]>0.1])
test['selected_text'] = test.apply(lambda df:get_selected_text(df['text'],df['senti_val']), axis=1)
test.head()

Unnamed: 0,textID,text,sentiment,senti_val,selected_text
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,[],Last session of the day http://twitpic.com/67ezh
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,"[(shanghai, 0.0), (is, 0.028846153846153848), ...",really exciting precisely good china
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,"[(recession, 0.0), (hit, 0.051470588235294115)...",shame
3,01082688c6,happy bday!,positive,"[(happy, 0.5625)]",happy
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,[],http://twitpic.com/4w75p - I like it!!


In [19]:
for i, row in test.loc[test.sentiment!='neutral'].iterrows():
    sample_submission.loc[sample_submission.textID==row['textID'], 'selected_text'] = row['selected_text']

In [20]:
sample_submission.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,Last session of the day http://twitpic.com/67ezh
1,96d74cb729,really exciting precisely good china
2,eee518ae67,shame
3,01082688c6,happy
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!


In [21]:
sample_submission.to_csv('submission.csv', index=False)