# SNLP 2019 Final Project:

## Case-study of Offensive Language Classification from Online Tweets

**Team members**

|Name|Matriculation Number|Email address|
|----|--------------------|-------------|
|Jyotsna Singh|2576744|s8jysing@stud.uni-saarland.de|
|Soumya Ranjan Sahoo|2576610|s8sosaho@stud.uni-saarland.de|
|Sourav Dutta|2576494|s8sodutt@stud.uni-saarland.de|

## Method 1: Multinomial Naive Bayes classification

### Load necessary libraries

In [1]:
# import necessary libraries

import re
import os
import sys
import subprocess as sp

import numpy as np
import pandas as pd

### Load data

In [2]:
def load_data(data_file_path):
    return pd.read_csv(data_file_path, sep='\t', names=['Sentiment', 'Text'], header=None)

In [3]:
# load data files

train_file = 'offensive_dataset/train.tsv'
dev_file   = 'offensive_dataset/dev.tsv'

train_data = load_data(train_file)
dev_data   = load_data(dev_file)

train_data['y'] = np.where(train_data['Sentiment'] == 'OFF', 1, 0)
dev_data['y'] = np.where(dev_data['Sentiment'] == 'OFF', 1, 0)

In [4]:
print('Train data:\n{}'.format(train_data.head()))
print('\nDev data:\n{}'.format(dev_data.head()))

Train data:
  Sentiment                                               Text  y
0       NOT                             @USER @USER Yeah he is  0
1       OFF  @USER @USER @USER I'm assuming that @USER is a...  1
2       NOT     @USER you are not the true Columbia Bugle. Gfy  0
3       NOT  @USER @USER There ain't no MAGA hats sold here...  0
4       OFF                              @USER Lmao fuck u bae  1

Dev data:
  Sentiment                                               Text  y
0       NOT  @USER He never was one much for the rule of la...  0
1       OFF  @USER I think Donald Trump is a disgusting dis...  1
2       NOT                           @USER Um bc she is??????  0
3       OFF  @USER @USER @USER @USER @USER @USER @USER @USE...  1
4       NOT  @USER It would be so nice if the Trump support...  0


In [5]:
# extract tweet text into a separate file for tokenization

def extract_data_to_file(data, file_name):
    with open(file_name, 'w', encoding='utf-8') as file:
        for text in data['Text']:
            file.write(text + '\n')
    print('Text extracted to file "{}".'.format(file_name))

extract_data_to_file(train_data, 'train_text.txt')
extract_data_to_file(dev_data, 'dev_text.txt')

Text extracted to file "train_text.txt".
Text extracted to file "dev_text.txt".


In [6]:
# remove local file

def remove_file(file):
    if os.path.exists(file):
        os.remove(file)
    else:
        print("File '{}' does not exist! Try again.".format(file))
        sys.exit()

### Extra step of tokenizing multiple emojis into individual emoji characters

In [7]:
# tokenize multi-emojis into separate emojis

def tokenize_emojis(text):
    
    RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
    
    result = ''
    for word in text.split():
        for char in word:
            if re.match(RE_EMOJI, char):
                result += ' ' + char + ' '
            else:
                result += char
        result += ' '
    return result

In [8]:
# testing the tokenization process of multi-emojis

texts = [
    '@USER @USER @USER Everyone muted you ...  Thanks for working for free 😂😂😂  #MAGA',
    '.@USER Today I can announce that new longer-term partnerships will be opened up to the most ambitious housing associations through a ground-breaking £2 billion initiative...the first time any government has offered housing associations such long-term certainty"  🏘🏘🏘🏘 URL',
    '6Lack just made it on my sex playlist😭😭🔥🔥 baby wam wherever you are get ready to get pregnant cause wow😭😭🔥',
    '@USER @USER @USER @USER @USER @USER @USER @USER @USER @USER You are so beautiful !!😢😍Xx '
]

for text in texts:
    print()
    print(text)
    print(tokenize_emojis(text))


@USER @USER @USER Everyone muted you ...  Thanks for working for free 😂😂😂  #MAGA
@USER @USER @USER Everyone muted you ... Thanks for working for free  😂  😂  😂  #MAGA 

.@USER Today I can announce that new longer-term partnerships will be opened up to the most ambitious housing associations through a ground-breaking £2 billion initiative...the first time any government has offered housing associations such long-term certainty"  🏘🏘🏘🏘 URL
.@USER Today I can announce that new longer-term partnerships will be opened up to the most ambitious housing associations through a ground-breaking £2 billion initiative...the first time any government has offered housing associations such long-term certainty"  🏘  🏘  🏘  🏘  URL 

6Lack just made it on my sex playlist😭😭🔥🔥 baby wam wherever you are get ready to get pregnant cause wow😭😭🔥
6Lack just made it on my sex playlist 😭  😭  🔥  🔥  baby wam wherever you are get ready to get pregnant cause wow 😭  😭  🔥  

@USER @USER @USER @USER @USER @USER @USER @USER 

In [9]:
# takes around 5 seconds to run

def tokenize(file, data):
    process = sp.Popen(['sh', 'twokenize.sh', file], stdout=sp.PIPE)
    with open('tokenized.txt', 'wb') as write_file:
        for line in process.stdout.readlines():
            write_file.write(line)
    tokenized = []
    with open('tokenized.txt', 'r', encoding='utf-8') as read_file:
        for line in read_file:
            tokenized.append(tokenize_emojis(line.split('\t')[0]))
    print('\nTokenization of file "{}" done.'.format(file))
    return tokenized

tokenized_train_data = tokenize('train_text.txt', train_data)
tokenized_dev_data = tokenize('dev_text.txt', dev_data)


Tokenization of file "train_text.txt" done.

Tokenization of file "dev_text.txt" done.


In [10]:
print('Number of samples in training data: \t', len(tokenized_train_data))
print('Number of samples in dev data: \t\t', len(tokenized_dev_data))

Number of samples in training data: 	 11240
Number of samples in dev data: 		 1000


### Preparing supervised classifier model pipeline

In [11]:
#labels NOT-> 0, OFF -> 1

from sklearn import preprocessing

train_labels =  train_data['Sentiment']
le = preprocessing.LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
type(train_labels)


dev_labels =  dev_data['Sentiment']
le = preprocessing.LabelEncoder()
le.fit(dev_labels)
dev_labels = le.transform(dev_labels)
print('Number of labels in Dev data: ', len(dev_labels))

Number of labels in Dev data:  1000


In [12]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV

text_clf = Pipeline([('vect', CountVectorizer()),
                     
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    
    'clf__alpha': [1, 1e-1, 1e-2]
}

In [13]:
# Multinomial Naive Bayes  (without tf-idf) (takes less than 1 minute to run)

from sklearn.metrics import classification_report

score = 'f1_macro'
print("# Tuning hyper-parameters for %s" % score)
print()
np.errstate(divide='ignore')
clf = GridSearchCV(text_clf, tuned_parameters, cv=3, scoring=score)
clf.fit(tokenized_train_data, train_labels)

print("Best parameters set found on training set:")
print()
print(clf.best_params_)
print()
print("Grid scores on training set:")
print()
for mean, std, params in zip(clf.cv_results_['mean_test_score'], 
                             clf.cv_results_['std_test_score'], 
                             clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full training set.")
print("The scores are computed on the full development set.")
print()
print(classification_report(dev_labels, clf.predict(tokenized_dev_data), digits=4))
print()

# Tuning hyper-parameters for f1_macro

Best parameters set found on training set:

{'clf__alpha': 0.1, 'vect__ngram_range': (1, 1)}

Grid scores on training set:

0.659 (+/-0.007) for {'clf__alpha': 1, 'vect__ngram_range': (1, 1)}
0.561 (+/-0.015) for {'clf__alpha': 1, 'vect__ngram_range': (1, 2)}
0.542 (+/-0.011) for {'clf__alpha': 1, 'vect__ngram_range': (2, 2)}
0.665 (+/-0.013) for {'clf__alpha': 0.1, 'vect__ngram_range': (1, 1)}
0.644 (+/-0.008) for {'clf__alpha': 0.1, 'vect__ngram_range': (1, 2)}
0.593 (+/-0.008) for {'clf__alpha': 0.1, 'vect__ngram_range': (2, 2)}
0.652 (+/-0.011) for {'clf__alpha': 0.01, 'vect__ngram_range': (1, 1)}
0.626 (+/-0.006) for {'clf__alpha': 0.01, 'vect__ngram_range': (1, 2)}
0.591 (+/-0.009) for {'clf__alpha': 0.01, 'vect__ngram_range': (2, 2)}

Detailed classification report:

The model is trained on the full training set.
The scores are computed on the full development set.

             precision    recall  f1-score   support

          0     0.64

### Some sample wrong classifications

In [14]:
#wrong predictions

list_diff = dev_labels - clf.predict(tokenized_dev_data)
list_diff = list(list_diff)
list_1 = [i for i, e in enumerate(list_diff) if e == 1] #actual 1(off)
list_0 = [i for i, e in enumerate(list_diff) if e == -1] #actual 0(not)
    
# 6 wrong predictions
def no_of_wrong_preds(num):
    print("Printing % 3d wrong predictions :\n" %(num))
    for i in range (int(num/2)):
        print(train_data['Text'][list_1[i]])
        print()
        print()
        print(train_data['Text'][list_0[i]])
        print()
        print()

no_of_wrong_preds(6)

Printing   6 wrong predictions :

@USER @USER There ain't no MAGA hats sold here but that don't stop me from wanting to MAGA.


@USER Lmao fuck u bae


@USER But I thought Antifa were actually fascists but this shows Rethuglicans are the actual fascists in the room.


@USER @USER @USER Where is VAN? She is ARMY who need to shazam the most you know... 🔍🐾🐶🐾🔎


#LeviStrauss Takes Stand On #GunControl #CBS #SanFrancisco URL #NoMoreLevis #guns #NRA #USA @USER #GOP #libertarian #conservative


@USER @USER Right?! I wanna know why she on the phone calling the police anyway? what did he do"? Fuckin "Run Me Over Rhonda" lol"




### Parameters of the Naive Bayes classifier

In [15]:
clf.get_params()

{'cv': 3,
 'error_score': 'raise',
 'estimator__memory': None,
 'estimator__steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None)),
  ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
 'estimator__vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None),
 'estimator__clf': MultinomialNB(alpha

### Test predictions using Method 1 (Multinomial Naive Bayes)

In [16]:
# load test data

test_file  = 'offensive_dataset/test.tsv'
test_data = pd.read_csv(test_file, names=['Text'], header=None)

In [17]:
test_data.head()

Unnamed: 0,Text
0,@USER This stinks like week old fish she is a ...
1,@USER @USER Wrong 60% didn't vote PC so that d...
2,@USER Goodness. Your wife and ex-gf were both ...
3,@USER @USER And yet he is allowed to. Congress...
4,@USER @USER #Westminster @USER #Tories @USER @...


In [18]:
# extract tests data to file

extract_data_to_file(test_data, 'test_text.txt')

# tokenize the test data

tokenized_test_data = tokenize('test_text.txt', test_data)

Text extracted to file "test_text.txt".

Tokenization of file "test_text.txt" done.


In [19]:
print(tokenized_test_data[1])
print()
print('Number of test data samples: ', len(tokenized_test_data))

@USER @USER Wrong 60% didn't vote PC so that doesn't mean the 60% who didn't should shut up . Likewise a 1/3 voted NDP so not a majority by any means but a sizeable minority . Actually Liberals and Greens have similar position too and vote wise those three are majority . 

Number of test data samples:  1000


In [20]:
test_predictions = clf.predict(tokenized_test_data)

In [21]:
print('Total labels: {}'.format(len(test_predictions)))
print('Total class "0" labels: {}'.format(len(test_predictions) - sum(test_predictions)))
print('Total class "1" labels: {}'.format(sum(test_predictions)))

Total labels: 1000
Total class "0" labels: 627
Total class "1" labels: 373


In [22]:
with open('predictions.test', 'w', encoding='utf-8') as test_pred_file:
    for pred in test_predictions:
        test_pred_file.write('OFF\n' if pred == 1 else 'NOT\n')
    print('Test predictions written to file.')

Test predictions written to file.


## Method 2: Multinomial Naive Bayes + _tf-idf_

In [23]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

In [24]:
# with tf-idf (takes less than 1 minute to run)

from sklearn.metrics import classification_report

score = 'f1_macro'
print("# Tuning hyper-parameters for %s" % score)
print()
np.errstate(divide='ignore')
clf = GridSearchCV(text_clf, tuned_parameters, cv=2, scoring=score)
clf.fit(tokenized_train_data, train_labels)

print("Best parameters set found on training set:")
print()
print(clf.best_params_)
print()
print("Grid scores on training set:")
print()
for mean, std, params in zip(clf.cv_results_['mean_test_score'], 
                             clf.cv_results_['std_test_score'], 
                             clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full training set.")
print("The scores are computed on the full development set.")
print()
print(classification_report(dev_labels, clf.predict(tokenized_dev_data), digits=4))
print()

# Tuning hyper-parameters for f1_macro



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Best parameters set found on training set:

{'clf__alpha': 0.01, 'tfidf__norm': 'l2', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

Grid scores on training set:

0.418 (+/-0.003) for {'clf__alpha': 1, 'tfidf__norm': 'l1', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}
0.412 (+/-0.001) for {'clf__alpha': 1, 'tfidf__norm': 'l1', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
0.417 (+/-0.000) for {'clf__alpha': 1, 'tfidf__norm': 'l1', 'tfidf__use_idf': True, 'vect__ngram_range': (2, 2)}
0.411 (+/-0.001) for {'clf__alpha': 1, 'tfidf__norm': 'l1', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}
0.411 (+/-0.000) for {'clf__alpha': 1, 'tfidf__norm': 'l1', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
0.417 (+/-0.001) for {'clf__alpha': 1, 'tfidf__norm': 'l1', 'tfidf__use_idf': False, 'vect__ngram_range': (2, 2)}
0.469 (+/-0.008) for {'clf__alpha': 1, 'tfidf__norm': 'l2', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}
0.433 (+/-0.002) for {'clf__alpha': 

## Method 3: Advanced Analysis: FastText _(Joulin et. al.)_

**NOTE:** This FastText implementation is carried out with minimal values of its hyperparameters (learning rate: 0.1, epochs: 20, dimensions: 20, character ngrams: bigrams). Using higher values of certain hyperparameters may increase its overall accuracy values.

In [25]:
import fasttext
import emoji

import nltk
import csv
import datetime
from bs4 import BeautifulSoup
import itertools

import pandas as pd
import numpy as np

In [26]:
#PREPROCESSING


def load_dict_smileys():
    
    return {
        ":‑)":"smiley",
        ":-]":"smiley",
        ":-3":"smiley",
        ":->":"smiley",
        "8-)":"smiley",
        ":-}":"smiley",
        ":)":"smiley",
        ":]":"smiley",
        ":3":"smiley",
        ":>":"smiley",
        "8)":"smiley",
        ":}":"smiley",
        ":o)":"smiley",
        ":c)":"smiley",
        ":^)":"smiley",
        "=]":"smiley",
        "=)":"smiley",
        ":-))":"smiley",
        ":‑D":"smiley",
        "8‑D":"smiley",
        "x‑D":"smiley",
        "X‑D":"smiley",
        ":D":"smiley",
        "8D":"smiley",
        "xD":"smiley",
        "XD":"smiley",
        ":‑(":"sad",
        ":‑c":"sad",
        ":‑<":"sad",
        ":‑[":"sad",
        ":(":"sad",
        ":c":"sad",
        ":<":"sad",
        ":[":"sad",
        ":-||":"sad",
        ">:[":"sad",
        ":{":"sad",
        ":@":"sad",
        ">:(":"sad",
        ":'‑(":"sad",
        ":'(":"sad",
        ":‑P":"playful",
        "X‑P":"playful",
        "x‑p":"playful",
        ":‑p":"playful",
        ":‑Þ":"playful",
        ":‑þ":"playful",
        ":‑b":"playful",
        ":P":"playful",
        "XP":"playful",
        "xp":"playful",
        ":p":"playful",
        ":Þ":"playful",
        ":þ":"playful",
        ":b":"playful",
        "<3":"love"
        }


def load_dict_contractions():
    
    return {
        "ain't":"is not",
        "amn't":"am not",
        "aren't":"are not",
        "can't":"cannot",
        "'cause":"because",
        "couldn't":"could not",
        "couldn't've":"could not have",
        "could've":"could have",
        "daren't":"dare not",
        "daresn't":"dare not",
        "dasn't":"dare not",
        "didn't":"did not",
        "doesn't":"does not",
        "don't":"do not",
        "e'er":"ever",
        "em":"them",
        "everyone's":"everyone is",
        "finna":"fixing to",
        "gimme":"give me",
        "gonna":"going to",
        "gon't":"go not",
        "gotta":"got to",
        "hadn't":"had not",
        "hasn't":"has not",
        "haven't":"have not",
        "he'd":"he would",
        "he'll":"he will",
        "he's":"he is",
        "he've":"he have",
        "how'd":"how would",
        "how'll":"how will",
        "how're":"how are",
        "how's":"how is",
        "I'd":"I would",
        "I'll":"I will",
        "I'm":"I am",
        "I'm'a":"I am about to",
        "I'm'o":"I am going to",
        "isn't":"is not",
        "it'd":"it would",
        "it'll":"it will",
        "it's":"it is",
        "I've":"I have",
        "kinda":"kind of",
        "let's":"let us",
        "mayn't":"may not",
        "may've":"may have",
        "mightn't":"might not",
        "might've":"might have",
        "mustn't":"must not",
        "mustn't've":"must not have",
        "must've":"must have",
        "needn't":"need not",
        "ne'er":"never",
        "o'":"of",
        "o'er":"over",
        "ol'":"old",
        "oughtn't":"ought not",
        "shalln't":"shall not",
        "shan't":"shall not",
        "she'd":"she would",
        "she'll":"she will",
        "she's":"she is",
        "shouldn't":"should not",
        "shouldn't've":"should not have",
        "should've":"should have",
        "somebody's":"somebody is",
        "someone's":"someone is",
        "something's":"something is",
        "that'd":"that would",
        "that'll":"that will",
        "that're":"that are",
        "that's":"that is",
        "there'd":"there would",
        "there'll":"there will",
        "there're":"there are",
        "there's":"there is",
        "these're":"these are",
        "they'd":"they would",
        "they'll":"they will",
        "they're":"they are",
        "they've":"they have",
        "this's":"this is",
        "those're":"those are",
        "'tis":"it is",
        "'twas":"it was",
        "wanna":"want to",
        "wasn't":"was not",
        "we'd":"we would",
        "we'd've":"we would have",
        "we'll":"we will",
        "we're":"we are",
        "weren't":"were not",
        "we've":"we have",
        "what'd":"what did",
        "what'll":"what will",
        "what're":"what are",
        "what's":"what is",
        "what've":"what have",
        "when's":"when is",
        "where'd":"where did",
        "where're":"where are",
        "where's":"where is",
        "where've":"where have",
        "which's":"which is",
        "who'd":"who would",
        "who'd've":"who would have",
        "who'll":"who will",
        "who're":"who are",
        "who's":"who is",
        "who've":"who have",
        "why'd":"why did",
        "why're":"why are",
        "why's":"why is",
        "won't":"will not",
        "wouldn't":"would not",
        "would've":"would have",
        "y'all":"you all",
        "you'd":"you would",
        "you'll":"you will",
        "you're":"you are",
        "you've":"you have",
        "Whatcha":"What are you",
        "luv":"love",
        "sux":"sucks"
        }


def strip_accents(text):
    if 'ø' in text or  'Ø' in text:
        #Do nothing when finding ø 
        return text   
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)


def tweet_cleaning_for_sentiment_analysis(tweet):    
    
    #Escaping HTML characters
    tweet = BeautifulSoup(tweet).get_text()
    #Special case not handled previously.
    tweet = tweet.replace('\x92',"'")
    #Removal of hastags/account
    tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", tweet).split())
    #Removal of address
    tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split())
    #Removal of Punctuation
    tweet = ' '.join(re.sub("[\.\,\!\?\:\;\-\=]", " ", tweet).split())
    #Lower case
    tweet = tweet.lower()
    #CONTRACTIONS source: https://en.wikipedia.org/wiki/Contraction_%28grammar%29
    CONTRACTIONS = load_dict_contractions()
    tweet = tweet.replace("’","'")
    words = tweet.split()
    reformed = [CONTRACTIONS[word] if word in CONTRACTIONS else word for word in words]
    tweet = " ".join(reformed)
    # Standardizing words
    tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
    #Deal with smileys
    #source: https://en.wikipedia.org/wiki/List_of_emoticons
    SMILEY = load_dict_smileys()  
    words = tweet.split()
    reformed = [SMILEY[word] if word in SMILEY else word for word in words]
    tweet = " ".join(reformed)
    #Deal with emojis
    tweet = emoji.demojize(tweet)
    #Strip accents
    tweet= strip_accents(tweet)
    tweet = tweet.replace(":"," ")
    tweet = ' '.join(tweet.split())
    
    # DO NOT REMOVE STOP WORDS FOR SENTIMENT ANALYSIS - OR AT LEAST NOT NEGATIVE ONES

    return tweet

In [27]:
def load_data(data_file_path):
    return pd.read_csv(data_file_path, sep='\t', names=['Sentiment', 'Text'], header=None)

train_file = 'offensive_dataset/train.tsv'
dev_file   = 'offensive_dataset/dev.tsv'

train_data = load_data(train_file)
dev_data   = load_data(dev_file)

train_data['y'] = np.where(train_data['Sentiment'] == 'OFF', 1, 0)
dev_data['y'] = np.where(dev_data['Sentiment'] == 'OFF', 1, 0)

In [28]:
tsv_file='offensive_dataset/train.tsv'
csv_table=pd.read_table(tsv_file,sep='\t')
csv_table.to_csv('train.csv',index=False)

In [29]:
def transform_instance(row):
    cur_row = []
    #Prefix the index-ed label with __label__
    label = "__label__" + row[1]['Sentiment']  
    cur_row.append(label)
    cur_row.extend(nltk.word_tokenize(tweet_cleaning_for_sentiment_analysis(row[1]['Text'].lower())))
    return cur_row

In [30]:
# takes around 10 seconds to run

def preprocess(input_file, output_file, keep=1):
    with open(output_file, 'w', encoding="utf8") as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        for row in input_file.iterrows():
            row_output = transform_instance(row)
            csv_writer.writerow(row_output)

preprocess(train_data, 'tweets.train')

In [31]:
preprocess(dev_data, 'tweets.validate')

In [32]:
model_path = '\\model\\'
model_name = "model-en"

In [36]:
def print_test_prediction(model, sample_text):
    pred = model.predict([sample_text], k=1)
    print('\nText: "{}"'.format(sample_text))
    print('Predicted label: ', pred[0][0][0][-3:])
    print('Confidence:  %.4f %%' % (pred[1][0][0]*100))

    
def train():
    
    print('Training start\n')
    
    try:
        hyper_params = {"lr": 0.01,
                        "epoch": 20,
                        "wordNgrams": 2,
                        "dim": 20}     
                               
        print(str(datetime.datetime.now()) + ' START => ' + str(hyper_params) )

        # Train the model
        
        model = fasttext.train_supervised("tweets.train", **hyper_params)
        print("Model trained with the hyperparameter \n {}".format(hyper_params))

        # CHECK PERFORMANCE
        
        print(str(datetime.datetime.now()) + 'Training complete.' + str(hyper_params) )
        
        result = model.test('tweets.train')
        validation = model.test('tweets.validate')
        
        # DISPLAY ACCURACY OF TRAINED MODEL
        
        text_line = str(hyper_params) + "\n\nTraining Accuracy: " + str(result[1] * 100)  + " %\nValidation Accuracy: " + str(validation[1] * 100) + ' %\n' 
        print(text_line)        
    
        #  TESTING PART

        print('Testing\n')
        sample_text = 'this player does not play well'
        print_test_prediction(model, sample_text)
        sample_text = 'this player is soooo shit'
        print_test_prediction(model, sample_text)
        
    except Exception as e:
        print('Exception during training: ' + str(e) )


# Train your model
train()

Training start

2019-09-13 23:15:17.501386 START => {'lr': 0.01, 'epoch': 20, 'wordNgrams': 2, 'dim': 20}
Model trained with the hyperparameter 
 {'lr': 0.01, 'epoch': 20, 'wordNgrams': 2, 'dim': 20}
2019-09-13 23:15:18.397257Training complete.{'lr': 0.01, 'epoch': 20, 'wordNgrams': 2, 'dim': 20}
{'lr': 0.01, 'epoch': 20, 'wordNgrams': 2, 'dim': 20}

Training Accuracy: 73.99466192170819 %
Validation Accuracy: 55.50000000000001 %

Testing


Text: "this player does not play well"
Predicted label:  NOT
Confidence:  78.4713 %

Text: "this player is soooo shit"
Predicted label:  OFF
Confidence:  94.1431 %
