**Ronan Murphy**

29/02/2020

Developed a system to detect irony in text. Used data.txt file to train the model

```csv
Tweet index     Label   Tweet text
1       1       Sweet United Nations video. Just in time for Christmas. #imagine #NoReligion  http://t.co/fej2v3OUBR
2       1       @mrdahl87 We are rumored to have talked to Erv's agent... and the Angels asked about Ed Escobar... that's hardly nothing    ;)
3       1       Hey there! Nice to see you Minnesota/ND Winter Weather 
4       0       3 episodes left I'm dying over here
```


# Part 1

Read all the data and find the size of vocabulary of the dataset (ignoring case) and the number of positive and negative examples.

**Part 1: Description**

- Imported text file with google colab

- Seperated the rows by tab, converted tweets to a list and lowercased all text and added results to a set, got the length of this to determine vocabulary size of dataset = 17055

- Grouped the data by label to get count for Ironic (1901) and non Ironic (1916)

In [None]:
from google.colab import files
files.upload()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#use pandas to read the text file seperated by tab
dataset = pd.read_csv('data.txt', sep = '\t')


#convert tweet text to list
results = []
tweet_data = dataset['Tweet text'].tolist()

#for each tweet lowercase to ignore case and split into words
for tweet in tweet_data:
    tweet = tweet.lower()
    results.extend([tweet_data for tweet_data in tweet.split()])
#convert to set to return unique words
results = set(results)
print(results)
#get size of vocabulary of dataset
print(len(results))


#number of positive and negative examples
pos_neg_size = dataset.groupby('Label').size()
print(pos_neg_size)

#print first 10 rows
dataset.head(10)

# Part 2

Divide the data into a training and test set.

Implement a function that calculates the precision, recall and F-Measure for this task.

**Part 2: Description**

- Import Stopwords from NLTK to use for feature extraction

- Split the data into X (Tweet text) and Y (Labels) values, after feature extraction will split into training an test with a ratio 80-20 as this avoids over training but returns the best model accuracy.

- Created a method 'Scores' to return the Accuracy, Precsion, Recall and F1-Score of the predicted Y labels compared to the acutual labels. This can be called for each model

In [None]:
import nltk
nltk.download('stopwords')
# import nltk to get stopwords package

In [None]:
from sklearn.model_selection import train_test_split

#add tweets and labels to X and Y respectively 
x= dataset.iloc[:,2].values
y = dataset.iloc[:,1].values



#method to get accuracy, precision recall and f-score 
def scores(y_test, y_predicted):
  # accuracy: (tp + tn) / (p + n)
  accuracy = accuracy_score(y_test, y_predicted)
  print('Accuracy: %f' % accuracy)
  # precision tp / (tp + fp)
  precision = precision_score(y_test, y_predicted)
  print('Precision: %f' % precision)
  # recall: tp / (tp + fn)
  recall = recall_score(y_test, y_predicted)
  print('Recall: %f' % recall)
  # f1: 2 tp / (2 tp + fp + fn)
  f1 = f1_score(y_test, y_predicted)
  print('F1 score: %f' % f1)
  return precision, recall, f1



#implement split in next task when extract features

# Part 3

Extracted features from each sentence. Implemented a simple log-linear model to classify tweets as ironic or not ironic.

Trained this method and evaluate the results using precision, recall and F-Measure

**Part 3: Description**

- First extract the stopwords features using NLTK as they add no extra sentiment to the tweet text. This can be used to improve accuracy of a model as it removes redundant information that can skew results. In part 5 I use more feature extraction as it can be seen that usernames, web addresses etc dont add more sentiment either

- Using count vectoriser covert the text to unique vectors, lowercasing all text so it can be analysed by a model

- split the data into 80:20 train test split 

- create Logistic Regression model which uses the signmoid function (Log-linear)

- Call the 'Scores' method to get the accuracy, precision, recall, f1-score :-

Accuracy: 0.649215;
Precision: 0.646597;
Recall: 0.650000;
F1 score: 0.648294


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


#remove stopwords features from dataset to improve accuracy
stopWords = set(stopwords.words('english'))
wordsFiltered = []
for w in x:
    if w not in stopWords:
        wordsFiltered.append(w)
vocab = set(wordsFiltered)
print(len(vocab))
#use count vectoriser to convert words to unique IDs and lowercase the words
vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = True,

)
 
features = vectorizer.fit_transform(
    wordsFiltered
)

features_nd = features.toarray() 

#splitting with 80:20 as for this size of data this will train on 80% of data and test on the rest.
#if train is too high then will overtrain and underfit if too low 
#this returns an accuracy of just under 65%
x_train, x_test, y_train, y_test = train_test_split(features_nd, y, train_size=0.80, random_state=1234)

print(x_train.shape)
print(y_train.shape)

#train logistic regression that uses sigmoid activaiton function with data
log_model = LogisticRegression()
log_model = log_model.fit(X=x_train, y=y_train)
#predict y values from test
y_pred = log_model.predict(x_test)

#predict scores with method, returns accuracy, precsion, recall and fscore
print(scores(y_test, y_pred))

3817
(3053, 12725)
(3053,)
Accuracy: 0.649215
Precision: 0.646597
Recall: 0.650000
F1 score: 0.648294
(0.6465968586387435, 0.65, 0.6482939632545932)


# Part 4

Developed an acceptor recurrent neural network that classifiers the sentence as ironic or not ironic.

Evaluated this according to precision, recall or F-Measure

**Part 4: Description**

- Created Acceptor RNN, Tokenised the words using keras preprocessing tools and added padding to each tweet to make them the same length. This is another feautre in part 5 which can be improved as one long tweet can skew the results of all others as they will add zeros to the max length. Truncating to and average length is a possible solution.

- Split the data into train, test with 80-20. Inan improved model can use validate data also which is an unbiased evaluation of hyperparameters

- Created 3 layered sequential model with emdedding, bidirectional LSTM and dense output layer that uses softmax to output results. Set the hyperparameters on these for the amount of nodes on each layer. Increasing nodes can improve accuracy but increase time to train 

- Compiled this model with binary crossentropy and a learning rate using adam to compare accuracy

- fitted the model for 10 epochs and batch size of 64 to increase the speed it returns each epoch, Test Accuracy = 58.74%

In [None]:
from google.colab import files
files.upload()
#import file

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical

dataset = pd.read_csv('data.txt', sep = '\t')

#tokenize the dataset with max features set to 30,000, split by spaces
t = Tokenizer(nb_words=30000, split=' ')
#fit the text of tweets to tokenized vectors
t.fit_on_texts(dataset['Tweet text'].values)
#convert the text in tweets to tokenized values
X1 = t.texts_to_sequences(dataset['Tweet text'].values)
#pad the Tweet text column with 0's. this adds zeros so all tweets are same size, this may skew results as one long tweet can add zeros for all tweets
X1 = pad_sequences(X1)

In [None]:
#get the labels for the X1 converted data 
Y1 = pd.get_dummies(dataset['Label']).values
#split the data into train and test with 80%-20% train-test
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1,Y1, train_size =0.8)
#get the shape of the data sets
print(X1_train.shape,Y1_train.shape)
print(X1_test.shape,Y1_test.shape)

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf



#create acceptor RNN model to classify tweets using tensorflow
#3 layers the original embedding, LSTM layer and output layer
model = tf.keras.Sequential([
    #size of vocab and output dimensions
    tf.keras.layers.Embedding(30000, 128, input_length=X1.shape[1]),
    #LSTM with 128 nodes
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256)),
    #dense output layer binary therefore only 1
    tf.keras.layers.Dense(2, activation='softmax')
])
#compile as binary output with adam learning rate comparing accuracy
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])
#give summary of model
model.summary()

In [None]:
batch_size = 64
#fit the model to training data 
history = model.fit(x=X1_train,y=Y1_train, batch_size=batch_size,
                    epochs=10)

In [None]:
#compare output with test data
results = model.evaluate(x= X1_test, y=Y1_test)

print('test loss, test acc:', results)
#compare scores accuracy, f1 and recall and precision
y_pred = model.predict(X1_test)

yp= []
for x in y_pred:
  x= x.astype(int)
  yp.append(x)

yp = np.array(yp)

print(scores(Y1_test, yp))

# Part 5

Enchanced the RNN developed in Part 4 and which shows an improvement according to your evaluation metric.


**Improvement RNN**

To improve the original modle decided to change the Acceptor RNN from using keras tokenizer to using my own method from scratch to convert unique numbers to ints so I could edit the padding sizes to improve the accuracy. 

I also added more extracting features to remove punctuation, usernames, web addresses and numbers as they might skew the results. By adding custom padding I got the max and avg length of words. This was found to be 149 and 12.5 respectively. Therefore there were many 0's in each vectorised tweet which added no meaning and truncating tweets over a size of 15 gave an improved accuracy. If tweets were shorter than this they would be paddded with 0's but this had less of an affect on the results.

I also added two extra hidden layers to see if the model would improve results by training better. Although it gave slight improvements this didnt massively change the output accuracy.

Finally I trained the model and evaluated with the scores method returning:-

Accuracy: 0.67121;
Precision: 0.65292;
Recall: 0.66203;
F1 score: 0.66119


This proves that by extracting more featueres and adding more layers that it improved the model in terms of accuracy and predictability. To get further improvments adding more layers or editing the hyperparameters might give a better accuracy.

In [None]:
from google.colab import files
files.upload()

In [None]:
import numpy as np
import pandas as pd

dataset = pd.read_csv('data.txt', sep ="\t")
x= dataset.iloc[:,2].values
y = dataset.iloc[:,1].values
punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~'
# remove punctuation
all_tweets = 'separator'.join(x)
all_tweets = all_tweets.lower()
all_text = ''.join([c for c in all_tweets if c not in punctuation])

# split by new lines and spaces
tweets_split = all_text.split('separator')
all_text = ' '.join(tweets_split)

# create a list of words
words = all_text.split()

# remove web address, twitter iusername, and numbers from tweet text
new_reviews = []
for review in tweets_split:
    review = review.split()
    new_text = []
    for word in review:
        if (word[0] != '@') & ('http' not in word) & (~word.isdigit()):
            new_text.append(word)
    new_reviews.append(new_text)

print(new_reviews)

In [28]:
from collections import Counter

## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in new_reviews:
    reviews_ints.append([vocab_to_int[word] for word in review])


# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+


# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])


#find tweets of 0 length
for i in reviews_ints:

  
  if len(i)==0:
    print(i)
    print(count)
  count+=1
  
#method to remove the 1966 element which was empty after feature extraction
list =[]
count = 0
for elements in reviews_ints:
  if count != 1966:
    list.append(elements)
  else:
    print(count)
  count +=1

#remove the same y value
labels = np.delete(y, [1966])
print("y", np.size(labels))
print("x", len(list))
count = 0


#get mean and max of tweet length to use for padding
mean_list = []
for i in list:
  mean_list.append(len(i))
from statistics import mean
print(mean(mean_list), "mean")


maxList = max(list, key = len) 
maxLength = max(map(len, list)) 
print(maxLength)


Unique words:  14173
Tokenized review: 
 [[815, 2146, 1186, 418, 19, 7, 60, 11, 65, 3665, 3666]]
1966
y 3816
x 3816
[]
1966
12.60272536687631 mean
149


In [29]:
#pad method to add zeros for length of padding set to 15, if tweet >15 length then truncate the tweet
def pad_features(list, seq_length):
    # getting the correct rows x cols shape
    features = np.zeros((len(list), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(list):
      features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

# Test implementation!

seq_length = 15

features = pad_features(list, seq_length=seq_length)



# print first 10 values of the first 15 tweets 
print(features[:10,:150])

[[   0    0    0    0  815 2146 1186  418   19    7   60   11   65 3665
  3666]
 [  37   25 3669    2   23 1566    2 3670 2147    6    1 3671  525   42
  1567]
 [   0    0    0    0    0    0  325   83  140    2   66    9 3673  366
   419]
 [   0    0    0    0    0    0    0    0    0 3674  367   26  640  100
   121]
 [   4   58 1188   33 2148   47    1  127 2149 1568    8    1   91    7
    46]
 [   0    0    0    0    0    0    0    0  118  114   89  282   11 3677
  3678]
 [ 157  197   20 2151  104  135   12    1 1191    6  420  299  641  954
    30]
 [   0    0    0    0   46 2152   10  160 1569   45  209   17    3  251
  3681]
 [  41    9   79    9   76  368    3  237   61 3682   48   31 1192 3683
   192]
 [   0    0    0    0    0    9   25   22 1194    2  526   13  252   65
    38]]


In [0]:
split_frac = 0.8 # best split ratio chosen from tests in part 4

## split data into training, validation, and test data for x and y 

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]



			Feature Shapes:
Train set: 		(3052, 15) 
Validation set: 	(382, 15) 
Test set: 		(382, 15)
Train set: 		(3052,) 
Validation set: 	(382,) 
Test set: 		(382,)


In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf
#create model similar to before with extra nodes 
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(14200, 7500),# let output dimensions equal to half input to give better results, input size reduced with extraction
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256)), # increase nodes to 256
    tf.keras.layers.Dense(128, activation='relu'), # increase nodes to 128
    tf.keras.layers.Dense(1) # output layer
])
#same means of compiling binary cross entropy adam learner
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(x=train_x,y=train_y,
                    batch_size=64,
                    epochs=10,
                    validation_data = (val_x, val_y))

In [None]:
results = model.evaluate(x= test_x, y=test_y)

print('test loss, test acc:', results)

y_pred = model.predict(test_x)
print(scores(test_y, y_pred))