<a href="https://colab.research.google.com/github/ryane-jon/Natural-Language-Processing-Coursework/blob/main/NLP_Coursework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Proccessing (CE4145) Coursework
- Ryan Jones (2208751)
- Dataset Source: [Sentiment Labelled Sentences](https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences)

This dataset consists of:
- **Sentences:** in the form of reviews sourced from Amazon, IMDB and Yelp.
- **Sentiments:** represented as 1s for positive, and 0s for negative.  
- **Sources:** as seperate files; Each file is data from a different source.

Tasks:
- Create a model that can predict the sentiment (Positive or Negative) of various sentences.
- I also plan to extend this to detect the source of these sentences (Amazon, imdb or yelp)

### Import dataset
(from the download of the dataset in the provided link)

The 3 files are imported (amazon_cells_labelled.txt, imdb_labelled.txt, yelp_labelled.txt)

In [None]:
#used to upload data
import io
from google.colab import files
uploaded = files.upload() #files can be uploaded here, I select the 3 labelled files.

Saving amazon_cells_labelled.txt to amazon_cells_labelled.txt
Saving imdb_labelled.txt to imdb_labelled.txt
Saving yelp_labelled.txt to yelp_labelled.txt


In [None]:
#import stuff to use later
from sklearn.pipeline import Pipeline #let's import the pipeline functionality
from sklearn.feature_extraction.text import CountVectorizer #and we will import a simple pre-processing method
from sklearn.feature_extraction.text import TfidfTransformer #and a representation learner
from sklearn.neighbors import KNeighborsClassifier #and a simple classifier model
from sklearn.model_selection import StratifiedKFold #cross fold is sometimes called k-fold. Calling the stratified version ensures that classes have equal representation across folds
from sklearn.metrics import accuracy_score #import an accuracy metric to tell us how well the model is doing

Split each files lines into strings in arrays to be processed

In [None]:
amazon_data = uploaded['amazon_cells_labelled.txt'].decode('UTF-8').splitlines()
imdb_data = uploaded['imdb_labelled.txt'].decode('UTF-8').splitlines()
yelp_data = uploaded['yelp_labelled.txt'].decode('UTF-8').splitlines()

#View some of the data
for index in range(5):
  print(amazon_data[index])
  print(imdb_data[index])
  print(yelp_data[index])

So there is no way for me to plug it in here in the US unless I go by a converter.	0
A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  	0
Wow... Loved this place.	1
Good case, Excellent value.	1
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  	0
Crust is not good.	0
Great for the jawbone.	1
Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  	0
Not tasty and the texture was just nasty.	0
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!	0
Very little music or anything to speak of.  	0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.	1
The mic is great.	1
The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.  	1
The selection on the menu was great

Create a numpy array with 3 collumns: Review, Sentiment, and Source

In [None]:
import numpy as np
reviews = []
sentiments = []
sources = []



for line in amazon_data:
  reviews.append(line[:-1])
  sentiments.append(line[-1])
  sources.append("Amazon")

for line in imdb_data:
  reviews.append(line[:-1])
  sentiments.append(line[-1])
  sources.append("IMDB")

for line in yelp_data:
  reviews.append(line[:-1])
  sentiments.append(line[-1])
  sources.append("Yelp")


data = np.column_stack((reviews, sentiments, sources))
print(data.shape)

(3002, 3)


Check for stinky datatype issue

In [None]:
def checkSentimentValid(datas):
  x,y,z = datas.T
  markedForDeath = []
  count = 0
  #for each sentiment value make sure it's "0" or "1"
  for sentiment in y:
    if sentiment != "0" and sentiment != "1":
      ##if not, get as relevant surrounding info
      markedForDeath.append(count)  #an omen
      print(str(count))
      print(x[count]+"\t"+y[count]+"\t"+z[count])
      print(x[count+1]+"\t"+y[count+1]+"\t"+z[count+1]+"\n")
    count= count+1
  return markedForDeath

checkSentimentValid(data)


1178
The script i	s	IMDB
was there a script?  		0	IMDB

1968
Definitely worth seein	g	IMDB
 it's the sort of thought provoking film that forces you to question your own threshold of loneliness.  		1	IMDB



[1178, 1968]

This is a result of reviews containing a character that is misinterpereted as a new line, splitting the review in two

In [None]:
marked = checkSentimentValid(data)
marked.sort(reverse=True)
for marker in marked:
  if sentiments[marker] != "0" and sentiments[marker] != "1":
    reviews[marker+1] = reviews[marker] + sentiments[marker]+" " + reviews[1969] #rejoin the split review as one
    reviews.pop(marker) #remove the leftover data
    sentiments.pop(marker)
    sources.pop(marker)

data = np.column_stack((reviews, sentiments, sources))

1178
The script i	s	IMDB
was there a script?  		0	IMDB

1968
Definitely worth seein	g	IMDB
 it's the sort of thought provoking film that forces you to question your own threshold of loneliness.  		1	IMDB



should now be no issue

In [None]:
checkSentimentValid(data)

[]

can now make a shuffled numpy array

In [None]:
random_array = []
for i in range(len(reviews)):
  random_array.append(i)
np.random.shuffle(random_array)

shuffled_reviews = []
shuffled_sentiments = []
shuffled_sources = []

for i in random_array:
  shuffled_reviews.append(reviews[i])
  shuffled_sentiments.append(sentiments[i])
  shuffled_sources.append(sources[i])

data = np.column_stack((shuffled_reviews, shuffled_sentiments, shuffled_sources))

## Test base NLP Pipeline.
- To start with, it's just predicting the sentiment based on the review text, I will bring the source in (possibly as an additional feature or as a target).
- This section is mostly me testing some pre-proccessing and tokenization methods to see what is actually helpful

In [None]:
x, y, z = data.T  #x=reviews, y=sentiment, z=source
text_clf = Pipeline([ #Pipeline to organise functions
  ('prep', CountVectorizer()), #Count vectorizer
  ('rep', TfidfTransformer()), #representation learning method using tf-idf
  ('mod', KNeighborsClassifier()), #kNN classifier
  ])

acc_score = []

kf = StratifiedKFold(n_splits=5)
for train, test in kf.split(x,y):
  x_train, x_test, y_train, y_test = x[train], x[test], y[train], y[test]
  text_clf.fit(x_train, y_train) #fit to training data
  predictions = text_clf.predict(x_test) #predict on test data
  acc_score.append(accuracy_score(predictions, y_test)) #get accuracy

print("Accuracy:", np.mean(acc_score)) #mean accuracy


Accuracy: 0.7753333333333333


Double-checking that data is being represented correctly

In [None]:
for index in range(0,4):
  print(x[index*100])
  print(y[index*100])
  print(z[index*100])

Its not user friendly.	
0
Amazon
It was so funny.  	
1
IMDB
I'll be drivng along, and my headset starts ringing for no reason.	
0
Amazon
It actually turned out to be pretty decent as far as B-list horror/suspense films go.  	
1
IMDB


### Introduce tokenization pre-proccessing into the pipeline

Import packages for tokenization

In [None]:
import nltk #import the natural language toolkit

nltk.download('punkt') #download the package in nltk which supports tokenization
nltk.download('punkt_tab')
nltk.download('stopwords') #download the nltk package for stopwords

from nltk.tokenize import word_tokenize #import the tokenize package
from nltk.corpus import stopwords #import the package from the corpus
from nltk.stem.snowball import SnowballStemmer #import the snowball stemmer (also known as Porter2)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Create a tokenize function and add it to the pipeline

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class pre_process_tokenize(BaseEstimator, TransformerMixin):

    def __init__(self):
      return None

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
      prep_text = []
      for x in X:
            #basic tokenisation
            token_text = word_tokenize(x)
            #lower casing & punctiuation removal)
            normd_text = [token.lower() for token in token_text]
            #stopword removal
            swr_text = [token for token in normd_text if token not in stopwords.words('english')] #list compression to remove any stopwords from our list
            #stemming
            stemmer = SnowballStemmer("english") #specify we are using the English stemming rules, as other languages are present in toolkit
            prep_text += [[stemmer.stem(word) for word in swr_text]] #list compression for applying the stemmer

      #rejoin the sentences
      prep_sentences = [" ".join(sentence) for sentence in prep_text]
      return prep_sentences

text_clf = Pipeline([
  ('prep', pre_process_tokenize()),
  ('count', CountVectorizer()),
  ('rep', TfidfTransformer()),
  ('mod', KNeighborsClassifier()),
  ])


acc_score = []

kf = StratifiedKFold(n_splits=5)
for train, test in kf.split(x,y):

  x_train, x_test, y_train, y_test = x[train], x[test], y[train], y[test]

  text_clf.fit(x_train, y_train)
  predictions = text_clf.predict(x_test)
  acc = accuracy_score(predictions, y_test)
  acc_score.append(acc)

print("Accuracy:", np.mean(acc_score))

Accuracy: 0.7463333333333333


Tokenization has decreased the overall accuracy. It's possible that stemming and stop word removal caused some context and nuance to be lost, which was made worse by the fact that the data comes from multiple different sources. I.E Without the source to contextualise the review, the feature starts to lose some important meaning. Additionally, punctiuation and capitilisation could be very important for determining sentiment.

In [None]:
from sklearn.neural_network import MLPClassifier # Import MLPClassifier
text_clf = Pipeline([ #Pipeline to organise functions
  ('prep', CountVectorizer(ngram_range=(1,2))), #Count vectorizer
  ('rep', TfidfTransformer()), #representation learning method using tf-idf
  ('mod', MLPClassifier()), #MLP classifier
  ])

acc_score = []

kf = StratifiedKFold(n_splits=5)
for train, test in kf.split(x,y):
  x_train, x_test, y_train, y_test = x[train], x[test], y[train], y[test]
  text_clf.fit(x_train, y_train) #fit to training data
  predictions = text_clf.predict(x_test) #predict on test data
  acc_score.append(accuracy_score(predictions, y_test)) #get accuracy

print("Accuracy:", np.mean(acc_score)) #mean accuracy

Accuracy: 0.8240000000000001
