<a href="https://colab.research.google.com/github/raynerz/nlp/blob/main/Sentiment_Analysis_Excercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Download the dataset to train your system:
http://ai.stanford.edu/~amaas/data/sentiment/
Write a Python Script to:
- Read training and test data from the files
- Preprocess the data (e.g. replace unwanted characters with space, or remove
html expressions)
- Lemmatize the test and training data
- Use Vectorization to get a numeric representation (for example by using
CountVectorizer*)
- Use a LogisticRegression classifier to train the model
- Hint: You can use a Scikit-learn Pipeline as seen previsouly this semester

Sources: https://towardsdatascience.com/a-complete-sentiment-analysis-algorithm-in-python-with-amazon-product-review-data-step-by-step-2680d2e2c23b


In [1]:
!pip install nltk
import nltk
nltk.download('brown')
nltk.download('names')
!pip install normalise
import os
import pandas as pd
from sklearn.utils import shuffle
import lxml.html.clean as clean
import string
import numpy as np
import string
from nltk.tokenize import word_tokenize
from normalise import normalise
import en_core_web_sm

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!




In [2]:
%%capture 
# Capture stops Jupyter from outputs

!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar.gz

In [3]:
# Read Data


def read_data(path="", sentiment=True): #True is positive, False is negative
  directorylist = os.listdir(path) # Reading data from the system
  data = []

  for i in directorylist:
    rating = i.split("_")[1]
    rating = rating.split(".")[0]
    rating = int(rating)

    f = open(path+i)
    text = f.read()
    data.append((text, rating, sentiment))
  return data
    
  

reviews_neg = read_data("./aclImdb/train/neg/", False)
reviews_pos = read_data("./aclImdb/train/pos/", True)

# Prepare data for the model

reviews_neg = pd.DataFrame(reviews_neg).rename(columns={0:"text", 1:"rating", 2:"sentiment"})
reviews_pos = pd.DataFrame(reviews_pos).rename(columns={0:"text", 1:"rating", 2:"sentiment"})

all_reviews = pd.concat([reviews_neg, reviews_pos])
all_reviews = shuffle(all_reviews)

print(all_reviews)



                                                    text  rating  sentiment
8853   Relentlessly stupid, no-budget "war picture" m...       3      False
4440   Before this, the flawed "Slaughterhouse Five" ...      10       True
3618   No movie with Madeleine Carroll in its cast co...       2      False
8943   I'm glad the folks at IMDb were able to deciph...       1      False
...                                                  ...     ...        ...
11863  This film is a complete re-imagining of Romeo ...      10       True
11499  I have spent the last week watching John Cassa...      10       True
8412   Dr Tarr's Torture Dungeon is about a journalis...       4      False
4999   A bit slow (somehow like a Sofia Coppola movie...       8       True
9575   Good lord, whoever made this turkey needs to b...       2      False

[25000 rows x 3 columns]


In [4]:
# Cleaning: Eliminating punctuation symbols or other unwanted symbols like html



tokens = []
for text in all_reviews['text']:
  text = clean.clean_html(text)
  text = text.translate(str.maketrans('', '', string.punctuation))
  tokens.append(text)

df = pd.DataFrame(tokens)
all_reviews['text'] = df[0].values
all_reviews.head()

Unnamed: 0,text,rating,sentiment
9738,pWarning Mild Spoilers AheadbrbrYes I realize ...,10,True
8853,pRelentlessly stupid nobudget war picture made...,3,False
4440,pBefore this the flawed Slaughterhouse Five wa...,10,True
3618,pNo movie with Madeleine Carroll in its cast c...,2,False
8943,pIm glad the folks at IMDb were able to deciph...,1,False


## Pipeline for
1. Normaliting
2. Remove Punctuation
3. Remove Stop words
4. Lemmatize

In [5]:
nlp = en_core_web_sm.load()


def preprocess_text(text):
      normalized_text = normalize(text)
      doc = nlp(normalized_text)
      removed_punct = remove_punct(doc)
      removed_stop_words = remove_stop_words(removed_punct)
      return lemmatize(removed_stop_words)

def normalize(text):
    # some issues in normalise package
    try:
        return ' '.join(normalise(text, verbose=False))
    except:
        return text

def remove_punct(doc):
    return [t for t in doc if t.text not in string.punctuation]

def remove_stop_words(doc):
    return [t for t in doc if not t.is_stop]

def lemmatize(doc):
    return ' '.join([t.lemma_ for t in doc])

preprocessed =  []
for text in all_reviews['text']:
  result = preprocess_text(text)
  preprocessed.append(result)

df = pd.DataFrame(preprocessed)

all_reviews['text'] = df[0].values

In [6]:
# Split into training and validation set

from sklearn.model_selection import train_test_split

X = all_reviews['text']
y = all_reviews['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [7]:
# CountVectorizer develops a vector of all the words in the string. 

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

In [8]:
# Train the model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(ctmTr, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
# Predict

y_pred_class = model.predict(X_test_dtm)

In [10]:
# Evaluate

accuracy_score(y_test, y_pred_class)

0.86656