<a href="https://colab.research.google.com/github/raynerz/nlp/blob/main/Sentiment_Analysis_Excercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Download the dataset to train your system:
http://ai.stanford.edu/~amaas/data/sentiment/
Write a Python Script to:
- Read training and test data from the files
- Preprocess the data (e.g. replace unwanted characters with space, or remove
html expressions)
- Lemmatize the test and training data
- Use Vectorization to get a numeric representation (for example by using
CountVectorizer*)
- Use a LogisticRegression classifier to train the model
- Hint: You can use a Scikit-learn Pipeline as seen previsouly this semester

Sources: https://towardsdatascience.com/a-complete-sentiment-analysis-algorithm-in-python-with-amazon-product-review-data-step-by-step-2680d2e2c23b


In [1]:
!pip install nltk



In [2]:
%%capture 
# Capture stops Jupyter from outputs

!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar.gz

In [10]:
# Read Data
import os
import pandas as pd
from sklearn.utils import shuffle

def read_data(path="", sentiment=True): #True is positive, False is negative
  directorylist = os.listdir(path) # Reading data from the system
  data = []

  for i in directorylist:
    rating = i.split("_")[1]
    rating = rating.split(".")[0]
    rating = int(rating)

    f = open(path+i)
    text = f.read()
    data.append((text, rating, sentiment))
  return data
    
  

reviews_neg = read_data("./aclImdb/train/neg/", False)
reviews_pos = read_data("./aclImdb/train/pos/", True)

# Prepare data for the model

reviews_neg = pd.DataFrame(reviews_neg).rename(columns={0:"text", 1:"rating", 2:"sentiment"})
reviews_pos = pd.DataFrame(reviews_pos).rename(columns={0:"text", 1:"rating", 2:"sentiment"})

all_reviews = pd.concat([reviews_neg, reviews_pos])
all_reviews = shuffle(all_reviews)

print(all_reviews)



                                                   text  rating  sentiment
1006  Granted I had seen some "Speed Racer", but I n...       7       True
9437  I'm going to have to disagree with the previou...       4      False
2439  I loved this movie. In fact I loved being an a...      10       True
6255  Not only does this film have one of the great ...       7       True
1466  Having seen the hot Eliza Dushku in the pretty...       1      False
...                                                 ...     ...        ...
5172  The plot of Corpse Grinders 2 is very much sim...       1      False
8861  Before watching this film, I could already tel...       3      False
9902  Expecting to see another Nunsploitation movie ...      10       True
9459  For those who think of Dame May Witty as the k...       7       True
743   I saw that when I was little and it was excell...      10       True

[25000 rows x 3 columns]


In [12]:
# Cleaning: Eliminating punctuation symbols or other unwanted symbols like html

import lxml.html.clean as clean
import string

tokens = []
for text in all_reviews['text']:
  text = clean.clean_html(text)
  text = text.translate(str.maketrans('', '', string.punctuation))
  tokens.append(text)

df = pd.DataFrame(tokens)
all_reviews['text'] = df[0].values
all_reviews.head()

Unnamed: 0,text,rating,sentiment
1006,pGranted I had seen some Speed Racer but I nev...,7,True
9437,pIm going to have to disagree with the previou...,4,False
2439,pI loved this movie In fact I loved being an a...,10,True
6255,pNot only does this film have one of the great...,7,True
1466,pHaving seen the hot Eliza Dushku in the prett...,1,False


In [5]:
# Creating a Pipeline for Preprocessing and Lemmatize
# Source: https://www.kaggle.com/balatmak/text-preprocessing-steps-and-universal-pipeline

 # Tokenization : Separating the text into tokens or minimal units of words, it can be done in several ways, with punctuation or heuristics

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

tokens = []
for text in all_reviews['text']:
  nltk_words = word_tokenize(text)
  tokens.append(nltk_words)

df = pd.DataFrame(tokens)
all_reviews['text'] = df[0].values

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [20]:
#Remove stop words

nltk.download('stopwords')
nltk_stop_words = nltk.corpus.stopwords.words('english')

tokens = []
for text in all_reviews['text']:
  text_without_stop_words = [t for t in text if t not in nltk_stop_words]
  tokens.append(text_without_stop_words)

df = pd.DataFrame(tokens)
all_reviews['text'] = df[0].values
  
 


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


KeyboardInterrupt: ignored

In [6]:
# Split into training and validation set

from sklearn.model_selection import train_test_split

X = all_reviews['text']
y = all_reviews['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [7]:
# CountVectorizer develops a vector of all the words in the string. 

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

AttributeError: ignored

In [None]:
# Train the model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(ctmTr, y_train)

In [None]:
# Predict

y_pred_class = model.predict(X_test_dtm)

In [None]:
# Evaluate

accuracy_score(y_test, y_pred_class)