<a href="https://colab.research.google.com/github/xerojester/Assignment-6/blob/main/Exercise_1_Sentiment_Analysis_with_Text_Classification_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1: Sentiment Analysis with Text Classification

In this notebook, we will apply our understanding of sentiment analysis using techniques like Multinomial Naïve Bayes, Logistic Regression, etc. The notebook is based on content discussed during week 4.

You will be working on training a model on a pretty big dataset of 1.6 million tweets! In case you are not able to load the complete dataset using your computing infrastructure, we recommend to work on a subset of the data

__Fill in missing content ``<YOUR CODE HERE>`` with correct answers__

## Install Dependencies

In [34]:
!pip install contractions
!pip install textsearch
!pip install tqdm



## Import Libraries

In [35]:
import nltk
import contractions
import numpy as np
import re
from tqdm import tqdm
import unicodedata
import pandas as pd
from bs4 import BeautifulSoup
import sklearn


In [36]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Get Dataset

For this exercise, we will make use of __Sentiment 140__ dataset. This dataset is a collection of tweets for the task of sentiment analysis. The dataset is available [here](http://help.sentiment140.com/for-students)

In [37]:
!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

URL transformed to HTTPS due to an HSTS policy
--2021-02-12 01:14:00--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘trainingandtestdata.zip.1’


2021-02-12 01:14:03 (33.1 MB/s) - ‘trainingandtestdata.zip.1’ saved [81363704/81363704]



In [38]:
!unzip trainingandtestdata.zip

Archive:  trainingandtestdata.zip
replace testdata.manual.2009.06.14.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: testdata.manual.2009.06.14.csv  
replace training.1600000.processed.noemoticon.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: training.1600000.processed.noemoticon.csv  


## Load and Process Dataset

In [39]:
def set_label(target):
  if target == 0:
    return 'negative'
  elif target == '2':
    return 'neutral'
  else:
    return 'positive'

In [40]:
train_dataset = pd.read_csv(r'training.1600000.processed.noemoticon.csv',
                            encoding='latin-1',
                            header=None,
                            names=['target','id','datetime','query','userid','tweet'])
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   target    1600000 non-null  int64 
 1   id        1600000 non-null  int64 
 2   datetime  1600000 non-null  object
 3   query     1600000 non-null  object
 4   userid    1600000 non-null  object
 5   tweet     1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [41]:
train_dataset.loc[:,'target'] = train_dataset.target.apply(lambda x: set_label(x))

In [42]:
train_dataset = train_dataset[train_dataset.target!='neutral']
train_dataset.shape

(1600000, 6)

In [43]:
train_dataset.head()

Unnamed: 0,target,id,datetime,query,userid,tweet
0,negative,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,negative,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,negative,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,negative,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,negative,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [44]:
test_dataset = pd.read_csv(r'testdata.manual.2009.06.14.csv',encoding='latin-1',
                          header=None,
                          names=['target','id','datetime','query','userid','tweet'])
test_dataset.shape

(498, 6)

In [45]:
test_dataset.loc[:,'target'] = test_dataset.target.apply(lambda x: set_label(x))

In [46]:
test_dataset = test_dataset[test_dataset.target!='neutral']
test_dataset.shape

(498, 6)

In [47]:
test_dataset.head()

Unnamed: 0,target,id,datetime,query,userid,tweet
0,positive,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,positive,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,positive,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,positive,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,positive,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


## Train and Test Datasets

In [48]:
train_reviews = train_dataset.tweet.values.tolist()
train_sentiments = train_dataset.target.values.tolist()

In [49]:
test_reviews = test_dataset.tweet.values.tolist()
test_sentiments = test_dataset.target.values.tolist()

## Question 1: Text Pre-processing (4 points)

1. Fill in the necessary functions below
2. Remove HTML tags, accents, contractions and special characters

In [50]:
def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text


def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm(docs):
    # strip HTML tags
   doc = strip_html_tags(doc)
    # remove extra newlines and convert them to spaces
  doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    # lower case
  doc = doc.lower()
    # remove accents
  doc = remove_accented_chars(doc)
    # fix contractions
  doc = contractions.fix(doc)
    # remove special characters
  doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
    # remove extra whitespaces
  doc = re.sub(' +', ' ', doc)
    # remove leading and training whitespaces
  doc = doc.strip()  
  norm_docs.append(doc)
  
  return norm_docs

## Normalize Text

In [51]:
%%time
norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

100%|██████████| 1600000/1600000 [02:37<00:00, 10170.50it/s]
100%|██████████| 498/498 [00:00<00:00, 11710.22it/s]

CPU times: user 2min 7s, sys: 22.7 s, total: 2min 30s
Wall time: 2min 37s





In [52]:
print(len(norm_train_reviews))

1


In [53]:
norm_train_reviews[:2]

['happy charitytuesday thenspcc sparkscharity speakinguph4h']

## Question 2: Feature Engineering (2 points)

1. Fit and transform text data using TF-IDF vectorizer

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [55]:
# build BOW features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=5, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)

ValueError: ignored

In [None]:
%%time

# transform test reviews into features
cv_test_features = cv.transform(norm_test_reviews)
tv_test_features = tv.transform(norm_test_reviews)

CPU times: user 1.42 ms, sys: 0 ns, total: 1.42 ms
Wall time: 1.35 ms


In [None]:
print('BOW model:> Train features shape:', cv_train_features.shape, 
      ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, 
      ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (1, 5)  Test features shape: (1, 5)
TFIDF model:> Train features shape: (1, 5)  Test features shape: (1, 5)


## Question 3: Sentiment Analysis using Multinomial Naïve Bayes (2 points)

Train a multinomial naive bayes model and evaluate the performance on the test data

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# instantiate model
clf = MultinomialNB(alpha=0, fit_prior=False)

In [None]:
# train model
clf.fit(cv_train_features, train_sentiments)

ValueError: ignored

Predict on test features

In [None]:
# predict on test data
mnb_tfidf_predictions = clf.predict(cv_test_features)

NotFittedError: ignored

### Model Evaluation

In [None]:
print(classification_report(test_sentiments, mnb_tfidf_predictions))

NameError: ignored

In [None]:
labels = ['negative', 'positive']
pd.DataFrame(confusion_matrix(test_sentiments, mnb_tfidf_predictions), 
             index=labels, columns=labels)

NameError: ignored

## Question 4: Sentiment Analysis using Logistic Regression (2 points)

Repeat the same experiment using logistic regression

In [None]:
<YOUR CODE HERE>