# How to create features from textual data? Bag of Words..?

1. Each document is made of bag of words.
2. Ignore Sentence structure.
3. Care about frequency of words across corpus.

The BoW model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences:

Sentence 1: "The cat sat on the hat"<br>
Sentence 2: "The dog ate the cat and the hat"

From these two sentences, our vocabulary is as follows:
{ the, cat, sat, on, hat, dog, ate, and } of size 8

Hence, length of our feature vector would be of size 8

FeatureVector 1: { 2, 1, 1, 1, 1, 0, 0, 0 }<br>
FeatureVector 2: { 3, 1, 0, 0, 1, 1, 1, 1}

# ML Model for Sentiment classification

Steps To Follow for creating a Machine Learning model for Sentiment Classification
1. Load the training data using pandas
2. Perform text pre-processing
3. Tokenize the text into words
4. Create BoW feature vector of each document
5. Use the feature vectors to build an ML model for classification


In [1]:
import pandas as pd
data_dir = 'data/'

In [2]:
df_train = pd.read_csv(data_dir + 'labeledTrainData.tsv', sep="\t")

In [3]:
df_train.columns

Index(['id', 'sentiment', 'review'], dtype='object')

In [4]:
#Take only the columns you need

df_train=df_train[['sentiment', 'review']]

In [5]:
df_train.columns

Index(['sentiment', 'review'], dtype='object')

In [6]:
#Data Cleaning and Text Preprocessing

from bs4 import BeautifulSoup
import re

def preproc(line):
    line=line.lower()
    text = BeautifulSoup(line).get_text()
    return re.sub("[\"\\\\]","", text)

In [7]:
df_train['review']=df_train['review'].apply(preproc)

In [8]:
df_train['review'][0]

"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj's feeling towards the press and also the obvious message of drugs are bad m'kay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 mi

In [9]:
#Tokenization

import spacy
nlp = spacy.load('en')

In [10]:
%time tokenized_docs=[[word.text for word in nlp(doc)] for doc in df_train['review']]

CPU times: user 1h 5min 10s, sys: 46min 55s, total: 1h 52min 5s
Wall time: 28min 46s


# Stopwords

Common occuring words like (a, an, the etc) which are usually removed while performing language processing.
Intention is to have more focus on important words which provides meaning to the text.

In [11]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [13]:
print(len(stopwords))
list(stopwords)[0:20]

305


['few',
 'someone',
 'top',
 'from',
 'before',
 'as',
 'less',
 'anyone',
 'many',
 'others',
 're',
 'since',
 'although',
 'even',
 'we',
 'thru',
 'should',
 'themselves',
 'whether',
 'however']

In [14]:
def remove_stop_words(words):
    imp_words=[w for w in words if not w in stopwords]
    return " ".join(imp_words)

In [15]:
 meaning_words=list(map(remove_stop_words, tokenized_docs))

# Create Bag of Words features

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
vectorizer = CountVectorizer(stop_words = None, max_features = 7000) 

In [61]:
train_data_features = vectorizer.fit_transform(meaning_words)

In [62]:
type(train_data_features)

scipy.sparse.csr.csr_matrix

In [63]:
train_data_features = train_data_features.toarray()

In [64]:
type(train_data_features)

numpy.ndarray

In [65]:
train_data_features.shape

(25000, 7000)

In [66]:
#Take a look at the words in vocab
vocab = vectorizer.get_feature_names()

In [67]:
type(vocab)
vocab[100:130]

['4th',
 '50',
 '50s',
 '60',
 '60s',
 '70',
 '70s',
 '73',
 '75',
 '80',
 '80s',
 '85',
 '90',
 '90s',
 '95',
 '99',
 'abandon',
 'abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abound',
 'about',
 'abraham',
 'abrupt',
 'abruptly',
 'absence',
 'absent',
 'absolute']

In [68]:
# Print Count of Vocab words

import numpy as np
dist = np.sum(train_data_features, axis=0)
word_freq=[(tag, count) for tag, count in zip(vocab, dist)]

In [69]:
word_freq[100:120]

[('4th', 65),
 ('50', 483),
 ('50s', 112),
 ('60', 304),
 ('60s', 183),
 ('70', 437),
 ('70s', 299),
 ('73', 72),
 ('75', 77),
 ('80', 598),
 ('80s', 301),
 ('85', 53),
 ('90', 529),
 ('90s', 106),
 ('95', 69),
 ('99', 118),
 ('abandon', 51),
 ('abandoned', 187),
 ('abc', 125),
 ('abilities', 108)]

# Split into training and test set

In [70]:
n=20000

X_train, X_test=train_data_features[:n], train_data_features[n:]

In [71]:
(X_train.shape, X_test.shape)

((20000, 7000), (5000, 7000))

In [72]:
target=df_train['sentiment']

In [73]:
y_train, y_test=target[0:n], target[n:target.shape[0]]

In [74]:
(y_train.shape, y_test.shape)

((20000,), (5000,))

In [75]:
from sklearn.ensemble import RandomForestClassifier

In [76]:
forest = RandomForestClassifier(n_estimators = 20) 

In [77]:
%time model=forest.fit(X_train, y_train)

CPU times: user 16.1 s, sys: 213 ms, total: 16.3 s
Wall time: 16.3 s


In [78]:
model.score(X_train,y_train)

0.99935

In [79]:
model.score(X_test,y_test)

0.8134

In [80]:
from sklearn.linear_model import LogisticRegression
logReg=LogisticRegression()

In [81]:
logReg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [82]:
logReg.score(X_train, y_train)

0.98735

In [84]:
logReg.score(X_test, y_test)

0.8576

In [45]:
np.argmin(dist,axis=0)
print(word_freq[50])

('22', 78)
