# Text classification using naive bayes for sentiment analysis

Please note that this is just done to understand what libraries to use and to
get an understanding of the processes involved in text classification. We will 
have to adapt the model to classify sentences of reviews into various topics. 
The steps we take for our analysis are basically

1. Classify sentences into topics e.g. service, fees, registration
2. Classify sentences within each topic as positive or negative
3. Count positive/negative within each topic
4. Perform linear/multi-linear regression between pos/neg in each topic, and
the rating associated the text that sentence belongs to.


[Tutorial](https://www.youtube.com/watch?v=oq68P8Kv7nE)

In [51]:
import numpy as np
import pandas as pd
import nltk #! Must download nltk punkt
from nltk import word_tokenize # ALternatively use nlp() parser from spacy

In [52]:
# b - business, t - tech, e - entertainment.
df = pd.read_csv("sentiment.csv").loc[:, ["TITLE", "CATEGORY"]]
df.head()

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


In [53]:
df["CATEGORY"].value_counts()

e    32
t    22
b    11
Name: CATEGORY, dtype: int64

In [54]:
# Alt: Use Spacy nlp() to create tokens automatically (with lemmatization)
data = "Pifort Technologies is a Software Development company. Piford also provides a training program."
print(word_tokenize(data))

['Pifort', 'Technologies', 'is', 'a', 'Software', 'Development', 'company', '.', 'Piford', 'also', 'provides', 'a', 'training', 'program', '.']


In [55]:
# Remove punctuation from token list () alternatively use token.lemma_ in nlp()
import string
data_1 = [char for char in data if char not in string.punctuation]
print(data_1)


['P', 'i', 'f', 'o', 'r', 't', ' ', 'T', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 'i', 'e', 's', ' ', 'i', 's', ' ', 'a', ' ', 'S', 'o', 'f', 't', 'w', 'a', 'r', 'e', ' ', 'D', 'e', 'v', 'e', 'l', 'o', 'p', 'm', 'e', 'n', 't', ' ', 'c', 'o', 'm', 'p', 'a', 'n', 'y', ' ', 'P', 'i', 'f', 'o', 'r', 'd', ' ', 'a', 'l', 's', 'o', ' ', 'p', 'r', 'o', 'v', 'i', 'd', 'e', 's', ' ', 'a', ' ', 't', 'r', 'a', 'i', 'n', 'i', 'n', 'g', ' ', 'p', 'r', 'o', 'g', 'r', 'a', 'm']


In [56]:
data_1 = "".join(data_1)
print(data_1)

Pifort Technologies is a Software Development company Piford also provides a training program


In [57]:
data_1 = data_1.split()
print(data_1)

['Pifort', 'Technologies', 'is', 'a', 'Software', 'Development', 'company', 'Piford', 'also', 'provides', 'a', 'training', 'program']


In [58]:
# delete all stop words from our data
from nltk.corpus import stopwords
data_1 = [word for word in data_1 if word not in stopwords.words("english")];
print(data_1)

['Pifort', 'Technologies', 'Software', 'Development', 'company', 'Piford', 'also', 'provides', 'training', 'program']


In [59]:
# Feature extraction Convert words into feature vectors (bag of words variation
# called the count vectoriser)
from sklearn.feature_extraction.text import CountVectorizer

vectoriser = CountVectorizer()
vectoriser.fit(data_1)
print(vectoriser.vocabulary_) # Shows the numeric value/index assigned to word

{'pifort': 4, 'technologies': 8, 'software': 7, 'development': 2, 'company': 1, 'piford': 3, 'also': 0, 'provides': 6, 'training': 9, 'program': 5}


In [60]:
# Get bag of words count (index in document list, index index of word, frequency of word)
data_1 = [" ".join(data_1)]
vector = vectoriser.transform(data_1)
print(vector)

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1


## Analysing for real

In [64]:
# Text cleaning function we want ot apply to each document
def text_cleaning(a):
    remove_punctuation = [char for char in a if char not in string.punctuation]
    remove_punctuation = "".join(remove_punctuation)
    return [word for word in remove_punctuation.split() if word.lower() not in stopwords.words("english")]


In [65]:
# Apply text clearning function to every element in series/column
print(df.loc[:, "TITLE"].apply(text_cleaning)) 

0     [Fed, official, says, weak, data, caused, weat...
1     [Feds, Charles, Plosser, sees, high, bar, chan...
2     [US, open, Stocks, fall, Fed, official, hints,...
3     [Fed, risks, falling, behind, curve, Charles, ...
4     [Feds, Plosser, Nasty, Weather, Curbed, Job, G...
                            ...                        
60    [GM, recalls, another, 24M, vehicles, belts, b...
61    [Business, update, Parade, GM, recalls, rolls,...
62                     [GM, keeps, recalling, vehicles]
63                                        [GM, recalls]
64                     [10, largest, GM, recalls, year]
Name: TITLE, Length: 65, dtype: object


In [72]:
# Get the vectoriser to learn the dictionary from the title series
#NB: analyzer=text_clearning basically applies the function to each row in series
bow_transformer = CountVectorizer(analyzer=text_cleaning).fit(df["TITLE"])
bow_transformer.vocabulary_

{'Fed': 61,
 'official': 202,
 'says': 221,
 'weak': 243,
 'data': 171,
 'caused': 167,
 'weather': 244,
 'slow': 224,
 'taper': 229,
 'Feds': 62,
 'Charles': 38,
 'Plosser': 109,
 'sees': 222,
 'high': 183,
 'bar': 160,
 'change': 168,
 'pace': 205,
 'tapering': 230,
 'US': 141,
 'open': 203,
 'Stocks': 131,
 'fall': 176,
 'hints': 184,
 'accelerated': 155,
 'risks': 219,
 'falling': 177,
 'behind': 162,
 'curve': 170,
 'Nasty': 96,
 'Weather': 147,
 'Curbed': 45,
 'Job': 81,
 'Growth': 73,
 'May': 86,
 'Accelerate': 18,
 'Tapering': 136,
 'Pace': 104,
 'Taper': 135,
 'may': 195,
 'expects': 175,
 'unemployment': 236,
 '62': 14,
 'end': 174,
 '2014': 4,
 'jobs': 190,
 'growth': 182,
 'last': 193,
 'month': 197,
 'hit': 185,
 'weatherFed': 245,
 'President': 113,
 'ECB': 54,
 'unlikely': 237,
 'sterilisation': 225,
 'SMP': 121,
 'purchases': 207,
 'traders': 235,
 'sterilization': 226,
 'Box': 32,
 'Office': 102,
 'XMen': 153,
 'Days': 50,
 'Future': 68,
 'Past': 106,
 'Nabs': 95,
 '26

In [73]:
title_bow = bow_transformer.transform(df["TITLE"])
print(title_bow)

  (0, 61)	1
  (0, 167)	1
  (0, 171)	1
  (0, 202)	1
  (0, 221)	1
  (0, 224)	1
  (0, 229)	1
  (0, 243)	1
  (0, 244)	1
  (1, 38)	1
  (1, 62)	1
  (1, 109)	1
  (1, 160)	1
  (1, 168)	1
  (1, 183)	1
  (1, 205)	1
  (1, 222)	1
  (1, 230)	1
  (2, 61)	1
  (2, 131)	1
  (2, 141)	1
  (2, 155)	1
  (2, 176)	1
  (2, 184)	1
  (2, 202)	1
  :	:
  (60, 69)	1
  (60, 158)	1
  (60, 159)	1
  (60, 163)	1
  (60, 213)	1
  (60, 240)	1
  (61, 7)	1
  (61, 36)	1
  (61, 69)	1
  (61, 105)	1
  (61, 213)	1
  (61, 220)	1
  (61, 238)	1
  (61, 240)	1
  (62, 69)	1
  (62, 191)	1
  (62, 212)	1
  (62, 240)	1
  (63, 69)	1
  (63, 213)	1
  (64, 2)	1
  (64, 69)	1
  (64, 192)	1
  (64, 213)	1
  (64, 249)	1


In [77]:
# Use TF-IDF to remove insignificant words i.e. ones appear in too many docs or rarely occurs
# NB: Gensim also has own tfidf model for extraction
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer().fit(title_bow)
print(tfidf_transformer)

title_tfidf = tfidf_transformer.transform(title_bow)
print(title_tfidf)
print(title_tfidf.shape)

# TODO: Use Tfidf to remove unimportant words that are used for bayes classification
# Prevents overfitting?

TfidfTransformer()
  (0, 244)	0.35477264027465666
  (0, 243)	0.35477264027465666
  (0, 229)	0.35477264027465666
  (0, 224)	0.32278160612728646
  (0, 221)	0.32278160612728646
  (0, 202)	0.32278160612728646
  (0, 171)	0.32278160612728646
  (0, 167)	0.35477264027465666
  (0, 61)	0.28247766961965964
  (1, 230)	0.33258913256553785
  (1, 222)	0.36555219519046506
  (1, 205)	0.33258913256553785
  (1, 183)	0.36555219519046506
  (1, 168)	0.36555219519046506
  (1, 160)	0.36555219519046506
  (1, 109)	0.2528507396807884
  (1, 62)	0.29106058500399523
  (1, 38)	0.30920146743562676
  (2, 230)	0.3250877985152921
  (2, 203)	0.35730740045597703
  (2, 202)	0.3250877985152921
  (2, 184)	0.35730740045597703
  (2, 176)	0.3250877985152921
  (2, 155)	0.35730740045597703
  (2, 141)	0.3022276271355269
  :	:
  (60, 213)	0.30863342900757507
  (60, 163)	0.4373404919478541
  (60, 159)	0.4373404919478541
  (60, 158)	0.4373404919478541
  (60, 69)	0.2652884107464841
  (60, 7)	0.4065867120677751
  (61, 240)	0.2817291368

In [84]:
# Perform multinomial bayes i.e. fit essentially returns classifier
# From bag of words (creates histogram) and lables df["CATEGORY"]
#! But why did we need TFIDF? Couldn't we just have used normal bow?
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(title_bow, df["CATEGORY"]) 

In [85]:
# Make predictions
all_predictions = model.predict(title_tfidf)
print(all_predictions)

['b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'e' 'e' 'e' 'e' 'e' 'e' 'e'
 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e'
 'e' 'e' 'e' 'e' 'e' 'e' 'e' 't' 't' 't' 't' 't' 't' 't' 't' 't' 't' 't'
 't' 't' 't' 't' 't' 't' 't' 't' 't' 't' 't']


In [86]:
# See the confusion matrix of prediction
from sklearn.metrics import confusion_matrix
confusion_matrix(df["CATEGORY"], all_predictions)

# Understanding result
# i.e. 11 in (0,0) means actual group 0 items predicted in cat 0 is 11

array([[11,  0,  0],
       [ 0, 32,  0],
       [ 0,  0, 22]])