# Text Mining Model - Topic Modeling and Sentiment Analysis

#### Preprocessing the data

In [1]:
import pandas as pd
import re
import lda
import numpy as np
import lda.datasets
import sklearn.feature_extraction.text as text

In [2]:
df = pd.read_csv('Tweets.csv', sep="\t")
df.head(5)

Unnamed: 0,created_at,screen_name,text
0,Fri Nov 18 23:59:58 +0000 2016,arunprasad72,RT @Praveen_1singh: First the stone pelting st...
1,Fri Nov 18 23:59:49 +0000 2016,pranavkisu,RT @NewDelhiTimesIN: Is the #demonetization of...
2,Fri Nov 18 23:59:48 +0000 2016,bablumohan,RT @scoopwhoopnews: #BREAKING Banks across Ind...
3,Fri Nov 18 23:59:37 +0000 2016,NagrathRob,RT @DrGPradhan: .@ravishndtv of @ndtv spreadin...
4,Fri Nov 18 23:59:28 +0000 2016,ManishPrasa,RT @YesIamSaffron: जब भी #Demonetization व् का...


#### Topic modelling (LDA - Latent Dirichlet allocation)

In natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

#### Generating the document term matrix

In [3]:
vectorizer = text.CountVectorizer(input='content', stop_words='english', min_df=1)
df.text.fillna('', inplace=True)
dtm = vectorizer.fit_transform(df.text).toarray()
dtm

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### Loading the vocabulary

In [4]:
vocab = np.array(vectorizer.get_feature_names())
vocab[:20]

array(['00', '000', '00f8drqczr', '025', '039mojsvyi', '0480654',
       '04bevf5lre', '06', '07', '08', '09ip2atmbg', '09mitali',
       '09uv9sfd2y', '0bfujsgltb', '0bwfwuirqk', '0fan6b2wxv',
       '0fmh3umbvq', '0frhtzuyhv', '0funzj0fjo', '0i47zl55f2'],
      dtype='<U29')

In [5]:
import warnings                  
warnings.filterwarnings("ignore")

text = df.text
model = lda.LDA(n_topics=5, n_iter=500, random_state=1)
model.fit(dtm)

INFO:lda:n_documents: 11914
INFO:lda:vocab_size: 11019
INFO:lda:n_words: 141760
INFO:lda:n_topics: 5
INFO:lda:n_iter: 500
INFO:lda:<0> log likelihood: -1372721
INFO:lda:<10> log likelihood: -978602
INFO:lda:<20> log likelihood: -954704
INFO:lda:<30> log likelihood: -950702
INFO:lda:<40> log likelihood: -947126
INFO:lda:<50> log likelihood: -944809
INFO:lda:<60> log likelihood: -943443
INFO:lda:<70> log likelihood: -942278
INFO:lda:<80> log likelihood: -940852
INFO:lda:<90> log likelihood: -939839
INFO:lda:<100> log likelihood: -939844
INFO:lda:<110> log likelihood: -938514
INFO:lda:<120> log likelihood: -938177
INFO:lda:<130> log likelihood: -936859
INFO:lda:<140> log likelihood: -936467
INFO:lda:<150> log likelihood: -936332
INFO:lda:<160> log likelihood: -935451
INFO:lda:<170> log likelihood: -935602
INFO:lda:<180> log likelihood: -935251
INFO:lda:<190> log likelihood: -935235
INFO:lda:<200> log likelihood: -934997
INFO:lda:<210> log likelihood: -934803
INFO:lda:<220> log likelihood:

<lda.lda.LDA at 0x26d4f1be0b8>

In [6]:
model.topic_word_

array([[3.64296203e-07, 3.64296203e-07, 3.67939165e-05, ...,
        3.64296203e-07, 3.64296203e-07, 3.64296203e-07],
       [3.99357992e-07, 5.59500547e-04, 3.99357992e-07, ...,
        3.99357992e-07, 3.99357992e-07, 8.02709564e-05],
       [3.72686687e-07, 1.86716030e-04, 3.72686687e-07, ...,
        3.72686687e-07, 3.72686687e-07, 3.72686687e-07],
       [7.48101351e-05, 4.22764252e-04, 2.48538655e-07, ...,
        2.48538655e-07, 2.48538655e-07, 2.48538655e-07],
       [4.39498813e-07, 4.39498813e-07, 4.39498813e-07, ...,
        1.32289143e-04, 2.20188905e-04, 4.39498813e-07]])

In [7]:
topic_word = model.topic_word_

#### Finding the key words that come together for each topic

In [8]:
n_top_words = 8
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: demonetization rt https purpose devyanidilli mouth uses
Topic 1: demonetization rt https sardesairajdeep taken congress struggling
Topic 2: demonetization rt https पर कर रह modi
Topic 3: demonetization https rt hai modi bank nahi
Topic 4: demonetization rt https money black anna govt


#### Finding the Topic for each Document

In [9]:
doc_topic = model.doc_topic_
for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("topic: {} , {}".format(topic_most_pr,text[n]))

topic: 4 , RT @Praveen_1singh: First the stone pelting stopped and now this!! 
What months of politics and talks couldn't do #demonetization did in da…
topic: 0 , RT @NewDelhiTimesIN: Is the #demonetization of ₹1000 &amp; ₹500 notes good for India? 

@AmitShah @OfficeOfRG @PMOIndia @BJP4India
topic: 0 , RT @scoopwhoopnews: #BREAKING Banks across India to serve only senior citizens tomorrow: NDTV
#demonetization
topic: 1 , RT @DrGPradhan: .@ravishndtv of @ndtv spreading rumours to provoke people against #demonetization &amp; PM @narendramodi 

He need mob treatmen…
topic: 2 , RT @YesIamSaffron: जब भी #Demonetization व् काली धन का इतिहास लीखा जाएगा फ़र्ज़ीवाल का नाम सबसे ऊपर काले अछर से लिखा जाएगा @ArvindKejriwal…
topic: 3 , Agree Sir reason of worry for SC could b some or all of them were affected by #demonetization or instructions by ma… https://t.co/ifqDuDbcEm
topic: 3 , @BspUp2017 @OfficeOfRG @MamataOfficial @ArvindKejriwal  #सारे_चोर_मचाये_शोर  #demonetization  #ModiFightsCorr

## Sentiment Analysis

Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).

A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. We will use knowledge-based techniques classify text by affect categories based on the presence of unambiguous affect words such as happy, sad, afraid, and bored.

In [10]:
from nltk.classify import NaiveBayesClassifier
import math
import collections

In [11]:
pos_features = []
neg_features = []
def make_full_dict(word):
    return dict([(word, True)])

In [14]:
with open('positive-words.txt','r', encoding='utf-8') as posFile:
    lines = posFile.readlines()
    for line in lines:
        pos_features.append([make_full_dict(line.rstrip()),'pos'])
        
with open('negative-words.txt','r', encoding='utf-8') as negFile:
    lines = negFile.readlines()
    for line in lines:
        neg_features.append([make_full_dict(line.rstrip()),'neg'])

In [15]:
len(pos_features),len(neg_features)

(8021, 4942)

In [16]:
trainFeatures = pos_features + neg_features

#### Naive Bayes Classifier

In [17]:
classifier = NaiveBayesClassifier.train(trainFeatures)
referenceSets = collections.defaultdict(set)
testSets = collections.defaultdict(set)
def make_full_dict_sent(words):
    return dict([(word, True) for word in words])

In [18]:
import re
neg_test = 'I hate data science'
title_words = re.findall(r"[\w']+|[.,!?;]",
                         'The Daily Mail stole My Visualization, Twice')
title_words

['The', 'Daily', 'Mail', 'stole', 'My', 'Visualization', ',', 'Twice']

In [19]:
test=[]
test.append([make_full_dict_sent(title_words),''])
test

[[{',': True,
   'Daily': True,
   'Mail': True,
   'My': True,
   'The': True,
   'Twice': True,
   'Visualization': True,
   'stole': True},
  '']]

In [20]:
# Test Classifier

for i, (features, label) in enumerate(test):
    predicted = classifier.classify(features)
    print(predicted)

neg


In [33]:
for doc in df.text[:20]:
    title_words = re.findall(r"[\w']+|[.,!?;]", doc.lower())
    test = []
    test.append([make_full_dict_sent(title_words),''])
    for i, (features, label) in enumerate(test):
        predicted = classifier.classify(features)
        print(predicted,doc)

pos RT @Praveen_1singh: First the stone pelting stopped and now this!! 
What months of politics and talks couldn't do #demonetization did in da…
pos RT @NewDelhiTimesIN: Is the #demonetization of ₹1000 &amp; ₹500 notes good for India? 

@AmitShah @OfficeOfRG @PMOIndia @BJP4India
neg RT @scoopwhoopnews: #BREAKING Banks across India to serve only senior citizens tomorrow: NDTV
#demonetization
neg RT @DrGPradhan: .@ravishndtv of @ndtv spreading rumours to provoke people against #demonetization &amp; PM @narendramodi 

He need mob treatmen…
pos RT @YesIamSaffron: जब भी #Demonetization व् काली धन का इतिहास लीखा जाएगा फ़र्ज़ीवाल का नाम सबसे ऊपर काले अछर से लिखा जाएगा @ArvindKejriwal…
neg Agree Sir reason of worry for SC could b some or all of them were affected by #demonetization or instructions by ma… https://t.co/ifqDuDbcEm
pos @BspUp2017 @OfficeOfRG @MamataOfficial @ArvindKejriwal  #सारे_चोर_मचाये_शोर  #demonetization  #ModiFightsCorruption 
https://t.co/4ElMIvgygX
pos RT @janlokpa

## As can be seen from the results above (on the first 20 tweets), our model performed well in the topic modeling task as well as sentiment analysis.