# Text Analytics / Assignment 1 / Part B / Document Model Simulation

Group members:
1. Balaji Venkatesh
2. Gireesh Sundaram
3. Vineet Kapoor

~ In this assignment we are constructing a topic model simulation by taking four topics which are not co-related from WikiPedia

## Step 1: Importing the necessary packages

In [1]:
#For web automation:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

#For natural language processing:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

#For topic model simulation:
import gensim
from gensim import corpora
from gensim.corpora import Dictionary

#General purpose
import pandas as pd
import string
import random
import re

#For visualizing the topics and words
import pyLDAvis.gensim



## Step 2: Opening web browser through selelium and importing the paragraph using beautiful soup:

In [6]:
driver = webdriver.Chrome("C:/Users/Vineet/Documents/ISB-H/Big data collection/group assign/chromedriver")

In [7]:
#This function will take a topic name as input and will give all the paragraph as output
#Input = String
#Return = List with paragaph found from WIKI page
def scrapeFromWiki(topic):
    driver.get("https://en.wikipedia.org/wiki/" + topic)

    pagesrc = driver.page_source
    soup = BeautifulSoup(pagesrc, "lxml")
    
    topic_1 = soup.find_all("p")
    return topic_1

In [8]:
#Passing to the function and storing the result
Paris = scrapeFromWiki("Paris")
Formula1 = scrapeFromWiki("Formula1")
Modi = scrapeFromWiki("Narendra_Modi")
Batman = scrapeFromWiki("Batman")

#Creating a list with four topics listed above
selection_list = ['Paris', 'Formula1', 'Modi', 'Batman']
selection_list

  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)


['Paris', 'Formula1', 'Modi', 'Batman']

## Step 3: Creating a dataframe with a random topic selected from the selection list that we have created:

In [9]:
document_topics = pd.DataFrame(columns = ["Document_number", "Topic1", "Topic2", "Topic3", "Topic4"])

#Appending the empty dataframe created with 50 rows of random topics from the selection_list
for i in range(0,50):
    _doc_top = pd.DataFrame([[i, random.choice(selection_list), random.choice(selection_list), random.choice(selection_list), random.choice(selection_list)]], 
                           columns = ["Document_number", "Topic1", "Topic2", "Topic3", "Topic4"])
    document_topics = document_topics.append(_doc_top, ignore_index=True)
    
document_topics.head(5)

Unnamed: 0,Document_number,Topic1,Topic2,Topic3,Topic4
0,0,Paris,Modi,Formula1,Formula1
1,1,Paris,Batman,Modi,Modi
2,2,Formula1,Paris,Batman,Batman
3,3,Formula1,Paris,Formula1,Formula1
4,4,Batman,Formula1,Batman,Formula1


## Step 4: Creating 50 random documents with the topics from the dataframe taken above:

In [10]:
#Creating50 documents with randomly selected paragraph from the dataframe topics
for i in range(0,50):
    _random_para = random.randint(1,50)
    globals()['document_%s' %i]  = eval(str(document_topics.iloc[i].Topic1))[_random_para].text.strip() +     eval(str(document_topics.iloc[i].Topic2))[_random_para].text.strip() +     eval(str(document_topics.iloc[i].Topic3))[_random_para].text.strip() +     eval(str(document_topics.iloc[i].Topic4))[_random_para].text.strip()    
    i = i + 1
    
#Printing one of the documents:
document_15

'On 25 June 2015, Modi launched a programme intended to develop 100 smart cities.[198] The "Smart Cities" programme is expected to bring Information Technology companies an extra benefit of ₹20 billion (US$310\xa0million).[199] In June 2015, Modi launched the "Housing for All By 2022" project, which intends to eliminate slums in India by building about 20 million affordable homes for India\'s urban poor.[200]Various modern stories have portrayed the extravagant, playboy image of Bruce Wayne as a facade.[76] This is in contrast to the post-Crisis Superman, whose Clark Kent persona is the true identity, while the Superman persona is the facade.[77][78] In Batman Unmasked, a television documentary about the psychology of the character, behavioral scientist Benjamin Karney notes that Batman\'s personality is driven by Bruce Wayne\'s inherent humanity; that "Batman, for all its benefits and for all of the time Bruce Wayne devotes to it, is ultimately a tool for Bruce Wayne\'s efforts to mak

## Step 5: Using NLTK to clean the document.
The below code will remove the stop words, remove puntations and will lemmatize the words in the document

In [11]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

#this function takes a string as input and will clean the text
def clean(string):
    stop_free = " ".join([i for i in string.lower().split() if i not in stop])
    punch_free = "".join([i for i in stop_free if i not in exclude])
    normalized = " ".join([lemma.lemmatize(words) for words in punch_free.split()])
    return normalized

## Step 6: Creating a function to compute topics from the passed documents:

In [12]:
#This function takes a cleaned document as input and will compute topic model for it
def topicFromDoc(document):
    document1_clean = [clean(document).split() for doc in clean(document)]
    dictionary = Dictionary(document1_clean)
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in document1_clean]
    Lda = gensim.models.ldamodel.LdaModel
    
    ldamodel = Lda(doc_term_matrix, num_topics=4, id2word = dictionary)
    one = ldamodel.show_topics(formatted = False)[1][1]
    two = [words + " / " + str(prob) for words, prob in one]
    three = pd.DataFrame(two)
    three = three.transpose()
    
    return three

## Step 7: Passing all the 50 documents created in the above functions and finding out the topic scores for each of the topics:

In [13]:
topics = pd.DataFrame(columns = ["Document_number", 0,1,2,3,4,5,6,7,8,9])

for i in range(0,5):
    four = topicFromDoc(document = globals()['document_%s' %i])
    four["Document_number"] = i
    topics = topics.append(four)
    
topics.head(5)

Unnamed: 0,Document_number,0,1,2,3,4,5,6,7,8,9
0,0,car / 0.0288702,lotus / 0.024484,since / 0.0218957,aluminiumsheet / 0.0194384,sponsorship / 0.0171019,introduced / 0.0159545,modi / 0.015253,introducing / 0.0149774,proved / 0.0147191,parisii / 0.0143017
0,1,wayne / 0.0606545,gdp / 0.0503419,paris / 0.0401322,region / 0.0238736,bruce / 0.0229325,billion / 0.0201301,square / 0.0188133,mile / 0.0135385,europe / 0.013264,child / 0.0128788
0,2,story / 0.0340273,created / 0.0286932,byline / 0.0275628,credit / 0.0267236,batman / 0.0261904,car / 0.0226654,began / 0.0219451,comic / 0.0206613,name / 0.019978,title / 0.015481
0,3,car / 0.0776388,lotus / 0.0362133,1968 / 0.0309487,since / 0.0286867,midengined / 0.0256993,sponsorship / 0.022873,proved / 0.022746,livery / 0.0226987,parisii / 0.0224952,chassis / 0.0221069
0,4,rosberg / 0.0263391,one / 0.0258657,season / 0.0241776,character / 0.0221429,series / 0.0208005,point / 0.020767,5 / 0.0166203,knight / 0.0162538,dark / 0.0158972,break / 0.0158486


## Step 8: Merging with original dataframe 

In [14]:
documents_with_topics = pd.merge(document_topics, topics, on = "Document_number")
documents_with_topics.head()

Unnamed: 0,Document_number,Topic1,Topic2,Topic3,Topic4,0,1,2,3,4,5,6,7,8,9
0,0,Paris,Modi,Formula1,Formula1,car / 0.0288702,lotus / 0.024484,since / 0.0218957,aluminiumsheet / 0.0194384,sponsorship / 0.0171019,introduced / 0.0159545,modi / 0.015253,introducing / 0.0149774,proved / 0.0147191,parisii / 0.0143017
1,1,Paris,Batman,Modi,Modi,wayne / 0.0606545,gdp / 0.0503419,paris / 0.0401322,region / 0.0238736,bruce / 0.0229325,billion / 0.0201301,square / 0.0188133,mile / 0.0135385,europe / 0.013264,child / 0.0128788
2,2,Formula1,Paris,Batman,Batman,story / 0.0340273,created / 0.0286932,byline / 0.0275628,credit / 0.0267236,batman / 0.0261904,car / 0.0226654,began / 0.0219451,comic / 0.0206613,name / 0.019978,title / 0.015481
3,3,Formula1,Paris,Formula1,Formula1,car / 0.0776388,lotus / 0.0362133,1968 / 0.0309487,since / 0.0286867,midengined / 0.0256993,sponsorship / 0.022873,proved / 0.022746,livery / 0.0226987,parisii / 0.0224952,chassis / 0.0221069
4,4,Batman,Formula1,Batman,Formula1,rosberg / 0.0263391,one / 0.0258657,season / 0.0241776,character / 0.0221429,series / 0.0208005,point / 0.020767,5 / 0.0166203,knight / 0.0162538,dark / 0.0158972,break / 0.0158486


## Step 9: For creating a visual representation of the words and the topics:

In [22]:
document_set = [document_0,document_1,document_2,document_3,document_4,document_5,document_6,document_7,document_8,document_9,document_10,document_11,document_12,document_13,document_14,document_15,document_16,document_17,document_18,document_19,document_20,document_21,document_22,document_23,document_24,document_25,document_26,document_27,document_28,document_29,document_30,document_31,document_32,document_33,document_34,document_35,document_36,document_37,document_38,document_39,document_40,document_41,document_42,document_43,document_44,document_45,document_46,document_47,document_48,document_49]

tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
#en_stop = get_stop_words('en')

stop = set(stopwords.words('english'))

lemma = WordNetLemmatizer()

# list for tokenized documents in loop
texts = []

# loop through document list
for i in document_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    #stopped_tokens = [i for i in tokens if not i in en_stop]
    stopped_tokens = [i for i in tokens if not i in stop]
    
    # stem tokens
    stemmed_tokens = [lemma.lemmatize(i) for i in stopped_tokens]
    
    #remove single letters
    #stemmed_tokens = [w for w in stemmed_tokens if re.sub('(\\b[A-Za-z] \\b|\\b [A-Za-z]\\b)','',stemmed_tokens)]
    
    #remove number
    #stemmed_tokens = [w for w in stemmed_tokens if re.sub(r'\d+',stemmed_tokens)]
        
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
doc_term_matrix = [dictionary.doc2bow(text) for text in texts]


#%% for creating LDA model
Lda = gensim.models.ldamodel.LdaModel

ldamodel = Lda(doc_term_matrix, num_topics=4, id2word = dictionary,passes = 20, alpha = 'auto')
topic_with_score = ldamodel.show_topics()
ldatopics = [[word for word, prob in topic] for topicid, topic in ldamodel.show_topics(formatted=False)]

In [23]:
#%%for visualizing the topics and words
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]
