# Webpage topics extraction utility
### Created by: <br> <br> Varad S. Tupe <br> [Email](mailto:varad.s.tupe@gmail.com)<br>[LinkedIn](http://www.linkedin.com/in/varadtupe) <br> [GitHub](http://github.com/varadtupe) <br> [About Me](http://about.me/varadtupe)

This utility is designed to extract textual data from webpages and perform topic modelling on web page.

### Requirements:
    1. Python 3.6
    2. nltk
    3. genism
    4. pyLDAvis
    5. matplotlib
    6. urlib
    7. bs4 (beautiful soup)

### Utility Architecture
<img src="./flow.jpg">

### Classes:
#### Page_Loader
- This is class handles all the data extraction from given URL.
- text_from_html method returns the text data for given URL.

#### Text_Analzyer
- This class handles the data pre-processing and create the LDA model for topic extraction.

#### Topic_Modeler
- This class is more of an executor.
- Has 2 methods
- `process_lda` method handles one URL at time.
- `process_bulk_lda` method handles multiple URLs at once and provide topic modelling for all URLs all together.
- Both the methods would have 3 tries to fetch data from URL.
- The `visualize_topic` method uses `pyLDAvis` library in order to provide interactive visualitization of topics extracted in the process.

#### Note: Data extraction from some URLs might fail since `urlib` is not capable handling data form webpages which uses dynamic content loading using ajax call
Hence this utility can't handle data extraction from site like `amazon.com`. However it can handle data extraction from `ebay.com`.
In order to over come this we need to use more simulation utilites like `selenium` or write `spyders` with custom configuration.


In [1]:
import re
import gensim
from gensim import corpora
from nltk.stem.porter import PorterStemmer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from gensim.models.ldamodel import LdaModel
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt


class Text_Analyzer():
    def __init__(self):
        self.reg_tokenizer = RegexpTokenizer(r'\w+')
        # use English stop words list
        self.en_stop = stopwords.words('english')
        # use self.p_stemmer of class PorterStemmer
        self.p_stemmer = PorterStemmer()
        self.lemmetizer = nltk.WordNetLemmatizer()
        self.corpus = ""
        self.dictionary = ""

    def remove_spl_char(self, inp_text):
        return re.sub("[^a-zA-Z ]","", inp_text)
    
    def reg_tokenize(self, raw_input):
        return self.reg_tokenizer.tokenize(raw_input)

    def remove_stop_word(self, tokens):
        return [i.lower() for i in tokens if not i.lower() in self.en_stop and len(i) > 2 ]
    
    def stemm_tokens(self, tokens):
        return [self.p_stemmer.stem(i) for i in tokens]
    
    def lemmetize_token(self, tokens):
        return [self.lemmetizer.lemmatize(token) for token in tokens]
    
    # turn our tokenized documents into a id <-> term dictionary
    def create_dictionary(self, tokens):
        self.dictionary = corpora.Dictionary(tokens)

    def create_corpus(self, tokens, dictionary):
        self.corpus = [dictionary.doc2bow(text) for text in tokens]
    
    def create_LDA_model(self):
        self.ldamodel = LdaModel(
            self.corpus,
            num_topics=10,
            id2word = self.dictionary,
            passes=30
            )

    def print_topics(self):
        print('Top 10 key words from the webpage')
        for i in self.ldamodel.show_topics(num_words=10, formatted=False)[0][1]:
            print(*i)

    def visualize_topic(self):
        pyLDAvis.enable_notebook()
        return pyLDAvis.gensim.prepare(
            self.ldamodel,
            self.corpus,
            self.dictionary
        )
    
    def process_lda(self, inp_text):
        text = self.remove_spl_char(inp_text)
        tokens = self.reg_tokenize(text)
        tokens = self.remove_stop_word(tokens)
        tokens = self.lemmetize_token(tokens)
        self.create_dictionary([tokens])
        self.create_corpus([tokens], self.dictionary)
        self.create_LDA_model()
    
    def process_bulk_lda(self, text_list):
        token_list = []
        for page in text_list:
            text = self.remove_spl_char(page)
            tokens = self.reg_tokenize(text)
            tokens = self.remove_stop_word(tokens)
            tokens = self.lemmetize_token(tokens)
            token_list.append(tokens)
        
        self.create_dictionary(token_list)
        self.create_corpus(token_list, self.dictionary)
        self.create_LDA_model()


In [2]:
import urllib.request

from bs4 import BeautifulSoup
from bs4.element import Comment


class Page_Loader():
    '''
    This class extract the text from any webpage
    '''
    def __init__(self):
        pass
    
    def tag_visible(self, element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True


    def text_from_html(self, url):
        body = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(body, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(self.tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)


In [3]:
class Topic_Modeller():

    def __init__(self):
        self.pg_loader = Page_Loader()
        self.txt_analyzer = Text_Analyzer()
    def process_url(self, inp_url):
        num_retries = 3
        success_flag = False
        for try_num in range(num_retries):
            if not success_flag:
                try:    
                    text_data = self.pg_loader.text_from_html(inp_url)
                    success_flag = True
                    print('Success')
                    break
                except:
                    print('Failed to fetch data on try num #', try_num)
        if not success_flag:
            raise Exception('Error while fetching data for url: ' + inp_url)
        self.txt_analyzer.process_lda(text_data)
        self.txt_analyzer.print_topics()
        return self.txt_analyzer

    def process_bulk_url(self, inp_list):
        num_retries = 3
        text_data = []
        url_queue = inp_list.copy()
        
        for _ in range(num_retries):
            url_list = url_queue.copy()
            url_queue = []
            for url in url_list:
                try:
                    page_text = self.pg_loader.text_from_html(url)
                    text_data.append(page_text)
                except:
                    url_queue.append(url)
            if len(url_queue) == 0:
                break


        self.txt_analyzer.process_bulk_lda(text_data)
        
        if len(url_queue) > 0:
            print('Unable to fetch data from following URLS:')
            for url in url_queue:
                print(url)
        return self.txt_analyzer


In [4]:
tm = Topic_Modeller()

#### Topic modelling on DSLR camera listed on Ebay.com

In [5]:
result = tm.process_url('https://www.ebay.com/itm/Canon-EOS-5D-Mark-IV-MK-4-DSLR-Camera-Body-Only/162197189088')
result.visualize_topic()

Success
Top 10 key words from the webpage
shipping 0.00161368
new 0.00161359
item 0.00161271
eos 0.00161267
camera 0.00161243
window 0.00161225
canon 0.00161222
open 0.00161216
seller 0.00161198
digital 0.00161177


#### Topic modelling from REI blog regarding how to introduce your indoorsy friend to the outdoors

In [6]:
result = tm.process_url('http://blog.rei.com/camp/how-to-introduce-your-indoorsy-friend-to-the-outdoors/')
result.visualize_topic()

Success
Top 10 key words from the webpage
friend 0.00306862
rei 0.00306852
camping 0.00306807
coop 0.00306804
like 0.00306803
hike 0.003068
take 0.00306797
keep 0.0030679
time 0.0030679
flat 0.00306787


#### Topic extraction from CNN's article about Edward Snowden

In [7]:
result = tm.process_url('http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/')
result.visualize_topic()

Success
Top 10 key words from the webpage
nsa 0.00291729
snowden 0.00291716
government 0.00291646
said 0.00291632
watched 0.00291619
video 0.0029161
replay 0.00291608
worked 0.00291602
obama 0.00291596
leak 0.00291594


#### Bulk topic modeling on multiple URLs with data extraction failure handling

In [8]:
url_list = [
    'https://www.nytimes.com/2018/08/18/us/politics/don-mcgahn-mueller-investigation.html?action=click&contentCollection=politics&region=rank&module=package&version=highlights&contentPlacement=1&pgtype=sectionfront',
    'https://www.quora.com/Is-it-possible-to-extract-topics-in-a-single-document',
    'http://uselsess.siteee',
    'https://www.cnet.com/reviews/google-home-review/',
    'https://www.amazon.com/gp/product/B009GQ034C?pf_rd_p=d1f45e03-8b73-4c9a-9beb-4819111bef9a&pf_rd_r=JY8Z5CD0VB6H3JT0212T'
]

In [9]:
result = tm.process_bulk_url(url_list)
result.visualize_topic()

Unable to fetch data from following URLS:
http://uselsess.siteee
https://www.amazon.com/gp/product/B009GQ034C?pf_rd_p=d1f45e03-8b73-4c9a-9beb-4819111bef9a&pf_rd_r=JY8Z5CD0VB6H3JT0212T
