# Topic Detection

Topic detection or identification is the process of discovering topics that are present in the input document set. These topics can be multiple words that occur uniquely in a given text.

In this part we will use Latent Dirichlet allocation algorithm to identify topics in a given input text.

In [1]:
# import required libraries
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim import corpora, models
import nltk
import feedparser

In [2]:
class TopicDetection:
    
    """
    (1) - define a function getDocuments(), whose responsability is to download few documents using feedparser
        - create an empty list to save all RSS 'summary' text information
        - if 'summary' contains sensitive words (e.g 'ex') we skip it
        - append summary text(s) to the new list for later use
    """
    def getDocuments(self):
        """
        Method to get a list of documents for topic detection.
        """
        url = 'https://sports.yahoo.com/mlb/rss.xml'
        # parse the url using feedparser library
        feed = feedparser.parse(url)
        self.documents = []
        
        i = 0
        # iterate over the top 5 documents 
        for entry in feed['entries'][:5]:
            i += 1
            text = entry['summary']
            # skip articles containing sensitive words
            if 'ex' in text:
                continue
            self.documents.append(text)
            print("Document {} : {}".format(i, text))
        print('INFO: Fetching documents from {} completed'.format(url))
        
    """
    (2) - define a function to apply simple preprocessing 
        - apply lowercase, tokenization, remove english stopwords
        - append the newly generated list to a new list for later use
    """
    def cleanDocuments(self):
        """
        Method to apply preprocessing on input documents.
        """
        tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
        # define list of unique stopwords
        eng_stop_words = set(stopwords.words('english'))
        self.cleaned = []
        
        for doc in self.documents:
            # apply lowercase 
            lowercase_doc = doc.lower()
            # tokenize document
            words = tokenizer.tokenize(lowercase_doc)
            # filter out english stopwords
            non_stopped_words = [word for word in words if not word in eng_stop_words]
            # append the newly generated list to 'cleaned' list
            self.cleaned.append(non_stopped_words)
        print('INFO: Cleaning {} documents completed'.format(len(self.documents)))
        
    """
    (3) - define a function to create Latent Dirichlet model to detect topic from text
        - create a dictionary of unique words from cleaned documents
        - create corpus as a bag of words for each cleaned sentence
        - create a model on the corpus with the number of topics defined as 2 and set the vocabulary
        size/mapping using id2word parameter
        - print 2 topics, where each topic should contain four words on the screen
    """
    def doLDA(self):
        """
        Method to detect topic from document(s) using Latent Dirichlet allocation algorithm.
        """
        dictionary = corpora.Dictionary(self.cleaned)
        # create corpus using cleaned document list
        corpus = [dictionary.doc2bow(cleandoc) for cleandoc in self.cleaned]
        # create Latent Dirichlet model using the newly generated corpus
        ldamodel = models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary)
        print(ldamodel.print_topics(num_topics=2, num_words=4))
        
    """
    (4) - define a function which does all the steps in order: getDocuments(), cleanDocuments(), doLDA()
    """
    def run(self):
        self.getDocuments()
        self.cleanDocuments()
        self.doLDA()
        

In [3]:
# create topic object
topic = TopicDetection()

In [4]:
# get documents
topic.getDocuments()

Document 1 : After years of quiet work, the A's shortstop has launched himself on to MVP ballots and put his name alongside superstars like Mike Trout.
Document 2 : It was an emotional scene in Seattle for the man they call King.
Document 3 : King Félix held court for the final time in Seattle
Document 4 : Bruce Bochy has a chance to finish the San Francisco era of his managerial career with a winning record against the club's chief rival when the Giants host the Los Angeles Dodgers beginning Friday night in the final three games of their skipper's career.  The Dodgers (103-56) will be looking
Document 5 : When Christian Yelich fouled a ball off of his kneecap on Sept. 10, the Milwaukee Brewers' season looked doomed.  Staying in the wild-card race was supposed to be tough without the reigning NL MVP, but not only did they stay in it, their scorching September has the Brewers in the hunt for a second
INFO: Fetching documents from https://sports.yahoo.com/mlb/rss.xml completed


In [7]:
# cleaning of documents
topic.cleanDocuments()

INFO: Cleaning 5 documents completed


In [8]:
# detect topic using LDA (Latent Dirichlet Allocation)
topic.doLDA()

[(0, '0.023*"mvp" + 0.019*"brewers" + 0.018*"dodgers" + 0.018*"mike"'), (1, '0.022*"king" + 0.022*"seattle" + 0.022*"career" + 0.021*"final"')]
