#### Lesson Objectives:
* learn how to extract text from a webpage
* learn how to do simple preprocessing of the data
* learn how to extract topics from a collection of documents

In [1]:
# import commong modules
import numpy as np

Our goal is to find topics within the abstracts of the [PyData conference](https://pydata.org/seattle2017/schedule/). For that we will first need to do some web-scraping to extract the text for each abstract.

For that we will rely on the `urllib` and `BeautifulSoup` packages.

#### Extracting Talk Links from the Schedule Webpage.

In [6]:
import urllib3

In [7]:
from bs4 import BeautifulSoup

In [13]:
http = urllib3.PoolManager()
url = "https://pydata.org/seattle2017/schedule/"
webpage = http.request('GET', url)

schedule = BeautifulSoup(webpage.data,'html.parser')



In [14]:
webpage

<urllib3.response.HTTPResponse at 0x6fca320>

Now schedule is a Beautiful Soup object which can be mined for certain HTML components.

In [15]:
# find all links within the page
schedule.find_all('a',href=True)

[<a href="/"><img alt="PyData" data-rjs="2" src="https://pydata.org/seattle2017/static/images/logo.288981a8dfa8.png"/></a>,
 <a href="/seattle2017/">Home</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="/seattle2017/about/" id="ddtoggle_4">About <span class="caret"></span></a>,
 <a href="/seattle2017/about/code-of-conduct/">Code of Conduct</a>,
 <a href="/seattle2017/about/mission/">Conference Mission</a>,
 <a href="/seattle2017/venue/">Venue</a>,
 <a href="/seattle2017/sponsors/">Sponsors</a>,
 <a href="/seattle2017/cfp/">CFP</a>,
 <a href="https://pydata.org/seattle2017/schedule/">Schedule</a>,
 <a href="https://pydata.org/past-events.html" title="Past PyData Events">here</a>,
 <a href="/seattle2017/schedule/presentation/57/">Using CNTK's Python Interface for Deep Learning</a>,
 <a href="/seattle2017/schedule/presentation/104/">D\u2019oh! Unevenly spaced time series analysis of The Simpsons in Pandas</a>,
 <a href="/seattle2017/

We note that we are interested only the ones which contain the string 'schedule/presentation'.

In [16]:
# set the base url for the PyData website
base_url = "https://pydata.org"

In [17]:
# find all other urls which have "schedule/presentation" link in them
urls = [base_url+a['href'] for a in schedule.find_all('a', href=True)  if 'schedule/presentation' in a['href']]

In [18]:
urls

[u'https://pydata.org/seattle2017/schedule/presentation/57/',
 u'https://pydata.org/seattle2017/schedule/presentation/104/',
 u'https://pydata.org/seattle2017/schedule/presentation/109/',
 u'https://pydata.org/seattle2017/schedule/presentation/125/',
 u'https://pydata.org/seattle2017/schedule/presentation/105/',
 u'https://pydata.org/seattle2017/schedule/presentation/58/',
 u'https://pydata.org/seattle2017/schedule/presentation/103/',
 u'https://pydata.org/seattle2017/schedule/presentation/108/',
 u'https://pydata.org/seattle2017/schedule/presentation/107/',
 u'https://pydata.org/seattle2017/schedule/presentation/110/',
 u'https://pydata.org/seattle2017/schedule/presentation/114/',
 u'https://pydata.org/seattle2017/schedule/presentation/102/',
 u'https://pydata.org/seattle2017/schedule/presentation/138/',
 u'https://pydata.org/seattle2017/schedule/presentation/67/',
 u'https://pydata.org/seattle2017/schedule/presentation/69/',
 u'https://pydata.org/seattle2017/schedule/presentation/62/

#### Extracting the Abstract from each Talk Webpage.

Let's scrape each individual link for the abstract.

In [20]:
talk_webpage = http.request('GET', urls[0])
talk = BeautifulSoup(talk_webpage.data,'html.parser')



Find the part of the webpage which contains the Abstract:

In [21]:
abstract = talk.find("div", { "class" : "abstract" }).text

In [22]:
abstract

u'Topics to be covered include ...\n\nCognitive Toolkit (CNTK) installation\nWhat is "machine learning"? [gradient descent example]\nWhat is "learning representations"?\nWhy do Graphics Processing Units (GPUs) help?\nHow do we prevent overfitting?\nCNTK Packages and Modules\nDeep Learning Examples [including Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) examples]\n'

In [23]:
type(abstract)

unicode

#### Text Processing

One of the most popular natural language processing packages in Python is `nltk`.

In [24]:
import nltk

It requires some corpora to be loaded when used for first time (get the nltk corpora):

In [None]:
nltk.download()

We can now convert the text string into tokens and apply different preprocessing steps to it.

In [25]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [26]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [27]:
# convert string into tokens
tokens = tokenizer.tokenize(abstract)
tokens

[u'Topics',
 u'to',
 u'be',
 u'covered',
 u'include',
 u'Cognitive',
 u'Toolkit',
 u'CNTK',
 u'installation',
 u'What',
 u'is',
 u'machine',
 u'learning',
 u'gradient',
 u'descent',
 u'example',
 u'What',
 u'is',
 u'learning',
 u'representations',
 u'Why',
 u'do',
 u'Graphics',
 u'Processing',
 u'Units',
 u'GPUs',
 u'help',
 u'How',
 u'do',
 u'we',
 u'prevent',
 u'overfitting',
 u'CNTK',
 u'Packages',
 u'and',
 u'Modules',
 u'Deep',
 u'Learning',
 u'Examples',
 u'including',
 u'Convolutional',
 u'Neural',
 u'Network',
 u'CNN',
 u'and',
 u'Long',
 u'Short',
 u'Term',
 u'Memory',
 u'LSTM',
 u'examples']

In [28]:
# making words lower case
lower_tokens = [tok.lower() for tok in tokens]

In [29]:
# remove stop words
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nostop_tokens = [tok for tok in lower_tokens if tok not in stopwords]

In [30]:
# stemming
lancaster = nltk.LancasterStemmer()
stemmed_tokens = [lancaster.stem(t) for t in nostop_tokens]

In [31]:
stemmed_tokens

[u'top',
 u'cov',
 u'includ',
 u'cognit',
 u'toolkit',
 u'cntk',
 u'instal',
 u'machin',
 u'learn',
 u'grady',
 u'desc',
 u'exampl',
 u'learn',
 u'repres',
 u'graph',
 u'process',
 u'unit',
 u'gpu',
 u'help',
 u'prev',
 u'overfit',
 u'cntk',
 u'pack',
 u'mod',
 u'deep',
 u'learn',
 u'exampl',
 u'includ',
 u'convolv',
 u'neur',
 u'network',
 u'cnn',
 u'long',
 u'short',
 u'term',
 u'mem',
 u'lstm',
 u'exampl']

Create a preprocessing function with the steps above:

In [32]:
def abstractpreprocess(url):
    
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    
    talk_webpage = request.urlopen(url).read()
    talk = BeautifulSoup(talk_webpage,'html.parser')
    abstract = talk.find("div", { "class" : "abstract" }).text
    
    # tokenize
   
    # make lower case
    
    # remove stop words
    
    # stem
    

    
    
    

In [37]:
def abstract_preprocess(url):
    
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    
    talk_webpage = http.request('GET', url)
    talk = BeautifulSoup(talk_webpage.data,'html.parser')
    abstract = talk.find("div", { "class" : "abstract" }).text
    
    # tokenize
    tokens = tokenizer.tokenize(abstract)
    
    # make lower case
    tokens = [tok.lower() for tok in tokens]
    
    # stem
    lancaster = nltk.LancasterStemmer()
    tokens = [lancaster.stem(tok) for tok in tokens]
    
    # remove stop words
    stopwords = stopwords.words('english')
    tokens = [tok for tok in tokens if tok not in stopwords]
    
    return(tokens)
    

In [38]:
# preprocess the abstract in each url
abstracts = [abstract_preprocess(url) for url in urls]







In [39]:
len(abstracts)

65

#### Topic Modeling

One of the approaches to extract topics from a collection of documents is to build the [TF-IDF](http://brandonrose.org/clustering#Tf-idf-and-document-similarity) (Term Frequency-Inverse Document Frequency) matrix for the dataset.

To use in scikit.learn the tokens need to be directly converted to string:

In [40]:
# converting to string
final_abstracts = []
for abstract in abstracts:
    for word in abstract:
        n = abstract.index(word)
        if n == 0:
            string = abstract[n]
        else:
            string = string + " " + abstract[n]
    final_abstracts.append(string)

The matrix is obtained via the TfidfVectorizer.

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [42]:
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.95)

In [43]:
tfidf_matrix = tfidf_vectorizer.fit_transform(final_abstracts)

In [44]:
tfidf_matrix

<65x587 sparse matrix of type '<type 'numpy.float64'>'
	with 2782 stored elements in Compressed Sparse Row format>

We can decompose the tf-idf matrix into topics and weights by Nonnegative Matrix Factorization.

In [45]:
n_topics = 5  # This number matters a lot.

from sklearn.decomposition import NMF
model = NMF(init="nndsvd", n_components=n_topics, random_state=1)
W_matrix = model.fit_transform(tfidf_matrix)
H_matrix = model.components_

In [46]:
W_matrix.shape

(65L, 5L)

In [47]:
H_matrix.shape

(5L, 587L)

In [49]:
# Print topics and keywords
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
for topic_index in range( H_matrix.shape[0] ):
    top_indices = np.argsort( H_matrix[topic_index,:] )[::-1][0:10]  ##show top 10 words associated with each topic
    term_ranking = [tfidf_feature_names[i] for i in top_indices]
    print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ))

Topic 0: dat, analys, diff, perform, inform, us, tim, distribut, sum, count
Topic 1: forecast, met, cov, preprocess, cost, choos, hardw, fash, context, runtim
Topic 2: model, wil, us, learn, python, ar, thi, build, hav, network
Topic 3: system, learn, novel, play, interact, design, machin, tool, comput, vis
Topic 4: headband, us, cheap, feedback, eeg, everyday, realtim, dev, mus, wav


How can we improve the topics?
* short documents -> small word overlap -> use synonyms

#### Tips for Large Datasets
* the preprocessing and word counting can be performed in parallel on each document
* use `dask.bag` package to parallelize it without loading all documents at the same time ([example](http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html#local-computation))
* [MLib](https://spark.apache.org/docs/1.1.0/mllib-feature-extraction.html) library has NLP functionality
* store tf-idf matrix as sparse