#### Lesson Objectives:
* learn how to extract text from a webpage
* learn how to do simple preprocessing of the data
* learn how to extract topics from a collection of documents

In [None]:
# import commong modules
import numpy as np

Our goal is to find topics within the abstracts of the [PyData conference](https://pydata.org/seattle2017/schedule/). For that we will first need to do some web-scraping to extract the text for each abstract.

For that we will rely on the `urllib` and `BeautifulSoup` packages.

#### Extracting Talk Links from the Schedule Webpage.

In [None]:
from urllib import request

In [None]:
from bs4 import BeautifulSoup

In [None]:
webpage = request.urlopen("https://pydata.org/seattle2017/schedule/").read()

schedule = BeautifulSoup(webpage,'html.parser')

Now schedule is a Beautiful Soup object which can be mined for certain HTML components.

In [None]:
# find all links within the page
schedule.find_all('a',href=True)

We note that we are interested only the ones which contain the string 'schedule/presentation'.

In [None]:
# set the base url for the PyData website
base_url = "https://pydata.org"

In [None]:
# find all other urls which have "schedule/presentation" link in them
urls = [base_url+a['href'] for a in schedule.find_all('a', href=True)  if 'schedule/presentation' in a['href']]

In [None]:
urls

#### Extracting the Abstract from each Talk Webpage.

Let's scrape each individual link for the abstract.

In [None]:
talk_webpage = request.urlopen(urls[0]).read()
talk = BeautifulSoup(talk_webpage,'html.parser')

Find the part of the webpage which contains the Abstract:

In [None]:
abstract = talk.find("div", { "class" : "abstract" }).text

In [None]:
abstract

In [None]:
type(abstract)

#### Text Processing

One of the most popular natural language processing packages in Python is `nltk`.

In [None]:
import nltk

It requires some corpora to be loaded when used for first time (get the nltk corpora):

In [None]:
nltk.download()

We can now convert the text string into tokens and apply different preprocessing steps to it.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [None]:
# convert string into tokens
tokens = tokenizer.tokenize(abstract)
tokens

In [None]:
# making words lower case
lower_tokens = [tok.lower() for tok in tokens]

In [None]:
# remove stop words
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nostop_tokens = [tok for tok in lower_tokens if tok not in stopwords]

In [None]:
# stemming
lancaster = nltk.LancasterStemmer()
stemmed_tokens = [lancaster.stem(t) for t in nostop_tokens]

In [None]:
stemmed_tokens

Create a preprocessing function with the steps above:

In [None]:
def abstractpreprocess(url):
    
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    
    talk_webpage = request.urlopen(url).read()
    talk = BeautifulSoup(talk_webpage,'html.parser')
    abstract = talk.find("div", { "class" : "abstract" }).text
    
    # tokenize
   
    # make lower case
    
    # remove stop words
    
    # stem
    

    
    
    

In [None]:
def abstract_preprocess(url):
    
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    
    talk_webpage = request.urlopen(url).read()
    talk = BeautifulSoup(talk_webpage,'html.parser')
    abstract = talk.find("div", { "class" : "abstract" }).text
    
    # tokenize
    tokens = tokenizer.tokenize(abstract)
    
    # make lower case
    tokens = [tok.lower() for tok in tokens]
    
    # stem
    lancaster = nltk.LancasterStemmer()
    tokens = [lancaster.stem(tok) for tok in tokens]
    
    # remove stop words
    stopwords = stopwords.words('english')
    tokens = [tok for tok in tokens if tok not in stopwords]
    
    return(tokens)
    

In [None]:
# preprocess the abstract in each url
abstracts = [abstract_preprocess(url) for url in urls]

In [None]:
len(abstracts)

#### Topic Modeling

One of the approaches to extract topics from a collection of documents is to build the [TF-IDF](http://brandonrose.org/clustering#Tf-idf-and-document-similarity) (Term Frequency-Inverse Document Frequency) matrix for the dataset.

To use in scikit.learn the tokens need to be directly converted to string:

In [None]:
# converting to string
final_abstracts = []
for abstract in abstracts:
    for word in abstract:
        n = abstract.index(word)
        if n == 0:
            string = abstract[n]
        else:
            string = string + " " + abstract[n]
    final_abstracts.append(string)

The matrix is obtained via the TfidfVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.95)

In [None]:
tfidf_matrix = tfidf_vectorizer.fit_transform(final_abstracts)

In [None]:
tfidf_matrix

We can decompose the tf-idf matrix into topics and weight by Nonnegative Matrix Factorization.

In [None]:
n_topics = 5

from sklearn.decomposition import NMF
model = NMF(init="nndsvd", n_components=n_topics, random_state=1)
W_matrix = model.fit_transform(tfidf_matrix)
H_matrix = model.components_

In [None]:
# Print topics and keywords
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
for topic_index in range( H_matrix.shape[0] ):
    top_indices = np.argsort( H_matrix[topic_index,:] )[::-1][0:10]  ##show top 10 words associated with each topic
    term_ranking = [tfidf_feature_names[i] for i in top_indices]
    print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ))

How can we improve the topics?
* short documents -> small word overlap -> use synonyms

#### Tips for Large Datasets
* the preprocessing and word counting can be performed in parallel on each document
* use `dask.bag` package to parallelize it without loading all documents at the same time ([example](http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html#local-computation))
* [MLib](https://spark.apache.org/docs/1.1.0/mllib-feature-extraction.html) library has NLP functionality
* store tf-idf matrix as sparse