# Fetching and parsing in Python

---
**Author**: Marko Bajec

**Last update**: 20.2.2019

**Description**: in this notebook I describe a selection of libraries that might come handy if for **fetching** and **parsing** pages as well as for **text analysis** including **natural language processing**. Many of these libraries are required in other notebooks that I share for this course. The libraries discussed are:
* <code>urllib</code>: [urllib](https://docs.python.org/3/library/urllib.html#module-urllib) is a package of modules for working with URLs,
* <code>beautifulsoup4</code>: [beautifulsoap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a library for pulling data out of HTML and XML files. Works with DOM,
* <code>nltk</code>: [nltk](https://www.nltk.org) is a natural language processing toolkit, 
* <code>polyglot</code>: [polyglot](https://polyglot.readthedocs.io/en/latest/) is another library for natural language processing which includes support for Slovenian language,
* <code>re</code>: [re]() is a library for working with regular expressions
---

Let's import all modules at once.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import polyglot
import re
import string
import urllib
import urllib.request
from bs4 import BeautifulSoup

## 1. Fetching pages
### Simple fetching
We will first show how <code>urllib</code> can be used for fetching pages from the Web. 

In [None]:
f = urllib.request.urlopen('http://www.times.si')
print(f.read(1000).decode('utf-8'))

In [None]:
f = urllib.request.urlopen('http://www.delo.si')
print(f.read(5000).decode('utf-8'))

What we can see is that simple fetching does not suffice for pulling out all the data if dynamic pages are used. To render a typical page, a browser would usually execute several additional HTTP requests to download all the required files. Moreover, modern web pages are dynamic, i.e. they use scripting languages to pull data from a database and then render it on the page. 

### Headless browsers
To get a page that corresponds to one that we see in a browser when following a certain URL, one could use a **headless browser**. In short, headless browsers are web browsers without a graphical user interface and are usually controlled programmatically or via a command-line interface. They are used in several contexts but most notably for **automatic usability testing** and **web scraping**. In web extration we need a DOM representaion of the exact content as it is rendered in a full browser.

To run a headless browser in Python one option is to use **Headless Chrome** together with **Selenium**. See [here](https://duo.com/decipher/driving-headless-chrome-with-python) for instructions of how to install and use Headless Chrome with Python.  

### Extracting text
This simple code shows how to extract text parts from a web page. It uses <code>beautifulsoap</code> library which cleans up <code>HTTP request payload</code> and creates a DOM tree.

In [None]:
# This is an example of code which extracts all text from a web page 
url_address = "http://www.times.si"

with urllib.request.urlopen(url_address) as url:
    html = url.read()
    
#html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

## 2. Processing text with NLTK
NLTK stands for **Natural Language Processing Toolkit**. It is a Python library that supports numerous tasks that are common in NLP, such as *tokenization*, *lemmatization*, *stemming*, *pos tagging*, *semantic reasoning*, *parsing*, etc. It provides interfaces to many corpora and lexical resources such as *WordNet*.

Below are few examples of how to use NLTK. For complete documentation see [NLTK webpage](https://www.nltk.org).

### 2.1. Tokenization
Suggested reading: [The art of tokenization](https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en)

In [None]:
# simple tokenization
sentence = "You can\'t say you didn\'t know this was wrong! And yet you did it. I want you to read these books."
sentence = "This book was written in 1998 by Dan Taylor. It is about information extraction from web sources."
sentence += " If you haven't read it yet then do so ASAP."
sentence = sentence.lower()
tokens = nltk.word_tokenize(sentence)
print(tokens)

### 2.2. Contractions

In [None]:
# handling contractions, e.g. don't --> do not

# example of contractions dictionary - incomplete!
contractions_dict = { 
"can't": "cannot",
"didn't": "did not",
"don't": "do not",
"isn't": "is not",
"haven't": "have not"
}
contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

def expand_contractions(s, contractions_dict=contractions_dict):
     def replace(match):
         return contractions_dict[match.group(0)]
     return contractions_re.sub(replace, s)

tokens = nltk.word_tokenize(expand_contractions(sentence))

print(tokens)

### 2.3. Stopwords

In [None]:
# print stopwords for English
print(stopwords.words('english'))

In [None]:
# remove stopwords
filtered_words = [word for word in tokens if word not in stopwords.words('english')]

print(filtered_words)

### 2.4. Punctation

In [None]:
nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
words_no_punct = [w for w in filtered_words if nonPunct.match(w)]

print(words_no_punct)

### 2.5. Using explicit RE pattern in tokenizer

In [None]:
# This example shows how to use RegexpTokenizer
# Function preprocess does all: makes the sentence lowercase, creates tokens, removes stopwords and punctation
# and then puts tokens back to a sentence

def preprocess(sentence):
    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    filtered_words = [w for w in tokens if not w in stopwords.words('english')]
    return " ".join(filtered_words)

sentence = "You can\'t say you didn\'t know! I told you several times and you know that."
print(preprocess(sentence))

### 2.6. POS tagging
Part-of-speech tagging (POS tagging) is grammatical analysis which marks each word in a text (corpus) as corresponding to a particular part of speech, e.g. <code>noun</code>, <code>verb</code>, <code>adjective</code>, <code>adverb</code>, etc. 

For details on *Categorizing and Tagging Wordshttp* check [here](www.nltk.org/book/ch05.html).

In [None]:
wordsPOS = nltk.pos_tag(words_no_punct)
# I am creating this set for for the use in the lemmatizer
wordsPOSset = {}
for i in range (0, len(wordsPOS)):
    wordsPOSset.update({(wordsPOS[i][0]):wordsPOS[i][1][0]})
    print(wordsPOS[i][0] + ":" + wordsPOS[i][1])

### 2.7. Stemming

In [None]:
# this is a simple stemming
ps = PorterStemmer()
 
for word in words_no_punct:
    print(word + ":" + ps.stem(word))

### 2.8. Lemmatization

In [None]:
# This example shows how to use NLTK lemmatizer. Remember that you have to tell what is the word you 
# would like to lemmatize. By default the lemmatizer expects a noun. But it could verb, adverb, adjecive and so on. 

# Pos tagger was trained on treebank corpora. This function maps treebank tags 
# into wordnet tags as expected in the lemmatizer.
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        #if not J, V, N or R then make it default, i.e. "n" as naun 
        return 'n'

lemmatizer = WordNetLemmatizer()
for word in words_no_punct:
    print(word + ":" + lemmatizer.lemmatize(word, pos=get_wordnet_pos(wordsPOSset[word])))

---
## 3. Working with Polyglot
Polyglot is a library for natural language processing with support for **many languages** including **Slovene**. The following features are available:
* Tokenization (165 Languages)
* Language detection (196 Languages)
* Named Entity Recognition (40 Languages)
* Part of Speech Tagging (16 Languages)
* Sentiment Analysis (136 Languages)
* Word Embeddings (137 Languages)
* Morphological analysis (135 Languages)
* Transliteration (69 Languages)

### 3.1. Installation
Installation of <code>polyglot</code> library is a bit more complicated. To save you some time, I provide instructions that worked for me on <code>macOS Mojave ver. 10.14.3.</code> and <code>Python 3.7</code>.

    # Install icu4c
    $ brew install icu4c
# Check icu4c version … in my case 63.1
$ ls /usr/local/Cellar/icu4c/
    # link icu-config to standard path
    $ ln -s /usr/local/Cellar/icu4c/63.1/bin/icu-config /usr/local/bin/icu-config 
# install pyicu and other dependencies for polyglot
$ sudo pip3 install pyicu
    $ pip3 install pycld2
$ pip3 install morfessor
    $ pip3 install polyglot

Below is an examples of how to use Polyglot for *language detection*. For more details check [Polyglot webpage](https://pypi.org/project/polyglot/).

### 3.2. Language detection

In [None]:
import polyglot
from polyglot.text import Text, Word

text1 = Text("Danes je lep sončen dan.")
text2 = Text("Heute ist ein schöner sonniger Tag.")
text3 = Text("Today is a beautiful sunny day.")
print("Language Detected: Code={}, Name={}\n".format(text1.language.code, text1.language.name))
print("Language Detected: Code={}, Name={}\n".format(text2.language.code, text2.language.name))
print("Language Detected: Code={}, Name={}\n".format(text3.language.code, text3.language.name))