# Discussion 7

## Web Scraping

Basic web scraping workflow:

1. Download pages (use a GET request) -- [`requests`][] and [`requests_cache`][]
2. Parse pages to extract text -- [`lxml.html`][] or [`bs4`][]
3. Clean up extracted text -- string methods or [`re`][] or [`pandas`][]
4. Store cleaned results -- [`pandas`][], [`sqlite3`][], [`pymongo`][], or ...
5. Analyze results -- [`pandas`][], and ...

Other than the packages involved, this workflow is the same regardless of the language you're using.

[`requests`]: http://docs.python-requests.org/en/master/
[`requests_cache`]: https://requests-cache.readthedocs.io/en/latest/
[`lxml.html`]: http://lxml.de/lxmlhtml.html
[`bs4`]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
[`re`]: https://docs.python.org/2/library/re.html
[`pandas`]: http://pandas.pydata.org/pandas-docs/stable/
[`sqlite3`]: https://docs.python.org/2/library/sqlite3.html
[`pymongo`]: http://api.mongodb.com/python/current/index.html

### Step 2: Parsing HTML

You can use `lxml.html` or `bs4` to parse HTML. Choose one for the entire scrape, since the two packages are not compatible with each other.

An advantage of `lxml.html` is that it supports both XPath and CSS Selectors. It's also faster than BeautifulSoup.

In [27]:
doc = u"""
<html>
<head>
<title>This is the Title!</title>
</head>

<body>
<p>This is a paragraph!</p>
<p>This is another paragraph!</p>
<span>This is a span. ❤️  </span>
</body>
</html>
"""

import lxml.html as lx

html = lx.fromstring(doc)

In [18]:
# Extract a tag with XPath
print html.xpath("/html/body/p")[0].text_content()

# In XPath, "//" means "at this level or anywhere below" "/.../"
print html.xpath("//p")[1].text_content()

# http://www.topswagcode.com/xpath/

This is a paragraph!
This is another paragraph!


In [19]:
# Extract a tag with CSS Selectors
print html.cssselect("html > body > p")[0].text_content()

# In CSS, " " means "at this level or anywhere below"
print html.cssselect("p")[1].text_content()

# http://flukeout.github.io/

This is a paragraph!
This is another paragraph!


### Step 3: Cleaning Up

By this point, there shouldn't be any HTML tags left in the text you've extracted.

In [26]:
p_list = [x.text_content().strip() for x in html.cssselect("p")]
"\n".join(p_list)

'This is a paragraph!\nThis is another paragraph!'

#### Unicode

Since computers only understand numbers, text is encoded by assigning a number to each symbol. There are many different text encodings, and until the 1980s there was no global standard. The United States used to use the [ASCII encoding](https://en.wikipedia.org/wiki/ASCII), which only covers English characters.

[Unicode](http://unicode.org/) is a global standard for text encoding. Unicode [includes symbols](http://unicode.org/charts/) for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).

In Python 2, Unicode strings are prefixed with a `u` before the first quote. For example, `u"This is Unicode"` is a Unicode string. All of the built-in string methods will work on Unicode strings.

In [37]:
html.cssselect("span")[0].text_content()

u'This is a span. \u2764\ufe0f  '

When you display a Unicode string without printing, special Unicode characters will be represented by `\uXXXX` where `XXXX` is the number assigned to the character in [base 16](https://en.wikipedia.org/wiki/Hexadecimal). Base 16 numbers are sometimes prefixed with `0x` to distinguish them from decimal numbers.

For example, the heart emoji ❤ is assigned to `10084`, which is `0x2764`. You can write the heart emoji in a Unicode string as `\u2764`. The pink heart emoji ❤️ consists of a heart emoji character `\u2764` followed by an invisible "text presentation" character `\ufe0f`. See [Wikipedia's emoji article](https://en.wikipedia.org/wiki/Emoji#Emoji_versus_text_presentation) for details.

In [35]:
html.cssselect("span")[0].text_content().strip()

u'This is a span. \u2764\ufe0f'

Many Unicode characters are larger than 1 byte, which can cause problems for some regular expression engines.

Python 2's `re` module works with Unicode strings. The module [has a flag](https://docs.python.org/2/library/re.html#re.UNICODE) `re.UNICODE` to inform the regex engine when you want to use Unicode strings.

## Natural Language Processing

Basic NLP workflow:

1. __Tokenize__ -- split text into words
2. __Denoise__ (optional) -- remove stop words, convert words to lemmas, correct spelling, ...
3. __Vectorize__ -- compute term frequencies, tf-idfs, or some other statistic
4. __Analyze__

The _smoothed term frequency-inverse document frequency_ (smoothed tf-idf), for a token $t$ and document $d$, is given by
$$
\operatorname{tf-idf}(t, d) = \operatorname{tf}(t, d) \cdot \log \left( \frac{N}{1 + n_t} \right)
$$
where $N$ is the total number of documents and $n_t$ is the number of documents that contain $t$.

In [None]:
import nltk

# Set up the Reuters corpus.
reuters = [nltk.corpus.reuters.raw(i) for i in nltk.corpus.reuters.fileids()]

In [None]:
#nltk.word_tokenize()
#stemmer = nltk.stem.porter.PorterStemmer()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#vectorizer = TfidfVectorizer(tokenizer = , stop_words = "english", smooth_idf = True, norm = None)
#tfs = vectorizer.fit_transform()

## Statistics with Python

`scikit-learn` is a popular and mature package for statistics, with particular focus on machine learning methods. The [documentation](http://scikit-learn.org/stable/documentation.html) has both a user guide and an API reference.

The user guide discusses the intuition (and sometimes mathematics) behind the supported statistical methods, with examples. The API reference is a detailed description of each function in the package.

Since `scikit-learn`'s user guide is not intended to teach statistics, you might also want to look at other resources to learn more about statistical methods. Both of these books are free online:

* [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
* [Elements of Statistical Learning](https://statweb.stanford.edu/~tibs/ElemStatLearn/)

ISL is a gentle introduction to statistical methods, while ESL is a more thorough reference.