## Playing with Python Libraries

In this notebook, we will play with some Python libraries to perform some common tasks. W1 will:

1. Use `requests` and `BeautifulSoup` to scrape a web page.
2. Analyse the text using the `spaCy` natural language processing library.
3. View the results in tables using `pandas`.

In [None]:
# Import requests and BeautifulSoup
import requests
from bs4 import BeautifulSoup

# The web page to scrape
url = 'https://www.fanfiction.net/s/6041872/1/Broken'

# The requests library sends an HTTP request and returns a response
response = requests.get(url)

# BeautifulSoup converts the web page to a list of elements
soup = BeautifulSoup(response.text, 'html.parser')

# We're going to guess where the content we want is in the page.
# You may have to change this, depending on how the page is organised.
content = soup.find('#storytext')

In [None]:
# Now that we have the content, let's get all the paragraph tags
# and join them into a single string.
content = soup.find_all('p')
paras = [p.text for p in content[2:]]
text = ' '.join(paras)

print(text[0:500] + '...')

In [None]:
# Download spaCy's English language model
%run -m spacy download en_core_web_sm

In [None]:
# Import the spaCy Natural Language Processing (NLP) library
import spacy

# Load spaCy's language model
nlp = spacy.load('en_core_web_sm')

# Process our text into a spaCy document
doc = nlp(text)

# Get some linguistic features
for token in doc[0:5]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

That looks like it has some useful information, but it is hard to read!

Time to play with pandas dataframes. A dataframe is a structure for holding data that can be easily viewed in a table. It's basically Excel for Python.

We are going to assume that each token in a spaCy doc is a set of features (lemma, part of speech, etc.), and we want each token in its own row and each feature in its own column. To do this, we'll create a list of features with the features for each token in a dict. The dict keys will be the column names. Once we massage our data into that format, we can create the dataframe.

In [None]:
# Import pandas
import pandas as pd

# Load the document features into a dict
features = []
for token in doc:
    feature = {
        'token': token.text,
        'norm': token.norm_,
        'lemma': token.lemma_,
        'pos': token.pos_,
        'stopword': token.is_stop
    }
    features.append(feature)

# Create a pandas dataframe
df = pd.DataFrame(features, columns=['token', 'norm', 'lemma', 'pos', 'stopword'])
df.head(10)

In [None]:
# We can sort dataframes!

sorted = df.sort_values('norm')
# To reverse sort
# sorted = df.sort_values('norm', ascending=False)

# Show the sorted table
sorted.head(10)

In [None]:
# We don't want punctuation, spaces, digits, and stop words in our table. Take them out!
tokens = [token.norm_ for token in doc if token.pos_ not in ['PUNCT', 'SPACE'] and token.norm_.isdigit() == False and token.is_stop == False]

# Create a pandas dataframe, this time with just the lower-cased tokens
df = pd.DataFrame(tokens, columns=['norm'])
df.head(10)

In [None]:
# Get a dict of the norms and counts
counts = df['norm'].value_counts().to_dict()

# Convert it to a list of dicts and feed to a new dataframe
counts = [{'norm': k, 'count': v} for k, v in counts.items()]
counted = pd.DataFrame(counts, columns=['norm', 'count'])

# Show the counts
counted.head(10)

In [None]:
# We can even do some fancy plotting using Python's matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline

# Some archane matplotlib stuff that experts understand and the rest of us Google
ax = plt.gca()
counted[0:10].plot(kind='line', x='norm', y='count', ax=ax, rot=90)
plt.show()

In [None]:
# Or we cna show it in a bar chart
counted[0:10].plot(kind='bar', x='norm', y='count')
plt.show()

## `textacy`

The Python `textacy` library builds on top of spaCy. Below we are going to create a corpus of texts by Tolkien fans using `textacy`. Then we'll use its built-in methods to do some analysis.

In [None]:
# Import textacy
import textacy

In [None]:
# Get keywords in context (KWIC)
textacy.text_utils.KWIC(text, 'war', window_width=35)

In [None]:
# Let's use textacy instead of list comprehensions to scrub
from textacy import preprocessing
normalized_text = preprocessing.normalize_whitespace(preprocessing.remove_punctuation(text))
normalized_text = textacy.preprocessing.replace.replace_numbers(normalized_text, '')
normalized_text[0:100]

### Document statistics

`textacy` can make spaCy docs. Once we have that done, we can use its `keywords` module to extract key phrases according to several different algoritms (the example shown below uses the "TextRank" algorithm).

We can also get other kinds of statistics and even use `textacy` to produce term counts as we did above. 

In [None]:
# Specify the language model and make a spaCy doc
en = textacy.load_spacy_lang('en_core_web_sm')
doc = textacy.make_spacy_doc(text, lang=en)

# Import the keywords module
import textacy.ke
print('Textrank:')
print(textacy.ke.textrank(doc, normalize='lemma', topn=10))

In [None]:
# Let's get some text statistics

stats = textacy.TextStats(doc)
stats.basic_counts

In [None]:
# What is the reading level for this text?

stats.readability_stats

In [None]:
# Get a bag of words using frequencies instead of counts
bow = doc._.to_bag_of_terms(ngrams=(1), entities=False, weighting="freq", as_strings=True)

# Let's look at this in a dataframe
bow = [{'Term': k, 'Frequency': v} for k, v in bow.items()]
bow_df = pd.DataFrame(bow, columns=['Term', 'Frequency'])
bow_df = bow_df.sort_values('Frequency', ascending=False)
snippet = bow_df.head(15)

# This just hides the dataframe index in a jupyter notebook, which is more pleasant on the eye
from IPython.display import display, HTML
display(HTML(snippet.to_html(index=False)))

So far, we've only played with one text. In the next notebook, we'll use `textacy` to build a corpus of texts and perform some low-level analysis. <a href="textacy.ipynb" target="_blank">Click here</a> to continue.