[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scott2b/PythonReview/blob/main/notebooks/Python.03.3rdPartyLibraries.ipynb)

# 3rd party packages in Python

What do you do when the standard library doesn't have what you need? There's a package for that. Python's central repository of 3rd party library packages is called PyPi (the Python Package Index). It has over 1/4 million packages of various utility.

Before you write code that seems like it should already be written, Google: "python <whatever>". Chances are someone has already tackled the same problem you are having now.

## Packages. What are they good for?


To mention a few things

 * plotting / charting
 * machine learning
 * natural language processing
 * web application frameworks
 * working with various services, web APIs, etc.
 * template languages
 * data parsers or various codecs
 * database drivers
 * better handling of x, where x might be:
   - date/time processing
   - statistics / scientific calculation
   - web resource fetching

The list goes on. This is an applied course, so we will use a **lot** of 3rd party libraries! You should get used to reading library documentation, and even sometimes looking at the code!

## Installing packages

`pip` is the go-to installer for Python packages. In your local environment, you would simply run `pip install requests`, e.g. to install the requests library.

To install packages into the Colab runtime environment, we need to call out to the shell to execute pip. We do this with a bang:

```
!pip install requests
```

Note, however, that Colab has a lot of packages already installed. E.g.:

In [None]:
!pip install requests



Note the _Requirement already satisfied_ since this is a popular library that is pre-installed on Colab.

The easiest way to see all the installed packages is to call:

```
pip freeze
```

In [None]:
# help can show you all of the modules, but it is a bit verbose and slow
# help('modules')

# instead you can call out to the shell to get the "pip freeze" which shows packages and their versions
!pip freeze

## Requests

[Requests](https://requests.readthedocs.io/en/master/) is a popular library for fetching web resources. Billing itself as "HTTP for humans," requests exposes an API that is much simpler than Python's urllib for most common use cases in fetching data on the internet.

Here's a simple example to show how requests compares with Python's urllib:

### Using Python's urllib to fetch a joke

In [None]:
import json
import urllib.request
req = urllib.request.Request('https://icanhazdadjoke.com/')
req.add_header('Accept', 'application/json')
req.add_header('User-agent', 'aprd-joke-fetcher/0.1')
r = urllib.request.urlopen(req)
data = json.loads(r.read())
data

{'id': 'EYo4TCAdUf',
 'joke': 'I tried to write a chemistry joke, but could never get a reaction.',
 'status': 200}

### vs. Requests

In [None]:
import requests
r = requests.get('https://icanhazdadjoke.com/', headers={'Accept': 'application/json'})
data = r.json()
data

{'id': 'fiydpr4EQnb',
 'joke': 'What’s brown and sounds like a bell? Dung!',
 'status': 200}

In Requests, the un-parsed payload is stored in a property called text

In [None]:
r.text

'{"id":"fiydpr4EQnb","joke":"What\\u2019s brown and sounds like a bell? Dung!","status":200}\n'

which is what you would use if you are fetching HTML rather than JSON:

In [None]:
r = requests.get('https://icanhazdadjoke.com/')
html = r.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1">\n<meta name="description" content="The largest collection of dad jokes on the internet" />\n<meta name="author" content="C653 Labs" />\n<meta name="keywords" content="dad,joke,funny,slack,alexa" />\n<meta property="og:site_name" content="icanhazdadjoke" />\n<meta property="og:title" content="icanhazdadjoke" />\n<meta property="og:type" content="website" />\n<meta property="og:url" content="https://icanhazdadjoke.com/j/JeaxXvkyPf" />\n<meta property="og:description" content="Can February march? No, but April may." />\n<meta property="og:image:url" content="https://icanhazdadjoke.com/j/JeaxXvkyPf.png" />\n<meta property="og:image:secure_url" content="https://icanhazdadjoke.com/j/JeaxXvkyPf.png" />\n<meta property="og:image:secure_url" content="https://icanhazdadjoke.com/static/smile

## Newspaper

[Newspaper3k](https://newspaper.readthedocs.io/en/latest/) is said to be inspired by requests, and is a high level library for managing access to news information.

Newspaper is something of a monolithic suite of utilities more than a library, and it is worth digging into if you are interested in fetching the news. However, in this course we are primarily interested in Newspaper's abiliity to extract article text from a web page. For this reason, we will not use Newspaper's fetching tools, but will simply use it to extract text from pages we have already fetched with requests.

### Using Newspaper's fulltext function

In [None]:
!pip install newspaper3k

In [None]:
import requests
import newspaper

r = requests.get('https://www.theonion.com/poll-finds-majority-of-americans-would-like-things-to-g-1819573273')
html = r.text
html[:80]

'<!DOCTYPE html><html lang="en-us" data-reactroot=""><head><meta name="google-sit'

In [None]:
article = newspaper.fulltext(html)
article[:80]

'UTICA, NY—A poll released Tuesday by Zogby International found that 72 percent o'

## spaCy

[spaCy](https://spacy.io/usage) is a fairly new library that makes short work of a lot of common NLP (natural language processing) tasks. Use spaCy for some straightforward out-of-the box tokenization, POS (part-of-speech) tagging, and NER (named entity recognition).

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

Once processed by nlp, a spaCy doc is tokenized:

In [None]:
list(doc)

[Apple, is, looking, at, buying, U.K., startup, for, $, 1, billion]

Tokens have [POS tags and other properties](https://spacy.io/usage/linguistic-features#pos-tagging)

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_)

Apple Apple PROPN
is be AUX
looking look VERB
at at ADP
buying buy VERB
U.K. U.K. PROPN
startup startup NOUN
for for ADP
$ $ SYM
1 1 NUM
billion billion NUM


The doc also contains an `ents` property with named entities:

In [None]:
for e in doc.ents:
    print(e.text, e.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


## truecase

[truecase](https://github.com/daltonfury42/truecase) is a handy utility for correcting capitalization in text. Sometimes you get text that is processed in some way, resulting in modifications such as changes to the letter-case. It can be helpful to have correct capitalization before, for example, running named entity recognition.

In [None]:
!pip install truecase

In [None]:
import truecase

truecase tries to be smart

In [None]:
truecase.get_true_case('apple is looking at buying u.k. startup for $1 billion')

'Apple is looking at buying U. K. startup for$ 1 billion'

In [None]:
truecase.get_true_case('gonna buy me a dog named rover.')

'Gonna buy me a dog named Rover.'

but it's not perfect

In [None]:
truecase.get_true_case('u.k. startup to be bought by apple.')

'U. K. startup to be bought by apple.'

In [None]:
truecase.get_true_case('gonna buy me a dog named spot.')

'Gonna buy me a dog named spot.'