# Introduction to Python Jupyter

Welcome to Jupyter, through this interface I will be showing you the following:

  - Python - A programming language that lets you work quickly. - [Documentation](https://docs.python.org/3/)
  - NLTK - Natural Languge Toolkit - a Python Library for working with written language data. -  [Documentation](http://www.nltk.org/book/)
  - Open Collections API - Our "Application Programming Interface" which will allow you to import full text. - [Documentation](https://open.library.ubc.ca/docs)
  
Python is a great language for data analysis, more experienced programmers might want to use R, but Python is a nice entry point for everyone.

**If you don't know Python, or any programming for that matter, please remain calm you won't need to do any programming** throughout this talk, however if you do know Python you can feel free to edit any of the code and have your notebook update accordingly.

## Getting the Full Text

To begin with we are just going to get one item from the Open Collections API and perform some analysis on that. Later on we will look at getting entire collections, and performing searches via the API. 

For our first item I have chosen:

https://open.library.ubc.ca/collections/bcbooks/items/1.0059569

The Open Collections API URL is: 

https://oc-index.library.ubc.ca 

So to access the item I have chosen via the API we would need to GET the data from:

https://oc-index.library.ubc.ca/collections/bcbooks/items/1.0059569

In [None]:
import json
import requests

apiResponse = requests.get('https://oc-index.library.ubc.ca/collections/bcbooks/items/1.0059569').json()
item = apiResponse['data']
fullText = item['FullText'][0]['value']
print(fullText)

## Cleaning Up The Full Text

In order for better results from our analysis later we need to clean up the full text. 

This could be a project within itself and will differ item to item so for the intial run I have just set the full text to get lowered so 'Canada' and 'canada' aren't considered two different words.

I've also provided a basic regex that you can uncomment to strip everything other than words from the full text

In [None]:
import re, string;

fullTextLower = fullText.lower()
cleanFullText = fullTextLower

### To strip everything other than words uncomment below ###
pattern = re.compile('[\W_]+')
cleanFullText = pattern.sub(' ', cleanFullText)

print(cleanFullText)

## Basic Analysis

So now we have the item's full text we are going to use the Natural Language Toolkit to perform some analysis on it using NLTK.

NLTK is a Python Library for working with written language data. It is free and very well documented. Many areas we'll be covering are treated in more detail in the NLTK Book, available for free online from [here](http://www.nltk.org/book/).

> Note: NLTK provides tools for tasks ranging from very simple (counting words in a text) to very complex (writing and training parsers, etc.). Many advanced tasks are beyond the scope of this talk, but by the time we're done, you should understand Python and NLTK well enough to perform these tasks on your own!

Firstly, we will need to import NLTK.

In [None]:
import nltk # imports all the nltk basics
nltk.download("punkt") # Word tokenizer
nltk.download("stopwords") # Stop words
from nltk import word_tokenize

### Exploring Vocabulary

NLTK makes it really easy to get basic information about the size of a text and the complexity of its vocabulary.

*len* gives the number of symbols or 'tokens' in your text. This is the total number of words and items of punctuation.

*set* gives you a list of all the tokens in the text, without the duplicates.

Hence, **len(set(fullText))** will give you the total number unique tokens. Remember this still includes punctuation. 

sorted() places items in the list into alphabetical order, with punctuation symbols and capitalised words first.

#### Number of characters

In [None]:
len(fullText)

#### Number of unique characters

In [None]:
len(set(fullText))

#### List of unique characters

In [None]:
sorted(set(fullText))[:50] # Limited to 50

#### Get token count (words + symbols) 

*For our analysis, we want to break up the full text into words and punctuation, this step is called tokenization*

In [None]:
tokens = word_tokenize(cleanFullText)
len(tokens)

#### Unique Token Count

In [None]:
len(set(tokens))

#### Average number of times a word is used 

We can investigate the lexical richness of a text. For example, by dividing the total number of words by the number of unique words, we can see the average number of times each word is used.

In [None]:
len(tokens)/len(set(tokens))

#### Number of times a specific word is used

In [None]:
cleanFullText.count("vancouver")

#### Percentage of text that is a specific word

In [None]:
100.0*fullText.count("and")/len(fullText) 

## Exploring Text

### Concordance

In [None]:
text = nltk.Text(tokens)
text.concordance("vancouver")

### Words used similarly??

In [None]:
text.similar("miles")

### Common contexts

Common contexts allow us to examine just the contexts that are shared by two or more words, such as valley and river.

In [None]:
text.common_contexts(["valley", "river"])

### Longest words in the text

It is possible to select the longest words in a text, which may tell you something about its vocabulary and style

In [None]:
v = set(text)
long_words = [word for word in v if len(word) > 15]
sorted(long_words)

### Collocations

We can also find words that typically occur together, which tend to be very specific to a text or genre of texts.

In [None]:
text.collocations()

## Graphing Data

### Single Dispersion Plot

In [None]:
import numpy
# allow visuals to show up in this interface-
% matplotlib inline 
text.dispersion_plot(["river"])

### Multiple Dispersion Plot

In [None]:
text.dispersion_plot(["miles", "sea", "lake", "land", "rest"])

### Frequency distributions

In [None]:
from nltk import FreqDist
fdist = FreqDist(text)
fdist.most_common(50)

In [None]:
fdist.plot(25)