# Week 2.0 Python concepts

##  Python Concepts Refresher

As we start working with bigger text documents, sets of tokens and numerical representations, the amount of individual bits of information we need to store and refer back to becomes **massive**. We'll learn some new Python concepts to both handle large collections of data, and to process them.

### Functions

A function is a named block of code that performs a specific task. A number of thesse have been pre-programmed in Python, so to use them and run the code you can simply type the name of the function followed by () - inside which we can put the parameters we want it to use in performing the function. Here are the ones we have looked at already:

#### `print()`

The `print()` function in Python takes any what ever input you put inside the brackets and prints it out on one line of text on the screen, or other standard output device. It an take strings or other objects. 

#### `split()`

The `split()` function in Python can be used to turn a sentence into a list of words

```
text = "this is my text and it contains words"
print(text.split())
```

#### `len()`

The `len()` function in Python can be used to get the **length of a given string**. 

```
letters_in_book = len(book_text)
```

#### `Indexing`

We can also extract **specific characters** from a string, given a **numerical index**. 

We do this using ``square brackets[]``. 

In coding, we **start counting at 0**. This means the first letter of a string is as follows 

``
first_letter_in_book = book_text[0]
``

It also means **last character** will be at index **length - 1**. So we can also combine with the `len()` function to calculate this!


#### `Slicing`

If we want to get **more than 1 character**, we can provide two indexes with a ``colon:`` inbetween to specify a range.

```
first_five_letters = book_text[0:5]
```


# Representing Documents as Numbers
## New concepts for week 2 


If we want to find take a document or set of documents, and use machine learning techniques, or other mathematical operations to uncover some new information abou them, we can't just use the text and characters. We need a new representation, and that need to be numerical. This week we will be taking our first steps to representing collections of documents as numbers. 

- Tokenisation 
- Bags of Words 
- n-Grams
- TF/IDF

### Building a Vocabulary 

The first step to getting a new, better representation of a text document is splitting it up into its consistuent parts. We call these **tokens** and deciding _what a token is_ in important choice. 

### Basic Tokenisation - ``str.split()``
What is the simplest way split a **String** into **tokens**? 

Introducing the `str.split()` function. Here we take a long string (multiple words) and split it into separate words (or **tokens**) based on spaces. 

We have previously seen **Functions** that take a String as an argument and return a new value. In this case, the concept is broadly similar, apart from we **call the function on the string itself**.

What gets returned is a **List**, containing our split string. We store it in the variable ``tokens``, and print it out. 

In [None]:
# Change the sentence below

sentence = "I'm learning new things every day"
tokens = sentence.split()
print(tokens)

You probably won't understand all of the code in the following parts, but thats fine. You're Great. 

We've picked up some super important new concepts that we'll keep practising throughout the course. 

Now, we'll learn some new concepts for analysing text

### How good is our vocabulary?

We have made a vocabulary from, (or **tokenised**, if you're fancy) our text document by splitting every time we see a space.

Does this seem sensible? Does this capture every thing that we would consider a separate word in the document? 

What about `isn't`? Is this one token (`isn't`), or two tokens (`is` and `not`)? Or `taxi cab`? Is that two tokens, do we care that `taxi` and `cab` are both used? Do we need this as a separate concept from `Uber`, `limo`, `Hackney Carriage` or `car`?. If we take it as one, do we miss out on other combinations like `black cab` or `taxi driver`? What about punctuation?

Just using `str.split()` works reasonably well on the sentence below. However, there are some issue with punctuation as ideally, the brackets and the exclaimation mark would also be separate tokens. 

After we have split our sentence into tokens, we can then create a **vocabulary**, which contains every unique token in the sentence.

### Import the numpy library, name it np for shorthand so that we can refer to it in the code as 'np'

This allows us to get unique words in our vocabulary 

In [4]:
import numpy as np

#String split
sentence = "I like to think (it has to be!) of a cybernetic ecology where we are free of our labors"
tokenised = sentence.split()
print("Tokenised sentence")
print(tokenised)

#Get the unique tokens (removes duplicates)
vocab = np.unique(tokenised)
print("Vocabulary")
print(vocab)

Tokenised sentence
['I', 'like', 'to', 'think', '(it', 'has', 'to', 'be!)', 'of', 'a', 'cybernetic', 'ecology', 'where', 'we', 'are', 'free', 'of', 'our', 'labors']
Vocabulary
['(it' 'I' 'a' 'are' 'be!)' 'cybernetic' 'ecology' 'free' 'has' 'labors'
 'like' 'of' 'our' 'think' 'to' 'we' 'where']


### Removing punctuation duplicates
We can see lots of tokens have trailing punctutation, and a good few of these will also exist in without the punctuation. This duplication is bad for us!

We would want these to be the same token so we can use a **regex** to replace it. The regex below splits on whitespace (represented by `\s`) or punctuation that appears at least once (using this plus notation `+`). We immediately see the size of the vocabulary drops by about 25%, showing there was loads of duplication in our bag of words. 

In [None]:
#Use a regex to split based on space AND punctuation
tokens = re.split(r'[-\s.,;!?]+', book)
vocab = np.unique(tokens)
print("unique words", vocab.shape)
#Counter(tokens).most_common(50)

## import 

We can 'import' existing libraries - or packages of code - that are widely available.

Some that will be useful are:

- nltk,  is especially for working with textual data, and has a lot of inbuilt functions to perform key tasks.
(Check out the NLTK book: https://www.nltk.org/book/ch01.html)
- spaCy, is another a free and open-source Natural Language Processing (NLP) package. If you're interested in learning how to work with spaCy more broadly for a variety of NLP tasks I recommend the tutorial Natural Language Processing with spaCy in Python: https://realpython.com/natural-language-processing-spacy-python/.
- gensim, is dedicated to topic modeling, and has some really useful tutorials and materials to read through: https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here  
- pandas, a data analysis library
- numpy, a mathematical functions library
    

# Natural Language Tool Kit (NLTK)

NLTK is one of many libraries - or packages of code that are available -  is especially for working with textual data. It has in built functions for working with text.

Check out the NLTK book:
https://www.nltk.org/book/ch01.html

NB. Importing libraries

To use a library you need to tell the program that you need it. ...

Sometimes you will need to install the library if it is not already on your computer.

The best way to install libraries is using pip - this will work if you are working with either colab or anaconda...

In [8]:
# remove the comment hashtags to tell the code to run

# pip install nltk
import nltk
#nltk.download()

NLTK has functions inbuilt to perform many of the tasks you need.

`sent_tokenize()` - splits your text into sentences. 

In [11]:
## Tokenizing

from nltk.tokenize import sent_tokenize, word_tokenize


EXAMPLE_TEXT = "NLTK is one of many libraries. Or packages of code that are available. Do you think the Natural Language Toolkit is especially useful for working with textual and linguistic data"


sent_tokenize(EXAMPLE_TEXT)


['NLTK is one of many libraries.',
 'Or packages of code that are available.',
 "Don't you think the Natural Language Toolkit is especially useful for working with textual and linguistic data"]

In [12]:
print(word_tokenize(EXAMPLE_TEXT))

['NLTK', 'is', 'one', 'of', 'many', 'libraries', '.', 'Or', 'packages', 'of', 'code', 'that', 'are', 'available', '.', 'Do', "n't", 'you', 'think', 'the', 'Natural', 'Language', 'Toolkit', 'is', 'especially', 'useful', 'for', 'working', 'with', 'textual', 'and', 'linguistic', 'data']


In [None]:
#Add your own example sentence to see how it handles contractions such as don't / isn't. 

EXAMPLE_TEXT = " "

## Python Concepts - Lists

As we start working with bigger text documents, sets of tokens and numerical representations, the amount of individual bits of information we need to store and refer back to becomes **massive**. We'll learn some new Python concepts to both handle large collections of data, and to process them.

### Lists 

Previously, we'd used **named variables** to store individual bits of information such as text and numbers. 

```
my_diary_entry = "Today I mainly SMASHED IT"
hours_spent_smashing it = 1000
```

But if we look at the ``split()`` function we have just used, it returns **6 different values**. And it would seem like a lot of effort to have a named variable for each of them?

``
token_1 = "I'm"
token_2 = "learning"
token_3 = "new"
token_4 = "things"
token_5 = "every"
token_6 = "day"
``

And what happens when we have 1000 tokens? Or 30,000 (this is the size of the average english speakers vocabulary)?

Instead, we can store collections of values in a **single object** known as a ``List``. You may also here the terms ``array`` and ``vector``, and whilst they do mean specific things in specific circumstances, these are all broadly interchangeable.



In [14]:
## Lists

sentences = []
sentences = sent_tokenize(EXAMPLE_TEXT)
words = word_tokenize(EXAMPLE_TEXT)
print(sentences[1])
print(words[8])

Or packages of code that are available.
packages


# Pre-processing steps

Tokenizing is the first step we take to process our data to prepare it for converting it to a numerical representation. There are other steos we might take too...

## Removing stop words

Before we start counting words, we might want to consider which words we are interested in counting.

Some words are frequent but don't carry much meaning in ad of themselves, for examples common words in English such as "a", "the", "it".  

NLTK has a pre-defined set of stopwords

In [23]:
from nltk.corpus import stopwords

In [31]:
set(stopwords.words('english'))



{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [22]:
# Add some example text to work with. Take a paragraph from a webpage or make one up!
# EXAMPLE_TEXT = " "

In [41]:
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(EXAMPLE_TEXT)

Now we can filter our text to remove stopwords

In [46]:
filtered_text = []

filtered_text = [w for w in word_tokens if not w in stop_words]

In [47]:
print(word_tokens)
print(filtered_text)

['NLTK', 'is', 'one', 'of', 'many', 'libraries', '.', 'Or', 'packages', 'of', 'code', 'that', 'are', 'available', '.', 'Do', "n't", 'you', 'think', 'the', 'Natural', 'Language', 'Toolkit', 'is', 'especially', 'useful', 'for', 'working', 'with', 'textual', 'and', 'linguistic', 'data']
['NLTK', 'one', 'many', 'libraries', '.', 'Or', 'packages', 'code', 'available', '.', 'Do', "n't", 'think', 'Natural', 'Language', 'Toolkit', 'especially', 'useful', 'working', 'textual', 'linguistic', 'data']


### Creating a Frequency Distribution

We can count the frequency of each unique word in the text to create a frequency distribution:


In [48]:
# Counter is a module that helps with counting
from collections import Counter

# Count the frequency of words
word_freq = Counter(filtered_text)
word_freq

Counter({'NLTK': 1,
         'one': 1,
         'many': 1,
         'libraries': 1,
         '.': 2,
         'Or': 1,
         'packages': 1,
         'code': 1,
         'available': 1,
         'Do': 1,
         "n't": 1,
         'think': 1,
         'Natural': 1,
         'Language': 1,
         'Toolkit': 1,
         'especially': 1,
         'useful': 1,
         'working': 1,
         'textual': 1,
         'linguistic': 1,
         'data': 1})

In [50]:
#get the ten most common words

common_words = word_freq.most_common(10)
common_words

[('.', 2),
 ('NLTK', 1),
 ('one', 1),
 ('many', 1),
 ('libraries', 1),
 ('Or', 1),
 ('packages', 1),
 ('code', 1),
 ('available', 1),
 ('Do', 1)]

In [52]:
# Display the plot inline in the notebook with interactive controls
# Comment out this line if you are running the notebook in Deepnote
%matplotlib notebook

# Import the matplotlib plot function
import matplotlib.pyplot as plt

# Get a list of the most common words
words = [word for word,_ in common_words]

# Get a list of the frequency counts for these words
freqs = [count for _,count in common_words]

# Set titles, labels, ticks and gridlines
plt.title("Top 10 Words in my text")
plt.xlabel("Word")
plt.ylabel("Count")
plt.xticks(range(len(words)), [str(s) for s in words], rotation=90)
plt.grid(visible=True, which='major', color='#333333', linestyle='--', alpha=0.2)

# Plot the frequency counts
plt.plot(freqs)

# Show the plot
plt.show()

<IPython.core.display.Javascript object>

# Other Pre-processing steps


The goal of both stemming and lemmatization is to collapse related forms of a word to a common base form. 

For instance:

am, are, is $\Rightarrow$ be

car, cars, car's, cars' $\Rightarrow$ car 

## Stemming

Stemming attempt to remove suffixes from words that contain the same base. This reduces variation and can help when we reduce the documents into a more distilled form (like a bag of words). 

- hacking, hackers, hacked, hacks
- computer, computing, computers

Depending on our task, its probably the case that we only really care about knowing if **any** of these words appear, not whether they each appear individually. For example, I might be doing a search for paragraphs about hacking and it may be that I would miss out on key documents otherwise if I only searched for one of the words. 

Stemming can be quite a challenging task however. If we want to combine pluralisations, for example, we can't just remove the "s" from the end of all nouns, what about 

- grass (not a plural)
- mice, octopi (plural, no s)
- geniuses (plural, es)

We're going to use the built in stemmer in the NLTK library. This reduces our vocabulary in the hacking book dramatically!


In [59]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

In [60]:
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

In [61]:
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [65]:
new_text = "It is important to be very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

In [66]:
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

it
is
import
to
be
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


In [67]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
word_list = ['feet', 'foot', 'foots', 'footing']
for word in word_list:
    print(word, "->", stemmer.stem(word))
#Doesn't always work
word_list = ['organise','organises','organised','organisation',"organs","organ","organic"]
for word in word_list:
    print(word, "->", stemmer.stem(word))

feet -> feet
foot -> foot
foots -> foot
footing -> foot
organise -> organis
organises -> organis
organised -> organis
organisation -> organis
organs -> organ
organ -> organ
organic -> organ


## Lematization
Lemmatisation is a technique similar to stemming, apart from it attempts to find similar meanings, as opposed to just similar roots. Like with all these _normalisation_ techniques, reducing your vocabulary will reduce precision but may make your model bettter at generalising and more efficient.

For example, lemmatisation would be able to separate **dogged** and **dog**, which have quite different meanings but would get combined by a stemmer. 

Below we use the WordNetLemmatizer from the NLTK library. 


In [68]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


In [None]:
# Compare outouts of stemmer and lemmer

In [69]:
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer() 
lemmatizer = WordNetLemmatizer() 
print(stemmer.stem('stones')) 
print(stemmer.stem('speaking')) 
print(stemmer.stem('bedroom')) 
print(stemmer.stem('jokes')) 
print(stemmer.stem('lisa')) 
print(stemmer.stem('purple')) 
print('----------------------') 
print(lemmatizer.lemmatize('stones')) 
print(lemmatizer.lemmatize('speaking'))
print(lemmatizer.lemmatize('bedroom'))
print(lemmatizer.lemmatize('jokes'))
print(lemmatizer.lemmatize('lisa'))
print(lemmatizer.lemmatize('purple'))

stone
speak
bedroom
joke
lisa
purpl
----------------------
stone
speaking
bedroom
joke
lisa
purple


### WordNet

Wordnet, a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.


In [6]:
from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets("small"):
    for l in syn.lemmas():
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
print(antonyms)

['large', 'big', 'big']


In [7]:
# WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples())





a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']


## Capitalisation
Whilst it may be tempting to just lower case every token with the belief that words all have the same meaning regardless of case. However, it may actually be that if something is in ALL CAPS it conveys some meaning. Or if a word is at the start of sentence, that has importance. 

For example 

- John liked to help
- John screamed HELP HELP HELP

Or if one book contained lots of capitalised nouns (like cities and countries), it might tell you it was about Geography.

Some libraries actually account for this by lower casing everything, then having a token which indicates a start of capitilising as well as one that signifies the end of capitalising. This allows the best of both worlds. Of course, this only works if your model or vocabulary is able to take sequence and context into account. Like for example....

# Go Further

If you would like something more challenging check out these notebooks which includes topic modelling on a shakespeare corpus:

https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb
