# Week 2.2 code for textual analysis

## Install
If it's the first time you are using a library you may also need to install it. 

You can do this using pip in the following way. 

`pip install package_name`


If you are working in Colab you need to add an exclamation mark in front. For example, to install numpy:

`!pip install numpy`

You can also search for and add libraries or 'packages' within the Anaconda GUI (graphical user interface) console.



## import 

Once we have installed them on our computer, we need to tell the code that we want them availabke to use. We do this using `import`. 

We can 'import' existing libraries - or packages of code - that are widely available.

Some that will be useful are:

- nltk,  is especially for working with textual data, and has a lot of inbuilt functions to perform key tasks.
(Check out the NLTK book: https://www.nltk.org/book/ch01.html)
- spaCy, is another a free and open-source Natural Language Processing (NLP) package. If you're interested in learning how to work with spaCy more broadly for a variety of NLP tasks I recommend the tutorial Natural Language Processing with spaCy in Python: https://realpython.com/natural-language-processing-spacy-python/.
- gensim, is dedicated to topic modeling, and has some really useful tutorials and materials to read through: https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here  
- pandas, a data analysis library
- numpy, a mathematical functions library
    

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install nltk



### numpy as np

Import the numpy library, name it np for shorthand so that we can refer to it in the code as 'np'

This allows us to get unique words in our vocabulary 

In [None]:
import numpy as np
import pandas as pd
import re, os
import csv
import nltk
from collections import Counter

nltk.download('punkt')
nltk.download('stopwords')

# Jupter notebook / Ananconda users
Save files that you want to use in the same folder as the ipython notebook you are running (or include a full file path, e.g. rather than `f = open('frankenstein.txt')` something like `f = open('Location/MyFolder/frankenstein.txt')`


# Colab users

 You will need to add any files you wish to use by clicking on the file icon on the left of the screen and uploading the txt file there. Click on the file with an arrow on it, then select the file and it will be added to the file list,



In [None]:
#Load your text 
f = open('frankenstein.txt')
text = f.read()

# if you would rather copy and paste text like we did in the tutorial, uncomment the code below and add your own text

# text = " add your text here "

In [None]:
# we will make all words lowercase so that *that* and *That* 
# will be counted together when we come to count words!


text = text.lower()


### Tokenize

NLTK has functions inbuilt to perform many of the tasks you need.

`sent_tokenize()` - splits your text into sentences. 

In [None]:
## Tokenizing

from nltk.tokenize import sent_tokenize, word_tokenize

## tokenize the data and store the tokens in a lists

sentences = []
sentences = sent_tokenize(text)

tokens = word_tokenize(text)

print(sentences[1])

print(tokens[3])

### Unique tokens and basic counts


In [None]:
#Get the unique tokens (our vocabulary)
vocab = np.unique(tokens)
print("total words:",len(tokens), "unique words:", len(vocab))
#Create a Bag of Words using a Counter
Counter(tokens).most_common(50)

This has counted all of the punctuation as tokens, we might not want them included...

### Removing punctuation 
There are lots of different ways to remove punctuation, for example...

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

test_sample = "This is a sentence - it has lots, that's right, lots of punctuation!!!!"
test_sample_no_punct = tokenizer.tokenize(test_sample)

print(test_sample)
print(test_sample_no_punct)


In [None]:
# We can also do it by writing our own regular expression or regex
# A regular expression is a sequence of characters that specifies a search pattern in text. 
# Read more about Regex here: https://www.w3schools.com/python/python_regex.asp 

#So instead let's just u a regex to split based on space AND punctuation
tokens = re.split(r'[-\s.,;!?]+', text)


In [None]:
#this time when we run this piece of code the punctuation should have gone

vocab = np.unique(tokens)
print("unique words", vocab.shape)
Counter(tokens).most_common(50)

#Get the unique tokens (our vocabulary)
vocab = np.unique(tokens)
print("total words:",len(tokens), "unique words:", len(vocab))
#Create a Bag of Words using a Counter
Counter(tokens).most_common(50)

Although the punctuation is gone there are still lots of words which aren't the most informative - like 'are', 'at', etc. 

Also can you spot the duplicates? Why might that be happening? We have *the* and *The*....

We need to make all words lowercase so that *the* and *The* are counted together!

## Removing stop words

Before we start counting words, we might want to consider which words we are interested in counting.

Some words are frequent but don't carry much meaning in ad of themselves, for examples common words in English such as "a", "the", "it".  

NLTK has a pre-defined set of stopwords

In [None]:

from nltk.corpus import stopwords

In [None]:
set(stopwords.words('english'))

In [None]:

# If you want to add additional stopwords you would do it like this...

stopwords = set(stopwords.words('english'))

# stopwords = nltk.corpus.stopwords.words('english')
# newStopWords = ['pick','some', 'words','to','add']
# stopwords.extend(newStopWords)
# print(stop_words)



Now we can filter our text to remove stopwords

In [None]:
filtered_tokens = []

filtered_tokens = [w for w in tokens if not w in stopwords]

We can compare the effect this has on the text by runing teh next two bloack of code (if you are using a text other thank frakenstein the slices selected won't make sense so you can swal the numbers to `[0:100]` in both boxes for a more straightforward comparison. 

In [None]:
print(tokens[280:400])

In [None]:
print(filtered_tokens[194:300])

Now stop words and punctuation gace been removed let's plot our frequency distribution

In [None]:
# Counter is a module that helps with counting
from collections import Counter

# Count the frequency of words
word_freq = Counter(filtered_tokens)
word_freq

### Creating a Frequency Distribution

We can count the frequency of each unique word in the text to create a frequency distribution:


In [None]:
# Counter is a module that helps with counting
from collections import Counter

# Count the frequency of words
word_freq = Counter(filtered_tokens)
word_freq

In [None]:
#get the ten most common words
common_words = word_freq.most_common(10)
common_words

In [None]:
# Display the plot inline in the notebook with interactive controls
# Comment out this line if you are running the notebook in Deepnote

%matplotlib notebook

# Uncomment the following line if you are in Colab
%matplotlib inline

# Import the matplotlib plot function
import matplotlib.pyplot as plt

# Get a list of the most common words
words = [word for word,_ in common_words]

# Get a list of the frequency counts for these words
freqs = [count for _,count in common_words]

# Set titles, labels, ticks and gridlines
plt.title("Top 10 Words in my text")
plt.xlabel("Word")
plt.ylabel("Count")
plt.xticks(range(len(words)), [str(s) for s in words], rotation=90)
plt.grid(visible=True, which='major', color='#333333', linestyle='--', alpha=0.2)

# Plot the frequency counts
plt.plot(freqs)

# Show the plot
plt.show()

# Go Further

If you would like something more challenging check out these notebooks which includes topic modelling on a shakespeare corpus:

https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb
