# Exploring Text Data  

The NLTK documentation can be found at https://www.nltk.org  

You will need to import nltk, see below.  
`>>> import nltk`    
`>>> nltk.download('punkt')`    
`>>> nltk.download('stopwords')`

## Italian recipes data

Data set of Italian recipes from https://www.gutenberg.org/ebooks/24407 (public domain).   
The txt format of this has been split into multiple files, one recipe per file.  There are 220 recipes.  


## Load the data

Firstly, load all the data into the `documents` dictionary

Merge the documents into one big string, `corpus_all_in_one`, for convenience

In [None]:
import os

data_folder = os.path.join('data', 'recipes')
all_recipe_files = [os.path.join(data_folder, fname)
                    for fname in os.listdir(data_folder)]

documents = {}
for recipe_fname in all_recipe_files:
    bname = os.path.basename(recipe_fname)
    recipe_number = os.path.splitext(bname)[0]
    with open(recipe_fname, 'r') as f:
        documents[recipe_number] = f.read()

corpus_all_in_one = ' '.join([doc for doc in documents.values()])

print("Number of docs: {}".format(len(documents)))
print("Corpus size (char): {}".format(len(corpus_all_in_one)))
type(corpus_all_in_one)

## Tokenisation

Tokenisation is the process of splitting a raw string into a list of tokens

Tokens can be...

- Words
- Phrases
- Punctuation
- Numbers
- Dates
- Currencies
- Hashtags
- ...?

In [None]:
from nltk.tokenize import word_tokenize

try:  
    all_tokens = [t for t in word_tokenize(corpus_all_in_one)]
except UnicodeDecodeError:  
    all_tokens = [t for t in word_tokenize(corpus_all_in_one.decode('utf-8'))]

print("Total number of tokens: {}".format(len(all_tokens)))

## Counting Words

Start with a simple word count using `collections.Counter`

- how many times a word occurs across the whole corpus (total number of occurrences)
- how many documents a word occurs

In [None]:
from collections import Counter

total_term_frequency = Counter(all_tokens)


for word, freq in total_term_frequency.most_common(20):
    print("{}\t{}".format(word, freq))
    
type(total_term_frequency)

In [None]:
document_frequency = Counter()      #dict of key, value pairs where value is a count

for recipe_number, content in documents.items():
    tokens = word_tokenize(content)
    unique_tokens = set(tokens)
    document_frequency.update(unique_tokens)

for word, freq in document_frequency.most_common(20):
    print("{}\t{}".format(word, freq))
    
type(unique_tokens)


## Stop-words

Some of the most common words above are not very interesting.

These words are called **stop-words**, and they don't provide any particular meaning in isolation (articles, conjunctions, pronouns, etc.)

Note that:

- there is no "universal" list of stop-words
- removing stop-words can be useful or damaging depending on the application

e.g. if you remove stop-words, what do you do with "The Who", "to be or not to be" and similar phrases?

In [None]:
from nltk.corpus import stopwords
import string

print(stopwords.words('english'))
print(len(stopwords.words('english')))
print(string.punctuation)

In [None]:
stop_list = stopwords.words('english') + list(string.punctuation)

tokens_no_stop = [token for token in all_tokens
                        if token not in stop_list]

total_term_frequency_no_stop = Counter(tokens_no_stop)

for word, freq in total_term_frequency_no_stop.most_common(20):
    print("{}\t{}".format(word, freq))

Notice **When** and **The** above (uppercase W and T)

Different variations of the same words are counted as different words (they are, after all, different strings)

In [None]:
print(total_term_frequency_no_stop['olive'])
print(total_term_frequency_no_stop['olives'])
print(total_term_frequency_no_stop['Olive'])
print(total_term_frequency_no_stop['Olives'])
print(total_term_frequency_no_stop['OLIVE'])
print(total_term_frequency_no_stop['OLIVES'])

## Text Normalisation

Normalise the text to group together different spelling/variations of the same word

Examples:
    
- lowercasing
- stemming
- American-to-British mapping
- synonym mapping

**Stemming** is the process of reducing a word to its base/root form, called stem

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
all_tokens_lower = [t.lower() for t in all_tokens]

tokens_normalised = [stemmer.stem(t) for t in all_tokens_lower
                                     if t not in stop_list]

total_term_frequency_normalised = Counter(tokens_normalised)

for word, freq in total_term_frequency_normalised.most_common(20):
    print("{}\t{}".format(word, freq))

Tips:
    
- a stem is not always a word
- careful with one-way transformations (like lowercasing)
- wrap your preprocessing steps in a function / chain of functions for better design

## n-grams 

An n-gram is a sequence of n adjacent terms.

Commonly used n-grams include bigrams (n=2) and trigrams (n=3).

In [None]:
from nltk import ngrams

phrases = Counter(ngrams(all_tokens_lower, 2))
for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

In [None]:
phrases = Counter(ngrams(all_tokens_lower, 3))
for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

## n-grams and stop-words

Stop-word removal will affect n-grams

e.g. phrases like "a pinch of salt" become "pinch salt" after stop-word removal

In [None]:
phrases = Counter(ngrams(tokens_no_stop, 2))

for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))

In [None]:
phrases = Counter(ngrams(tokens_no_stop, 3))

for phrase, freq in phrases.most_common(20):
    print("{}\t{}".format(phrase, freq))