# Loading and tokenising a single document
You can use the open() function to open one file in the Medical History of British India corpus. You need to specify the path to a file in the downloaded dataset and the mode of opening it (‘r’ for read). The path will be different to the one below depending on where you saved the data on your computer.

The read() function is used to read the file. The file’s content (the text) is then stored as a string variable called india_raw.

You can then tokenise the text and convert it to lowercase. You can check it has worked by printing out a slice of the list lower_india_tokens.

In [3]:
import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize

In [4]:
file = open('nls-text-indiaPapers/74457530.txt','r')
raw_data = file.read()

### Using word_tokenize() to tokenize the document
#### For lower case, we use lower() on each entry

In [7]:
tokens = word_tokenize(raw_data)
lower_tokens = [word.lower() for word in tokens]

In [8]:
lower_tokens[0:10]

['no', '.', '1111', '(', 'sanitary', ')', ',', 'dated', 'ootacamund', ',']

# Loading and tokenising a corpus
We can do the same for an entire collection of documents (a corpus). Here we choose a collection of raw text documents in a given directory. We will use the entire Medical History of British India collection as our dataset.

To read the text files in this collection we can use the PlaintextCorpusReader class provided in the corpus package of NLTK. You need to specify the collection directory name and a wildcard for which files to read in the directory (e.g. .* for all files) and the text encoding of the files (in this case latin1). Using the words() method provided by NLTK, the text is automatically tokenised and stored in a list of words. As before, we can then lowercase the words in the list.

In [9]:
from nltk.corpus import PlaintextCorpusReader

In [10]:
corpus_root = 'nls-text-indiaPapers/'
wordlists = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1')

In [11]:
corpus_tokens = wordlists.words()

In [21]:
corpus_tokens[0:15]

['No',
 '.',
 '1111',
 '(',
 'Sanitary',
 '),',
 'dated',
 'Ootacamund',
 ',',
 'the',
 '6th',
 'October',
 '1876',
 '.',
 'From']

In [13]:
lower_corpus_tokens = [str(word).lower() for word in corpus_tokens]

In [22]:
lower_corpus_tokens[0:15]

['no',
 '.',
 '1111',
 '(',
 'sanitary',
 '),',
 'dated',
 'ootacamund',
 ',',
 'the',
 '6th',
 'october',
 '1876',
 '.',
 'from']

In [23]:
# Lowercase tokens of the file 74457530.txt from the 'nls-text-indiaPapers/' directory
print(len(lower_tokens))

# Lowercase tokens of the entire corpus or all the files from the 'nls-text-indiaPapers/' directory
print(len(lower_corpus_tokens)) 

93568
28345943
