# 3.1 - Working with text

These examples show how we can use Python to load text from a file and use code to split it into paragraphs, sentances, and words. In Python, text is represented in the `string` format which is basically a list of character objects. We can use special Python libraries such as `re` to work with string objects. We can also use Python's `collections` library to find unique words and characters and count their occurances in a piece of text.

In [8]:
# first we import the 're' library which allows us to work with and format string objects in different ways
import re

In [9]:
filename = "data/wonderland.txt"

with open(filename,'rb') as f:
    data = f.read()
    raw_text = data.decode('utf-8')

# get rid of any characters other than letters, numbers, 
# and a few special characters
raw_text = re.sub('[^\nA-Za-z0-9 ,.:;?!-]+', '', raw_text)

n_chars = len(raw_text)
print("length of text:", n_chars)

length of text: 141240


In [10]:
paragraphs = raw_text.split('\n')
paragraphs = [p for p in paragraphs if len(p) > 0]
print("number of paragraphs:", len(paragraphs))

words = []
for p in paragraphs:
    words += re.sub('[^A-Za-z ]+', '', p).split(" ")
print("number of words:", len(words))

letters = []
for w in words:
    letters += w
print("number of letters:", len(letters))

number of paragraphs: 2480
number of words: 27514
number of letters: 107715


In [11]:
# here we import the 'collections' library which allows us to count unique objects in a list of data
# https://docs.python.org/3/library/collections.html#counter-objects
import collections

wordSet = collections.Counter(words)
print("most common words:", wordSet.most_common(10))

uniqueWords = list(wordSet)
print("unique words:", len(uniqueWords))

letterSet = collections.Counter(letters)
print("most common letters:", letterSet.most_common(10))

most common words: [('the', 1515), ('', 1128), ('and', 774), ('to', 717), ('a', 610), ('she', 498), ('of', 494), ('it', 482), ('said', 456), ('I', 400)]
unique words: 3151
most common letters: [('e', 13388), ('t', 10217), ('a', 8153), ('o', 7969), ('h', 7091), ('n', 6896), ('i', 6782), ('s', 6283), ('r', 5298), ('d', 4739)]
