# Fetching

In order to process text that is available online, we'll need to fetch the text with Python's urllib. We call on the request module to fetch content from classics.mit.edu.

What we get in the output is an indication that the raw text is a string. (Official documentation https://docs.python.org/3.5/library/string.html)

NBViewer: http://nbviewer.jupyter.org/github/crimsonpresident/JuliusCaesarGallicWars/blob/master/JuliusCaesarGallicCampaign.ipynb

In [1]:
import nltk
from urllib import request
url = "http://classics.mit.edu/Caesar/gallic.mb.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)


str

At this point, I will tokenize the raw file and create a list, which reveals there are 99,328 tokens. 

In [2]:
tokens = nltk.word_tokenize(raw)
type(tokens)


list

In [3]:
len(tokens)

99328

# Need to Normalize

Based on the text and the NLTK library, we can identify the first several tokens. In this case, the first eight are: Provided, by, The, Internet, Classics, Archive, . , and See. One of the first things you notice is the non-uniformity of cases. Also, the full stop glyph or period has been tokenized and was included in the aforementioned 99,328 universe. 

In [4]:
tokens[:8]

['Provided', 'by', 'The', 'Internet', 'Classics', 'Archive', '.', 'See']

Normalizing by de-uppercasifying and removing punctuation can be achieved by: [w.lower() for w in tokens if w.islpha()] as seen below. The output will showcase all normalized tokens and the removal of the full stop glyph or period that was seen above. 

The normalized tokens are: provided, by, the, internet, classics, archive, see, and bottom. After the removal of punctuations, there are 85,930 tokens. However, our cleaning is not yet done. 

In [5]:
tokens = nltk.word_tokenize(raw)
tokens = [w.lower() for w in  tokens if w.isalpha()]
tokens[:8]

['provided', 'by', 'the', 'internet', 'classics', 'archive', 'see', 'bottom']

In [6]:
len(tokens)

85930

# Frequency and Visualization

In these next steps, we will attempt to figure out the frequency of each word and then visualize the most and least common tokens. When utilizing the FreqDist module, I noticed that the most common words were stopwords (...the, of, and, to, that, etc.). 

In [7]:
fdist1 = nltk.FreqDist(tokens)
print(fdist1)

<FreqDist with 5584 samples and 85930 outcomes>


In [8]:
fdist1.N()

85930

In [19]:
fdist1.hapaxes()

['sue',
 'raises',
 'segontiaci',
 'forasmuch',
 'commends',
 'barely',
 'die',
 'neighbouring',
 'east',
 'encamp',
 'resulted',
 'solicited',
 'drag',
 'artifice',
 'protecting',
 'subjected',
 'muster',
 'straggling',
 'emigrate',
 'repelling',
 'reparation',
 'costly',
 'indignity',
 'curisolites',
 'contents',
 'bends',
 'huge',
 'concentrated',
 'woman',
 'unequal',
 'resigned',
 'delivers',
 'causes',
 'active',
 'conduce',
 'usage',
 'agger',
 'meritorious',
 'invincible',
 'verses',
 'dearer',
 'conjectured',
 'doom',
 'beforehand',
 'strengthen',
 'relates',
 'facility',
 'forfeit',
 'reserves',
 'courteous',
 'domains',
 'doctrines',
 'connects',
 'verbigene',
 'dressed',
 'telling',
 'blot',
 'intend',
 'triviri',
 'terminated',
 'apathy',
 'marcomanni',
 'tendency',
 'prison',
 'congratulate',
 'fenny',
 'rashly',
 'bull',
 'accomplishments',
 'caltes',
 'sinews',
 'intimated',
 'announcements',
 'disband',
 'wrecked',
 'tampers',
 'mismanagement',
 'reign',
 'sincere',
 '

In [20]:
fdist1.max()

'the'

We can utilize NLTK's stopwords module to clean the list a bit further by removing stop words from our current universe. This should remove words such as 'the', 'at', etc. 

In [13]:
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
mynewtokens=[w for w in tokens if w not in stopwords]
Fdist2=nltk.FreqDist(mynewtokens)
print(Fdist2)

<FreqDist with 5463 samples and 38612 outcomes>


This brings down our universe to 38,612 tokens. And we can find out the most common words below. 

In [23]:
Fdist2.most_common(10)

[('caesar', 481),
 ('chapter', 402),
 ('enemy', 359),
 ('men', 314),
 ('great', 299),
 ('camp', 292),
 ('one', 248),
 ('could', 238),
 ('would', 235),
 ('war', 205)]

In [27]:
Fdist2.most_common(5)

[('caesar', 481),
 ('chapter', 402),
 ('enemy', 359),
 ('men', 314),
 ('great', 299)]

Now that we have the 5 most common words, we can use Pandas and Bokeh to visualize them with a bar chart. And we can apply them to the 5 least common words and visualize that set. 



In [11]:
import pandas as pd
from bokeh.charts import Bar
from bokeh.io import output_notebook, show
output_notebook()
dict = {'frequency': {u'caesar': 481, u'chapter': 402, u'enemy': 359, u'men':314, u'great':299}}
df = pd.DataFrame(dict)
df['word'] = df.index
df
p = Bar(df, values='frequency',label='word')
show(p)

In [31]:
Fdist2.most_common()[-5:]

[('south', 1), ('measured', 1), ('fugitives', 1), ('deeper', 1), ('spend', 1)]

In [12]:
import pandas as pd
from bokeh.charts import Bar
from bokeh.io import output_notebook, show
output_notebook()
dict = {'frequency': {u'south':1, u'measured':1, u'fugitive':1, u'deeper' :1, u'spend':1}}
df = pd.DataFrame(dict)
df['word'] = df.index
df
p = Bar(df, values='frequency',label='word')
show(p)