# Homework 1: Word Frequencies

## Challenge
Can we identify different types of text documents based on the frequency of their words? Can we identify different authors, styles, or disciplines like medical versus information technology? The assignment is to compute word frequencies for different types of documents, and to develop patterns for document classification.

## Tasks
1. Write Python code to load different text documents and compute word frequencies. The most frequent words should be at the beginning of the list.
2. Identify a small (about 5 to 10) words that could represent a particular type of document.
3. Show how different types have different word lists ("signatures").
4. Discuss results and the feasibilty of this method.

## Deliverable
Use this notebook to implement your assignment. Please, observe the following:
1. Your notebook should have the completly executed code and results.
2. Please, organize your notebook to tell the story. Remove unnecessary clutter, test code, and anything that does not belong to the story.
3. Save your notebook in a directory named `HW1` in `MSA8010F16` in your *home* directory on the Hadoop Cluster. The path should be `~/MSA8010F16/HW1/HW1.ipynb`
4. Also save the notebook in HTML as `~/MSA8010F16/HW1/HW1.html`
5. All file names are *case sensitive*!

#### I want to first build a function that will return the list of words appearing in a document. This requires removing spaces, linebreaks, and punctuation. I must define the list of punctuation marks to remove, which I defined with the variable rmvpunc. I have chosen to leave apostrophes in the words to maintain contractions. In order to remove the empty strings that result, I found it possible to use a list comprehension to filter those out.

##### Additionally, I later found that many text documents have '\r' in addition or in lieu of '\n' as a linebreak, so I used a second replace to eliminate those.

In [1]:
import string

def makelist(src):
    rmvpunc = string.punctuation.replace("'", " ")
    txt = src.lower()
    for c in rmvpunc:
        txt = txt.replace(c, '\n')
        txt = txt.replace('\r','\n')
    wordlist = [i for i in txt.split('\n') if i != '']
    return wordlist

#### Let's test this function on a small text file I created.

In [2]:
makelist(open('text.txt').read())


['these',
 'words',
 'are',
 'on',
 'the',
 'first',
 'line',
 'these',
 'words',
 'are',
 'on',
 'the',
 'second',
 'line',
 'these',
 'words',
 'are',
 'on',
 'the',
 'third',
 'line',
 'the',
 'fourth',
 'line',
 'is',
 'this',
 'stuff',
 'here',
 'goes',
 'the',
 'fifth',
 'line',
 "what's",
 'the',
 'point',
 'of',
 'the',
 'sixth',
 'line',
 'lucky',
 'seventh',
 'line',
 'baby',
 'eighth',
 'line',
 'material',
 'appears',
 'here']

#### Now I need a function that counts the words from the word list. I'll use a dictionary since that makes the most sense to me to count word frequency, but then I'll return a list of tuples which will make for easier sorting later.

In [3]:
def wordcount(wordlist):
    freq_dict = {}
    for i in wordlist:
        if i not in freq_dict:
            freq_dict[i] = 1
        else:
            freq_dict[i] += 1
    return list(freq_dict.items())

wordcount(makelist(open('text.txt').read()))

[('second', 1),
 ("what's", 1),
 ('line', 8),
 ('material', 1),
 ('these', 3),
 ('lucky', 1),
 ('appears', 1),
 ('goes', 1),
 ('fourth', 1),
 ('fifth', 1),
 ('on', 3),
 ('are', 3),
 ('here', 2),
 ('point', 1),
 ('seventh', 1),
 ('baby', 1),
 ('this', 1),
 ('is', 1),
 ('of', 1),
 ('words', 3),
 ('sixth', 1),
 ('eighth', 1),
 ('first', 1),
 ('third', 1),
 ('stuff', 1),
 ('the', 7)]

#### Now I can write a function that sorts the list of tuples and ties everything together by calling my previous functions and then printing the twenty most frequent words along with how many times each word appears.

In [4]:
def topten(src):
    x = wordcount(makelist(src))
    x.sort(key=lambda x: x[1], reverse=True)
    for i in range(20):
        print (x[i][0],"-",x[i][1])
    print ('\n')

topten(open('text.txt').read())

line - 8
the - 7
these - 3
on - 3
are - 3
words - 3
here - 2
second - 1
what's - 1
material - 1
lucky - 1
appears - 1
goes - 1
fourth - 1
fifth - 1
point - 1
seventh - 1
baby - 1
this - 1
is - 1




#### Let's run the function with the locally stored Shakespeare text file and begin building lists to compare the "signatures" of various types of documents.

In [5]:
topten(open('shakespeare.txt').read())

the - 27618
and - 26771
i - 20296
to - 19685
of - 18183
a - 14565
you - 13645
my - 12473
that - 11135
in - 11000
is - 9596
not - 8733
for - 8251
with - 7999
me - 7771
it - 7697
be - 7089
your - 6876
his - 6853
this - 6832




#### We'll also run the code with other fictional writings - the complete works of American author Mark Twain and English author Jane Austen.
(These are pulled locally because of problems with 'urlopen' accessing documents on the gutenberg.org domain.

In [6]:
topten(open('twain.txt').read())
topten(open('austen.txt').read())

the - 155007
and - 122526
of - 79376
a - 73415
to - 71630
it - 49657
in - 48082
i - 45342
that - 39858
was - 39596
he - 32872
is - 24821
for - 22502
his - 21730
you - 21157
with - 21046
but - 20818
not - 17218
had - 16950
as - 16525


the - 28448
to - 26119
and - 24176
of - 23051
a - 14366
her - 14031
i - 13785
in - 12226
was - 11783
it - 10834
she - 10710
not - 9112
that - 8928
be - 8706
you - 8413
he - 7758
had - 7669
as - 7608
for - 7221
with - 6382




#### It seems these fictional works mostly use "structural" English language words, without any particularly unique words appearing in the lists.

#### Let's look at some other types of documents to compare - 
* a philosophical document from Immanuel Kant
* a medical textbook
* a scientific document, Darwin's 'Origin of Species,'
* a legal document 'Prize Cases Decided in the US Supreme Court, 1789-1918.'

In [7]:
from urllib.request import urlopen
print ("Immanuel Kant")
topten(urlopen('http://www.textfiles.com/etext/AUTHORS/KANT/kant-critique-142.txt').read().decode())
print ("Textbook of Medical Physiology")
topten(urlopen('https://archive.org/stream/GuytonAndHallTextbookOfMedicalPhysiologyRentalEBookNodrm/Guyton%20and%20Hall%20Textbook%20of%20Medical%20Physiology%20Rental%20E-Book_nodrm_djvu.txt').read().decode())
print ("Charles Darwin")
topten(urlopen('http://www.loyalbooks.com/download/text/The-Origin-of-Species-by-Charles-Darwin.txt').read().decode())
print ("Court Decisions")
topten(urlopen('https://archive.org/stream/prizecasesdecide00scotuoft/prizecasesdecide00scotuoft_djvu.txt').read().decode())

Immanuel Kant
the - 15918
of - 13074
to - 6208
in - 5942
a - 5077
is - 4876
and - 4838
which - 3424
that - 3095
it - 2950
as - 2879
be - 2378
this - 2232
not - 1934
we - 1924
but - 1784
for - 1690
all - 1430
an - 1405
by - 1370


Textbook of Medical Physiology
the - 65627
of - 39137
in - 20205
and - 18995
to - 14877
is - 12555
a - 10528
that - 6956
as - 5889
by - 5845
are - 5441
this - 5092
for - 4842
from - 4748
blood - 4607
with - 3226
or - 3208
1 - 2961
pressure - 2961
cells - 2897


Charles Darwin
the - 14562
of - 10427
and - 5852
in - 5414
to - 4752
a - 3379
that - 2749
as - 2230
have - 2114
be - 2099
is - 2069
on - 1952
species - 1922
by - 1824
which - 1787
or - 1655
are - 1646
it - 1526
for - 1465
with - 1464


Court Decisions
the - 33939
of - 18644
to - 11284
and - 9345
in - 7939
a - 7002
that - 5609
is - 4367
it - 4017
be - 3872
was - 3604
by - 3513
as - 3061
not - 2968
court - 2672
on - 2638
this - 2495
which - 2350
for - 2330
' - 2165




## Conclusions
The words 'the, of, and, in, to' appear in the ten most frequent words in every document I've tested, and generally comprise the top five. It would seem that any English document of sufficient length will ultimately contain these and other basic words most frequently, and thus these common words wouldn't contribute to a "signature" for a document. Writing in English simply requires these building blocks quite often to construct sentences.

We begin to see more unique identifiers in the 11th-20th words once we get into more topical documents. For example, the medical text contains 'blood' and 'cells,' Darwin's work uses 'species' often, and court decisions contain the word 'court' often.