# Applying NLP to 2 of Project Gutenbergs books

## Pseudo code
1) Look at the data structure of the books. I'm going to read the .txt version. 
2) The html structure seems to be simple with just the body and paragraph tags. Focus will be to just take the text of the paragraph.
3) Read the html via BeautifulSoup  
4) Text has to be processed to remove unwanted characters like \n \r and unwanted text as well.
5) Read the processed text and proceed with counting the occurrences of the words and frequency.

## Part 1)  "Common Sense" by Thomas Paine 

In [1]:
# Import required libraries
# Import urllib.request library for opening URLs
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np

### 1.1) Data Preparation & Analysis
Get and read the web page http://www.gutenberg.org/cache/epub/147/pg147.txt having the txt file of the book Common Sense by Thomas Paine. Display first 1000 characters to make sure everything is working correctly.

In [2]:
webpage_ThomasPaine = 'http://www.gutenberg.org/cache/epub/147/pg147.txt'
webcontent_ThomasPaine = urlopen(webpage_ThomasPaine).read()

print(webcontent_ThomasPaine[:1000])

b'\xef\xbb\xbf This eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever. You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\nTitle: Common Sense\r\nAuthor: Thomas Paine\r\nRelease Date: June 9, 2008 [EBook #147]\r\nLast updated: August 25, 2016\r\nLanguage: English\r\nCharacter set encoding: UTF-8.\r\n\r\nProduced by John Campbell. HTML version by Al Haines. Modified by\r\nRobert Homa.\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK COMMON SENSE ***\r\n\r\n\r\nCOMMON SENSE;\r\n\r\naddressed to the\r\n\r\nINHABITANTS\r\n\r\nof\r\n\r\nAMERICA,\r\n\r\nOn the following interesting\r\n\r\nSUBJECTS\r\n\r\n    Of the Origin and Design of Government in general,\r\n    with concise Remarks on the English Constitution.\r\n    Of Monarchy and Hereditary Succession\r\n    Thoughts on the present State of American Affairs\r\n    Of the present Ab

In [3]:
#use Beautiful soup to read the html page contents
resp_ThomasPaine = requests.get(webpage_ThomasPaine)
soup_ThomasPaine = BeautifulSoup(resp_ThomasPaine.content, 'lxml')

### 1.2) Select all text data 

    - Select only text data from the parser - beautiful soup
    - Convert this result set to string for our text processing

In [4]:
textsoup_ThomasPaine = soup_ThomasPaine.findAll(text=True)
type(textsoup_ThomasPaine) 

bs4.element.ResultSet

### 1.3) Text Data Cleaning
    - clean the text data stripping unwanted spaces
    - clean the \n \r characters
    - remove all texts after "*** END OF THIS PROJECT"
    - remove all texts before "*** START OF THIS PROJECT GUTENBERG EBOOK 

In [5]:
text_ThomasPaine = ''.join(textsoup_ThomasPaine).strip()
text_ThomasPaine[:100]

'This eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.'

In [6]:
import re
text_ThomasPaine = re.sub('[\r\n\']',' ',text_ThomasPaine) 
text_ThomasPaine[:500]

'This eBook is for the use of anyone anywhere at no cost and with  almost no restrictions whatsoever. You may copy it, give it away or  re-use it under the terms of the Project Gutenberg License included  with this eBook or online at www.gutenberg.net    Title: Common Sense  Author: Thomas Paine  Release Date: June 9, 2008 [EBook #147]  Last updated: August 25, 2016  Language: English  Character set encoding: UTF-8.    Produced by John Campbell. HTML version by Al Haines. Modified by  Robert Homa'

#### Remove all words after "*** END OF THIS PROJECT"

In [7]:
sep1 = '*** END'
text_ThomasPaine = text_ThomasPaine.split(sep1, 1)[0]

#### Remove all words before "*** START"

In [8]:
sep2 = '*** START OF THIS PROJECT'
text_ThomasPaine = text_ThomasPaine.split(sep2, 1)[1]

In [53]:
processedData = ' '.join(text_ThomasPaine.split())
processedData[:500]

'GUTENBERG EBOOK COMMON SENSE *** COMMON SENSE; addressed to the INHABITANTS of AMERICA, On the following interesting SUBJECTS Of the Origin and Design of Government in general, with concise Remarks on the English Constitution. Of Monarchy and Hereditary Succession Thoughts on the present State of American Affairs Of the present Ability of America, with some miscellaneous Reflections A new edition, with several additions in the body of the work. To which is added an appendix; together with an add'

### 1.4) Count the occurrences of all words in the documents.
    * Create a corpus with the words from the document

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

### 1.5) Creating wikilit corpus with all words from the documents 

In [12]:
##Creating a list from Processes Data

gutenberg_Thomas_Corpus = [processedData]

### 1.6) Without using stopwords

In [13]:
cvec = CountVectorizer()
gutenberg_thomas_cnt = cvec.fit_transform(gutenberg_Thomas_Corpus)

In [14]:
# Summarize
print(gutenberg_thomas_cnt.shape)

(1, 3546)


In [15]:
# Bag of Words matrix
print(gutenberg_thomas_cnt.toarray())

[[3 2 2 ... 6 3 2]]


In [16]:
#print(cvec.vocabulary_) ## This lists all the words and the corresponding indices. 

In [17]:
print(cvec.get_feature_names()[100:300])## This lists the list of words aka bag of words in alphabetical order.

['acts', 'adam', 'adams', 'adapted', 'added', 'additions', 'address', 'addressed', 'addresses', 'adds', 'adherents', 'administration', 'administred', 'admirals', 'admit', 'admits', 'admitted', 'admitting', 'adopted', 'advanced', 'advantage', 'advantages', 'adventurer', 'adversity', 'advertisements', 'advise', 'advises', 'advocate', 'advocates', 'affair', 'affairs', 'affected', 'affection', 'affections', 'affirm', 'affirming', 'affluence', 'afford', 'affords', 'afraid', 'africa', 'after', 'afterward', 'afterwards', 'again', 'against', 'age', 'ages', 'aggravated', 'ago', 'agree', 'agreeable', 'agrees', 'aim', 'alarming', 'alas', 'all', 'alliance', 'allow', 'allowed', 'almanacks', 'almighty', 'almost', 'alone', 'aloud', 'already', 'also', 'alteration', 'altered', 'altering', 'alternative', 'although', 'always', 'am', 'ambiguous', 'ambition', 'amen', 'america', 'american', 'americans', 'among', 'amount', 'amounts', 'amuse', 'amused', 'an', 'ancestors', 'ancient', 'ancients', 'and', 'anello

In [18]:
## Frequency Dataframe
freq_df = pd.DataFrame(gutenberg_thomas_cnt.toarray(), columns=cvec.get_feature_names())
freq_df.head()

Unnamed: 0,000,10,100,110,12,130,14,1422,1489,17,...,yet,york,you,young,your,yours,yourself,yourselves,youth,æra
0,3,2,2,1,2,1,4,1,1,2,...,39,3,42,5,51,1,1,6,3,2


In [19]:
sumdf = freq_df.sum(axis=0)
pd.DataFrame({'Vocab': sumdf.index, 'Frequency': sumdf.values}).sort_values(by='Frequency', ascending=False).head()

Unnamed: 0,Vocab,Frequency
3151,the,1519
2147,of,1035
189,and,803
3219,to,641
1611,in,392


#### We can see that there are many common occuring words above. We need to filter them out.

### 1.7) Filter out common words by using stopwords 

In [20]:
cvec = CountVectorizer(stop_words='english')
gutenberg_thomas_cnt = cvec.fit_transform(gutenberg_Thomas_Corpus)

In [21]:
print(gutenberg_thomas_cnt.shape)

(1, 3308)


### 1.8) Bag of words matrix

In [22]:
print(gutenberg_thomas_cnt.toarray())

[[3 2 2 ... 5 3 2]]


In [23]:
#print(cvec.vocabulary_) ## This lists all the words and the corresponding indices. 

In [24]:
print(cvec.get_feature_names()[100:300])## This lists the list of words aka bag of words in alphabetical order.

['adams', 'adapted', 'added', 'additions', 'address', 'addressed', 'addresses', 'adds', 'adherents', 'administration', 'administred', 'admirals', 'admit', 'admits', 'admitted', 'admitting', 'adopted', 'advanced', 'advantage', 'advantages', 'adventurer', 'adversity', 'advertisements', 'advise', 'advises', 'advocate', 'advocates', 'affair', 'affairs', 'affected', 'affection', 'affections', 'affirm', 'affirming', 'affluence', 'afford', 'affords', 'afraid', 'africa', 'afterward', 'age', 'ages', 'aggravated', 'ago', 'agree', 'agreeable', 'agrees', 'aim', 'alarming', 'alas', 'alliance', 'allow', 'allowed', 'almanacks', 'almighty', 'aloud', 'alteration', 'altered', 'altering', 'alternative', 'ambiguous', 'ambition', 'amen', 'america', 'american', 'americans', 'amounts', 'amuse', 'amused', 'ancestors', 'ancient', 'ancients', 'anello', 'anger', 'animals', 'annual', 'answer', 'answering', 'answers', 'anti', 'antiquity', 'anxious', 'apart', 'apostate', 'appeal', 'appear', 'appearance', 'appeared'

In [25]:
freq_df = pd.DataFrame(gutenberg_thomas_cnt.toarray(), columns=cvec.get_feature_names())
freq_df.head()

Unnamed: 0,000,10,100,110,12,130,14,1422,1489,17,...,wrong,yards,ye,year,yearly,years,york,young,youth,æra
0,3,2,2,1,2,1,4,1,1,2,...,1,2,52,7,2,15,3,5,3,2


### 1.9) High Frequency Words

In [26]:
sumdf = freq_df.sum(axis=0)
pd.DataFrame({'Vocab': sumdf.index, 'Frequency': sumdf.values}).sort_values(by='Frequency', ascending=False).head()

Unnamed: 0,Vocab,Frequency
1416,hath,87
1676,king,77
1353,government,70
3015,time,67
1854,men,59


#### Low frequency words

In [27]:
sumdf = freq_df.sum(axis=0)
pd.DataFrame({'Vocab': sumdf.index, 'Frequency': sumdf.values}).sort_values(by='Frequency', ascending=True).head()

Unnamed: 0,Vocab,Frequency
2957,tended,1
1574,inoffensive,1
1573,innovation,1
2856,study,1
1571,inland,1


## Part 2)  "The Communist Manifesto" by Friedrich Engels and Karl Marx  

### 2.1) Data Preparation & Analysis
Get and read the web page http://www.gutenberg.org/cache/epub/147/pg147.txt having the txt file of the book Common Sense by Thomas Paine. Display first 1000 characters to make sure everything is working correctly.

In [28]:
webpage_KarlMarx = 'http://www.gutenberg.org/cache/epub/61/pg61.txt'
webcontent_KarlMarx = urlopen(webpage_KarlMarx).read()

print(webcontent_KarlMarx[:1000])

b'\xef\xbb\xbfThe Project Gutenberg EBook of The Communist Manifesto\r\nby Karl Marx and Friedrich Engels\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: The Communist Manifesto\r\n\r\nAuthor: Karl Marx and Friedrich Engels\r\n\r\nRelease Date: January 25, 2005 [EBook #61]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE COMMUNIST MANIFESTO ***\r\n\r\n\r\n\r\n\r\nTranscribed by Allen Lutins with assistance from Jim Tarzia.\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nMANIFESTO OF THE COMMUNIST PARTY\r\n\r\n[From the English edition of 1888, edited by Friedrich Engels]\r\n\r\n\r\nA spectre is haunting Europe--the spectre of Communism.\r\nAll the Powers of old Europe have entered into a holy alliance to\r\nexorcise this spectre: Pope 

In [29]:
#use Beautiful soup to read the html page contents
resp_KarlMarx = requests.get(webpage_KarlMarx)
soup_KarlMarx = BeautifulSoup(resp_KarlMarx.content, 'lxml')

### 2.2) Select all text data 

    - Select only text data from the parser - beautiful soup
    - Convert this result set to string for our text processing

In [30]:
textsoup_KarlMarx = soup_KarlMarx.findAll(text=True)
type(textsoup_KarlMarx) 

bs4.element.ResultSet

### 2.3) Text Data Cleaning
    - clean the text data stripping unwanted spaces
    - clean the \n \r characters
    - remove all texts after "*** END OF THIS PROJECT"
    - remove all texts before "*** START OF THIS PROJECT GUTENBERG EBOOK 

In [31]:
text_KarlMarx = ''.join(textsoup_KarlMarx).strip()
text_KarlMarx[:100]

'The Project Gutenberg EBook of The Communist Manifesto\r\nby Karl Marx and Friedrich Engels\r\n\r\nThis eB'

In [32]:
import re
text_KarlMarx = re.sub('[\r\n\']',' ',text_KarlMarx) 
text_KarlMarx[:500]

'The Project Gutenberg EBook of The Communist Manifesto  by Karl Marx and Friedrich Engels    This eBook is for the use of anyone anywhere at no cost and with  almost no restrictions whatsoever.  You may copy it, give it away or  re-use it under the terms of the Project Gutenberg License included  with this eBook or online at www.gutenberg.net      Title: The Communist Manifesto    Author: Karl Marx and Friedrich Engels    Release Date: January 25, 2005 [EBook #61]    Language: English      *** S'

#### Remove all words after "*** END OF THIS PROJECT"

In [33]:
sep1 = '*** END'
text_KarlMarx = text_KarlMarx.split(sep1, 1)[0]

#### Remove all words before "*** START"

In [34]:
sep2 = '*** START OF THIS PROJECT'
text_KarlMarx = text_KarlMarx.split(sep2, 1)[1]

In [54]:
processedData_KarlMarx = ' '.join(text_KarlMarx.split())
processedData_KarlMarx[:500]

'GUTENBERG EBOOK THE COMMUNIST MANIFESTO *** Transcribed by Allen Lutins with assistance from Jim Tarzia. MANIFESTO OF THE COMMUNIST PARTY [From the English edition of 1888, edited by Friedrich Engels] A spectre is haunting Europe--the spectre of Communism. All the Powers of old Europe have entered into a holy alliance to exorcise this spectre: Pope and Czar, Metternich and Guizot, French Radicals and German police-spies. Where is the party in opposition that has not been decried as Communistic b'

### 2.4) Count the occurrences of all words in the documents.
    * Create a corpus with the words from the document

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

### 2.5) Creating wikilit corpus with all words from the documents 

In [38]:
##Creating a list from Processes Data

gutenberg_KarlMarx_Corpus = [processedData_KarlMarx]

### 2.6) Filter out common words by using stopwords 

In [39]:
cvec = CountVectorizer(stop_words='english')
gutenberg_karlmarx_cnt = cvec.fit_transform(gutenberg_KarlMarx_Corpus)

In [40]:
print(gutenberg_karlmarx_cnt.shape)

(1, 2021)


### 2.7) Bag of words matrix

In [41]:
print(gutenberg_karlmarx_cnt.toarray())

[[1 1 1 ... 1 1 1]]


In [42]:
#print(cvec.vocabulary_) ## This lists all the words and the corresponding indices. 

In [43]:
print(cvec.get_feature_names()[100:300])## This lists the list of words aka bag of words in alphabetical order.

['appropriated', 'appropriates', 'appropriating', 'appropriation', 'aqueducts', 'arena', 'arises', 'aristocracies', 'aristocracy', 'aristocrat', 'armed', 'armies', 'arms', 'army', 'arose', 'arouse', 'arrangement', 'article', 'articles', 'artillery', 'artisan', 'asceticism', 'ask', 'aspect', 'aspires', 'assembled', 'assistance', 'association', 'associations', 'assumed', 'assumes', 'assure', 'assured', 'asunder', 'atmosphere', 'attack', 'attacks', 'attain', 'attained', 'attainment', 'attains', 'attempts', 'attention', 'augmentation', 'average', 'away', 'awe', 'babeuf', 'background', 'bag', 'bailiffs', 'bank', 'banner', 'barbarian', 'barbarians', 'barbarism', 'bare', 'barter', 'based', 'basis', 'batters', 'battle', 'battles', 'bears', 'beaux', 'beetroot', 'begetting', 'begin', 'beginning', 'begins', 'begun', 'belief', 'belong', 'belongs', 'beneath', 'benefit', 'best', 'birds', 'birth', 'bit', 'bitter', 'blind', 'blues', 'bodies', 'bombastic', 'bonds', 'bone', 'bound', 'bounds', 'bourgeois

In [44]:
termFreq = pd.DataFrame(gutenberg_karlmarx_cnt.toarray(), columns=cvec.get_feature_names())
termFreq.head()

Unnamed: 0,10,1830,1846,1847,1888,18th,ablaze,able,abolish,abolished,...,writers,writings,written,wrote,yearnings,years,yield,yoke,young,zones
0,1,1,1,1,1,1,1,2,3,2,...,1,2,1,4,1,2,1,1,1,1


In [45]:
termFreq.to_csv('wordFreqKarlmarx.csv', index=False, header=True)

### 2.8) High Frequency Words

In [46]:
sumdf = termFreq.sum(axis=0)
pd.DataFrame({'Vocab': sumdf.index, 'Frequency': sumdf.values}).sort_values(by='Frequency', ascending=False).head()

Unnamed: 0,Vocab,Frequency
280,class,102
189,bourgeois,100
190,bourgeoisie,91
1710,society,70
1424,proletariat,63


### 2.9) From occurrences to frequencies - TF IDF
Term frequency-inverse document frequency- this weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

In [1]:
from sklearn.feature_extraction.text import TfidfTransformer

In [48]:
tfidf = TfidfTransformer()
karlmarx_freq = tfidf.fit_transform(gutenberg_karlmarx_cnt)
karlmarx_freq.shape

(1, 2021)

In [49]:
print(karlmarx_freq.toarray())

[[0.00348065 0.00348065 0.00348065 ... 0.00348065 0.00348065 0.00348065]]


In [50]:
TFtermFreq = pd.DataFrame(karlmarx_freq.toarray(), columns=cvec.get_feature_names())
TFtermFreq.head()

Unnamed: 0,10,1830,1846,1847,1888,18th,ablaze,able,abolish,abolished,...,writers,writings,written,wrote,yearnings,years,yield,yoke,young,zones
0,0.003481,0.003481,0.003481,0.003481,0.003481,0.003481,0.003481,0.006961,0.010442,0.006961,...,0.003481,0.006961,0.003481,0.013923,0.003481,0.006961,0.003481,0.003481,0.003481,0.003481


In [51]:
TFtermFreq.to_csv('tfidfFreqKarlmarx.csv', index=False, header=True)

In [52]:
TFtermFreq.T.sort_values(by=[0], ascending=False).head()

Unnamed: 0,0
class,0.355026
bourgeois,0.348065
bourgeoisie,0.316739
society,0.243645
proletariat,0.219281
