In [2]:
import requests
import re
from bs4 import BeautifulSoup
import string

In [3]:
page = requests.get("http://www.vatican.va/archive/bible/genesis/documents/bible_genesis_en.html")
soup = BeautifulSoup(page.content, 'html.parser')

for script in soup(["script", "style"]):
    script.decompose()

In [4]:
genesis = soup.get_text()
genesis = re.sub(r'[\n]+','\n', genesis)

## Questions

1) How many times the word 'God' appear (as an isolated word)?

2) What are the 5 most common words?

3) What are the words that appear only once (hapaxes)?

**for** loops and **if** statements are required to answer the questions

## 1st version

Using if, for but only lists and sets

In [5]:
# Removing undesirable characters
for j in string.punctuation:
    genesis = genesis.replace(j,' ')

### Question 1

In [6]:
# wrong way! We are not counting 'God' as an isolated word here.
genesis.count('God')

227

In [7]:
genesis.split().count('God')

227

Coincidentally, there are no words that contain 'God'. Ex: Goddess

### Question 2

To accomplish this task we need to convert all uppercase characters into lowercase characters

In [8]:
genesis = genesis.lower()

**word_list**: list that contains every word of the string (no duplicates)

In [9]:
word_list = [] 
# using the data structure 'set' to remove duplicates
for isolated_string in list(set(genesis.split())):
# ignoring numeric strings  
    if isolated_string.isalpha():   
        word_list.append(isolated_string)

**count_list**: list that contains the frequency of every element of **word_list**

In [10]:
count_list = []
for word in word_list:
    count_list.append(genesis.split().count(word))

For the same index, we have the word and the associated frequency. The natural data structure here would be dictionaries or dataframe.

In [11]:
len(count_list) == len(word_list)

True

Making a copy in order to create an ordered version of **count_list**. Bearing in mind that `list.sort` is an inplace method, i.e., our **count_list**'s order would be modified and we want to preserve it.

In [12]:
sorted_count_list = count_list.copy()

In [13]:
sorted_count_list.sort(reverse = True)

In [14]:
list_5_freq = sorted_count_list[:5]

In [15]:
list_5_freq

[2475, 2018, 1271, 1078, 650]

In [16]:
#Let's find the index in count_list to find the index in word_list

for i in range(5):
    word_index = count_list.index(list_5_freq[i])
    print(word_list[word_index])

the
and
of
to
you


### Question 3

Let's create **list_one_time** to receive the **hapaxes**

In [17]:
list_one_time = []
for freq,word in zip(count_list,word_list):
    if freq == 1:
        list_one_time.append(word)

In [18]:
list_one_time

['sorts',
 'private',
 'powerful',
 'marriages',
 'reckoned',
 'sandal',
 'ass',
 'pishon',
 'tarried',
 'means',
 'fury',
 'crouches',
 'invoked',
 'dread',
 'stands',
 'salt',
 'hardship',
 'arvadites',
 'silence',
 'spared',
 'songs',
 'concubines',
 'war',
 'pangs',
 'leaped',
 'settlements',
 'strong',
 'bedad',
 'jubal',
 'pleasing',
 'afflicted',
 'pleasure',
 'belonged',
 'beasts',
 'vengeance',
 'rings',
 'spying',
 'stuff',
 'denied',
 'casluhim',
 'lifeblood',
 'visions',
 'tamarisk',
 'penuel',
 'tarshish',
 'darker',
 'amorite',
 'depart',
 'belongs',
 'jahzeel',
 'delighted',
 'asshurim',
 'breadth',
 'governor',
 'grazed',
 'hate',
 'worse',
 'recorded',
 'ladder',
 'reproach',
 'hired',
 'beast',
 'confused',
 'chariots',
 'mantle',
 'cheated',
 'weighed',
 'fondling',
 'streaks',
 'ending',
 'lacking',
 'ornaments',
 'foes',
 'realizing',
 'displeased',
 'cover',
 'corrupted',
 'heth',
 'wandering',
 'raised',
 'gates',
 'change',
 'relief',
 'inherit',
 'earthly',
 'j

In [19]:
len(list_one_time)

1106

Double check

In [20]:
for word in list_one_time:
    counting = genesis.split().count(word)
    if counting > 1:
        print(word)

## 2nd version

Using dictionaries, filter and lambda functions in questions 2 and 3

In [21]:
word_dict = dict() 
# using the data structure 'set' to remove duplicates
for isolated_string in list(set(genesis.split())):
# ignoring numeric strings  
    if isolated_string.isalpha():   
        word_dict[isolated_string] = genesis.split().count(isolated_string)

### Question 2

Sorting the dictionary by using an lambda function

In [22]:
sorted(word_dict.items(), key = lambda x:x[1], reverse=True)[:5]

[('the', 2475), ('and', 2018), ('of', 1271), ('to', 1078), ('you', 650)]

### Question 3

Using the built-in function **filter**

In [29]:
dict_one_time = dict(filter(lambda x:x[1]==1, word_dict.items())).keys()

In [30]:
len(dict_one_time)

1106