In [1]:
import requests
import re
from bs4 import BeautifulSoup
import string

In [2]:
page = requests.get("http://www.vatican.va/archive/bible/genesis/documents/bible_genesis_en.html")
soup = BeautifulSoup(page.content, 'html.parser')

for script in soup(["script", "style"]):
    script.decompose()

In [3]:
genesis = soup.get_text()
genesis = re.sub(r'[\n]+','\n', genesis)

In [4]:
# Removing undesirable characters
for symbol in string.punctuation:
    genesis = genesis.replace(symbol,' ')

### Questions:


#### 1. How many times the word 'God' appear (as an isolated word)?  

**for** loops and **if** statements are required to answer the questions

In [5]:
genesis.count('God')   # wrong way! We are not counting 'God' as an isolated word here.

227

In [6]:
genesis.split().count('God')

227

Coincidentally, there are no words that contain 'God'. Ex: Goddess

#### 2. What are the 5 most common words?    

To accomplish this task we need to convert all uppercase characters into lowercase characters

In [7]:
genesis = genesis.lower()

**word_list**: list that contains every word of the string (no duplicates)

In [8]:
word_list = [] 
# using the data structure 'set' to remove duplicates
for isolated_string in list(set(genesis.split())):
# ignoring numeric strings  
    if isolated_string.isalpha():   
        word_list.append(isolated_string)

**count_list**: list that contains the frequency of every element of **word_list**

In [9]:
count_list = []
for word in word_list:
    count_list.append(genesis.split().count(word))

For the same index, we have the word and the associated frequency. The natural data structure here would be dictionaries or dataframe.

In [10]:
len(count_list) == len(word_list)

True

Making a copy in order to create an ordered version of **count_list**. Bearing in mind that `list.sort` is an inplace method, i.e., our **count_list**'s order would be modified and we want to preserve it.

In [11]:
sorted_count_list = count_list.copy()

In [12]:
sorted_count_list.sort(reverse = True)

In [13]:
list_5_freq = sorted_count_list[:5]

In [14]:
list_5_freq

[2475, 2018, 1271, 1078, 650]

In [15]:
#Let's find the index in count_list to find the index in word_list

for i in range(5):
    word_index = count_list.index(list_5_freq[i])
    print(word_list[word_index])

the
and
of
to
you


Using dictionaries, filter and lambda functions in questions 2 and 3

In [21]:
word_dict = dict() 
# using the data structure 'set' to remove duplicates
for isolated_string in list(set(genesis.split())):
# ignoring numeric strings  
    if isolated_string.isalpha():   
        word_dict[isolated_string] = genesis.split().count(isolated_string)

Sorting the dictionary by using an lambda function

In [22]:
sorted(word_dict.items(), key = lambda x:x[1], reverse=True)[:5]

[('the', 2475), ('and', 2018), ('of', 1271), ('to', 1078), ('you', 650)]

#### 3. What are the words that appear only once ([hapaxes](https://en.wikipedia.org/wiki/Hapax_legomenon))?  

Let's create **list_one_time** to receive the **hapaxes**

In [16]:
list_one_time = []
for freq,word in zip(count_list,word_list):
    if freq == 1:
        list_one_time.append(word)

In [17]:
list_one_time[0:10]

['vindicated',
 'fourteenth',
 'sustained',
 'displeasing',
 'chariots',
 'confused',
 'belongs',
 'stands',
 'buz',
 'banks']

In [18]:
len(list_one_time)

1106

Double check

In [19]:
for word in list_one_time:
    counting = genesis.split().count(word)
    if counting > 1:
        print(word)

Using the built-in function **filter**

In [23]:
dict_one_time = dict(filter(lambda x:x[1]==1, word_dict.items())).keys()

In [24]:
len(dict_one_time)

1106