# Bible Search:
### Find verses similar to yours!

The user will provide information about his chosen Biblical verse, and the computer will return a ranked list of most similar other verses from the Tanach.

Import Libraries

In [1]:
from os import listdir # list all files in a directory
import pandas as pd # data manipulation

First, let us examine what files we have to work with...

In [2]:
PATH = "bible/"
listdir(PATH)

['bible_databases-master',
 'bible_version_key.csv',
 'key_abbreviations_english.csv',
 'key_english.csv',
 'key_genre_english.csv',
 't_asv.csv',
 't_bbe.csv',
 't_dby.csv',
 't_kjv.csv',
 't_wbt.csv',
 't_web.csv',
 't_ylt.csv']

I will use two of these files.
1. The `key_english.csv`, which lists and numbers the biblical book names; and 
2. The `t_kjv.csv`, which lists and numbers all biblical verses.

Let us list the books of the bible by name and number.

In [3]:
BOOKS = 'key_english.csv' # a constant, the file with book numbers
df_books = pd.read_csv(PATH + BOOKS) # load data into dataframe 
df_books.sample(5) # random sampling of 10 rows from the list of books

Unnamed: 0,b,n,t,g
41,42,Luke,NT,5
53,54,1 Timothy,NT,7
59,60,1 Peter,NT,7
6,7,Judges,OT,2
45,46,1 Corinthians,NT,7


I see the books are labeled in column `t` as either Old or New Testament.

I will work with only verses from the Hebrew Bible (OT).

Let us now load the verses into a dataframe. 

I will use the King James version.

In [4]:
KING_JAMES = 't_kjv.csv' # a constant, the tail end of the file with the verses of the KJV bible
df_verses = pd.read_csv(PATH + KING_JAMES) # load verses into dataframe
df_verses.sample(5)

Unnamed: 0,id,b,c,v,t
24666,41011026,41,11,26,"But if ye do not forgive, neither will your Fa..."
3835,4006012,4,6,12,And he shall consecrate unto the LORD the days...
12451,16007031,16,7,31,"The men of Michmas, an hundred and twenty and ..."
26541,43011018,43,11,18,"Now Bethany was nigh unto Jerusalem, about fif..."
13962,19003005,19,3,5,I laid me down and slept; I awaked; for the LO...


I want the verses in which the book, i.e. `df_verses['b']` has a `OT` in the testament column of the list of books, i.e. `df_books['t']`.

In [5]:
heb_books = df_books.loc[df_books['t'] == 'OT']
heb_books.tail(5)

Unnamed: 0,b,n,t,g
34,35,Habakkuk,OT,4
35,36,Zephaniah,OT,4
36,37,Haggai,OT,4
37,38,Zechariah,OT,4
38,39,Malachi,OT,4


Notice that the Hebrew books go up to number 39. Let's exlude anything higher.

In [6]:
heb_verses = df_verses[df_verses['b']<=39]
heb_verses.tail()
# type(df_books['t'][0])

Unnamed: 0,id,b,c,v,t
23140,39004002,39,4,2,But unto you that fear my name shall the Sun o...
23141,39004003,39,4,3,And ye shall tread down the wicked; for they s...
23142,39004004,39,4,4,"Remember ye the law of Moses my servant, which..."
23143,39004005,39,4,5,"Behold, I will send you Elijah the prophet bef..."
23144,39004006,39,4,6,And he shall turn the heart of the fathers to ...


Sure enough, the last verse is the final verse of Malachi, as we wished. 

Now let's ask the user to provide us the info about his chosen verse. 

In [7]:
heb_books[['b','n']]

Unnamed: 0,b,n
0,1,Genesis
1,2,Exodus
2,3,Leviticus
3,4,Numbers
4,5,Deuteronomy
5,6,Joshua
6,7,Judges
7,8,Ruth
8,9,1 Samuel
9,10,2 Samuel


In [8]:
def book_num_to_name(n):
    """
    given the index, produce the book name
    e.g. 1 results in genesis
    """
    return heb_books.loc[ heb_books['b'] == n ]['n'].iloc[0]
# book_num_to_name(2)

In [9]:
print('What is the verse you chose? Type and press enter...')

What is the verse you chose? Type and press enter...


In [23]:
user = {}
user['book_num'] = int(input('Book (select number from list above): '))
user['book_name'] = book_num_to_name(int(user['book_num']))
user['chap'] = int(input("Chapter: "))
user['verse'] = int(input("Verse: "))
user

Book (select number from list above): 1
Chapter: 1
Verse: 1


{'book_num': 1, 'book_name': 'Genesis', 'chap': 1, 'verse': 1}

In [24]:
def id_to_book(verse_id):
    book_num = heb_verses.loc[heb_verses['id'] == verse_id]['b'].iloc[0]
    book_name = book_num_to_name(book_num)
    result = {}
    result['num'] = book_num
    result['name'] = book_name
    return result
id_to_book(1002001)

{'num': 1, 'name': 'Genesis'}

In [25]:
print("Wait! Please, double check before proceeding.")
print("Your current choice is\n*** {} ***".format(user)) # Tell the user which bok he chose

Wait! Please, double check before proceeding.
Your current choice is
*** {'book_num': 1, 'book_name': 'Genesis', 'chap': 1, 'verse': 1} ***


In [26]:
user['verse']
my_id = user['book_num']*1000000 + user['chap']*1000 + user['verse']
my_id

1001001

In [33]:
my_row = heb_verses.loc[ heb_verses['id'] == my_id ] # select the row of ds in 'id' matches my_id
# my_row
my_verse = my_row['t'].iloc[0] #
# print(type(my_verse))
# print(my_verse)

In [226]:
# TF IDF stands for "term frequency–inverse document frequency"
# it is a a numerical statistic that is intended to reflect how important a word is
# to a document in a collection or corpus.
from sklearn.feature_extraction.text import TfidfVectorizer # Convert a collection of raw documents to a matrix of TF-IDF features
from sklearn.metrics.pairwise import linear_kernel

In [227]:
tf = TfidfVectorizer(
                    analyzer='word', # the feature should be made of word (not character) n-grams
                    ngram_range=(1, 3), # the inclusive range of n-values for different n-grams to be extracted
                    min_df=0, # When building the vocabulary, ignore terms that have a document frequency strictly lower than this threshold
                    stop_words='english' # 
                    ) 

In [228]:
tfidf_matrix = tf.fit_transform(ds['t']) # Learn vocabulary and idf, return term-document matrix.

In [229]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) # dot product

In [230]:
results = {} # dict where each key is an
# id in the list of verses
# and the entry for that key is
# a list id's for verses that are simliar the the verse of the key

for idx, row in ds.iterrows(): # 
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] # numpy.ndarray
    #print(type(similar_indices))
    similar_rows = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices] # list of 
    #print(type(similar_rows))
    # First row is the row itself, so remove it.
    # Each dictionary entry is like: [(1,2), (3,4)], with each tuple being (score, row_id)
    results[row['id']] = similar_rows[1:]
    
print('done!')

done!


In [231]:
def get_verse_text(id):
    """
    get the words of the verse, given the id
    """
    return ds.loc[ds['id'] == id]['t'].values[0]
# get_verse_text(my_id)

In [232]:
def get_verse_num(verse_id):
    return ds.loc[ds['id'] == verse_id]['v'].iloc[0]
# get_verse_num(1001011)

In [233]:
def get_chap(verse_id):
    return ds.loc[ds['id'] == verse_id]['c'].iloc[0]
# get_chap(1002001)

In [234]:
def cit_to_id(book_num,chap,verse):
    """
    given a book number, chap number, and verse number, produce the verse id
    """ 
    return book_num*1000000 + chap*1000 + verse
# cit_to_id(1,1,1)

In [235]:
# reads the results out of the dictionary
def recommend(verse_id, num):
    my_info = 0
    print("The top {} similar verses to {} {}:{}\n\n{}\n".format(num, book_id_to_name(verse_id), get_chap(verse_id), get_verse_num(verse_id), get_verse_text(verse_id)))
    print("-------")
    recs = results[verse_id][:num] # the top num items listed in the recomendations for this id
    result = []
    for rec in recs:        
        rank = ''+str(recs.index(rec)+1)+'.)'
        citation = rec[1]
        book = book_id_to_name(citation)
        chap = get_chap(citation)
        verse = get_verse_num(citation)
        score = str(int(rec[0]*100))[:2]+ "%"
        text = get_verse_text(rec[1])
        print()
        print(rank,score,book,str(chap)+':'+str(verse))
        print()
        print(text)
user_input = cit_to_id(book_num,chap,verse)
recommend(verse_id=user_input, num=3)

The top 3 similar verses to Exodus 15:1

Then sang Moses and the children of Israel this song unto the LORD, and spake, saying, I will sing unto the LORD, for he hath triumphed gloriously: the horse and his rider hath he thrown into the sea.

-------

1.) 57% Exodus 15:21

And Miriam answered them, Sing ye to the LORD, for he hath triumphed gloriously; the horse and his rider hath he thrown into the sea.

2.) 12% Judges 5:3

Hear, O ye kings; give ear, O ye princes; I, even I, will sing unto the LORD; I will sing praise to the LORD God of Israel.

3.) 11% Exodus 39:42

According to all that the LORD commanded Moses, so the children of Israel made all the work.


In [252]:
type(ds['t'].iloc[0])

str

In [254]:
ds['t'].iloc[0] == 'In the beginning God created the heaven and the earth.'

True

In [264]:
ds['t'][0]

'In the beginning God created the heaven and the earth.'

In [275]:
df1 = ds.loc[ds['t']=='In the beginning God created the heaven and the earth.']
df1.head()

Unnamed: 0,id,b,c,v,t
0,1001001,1,1,1,In the beginning God created the heaven and th...


In [None]:
ds.t.str.startswith('In')

In [302]:
# ds[ds.t.str.startswith('In')]

In [300]:
keyword = input("Type the word you'd like to find. Then press enter. Your choice: ")

Type the word you'd like to find. Then press enter. Your choice: song


In [303]:
keyword

'song'

In [304]:
df_search = ds[ds['t'].str.contains(keyword)]
# ds[ds['t'].str.contains("song")]

In [305]:
df_search.head()

Unnamed: 0,id,b,c,v,t
900,1031027,1,31,27,"Wherefore didst thou flee away secretly, and s..."
1921,2015001,2,15,1,Then sang Moses and the children of Israel thi...
1922,2015002,2,15,2,"The LORD is my strength and song, and he is be..."
4357,4021017,4,21,17,"Then Israel sang this song, Spring up, O well;..."
5747,5031019,5,31,19,"Now therefore write ye this song for you, and ..."


In [306]:
df_search[:5]

Unnamed: 0,id,b,c,v,t
900,1031027,1,31,27,"Wherefore didst thou flee away secretly, and s..."
1921,2015001,2,15,1,Then sang Moses and the children of Israel thi...
1922,2015002,2,15,2,"The LORD is my strength and song, and he is be..."
4357,4021017,4,21,17,"Then Israel sang this song, Spring up, O well;..."
5747,5031019,5,31,19,"Now therefore write ye this song for you, and ..."


In [None]:
book = book_id_to_name(citation)
chap = get_chap(citation)
verse = get_verse_num(citation)


In [315]:
print("These verses contain your keyword '{}'.".format(keyword))
print()
for i in range(0,5):
    print(str(i+1)+'.',df_search.iloc[i]['t'])
    print()

These verses contain your keyword 'song'.

1. Wherefore didst thou flee away secretly, and steal away from me; and didst not tell me, that I might have sent thee away with mirth, and with songs, with tabret, and with harp?

2. Then sang Moses and the children of Israel this song unto the LORD, and spake, saying, I will sing unto the LORD, for he hath triumphed gloriously: the horse and his rider hath he thrown into the sea.

3. The LORD is my strength and song, and he is become my salvation: he is my God, and I will prepare him an habitation; my father's God, and I will exalt him.

4. Then Israel sang this song, Spring up, O well; sing ye unto it:

5. Now therefore write ye this song for you, and teach it the children of Israel: put it in their mouths, that this song may be a witness for me against the children of Israel.

