![Torah Scroll](torah.png "Torah Scroll")

# **Bible Search**
Find verses similar to yours!

## Intro
You provide the Book, chapter, and verse of your chosen passage. 

Machine provides a ranked list of other most similar verses.

## Libraries

In [40]:
from os import listdir # list all files in a directory
import os
import pandas as pd # data manipulation

## Data

First, let us examine what files we have to work with...

In [42]:
PATH = "bible/"
listdir(PATH)

['bible_databases-master',
 'bible_version_key.csv',
 'key_abbreviations_english.csv',
 'key_english.csv',
 'key_genre_english.csv',
 't_asv.csv',
 't_bbe.csv',
 't_dby.csv',
 't_kjv.csv',
 't_wbt.csv',
 't_web.csv',
 't_ylt.csv']

I will use two of these files.
1. The `key_english.csv`, which lists and numbers the biblical book names; and 
2. The `t_kjv.csv`, which lists and numbers all biblical verses.

## Books

Let us list the books of the bible by name and number.

In [45]:
BOOKS = 'key_english.csv' # a constant, the file with book numbers
df_books = pd.read_csv(PATH + BOOKS) # load data into dataframe 
df_books[35:42] # sampling of books near the point where Hebrew Bible ends and NT begins.

Unnamed: 0,b,n,t,g
35,36,Zephaniah,OT,4
36,37,Haggai,OT,4
37,38,Zechariah,OT,4
38,39,Malachi,OT,4
39,40,Matthew,NT,5
40,41,Mark,NT,5
41,42,Luke,NT,5


**Notice**: The books are labeled in column `t` as either Old or New Testament.

**Preference**: I will work with only verses from the Hebrew Bible (OT).

## Verses

Load the verses into a dataframe. 

I will use the King James version. Hopefully I can use a real Hebrew version some day...

In [46]:
KING_JAMES = 't_kjv.csv' # name of finle with king james version of biblical verses
df_verses = pd.read_csv(PATH + KING_JAMES) # load verses into dataframe
df_verses.sample(5)

Unnamed: 0,id,b,c,v,t
4616,4029008,4,29,8,But ye shall offer a burnt offering unto the L...
23124,39003004,39,3,4,Then shall the offering of Judah and Jerusalem...
26163,43004007,43,4,7,There cometh a woman of Samaria to draw water:...
30193,58011021,58,11,21,"By faith Jacob, when he was a dying, blessed b..."
6775,7009021,7,9,21,"And Jotham ran away, and fled, and went to Bee..."


## Select Hebrew Books

I want the select only the verses in which the book, i.e. `df_verses['b']` has a `OT` in the testament column of the list of books, i.e. `df_books['t']`.

In [47]:
heb_books = df_books.loc[df_books['t'] == 'OT']
heb_books.tail(5)

Unnamed: 0,b,n,t,g
34,35,Habakkuk,OT,4
35,36,Zephaniah,OT,4
36,37,Haggai,OT,4
37,38,Zechariah,OT,4
38,39,Malachi,OT,4


**Notice**: The Hebrew books go up to number 39. Let's exlude anything higher.

In [57]:
all_heb_verses = df_verses[df_verses['b']<=39]
heb_verses = all_heb_verses[:23145]
heb_verses.tail()
all_heb_verses.shape
all_heb_verses.tail()
heb_verses.tail()

Unnamed: 0,id,b,c,v,t
23140,39004002,39,4,2,But unto you that fear my name shall the Sun o...
23141,39004003,39,4,3,And ye shall tread down the wicked; for they s...
23142,39004004,39,4,4,"Remember ye the law of Moses my servant, which..."
23143,39004005,39,4,5,"Behold, I will send you Elijah the prophet bef..."
23144,39004006,39,4,6,And he shall turn the heart of the fathers to ...


## User Input

Request the user to provide us the info about his chosen verse. 

First, glance at the chart of all books and their numbers.

In [58]:
heb_books[['b','n']]

Unnamed: 0,b,n
0,1,Genesis
1,2,Exodus
2,3,Leviticus
3,4,Numbers
4,5,Deuteronomy
5,6,Joshua
6,7,Judges
7,8,Ruth
8,9,1 Samuel
9,10,2 Samuel


In [60]:
def book_num_to_name(n):
    """
    given the index, produce the book name
    e.g. 1 results in genesis
    """
    return heb_books.loc[ heb_books['b'] == n ]['n'].iloc[0]
book_num_to_name(2) # for example the second book should be Exodus

'Exodus'

In [61]:
print('What is the verse you chose? Type and press enter...')

What is the verse you chose? Type and press enter...


In [62]:
user = {}
user['book_num'] = int(input('Book (select number from list above): '))
user['book_name'] = book_num_to_name(int(user['book_num']))
user['chap'] = int(input("Chapter: "))
user['verse'] = int(input("Verse: "))
user

Book (select number from list above): 1
Chapter: 1
Verse: 1


{'book_num': 1, 'book_name': 'Genesis', 'chap': 1, 'verse': 1}

In [64]:
def id_to_book(verse_id):
    """
    given the verse id number (e.g. 1001001),
    produce the book name;
    e.g. 1001001 results in genesis    
    """
    book_num = heb_verses.loc[heb_verses['id'] == verse_id]['b'].iloc[0]
    book_name = book_num_to_name(book_num)
    result = {}
    result['num'] = book_num
    result['name'] = book_name
    return result
id_to_book(1002001)

{'num': 1, 'name': 'Genesis'}

In [67]:
def ref_to_id(book,chap,verse):
    """
    given the reference, i.e. book, chap, and verse numbers,
    produce the verse id;
    e.g. (1,1,1) results in 1001001
    """
    return book*1000000 + chap*1000 + verse
my_id = ref_to_id(user['book_num'],user['chap'],user['verse'])
my_id

1001001

In [71]:
def id_to_row(verse_id):
    """
    given a verse id,
    produce the row in the df of verses;
    e.g. 
    """
    return heb_verses.loc[ heb_verses['id'] == verse_id ] # select the row of verses in which 'id' matches my_id
my_row = id_to_row(my_id) # select the row of verses in which 'id' matches my_id
my_row

Unnamed: 0,id,b,c,v,t
0,1001001,1,1,1,In the beginning God created the heaven and th...


In [73]:
def row_to_verse(row):
    """
    given row of verse df,
    produce the verse in string format;
    e.g. my_row results in "In the beg..."
    """
    return row['t'].iloc[0] # the content of the text column of the row
my_verse = row_to_verse(my_row)
my_verse
# print(type(my_verse))
# print(my_verse)

'In the beginning God created the heaven and the earth.'

In [43]:
"""
TF-IDF stands for Term Frequency Inverse Document Frequency.
It is a a numerical statistic that reflects how important a word is to a document.
"""
from sklearn.feature_extraction.text import TfidfVectorizer # convert text set into a matrix
from sklearn.metrics.pairwise import linear_kernel # 

In [44]:
# create an instance of the tf idf vectorizer
tf = TfidfVectorizer(analyzer='word', # feature should be made of word (not character) n-grams
                     ngram_range=(1, 3), # the inclusive range of n-values for different n-grams to be extracted
                     min_df=0, # When building the vocabulary, ignore terms that have a document frequency strictly lower than this threshold
                     stop_words='english' # passed to _check_stop_list and the appropriate stop list is returned
                    ) 

In [45]:
# Learn vocabulary and idf
# Return term-document matrix
tfidf_matrix = tf.fit_transform(heb_verses['t']) 

In [46]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) # dot product

In [47]:
results = {}
"""
dict where each key is an id in the list of verses and 
the entry for that key is a ranked list of id's belonging to 
verses that are most simliar to the key
"""
for idx, row in heb_verses.iterrows(): # 
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] # numpy.ndarray
    #print(type(similar_indices))
    similar_rows = [(cosine_similarities[idx][i], heb_verses['id'][i]) for i in similar_indices] # list of 
    #print(type(similar_rows))
    # First row is the row itself, so remove it.
    # Each dictionary entry is like: [(1,2), (3,4)], with each tuple being (score, row_id)
    results[row['id']] = similar_rows[1:]
    
print('done!')

done!


In [48]:
def get_verse_text(id):
    """
    get the words of the verse, given the id
    """
    return heb_verses.loc[heb_verses['id'] == id]['t'].values[0]
get_verse_text(my_id)

'In the beginning God created the heaven and the earth.'

In [49]:
def get_verse_num(verse_id):
    return heb_verses.loc[heb_verses['id'] == verse_id]['v'].iloc[0]
get_verse_num(1001011)

11

In [50]:
def get_chap(verse_id):
    return heb_verses.loc[heb_verses['id'] == verse_id]['c'].iloc[0]
get_chap(1002001)

2

In [51]:
def cit_to_id(book_num,chap,verse):
    """
    given a book number, chap number, and verse number, produce the verse id
    """ 
    return book_num*1000000 + chap*1000 + verse
cit_to_id(1,1,1)

1001001

In [52]:
user

{'book_num': 1, 'book_name': 'Genesis', 'chap': 1, 'verse': 1}

### Time to Recommend!

In [53]:
# reads the results out of the dictionary
def recommend(user, num):
    verse_id = cit_to_id(user['book_num'],user['chap'],user['verse'])
    print("The top {} similar verses to {} {}:{}\n{}".format(num, id_to_book(verse_id)['name'], get_chap(verse_id), get_verse_num(verse_id), get_verse_text(verse_id)))
    print("-------")
    recs = results[verse_id][:num] # the top num items listed in the recomendations for this id
    result = []
    for rec in recs:        
        rank = ''+str(recs.index(rec)+1)+'.)'
        citation = rec[1]
        book = id_to_book(citation)['name']
        chap = get_chap(citation)
        verse = get_verse_num(citation)
        score = str(int(rec[0]*100))[:2]+ "%"
        text = get_verse_text(rec[1])
        print()
        print(rank,score,book,str(chap)+':'+str(verse))
        print()
        print(text)
recommend(user=user, num=6)

The top 6 similar verses to Genesis 1:1
In the beginning God created the heaven and the earth.
-------

1.) 25% Genesis 1:27

So God created man in his own image, in the image of God created he him; male and female created he them.

2.) 13% Genesis 2:3

And God blessed the seventh day, and sanctified it: because that in it he had rested from all his work which God created and made.

3.) 13% Genesis 5:1

This is the book of the generations of Adam. In the day that God created man, in the likeness of God made he him;

4.) 11% Deuteronomy 4:32

For ask now of the days that are past, which were before thee, since the day that God created man upon the earth, and ask from the one side of heaven unto the other, whether there hath been any such thing as this great thing is, or hath been heard like it?

5.) 11% Genesis 5:2

Male and female created he them; and blessed them, and called their name Adam, in the day when they were created.

6.) 11% Deuteronomy 4:39

Know therefore this day, and con

In [None]:
type(ds['t'].iloc[0])

In [None]:
ds['t'].iloc[0] == 'In the beginning God created the heaven and the earth.'

In [None]:
ds['t'][0]

In [None]:
df1 = ds.loc[ds['t']=='In the beginning God created the heaven and the earth.']
df1.head()

In [None]:
ds.t.str.startswith('In')

In [None]:
# ds[ds.t.str.startswith('In')]

In [None]:
keyword = input("Type the word you'd like to find. Then press enter. Your choice: ")

In [None]:
keyword

In [None]:
df_search = ds[ds['t'].str.contains(keyword)]
# ds[ds['t'].str.contains("song")]

In [None]:
df_search.head()

In [None]:
df_search[:5]

In [None]:
book = book_id_to_name(citation)
chap = get_chap(citation)
verse = get_verse_num(citation)


In [None]:
print("These verses contain your keyword '{}'.".format(keyword))
print()
for i in range(0,5):
    print(str(i+1)+'.',df_search.iloc[i]['t'])
    print()