# Hierarchical clustering of the verses of Bhagavad-gītā to find repeated teachings
In the vedic literature, the repetition of a proposition is the indicative of the importance of that certain proposition. Based on this I thought that it would be interesting to collect these repetitions from *Bhagavad-gītā* to find important teachings in it. From the different tools of data science I selected hierarchical clustering to find these repetitions.

For the analysis I used not the original verses rather the version of the verses in which the words are separated according to the external saṁdhi (sound changes at word boundaries) rules.

This following code is based on the code at http://brandonrose.org/clustering.

In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.probability import FreqDist
from scipy.cluster.hierarchy import ward, dendrogram, leaves_list, to_tree
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

In [2]:
# Open the csv file which contains the text of the book
bg = pd.read_csv('../input/bhagavad-gita.csv')
titles = bg['title'].tolist()
texts = bg['verse_text_no_samdhis'].tolist()

In [3]:
# Create tokentype counts and frequency distribution plots for examination (optional)
full_text = ' '.join(texts)
tokens = nltk.word_tokenize(full_text)
sortedset = sorted(set(tokens))

sortedset_counts = {}
for tokentype in sortedset:
    sortedset_counts[tokentype] = tokens.count(tokentype)
sortedset_counts = sorted(((v,k) for k,v in sortedset_counts.items()), reverse=True)

fdist = FreqDist(tokens)
fdist.plot(50, cumulative=True)

In [4]:
# Create a function for creating a vocabulary frame and for the parameter definition of the TfidfVectorizer

def tokenizer(text):
    tokens = [word for word in nltk.word_tokenize(text)]
    return tokens

totalvocab = []
for i in tqdm(texts):
    allwords_tokenized = tokenizer(i)
    totalvocab.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab})

In [5]:
# Define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
                                 min_df=1, use_idf=True,
                                 tokenizer=tokenizer, ngram_range=(1,3))

In [6]:
# Fit the vectorizer to texts
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
print(tfidf_matrix.shape)
tfidf_matrix_array = tfidf_matrix.toarray()

In [7]:
# Get the distance matrix
dist = 1 - cosine_similarity(tfidf_matrix)

In [8]:
# Hierarchical clustering
# Define the linkage_matrix using ward clustering pre-computed distances
linkage_matrix = ward(dist)

In [9]:
# Plot the dendrogram

# Set size
fig, ax = plt.subplots(figsize=(25, 100))
ax = dendrogram(linkage_matrix, orientation="right", labels=titles, distance_sort=True, leaf_font_size=10);

# Show plot with tight layout
plt.tight_layout()

In [10]:
# Create a function to get the titles of verses and their distance from the rows of the linkage matrix
linkage_matrix_tree = to_tree(linkage_matrix, rd=True)

def get_verses(linkage_matrix_row_number):
    first_node_id = int(linkage_matrix[linkage_matrix_row_number][0])
    second_node_id = int(linkage_matrix[linkage_matrix_row_number][1])
    if first_node_id in leaves_list(linkage_matrix):
        first = titles[first_node_id]
    else:
        node_id_list = linkage_matrix_tree[1][first_node_id].pre_order()
        first = []
        for n_id in node_id_list:
            first.append(titles[n_id])
    if second_node_id in leaves_list(linkage_matrix):
        second = titles[second_node_id]
    else:
        node_id_list = linkage_matrix_tree[1][second_node_id].pre_order()
        second = []
        for n_id in node_id_list:
            second.append(titles[n_id])
    verses = [first, second, linkage_matrix[linkage_matrix_row_number][2]]
    return verses

In [11]:
# Get the 40 most similar verse-pairs or pairs of verse-pairs
for i in range(40):
    print(get_verses(i))

The most similar verses are 9.34 and 18.65. This result is in accordance with the theory that in the vedic literature the repetition of a proposition is the indicative of the importance of that certain proposition, because these verses include the most important teachings of *Bhagavad-gītā*, namely that one should always think of God, Kṛṣṇa, one should be a devotee of Him, one should worship Him and one should offer obeisances to Him and thus one will come to Him without fail.