<a href="https://colab.research.google.com/github/kcalizadeh/phil_nlp/blob/main/data_load_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Introduction

A book of philosophy represents an effort to systematically organize one's thought about the world. Using the data from the history of philosophy to classify texts thus enable us to, by proxy, classify how people think about the world. Where some projects focus on sentiment analysis, here we focus on conceptual, or ideological analysis.

This project uses 51 texts spanning 10 schools of philosophical thought. Based on these, we develop classification models, word vectors, and general EDA. This can then be used to understand user's worldviews by comparing them to historical schools of thought. And once we understand a person's worldview, there is no limit to what we can do with that information - from advertising to political campaigning through to self-exploration and therapy.

This notebook contains the first steps of that project, where we load the 51 texts in the corpus, clean them, and then produce and export a dataframe for use in modeling.

### Imports and Mounting Drive

In [None]:
# this cell mounts drive, sets the correct directory, then imports all functions
# and relevant libraries via the functions.py file
from google.colab import drive
import sys

# install relevent libraries not included with colab
!pip install lime
!pip install symspellpy
!pip install gensim
drive.mount('/gdrive',force_remount=True)

drive_path = '/gdrive/MyDrive/Colab Notebooks'

sys.path.append(drive_path)

Mounted at /gdrive


In [None]:
from functions import *
%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import spacy.cli
spacy.cli.download("en_core_web_lg")
import en_core_web_lg
nlp = en_core_web_lg.load()

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### Load the Texts

In [None]:
import re
import os
from pathlib import Path


def remove_page_references(text):
    """Remove page references like [p. 84]"""
    text = re.sub(r'\[p\.\s*\d*\]', '', text)
    return text

With the functions loaded, we bring in the various texts. For access to them via Google Drive, use this [link](https://drive.google.com/drive/folders/1OdTQzRboTOozJqX1INJoYljuA4ctttx8?usp=sharing).

In [None]:
def clean_tanakh(text):
    """Clean Tanakh"""

    # Remove all numbers
    text = re.sub(r'\d+', '', text)

    # Remove footnotes (single letters)
    text = re.sub(r'\b[a-c]\b', '', text)

    # Remove book markers
    text = re.sub(r'BOOK\s+[IVX]+', '', text, flags=re.IGNORECASE)


    return text


def clean_book_certitude(text):
    """Clean Book of Certitude"""
    # Remove Quran references
    text = re.sub(r'\[Qurían\s+\d+:\d+\.\]', '', text)
    return text


def clean_nature_gods(text):
    """Clean Nature Gods"""
    # Remove Roman numeral markers
    text = re.sub(r'"\s*[IVX]+\s*"\.', '', text)
    return text

def clean_kaivalya_upanishad(text):
    """Clean Kaivalya Upanishad"""
    # Remove numbered sections
    text = re.sub(r'^\d+\.\s*', '', text, flags=re.MULTILINE)
    return text

def clean_kybalion(text):
    """Clean Kybalion"""
    # Remove Kybalion attribution
    text = re.sub(r'--The Kybalion\.', '', text)

    # Convert -- to comma
    text = text.replace('--', ',')

    # Don't convert to capitals (keep original case)
    return text


def clean_secret_teachings_jesus(text):
    """Clean Secret Teachings of the Society of Jesus"""
    # Remove bank information
    text = re.sub(r'BANK of WISDOM.*?40201', '', text, flags=re.DOTALL | re.IGNORECASE)

    # Remove numbered sections
    text = re.sub(r'^\d+\.\s*', '', text, flags=re.MULTILINE)


    return text

In [None]:

def clean_epistle_son_wolf(text):
    """Clean Epistle to the Son of the Wolf"""
    # Remove questions and exclamations patterns
    text = re.sub(r'[?!]+\s*', ' ', text)
    return text

def clean_bhagavad_gita(text):
    """Clean Bhagavad Gita"""
    # Remove section and chapter markers
    text = re.sub(r'Section\s+"?\d+"?', '', text, flags=re.IGNORECASE)
    text = re.sub(r'Chapter\s+"?\d+"?', '', text, flags=re.IGNORECASE)
    text = re.sub(r'~\s*Chapter\s+"?\d+"?\s*~', '', text, flags=re.IGNORECASE)
    text = re.sub(r'End of Chapter\s+\d+', '', text, flags=re.IGNORECASE)
    text = re.sub(r'Svetasvatara|Prasna-Upanishad', '', text, flags=re.IGNORECASE)

    # Remove quotes section
    text = re.sub(r'And here they quote:', '', text, flags=re.IGNORECASE)

    # Convert &c. to etc.
    text = text.replace('&c.', 'etc.')

    # Remove verse references like 6-2:13, 2:13
    text = re.sub(r'\d+-?\d*:\d+', '', text)

    # Remove ordinal numbers and religious terms
    text = re.sub(r'\b(First|Second|Third|Fourth|Fifth|Sixth|Seventh|Eighth|Ninth|Tenth)\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(Brahmana|Adhyaya|Anuvaka|Valli|Prapathaka|Question|Khanda)\b', '', text, flags=re.IGNORECASE)

    # Remove dialogue patterns
    text = re.sub(r'…\s*asked him', '', text, flags=re.IGNORECASE)
    text = re.sub(r'He replied:', '', text, flags=re.IGNORECASE)

    # Remove names (sequences of capitalized words)
    text = re.sub(r'\b[A-Z][a-z]*\s+from\s+[A-Z][a-z]*\b', '', text)

    return text

def clean_augustine_city_god(text):
    """Clean Augustine City of God and Christian Doctrine"""
    # Remove chapter markers
    text = re.sub(r'Chapter\s+\d+\.—', '', text, flags=re.IGNORECASE)

    # Remove content between underscores (section separators)
    text = re.sub(r'_{10,}.*?_{10,}', '', text, flags=re.DOTALL)

    return text

def clean_dhammapada(text):
    """Clean Dhammapada"""
    # Remove numbered verses
    text = re.sub(r'^\d+\.\s*', '', text, flags=re.MULTILINE)
    text = text.replace('II', '')
    return text

def clean_siri_guru_granth_sahib(text):
    """Clean Siri Guru Granth Sahib"""
    # Remove verse markers
    text = re.sub(r'\|\|"\d+"\|\|', '', text)
    text = re.sub(r'\|Pause\|\|', '', text)
    return text

def clean_tao_te_ching(text):
    """Clean Tao Te Ching"""
    # Remove part markers
    text = re.sub(r'PART\s+[IVX]+\.', '', text, flags=re.IGNORECASE)
    return text

def clean_analects(text):
    """Clean Analects"""
    # Remove chapter markers with Roman numerals
    text = re.sub(r'CHAP\.\s+[IVX]+\.', '', text, flags=re.IGNORECASE)
    return text

In [None]:
def clean_way_of_virtue_buddha_path_light(text):
    """Clean Way of Virtue (wov), The Life of Buddha, The Path of Light (tpol)"""
    # Remove page references
    text = remove_page_references(text)

    # Ignore capitals (don't remove capital lines)
    return text



def clean_kitab_aqdas(text):
    """Clean Kitab I Aqdas - clean all"""
    # Comprehensive cleaning
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[()]', '', text) # Remove parentheses
    text = remove_page_references(text)  # Remove page refs
    return text

def clean_vedic_hymns(text):
    """Clean Vedic Hymns - clean all"""
    # Comprehensive cleaning
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = remove_page_references(text)  # Remove page refs
    text = re.sub(r'\[.*?\]', '', text)  # Remove bracket references
    return text

def clean_vedanta_sutras(text):
    """Clean Vedanta Sutras - clean all"""
    # Comprehensive cleaning
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = remove_page_references(text)  # Remove page refs
    text = re.sub(r'\[.*?\]', '', text)  # Remove bracket references
    return text

def clean_gathas_zoroastrianism(text):
    """Clean Gathas Zoroastrianism"""
    # Remove verse markers with single quotes around numbers
    text = re.sub(r"Verse\s+['\"]?\d+['\"]?\s*:", '', text, flags=re.IGNORECASE)

    # Remove yasna markers with single quotes around numbers
    text = re.sub(r"Yasna\s+['\"]?\d+['\"]?\s*:", '', text, flags=re.IGNORECASE)

    # Remove standalone verse and yasna references
    text = re.sub(r"Verse\s+['\"]?\d+['\"]?", '', text, flags=re.IGNORECASE)
    text = re.sub(r"Yasna\s+['\"]?\d+['\"]?", '', text, flags=re.IGNORECASE)

    # Remove any remaining numbered sections
    text = re.sub(r'^\d+\.\s*', '', text, flags=re.MULTILINE)

    return text

In [None]:
folder_path = drive_path + '/religion/'
advaita_vedanta = get_text(folder_path + 'Advaita_Vedanta.txt')
analects = get_text(folder_path + 'Analects.txt')
bhagavad_gita = get_text(folder_path + 'Bhagavad Gita.txt')
collected_fruits = get_text(folder_path + 'Collected Fruits of Occult Teaching by A.P.Sinnett (1920).txt')
kaivalya_upanishad = get_text(folder_path + 'Kaivalya_Upanishad.txt')
higher_worlds = get_text(folder_path + 'Knowledge of the Higher Worlds - by Rudolf Steiner.txt')
kybalion = get_text(folder_path + 'Kybalion.txt')
popol_vuh = get_text(folder_path + 'Popol Vuh.txt')
secret_teachings = get_text(folder_path + 'Secret Teachings of the Society of Jesus.txt')
guru_granth_sahib = get_text(folder_path + 'Siri Guru Granth Sahib.txt')
tanakh = get_text(folder_path + 'Tanakh.txt')
tao_te_ching = get_text(folder_path + 'Tao Te Ching.txt')
kitab_aqdas = get_text(folder_path + 'The Kitab I Aqdas.txt')
life_of_buddha = get_text(folder_path + 'The Life of Buddha.txt')
city_of_god = get_text(folder_path + 'augustine city of god and christian doctrine.txt')
dhammapada = get_text(folder_path + 'dhammapada.txt')
epistle_son_wolf = get_text(folder_path + 'epistle to the son of the wolf.txt')
gathas = get_text(folder_path + 'gathas zoroastrianism.txt')
book_of_certitude = get_text(folder_path + 'kitab i ilqan book of certitude.txt')
nature_gods = get_text(folder_path + 'nature-gods.txt')
tpol = get_text(folder_path + 'tpol.txt')
vedic_hymns = get_text(folder_path + 'vedic_hymns_readable.txt')
vedanta_sutras = get_text(folder_path + 'vendanta sutras patanjali.txt')
wov = get_text(folder_path + 'wov.txt')
yoga_sutras = get_text(folder_path + 'yoga sutras patanjali.txt')

# Apply specific cleaning functions to each text
analects = clean_analects(analects)
bhagavad_gita = clean_bhagavad_gita(bhagavad_gita)
city_of_god = clean_augustine_city_god(city_of_god)
dhammapada = clean_dhammapada(dhammapada)
epistle_son_wolf = clean_epistle_son_wolf(epistle_son_wolf)
gathas = clean_gathas_zoroastrianism(gathas)
book_of_certitude = clean_book_certitude(book_of_certitude)
kaivalya_upanishad = clean_kaivalya_upanishad(kaivalya_upanishad)
kitab_aqdas = clean_kitab_aqdas(kitab_aqdas)
kybalion = clean_kybalion(kybalion)
life_of_buddha = clean_way_of_virtue_buddha_path_light(life_of_buddha)
nature_gods = clean_nature_gods(nature_gods)
secret_teachings = clean_secret_teachings_jesus(secret_teachings)
guru_granth_sahib = clean_siri_guru_granth_sahib(guru_granth_sahib)
tanakh = clean_tanakh(tanakh)
tao_te_ching = clean_tao_te_ching(tao_te_ching)
tpol = clean_way_of_virtue_buddha_path_light(tpol)
vedanta_sutras = clean_vedanta_sutras(vedanta_sutras)
vedic_hymns = clean_vedic_hymns(vedic_hymns)
wov = clean_way_of_virtue_buddha_path_light(wov)

hindu_texts = [
    advaita_vedanta, bhagavad_gita, kaivalya_upanishad,
    vedanta_sutras, yoga_sutras, vedic_hymns
]

buddhist_texts = [
    life_of_buddha, dhammapada, tpol
]

christian_texts = [
    city_of_god, secret_teachings
]

islamic_texts = [

]

bahai_texts = [
    book_of_certitude, kitab_aqdas, epistle_son_wolf
]

jewish_texts = [
    tanakh
]

sikh_texts = [
    guru_granth_sahib
]

zoroastrian_texts = [
    gathas
]

confucian_texts = [
    analects
]

taoist_texts = [
    tao_te_ching
]

indigenous_texts = [
    popol_vuh
]

occult_esoteric_texts = [
    collected_fruits, higher_worlds, kybalion, nature_gods, wov
]

text_dict_list = {
    'hinduism': hindu_texts,
    'buddhism': buddhist_texts,
    'christianity': christian_texts,
    'islam': islamic_texts,
    'bahai': bahai_texts,
    'judaism': jewish_texts,
    'sikhism': sikh_texts,
    'zoroastrianism': zoroastrian_texts,
    'confucianism': confucian_texts,
    'taoism': taoist_texts,
    'indigenous': indigenous_texts,
    'occult_esoteric': occult_esoteric_texts
}

Now we cut out front and end-matter. This needs to be done ad hoc, since there is no consistent marker for it.

Having isolated the relevant portions of each document, we can now unify all the texts in each school.

In [None]:
all_texts = sum(text_dict_list.values(), [])
all_texts_string = ' . '.join(all_texts)

text_dict = {}
for school in text_dict_list.keys():
    text_dict[school] = ' . '.join(text_dict_list[school])

### Preliminary EDA

For a bit of preliminary EDA, we can make word clouds for each school. Here we prepare the text for this with some very basic cleaning to remove encoding artifacts and the like. Then we build the word clouds and present them.

In [None]:
# some basic initial cleaning
all_texts_string = all_texts_string.replace('signi cance', 'significance').replace('obj ects', 'objects').replace('de nite', 'denote').replace('j ust', 'just')

for school in text_dict.keys():
    text_dict[school] = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff\xad\x0c6§\[\]\\\£\Â\n\r]', '', text_dict[school])
    text_dict[school] = re.sub(r'[0123456789]', ' ', text_dict[school])
    text_dict[school] = text_dict[school].replace('signi cance', 'significance').replace('obj ects', 'objects').replace('de nite', 'denote').replace('j ust', 'just')

In [None]:
stopwords_list = stopwords.words('english') + list(string.punctuation) + ['“','”','...',"''",'’','``', "'", "‘"]
custom_stopwords = ['–', 'also', 'something', 'cf', 'thus', 'two', 'now', 'would', 'make', 'eb', 'u', 'well', 'even', 'said', 'eg', 'us',
                    'n', 'sein', 'e', 'da', 'therefore', 'however', 'would', 'thing', 'must', 'merely', 'way', 'since', 'latter', 'first',
                    'B', 'mean', 'upon', 'yet', 'cannot', 'c', 'C', 'let', 'may', 'might', "'s", 'b', 'ofthe', 'p.', '_', '-', 'eg', 'e.g.',
                    'ie', 'i.e.', 'f', 'l', "n't", 'e.g', 'i.e', '—', '--', 'hyl', 'phil', 'one', 'press', 'cent', 'place'] + stopwords_list

In [None]:
cloud_dict = {}
for school in text_dict.keys():
    cloud_dict[school] = make_word_cloud(text_dict[school], custom_stopwords)
    cloud_dict['middle1'] = make_word_cloud('this page intentionally left blank', stopwords=[])
    cloud_dict['middle2'] = make_word_cloud('this page intentionally left blank', stopwords=[])

ValueError: We need at least 1 word to plot a word cloud, got 0.

In [None]:
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, figsize=(25, 14))

fig.suptitle('Word Clouds for Spiritual Traditions', size=40, fontweight='bold')
fig.tight_layout(rect=[0, 0, 1, 0.95])

ax1.imshow(cloud_dict['hinduism'])
ax1.set_title('Hinduism Word Cloud', size=25, pad=20, fontweight='bold')

ax2.imshow(cloud_dict['buddhism'])
ax2.set_title('Buddhism Word Cloud', size=25, pad=20, fontweight='bold')

ax3.imshow(cloud_dict['christianity'])
ax3.set_title('Christianity Word Cloud', size=25, pad=20, fontweight='bold')

ax4.imshow(cloud_dict['islam'])
ax4.set_title('Islam Word Cloud', size=25, pad=20, fontweight='bold')

ax5.imshow(cloud_dict['occult'])
ax5.set_title('Occult / Esotericism Word Cloud', size=25, pad=20, fontweight='bold')

ax6.imshow(cloud_dict['indigenous'])
ax6.set_title('Indigenous / Popol Vuh Word Cloud', size=25, pad=20, fontweight='bold')

for ax in fig.axes:
    ax.axis('off')

fig.patch.set_facecolor('#D1D1D1')
plt.show()

Note that these word clouds are the result of applying many custom stopwords. Those words were words that had significant overlap between schools and so could not be used to distinguish them.

One of these common stopwards was the word 'one.' I guess Plato was right about it being the central concept of philosophy.

Even after this pruning, a lot of the same words show up in all the schools. But there are also a good number of differences between the schools as well, enough that a model is not *prima facie* a lost cause.

Our next step is to explore the frequency distrubtion of words in the corpus, both in the texts as a whole and in the individual schools.

In [None]:
all_texts_string = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff\xad\x0c6§\[\]\\\£\Â\n\r]', ' ', all_texts_string)
all_texts_string = re.sub(r'[0123456789]', ' ', all_texts_string)

In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
all_text_words = word_tokenize(all_texts_string)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
cleaned_words = [x.lower() for x in all_text_words if x.lower() not in custom_stopwords]
freq_dist = FreqDist(cleaned_words)

This more or less matches all the major topics of philosophy - ontology ('things'), human nature ('man'), logic and rationality ('reason'), truth ('true'), language ('say'). Of course these also reflect common turns of phrase in philosophical texts; these texts often discuss what is true and what others say, for example.

Now let's take a look at frequency distributions for each school.

In [None]:
uninformative_words = ['else', 'shall', 'either', 'still', 'rather', 'another', 'made', 'without']
school_stopwords = custom_stopwords + [x[0] for x in freq_dist.most_common(50)] + uninformative_words

A lot of this is unsurprising - no one should be baffled that the word 'socrates' turns up a lot in Plato. Others are perhaps more interesting - the appearance of 'animal' in Aristotle, for example. And yet others are just artifacts of the texts we had available - the word 'madness' is common in Continental philosophy because of Foucault's book *History of Madness*. That said, some of these are comical as well. For example, analytic philosophy's tendency to logical formalism led to it having variables for propositions ('p' and 'x') among its most commonly used words.

Still, these each seem substantially different enough that we should be able to build a reasonable model. Although we could leap to building a model right now, let's do a little more exploring by examining the bigrams.

In [None]:
# tokenizing the text to prepare it
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
all_tokens = nltk.regexp_tokenize(all_texts_string, pattern)

all_tokens_stopped = [x.lower() for x in all_tokens if x.lower() not in custom_stopwords]

While this isn't particularly informative, it does show us some common phrases used throughout the history of philosophy.

Next we can create similar charts for each school to see which schools use which phrases preferentially.*italicized text*

In [None]:
import os, re
from collections import Counter
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS



# Prepare a base set of English stop words (and add a few old-fashioned terms)
stopwords = set(ENGLISH_STOP_WORDS)
# Optionally add domain-specific common words (archaic pronouns, etc.)
stopwords |= {"thou", "thee", "thy", "ye", "say", "said", "shall", "also", "could"}

for filename in os.listdir(folder_path):
    if not filename.lower().endswith(".txt"):
        continue
    filepath = os.path.join(folder_path, filename)
    with open(filepath, 'r', encoding='utf-8', errors='replace') as f:
        text = f.read()

    # Strip Gutenberg header/trailer if present
    start = text.find('*** START OF THIS PROJECT GUTENBERG')
    if start != -1:
        text = text[start:]
    end = text.find('*** END OF THIS PROJECT GUTENBERG')
    if end != -1:
        text = text[:end]

    # Lowercase and remove punctuation
    text = text.lower()
    # Replace punctuation with spaces (so words stay separated)
    text = re.sub(r'[\W_]+', ' ', text)

    # Tokenize into words
    tokens = re.findall(r'\b[a-z]+\b', text)

    # Filter out standard stop words
    tokens = [w for w in tokens if w not in stopwords]

    # Count word frequencies
    freq = Counter(tokens)

    # Take the top N frequent words as custom stop words (e.g., N=20)
    top_n = 20
    most_common = [word for word, count in freq.most_common(top_n)]

    print(f"{filename}: {most_common}")


yoga sutras patanjali.txt: ['spiritual', 'man', 'life', 'consciousness', 'mind', 'self', 'power', 'soul', 'things', 'powers', 'psychic', 'comes', 'psychical', 'nature', 'divine', 'come', 's', 'eternal', 'true', 'body']
epistle to the son of the wolf.txt: ['god', 'hath', 'unto', 'o', 'lord', 'men', 'verily', 'things', 'wronged', 'world', 'day', 'earth', 'people', 'revelation', 'cause', 'saith', 'great', 'glory', 'truth', 'words']
Bhagavad Gita.txt: ['self', 'brahman', 'man', 'o', 'world', 'food', 'mind', 'knows', 'breath', 'body', 'section', 'let', 'does', 'having', 'sun', 'knowledge', 'know', 'speech', 'earth', 'like']
augustine city of god and christian doctrine.txt: ['god', 'things', 'man', 'men', 'gods', 's', 'good', 'life', 'body', 'christ', 'chapter', 'says', 'time', 'lord', 'great', 'soul', 'world', 'does', 'earth', 'death']
Tao Te Ching.txt: ['tao', 'things', 'does', 'men', 'great', 'heaven', 'people', 'know', 'sage', 'state', 'like', 's', 'way', 'place', 'world', 'life', 'knows

In [None]:
# custom stopwords for each school since there are odd phrases in each
stopwords_dict = {
    # Christianity
    'christianity': [
        'god', 'man', 'lord', 'christ', 'life', 'soul', 'world', 'earth', 'men', 'great'
    ],

    # Vedanta / Hinduism
    'hinduism': [
        'brahman', 'self', 'knowledge', 'soul', 'world', 'body', 'meditation', 'nature', 'truth', 'atman'
    ],

    # Islam / Quran
    'islam': [
        'god', 'lord', 'people', 'prophet', 'quran', 'day', 'revelation', 'truth', 'believe', 'mercy'
    ],

    # Buddhism
    'buddhism': [
        'buddha', 'life', 'man', 'monk', 'law', 'sacred', 'mind', 'spiritual', 'time', 'body'
    ],

    # Judaism / Kabbalah
    'judaism_kabbalah': [
        'god', 'world', 'holy', 'light', 'torah', 'israel', 'rabbi', 'king', 'blessed', 'genesis'
    ],

    # Zoroastrianism
    'zoroastrianism': [
        'man', 'life', 'world', 'god', 'mazda', 'spirit', 'soul', 'earth', 'righteousness', 'evil'
    ],

    # Indigenous
    'indigenous': [
        'lords', 'people', 'men', 'earth', 'great', 'tribes', 'house', 'came', 'called', 'balam'
    ],

    # Occult / Esoteric
    'occult': [
        'life', 'world', 'spiritual', 'knowledge', 'divine', 'nature', 'energy', 'time', 'people', 'great'
    ],

    # Modern Spirituality
    'modern_spirituality': [
        'life', 'energy', 'nature', 'water', 'earth', 'temperature', 'natural', 'forest', 'oxygen', 'quality'
    ]
}




# building the bigrams for each school
bigram_dict = {}
for school in text_dict.keys():
    tokens = nltk.regexp_tokenize(text_dict[school], pattern)
    tokens_stopped = [x.lower() for x in tokens if x.lower() not in (custom_stopwords + stopwords_dict[school])]
    measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens_stopped)
    scored = finder.score_ngrams(measures.raw_freq)
    bigram_dict[school] = scored


KeyError: 'bahai'

Not bad! These bigrams actually give a solid sense of the key concepts of some of the schools. Rationalism's emphasis on 'clear and distinct ideas' and empiricism's focus on 'simple and complex ideas' are so forceful that both singular and plural versions of the phrases show up. Heidegger's common phrase 'always already' finds itself on the Phenomenology bigram list, and the German Idealist's continual discussion of self-consciousness is also evident. Aristotle and Plato's bigrams are harder to interpret, with Plato's being mostly common locutions from the dialogues rather than anything philosophically significant.

### More In-Depth Cleaning

All the previous explorations were done with some basic cleaning methods like removing stopwords. And while we brushed over it thus far, in the process of dealing with the data we encountered a lot of oddities. These include
- words fused together (e.g., 'aconcept' for 'a concept')
- headers of pages occurring repeatedly in the text
- page numbers and citation numbers
- footnotes, roman numerals, titles of chapters

All these would need to be removed if we are to train a model on the actual content of these thinkers, and especially if we want a neural network to do any kind of predictive work where it will look at full sentences.

The process of dealing with these and getting the data ready for our models has a few steps:
1. develop a general cleaning function that can work for every text (removing roman numerals, for example)
2. examine each text itself and remove the specific headers that are relevant to it
  - look for features that could capture all the footnotes here as well
3. tokenize the text using spacy
4. examine the tokens for unusual patterns
  - there should be virtually no duplicate sentences
  - we can remove sentences that are too short to mean anything
  - remove sentences that contain terms that must be from footnotes (the author's name should be very rare in the actual text, for example)

Unfortunately many of these steps can only be done ad hoc; there is no real way to know whether and what headers are in a text without examining the files individually. So the process is a bit tedious and time-consuming. Still, when we finish we will have data that is much cleaner and more useful for modeling.


#### 1. Universal Cleaning Steps

In [None]:
def baseline_clean(to_correct, capitals=True, bracketed_fn=False, odd_words_dict={}):
    # remove utf8 encoding characters and expanded special characters
    result = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff\xad\x0c6§\\\£\Â*_<>""⎫•{}Γ~]', ' ', to_correct)

    # remove additional special characters and symbols
    result = re.sub(r'[©®™†‡§¶•‰¿¡÷±≠≤≥∞∑∏√∫∂∆∇Ω∪∩⊂⊃⊄⊅∈∉∅∀∃∧∨¬→←↑↓↔⇄⇒⇐⇑⇓⇔]', ' ', result)

    # remove mathematical and scientific symbols
    result = re.sub(r'[αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ]', ' ', result)

    # remove currency symbols
    result = re.sub(r'[¢€¥₹₽₩₨₪₫₡₢₵₶₷₸₹₺₻₼₽₾₿]', ' ', result)

    # remove various dashes and hyphens
    result = re.sub(r'[\u2014\u2013\u2012\u2010\u2011\u2015\u2500-\u257F-]', ' ', result)

    # remove quotation marks and similar
    result = re.sub(r'[„‹›«»¨´`^¸˜˚˙¨¯˘˙˚¸˝˛ˇ]', ' ', result)

    # remove box drawing and block elements
    result = re.sub(r'[\u2500-\u257F\u2580-\u259F]', ' ', result)

    # remove geometric shapes
    result = re.sub(r'[\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF]', ' ', result)

    # remove arrows and miscellaneous symbols
    result = re.sub(r'[\u2190-\u21FF\u2900-\u297F]', ' ', result)

    # remove superscript and subscript numbers/letters
    result = re.sub(r'[⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎]', ' ', result)

    # remove fractions and special numeric characters
    result = re.sub(r'[½⅓⅔¼¾⅛⅜⅝⅞⅟]', ' ', result)

    # replace whitespace characters with actual whitespace
    result = re.sub(r'\s', ' ', result)

    # replace odd quotation marks with a standard
    result = re.sub(r'[''""]', "'", result)

    # replace the ﬀ, ﬃ and ﬁ with the appropriate counterparts
    result = re.sub(r'ﬀ', 'ff', result)
    result = re.sub(r'ﬁ', 'fi', result)
    result = re.sub(r'ﬃ', 'ffi', result)
    result = re.sub(r'ﬂ', 'fl', result)
    result = re.sub(r'ﬄ', 'ffl', result)

    # remove or standardize some recurring common and meaningless words/phrases
    result = re.sub(r'\s*This\s*page\s*intentionally\s*left\s*blank\s*', ' ', result)
    result = re.sub(r'(?i)Aufgabe\s+', ' ', result)
    result = re.sub(r',*\s+cf\.', ' ', result)

    # some texts have footnotes conveniently in brackets - this removes them all,
    # with a safety measure for unpaired brackets, and deletes all brackets afterwards
    if bracketed_fn:
        result = re.sub(r'\[.{0,300}\]|{.{0,300}}|{.{0,300}\]|\[.{0,300}}', ' ', result)
    result = re.sub(r'[\[\]{}]', ' ', result)

    # unify some abbreviations
    result = re.sub(r'&', 'and', result)
    result = re.sub(r'\se\.g\.\s', ' eg ', result)
    result = re.sub(r'\si\.e\.\s', ' ie ', result)
    result = re.sub('coroll\.', 'coroll', result)
    result = re.sub('pt\.', 'pt', result)

    # remove roman numerals, first capitalized ones
    result = re.sub(r'\s((I{2,}V*X*\.*)|(IV\.*)|(IX\.*)|(V\.*)|(V+I*\.*)|(X+L*V*I*]\.*))\s', ' ', result)
    # then lowercase
    result = re.sub(r'\s((i{2,}v*x*\.*)|(iv\.*)|(ix\.*)|(v\.*)|(v+i*\.*)|(x+l*v*i*\.*))\s', ' ', result)

    # remove periods and commas flanked by numbers
    result = re.sub(r'\d\.\d', ' ', result)
    result = re.sub(r'\d,\d', ' ', result)

    # remove the number-letter-number pattern used for many citations
    result = re.sub(r'\d*\w{,2}\d', ' ', result)

    # remove numerical characters
    result = re.sub(r'\d+', ' ', result)

    # remove words of 2+ characters that are entirely capitalized
    # (these are almost always titles, headings, or speakers in a dialogue)
    # remove capital I's that follow capital words - these almost always roman numerals
    # some texts do use these capitalizations meaningfully, so we make this optional
    if capitals:
        result = re.sub(r'[A-Z]{2,}\s+I', ' ', result)
        result = re.sub(r'[A-Z]{2,}', ' ', result)

    # remove isolated colons and semicolons that result from removal of titles
    result = re.sub(r'\s+:\s*', ' ', result)
    result = re.sub(r'\s+;\s*', ' ', result)

    # remove isolated letters (do it several times because strings of isolated letters do not get captured properly)
    for _ in range(6):
        result = re.sub(r'\s[^aAI\.]\s', ' ', result)

    # remove isolated letters at the end of sentences or before commas
    result = re.sub(r'\s[^aI]\.', '.', result)
    result = re.sub(r'\s[^aI],', ',', result)

    # deal with spaces around periods and commas
    result = re.sub(r'\s+,\s+', ', ', result)
    result = re.sub(r'\s+\.\s+', '. ', result)

    # remove empty parentheses
    result = re.sub(r'(\(\s*\.*\s*\))|(\(\s*,*\s*)\)', ' ', result)

    # reduce multiple periods, commas, or whitespaces into a single one
    result = re.sub(r'\.+', '.', result)
    result = re.sub(r',+', ',', result)
    result = re.sub(r'\s+', ' ', result)

    # deal with isolated problem cases discovered in the data:
    for key in odd_words_dict.keys():
        result = re.sub(r''+key+'', odd_words_dict[key], result)

    return result.strip()

This step is relatively easy - all we had to do was use regex to capture and remove the patterns that we needed to remove.

The next step requires deeper examination of the individual texts, however.

#### 2. Text-by-Text Cleaning

In this step we will remove headers and other offensive features of specific texts.

The most common problem the presence of headings that appear at the top of each page in the original book. When converted to a string, these then get interpolated into the text, interrupting the normal flow of sentences (this happens for page numbers and citations as well, but in those cases is much easier to deal with).

To deal with this, we build a list of the headers for each book and then delete them from the string that represents the book. In the process, we may create some issues if the header is common, so we are careful to only delete when the loss is worth it.

In some cases, of course, the texts are already clean and no extra steps are required.

#### 3. Tokenizing and Rendering the Texts as a Dataframe

We now are in a position to apply these methods to each text and return a dataframe for each of them. Although we are interested primarily in the schools of thought in general, it would be convenient and more useful for future projects if we also include the specific authors and titles.

To prepare for this project, we build a dictionary for each school, so that we can then iterate over a list of dictionaries to create a dataframe for each.

In [None]:
# prepare lists that will be zipped into a dictionary

# texts


title_list = [
    "Advaita Vedanta",
    "Analects",
    "Bhagavad Gita",
    "Collected Fruits of Occult Teaching by A.P.Sinnett",
    "Kaivalya Upanishad",
    "Knowledge of the Higher Worlds - by Rudolf Steiner",
    "Kybalion",
    "Popol Vuh",
    "Secret Teachings of the Society of Jesus",
    "Siri Guru Granth Sahib",
    "Tanakh",
    "Tao Te Ching",
    "The Kitab I Aqdas",
    "The Life of Buddha",
    "Augustine City of God and Christian Doctrine",
    "Dhammapada",
    "Epistle to the Son of the Wolf",
    "Gathas Zoroastrianism",
    "Kitab I Ilqan Book of Certitude",
    "Nature Gods",
    "The Path of Light (TPOL)",
    "Vedic Hymns Readable",
    "Vedanta Sutras Patanjali",
    "Words of Virtue (WOV)",
    "Yoga Sutras Patanjali"
]

school_list = [
    'hinduism',         # Advaita Vedanta
    'confucianism',     # Analects
    'hinduism',         # Bhagavad Gita
    'occult_esoteric',  # Collected Fruits of Occult Teaching by A.P.Sinnett
    'hinduism',         # Kaivalya Upanishad
    'occult_esoteric',  # Knowledge of the Higher Worlds - by Rudolf Steiner
    'occult_esoteric',  # Kybalion
    'indigenous',       # Popol Vuh
    'christianity',     # Secret Teachings of the Society of Jesus
    'sikhism',          # Siri Guru Granth Sahib
    'judaism',          # Tanakh
    'taoism',           # Tao Te Ching
    'bahai',            # The Kitab I Aqdas
    'buddhism',         # The Life of Buddha
    'christianity',     # Augustine City of God and Christian Doctrine
    'buddhism',         # Dhammapada
    'bahai',            # Epistle to the Son of the Wolf
    'zoroastrianism',   # Gathas Zoroastrianism
    'bahai',            # Kitab I Ilqan Book of Certitude
    'occult_esoteric',  # Nature Gods
    'buddhism',         # The Path of Light (TPOL)
    'hinduism',         # Vedic Hymns Readable
    'hinduism',         # Vedanta Sutras Patanjali
    'occult_esoteric',  # Words of Virtue (WOV)
    'hinduism'          # Yoga Sutras Patanjali
]






# check lengths to make sure all are present
len(title_list), len(school_list), len(all_texts)

(25, 25, 25)

In [None]:
# combine all these into a single list of dictionaries
book_dicts = []
for i in range(0, 25):
  book_dict = {}
  book_dict['title'] = title_list[i].title()
  book_dict['text'] = all_texts[i]
  book_dict['school'] = school_list[i]
  book_dict['remove capitals'] = True
  book_dict['bracketed fn'] = False
  book_dicts.append(book_dict)



# check length again to make sure
len(book_dicts)

25

With a dictionary for each text, we are prepared to clean them, build dataframes for each text, and combine them into a master dataframe for all our data.

In [None]:
#@title Oddities Dictionary for Cleaning
# a dictionary of oddities to clean up
odd_words_dict = {'\sderstanding': 'derstanding',
                  '\sforthe\s': ' for the ',
                  '\sject': 'ject',
                  '\sjects': 'jects',
                  '\sness': 'ness',
                  '\sper\scent\s': ' percent ',
                  '\sper\scent\.': ' percent.',
                  '\sper\scent,': ' percent,',
                  '\wi\son': 'ion',
                  '\spri\sori': ' priori',
                  '\stences\s': 'tences ',
                  '\sprincipleb': ' principle',
                  '\ssciousness': 'sciousness',
                  '\stion': 'tion',
                  '\spri\s': ' pri',
                  '\scluding': 'cluding',
                  '\sdom': 'dom',
                  '\sers': 'ers',
                  '\scritiq\s': ' critique ',
                  '\ssensati\s': ' sensation ',
                  '(?i)\syou\sll': " you'll",
                  '\sI\sll': " I'll",
                  '(?i)\swe\sll': " we'll",
                  '(?i)he\sll': " he'll",
                  '(?i)who\sll': "who'll",
                  '(?i)\sthere\sll\s': " there'll ",
                  '\seduca\s': ' education ',
                  '\slity\s': 'lity ',
                  '\smultaneously\s': 'multaneously ',
                  '\stically\s': 'tically ',
                  '\sDa\ssein\s': ' Dasein ',
                  '(?i)\sthey\sll\s': " they'll ",
                  '(?i)\sin\tum\s': ' in turn ',
                  '\scon~\s': ' con',
                  '\sà\s': ' a ',
                  '\sjor\s': ' for ',
                  '\sluminating\s': 'luminating ',
                  '\sselj\s': ' self ',
                  '\stial\s': 'tial ',
                  '\sversal\s': 'versal ',
                  '\sexis\st': ' exist',
                  '\splauded\s': 'plauded ',
                  '\suiry\s': 'uiry ',
                  '\svithin\s': ' within ',
                  '\soj\s': ' of ',
                  '\sposi\st': ' posit',
                  '\sra\sther\s': ' rather ',
                  '(?i)\sthat\sll\s': " that'll ",
                  '(?i)\sa\sll\s': ' all ',
                  '\so\sther\s': ' other ',
                  '\sra\sther\s': ' rather ',
                  '\snei\sther\s': ' neither ',
                  '\sei\sther\s': ' either ',
                  '\sfur\sther\s': ' further ',
                  '\sano\sther': ' another ',
                  '\sneces\s': ' neces',
                  'u\slar\s': 'ular ',
                  '\sference\s': 'ference ',
                  '(?i)it\sll\s': "it'll ",
                  '\stoge\sther': ' together ',
                  '\sknowledgeb\s': ' knowledge ',
                  'r\stain\s': 'rtain ',
                  'on\stain\s': 'ontain',
                  '(?i)j\sect\s': 'ject',
                  '\sob\sect\s': ' object ',
                  '\sbtle\s': 'btle ',
                  '\snition\s': 'nition ',
                  '\sdering\s': 'dering ',
                  '\sized\s': 'ized ',
                  '\sther\shand': ' other hand',
                  '\ture\s': 'ture ',
                  '\sabso\sl': ' absol',
                  '\stly\s': 'tly ',
                  '\serty\s': 'erty ',
                  '\sobj\se': ' obj',
                  '\sffiir\s': ' for ',
                  '\sndeed\s': ' indeed ',
                  '\sfonn\s': ' form ',
                  '\snally\s': 'nally ',
                  'ain\sty\s': 'ainty ',
                  'ici\sty\s': 'icity ',
                  '\scog\sni': ' cogni',
                  '\sacc\s': ' acc',
                  '\sindi\svid\sual': ' individual',
                  '\sintu\sit': ' intuit',
                  'r\sance\s': 'rance ',
                  '\ssions\s': 'sions ',
                  '\sances\s': 'ances ',
                  '\sper\sception\s': ' perception ',
                  '\sse\sries\s': ' series ',
                  '\sque\sries\s': ' queries ',
                  '\sessary\s': 'essary ',
                  '\sofa\s': ' of a ',
                  '\scer\stainty\s': ' certainty ',
                  'ec\stivity\s': 'ectivity ',
                  '\stivity\s': 'tivity ',
                  '\slation\s': 'lation ',
                  '\sir\sr': ' irr',
                  '\ssub\sstance\s': ' substance ',
                  'sec\sond\s': 'second ',
                  '\s\.rv': '',
                  '\story\s': 'tory ',
                  '\sture\s': 'ture ',
                  '\sminate\s': 'minate ',
                  '\sing\s': 'ing ',
                  '\splicity\s': 'plicity ',
                  '\ssimi\slar\s': ' similar ',
                  '\scom\smunity\s': ' community ',
                  '\sitselfa\s': ' itself a ',
                  '\ssimp\s': ' simply ',
                  '\scon\stex': ' contex',
                  '\scon\sseq': ' conseq',
                  '\scon\stai': ' contai',
                  '\sofwhat\s': ' of what ',
                  '\sui\s': 'ui',
                  '\sofan\s': ' of an ',
                  '\saccor\sdance\s': ' accordance ',
                  '\stranscen\sdental\s': ' transcendental ',
                  '\sap\spearances\s': ' appearances ',
                  'e\squences\s': 'equences ',
                  '\sorits\s': ' or its ',
                  '\simma\sn': ' imman',
                  '\seq\sua': ' equa',
                  '\simpl\sied\s': ' implied ',
                  '\sbuta\s': ' but a ',
                  '\sa\snd\s': ' and ',
                  '\sence\s': 'ence ',
                  '\stain\s': 'tain ',
                  '\sunder\sstanding\s': ' understanding ',
                  'i\sence\s': 'ience ',
                  'r\sence\s': 'rence ',
                  '\stical\s': 'tical ',
                  '\sobjectsb\s': ' objects ',
                  '\stbe\s': ' the ',
                  '\smul\st': ' mult',
                  '\sgen\seral\s': ' general ',
                  '\suniver\ssal\s': ' universal ',
                  '\scon\stent\s': ' content ',
                  '\spar\sticular\s': ' particular ',
                  'ver\ssity\s': 'versity ',
                  '\sCritiq\s': ' Critique ',
                  '\sphilo\ssophy\s': ' philosophy ',
                  '\seq\s': ' eq'}

In [None]:
import gc

# ✅ Fix: Missing parenthesis
def split_text(text, max_len=50_000):
    return [text[i:i+max_len] for i in range(0, len(text), max_len)]

def from_raw_to_df(text_dict):
    nlp.max_length = 14_000_000  # Still needed

    # Step 1: Clean text
    text = text_dict['text']
    text = baseline_clean(
        text,
        capitals=text_dict['remove capitals'],
        bracketed_fn=text_dict['bracketed fn'],
        odd_words_dict=odd_words_dict
    )

    # Step 2: Chunk the cleaned text and use SpaCy on each chunk
    chunks = split_text(text, max_len=50_000)

    # ✅ Efficient and RAM-safe: disable NER, tagger, parser
    docs = list(nlp.pipe(chunks, batch_size=1, disable=["ner"]))
    sentences = [s for doc in docs for s in doc.sents]

    # Step 3: Create DataFrame (avoid keeping spaCy Span objects)
    sentence_strs = [''.join(str(s)) for s in sentences]
    text_df = pd.DataFrame({
        'sentence_str': sentence_strs,
        'title': text_dict['title'],
        'school': text_dict['school'],
    })

    return text_df


In [None]:
df_list = []
for i, book in enumerate(book_dicts):
    print(f"Processing book {i+1}/{len(book_dicts)}: {book['title']}")
    book_df = from_raw_to_df(book)
    df_list.append(book_df)

    # ✅ Optional: Save each to disk
    # book_df.to_parquet(f"book_{i}.parquet")

    del book_df
    gc.collect()

df = pd.concat(df_list, ignore_index=True)
del df_list
gc.collect()

Processing book 1/25: Advaita Vedanta
Processing book 2/25: Analects
Processing book 3/25: Bhagavad Gita
Processing book 4/25: Collected Fruits Of Occult Teaching By A.P.Sinnett
Processing book 5/25: Kaivalya Upanishad
Processing book 6/25: Knowledge Of The Higher Worlds - By Rudolf Steiner
Processing book 7/25: Kybalion
Processing book 8/25: Popol Vuh
Processing book 9/25: Secret Teachings Of The Society Of Jesus
Processing book 10/25: Siri Guru Granth Sahib
Processing book 11/25: Tanakh
Processing book 12/25: Tao Te Ching
Processing book 13/25: The Kitab I Aqdas
Processing book 14/25: The Life Of Buddha
Processing book 15/25: Augustine City Of God And Christian Doctrine
Processing book 16/25: Dhammapada
Processing book 17/25: Epistle To The Son Of The Wolf
Processing book 18/25: Gathas Zoroastrianism
Processing book 19/25: Kitab I Ilqan Book Of Certitude
Processing book 20/25: Nature Gods
Processing book 21/25: The Path Of Light (Tpol)
Processing book 22/25: Vedic Hymns Readable
Proc

0

In [None]:
#@title
# ## some code for checking oddities

# # unfortunately it has to be done ad hoc and as they are discovered,
# # so we will not go too deep into it
# #
# # the code is commented out to make running the notebook smoother


# word_list = [
# 'tent',
# 'per',
# 'cent'
# 'imma']

# word_checker = pd.DataFrame()
# for word in word_list:
#   word_check_slice = df[(df['sentence_str'].str.contains('\s'+word+'\s'.lower()))].copy()
#   word_check_slice['word'] = word
#   word_checker = word_checker.append(word_check_slice)

# print(len(word_checker))
# print(len(word_list))
# word_checker['word'].value_counts()

# pd.options.display.max_colwidth = 300
# word_checker[word_checker['word']=='con']

And *voila*! There we have it. We can do some quick exploring to make sure everything came out ok, but we are nearing the end of our task!

In [None]:
pd.options.display.max_colwidth = 200
df.sample(10)

Unnamed: 0,sentence_str,title,school
163018,"The minister of crime of Ch'an asked whether the duke Chao knew propriety, and Confucius said, 'He knew propriety.'.",Gathas Zoroastrianism,zoroastrianism
120648,"|| I sing the Glorious Praises of the unchanging, eternal Lord God, and the noose of death is snapped.",Dhammapada,buddhism
165819,said Chamiaholom.,Nature Gods,occult_esoteric
17317,In the swooning person there is half combination; this being the remaining (hypothesis).,Collected Fruits Of Occult Teaching By A.P.Sinnett,occult_esoteric
56668,"Thou shalt not wrest judgment; thou shalt not respect persons; neither shalt thou take gift; for gift doth blind the eyes of the wise, and pervert the words of the righteous.",Augustine City Of God And Christian Doctrine,christianity
39916,Gen. Gen. See Contra Faust.,Siri Guru Granth Sahib,sikhism
90313,|| You shall not obtain this human body again; make the effort try to achieve liberation!,Dhammapada,buddhism
50555,"Lift it up, be not afraid; say unto the cities of Judah: �Behold your God Behold the Lord God will come with strong hand, and His arm shall rule for Him.�",The Life Of Buddha,buddhism
42543,"As for those who, even though they know and understand my directions, fail to penetrate the meaning of obscure passages in Scripture, they may stand for those who, in the case I have imagined, are...",Siri Guru Granth Sahib,sikhism
48489,This prohibition was defined by Shoghi Effendi as plunging one�s hand in food.,The Kitab I Aqdas,bahai


In [None]:
df['school'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
school,Unnamed: 1_level_1
buddhism,0.51207
occult_esoteric,0.153275
christianity,0.149912
confucianism,0.048515
sikhism,0.047941
hinduism,0.038921
bahai,0.022482
taoism,0.012366
zoroastrianism,0.009764
indigenous,0.002783


The texts look more or less ok, though there is still some cleaning to be done. And there is definitely some class imabalance (with 10 schools, ideally they'd each be at 10%).

That said, the numbers look reasonable enough that we have something we can work with. We can even do some fun stuff like find the average length of a sentence for each school or run other little tests.

### Cleaning the Dataframe

But before the fun EDA stuff, to ensure that we get good results, we need to clean the dataframe. This will take a few steps:
1. Determine a threshold length and cut the (so-called) sentences that are shorter than that length (this will already eliminate meaningless duplicates like punctuations)
2. Check for words that indicate footnotes (words like 'edition' or 'ibid'); we can then cut these sentences from the data
3. Check for duplicates; there should be few if any duplicates in the dataframe for each school
4. Check for words that indicate other languages so that we can eliminate quotations, citations, or other non-English sentences

#### 1. Deal with Short Sentences

In [None]:
df['sentence_length'] = df['sentence_str'].map(lambda x: len(x))
num_of_short_entries = len(df[df['sentence_length'] < 20])
print(f"there are {num_of_short_entries} so-called sentences with fewer than 20 characters")
df[df['sentence_length'] < 20].sample(5)

there are 35789 so-called sentences with fewer than 20 characters


Unnamed: 0,sentence_str,title,school,sentence_length
121445,||,Dhammapada,buddhism,2
161066,||,Dhammapada,buddhism,2
117703,||,Dhammapada,buddhism,2
55393,�,Augustine City Of God And Christian Doctrine,christianity,1
82323,||,Dhammapada,buddhism,2


Sentences with fewer than 20 characters tend to be more or less meaningless, so we will drop them.

In [None]:
df = df.drop(df[df['sentence_length'] < 20].index)
len(df)

140261

#### 2. Look at Words that Indicate Footnotes

Now let's look at footnote-indicator words.

In [None]:
fn_words = ['ch\.', 'bk', 'sect\.', 'div\.', 'cf', 'ibid', 'prop\.', 'Q\.E\.D\.',
            'pt\.', 'coroll\.', 'cf\.']

df['sentence_lowered'] = df['sentence_str'].str.lower()

fn_df_list = []

for word in fn_words:
    found_word = df[df['sentence_lowered'].str.contains(r'\s' + word.lower(), regex=True)].copy()
    found_word['word'] = word
    fn_df_list.append(found_word)

fn_df = pd.concat(fn_df_list, ignore_index=True)


In [None]:
fn_df.sample(5)

Unnamed: 0,sentence_str,title,school,sentence_length,sentence_lowered,word
8,"For, as other scriptural texts testify ('Then he becomes united with the True,' Ch. Up. 'Embraced by the intelligent Self he knows nothing that is without, nothing that is within,' Bri, Up., the a...",Collected Fruits Of Occult Teaching By A.P.Sinnett,occult_esoteric,304,"for, as other scriptural texts testify ('then he becomes united with the true,' ch. up. 'embraced by the intelligent self he knows nothing that is without, nothing that is within,' bri, up., the a...",ch\.
25,"Rom. Cf. Cicero, Orator.",Siri Guru Granth Sahib,sikhism,24,"rom. cf. cicero, orator.",cf\.
19,"Either thou must recognize it, or�God forbid�arise and deny all the Prophets Reflect, Shaykh, upon the Shi�ih sect.",The Life Of Buddha,buddhism,115,"either thou must recognize it, or�god forbid�arise and deny all the prophets reflect, shaykh, upon the shi�ih sect.",sect\.
13,"the vidya of the five fires, and those who in the forest meditate on faith and austerity go to light,there is a person not human, he leads them to Brahman,' Ch.",Collected Fruits Of Occult Teaching By A.P.Sinnett,occult_esoteric,160,"the vidya of the five fires, and those who in the forest meditate on faith and austerity go to light,there is a person not human, he leads them to brahman,' ch.",ch\.
12,The case is analogous to that of the meditation on 'plenitude' (bhuman; Ch. Up.,Collected Fruits Of Occult Teaching By A.P.Sinnett,occult_esoteric,79,the case is analogous to that of the meditation on 'plenitude' (bhuman; ch. up.,ch\.


Unfortunately, there was too much noise and too many differences in how the sentences were tokenized, so this kind of cleaning did not prove useful. As can be seen above, many of the relevant terms were used in meaningful sentences attributable to the correct authors.

We were able to tell that 'bk.' was almost never used productively, so we cut that. For others, this was instructive in helping us clean the text and making revisions to the baseline cleaning function.

#### 4. Deal with Duplicates

Now let's look at how many duplicates we have.

In [None]:
# find the total number of duplicates
len(df['sentence_str'])-len(df['sentence_str'].drop_duplicates())

1890

In [None]:
# get the number of duplicates in each school
for school in df['school'].unique():
  print(school)
  print(len(df.loc[df['school'] == school]['sentence_str']) -
        len(df.loc[df['school'] == school]['sentence_str'].drop_duplicates()))

hinduism
50
confucianism
457
occult_esoteric
166
indigenous
4
christianity
254
sikhism
3
judaism
0
taoism
5
bahai
79
buddhism
844
zoroastrianism
7


In [None]:
doubles_df = pd.concat(g for _, g in df.groupby("sentence_str") if len(g) > 1)
doubles_df.sample(5)

Unnamed: 0,sentence_str,title,school,sentence_length,sentence_lowered
21874,The gods have held Agni as the giver of wealth.,Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,47,the gods have held agni as the giver of wealth.
165804,What is it that has stung you?,Nature Gods,occult_esoteric,30,what is it that has stung you?
127236,Image Of The Undying.,Dhammapada,buddhism,21,image of the undying.
92990,"The Divine Guru is my companion, the Destroyer of ignorance; the Divine Guru is my relative and brother.",Dhammapada,buddhism,104,"the divine guru is my companion, the destroyer of ignorance; the divine guru is my relative and brother."
8242,They said: We did understand.,Analects,confucianism,29,they said: we did understand.


From this it is clear that many of these duplicates are notes, meaninglessly short, or else headings that somehow escaped earlier efforts. Oddly, an enormous number of aristotle's sentences seem to be doubled. Looking at the doubled sentences, this appears to be because similar notes were made in both of the two volumes of the text.

Let's eliminate the aristotle doubles first, then take another look to see what the others are like.

Deeper exploration of the duplicates reveals that Kant has a lot of doubles that seem to be authentically from his texts. Plato also has several duplicate sentences, but these are almost all short phrases from the dialogues ('of course, yes' and that kind of thing).

To preserve the Kant, while also removing the irrelevant duplicates, we will remove both copies of all duplicates from texts other than the Kant's *Critique of Pure Reason*. For that text, we will remove the short duplicates and keep one copy of the longer ones, thus preserving the meaningful sentences.

#### Check for Foreign-Language Sentences

With this cleared up, let's do a couple quick tests to check for other languages in our texts. We use 'der' to check for German, since it is a common article and is not an English word. Similarly 'il' can be used to check for French.

In [None]:
(df[df['sentence_str'].str.contains('\sder\s')]).sample(5)

Unnamed: 0,sentence_str,title,school,sentence_length,sentence_lowered
28988,"For other explanations see Roth, Uber gewisse Kurzungen, Wien, Bartholomae, in Kuhn's Zeitschrift, Schmidt, Die Pluralbildungen der indogermanischen Neutra,.",Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,157,"for other explanations see roth, uber gewisse kurzungen, wien, bartholomae, in kuhn's zeitschrift, schmidt, die pluralbildungen der indogermanischen neutra,."
27110,"Benfey: 'Dann stu'rzen reichlich aus der rothen (Gewitterwolke) Tropfen, mit Fluth wie eine Haut die Erde netzend.",Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,114,"benfey: 'dann stu'rzen reichlich aus der rothen (gewitterwolke) tropfen, mit fluth wie eine haut die erde netzend."
29478,Ludwig: 'Die sich nicht weigern der geburt.',Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,44,ludwig: 'die sich nicht weigern der geburt.'
21367,The removal of yat has already been proposed by Bollensen (Zeitschrift der Deutschen Morg.,Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,90,the removal of yat has already been proposed by bollensen (zeitschrift der deutschen morg.
23696,"I prefer to follow the opinion of Bechtel, Nachrichten der Gottinger Gesellschaft der Wissenschaften, philolog.",Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,111,"i prefer to follow the opinion of bechtel, nachrichten der gottinger gesellschaft der wissenschaften, philolog."


In [None]:
len((df[df['sentence_str'].str.contains('\sder\s')]))

47

In [None]:
mask = df['title'] == 'Augustine City Of God And Christian Doctrine'
df.loc[mask, 'sentence_str'] = df.loc[mask, 'sentence_str'].str.replace(r'^And\s+', '', regex=True)

In [None]:
import re

def remove_upanishad_citations(text):
    return re.sub(r'\(?[\w\s,]*?(?:Up\.|I,)[\w\s.,]*?\)?\s*\);?', '', text)

mask = df['title'] == 'Collected Fruits Of Occult Teaching By A.P.Sinnett'
df.loc[mask, 'sentence_str'] = df.loc[mask, 'sentence_str'].apply(remove_upanishad_citations)


These all seem questionable at best - the 'der' indicates a German phrase or book title where it doesn't just denote a fully German quote. Let's drop those and check 'il.'

In [None]:
df = df.drop(df[df['sentence_str'].str.contains('\sder\s')].index)

len(df)

140214

In [None]:
df[df['sentence_str'].str.contains('\sil\s')].sample(5)

ValueError: Cannot take a larger sample than population when 'replace=False'

It seems clear that those using 'il' are predominantly notes in French, especially from Marx. We drop them; even those with some meaning must have some errors - 'il' is not an English word.

In [None]:
df = df.drop(df[df['sentence_str'].str.contains('\sil\s')].index)

len(df)

140213

#### Some Ad Hoc Cleaning

These last cells show us cleaning up some things that we noticed as we read over the data and explored it in other ways. There is nothing systematic here, just us noticing bad data and deleting as we go.

In [None]:
# miscellaneous nonsense sentences
df = df.drop(df[df['sentence_str'].str.contains('\spp\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\stotam\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\srree\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\sflir\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\smodis\s')].index)

len(df)

140213

In [None]:
# markers of french and notes
df = df.drop(df[df['sentence_str'].str.contains('\schapitre')].index)
df = df.drop(df[df['sentence_str'].str.contains('\salisme')].index)
df = df.drop(df[df['sentence_str'].str.contains('\sHahn')].index)

len(df)

140213

In [None]:
# some notes in Kant
df = df.drop(df[df['sentence_str'].str.contains('\sVorl\s')].index)

len(df)

140213

### Lemmatizing, Tokenizing, and Exporting

This brings us to the end of our pruning the data. Let's take a quick look at how the schools break down after the cleaning.

In [None]:
df['school'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
school,Unnamed: 1_level_1
buddhism,0.442284
occult_esoteric,0.1737
christianity,0.168408
confucianism,0.056742
sikhism,0.051151
hinduism,0.048476
bahai,0.027316
taoism,0.01405
zoroastrianism,0.011953
indigenous,0.003459


Things seem slightly more balanced, if only because there are a few thousand
less sentences overall. At this point, we are ready to export the dataframe. Before doing so, we will make future work easier by adding adding two columns: one with lemmatized text and another with tokenized text.

One last preview before we finalize the document:

In [None]:
df.sample(5)

Unnamed: 0,sentence_str,title,school,sentence_length,sentence_lowered
100675,"He came and poured His Ambrosial Nectar into your hands, but it slipped through your fingers, and fell onto the ground.",Dhammapada,buddhism,119,"he came and poured his ambrosial nectar into your hands, but it slipped through your fingers, and fell onto the ground."
119826,"He acts out in ego, and suffers terrible punishment.",Dhammapada,buddhism,52,"he acts out in ego, and suffers terrible punishment."
53375,"he went out from Pharaoh, and entreated the.",Augustine City Of God And Christian Doctrine,christianity,48,"and he went out from pharaoh, and entreated the."
49767,"Shaykh Every time God the True One�exalted be His glory�revealed Himself in the person of His Manifestation, He came unto men with the standard of He doeth what He willeth, and ordaineth what He p...",The Life Of Buddha,buddhism,204,"shaykh every time god the true one�exalted be his glory�revealed himself in the person of his manifestation, he came unto men with the standard of he doeth what he willeth, and ordaineth what he p..."
24215,"Men have set down Agni as the Hotri, the Usigs, honouring (him), the praise of Ayu .",Knowledge Of The Higher Worlds - By Rudolf Steiner,occult_esoteric,84,"men have set down agni as the hotri, the usigs, honouring (him), the praise of ayu ."


Looks good! Let's export.

In [None]:
import re
from typing import List, Set, Iterable
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy

DEFAULT_CAPITAL_WHITELIST = {"God", "Truth", "Love", "Justice", "Wisdom", "LORD"}
DEFAULT_ENTITY_LABELS = {"PERSON", "GPE", "ORG", "DATE"}

def load_spacy_model(model_name: str = "en_core_web_sm"):
    """Load a spaCy model, with error handling."""
    try:
        return spacy.load(model_name)
    except OSError:
        raise RuntimeError(f"spaCy model '{model_name}' not found. Run: python -m spacy download {model_name}")

def capital_ratio(
    sentence: str,
    whitelist: Set[str] = DEFAULT_CAPITAL_WHITELIST,
    threshold: float = 0.3,
    min_word_len: int = 3
) -> bool:
    """
    Returns True if the ratio of capitalized words (not in whitelist) exceeds threshold.
    """
    words = [w for w in word_tokenize(sentence) if w.isalpha() and len(w) >= min_word_len]
    if not words:
        return False
    capitalized = [
        w for i, w in enumerate(words)
        if w[0].isupper() and w not in whitelist and i != 0
    ]
    ratio = len(capitalized) / len(words)
    return ratio >= threshold

def is_mostly_named_entities(
    sentences: Iterable[str],
    nlp,
    entity_labels: Set[str] = DEFAULT_ENTITY_LABELS,
    max_entity_ratio: float = 0.4
) -> List[bool]:
    """
    Returns a list of bools indicating if each sentence is mostly named entities.
    """
    results = []
    for doc in nlp.pipe(sentences, disable=["tagger", "parser"]):
        if len(doc) == 0:
            results.append(False)
            continue
        entity_tokens = set()
        for ent in doc.ents:
            if ent.label_ in entity_labels:
                entity_tokens.update(range(ent.start, ent.end))
        ratio = len(entity_tokens) / len(doc)
        results.append(ratio >= max_entity_ratio)
    return results

def contains_philosophical_terms(sentence: str, keywords: Set[str]) -> bool:
    """Check if any keyword is present in the sentence (case-insensitive)."""
    sentence_lower = sentence.lower()
    return any(k in sentence_lower for k in keywords)

def sentence_is_useful(
    sentence: str,
    keywords: Set[str],
    cap_threshold: float = 0.3,
    min_len: int = 8,
    whitelist: Set[str] = DEFAULT_CAPITAL_WHITELIST
) -> bool:
    """
    Returns True if the sentence is long enough and not mostly capitalized (unless it contains a keyword).
    """
    if len(sentence.split()) < min_len:
        return False
    if capital_ratio(sentence, whitelist=whitelist, threshold=cap_threshold) and not contains_philosophical_terms(sentence, keywords):
        return False
    return True

def clean_sentences(
    sentences: List[str],
    keywords: Set[str],
    nlp=None,
    cap_threshold: float = 0.3,
    min_len: int = 8,
    entity_labels: Set[str] = DEFAULT_ENTITY_LABELS,
    diagnostics: bool = False
) -> List[str]:
    """
    Filter sentences by capital ratio, length, and named entity ratio.
    """
    if nlp is None:
        nlp = load_spacy_model()
    filtered = [
        s for s in sentences
        if sentence_is_useful(s, keywords, cap_threshold, min_len)
    ]
    entity_flags = is_mostly_named_entities(filtered, nlp, entity_labels)
    final = [s for s, flag in zip(filtered, entity_flags) if not flag]
    if diagnostics:
        print(f"Original: {len(sentences)}, After capital/length: {len(filtered)}, After NER: {len(final)}")
    return final

In [None]:
from google.colab import drive
from google.colab import files
import pandas as pd

# Mount Google Drive
print("Mounting Google Drive...")
drive.mount('/content/drive')

# Save your current dataframe first
df.to_csv('reli_nlp.csv', index=False)

# Read Bible and Quran data from Google Drive with correct paths
try:
    bible_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/religion/bible_data_set.csv')
    print(f"Bible data loaded successfully: {len(bible_df)} rows")
    print(f"Bible columns: {bible_df.columns.tolist()}")
except FileNotFoundError:
    print("Bible file not found. Check the path: /content/drive/MyDrive/Colab Notebooks/religion/bible_data_set.csv")
    bible_df = None

try:
    quran_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/religion/quran.csv')
    print(f"Quran data loaded successfully: {len(quran_df)} rows")
    print(f"Quran columns: {quran_df.columns.tolist()}")
except FileNotFoundError:
    print("Quran file not found. Check the path: /content/drive/MyDrive/Colab Notebooks/religion/quran.csv")
    quran_df = None

# Check if df has the right column
print(f"Main df columns: {df.columns.tolist()}")

# Rename columns in main df if needed
if 'sentence_str' in df.columns:
    df = df.rename(columns={'sentence_str': 'text'})

# Create list of dataframes to concatenate
dfs_to_concat = [df]

# Process Bible data if available
if bible_df is not None:
    # Add required columns
    if 'school' not in bible_df.columns:
        bible_df['school'] = 'christianity'
    if 'source' not in bible_df.columns:
        bible_df['source'] = 'bible'

    # Bible already has 'text' column - no renaming needed

    # Select only the columns you need
    columns_to_keep = ['text', 'source', 'school']
    if all(col in bible_df.columns for col in columns_to_keep):
        bible_df = bible_df[columns_to_keep]
        dfs_to_concat.append(bible_df)
        print(f"Bible data prepared: {len(bible_df)} verses")
    else:
        print(f"Bible missing required columns. Available: {bible_df.columns.tolist()}")

# Process Quran data if available
if quran_df is not None:
    # Add required columns
    if 'school' not in quran_df.columns:
        quran_df['school'] = 'islam'
    if 'source' not in quran_df.columns:
        quran_df['source'] = 'quran'

    # Rename EnglishTranslation to text
    if 'EnglishTranslation' in quran_df.columns:
        quran_df = quran_df.rename(columns={'EnglishTranslation': 'text'})

    # Select only the columns you need
    columns_to_keep = ['text', 'source', 'school']
    if all(col in quran_df.columns for col in columns_to_keep):
        quran_df = quran_df[columns_to_keep]
        dfs_to_concat.append(quran_df)
        print(f"Quran data prepared: {len(quran_df)} verses")
    else:
        print(f"Quran missing required columns. Available: {quran_df.columns.tolist()}")

# Concatenate all dataframes
combined_df = pd.concat(dfs_to_concat, ignore_index=True)

# Check if 'text' column exists before cleaning
if 'text' not in combined_df.columns:
    print(f"Error: 'text' column not found. Available columns: {combined_df.columns.tolist()}")
    # Try to find the text column with a different name
    if 'sentence_str' in combined_df.columns:
        combined_df = combined_df.rename(columns={'sentence_str': 'text'})
    else:
        print("Cannot proceed without a text column")
        raise ValueError("No text column found")

# Remove any empty or null texts
print(f"Before cleaning: {len(combined_df)} rows")
combined_df = combined_df.dropna(subset=['text'])
combined_df = combined_df[combined_df['text'].str.strip() != '']
print(f"After removing empty texts: {len(combined_df)} rows")

# Apply final capital cleaning to all texts together
philosophical_keywords = ['truth', 'wisdom', 'enlightenment', 'divine', 'sacred', 'spiritual',
                         'meditation', 'prayer', 'faith', 'compassion', 'love', 'peace', 'god']

# Extended helper function to return reason
def get_drop_reason(sentence, keywords, cap_threshold=0.3, min_len=8):
    if len(sentence.split()) < min_len:
        return "too_short"
    if capital_ratio(sentence, threshold=cap_threshold) and not contains_philosophical_terms(sentence, keywords):
        return "high_capital_no_keywords"
    return None  # means it's useful

try:
    # Add reason column (None means useful)
    combined_df['drop_reason'] = combined_df['text'].apply(
        lambda x: get_drop_reason(x, philosophical_keywords, cap_threshold=0.3, min_len=8)
    )
    if 'school' not in combined_df.columns:
        combined_df['school'] = combined_df['source'].apply(lambda x: 'islam' if x == 'quran' else 'christianity')
    # Separate dropped and useful
    dropped_df = combined_df[combined_df['drop_reason'].notnull()].copy()
    combined_df = combined_df[combined_df['drop_reason'].isnull()].drop(columns='drop_reason')

    # Save dropped with reasons
    dropped_df[['text', 'drop_reason', 'school']].to_csv("dropped_sentences.csv", index=False)
    print(f"Saved {len(dropped_df)} dropped sentences to 'dropped_sentences.csv'")

    # Print dropped sentences with reason
    print("Sentences dropped during advanced filtering:")
    for _, row in dropped_df.iterrows():
        print(f" - [{row['drop_reason']}] {row['text']}")

    print(f"After advanced filtering: {len(combined_df)} rows")

except NameError:
    print("sentence_is_useful function not available, skipping advanced filtering")


# Save the complete dataset
combined_df.to_csv('complete_religious_texts.csv', index=False)

# Download from Colab
files.download('complete_religious_texts.csv')

files.download('dropped_sentences.csv')
print(f"\nCombined dataset created with {len(combined_df)} sentences")
print(f"Schools included: {combined_df['school'].unique()}")
print(f"Sources included: {combined_df['source'].nunique()} different sources")
print(f"Breakdown by school:")
print(combined_df['school'].value_counts())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 - [high_capital_no_keywords] You are forever True, the Home of Excellence, the Primal Supreme Being.
 - [high_capital_no_keywords] Waahay Guru, Waahay Guru, Waahay Guru, Waahay Jee.
 - [high_capital_no_keywords] You are the Formless, Infinite Lord; who can compare to You?
 - [high_capital_no_keywords] You are forever True, the Home of Excellence, the Primal Supreme Being.
 - [high_capital_no_keywords] Waahay Guru, Waahay Guru, Waahay Guru, Waahay Jee.
 - [too_short] No one can speak Your Unspoken Speech.
 - [too_short] You are pervading the three worlds.
 - [high_capital_no_keywords] You are forever True, the Home of Excellence, the Primal Supreme Being.
 - [high_capital_no_keywords] Waahay Guru, Waahay Guru, Waahay Guru, Waahay Jee.
 - [high_capital_no_keywords] || The True Guru, the True Guru, the True Guru is the Lord of the Universe Himself.
 - [high_capital_no_keywords] He took birth in the Incarnations of the Fish,

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Combined dataset created with 149940 sentences
Schools included: ['hinduism' 'confucianism' 'occult_esoteric' 'indigenous' 'christianity'
 'sikhism' 'judaism' 'taoism' 'bahai' 'buddhism' 'zoroastrianism' 'islam']
Sources included: 2 different sources
Breakdown by school:
school
christianity       51317
buddhism           45566
occult_esoteric    20879
sikhism             6589
confucianism        6438
hinduism            6263
islam               5517
bahai               3344
taoism              1821
zoroastrianism      1429
indigenous           467
judaism              310
Name: count, dtype: int64


And that's it! When it came to modeling the data, we first worked on some basic [Bayesian models](https://github.com/kcalizadeh/phil_nlp/blob/master/Notebooks/2_non-neural_models.ipynb), before then moving on to [w2v](https://github.com/kcalizadeh/phil_nlp/blob/master/Notebooks/3_w2v.ipynb) and [neural networks](https://github.com/kcalizadeh/phil_nlp/blob/master/Notebooks/4_neural_networks.ipynb).

# Comprehensive Religious Text Cleaning

This section contains cleaning functions for all religious texts according to specific requirements.

## Test the Cleaning Functions

Run the cell below to clean all religious texts in your revised_religion folder: