# Text Classification - Dataset

---

## $\color{blue}{Sections:}$
* Preamble
* Admin - importing libraries
* Scraping - getting our data
* Splitting - formatting data into datapoints
* Analysis - distribution and size of data
* Data - formatting into Pandas and adding more metadata
* Subset and Save - train/dev/test set and pickling

## $\color{blue}{Preamble:}$
This is a text classification project where we attempt to test numerous types of AI models on a single task. This first notebook prepares our dataset.

#### General Project Themes:

* Embedding models
  * Finetuning
  * Hard Negatives/ Positive Triplet finetuning
  * End-End finetuning
  * Embedding finetuning
* Mixture models
* LLMs
  * finetuning
* GNNs

#### Data

Our data includes 12,000 training points of approx 40 words in length. These come from 4 classic books, but be counted as 6 given that Ulysses has 3 distinct parts.
* Ulysses - James Joyce
* Dubliners - James Joyce
* Dracula - Bram Stoker
* The Republic - Plato

#### Comments

The task can also be considered 70 classes for each chapter of each book. The works have been carefully selected, for example, there is a variation in the similarity between the works. James Joyce v James Joyce or James Joyce v Plato. The set up will allow us to break up the task and have specialist models for each book. This project allows for practice at various techniques, and although it remains a ficticious challenge. The results of this project will be largely applicable into applications like:

* Topic Classification
* Fraud Detection
* Product Tagging
* Sentiment Analysis

---

#### Notebook Details

This notebook imports HTML documents and scrapes the content with Beautiful Soup. With Langchain and Llama-Index the documents are split and prepared to get our clean datasets.




## $\color{blue}{Admin:}$


In [None]:
from google.colab import drive

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

Mounted at /content/drive
/content/drive/MyDrive


In [None]:
%%capture
!pip install langchain langchain-community bs4 llama-index

In [None]:
from bs4 import BeautifulSoup
import re

## $\color{blue}{Scraping:}$


### $\color{red}{Ulysses:}$


In [None]:

# Load the HTML file
with open('class/data/ulysses_text.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
# Initialize a list to hold all episodes
ulysses_episodes = []
last_book_title = None

# Iterate through each 'div' with the class 'chapter'
for chapter in soup.find_all('div', class_='chapter'):
    # Check for a book title (h2) above the current chapter
    book_title_tag = chapter.find_previous('h2')
    if book_title_tag:
        last_book_title = book_title_tag.get_text(strip=True)

    # Get the episode title from the current chapter (h3)
    episode_title_tag = chapter.find('h3')
    if episode_title_tag:
        episode_title = episode_title_tag.get_text(strip=True)
    else:
        continue  # Skip if there is no episode title (h3)

    # Initialize a dictionary for the current episode
    episode_data = {
        'master': 'Ulysses',
        'book': last_book_title,
        'episode': episode_title,
        'content': ''
    }

    # Gather all paragraphs within the current chapter
    for paragraph in chapter.find_all('p'):
        episode_data['content'] += paragraph.get_text() + ' '  # Add space to separate paragraphs

    # Clean up the content by stripping whitespace
    episode_data['content'] = episode_data['content'].replace("\n"," ")

    # Append episode data to the list of episodes
    ulysses_episodes.append(episode_data)

In [None]:
len(ulysses_episodes)

18

-

In [None]:
ulysses_master = []
ulysses_book = []
ulysses_chapter = []
ulysses_text = []

for item in ulysses_episodes:
  # get master
  ulysses_master.append(item['master'])

  # get book number
  ulysses_book.append(len(re.findall('I',item['book']))-1)

  # get chapter number
  number = ''
  for char in item['episode']:
    if char.isnumeric():
      number += char
  ulysses_chapter.append(int(number)-1)

  # get text
  ulysses_text.append(item['content'])

In [None]:
print(ulysses_master)
print(ulysses_book)
print(ulysses_chapter)
print(len(ulysses_text))

['Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses', 'Ulysses']
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
18


### $\color{red}{Dubliners:}$


In [None]:
# Load the HTML file
with open('class/data/dubliners_text.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
dubliners_episodes = []

# Iterate through each 'div' with the class 'chapter'
for chapter in soup.find_all('div', class_='chapter'):
    # Get the episode title from the current chapter (h3)
    episode_title_tag = chapter.find('h2')
    if episode_title_tag:
        episode_title = episode_title_tag.get_text(strip=True)
    else:
        continue  # Skip if there is no episode title (h3)

    # Initialize a dictionary for the current episode
    episode_data = {
        'master': 'Dubliners',
        'book': 'Dubliners',
        'episode': episode_title,
        'content': ''
    }

    # Gather all paragraphs within the current chapter
    for paragraph in chapter.find_all('p'):
        episode_data['content'] += paragraph.get_text() + ' '  # Add space to separate paragraphs

    # Clean up the content by stripping whitespace
    episode_data['content'] = episode_data['content'].replace("\n"," ")

    # Append episode data to the list of episodes
    dubliners_episodes.append(episode_data)


In [None]:
dubliners_title = [episode['episode'] for episode in dubliners_episodes]
dubliners_inds = list(range(len(ulysses_episodes),len(ulysses_episodes) + len(dubliners_episodes)))
dublin_title = {dubliners_title[i]:dubliners_inds[i] for i in range(len(dubliners_episodes))}
dublin_title

{'THE SISTERS': 18,
 'AN ENCOUNTER': 19,
 'ARABY': 20,
 'EVELINE': 21,
 'AFTER THE RACE': 22,
 'TWO GALLANTS': 23,
 'THE BOARDING HOUSE': 24,
 'A LITTLE CLOUD': 25,
 'COUNTERPARTS': 26,
 'CLAY': 27,
 'A PAINFUL CASE': 28,
 'IVY DAY IN THE COMMITTEE ROOM': 29,
 'A MOTHER': 30,
 'GRACE': 31,
 'THE DEAD': 32}

In [None]:
dubliners_master = []
dubliners_book = []
dubliners_chapter = []
dubliners_text = []

for item in dubliners_episodes:
  # get master
  dubliners_master.append(item['master'])

  # get book number
  dubliners_book.append(3)

  # get chapter number
  dubliners_chapter.append(dublin_title[item['episode']])

  # get text
  dubliners_text.append(item['content'])

In [None]:
print(dubliners_master)
print(dubliners_book)
print(dubliners_chapter)
print(len(dubliners_text))

['Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners', 'Dubliners']
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
15


### $\color{red}{Dracula:}$


In [None]:
# Load the HTML file
with open('class/data/dracula_text.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
dracula_episodes = []

# Iterate through each 'div' with the class 'chapter'
for chapter in soup.find_all('div', class_='chapter'):
    # Get the episode title from the current chapter (h3)
    episode_title_tag = chapter.find('h2')
    if episode_title_tag:
        episode_title = episode_title_tag.get_text(strip=True)
    else:
        continue  # Skip if there is no episode title (h3)

    # Initialize a dictionary for the current episode
    episode_data = {
        'master': 'Dracula',
        'book': 'Dracula',
        'episode': episode_title,
        'content': ''
    }

    # Gather all paragraphs within the current chapter
    paragraphs = chapter.find('p')
    if paragraphs:
      for paragraph in chapter.find_all('p'):
          episode_data['content'] += paragraph.get_text() + ' '  # Add space to separate paragraphs

      # Clean up the content by stripping whitespace
      episode_data['content'] = episode_data['content'].replace("\n"," ")

      # Append episode data to the list of episodes
      dracula_episodes.append(episode_data)

In [None]:
len(dracula_episodes)

28

In [None]:
dracula_episodes = dracula_episodes[:-1]

In [None]:
dracula_title = [episode['episode'] for episode in dracula_episodes]
dracula_inds = list(range(len(ulysses_episodes) + len(dubliners_episodes),len(ulysses_episodes) + len(dubliners_episodes) + len(dracula_episodes)))
drac_title = {dracula_title[i]:dracula_inds[i] for i in range(len(dracula_episodes))}
drac_title

{'CHAPTER IJONATHAN HARKER’S JOURNAL': 33,
 'CHAPTER IIJONATHAN HARKER’S JOURNAL—continued': 34,
 'CHAPTER IIIJONATHAN HARKER’S JOURNAL—continued': 35,
 'CHAPTER IVJONATHAN HARKER’S JOURNAL—continued': 36,
 'CHAPTER V': 37,
 'CHAPTER VIMINA MURRAY’S JOURNAL': 38,
 'CHAPTER VIICUTTING FROM “THE DAILYGRAPH,” 8 AUGUST': 39,
 'CHAPTER VIIIMINA MURRAY’S JOURNAL': 40,
 'CHAPTER IX': 41,
 'CHAPTER X': 42,
 'CHAPTER XI': 43,
 'CHAPTER XIIDR. SEWARD’S DIARY': 44,
 'CHAPTER XIIIDR. SEWARD’S DIARY—continued.': 45,
 'CHAPTER XIVMINA HARKER’S JOURNAL': 46,
 'CHAPTER XVDR. SEWARD’S DIARY—continued.': 47,
 'CHAPTER XVIDR. SEWARD’S DIARY—continued': 48,
 'CHAPTER XVIIDR. SEWARD’S DIARY—continued': 49,
 'CHAPTER XVIIIDR. SEWARD’S DIARY': 50,
 'CHAPTER XIXJONATHAN HARKER’S JOURNAL': 51,
 'CHAPTER XXJONATHAN HARKER’S JOURNAL': 52,
 'CHAPTER XXIDR. SEWARD’S DIARY': 53,
 'CHAPTER XXIIJONATHAN HARKER’S JOURNAL': 54,
 'CHAPTER XXIIIDR. SEWARD’S DIARY': 55,
 'CHAPTER XXIVDR. SEWARD’S PHONOGRAPH DIARY, SPOKEN 

In [None]:
dracula_master = []
dracula_book = []
dracula_chapter = []
dracula_text = []

for item in dracula_episodes:
  # get master
  dracula_master.append(item['master'])

  # get book number
  dracula_book.append(4)

  # get chapter number
  dracula_chapter.append(drac_title[item['episode']])

  # get text
  dracula_text.append(item['content'])

In [None]:
print(dracula_master)
print(dracula_book)
print(dracula_chapter)
print(len(dracula_text))

['Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula', 'Dracula']
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
27


### $\color{red}{Republic:}$


In [None]:
# Load the HTML file
with open('class/data/republic_text.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
republic_episodes = []

# Iterate through each 'div' with the class 'chapter'
for chapter in soup.find_all('div', class_='chapter'):
    # Get the episode title from the current chapter (h3)
    episode_title_tag = chapter.find('h2')
    if episode_title_tag:
        episode_title = episode_title_tag.get_text(strip=True)
    else:
        continue  # Skip if there is no episode title (h3)

    # Initialize a dictionary for the current episode
    episode_data = {
        'master': 'Republic',
        'book': 'Republic',
        'episode': episode_title,
        'content': ''
    }

    # Gather all paragraphs within the current chapter
    paragraphs = chapter.find('p')
    if paragraphs:
      for paragraph in chapter.find_all('p'):
          episode_data['content'] += paragraph.get_text() + ' '  # Add space to separate paragraphs

      # Clean up the content by stripping whitespace
      episode_data['content'] = episode_data['content'].replace("\n"," ")

      # Append episode data to the list of episodes
      republic_episodes.append(episode_data)

In [None]:
len(republic_episodes)

12

In [None]:
republic_episodes = republic_episodes[2:]

In [None]:
chapter_title = [episode['episode'] for episode in republic_episodes]
republic_inds = list(range(len(ulysses_episodes) + len(dubliners_episodes) + len(dracula_episodes),len(ulysses_episodes) + len(dubliners_episodes) + len(dracula_episodes) + len(republic_episodes)))
republic_title = {chapter_title[i]:republic_inds[i] for i in range(len(republic_episodes))}
republic_title

{'BOOK I.': 60,
 'BOOK II.': 61,
 'BOOK III.': 62,
 'BOOK IV.': 63,
 'BOOK V.': 64,
 'BOOK VI.': 65,
 'BOOK VII.': 66,
 'BOOK VIII.': 67,
 'BOOK IX.': 68,
 'BOOK X.': 69}

In [None]:
republic_master = []
republic_book = []
republic_chapter = []
republic_text = []

for item in republic_episodes:
  # get master
  republic_master.append(item['master'])

  # get book number
  republic_book.append(5)

  # get chapter number
  republic_chapter.append(republic_title[item['episode']])

  # get text
  republic_text.append(item['content'])

In [None]:
print(republic_master)
print(republic_book)
print(republic_chapter)
print(len(republic_text))

['Republic', 'Republic', 'Republic', 'Republic', 'Republic', 'Republic', 'Republic', 'Republic', 'Republic', 'Republic']
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
10


## $\color{blue}{Splitting:}$


In [None]:
from llama_index.core.node_parser import SentenceSplitter

In [None]:
splitter = SentenceSplitter(
    chunk_size=80,
    chunk_overlap=0,
    separator='.'
)

In [None]:
from langchain.docstore.document import Document

In [None]:
ulysses_docs = []
for i in range(len(ulysses_text)):
  ulysses_nodes = splitter.split_text(ulysses_text[i])
  for node in ulysses_nodes:
    doc =  Document(page_content=node, metadata={"master":ulysses_master[i],"book_idx":ulysses_book[i], "chapter_idx":ulysses_chapter[i]})
    ulysses_docs.append(doc)


In [None]:
dubliners_docs = []
for i in range(len(dubliners_text)):
  dubliners_nodes = splitter.split_text(dubliners_text[i])
  for node in dubliners_nodes:
    doc =  Document(page_content=node, metadata={"master":dubliners_master[i],"book_idx":dubliners_book[i], "chapter_idx":dubliners_chapter[i]})
    dubliners_docs.append(doc)


In [None]:
dracula_docs = []
for i in range(len(dracula_text)):
  dracula_nodes = splitter.split_text(dracula_text[i])
  for node in dracula_nodes:
    doc =  Document(page_content=node, metadata={"master":dracula_master[i],"book_idx":dracula_book[i], "chapter_idx":dracula_chapter[i]})
    dracula_docs.append(doc)

In [None]:
republic_docs = []
for i in range(len(republic_text)):
  republic_nodes = splitter.split_text(republic_text[i])
  for node in republic_nodes:
    doc =  Document(page_content=node, metadata={"master":republic_master[i],"book_idx":republic_book[i], "chapter_idx":republic_chapter[i]})
    republic_docs.append(doc)

## $\color{blue}{Analysis:}$


In [None]:
import numpy as np
from collections import Counter
def report_docs(docs):
  length = len(docs)
  book_count = set()
  chapter_count = set()
  name = docs[0].metadata['master']
  word_count = []
  chapter_count_list = []
  for doc in docs:
    book_count.add(doc.metadata["book_idx"])
    chapter = doc.metadata["chapter_idx"]
    chapter_count.add(chapter)
    chapter_count_list.append(chapter)
    word_count.append(len(doc.page_content.split()))

  mean = np.mean(word_count)
  max = np.max(word_count)
  min = np.min(word_count)
  std = np.std(word_count)
  c = Counter(chapter_count_list)
  fewest = c.most_common()[-1][1]
  most = c.most_common()[0][1]
  print('\n ############## \n')
  print(f'Title: {name}')
  print('-------------- \n')
  print(f'total word count: {np.sum(word_count)}')
  print(f'total chapters: {len(chapter_count)}')
  print(f'total book count: {len(book_count)}')
  print('------------- \n')
  print(f'data points: {len(docs)}')
  print(f'minority class data points: {fewest}')
  print(f'majority class data points: {most}')
  print('------------- \n')
  print(f'mean words: {round(mean,1)}')
  print(f'max words: {round(max,1)}')
  print(f'min words: {round(min,1)}')
  print(f'std words: {round(std,1)}')


In [None]:
report_docs(ulysses_docs)
report_docs(dubliners_docs)
report_docs(dracula_docs)
report_docs(republic_docs)


 ############## 

Title: Ulysses
-------------- 

total word count: 265597
total chapters: 18
total book count: 3
------------- 

data points: 7093
minority class data points: 94
majority class data points: 1484
------------- 

mean words: 37.4
max words: 72
min words: 1
std words: 14.9

 ############## 

Title: Dubliners
-------------- 

total word count: 67496
total chapters: 15
total book count: 1
------------- 

data points: 1378
minority class data points: 34
majority class data points: 323
------------- 

mean words: 49.0
max words: 73
min words: 7
std words: 10.8

 ############## 

Title: Dracula
-------------- 

total word count: 159755
total chapters: 27
total book count: 1
------------- 

data points: 3216
minority class data points: 72
majority class data points: 151
------------- 

mean words: 49.7
max words: 71
min words: 5
std words: 11.1

 ############## 

Title: Republic
-------------- 

total word count: 118281
total chapters: 10
total book count: 1
------------- 

da

In [None]:
all_docs = ulysses_docs + dubliners_docs + dracula_docs + republic_docs

In [None]:
len(all_docs)

13964

In [None]:
def print_point(ind):
  print(' \n #################### \n')
  contents = all_docs[ind].page_content
  meta = all_docs[ind].metadata
  print(f'Book: {meta["master"]}')
  print(f'Book Index: {meta["book_idx"]}')
  print(f'Chapter Index: {meta["chapter_idx"]}')
  print('--------------------------')
  print(contents)

In [None]:
for i in range(10):
  print_point(np.random.choice(range(len(all_docs))))

 
 #################### 

Book: Ulysses
Book Index: 1
Chapter Index: 14
--------------------------
LORD TENNYSON: (Gentleman poet in Union Jack blazer and cricket flannels, bareheaded, flowingbearded.) Theirs not to reason why.   PRIVATE COMPTON: Biff him, Harry.   STEPHEN: (To Private Compton.) I don’t know your name but you are quite right.
 
 #################### 

Book: Ulysses
Book Index: 1
Chapter Index: 14
--------------------------
BLOOM: (Turns to the gallery.) The royal Dublins, boys, the salt of the earth, known the world over. I think I see some old comrades in arms up there among you.
 
 #################### 

Book: Ulysses
Book Index: 1
Chapter Index: 12
--------------------------
he old pair on her inside out and that was for luck and lovers’ meeting if you p
 
 #################### 

Book: Dubliners
Book Index: 3
Chapter Index: 28
--------------------------
No one wanted him; he was outcast from life’s feast. He turned his eyes to the grey gleaming river, winding along 

## $\color{blue}{Data:}$


In [None]:
# Ulysses

D_ul_book = {0: "Telemachia", 1: "Odyssey", 2:"Nostos"}
D_ul_chapter = {
    0:"Telemachus",
    1:"Nestor",
    2:"Proteus",
    3:"Calypso",
    4:"Lotus Eaters",
    5:"Hades",
    6:"Aeolus",
    7:"Lestrygonians",
    8:"Scylla and Charybdis",
    9:"Wandering Rocks",
    10:"Sirens",
    11:"Cyclops",
    12:"Nausicaa",
    13:"Oxen of the Sun",
    14:"Circe",
    15:"Eumaeus",
    16:"Ithaca",
    17:"Penelope"
}

ul_master = [item.metadata['master'] for item in ulysses_docs]
ul_book_idx = [item.metadata['book_idx'] for item in ulysses_docs]
ul_book = [D_ul_book[i] for i in ul_book_idx]
ul_chapter_idx = [item.metadata['chapter_idx'] for item in ulysses_docs]
ul_chapter = [D_ul_chapter[i] for i in ul_chapter_idx]
ul_content = [item.page_content for item in ulysses_docs]
ul_author = ["Joyce" for item in ul_content]


In [None]:
# Dubliners
D_db_chapter = {v:k for (k,v) in dublin_title.items()}

db_master = [item.metadata['master'] for item in dubliners_docs]
db_book_idx = [item.metadata['book_idx'] for item in dubliners_docs]
db_book = ["Dubliners" for i in db_book_idx]
db_chapter_idx = [item.metadata['chapter_idx'] for item in dubliners_docs]
db_chapter = [D_db_chapter[i] for i in db_chapter_idx]
db_content = [item.page_content for item in dubliners_docs]
db_author = ["Joyce" for item in db_content]


In [None]:
# Dracula
D_dr_chapter = {
 "CHAPTER I: JONATHAN HARKER’S JOURNAL": 33,
 "CHAPTER II: JONATHAN HARKER’S JOURNAL—continued": 34,
 "CHAPTER III: JONATHAN HARKER’S JOURNAL—continued": 35,
 "CHAPTER IV: JONATHAN HARKER’S JOURNAL—continued": 36,
 "CHAPTER V: LETTER FROM MISS MINA MURRAY TO MISS LUCY WESTENRA": 37,
 "CHAPTER VI: MINA MURRAY’S JOURNAL": 38,
 "CHAPTER VII: CUTTING FROM 'THE DAILYGRAPH,' 8 AUGUST": 39,
 "CHAPTER VIII: MINA MURRAY’S JOURNAL": 40,
 "CHAPTER IX: LETTER, MINA HARKER TO LUCY WESTENRA": 41,
 "CHAPTER X: LETTER, DR.SEWARD TO HON ARTHUR HOLMWOOD": 42,
 "CHAPTER XI: LUCY WESTENRA'S DIARY": 43,
 "CHAPTER XII: DR. SEWARD’S DIARY": 44,
 "CHAPTER XIII: DR. SEWARD’S DIARY—continued.": 45,
 "CHAPTER XIV: MINA HARKER’S JOURNAL": 46,
 "CHAPTER XV: DR. SEWARD’S DIARY—continued.": 47,
 "CHAPTER XVI: DR. SEWARD’S DIARY—continued": 48,
 "CHAPTER XVII: DR. SEWARD’S DIARY—continued": 49,
 "CHAPTER XVIII: DR. SEWARD’S DIARY": 50,
 "CHAPTER XIX: JONATHAN HARKER’S JOURNAL": 51,
 "CHAPTER XX: JONATHAN HARKER’S JOURNAL": 52,
 "CHAPTER XXI: DR. SEWARD’S DIARY": 53,
 "CHAPTER XXII: JONATHAN HARKER’S JOURNAL": 54,
 "CHAPTER XXIII: DR. SEWARD’S DIARY": 55,
 "CHAPTER XXIV: DR. SEWARD’S PHONOGRAPH DIARY, SPOKEN BY VAN HELSING": 56,
 "CHAPTER XXV: DR. SEWARD’S DIARY": 57,
 "CHAPTER XXVI: DR. SEWARD’S DIARY": 58,
 "CHAPTER XXVII: MINA HARKER’S JOURNAL": 59
}
D_dr_chapter = {v:k for (k,v) in D_dr_chapter.items()}

dr_master = [item.metadata['master'] for item in dracula_docs]
dr_book_idx = [item.metadata['book_idx'] for item in dracula_docs]
dr_book = ["Dracula" for i in dr_book_idx]
dr_chapter_idx = [item.metadata['chapter_idx'] for item in dracula_docs]
dr_chapter = [D_dr_chapter[item] for item in dr_chapter_idx]
dr_content = [item.page_content for item in dracula_docs]
dr_author = ["Bram Stoker" for item in dr_content]

In [None]:
D_rp_chapter = {
    60: "Book I",
    61: "Book II",
    62: "Book III",
    63: "Book IV",
    64: "Book V",
    65: "Book VI",
    66: "Book VII",
    67: "Book VIII",
    68: "Book IX",
    69: "Book X"
}

rp_master = [item.metadata['master'] for item in republic_docs]
rp_book_idx = [item.metadata['book_idx'] for item in republic_docs]
rp_book = ["Republic" for i in rp_book_idx]
rp_chapter_idx = [item.metadata['chapter_idx'] for item in republic_docs]
rp_chapter = [D_rp_chapter[item] for item in rp_chapter_idx]
rp_content = [item.page_content for item in republic_docs]
rp_author = ["Plato" for item in rp_content]

In [None]:
master = ul_master + db_master + dr_master + rp_master
book_idx = ul_book_idx + db_book_idx + dr_book_idx + rp_book_idx
book = ul_book + db_book + dr_book + rp_book
chapter_idx = ul_chapter_idx + db_chapter_idx + dr_chapter_idx + rp_chapter_idx
chapter = ul_chapter + db_chapter + dr_chapter + rp_chapter
author = ul_author + db_author + dr_author + rp_author
content = ul_content + db_content + dr_content + rp_content

In [None]:
import pandas as pd
df = pd.DataFrame(
    {
        "master": master,
        "book_idx": book_idx,
        "book": book,
        "chapter_idx": chapter_idx,
        "chapter": chapter,
        "author": author,
        "content": content
    }
)

In [None]:
def add_meta(docs, name, data):
  for i in range(len(docs)):
    docs[i].metadata[name] = data[i]
  return docs


ulysses_docs
dubliners_docs
dracula_docs
republic_docs



book chapter author

In [None]:
fields = ["book", "chapter", "author"]
datas = [ul_book, ul_chapter, ul_author]
for i in range(3):
  ulysses_docs = add_meta(ulysses_docs,fields[i],datas[i])

In [None]:
fields = ["book", "chapter", "author"]
datas = [db_book, db_chapter, db_author]
for i in range(3):
  dubliners_docs = add_meta(dubliners_docs,fields[i],datas[i])

In [None]:
fields = ["book", "chapter", "author"]
datas = [dr_book, dr_chapter, dr_author]
for i in range(3):
  dracula_docs = add_meta(dracula_docs,fields[i],datas[i])

In [None]:
fields = ["book", "chapter", "author"]
datas = [rp_book, rp_chapter, rp_author]
for i in range(3):
  republic_docs = add_meta(republic_docs,fields[i],datas[i])

In [None]:
docs = ulysses_docs + dubliners_docs + dracula_docs + republic_docs

In [None]:
df = df.reset_index()

In [None]:
inds = list(df['index'])
for i in range(len(inds)):
  docs[i].metadata['index'] = inds[i]


In [None]:
def integrity():
  ind = np.random.choice(df.shape[0])
  df_point = df.loc[ind]
  df_vals = [df_point['index'], df_point['master'], df_point['book'], df_point['book_idx'], df_point['chapter'], df_point['chapter_idx'], df_point['author']]
  dm = docs[ind].metadata
  doc_vals = [dm['index'], dm['master'], dm['book'], dm['book_idx'], dm['chapter'], dm['chapter_idx'], dm['author']]
  print("\n ################### \n")
  print('DataFrame:', df_vals[0], df_vals[1], df_vals[2], df_vals[3], df_vals[4], df_vals[5], df_vals[6])
  print('---------------')
  print('Document:', doc_vals[0], doc_vals[1], doc_vals[2], doc_vals[3], doc_vals[4], doc_vals[5], )
  if df_vals == doc_vals:
    print('\n meta match')
  else:
    print('\n ooops!!!')
  print('\n --------------- \n')
  df_cont = df_point['content']
  doc_cont = docs[ind].page_content
  print('Dataframe:', df_cont)
  print('---------------')
  print('Document:', doc_cont)
  if df_cont == doc_cont:
    print('\n content match\n')
  else:
    print('\n oops\n')

  result = (df_vals == doc_vals) and (df_cont == doc_cont)
  print(f'\n ___ \nFinal Result: {result}')
  return result

In [None]:
test = []
for i in range(10):
  test.append(integrity())
print(f'\n ***** \nMeta Result {all(test)}')


 ################### 

DataFrame: Dubliners Dubliners 3 THE SISTERS 18 Joyce
---------------
Document: Dubliners Dubliners 3 THE SISTERS 18 Joyce

 meta match

 --------------- 

Dataframe: His face was very truculent, grey and massive, with black cavernous nostrils and circled by a scanty white fur. There was a heavy odour in the room—the flowers.   We blessed ourselves and came away. In the little room downstairs we found Eliza seated in his arm-chair in state.
---------------
Document: His face was very truculent, grey and massive, with black cavernous nostrils and circled by a scanty white fur. There was a heavy odour in the room—the flowers.   We blessed ourselves and came away. In the little room downstairs we found Eliza seated in his arm-chair in state.

 content match


 ___ 
Final Result: True

 ################### 

DataFrame: Dubliners Dubliners 3 THE DEAD 32 Joyce
---------------
Document: Dubliners Dubliners 3 THE DEAD 32 Joyce

 meta match

 --------------- 

Dataframe:

## $\color{blue}{Subset/Save:}$


In [None]:
np.random.seed(0)
train_inds = np.random.choice(df.shape[0], 12000, replace = False)
other_inds = set(range(df.shape[0])) - set(train_inds)
dev_inds = np.random.choice(list(other_inds), 964, replace = False)
test_inds = np.array(list(other_inds - set(dev_inds)))

In [None]:
print(train_inds[:5])
print(len(train_inds))
print(dev_inds[:5])
print(len(dev_inds))
print(test_inds[:5])
print(len(test_inds))

[ 8114  4951  4629 11556 12262]
12000
[3781 2532  662 2352 7128]
964
[   0 2051 4108 2068 6167]
1000


In [None]:
df_train = df.loc[train_inds]
df_dev = df.loc[dev_inds]
df_test = df.loc[test_inds]
docs_train = [docs[i] for i in train_inds]
docs_dev = [docs[i] for i in dev_inds]
docs_test = [docs[i] for i in test_inds]

In [None]:
len(set(df_train['index'] + df_dev['index'] + df_test['index']))

13964

In [None]:
all_inds = set()
for item in docs_train:
  all_inds.add(item.metadata['index'])
for item in docs_dev:
  all_inds.add(item.metadata['index'])
for item in docs_test:
  all_inds.add(item.metadata['index'])

In [None]:
len(set(all_inds))

13964

In [None]:
training = [item.metadata['index'] for item in docs_train]
training == list(df_train['index'])

True

In [None]:
dev = [item.metadata['index'] for item in docs_dev]
dev == list(df_dev['index'])

True

In [None]:
test = [item.metadata['index'] for item in docs_test]
test == list(df_test['index'])

True

In [None]:
df_train.to_pickle('class/datasets/df_train')
df_dev.to_pickle('class/datasets/df_dev')
df_test.to_pickle('class/datasets/df_test')

In [None]:
!pip install dill

Collecting dill
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Downloading dill-0.3.9-py3-none-any.whl (119 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill
Successfully installed dill-0.3.9


In [None]:
import dill

In [None]:
def save_langchain_docs(docs, filename):
    """Save a list of Langchain Documents to a .dill file."""
    with open(filename, 'wb') as f:
        dill.dump(docs, f)
    print(f"Documents saved to {filename}")

In [None]:
def load_langchain_docs(filename):
    """Load a list of Langchain Documents from a .dill file."""
    with open(filename, 'rb') as f:
        docs = dill.load(f)
    print(f"Documents loaded from {filename}")
    return docs

In [None]:
save_langchain_docs(docs_train, "class/datasets/docs_train")
save_langchain_docs(docs_dev, "class/datasets/docs_dev")
save_langchain_docs(docs_test, "class/datasets/docs_test")

Documents saved to class/datasets/docs_train
Documents saved to class/datasets/docs_dev
Documents saved to class/datasets/docs_test
