In [2]:
import pandas as pd
import csv
import json

In [21]:
pd.set_option('display.max_colwidth',1000)

In [24]:
import csv

# Path to your input text file
input_file_path = 'booksummaries.txt'

# Path to the output CSV file
output_csv_file_path = 'books_dataset.csv'

# Define a function to parse each entry from the text file
def parse_dataset_entry(entry):
    parts = entry.split('\t')
    # Assuming the format is: ID UniqueID Title Author Year Genres Summary
    genres_json = parts[5]
    try:
        genres = json.loads(genres_json.replace("'", "\""))  # Correcting single quotes to double quotes for JSON
        genre_list = ", ".join([genre for genre in genres.values()])
    except json.JSONDecodeError:
        genre_list = "Error parsing genres"
    
    summary = parts[6] if len(parts) > 6 else "Summary not available"
    
    return {
        'Title': parts[2],
        'Author': parts[3],
        'Year': parts[4],
        'Genres': genre_list,
        'Summary': summary
    }


# Initialize a list to hold parsed entries
parsed_entries = []

# Read the text file and parse each entry
with open(input_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        parsed_entry = parse_dataset_entry(line.strip())
        parsed_entries.append(parsed_entry)

with open(input_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        parsed_entry = parse_dataset_entry(line.strip())
        parsed_entries.append(parsed_entry)

# Write the parsed data to a CSV file
with open(output_csv_file_path, 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=['Title', 'Author', 'Year', 'Genres', 'Summary'])  # Include 'Genres' if you want it in your CSV
    writer.writeheader()
    writer.writerows(parsed_entries)

print(f"CSV file has been created at: {output_csv_file_path}")


CSV file has been created at: books_dataset.csv


In [3]:
df=pd.read_csv('books_dataset.csv')

In [27]:
df.head()


Unnamed: 0,Title,Author,Year,Genres,Summary
0,Animal Farm,George Orwell,1945-08-17,Error parsing genres,"Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it ""Animal Farm"". They adopt Seven Commandments of Animal-ism, the most important of which is, ""All animals are equal"". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. ..."
1,A Clockwork Orange,Anthony Burgess,1962,"Science Fiction, Novella, Speculative fiction, Utopian and dystopian fiction, Satire, Fiction","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and Pete, who mostly plays along as the droogs indulge their taste for ultra-violence. Characterized as a sociopath and a hardened juvenile delinquent, Alex is also intelligent and quick-witted, with sophisticated taste in music, being particularly fond of Beethoven, or ""Lovely Ludwig Van."" The novel begins with the droogs sitting in their favorite hangout (the Korova Milkbar), drinking milk-drug cocktails, called ""milk-plus"", to hype themselves for the night's mayhem. They assault a scholar walking home from the public library, rob a store leaving the owner and his wife bloodied and unconscious, stomp a panhandling derelict, then scuffle with a rival gang. Joyriding through the countryside in ..."
2,The Plague,Albert Camus,1947,"Existentialism, Fiction, Absurdist fiction, Novel","The text of The Plague is divided into five parts. In the town of Oran, thousands of rats, initially unnoticed by the populace, begin to die in the streets. A hysteria develops soon afterward, causing the local newspapers to report the incident. Authorities responding to public pressure order the collection and cremation of the rats, unaware that the collection itself was the catalyst for the spread of the bubonic plague. The main character, Dr. Bernard Rieux, lives comfortably in an apartment building when strangely the building's concierge, M. Michel, a confidante, dies from a fever. Dr. Rieux consults his colleague, Castel, about the illness until they come to the conclusion that a plague is sweeping the town. They both approach fellow doctors and town authorities about their theory, but are eventually dismissed on the basis of one death. However, as more and more deaths quickly ensue, it becomes apparent that there is an epidemic. Authorities, including the Prefect, M. Othon, ..."
3,An Enquiry Concerning Human Understanding,David Hume,,Error parsing genres,"The argument of the Enquiry proceeds by a series of incremental steps, separated into chapters which logically succeed one another. After expounding his epistemology, Hume explains how to apply his principles to specific topics. In the first section of the Enquiry, Hume provides a rough introduction to philosophy as a whole. For Hume, philosophy can be split into two general parts: natural philosophy and the philosophy of human nature (or, as he calls it, ""moral philosophy""). The latter investigates both actions and thoughts. He emphasizes in this section, by way of warning, that philosophers with nuanced thoughts will likely be cast aside in favor of those whose conclusions more intuitively match popular opinion. However, he insists, precision helps art and craft of all kinds, including the craft of philosophy. Next, Hume discusses the distinction between impressions and ideas. By ""impressions"", he means sensations, while by ""ideas"", he means memories and imaginings. According to..."
4,A Fire Upon the Deep,Vernor Vinge,,"Hard science fiction, Science Fiction, Speculative fiction, Fantasy, Fiction","The novel posits that space around the Milky Way is divided into concentric layers called Zones, each being constrained by different laws of physics and each allowing for different degrees of biological and technological advancement. The innermost, the ""Unthinking Depths"", surrounds the galactic core and is incapable of supporting advanced life forms at all. The next layer, the ""Slow Zone"", is roughly equivalent to the real world in behavior and potential. Further out, the zone named the ""Beyond"" can support futuristic technologies such as AI and FTL travel. The outermost zone, the ""Transcend"", contains most of the galactic halo and is populated by incomprehensibly vast and powerful posthuman entities. A human expedition investigates a five-billion-year-old data archive that offers the possibility of unimaginable riches for the ambitious young civilization of the Straumli Realm. The expedition's facility, called High Lab, is gradually compromised by a dormant super-intelligent ent..."


In [4]:
# as you can see, we only need (the title and author)->to index and the summary to do NLP processing
# maybe later we will use other information like genres but right now no
# so we will genreate a 4 csv files for our CI/CD practice

df_use=df[['Title','Author','Summary']]

In [6]:
df_use.shape

(33118, 3)

In [31]:
df_use.isnull().sum()

Title         0
Author     4764
Summary       0
dtype: int64

In [8]:
# never mind we only need the author to locate a book
# as long as summary is not none we are fine
# Calculate the number of rows to include in each split, adjusting for example purposes
rows_per_split = len(df_use) // 4

# Split the DataFrame into 4 parts
dfs = [df_use.iloc[i:i + rows_per_split] for i in range(0, len(df_use), rows_per_split)]

# Save each part to a CSV file
for i, df in enumerate(dfs, start=1):
    csv_file_name = f'books_{i}.csv'  # Specifying the file path for saving
    df.to_csv(csv_file_name, index=False)
    print(f'Saved: {csv_file_name}')

Saved: books_1.csv
Saved: books_2.csv
Saved: books_3.csv
Saved: books_4.csv
Saved: books_5.csv


In [3]:
# ok we get what we want 
# we will not use books_5 because it is too small it is the left
# check the dataframe
df_check=pd.read_csv('books_1.csv')

In [4]:
df_check.head()

Unnamed: 0,Title,Author,Summary
0,Animal Farm,George Orwell,"Old Major, the old boar on the Manor Farm, ca..."
1,A Clockwork Orange,Anthony Burgess,"Alex, a teenager living in near-future Englan..."
2,The Plague,Albert Camus,The text of The Plague is divided into five p...
3,An Enquiry Concerning Human Understanding,David Hume,The argument of the Enquiry proceeds by a ser...
4,A Fire Upon the Deep,Vernor Vinge,The novel posits that space around the Milky ...
