The assumption here is that books by the same author make for good recommendations, so this groups the books by author to form an in-group (by same author) and out-group (by different author). 

This is not a perfect way of making comparisons, as some authors will have vastly different works in their bibliography and some authors are very similar to each other, but this works as a first approximation for a dataset of similar books.

In [2]:
import pandas as pd

In [3]:
# Get the data from the CSV file
file = "Datasets/goodreads_data.csv"
rawdata = pd.read_csv(file)

# Display the first 5 rows of the data
print(rawdata.head())

   Unnamed: 0                                               Book  \
0           0                              To Kill a Mockingbird   
1           1  Harry Potter and the Philosopher’s Stone (Harr...   
2           2                                Pride and Prejudice   
3           3                          The Diary of a Young Girl   
4           4                                        Animal Farm   

          Author                                        Description  \
0     Harper Lee  The unforgettable novel of a childhood in a sl...   
1   J.K. Rowling  Harry Potter thinks he is an ordinary boy - un...   
2    Jane Austen  Since its immediate success in 1813, Pride and...   
3     Anne Frank  Discovered in the attic in which she spent the...   
4  George Orwell  Librarian's note: There is an Alternate Cover ...   

                                              Genres  Avg_Rating Num_Ratings  \
0  ['Classics', 'Fiction', 'Historical Fiction', ...        4.27   5,691,311   
1  [

In [8]:
# Get a list of the unique authors in the data
authors = rawdata["Author"].unique()

print("\nAuthors:", len(authors))
for author in authors:
    print(author)


Authors: 6064
Harper Lee
J.K. Rowling
Jane Austen
Anne Frank
George Orwell
Antoine de Saint-Exupéry
F. Scott Fitzgerald
J.D. Salinger
J.R.R. Tolkien
Markus Zusak
Charlotte Brontë
C.S. Lewis
William Golding
William Shakespeare
Khaled Hosseini
Lois Lowry
Shel Silverstein
E.B. White
Louisa May Alcott
Suzanne Collins
John Steinbeck
Ray Bradbury
Dr. Seuss
Emily Brontë
Lewis Carroll
Oscar Wilde
Elie Wiesel
Margaret Mitchell
Anonymous
Douglas Adams
Mark Twain
Victor Hugo
Paulo Coelho
Aldous Huxley
Fyodor Dostoevsky
Kathryn Stockett
Frances Hodgson Burnett
Arthur Golden
Homer
Charles Dickens
S.E. Hinton
Gabriel García Márquez
L.M. Montgomery
Alexandre Dumas
Orson Scott Card
Maurice Sendak
Alice Walker
Yann Martel
Margaret Atwood
Ken Kesey
Mary Wollstonecraft Shelley
Daniel Keyes
Leo Tolstoy
Mitch Albom
A.A. Milne
Kurt Vonnegut Jr.
Ernest Hemingway
Joseph Heller
Nathaniel Hawthorne
Stephenie Meyer
Vladimir Nabokov
Franz Kafka
Audrey Niffenegger
Hermann Hesse
Albert Camus
Margaret Wise Brown
Be

In [11]:
# Create a dictionary to store the author names and the number of books they have written
author_books = {}
for author in authors:
    author_books[author] = len(rawdata[rawdata["Author"] == author])
# Remove the authors with less than 2 books
author_books = {author: author_books[author] for author in author_books if author_books[author] > 1}

print("Number of authors with more than 1 book:", len(author_books))

print("\nBooks per author:")
for author in author_books:
    print(author, author_books[author])

Number of authors with more than 1 book: 1385

Books per author:
Harper Lee 2
J.K. Rowling 13
Jane Austen 8
George Orwell 6
Antoine de Saint-Exupéry 2
F. Scott Fitzgerald 5
J.D. Salinger 4
J.R.R. Tolkien 12
Markus Zusak 2
Charlotte Brontë 2
C.S. Lewis 20
William Golding 2
William Shakespeare 39
Khaled Hosseini 4
Lois Lowry 5
Shel Silverstein 7
E.B. White 4
Louisa May Alcott 4
Suzanne Collins 10
John Steinbeck 10
Ray Bradbury 7
Dr. Seuss 14
Lewis Carroll 4
Oscar Wilde 11
Elie Wiesel 3
Anonymous 28
Douglas Adams 11
Mark Twain 13
Victor Hugo 5
Paulo Coelho 12
Aldous Huxley 5
Fyodor Dostoevsky 10
Frances Hodgson Burnett 3
Homer 3
Charles Dickens 14
S.E. Hinton 2
Gabriel García Márquez 7
L.M. Montgomery 8
Alexandre Dumas 4
Orson Scott Card 10
Maurice Sendak 2
Alice Walker 3
Margaret Atwood 10
Ken Kesey 2
Daniel Keyes 2
Leo Tolstoy 4
Mitch Albom 7
A.A. Milne 5
Kurt Vonnegut Jr. 10
Ernest Hemingway 8
Nathaniel Hawthorne 3
Stephenie Meyer 10
Vladimir Nabokov 6
Franz Kafka 9
Audrey Niffenegger 

In [15]:
# Create a dictionary to store titles and plot summaries grouped by author
author_summaries = {}
# Iterate over the authors
for author in author_books:
    # Get the titles and summaries of the books by the author
    titles = rawdata[rawdata["Author"] == author]["Book"]
    summaries = rawdata[rawdata["Author"] == author]["Description"]
    # Combine the titles and summaries into a dictionary
    author_summaries[author] = dict(zip(titles, summaries))

print("\nTitles and summaries per author:")
for author in author_summaries:
    print(author)
    print(author_summaries[author])


Titles and summaries per author:
Harper Lee
{'To Kill a Mockingbird': 'The unforgettable novel of a childhood in a sleepy Southern town and the crisis of conscience that rocked it. "To Kill A Mockingbird" became both an instant bestseller and a critical success when it was first published in 1960. It went on to win the Pulitzer Prize in 1961 and was later made into an Academy Award-winning film, also a classic.Compassionate, dramatic, and deeply moving, "To Kill A Mockingbird" takes readers to the roots of human behavior - to innocence and experience, kindness and cruelty, love and hatred, humor and pathos. Now with over 18 million copies in print and translated into forty languages, this regional story by a young Alabama woman claims universal appeal. Harper Lee always considered her book to be a simple love story. Today it is regarded as a masterpiece of American literature.', 'Go Set a Watchman': 'From Harper Lee comes a landmark new novel set two decades after her beloved Pulitzer