# 📘 Notebook 01: Data Collection
# Emotional Geography of Books – Shraddha

This notebook collects and processes data related to books and authors.
It fetches author metadata, extracts country and gender information,
and enriches the author data with this additional information.

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append("..")
from time import sleep, time
from utils.data_preprocessing import load_all_books, clean_books
from utils.author_metadata import enrich_books_with_authors

In [2]:
# Load all books from the raw folder
df_all = load_all_books()
df_all.head()

📂 Found 5 Goodreads files.


📂 Loading Goodreads files: 100%|██████████| 5/5 [00:00<00:00, 32.99it/s]

📚 Total books loaded: 1015





Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
0,The Midnight Library,Matt Haig,https://www.goodreads.com/book/show/52578297-t...,3.98,2m ratings,Between life and death there is a library.When...,2020
127,The Happiest Man on Earth,Eddie Jaku,https://www.goodreads.com/book/show/53239311-t...,4.62,111k ratings,Life can be beautiful if you make it beautiful...,2020
128,"The City We Became (Great Cities, #1)",N.K. Jemisin,https://www.goodreads.com/book/show/42074525-t...,3.85,75.5k ratings,Five New Yorkers must come together in order t...,2020
129,"Throttled (Dirty Air, #1)",Lauren Asher,https://www.goodreads.com/book/show/206023355-...,3.77,233k ratings,NoahMaya Alatorre is the sister of my teammate...,2020
130,If It Bleeds,Stephen King,https://www.goodreads.com/book/show/46015758-i...,3.98,114k ratings,If it Bleeds is a collection of four new novel...,2020


### Basic EDA

In [3]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1015 entries, 0 to 1014
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           1015 non-null   object 
 1   author          1015 non-null   object 
 2   link            1015 non-null   object 
 3   rating          1015 non-null   float64
 4   ratings_count   1015 non-null   object 
 5   description     1015 non-null   object 
 6   published_year  1015 non-null   int64  
dtypes: float64(1), int64(1), object(5)
memory usage: 63.4+ KB


In [4]:
df_all.describe(include="all")

Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
count,1015,1015,1015,1015.0,1015,1015,1015.0
unique,998,600,1000,,596,1000,
top,"The Love Wager (Mr. Wrong Number, #2)",Freida McFadden,https://www.goodreads.com/book/show/60487511-t...,,1m ratings,Hallie Piper is turning over a new leaf. After...,
freq,2,18,2,,16,2,
mean,,,,3.981429,,,2022.014778
std,,,,0.275918,,,1.409596
min,,,,2.79,,,2020.0
25%,,,,3.79,,,2021.0
50%,,,,4.0,,,2022.0
75%,,,,4.17,,,2023.0


In [5]:
#Count number of rows per published_year
df_all["published_year"].value_counts()

2023    215
2020    200
2021    200
2022    200
2024    200
Name: published_year, dtype: int64

In [7]:
df_all.duplicated().sum()

15

In [8]:
df = df_all.copy()

In [8]:
df = clean_books(df)

In [9]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
title,1015.0,998.0,"The Love Wager (Mr. Wrong Number, #2)",2.0,,,,,,,
author,1015.0,600.0,Freida McFadden,18.0,,,,,,,
link,1015.0,1000.0,https://www.goodreads.com/book/show/60487511-t...,2.0,,,,,,,
rating,1015.0,,,,3.981429,0.275918,2.79,3.79,4.0,4.17,4.76
ratings_count,1015.0,596.0,1m ratings,16.0,,,,,,,
description,1015.0,1000.0,Hallie Piper is turning over a new leaf. After...,2.0,,,,,,,
published_year,1015.0,,,,2022.014778,1.409596,2020.0,2021.0,2022.0,2023.0,2024.0


In [10]:
# 🧠 Enrich with gender
df = enrich_books_with_authors(df)

# 🔍 Preview
df[["author", "author_gender"]].head()

https://www.goodreads.com/author/show/20228583.Eddie_Jaku - in Leipzig, Germany - male - goodreads
https://www.goodreads.com/author/show/76360.Matt_Haig - in Sheffield, The United Kingdom - male - goodreads
https://www.goodreads.com/author/show/2917917.N_K_Jemisin - in The United States - unknown - unknown
https://www.goodreads.com/author/show/3389.Stephen_King - in Portland, Maine, The United States - male - goodreads
https://www.goodreads.com/author/show/19561164.Lauren_Asher -  - female - goodreads
https://www.goodreads.com/author/show/2996650.Karen_McQuestion - in Milwaukee, WI, The United States - female - goodreads
https://www.goodreads.com/author/show/2003596.Tracy_Wolff -  - female - goodreads
https://www.goodreads.com/author/show/2869149.Charlotte_McConaghy - in Australia - female - goodreads
https://www.goodreads.com/author/show/15862877.Rebecca_Roanhorse -  - female - goodreads
https://www.goodreads.com/author/show/7060621.Barbara_Davis - in Fair Lawn, New Jersey, The United

Unnamed: 0,author,author_gender
0,Matt Haig,male
1,Eddie Jaku,male
2,Lauren Asher,female
3,Stephen King,male
4,Karen McQuestion,female


In [12]:
df["author_gender"].value_counts()

female                749
male                  144
unknown/non-binary     11
Name: author_gender, dtype: int64

In [None]:
df["gender_source"].value_counts()

goodreads    778
manual        87
namsor        39
Name: gender_source, dtype: int64

#### Feature Engineering: Get Author's Country

In [123]:
df_final.to_csv("../data/processed/clean_books_2020_2024.csv", index=False)
print("✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv")

✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv
