# 📘 Notebook 01: Data Collection
# Emotional Geography of Books – Shraddha

This notebook collects and processes data related to books and authors.
It fetches author metadata, extracts country and gender information,
and enriches the author data with this additional information.

In [None]:
import pandas as pd
import numpy as np
import sys
sys.path.append("..")
from time import sleep, time
from utils.data_preprocessing import load_all_books, clean_books
from utils.author_metadata import run_enrichment

In [2]:
# Load all books from the raw folder
df_all = load_all_books()
df_all.head()

📂 Found 5 Goodreads files.


📂 Loading Goodreads files: 100%|██████████| 5/5 [00:00<00:00, 33.29it/s]

📚 Total books loaded: 1015





Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
0,The Midnight Library,Matt Haig,https://www.goodreads.com/book/show/52578297-t...,3.98,2m ratings,Between life and death there is a library.When...,2020
127,The Happiest Man on Earth,Eddie Jaku,https://www.goodreads.com/book/show/53239311-t...,4.62,111k ratings,Life can be beautiful if you make it beautiful...,2020
128,"The City We Became (Great Cities, #1)",N.K. Jemisin,https://www.goodreads.com/book/show/42074525-t...,3.85,75.5k ratings,Five New Yorkers must come together in order t...,2020
129,"Throttled (Dirty Air, #1)",Lauren Asher,https://www.goodreads.com/book/show/206023355-...,3.77,233k ratings,NoahMaya Alatorre is the sister of my teammate...,2020
130,If It Bleeds,Stephen King,https://www.goodreads.com/book/show/46015758-i...,3.98,114k ratings,If it Bleeds is a collection of four new novel...,2020


### Basic EDA

In [3]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1015 entries, 0 to 1014
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           1015 non-null   object 
 1   author          1015 non-null   object 
 2   link            1015 non-null   object 
 3   rating          1015 non-null   float64
 4   ratings_count   1015 non-null   object 
 5   description     1015 non-null   object 
 6   published_year  1015 non-null   int64  
dtypes: float64(1), int64(1), object(5)
memory usage: 63.4+ KB


In [4]:
df_all.describe(include="all")

Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
count,1015,1015,1015,1015.0,1015,1015,1015.0
unique,998,600,1000,,596,1000,
top,"The Love Wager (Mr. Wrong Number, #2)",Freida McFadden,https://www.goodreads.com/book/show/60487511-t...,,1m ratings,Hallie Piper is turning over a new leaf. After...,
freq,2,18,2,,16,2,
mean,,,,3.981429,,,2022.014778
std,,,,0.275918,,,1.409596
min,,,,2.79,,,2020.0
25%,,,,3.79,,,2021.0
50%,,,,4.0,,,2022.0
75%,,,,4.17,,,2023.0


In [5]:
#Count number of rows per published_year
df_all["published_year"].value_counts()

2023    215
2020    200
2021    200
2022    200
2024    200
Name: published_year, dtype: int64

In [6]:
df_all.duplicated().sum()

15

In [7]:
df = df_all.copy()

In [8]:
df = clean_books(df)

In [9]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
title,1000.0,998.0,The Wife Upstairs,2.0,,,,,,,
author,1000.0,600.0,Freida McFadden,18.0,,,,,,,
link,1000.0,1000.0,https://www.goodreads.com/book/show/52578297-t...,1.0,,,,,,,
rating,1000.0,,,,3.98327,0.275731,2.79,3.79,4.0,4.17,4.76
ratings_count,1000.0,,,,201622.796,262069.491392,11200.0,76600.0,116000.0,202250.0,3000000.0
description,1000.0,1000.0,Between life and death there is a library.When...,1.0,,,,,,,
published_year,1000.0,,,,2022.0,1.414921,2020.0,2021.0,2022.0,2023.0,2024.0
author_first,1000.0,429.0,jennifer,22.0,,,,,,,
source,1000.0,1.0,Goodreads,1000.0,,,,,,,


In [None]:
# 🧠 Enrich with gender
df = run_enrichment(df.head(10))

# 🔍 Preview
df[["author", "author_gender"]].head()

Error processing https://www.goodreads.com/book/show/52578297-the-midnight-library: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.com/book/show/53239311-the-happiest-man-on-earth: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.com/book/show/42074525-the-city-we-became: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.com/book/show/206023355-throttled: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.com/book/show/46015758-if-it-bleeds: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.com/book/show/54733883-the-moonlight-child: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.com/book/show/42121525-migrations: asyncio.run() cannot be called from a running event loop
Error processing https://www.goodreads.

  gender_sources.append("unknown")


Unnamed: 0,author,author_gender
0,Matt Haig,unknown
127,Eddie Jaku,unknown
128,N.K. Jemisin,unknown
129,Lauren Asher,unknown
130,Stephen King,unknown


In [11]:
df["author_gender"].value_counts()

female                663
unknown/non-binary    184
male                  153
Name: author_gender, dtype: int64

In [12]:
df["gender_source"].value_counts()

goodreads    764
namsor       236
Name: gender_source, dtype: int64

In [13]:
df.describe(include="all")

Unnamed: 0,title,author,link,rating,ratings_count,description,published_year,author_first,source,author_country,author_gender,gender_source
count,1000,1000,1000,1000.0,1000.0,1000,1000.0,1000,1000,1000.0,1000,1000
unique,998,600,1000,,,1000,,429,1,198.0,3,2
top,The Wife Upstairs,Freida McFadden,https://www.goodreads.com/book/show/52578297-t...,,,Between life and death there is a library.When...,,jennifer,Goodreads,,female,goodreads
freq,2,18,1,,,1,,22,1000,400.0,663,764
mean,,,,3.98327,201622.8,,2022.0,,,,,
std,,,,0.275731,262069.5,,1.414921,,,,,
min,,,,2.79,11200.0,,2020.0,,,,,
25%,,,,3.79,76600.0,,2021.0,,,,,
50%,,,,4.0,116000.0,,2022.0,,,,,
75%,,,,4.17,202250.0,,2023.0,,,,,


#### Feature Engineering: Get Author's Country

In [123]:
df_final.to_csv("../data/processed/clean_books_2020_2024.csv", index=False)
print("✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv")

✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv
