# 📘 Notebook 01: Data Collection
# Emotional Geography of Books – Shraddha

This notebook collects and processes data related to books and authors.
It fetches author metadata, extracts country and gender information,
and enriches the author data with this additional information.

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append("..")
from time import sleep, time

# Custom Modules
from utils.data_preprocessing import load_all_books, clean_books
from utils.author_metadata import run_enrichment
from utils.country_extractor import extract_countries_from_dataframe, analyze_country_distribution

In [2]:
# Load all books from the raw folder
df_all = load_all_books()
df_all.head()

📂 Found 5 Goodreads files.


📂 Loading Goodreads files: 100%|██████████| 5/5 [00:00<00:00, 52.28it/s]

📚 Total books loaded: 1015





Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
0,The Midnight Library,Matt Haig,https://www.goodreads.com/book/show/52578297-t...,3.98,2m ratings,Between life and death there is a library.When...,2020
127,The Happiest Man on Earth,Eddie Jaku,https://www.goodreads.com/book/show/53239311-t...,4.62,111k ratings,Life can be beautiful if you make it beautiful...,2020
128,"The City We Became (Great Cities, #1)",N.K. Jemisin,https://www.goodreads.com/book/show/42074525-t...,3.85,75.5k ratings,Five New Yorkers must come together in order t...,2020
129,"Throttled (Dirty Air, #1)",Lauren Asher,https://www.goodreads.com/book/show/206023355-...,3.77,233k ratings,NoahMaya Alatorre is the sister of my teammate...,2020
130,If It Bleeds,Stephen King,https://www.goodreads.com/book/show/46015758-i...,3.98,114k ratings,If it Bleeds is a collection of four new novel...,2020


### Basic EDA

In [3]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1015 entries, 0 to 1014
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           1015 non-null   object 
 1   author          1015 non-null   object 
 2   link            1015 non-null   object 
 3   rating          1015 non-null   float64
 4   ratings_count   1015 non-null   object 
 5   description     1015 non-null   object 
 6   published_year  1015 non-null   int64  
dtypes: float64(1), int64(1), object(5)
memory usage: 63.4+ KB


In [4]:
df_all.describe(include="all")

Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
count,1015,1015,1015,1015.0,1015,1015,1015.0
unique,998,600,1000,,596,1000,
top,"The Love Wager (Mr. Wrong Number, #2)",Freida McFadden,https://www.goodreads.com/book/show/60487511-t...,,1m ratings,Hallie Piper is turning over a new leaf. After...,
freq,2,18,2,,16,2,
mean,,,,3.981429,,,2022.014778
std,,,,0.275918,,,1.409596
min,,,,2.79,,,2020.0
25%,,,,3.79,,,2021.0
50%,,,,4.0,,,2022.0
75%,,,,4.17,,,2023.0


In [5]:
#Count number of rows per published_year
df_all["published_year"].value_counts()

published_year
2023    215
2020    200
2021    200
2022    200
2024    200
Name: count, dtype: int64

In [6]:
df_all.duplicated().sum()

15

In [7]:
df = df_all.copy()

In [8]:
df = clean_books(df)

In [9]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
title,1000.0,998.0,The Wife Upstairs,2.0,,,,,,,
author,1000.0,600.0,Freida McFadden,18.0,,,,,,,
link,1000.0,1000.0,https://www.goodreads.com/book/show/52578297-t...,1.0,,,,,,,
rating,1000.0,,,,3.98327,0.275731,2.79,3.79,4.0,4.17,4.76
ratings_count,1000.0,,,,201622.796,262069.491392,11200.0,76600.0,116000.0,202250.0,3000000.0
description,1000.0,1000.0,Between life and death there is a library.When...,1.0,,,,,,,
published_year,1000.0,,,,2022.0,1.414921,2020.0,2021.0,2022.0,2023.0,2024.0
author_first,1000.0,429.0,jennifer,22.0,,,,,,,
source,1000.0,1.0,Goodreads,1000.0,,,,,,,


In [10]:
# 🧠 Enrich with gender
df_all = run_enrichment(df)

# 🔍 Preview
df_all[["author", "author_gender"]].head(10)

Processing authors:   0%|          | 0/1000 [00:00<?, ?it/s]

[RETRY] Attempt 1 failed. Retrying in 1s...
[RETRY] Attempt 1 failed. Retrying in 1s...


Processing authors:  25%|██▌       | 250/1000 [03:21<08:32,  1.46it/s]

Error querying Genderize.io API: 429 Client Error: Too Many Requests for url: https://api.genderize.io/?name=jessica


Processing authors:  30%|███       | 300/1000 [03:53<07:48,  1.50it/s]Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x111340e50>>
Traceback (most recent call last):
  File "/Users/shraddharamesh/.local/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 


[RETRY] Attempt 1 failed. Retrying in 1s...


Processing authors:  30%|███       | 300/1000 [04:48<11:13,  1.04it/s]


KeyboardInterrupt: 

In [None]:
df_all["author_gender"].value_counts()

female        840
male          155
non-binary      5
Name: author_gender, dtype: int64

In [None]:
df_all["gender_source"].value_counts()

goodreads       709
manual          246
genderize.io     45
Name: gender_source, dtype: int64

In [18]:
df_all.to_csv("../data/processed/clean_books_2020_2024.csv", index=False)
print("✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv")

✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv


In [12]:
df_all = pd.read_csv("../data/processed/clean_books_2020_2024.csv")
df_all

Unnamed: 0,title,author,link,rating,ratings_count,description,published_year,author_first,source,author_country,author_gender,gender_source
0,The Midnight Library,Matt Haig,https://www.goodreads.com/book/show/52578297-t...,3.98,2000000,Between life and death there is a library.When...,2020,matt,Goodreads,"in Sheffield, The United Kingdom",male,goodreads
1,The Happiest Man on Earth,Eddie Jaku,https://www.goodreads.com/book/show/53239311-t...,4.62,111000,Life can be beautiful if you make it beautiful...,2020,eddie,Goodreads,"in Leipzig, Germany",male,goodreads
2,"The City We Became (Great Cities, #1)",N.K. Jemisin,https://www.goodreads.com/book/show/42074525-t...,3.85,75500,Five New Yorkers must come together in order t...,2020,n.k.,Goodreads,unknown,female,manual
3,"Throttled (Dirty Air, #1)",Lauren Asher,https://www.goodreads.com/book/show/206023355-...,3.77,233000,NoahMaya Alatorre is the sister of my teammate...,2020,lauren,Goodreads,unknown,female,goodreads
4,If It Bleeds,Stephen King,https://www.goodreads.com/book/show/46015758-i...,3.98,114000,If it Bleeds is a collection of four new novel...,2020,stephen,Goodreads,"in Portland, Maine, The United States",male,goodreads
...,...,...,...,...,...,...,...,...,...,...,...,...
995,The Songbird & the Heart of Stone (Crowns of N...,Carissa Broadbent,https://www.goodreads.com/book/show/210134467-...,4.08,70400,New York Times bestselling author and BookTok ...,2024,carissa,Goodreads,unknown,female,manual
996,"A Touch of Chaos (Hades x Persephone Saga, #4)",Scarlett St. Clair,https://www.goodreads.com/book/show/56670031-a...,3.94,49200,"The World Will BurnPersephone, Goddess of Spri...",2024,scarlett,Goodreads,unknown,female,goodreads
997,The Husbands,Holly Gramazio,https://www.goodreads.com/book/show/193781998-...,3.52,103000,"An exuberant debut, The Husbands delights in h...",2024,holly,Goodreads,unknown,female,manual
998,The Lion Women of Tehran,Marjan Kamali,https://www.goodreads.com/book/show/199798217-...,4.50,75700,"A heartfelt novel of friendship, betrayal, and...",2024,marjan,Goodreads,unknown,female,goodreads


#### Feature Engineering: Get Author's Country

In [13]:
df_all["author_country"].value_counts()

author_country
unknown                                                           517
in The United States                                               78
The United States                                                  62
in Oceanside, California, The United States                        13
The United Kingdom                                                 10
                                                                 ... 
in Westminster, London, The United Kingdom                          1
in Peekskill, New York, The United States                           1
in Old Picacho, Dona Ana County, New Mexico, The United States      1
in Warren, Ohio, The United States                                  1
in Eastleigh, Hampshire, England, The United Kingdom                1
Name: count, Length: 186, dtype: int64

In [14]:

# Assuming df is your DataFrame with an 'author_country' column
result_df = extract_countries_from_dataframe(df_all)

# View the results
print(result_df[['author', 'author_country', 'extracted_country']].head())

# Get distribution analysis
analyze_country_distribution(result_df)

Extracting countries: 100%|██████████| 1000/1000 [00:02<00:00, 348.45it/s]


Extraction complete. Found countries for 450/1000 (45.0%) authors.

Top 10 most common unknown values:
author_country
unknown                        517
The United Kingdom              10
Glasgow, The United Kingdom      2
April 16                         2
in Stockholm, Sweden             2
August 12, 1982                  1
September 22                     1
Seoul                            1
December 19                      1
May 23                           1
Name: count, dtype: int64
                author                         author_country  \
0            Matt Haig       in Sheffield, The United Kingdom   
1           Eddie Jaku                    in Leipzig, Germany   
2         N.K. Jemisin                                unknown   
3         Lauren Asher                                unknown   
4  Stephen        King  in Portland, Maine, The United States   

  extracted_country  
0    United Kingdom  
1           Germany  
2              None  
3              None  
4   




In [15]:
#Manually map extracted_country Jersey to United States and Mayotte to France
result_df.loc[result_df['extracted_country'] == 'Jersey', 'extracted_country'] = 'United States'
result_df.loc[result_df['extracted_country'] == 'Heard Island and McDonald Islands', 'extracted_country'] = 'Australia'


# Get distribution analysis
analyze_country_distribution(result_df)



Country Distribution:
extracted_country
United States                311
United Kingdom                42
Australia                     22
Canada                        21
Ireland                       11
Georgia                        5
Iran, Islamic Republic of      5
Israel                         4
China                          4
New Zealand                    3
Mexico                         3
Korea, Republic of             3
India                          2
Netherlands                    2
France                         2
Japan                          2
Peru                           1
Ethiopia                       1
Ghana                          1
Hungary                        1
Name: count, dtype: int64

Countries with only one occurrence: Peru, Ethiopia, Ghana, Hungary, El Salvador, Germany, Spain, Singapore


In [None]:
result_df[result_df["extracted_country"].isna()]["author"].value_counts()

In [21]:

result_df.to_csv("../data/processed/clean_books_2020_2024_with_countries.csv", index=False)
print("✅ Saved cleaned data with countries to data/processed/clean_books_2020_2024_with_countries.csv")


✅ Saved cleaned data with countries to data/processed/clean_books_2020_2024_with_countries.csv
