# 📘 Notebook 01: Data Collection
# Emotional Geography of Books – Shraddha

This notebook collects and processes data related to books and authors.
It fetches author metadata, extracts country and gender information,
and enriches the author data with this additional information.

In [None]:
import pandas as pd
import numpy as np
import sys
sys.path.append("..")
from time import sleep, time
from utils.data_preprocessing import load_all_books, clean_books
from utils.author_metadata import run_enrichment
from utils.country_extractor import extract_countries_from_dataframe, analyze_country_distribution


In [2]:
# Load all books from the raw folder
df_all = load_all_books()
df_all.head()

📂 Found 5 Goodreads files.


📂 Loading Goodreads files: 100%|██████████| 5/5 [00:00<00:00, 88.46it/s]

📚 Total books loaded: 1015





Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
0,The Midnight Library,Matt Haig,https://www.goodreads.com/book/show/52578297-t...,3.98,2m ratings,Between life and death there is a library.When...,2020
127,The Happiest Man on Earth,Eddie Jaku,https://www.goodreads.com/book/show/53239311-t...,4.62,111k ratings,Life can be beautiful if you make it beautiful...,2020
128,"The City We Became (Great Cities, #1)",N.K. Jemisin,https://www.goodreads.com/book/show/42074525-t...,3.85,75.5k ratings,Five New Yorkers must come together in order t...,2020
129,"Throttled (Dirty Air, #1)",Lauren Asher,https://www.goodreads.com/book/show/206023355-...,3.77,233k ratings,NoahMaya Alatorre is the sister of my teammate...,2020
130,If It Bleeds,Stephen King,https://www.goodreads.com/book/show/46015758-i...,3.98,114k ratings,If it Bleeds is a collection of four new novel...,2020


### Basic EDA

In [3]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1015 entries, 0 to 1014
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           1015 non-null   object 
 1   author          1015 non-null   object 
 2   link            1015 non-null   object 
 3   rating          1015 non-null   float64
 4   ratings_count   1015 non-null   object 
 5   description     1015 non-null   object 
 6   published_year  1015 non-null   int64  
dtypes: float64(1), int64(1), object(5)
memory usage: 63.4+ KB


In [5]:
df_all.describe(include="all")

Unnamed: 0,title,author,link,rating,ratings_count,description,published_year
count,1015,1015,1015,1015.0,1015,1015,1015.0
unique,998,600,1000,,596,1000,
top,"The Love Wager (Mr. Wrong Number, #2)",Freida McFadden,https://www.goodreads.com/book/show/60487511-t...,,1m ratings,Hallie Piper is turning over a new leaf. After...,
freq,2,18,2,,16,2,
mean,,,,3.981429,,,2022.014778
std,,,,0.275918,,,1.409596
min,,,,2.79,,,2020.0
25%,,,,3.79,,,2021.0
50%,,,,4.0,,,2022.0
75%,,,,4.17,,,2023.0


In [6]:
#Count number of rows per published_year
df_all["published_year"].value_counts()

2023    215
2020    200
2021    200
2022    200
2024    200
Name: published_year, dtype: int64

In [7]:
df_all.duplicated().sum()

15

In [9]:
df = df_all.copy()

In [10]:
df = clean_books(df)

In [11]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
title,1000.0,998.0,The Wife Upstairs,2.0,,,,,,,
author,1000.0,600.0,Freida McFadden,18.0,,,,,,,
link,1000.0,1000.0,https://www.goodreads.com/book/show/52578297-t...,1.0,,,,,,,
rating,1000.0,,,,3.98327,0.275731,2.79,3.79,4.0,4.17,4.76
ratings_count,1000.0,,,,201622.796,262069.491392,11200.0,76600.0,116000.0,202250.0,3000000.0
description,1000.0,1000.0,Between life and death there is a library.When...,1.0,,,,,,,
published_year,1000.0,,,,2022.0,1.414921,2020.0,2021.0,2022.0,2023.0,2024.0
author_first,1000.0,429.0,jennifer,22.0,,,,,,,
source,1000.0,1.0,Goodreads,1000.0,,,,,,,


In [12]:
# 🧠 Enrich with gender
df_all = run_enrichment(df)

# 🔍 Preview
df_all[["author", "author_gender"]].head(10)

Processing authors:   0%|          | 0/1000 [00:00<?, ?it/s]

[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...
[RATE LIMIT] Waiting 15.0s before next request...


Processing authors:   5%|▌         | 50/1000 [00:35<11:17,  1.40it/s]

[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting 9.5s before next request...
[RATE LIMIT] Waiting

Processing authors:  10%|█         | 100/1000 [01:06<09:52,  1.52it/s]

[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting 8.6s before next request...
[RATE LIMIT] Waiting

Processing authors:  15%|█▌        | 150/1000 [01:36<08:59,  1.58it/s]

[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting 8.5s before next request...
[RATE LIMIT] Waiting

Processing authors:  20%|██        | 200/1000 [02:08<08:28,  1.57it/s]

[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting 6.7s before next request...
[RATE LIMIT] Waiting

Processing authors:  25%|██▌       | 250/1000 [02:36<07:33,  1.65it/s]

[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting 9.2s before next request...
[RATE LIMIT] Waiting

Processing authors:  30%|███       | 300/1000 [03:06<07:04,  1.65it/s]

[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting 8.8s before next request...
[RATE LIMIT] Waiting

Processing authors:  35%|███▌      | 350/1000 [03:36<06:33,  1.65it/s]

[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting

Processing authors:  40%|████      | 400/1000 [04:08<06:06,  1.64it/s]

[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting 7.7s before next request...
[RATE LIMIT] Waiting

Processing authors:  45%|████▌     | 450/1000 [04:37<05:31,  1.66it/s]

[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting

Processing authors:  50%|█████     | 500/1000 [05:10<05:09,  1.61it/s]

[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting 5.9s before next request...
[RATE LIMIT] Waiting

Processing authors:  55%|█████▌    | 550/1000 [05:38<04:31,  1.66it/s]

[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting 7.6s before next request...
[RATE LIMIT] Waiting

Processing authors:  60%|██████    | 600/1000 [06:05<03:54,  1.71it/s]

[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...
[RATE LIMIT] Waiting 10.4s before next request...


Processing authors:  65%|██████▌   | 650/1000 [07:03<04:25,  1.32it/s]

[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...
[RATE LIMIT] Waiting 13.6s before next request...


Processing authors:  70%|███████   | 700/1000 [07:38<03:41,  1.36it/s]

[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting 9.3s before next request...
[RATE LIMIT] Waiting

Processing authors:  75%|███████▌  | 750/1000 [08:07<02:53,  1.44it/s]

[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting

Processing authors:  80%|████████  | 800/1000 [08:37<02:12,  1.50it/s]

[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting 9.9s before next request...
[RATE LIMIT] Waiting

Processing authors:  85%|████████▌ | 850/1000 [09:08<01:37,  1.53it/s]

[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting 8.9s before next request...
[RATE LIMIT] Waiting

Processing authors:  90%|█████████ | 900/1000 [09:38<01:03,  1.57it/s]

[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting 8.7s before next request...
[RATE LIMIT] Waiting

Processing authors:  95%|█████████▌| 950/1000 [10:10<00:31,  1.57it/s]

[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting 7.3s before next request...
[RATE LIMIT] Waiting

Processing authors: 100%|██████████| 1000/1000 [10:37<00:00,  1.57it/s]


Unnamed: 0,author,author_gender
0,Matt Haig,male
127,Eddie Jaku,male
128,N.K. Jemisin,female
129,Lauren Asher,female
130,Stephen King,male
131,Karen McQuestion,female
132,Charlotte McConaghy,female
133,Tracy Wolff,female
134,Rebecca Roanhorse,female
135,Barbara Davis,female


In [13]:
df_all["author_gender"].value_counts()

female        840
male          155
non-binary      5
Name: author_gender, dtype: int64

In [16]:
df_all["gender_source"].value_counts()

goodreads       709
manual          246
genderize.io     45
Name: gender_source, dtype: int64

In [18]:
df_all.to_csv("../data/processed/clean_books_2020_2024.csv", index=False)
print("✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv")

✅ Saved cleaned data to data/processed/clean_books_2020_2024.csv


#### Feature Engineering: Get Author's Country

In [20]:
df_all["author_country"].value_counts()

unknown                                                           517
in The United States                                               78
The United States                                                  62
in Oceanside, California, The United States                        13
The United Kingdom                                                 10
                                                                 ... 
in Westminster, London, The United Kingdom                          1
in Peekskill, New York, The United States                           1
in Old Picacho, Dona Ana County, New Mexico, The United States      1
in Warren, Ohio, The United States                                  1
in Eastleigh, Hampshire, England, The United Kingdom                1
Name: author_country, Length: 186, dtype: int64

In [None]:

# Assuming df is your DataFrame with an 'author_country' column
result_df = extract_countries_from_dataframe(df)

# View the results
print(result_df[['author', 'author_country', 'extracted_country']].head())

# Get distribution analysis
analyze_country_distribution(result_df)