This project contains two scripts. A book scraper and a pinecone feeder.
This scraper will:
- Iterate through all pages of children's and teenage books on the website
- For each book:
- Scrape the title, description, and metadata
- Download the book cover image
- Store image filename reference in the CSV
- Save all data to a CSV file with UTF-8 encoding (important for Icelandic characters)
To use the scraper:
- Install required packages:
pip install requests beautifulsoup4 pandas- Run the script:
python book-scraper.pyThe script will create:
- A folder named
book_coverscontaining all downloaded images - A CSV file named
icelandic_children_books.csvwith all book information
Features:
- Respectful scraping with 1-second delays between requests
- Error handling for failed requests/downloads
- Clean filename generation for images
- UTF-8 encoding support for Icelandic characters
To use the pinecone feeder, you need to have a PostgreSQL database with a table of book data. You can import the icelandic_children_books.csv file into a table in your database.
This script will:
- Connect to a PostgreSQL database
- Fetch data from a table
- Prepare and upsert data for Pinecone in batches
- Use Pinecone to embed the description and metadata of the book
- Keep track of progress in a file so it can resume from where it left off if it fails
To use the feeder:
- set up environment variables in a .env file
PINECONE_API_KEY=<YOUR_PINECONE_API_KEY_HERE>
OPENAI_API_KEY=<YOUR_OPENAI_API_KEY_HERE>
dbpassword=<YOUR_DATABASE_PASSWORD_HERE>
dbhost=<YOUR_DATABASE_HOST_HERE>
dbuser=neondb_owner
dbname=neondb
dbport=5432
table_name=<YOUR_TABLE_NAME>
pinecone_index=<YOUR_PINECONE_INDEX>
- Install required packages:
pip install pinecone pandas psycopg2 python-dotenv
pip install "pinecone[grpc]" #for the gprc version- Run the script:
python pinecone-feeder.pyto clear the index and start fresh after a failed attempt, uncomment the lines in the pinecone-feeder.py file that delete the index.
# Delete existing index
if pc.has_index(index_name):
pc.delete_index(index_name)