Foodmandu Scraper

Foodmandu Scraper extracts restaurant information from FOODMANDU using Scrapy and Playwright. It captures details such as restaurant URLs, images, names, addresses, and cuisines. The extracted data is then stored in a SQLite database.

Setup

To set up the project, follow these steps:

1. Clone the Repository:

git clone https://github.com/rheaacharya77/foodmandu-scraper.git

2. Navigate to the Project Directory:

cd foodmandu

3. Create and Activate a Virtual Environment

python -m venv venv

source venv/bin/activate

4. Install the Required Dependencies:

pip install -r requirements.txt

5. You're now ready to start using the scraper!

Project Structure

Here's an overview of the key components in the Foodmandu Scraper project:

/.github/workflows/: Contains the GitHub Actions workflow files for automation.
/foodmandu/: The main project directory with all the Scrapy components.
- /spiders/: Contains the spider restaurants.py that defines the scraping logic.
- items.py: Defines the data structure for scraped data.
- middlewares.py: Manages custom middleware for Scrapy.
- pipelines.py: Processes and stores data items after scraping.
- settings.py: Configures settings for Scrapy.
foodmandu.db: The SQLite database where scraped data is stored.
requirements.txt: Lists all the dependencies required to run the project.
scrapy.cfg: Configuration file for Scrapy projects.

Usage

Modify the scraping settings in settings.py as needed, then run the scraper with:

scrapy crawl restaurants

Pipeline

The scraper employs two main pipelines in pipelines.py for processing and storing scraped data:

FoodmanduPipeline

Manages SQLite database interactions by establishing a connection, creating a fresh restaurants table, and inserting scraped data.

DuplicatesPipeline

Eliminates duplicate data by checking against a set of visited URLs and dropping any repeats during the scraping session.

These pipelines ensure efficient data storage and integrity by managing database operations and eliminating duplicate data.

Data Schema

The scraped restaurant data is stored in a SQLite database, utilizing a table with the following schema:

id: An auto-incrementing integer that serves as the primary key.
url: Text field storing the restaurant's URL.
image: Text field storing the URL of the restaurant's image.
name: Text field for the restaurant's name.
address: Text field for the restaurant's address.
cuisine: Text field describing the type of cuisine offered by the restaurant.

This schema is designed to capture essential details about each restaurant, facilitating easy access and analysis of the collected data.

Github Actions

GitHub Actions is used to automate the scraping process and ensure our data is always up to date. The workflow, defined in .github/workflows/actions.yml, performs the following tasks:

Trigger: It's set to run automatically every Saturday at 1:45 PM UTC. Additionally, it can be manually triggered via GitHub's workflow_dispatch event.
Environment Setup: Prepares an Ubuntu environment, sets up Python 3.10, and installs all necessary dependencies from requirements.txt.
Data Scraping: Executes our Scrapy spider named restaurants to scrape the latest restaurant data.
Commit: Any changes in the data are committed to the repository with a timestamp.
Push: Updates the main branch with the latest data.

This automated workflow minimizes manual effort and keeps our data fresh with scheduled and on-demand runs.

Contributing

Contributions are welcome! Here's how to contribute:

Fork the repo and clone your fork.
Create a branch for your changes.
Make your changes and test them.
Commit your changes with clear messages.
Submit a pull request (PR) with a detailed description of your changes.

Thank you for helping improve the Foodmandu Scraper!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Foodmandu Scraper

Setup

1. Clone the Repository:

2. Navigate to the Project Directory:

3. Create and Activate a Virtual Environment

4. Install the Required Dependencies:

5. You're now ready to start using the scraper!

Project Structure

Usage

Pipeline

FoodmanduPipeline

DuplicatesPipeline

Data Schema

Github Actions

Contributing

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
foodmandu		foodmandu
.gitignore		.gitignore
README.md		README.md
foodmandu.db		foodmandu.db
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

rheaacharya77/foodmandu-scraper

Folders and files

Latest commit

History

Repository files navigation

Foodmandu Scraper

Setup

1. Clone the Repository:

2. Navigate to the Project Directory:

3. Create and Activate a Virtual Environment

4. Install the Required Dependencies:

5. You're now ready to start using the scraper!

Project Structure

Usage

Pipeline

FoodmanduPipeline

DuplicatesPipeline

Data Schema

Github Actions

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages