Skip to content

This project is a web scraper that automates the process of extracting repository details from GitHub. It navigates through pages, clicks the "Load more" button, and collects relevant data such as repository names, descriptions, forks, star counts, and programming languages.

License

Notifications You must be signed in to change notification settings

yokodrea/scraper-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕷️ Web Scraper Project – GitHub Repo Extractor

This project is a powerful, stealth-enabled web scraper built using Python + Selenium to automate the process of extracting repository details from GitHub collections.


✨ Features

  • 🔍 Automated Web Scraping – Extracts data like repo names, stars, forks, and languages.
  • 🧠 Stealth Mode Enabled – Avoids bot detection using ChromeDriver stealth configuration.
  • 🌐 Handles JavaScript-Heavy Pages – Scrapes dynamic content by rendering the full page.
  • 📊 Data Analytics – Visualizes data with charts and graphs via Streamlit.
  • 💾 Export Options – Save scraped results directly as CSV.
  • 📁 Download-Free Setup – Uses webdriver-manager, so no need to manually install ChromeDriver.

📸 Screenshots

ouptut result 1

output result 2

output result 3


💡 Use Cases

  • 📈 Market & Trend Analysis
  • 🧑‍💻 GitHub-based Research & Repo Discovery
  • 🏢 Competitor & Project Intelligence
  • 🤖 Dataset creation for AI/ML Models

🛠️ Setup & Installation

1. Clone the Repository

git clone https://github.com/yokodrea/scraper-project.git cd scraper-project

2. Install Dependencies

pip install -r requirements.txt

✅ No need to download ChromeDriver manually — it's handled by webdriver-manager.

3. 🚀 Run the Scraper

streamlit run scrape.py You can customize the GitHub collection URL inside scrape.py.

4. 📂 Output

The scraped data will be saved as:

project_list.csv

5. 🔧 Tech Stack

Language: Python

Core Libraries: selenium (for automation), pandas (for data handling), streamlit (for UI) ,webdriver-manager (auto-handles ChromeDriver)

Scraping Mode : Headless browser via Chrome

Export Format: CSV

📅 Future Enhancements

⏱️ Multi-threading for speed boost

📬 Real-time alerts on repo updates

🕒 Scheduler for auto-scraping (cron jobs)

🌐 Deploy as a hosted scraping service

🧪 Requirements

📌 To be filled in once final dependencies are set. 👉 See requirements.txt for more info.

📄 License

This project is licensed under the MIT License.

About

This project is a web scraper that automates the process of extracting repository details from GitHub. It navigates through pages, clicks the "Load more" button, and collects relevant data such as repository names, descriptions, forks, star counts, and programming languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages