Skip to content

Mix of web scraping programs collecting all kind of data and exporting datasets

Notifications You must be signed in to change notification settings

samuel-villa/data-collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Collection Project

Mix of web scraping programs collecting all kinds of data and exporting datasets. All programs will store the collected data into a common data_storage directory structured as follows:

data_storage
└── categories
    ├── category_1
    │   ├── website_1
    │   │   ├── YYYYMMDD_hhmmss(1)
    │   │   │   ├── data
    │   │   │   │   └── data.json                             
    │   │   │   └── logfile.log
    │   │   ├── YYYYMMDD_hhmmss(2)
    │   │   └── ...
    │   ├── website_2
    │   └── ...
    ├── category_2
    └── ...

Usage

First, install dependencies:

$ pip install -r requirements.txt

Then, run the program:

$ python3 ./main.py

Currently Implemented Scrapers

Udemy Courses

The scraper fetches and collects all courses data present on the Udemy platform. Due to the big amount of data available in this website, in addition to get data stored in a single json file, it seemed convenient to also store data into multiple json files organized by category.


Pluralsight Courses

The scraper fetches and collects all courses data present on the Pluralsight platform. Data key values have been chosen arbitrarily and the data scraping is done by fetching each course url html code.


OpenClassrooms

The scraper fetches and collects all courses data present on the Openclassrooms platform. The courses data collected includes all free access courses in English and French as well as all Diploma courses in English and French.


GlobalKnowledge

The scraper fetches and collects all courses data present on the GlobalKnowledge platform.

About

Mix of web scraping programs collecting all kind of data and exporting datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages