Data Collection Project

Mix of web scraping programs collecting all kinds of data and exporting datasets. All programs will store the collected data into a common data_storage directory structured as follows:

data_storage
└── categories
    ├── category_1
    │   ├── website_1
    │   │   ├── YYYYMMDD_hhmmss(1)
    │   │   │   ├── data
    │   │   │   │   └── data.json                             
    │   │   │   └── logfile.log
    │   │   ├── YYYYMMDD_hhmmss(2)
    │   │   └── ...
    │   ├── website_2
    │   └── ...
    ├── category_2
    └── ...

Usage

First, install dependencies:

$ pip install -r requirements.txt

Then, run the program:

$ python3 ./main.py

Currently Implemented Scrapers

Udemy Courses

https://www.udemy.com/

The scraper fetches and collects all courses data present on the Udemy platform. Due to the big amount of data available in this website, in addition to get data stored in a single json file, it seemed convenient to also store data into multiple json files organized by category.

Pluralsight Courses

https://pluralsight.com/

The scraper fetches and collects all courses data present on the Pluralsight platform. Data key values have been chosen arbitrarily and the data scraping is done by fetching each course url html code.

OpenClassrooms

https://openclassrooms.com/

The scraper fetches and collects all courses data present on the Openclassrooms platform. The courses data collected includes all free access courses in English and French as well as all Diploma courses in English and French.

GlobalKnowledge

https://globalknowledge.com/

The scraper fetches and collects all courses data present on the GlobalKnowledge platform.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
scrapers		scrapers
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
categories_list.txt		categories_list.txt
dccli		dccli
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Collection Project

Usage

Currently Implemented Scrapers

Udemy Courses

https://www.udemy.com/

Pluralsight Courses

https://pluralsight.com/

OpenClassrooms

https://openclassrooms.com/

GlobalKnowledge

https://globalknowledge.com/

About

Releases

Packages

Languages

samuel-villa/data-collection

Folders and files

Latest commit

History

Repository files navigation

Data Collection Project

Usage

Currently Implemented Scrapers

Udemy Courses

https://www.udemy.com/

Pluralsight Courses

https://pluralsight.com/

OpenClassrooms

https://openclassrooms.com/

GlobalKnowledge

https://globalknowledge.com/

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages