This tool supports collecting data from OpenProject, forcing users to use the available API of OpenProject and additional Web Selenium for scraping more data, which the API doesn't support. Scraping processes are using asynchronous programming to make it faster and stable.
To install the required dependencies, use:
pip install -r requirements.txt
username
: This variable could be change when collect data from API or from web portal, for the API value should beapikey
, check this for more information. For the portal value should be the username you use to access the web portalpassword
: Also like the username, for the API it must be access token, check this note.api_url
: The value should behttps://myopenproject.example/api/v3
(endswith/api/v3
)portal_url
: The value should behttps://myopenproject.example
(no need any uri path)
For example, to use this module, I provide a script named utils.py to scrape data from a specific project. This will use the asynchronous method, execpt DataParser
; it will use ThreadPool instead. Hence, you need to setup it in an asynchronous way with async/await
syntax. Give some explanation.
Crawler
class where to init crawler and get data such project's ID, project's tasks ID, tasks's activities- function
get_projects_id
-> Get all projects available and its ID - function
get_tasks_id
-> Get all tasks that belong to project"my_project"
with filters parameters in HTTP request - function
get_tasks_activities_data
-> Scrape data fromwork_packages/{id}
- function
- Navigate to the project source:
cd /path/to/openproject-crawler/src/python
- Create a virutal environment:
python -m venv venv
-
Active environment
-
On Windows
.\venv\Scripts\activate
-
Unix or MacOS
source venv/bin/activate
-
-
Install the required dependencies:
pip install -e .
Given detail usage on main.go as same as Python process, the flow is
Get projects ID
-> Get tasks ID of specific project
-> Get tasks activities of specific project
go run main.go
- Projects ID:
{
"1" : "mainproject",
"2" : "demoproject"
}
-
Tasks ID:
-
Golang data:
[45 278 13 225]
-
Python data:
[45, 278, 13, 225]
-
-
Tasks activities:
{
"Task name": "Scraping data from openproject",
"Task info": {
"Project": "Data collection",
"ID": "2",
"Type": "Task",
"Priority": "Normal",
"Create date": "2024-06-09 15:12:26",
"End Date": "2024-06-19 16:44:31",
"Duration": "10 days"
},
"Task activities": [
{
"Datetime": "2024-06-19 16:44:31",
"Action": [
"Status changed from In progress to Closed"
]
}
]
}