Major reconstruction incoming.
Updated ERD: Click Here
The aim of the project is to replicate the entire dump data set given from data.discogs.com.
In summary, the batch operates as following:
- Currently only supports postgresql for higher stability and maintainability
- One shot process with validations before firing jobs
- Idempotent actions; may run several times for same source without issue
- Supports dockerize, docker run with predefined batch commands
Commands will be accepted regardless of -- mark ONLY IF gets arguments directly from jar file. Also, there is no impact from giving arguments in certain order. However, it will NOT accept any duplicated arguments.
i.e. --m will work, as well as -m, m will.
Brief summary for the commands are as below...
NAME | SYNONYM | REQUIRED | MIN | MAX | FORMAT | DEFAULT | NOTE |
---|---|---|---|---|---|---|---|
username | user, u | ✔️ | 1 | 1 | STRING | NULL | |
password | pass, p | ✔️ | 1 | 1 | STRING | NULL | |
url | ✔️ | 1 | 1 | addr:port | jdbc:postgresql://localhost:5432/discogs | ||
type | t | 🔲 | 1 | 4 | a,b,... | ARTIST, MEMBER, LABEL, RELEASE_ITEM | |
chunk_size | chunk, c | 🔲 | 1 | 1 | 0 < N | 3000 | |
core_count | core | 🔲 | 1 | 1 | 0 < N | 80% of core from runtime | |
year | y | 🔲 | 1 | 1 | yyyy | CURRENT | this or year_month. |
year_month | ym | 🔲 | 1 | 1 | yyyy-mm | CURRENT | this or year. |
etag | e | 🔲 | 1 | 4 | a,b,... | MOST_RECENT | overrides type, date. |
mount | m | 🔲 | 0 | 0 | NONE | - | keep dump file |
strict | s | 🔲 | 0 | 0 | NONE | - | only perform specified type or ETag |
It is important to note that there are three required arguments.
Username of the target database server. This will automatically be encoded to UTF-8. The user must have sufficient permissions to create and modify the given schema or database.
Password for the username given. This will automatically be encoded to UTF-8.
URL for the target database. The expected releaseFormat for the url would be...
--url=jdbc://postgresql://{server_address}:{port}/{target_database}
If you prefer to use specific database, please make sure to set it to the db prior to run batch, otherwise the process will fail with messages.
if target_database is missing, will be set to discogs as default.
It is important to note that if given schema or database is empty, this batch will automatically create tables via liquibase and sql.
First and foremost, by specifying the ETag, any arguments given for year, year-month, type will be ignored. This is intended behavior as each dump relies on other dump types in specified year and month.
Other than ETag, it is important to note that providing both year and year-month at the same time is not supported.
Finally, types cannot be duplicated.
If you specify a year and a type for example, batch will automatically fetch and process the target dump INCLUDING the dependant dump.
Dump dependency for other type are can be described as following:
TYPE | REQUIRES |
---|---|
ARTIST | - |
LABEL | - |
MASTER | ARTIST, LABEL |
RELEASE | ARTIST, LABEL, MASTER |
The job will always be executed by order as following:
ARTIST > LABEL > MASTER > RELEASE
If you run the batch with following arguments:
url=[?] user=[?] pass=[?] year-month=2021-3 type=release
Batch will be executed with artist, label, master, release dumps from 2021, March.
If you do not specify any options, but simply call the batch by username, password and url, then batch will be executed with most recent artist, label, master, release dumps.
If mount option is specified, the downloaded file from the discogs data will not be removed. This maybe useful if you need to keep the downloaded dump.
This option will not resolve any dependency, but to simply execute with given etag or type.
The application will automatically resolve the current core size of running system (currently 80%). If core count argument will override the default setting, and validate the value accordingly.
The core count cannot exceed 80% of full core size of given machine, thus setting the value above will simply be ignored.
Also, setting core count as negative value will also ignore the setting, which will simply set the core count to default(80%).
The default chunk-size is 500, however, in average environment, I would recommend to set to 100~200. This is totally up to the I/O spec and postgres settings of the running client and database server, so feel free to experiment with it.