Parallel file downloader for HTTP/HTTPS, FTP and SFTP protocols, built using Python.
🔗Demo 🔗Demo cont'd 🔗Demo cont'd
pip install pymatris
Initialize Downloader
from pymatris import Downloader
dl = Downloader()
Enqueue file to download
dl.enqueue_file("https://storage.data.gov.my/pricecatcher/pricecatcher_2022-01.parquet", path="./")
Start downloading files and view results
results = dl.download()
print(results)
from main.py
from pymatris import Downloader
urls = [
"https://storage.data.gov.my/pricecatcher/pricecatcher_2022-01.parquet",
"https://storage.data.gov.my/pricecatcher/pricecatcher_2022-02.parquet",
"https://storage.data.gov.my/pricecatcher/pricecatcher_2022-03.parquet",
"https://storage.data.gov.my/pricecatcher/pricecatcher_2022-04.parquet",
"https://storage.data.gov.my/pricecatcher/pricecatcher_2022-05.parquet",
"sftp://belle:belle@192.168.1.6/home/belle/files/testfile.txt",
"ftp://bob:bob@192.168.1.6:20/nonexistent.txt",
]
dm = Downloader()
for url in urls:
dm.enqueue_file(url, path="./")
results = dm.download()
print(results)
>> Success:
>> pricecatcher_2022-01.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-01.parquet
>> pricecatcher_2022-02.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-02.parquet
>> pricecatcher_2022-03.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-03.parquet
>> pricecatcher_2022-04.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-04.parquet
>> pricecatcher_2022-05.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-05.parquet
>> testfile.txt sftp://belle:belle@192.168.1.6/home/belle/files/testfile.txt
>> Errors:
>> (ftp://bob:bob@192.168.1.6:20/tesfile.txt,
>> ConnectionRefusedError(61, "Connect call failed ('192.168.1.6', 20)"))
Visit main.py for advanced usage.
Under the hood, pymatris.Downloader
uses a global queue to manage the download tasks. pymatris.Downloader.enqueue_file()
will add url to download queue, and pymatris.Downloder.download()
will download the files in parallel.
pymatris.Downloader
depends on ProtocolResolver
to resolve the url scheme and route to the correct protocol handler.
Pymatris uses asyncio-based clients (aiohttp
, aioftp
, asyncssh
) to download files in parallel, with asychronous I/O operations using aiofiles
.
pymatris.Downloader.download()
returns a Results
object, which is a list of the filenames that have been downloaded. Results
object has two attributes, success
and errors
.
success
is a list of named tuples, where each named tuple contains .path
the filepath and .url
the url.
errors
is a list of named tuples, where each named tuple contains .filepath_partial
the intended filepath, .url
the url, .exception
an Exception or aiohttp.ClientResponse that occurred during download.
Pymatris also provides a command line interface to download files in parallel. In your terminal, run the following command to download files in parallel.
# Insert single url as argument
pymatris https://storage.data.gov.my/pricecatcher/pricecatcher_2022-01.parquet
# Or multiple urls
pymatris https://storage.data.gov.my/pricecatcher/pricecatcher_2022-01.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-02.parquet https://storage.data.gov.my/pricecatcher/pricecatcher_2022-03.parquet
usage: pymatris [-h] [--max-parallel MAX_PARALLEL] [--max-splits MAX_SPLITS]
[--max-tries MAX_TRIES] [--timeouts TIMEOUTS] [--dir DIR]
[--overwrite] [--quiet] [--show-errors] [--verbose] URLS [URLS ...]
pymatris: Parallel download manager for HTTP/HTTPS/FTP/SFTP protocols.
positional arguments:
URLS URLs of files to be downloaded.
options:
-h, --help show this help message and exit
--max-parallel MAX_PARALLEL
Maximum number of parallel file downloads.
--max-splits MAX_SPLITS
Maximum number of parallel connections per file (only if protocol and server is supported).
--max-tries MAX_TRIES
Maximum number of download attempt per url.
--timeouts TIMEOUTS Maximum timeouts per url.
--dir DIR Directory to which downloaded files are saved.
--overwrite Overwrite if file exists. Only one url with the clashing name will overwrite the file.
--quiet Show progress indicators and file retries if any during download.
--show-errors Show failed downloads with its errors to stderr.
--verbose Log debugging output while transferring the files.
To provide path to save files, use --dir option. By default files will be saved in current directory.
pymatris --dir "./" <urls>
To overwrite existing files, use --overwrite option. By default, files will not be overwritten.
pymatris --overwrite <urls>
Assuming your have "pricecatcher_2022-01.parquet" file in your current directory, running above command will overwrite the existing file. During download, Pymatris creates tempfile to download files, if download is interrupted, rest assured that your existing files are safe, and tempfiles will be deleted.
To configure number of parallel downloads, use --max-parallel option. By default, 5 parallel downloads are allowed.
pymatris --max-parallel 10 <urls>
Pymatris uses asyncio to download files in parallel. By default, 5 files are downloaded in parallel. You can increase or decrease the number of parallel downloads.
To configure number of parallel download parts per file, use --max-splits option. By default, 5 parts are downloaded in parallel for each file.
pymatris --max-splits 10 <urls>
This is only available for HTTP/HTTPS and SFTP protocols. Currently, FTP protocol does not support multipart downloads.
To configure number of retries for failed downloads, use --max-tries option. By default, 5 retries are allowed.
pymatris --max-tries 10 <urls>
To hide progress bar, use --quiet option. By default, progress bar is shown.
pymatris --quiet <urls>
- python 3.9 or above
- aiohttp
- aioftp
- asyncssh
- aiofiles
- tqdm
- Add support for public key authentication for FTP and SFTP protocol handler.
- Resolve
PytestUnraisableExceptionWarning
due to pytestservers running on separate threads. ref - Add better concurrency support for FTP protocol, as
aioftp
by default allow only one data connection per client session. - Better error handling and logging.