The above picture illustrates the submission and debug workflows of TACC job.
Before using tcloud SDK, please make sure that you have applied for a TACC account and submitted your public key to TACC. You may generate SSH public key according to the steps. To apply for a TACC account, please visit our website .
- Download tcloud SDK
Download the latest tcloud SDK from tags. - Install tcloud SDK
Placesetup.sh
andtcloud
in the same directory, and runsetup.sh
.
- First, you need to configure your TACC credentials. You can do this by running the
tcloud config
command:$ tcloud config [-u/--username] MYUSERNAME $ tcloud config [-f/--file] MYPRIVATEFILEPATH
- Then, run
tcloud init
command to obtain the latest cluster hardware information from TACC cluster.PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tacc* up infinite 5 alloc 10-0-7-[18-19],10-0-8-[18-19] tacc* up infinite 19 idle 10-0-2-[18-19],10-0-3-[10-13]
You can use this link to download our example code.
Each job requires a main.py
with tuxiv.conf
-
main.py: Your machine learning training code.
-
tuxiv.conf: Detail about tuxiv.conf
After tcloud is configured correctly, you can try to submit your first job.
- Go to the example folder in your terminal.
- Run
tcloud submit
command.~/Dow/quickstart-master/example/helloworld ❯ tcloud submit Start parsing tuxiv.conf... building file list ... 8 files to consider helloworld/ helloworld/run.sh 151 100% 0.00kB/s 0:00:00 (xfer#1, to-check=5/8) helloworld/configurations/ helloworld/configurations/citynet.sh 12 100% 11.72kB/s 0:00:00 (xfer#2, to-check=2/8) helloworld/configurations/conda.yaml 107 100% 104.49kB/s 0:00:00 (xfer#3, to-check=1/8) helloworld/configurations/run.slurm 278 100% 271.48kB/s 0:00:00 (xfer#4, to-check=0/8) sent 429 bytes received 144 bytes 382.00 bytes/sec total size is 1071 speedup is 1.87 Submitted batch job 2000 Job helloworld submitted.
In this section, we provide two methods to monitor the job log.
After training, you can use tcloud ls [filepath]
to find the output files
-
cat
You can configure your log path in the
tuxiv.conf
. The default path isslurm_log/slurm-jobid.out
.tcloud cat slurm_log/slurm-jobid.out
In the helloworld example, the tuxiv.conf file specifies the log path as
slurm_log/hello.log
-
download
You can use
tcloud download [filepath]
.Note that you can only read and download files in
USERDIR
, and the files inWORKDIR
may be removed after the job is finished.tcloud download slurm_log/slurm-jobid.out
tcloud uses Conda to manage your dependencies. All dependencies will be installed through conda. Please specify the required conda channel to meet the installation requirements. In tcloud, we offer two ways of environment management:
- One-off Environment. A new environment with different dependencies will be created every time you submit a task to TACC. If you do not specify an environment name and your dependencies configuration does not change between two consecutive submissions in
tuxiv.conf
, we will reuse the previous environment to save time. This is the default behavior.environment: # name: # do not specify environment name dependencies: - pytorch=1.6.0 - torchvision=0.7.0 channels: pytorch
- Persistent Environment. You can create a dedicated environment for each project. It needs to set a different environment name in
tuxiv.conf
for each project. When you change your dependencies configuration with an exist environment, tcloud will update this environment in stead of creating a new one. Learn how to do this in tuxiv.conf documentation environment part.environment: name: torch-env # dedicated environment name dependencies: - pytorch=1.6.0 - torchvision=0.7.0 channels: pytorch
The following videos will help you use tcloud CLI to begin your TACC journey: demo video.
Basic examples are provided under the example folder. These examples include: HelloWorld, TensorFlow, PyTorch and MXNet.