# Submit and download a synthetic data batch

This notebook shows how to: 
  1. Check your API usage.
  2. Define a distribution of synthetic data parameters.
  2. Submit a synthetic data batch to the API.
  3. Query the status of the batch to see how many jobs have completed.
  4. Visualize dataset statistics + filter for specific properties.

For more information, please visit the Infinity Docs: [docs.infinity.ai](https://docs.infinity.ai/). 

## Import libraries

In [1]:
import os
import random
import numpy as np
from datetime import datetime
from infinity_core.api import get_usage_datetime_range
import infinity_tools.visionfit.api as api
from infinity_tools.visionfit.vis import visualize_job_params
from infinity_tools.visionfit.vis import summarize_batch_results_as_dataframe, visualize_batch_results
#import pandas as pd  # Optional but useful for tabular visualizations in the notebook)
#pd.options.display.max_columns = None

## Define constants

In [2]:
TOKEN = "691fda0541b490ac801d9f4b1c1179eafbd53593"
SERVER = "https://apidev.toinfinity.ai/"
GENERATOR = "visionfit-v0.4.0"
OUT_DIR = "tmp/"

## Create an API session
Initialize a Session object to interact with the Infinity API. 

In [3]:
sesh = api.VisionFitSession(token=TOKEN, generator=GENERATOR, server=SERVER)

## Check API usage

Check your API usage stats for the last N days (30 in the example below): 

In [4]:
usage_data_last_30_days = sesh.get_usage_stats_last_n_days(n_days=30)
print("-- Usage in the last 30 days --\n", usage_data_last_30_days)

#pd.DataFrame(usage_data_last_30_days['counts_by_generator']) # Uncomment to see stats in a DF (ensure pandas has been imported)


-- Usage in the last 30 days --
 {'counts_by_generator': [{'preview_samples_rendered': 4, 'non_preview_samples_rendered': 0, 'generator': 'visionfit-v0.3.1'}, {'preview_samples_rendered': 16, 'non_preview_samples_rendered': 84815, 'generator': 'visionfit-v0.4.0'}], 'start_time': '2022-12-13T22:07:43.256093-06:00', 'end_time': '2023-01-12T22:07:43.256093-06:00'}


For a more precise/custom range of usage stats, use datetime and a lower-level API: 

In [5]:
start_time = datetime.fromisoformat("2022-07-15")
end_time = datetime.fromisoformat("2022-08-20")
usage_data_late_summer_2022 = get_usage_datetime_range(token=TOKEN, server=SERVER, start_time=start_time, end_time=end_time)
print("-- Usage in late summer 2022 -- \n", usage_data_late_summer_2022.json())


-- Usage in late summer 2022 -- 
 {'counts_by_generator': [{'preview_samples_rendered': 4024, 'non_preview_samples_rendered': 0, 'generator': 'visionfit-v0.3.1'}], 'start_time': '2022-07-15T00:00:00-05:00', 'end_time': '2022-08-20T00:00:00-05:00'}


## Define the distribution  of parameters
Select the number of jobs in the batch. Set the batch name. And, customize the distribution of the parameters in the batch. 


**All parameters:** Use `pd.DataFrame(sesh.parameter_info)` to see all available parameters. Or, visit the [generator pages](https://api.toinfinity.ai/admin/api/generator/) on the User Portal for full parameter documentation. 

---  


This **example** shows how to create a 1000-video dataset of arm raises with various param distributions (fixed, uniform, gaussian, clipped gaussian, etc.). 


**Resources:** We recommend visiting the `random` python library ([see here.](https://docs.python.org/3/library/random.html#real-valued-distributions)) or the `numpy random` library ([see here.](https://numpy.org/doc/stable/reference/random/legacy.html#distributions)) for inspiration on types of distributions you can use for your parameter sampling.   
  




In [None]:
num_jobs = 50
batch_name = "apple " # batch name is displayed in the API User Portal 
job_params = [
        sesh.sample_input(
            exercise = "UPPERCUT-LEFT", #Only 1 type of exercise
            num_reps = 4, #Always 4 reps per video
            lighting_power = float(random.gauss(400.0, 20.0)), #Gaussian lighting centered at ~400 units
            camera_height = float(np.random.uniform(0.1,1.2)), #Uniform camera height 0.1-1.2 m 
            relative_avatar_angle_deg= float(np.clip(np.random.normal(0, 30, 1)[0],-60, 60)), #Normal dist. clipped at +/-60d
            frame_rate=6, #Fixed frame rate (6 fps)
            image_height=256, #Fixed resolution (256x256 px)
            image_width=256
        ) for _ in range(num_jobs)
    ]

## Review the distribution 
Before submitting to the API, visualize the distributions of parameters either in a histogram or table. 

In [None]:
visualize_job_params(job_params) 
#pd.DataFrame(job_params) #uncomment for tabular visualization (ensure pandas has been imported)

## Submit the batch 

After verifying the parameters meet your desired specs, you can submit the batch to the API. 

**BE CAREFUL: There is no submission confirmation button. And there is no way to cancel your jobs once you submit them. Use caution.** 

The data generation takes some time so this part is **non-blocking**. That is, you can run this cell and then shut down your notebook if you like. 


In [None]:
job_params = job_params[:3] #only the first 3 videos will be submitted
is_preview = True # True = previews (single frames). False = full videos.

In [None]:
# **** WARNING! There is no way to cancel these jobs once you run this cell. ***
batch = sesh.submit(
    job_params=job_params,
    is_preview=is_preview,
    batch_name=batch_name,
)
print(f"Your Batch ID is: {batch.batch_id}") 

## Check the status of your batch
Check if your synethetic data batch is completed or if generation is still in progress. Even if it is in progress, you can do a partial download of the jobs that have already completed. 

If you do not remember your data's Batch ID you can either:  
+ Login to the [User Portal](https://api.toinfinity.ai/admin/api/batch/) and look it up  
OR 
+ Run `sesh.get_batches_last_n_days(30)` and look up the batch ID (sample code is provided in the next cell)

In [None]:
batch_id = batch_id 

# # Uncomment to look up the batch ID 
# batches_last_month = sesh.get_batches_last_n_days(30) 
# print('-- LIST OF BATCHES IN LAST N DAYS -- ')
# for dict_item in batches_last_month: 
#    print(f"{dict_item['created']:%Y-%m-%d %H:%M} | NAME: {dict_item['name']} | BATCH ID: {dict_item['batch_id']}")

Once you have the Batch ID, reconstitute the Batch object: 

In [None]:
batch = sesh.batch_from_api(batch_id=batch_id)

Poll the server to see the status of your batch job.

In [None]:
print(f"{batch.num_jobs - batch.get_num_jobs_remaining()}/{batch.num_jobs} submitted jobs have completed")

completed_jobs = batch.get_completed_jobs()
valid_jobs = batch.get_valid_completed_jobs()

num_submitted = batch.num_jobs
num_completed = len(completed_jobs)
num_in_progress = num_submitted - num_completed
num_successful = len(valid_jobs)
num_failed = num_completed - num_successful

print(f"{num_successful}/{num_completed} completed jobs have a valid URL.")

## Download the data

We can download the completed jobs. Only the completed jobs that have a valid URL will be downloaded.  

Jobs will be downloaded to your local computer at the path specified by `OUT_DIR`. 

In [None]:
batch_path = os.path.join(OUT_DIR, batch.name, f'batch_ID_{batch.batch_id}')
_download_ok = batch.download(path=batch_path)

## Inspect and filter the dataset

Finally, we compile some of the metadata and all of the job parameters that were submitted into a dataframe. This allows us to see the distribution of the resulting dataset. 

In addition, we can query the dataset for specific properties, which allows us to curate a desired training set for a given ML application.

In [None]:
visualize_batch_results(batch_path)

In [None]:
# # Uncomment to filter through the dataset in tabular format (and ensure pandas has been imported)
# import pandas as pd
# pd.options.display.max_columns = None
# df = summarize_batch_results_as_dataframe(batch_path)
# df.round(2).query('avg_percent_occlusion < 4')