## Checkout with sample

The checkout command has three types of sampling options available only for dataset: ```--sample-type=group --seed```,```--sample-type=random --seed```,```--sample-type=range``` . We use [random.sample(population, k)](https://docs.python.org/3.7/library/random.html#random.sample) to return a sample of the size k from the population elements. We use [random.seed()](https://docs.python.org/3.7/library/random.html#random.sample) to set the seed so that the sample generated by `random.sample()` can be reproduced between experiments. We use the [range()](https://docs.python.org/3.7/library/stdtypes.html?highlight=range#range) object to take samples from a given range.

#### Example:

![dataset](../../dataset.png)



Let's assume that we have a dataset that contains 12 files.

````ml-git dataset checkout computer-vision__images__dataset-ex__22 --sample-type=group --sampling=2:5 --seed=1```` : This command selects 2 files randomly from every group of five files to download.

![group-sample](../../group-sample.png)

````ml-git dataset checkout computer-vision__images__dataset-ex__22 --sample-type=random --sampling=2:6 --seed=1```` : This command makes a sample = (amount * len (dataset))% frequency ratio, sample = 4, so four files are selected randomly to download.  

![random-sample](../../random-sample.png)

````ml-git dataset checkout computer-vision__images__dataset-ex__22 --sample-type=range --sampling=2:11:2```` : This command selects the files at indexes generated by `range(start=2, stop=11, step=2)`.

![range-sample](../../range-sample.png)


#### To start using the ml-git api we need to import it into our script

In [None]:
from ml_git import api

#### After that, we define some variables that will be used by the script

In [None]:
# The type of entity we are working on
entity = 'dataset'

# Existing tag in our repository
tag = 'computer-vision__images__mscoco__1'

#### Before using the sample option, we will checkout the entity to check the files contained in the tag

In [None]:
data_path = api.checkout(entity, tag)

The datapath returned by the function tells us where the entity's data was downloaded. That way we can use the following method to print the files that are in the entity's directory

In [None]:
import os
import glob

data_type = '*.png'

def print_files(data_path):
    folder = os.path.join(data_path, 'data', data_type)
    print('Downloaded files: ')
    for imageName in glob.glob(folder):
        print (imageName)

print_files(data_path)

#### To be able to checkout the same tag, we use the following method to remove some files.

In [None]:
import shutil
import stat

# function created to clear directory
def clear_path(path):
    if not os.path.exists(path):
        return
    # SET the permission for files inside the .git directory to clean up
    for root, dirs, files in os.walk(path):
        for f in files:
            os.chmod(os.path.join(root, f), stat.S_IRWXU | stat.S_IRWXG | stat.S_IRWXO)
    try:
        shutil.rmtree(path)
    except Exception as e:
        print('except: ', e)


def clear_environment():
    clear_path(os.path.join('.ml-git', entity, 'index'))
    clear_path(os.path.join('.ml-git', entity, 'refs'))
    
clear_environment()

#### Checkout with group sample

In [None]:
sampling = {'group': '1:10', 'seed': '10'}

data_path = api.checkout(entity, tag, sampling)

print_files(data_path)

clear_environment()

#### Checkout with range sample

In [None]:
sampling = {'range': '0:4:3'}

data_path = api.checkout(entity, tag, sampling)

print_files(data_path)

clear_environment()

#### Checkout with random sample

In [None]:
sampling = {'random': '1:5', 'seed': '1'}

data_path = api.checkout(entity, tag, sampling)

print_files(data_path)

clear_environment()