# Splitting Data Into Train, Validation, and Test Sets

### Overview
By this point in the project, you should have already retrieved your images from Labelbox and stored them in a directory with their corresponding text files (you should change the constant `IMAGE_FOLDER` to be this). Before we can begin training our object detection model, we will want to split our labeled data into train, validation, and test sets. You will have some freedom on which ratio you use to split, but some commonly used ratios are:
* 70% train, 15% valid, 15% test
* 80% train, 10% valid, 10% test
* 60% train, 20% valid, 20% test

Here is the general strategy we are going to use to split our data.
1. For each set (train, valid, test) we will do the following:
    * Make a directory to store the images (and text files) we assign to that set.
    * Make a text file to keep track of all the image paths we assign to that set.
2. Produce a list that contains all the text files in `IMAGE_FOLDER`. You will fill in the function called `get_text_files` to accomplish this.
3. For each text file in that list, we will do the following:
    * Find the image with the same base name (e.g. if we had puppy_1.txt, we might want puppy_1.jpg or puppy_1.png). You will fill in the function called `get_image_file` to accomplish this.
    * Copy the text file and image to either the train, valid, or test directory (we will discuss the logic for this later in the notebook). Then, save the image path to either the train, valid, or test text file (same one as the directory you copied image and text file into). You will fill in the function called `assign_to_set` to accomplish this.

Note: It may be helpful to check out *Step 4: Putting It All Together* before implementing individual function to see how the above strategy would translate to code.

### Step 1: Import modules and set image folder constant
Here, all you need to do is specify the path to the directory where all your images and text files are stored (it should be in your current working directory). As of now, they should all be together in the same directory. By the end of this notebook, we will split them into train, validation, and test sets!

In [20]:
import shutil, os

**Important:** For `IMAGE_FOLDER`, we will need the absolute path, not the relative path (the one that uses `./`). To get the absolute path, we will get our current working directory and concatenate that with the name of the directory where our images are stored.

In [21]:
cwd = os.getcwd() # Current working directory
IMAGE_FOLDER = cwd + '/img' # Directory with images you want to split
print('Your image directory: ', IMAGE_FOLDER)

Your image directory:  /projects/d3bc3d47-ea68-450a-9811-79d4dfe8d026/shane/img


### Step 2: Complete Functions
Once you have specified your `IMAGE_FOLDER`, you are ready for Step 2! As a team, complete the three functions below. Pay special attention to the instructions in the function comment, the parameters being passed into the function, and the requested return value (if any). You can work together or divide and conquer.

*Hint*: You can see if a string ends with a particular suffix using a method called `endswith()`. Try looking it up on Google. It might be helpful for seeing if a filename ends with a particular extension.

In [22]:
def get_text_files(file_list):
    '''
    Given a list of files, returns a list of files with .txt extension

    Parameters
    ----------
    file_list : list of strings
        List where each element is a filename

    Returns
    -------
    text_files : list of strings
        List where each element is a filename ending with .txt

    >>> file_list = ['aicamp1.jpg', 'aicamp1.txt', 'aicamp2.jpg', 'aicamp2.txt']
    >>> get_text_files(file_list)
    ['aicamp1.txt', 'aicamp2.txt']
    '''
    text_files = []
    for file in file_list:
        if file.endswith('.txt'):
            text_files.append(file)
    return text_files

In [23]:
def get_image_file(file_list, text_file):
    '''
    Given a text file, find image in a list of files with the same base name.

    Parameters
    ----------
    file_list : list of strings
        List where each element is a filename
    text_file : string
        Filename ending with .txt extension

    Returns
    -------
    image_file : string
        -Filename ending with .jpg or .png extension
        -Should have the same base name as text_file
        -There should be a filename that exactly matches in file_list
        -Return None if no matching image file is found in file_list

    >>> file_list = ['aicamp1.jpg', 'aicamp2.JPG', 'aicamp3.png', 'aicamp4.PNG']
    >>> get_image_file(file_list, 'aicamp1.txt')
    'aicamp1.jpg'
    >>> get_image_file(file_list, 'aicamp2.txt')
    'aicamp2.jpg'
    >>> get_image_file(file_list, 'aicamp3.txt')
    'aicamp3.png'
    >>> get_image_file(file_list, 'aicamp4.txt')
    'aicamp4.png'
    '''
    base_name = text_file.split('.')[0]
    if base_name + '.jpg' in file_list:
        return base_name + '.jpg'
    elif base_name + '.JPG' in file_list:
        return base_name + '.jpg'
    elif base_name + '.png' in file_list:
        return base_name + '.png'
    elif base_name + '.PNG' in file_list:
        return base_name + '.png'
    else:
        print('No matching image found for', text_file)
        return None

### Logic for assigning data to train, valid, test
In the `assign_to_set` function below, we are going to use some logic that involves the Modulo Operator (%) to split the data. As a reminder, the Modulo Operator returns the remainder of a quotient. Let's suppose, for example, that we simply wanted to evenly divide the contents of our `IMAGE_FOLDER` into two directories, A and B. Since we are evenly dividing the data between A and B, we want to randomly assign 50% of the images to A and the other 50% to B. This is where the Modulo Operator comes in! We could do something like this:
~~~
if index % 2 == 0:
    shutil.copy(text_path, 'A')
    shutil.copy(image_path, 'A')
    a.write(image_path + '\n') # Here, a is the text file that saves the image paths for set A
else:
    shutil.copy(text_path, 'B')
    shutil.copy(image_path, 'B')
    b.write(image_path + '\n') # Here, b is the text file that saves the image paths for set B
~~~
Now, using this logic, it's your turn to write the `assign_to_set` function. Keep in mind, you will have three sets (train, valid, test) and they should not be evenly split. So you will have to get creative with that Modulo Operator! Ultimately, you have control over which ratio you use to split as long as it is reasonable, but we recommend the following ratio: **80% train, 10% valid, 10% test.**

In [24]:
def assign_to_set(text_path, image_path, index):
    '''
    Given an index, assign image to either train, validation, or test set.
    To assign an image to a set, we will need to do the following:
        -Make a copy of the text file in the set's directory
        -Make a copy of the image in the set's directory
        -Write the image path to the set's text file

    Parameters
    ----------
    text_path : string
        Path to the text file being assigned
    image_path : string
        Path to the image being assigned
    index : int
        Index of the the image being assigned

    Returns
    -------
    None
    '''
    if index % 10 == 0:
        shutil.copy(text_path, 'test_images')
        shutil.copy(image_path, 'test_images')
        test_file.write(image_path + '\n')
    elif index % 10 == 5:
        shutil.copy(text_path, 'valid_images')
        shutil.copy(image_path, 'valid_images')
        valid_file.write(image_path + '\n')
    else:
        shutil.copy(text_path, 'train_images')
        shutil.copy(image_path, 'train_images')
        train_file.write(image_path + '\n')

### Step 3: Test Functions
We want to test our functions to make sure our functions are producing the correct values. We will be using doctests to ensure that our functions are producing the correct values. If you have written a function correctly, it will pass its doctests. If a function returns a value that doesn't match the expected value, it will fail the doctest. You should debug any failed doctests before moving on to the final step.

**Note**: The function `assign_to_set` doesn't have doctests, so have your instructor check them out before moving on to the final step.

In [25]:
import doctest
doctest.testmod()

TestResults(failed=0, attempted=7)

### Step 4: Putting It All Together
Congrats on reaching the final step in the notebook. If everything else has been done correctly, all you need to do for this step is run the cell below. After you run the cell, go check your current working directory. You should see your new train, validation, and test sets organized in their own directories!

**Important: We have provided the code for you here, but look it over to make sure you understand it. If you don't, make sure you ask your instructor.**

In [26]:
# Make directories to store the images (and text files) for each set
os.mkdir('train_images')
os.mkdir('valid_images')
os.mkdir('test_images')

# Make text files to save image paths for each set
train_file = open('./train_images/train.txt', 'w')
valid_file = open('./valid_images/valid.txt', 'w')
test_file = open('./test_images/test.txt', 'w')

# Get list with all the text files in IMAGE_FOLDER
file_list = os.listdir(IMAGE_FOLDER)
text_files = get_text_files(file_list)

index = 0
for text_file in text_files:
    image_file = get_image_file(file_list, text_file) # Get matching image for text file
    if image_file == None:
        continue # Skip if no matching image file found
    text_path = IMAGE_FOLDER + '/' + text_file
    image_path = IMAGE_FOLDER + '/' + image_file
    assign_to_set(text_path, image_path, index)
    index += 1

# Close train.txt, valid.txt, and test.txt
train_file.close()
print('Training data saved')
valid_file.close()
print('Validation data saved')
test_file.close()
print('Testing data saved')

Training data saved
Validation data saved
Testing data saved
