# Notebook to generate fold text files for the mturk task

This notebook generates fold.txt files for each element in the canva_scarping2 dataset.

Each fold.txt will contain 5-10 image names from the canva_scraping2 dataset. The image names will be links to Dropbox, as the canva_scraping2 dataset is hosted there.

Folds can be built with a mix of multiple classes in each one of them, as to avoid having only book-covers or web-ads in each fold.

Example structure of a fold.txt file: 

https://www.dropbox.com/s/gp9snrmlc54d3vo/certificates_1_18_MACS_ooy31s.png?raw=1
https://www.dropbox.com/s/c32nwr8y37sw3or/certificates_1_10_MACTFpwVtmk.png?raw=1

.
.
.
.




In [31]:
# Imports
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd

In [23]:
# Constants
USE_FULL_DATASET = True
CLASSES_TO_USE = None
WITH_TEXT = True
FILES_PER_FOLD = 10
DATASET_PATH = '../canva_scraping2'
FOLD_OUTPUT_DIR = './files'

## Dataset stats

In [24]:
# Get statistics on canva_scraping2

# Get folder names
canva_scraping2_folders = [f for f in os.listdir(DATASET_PATH) if '_' not in f and '.' not in f and os.path.isdir(os.path.join(DATASET_PATH,f))]
print('Data folders in canva_scraping2:', canva_scraping2_folders)

num_elems_in_full_dataset = 0

# Get number of elements in folders
for fol in canva_scraping2_folders:
    num_elems = len([p for p in os.listdir(os.path.join(DATASET_PATH, fol, 'png')) if p.endswith('png')])
    print('Number of elements in folder %s: %d' % (fol, num_elems))
    num_elems_in_full_dataset += num_elems
    
print('Num elems in full dataset:', num_elems_in_full_dataset)

Data folders in canva_scraping2: ['book-covers', 'cd-covers', 'certificates', 'coupons', 'flyers', 'infographics', 'magazine-covers', 'posters', 'social-graphics', 'web-ads']
Number of elements in folder book-covers: 799
Number of elements in folder cd-covers: 447
Number of elements in folder certificates: 270
Number of elements in folder coupons: 579
Number of elements in folder flyers: 1526
Number of elements in folder infographics: 282
Number of elements in folder magazine-covers: 868
Number of elements in folder posters: 2255
Number of elements in folder social-graphics: 1384
Number of elements in folder web-ads: 567
Num elems in full dataset: 8977


## Generate fold files

In [29]:
## Helper functions
def get_all_design_names(data_path, classes_to_use):    
    names_dict = {}
    
    for cl in classes_to_use:
        names_dict[cl] = []
    
    
    for cl in classes_to_use: 
        names_dict[cl] = [p for p in os.listdir(os.path.join(data_path, cl, 'png')) if p.endswith('png')]
        
    return names_dict

## Generate fold files
def generate_fold_files(data_path, output_dir, classes_to_use=None, files_per_fold = 10, mix_classes=True, files_to_generate=2, verbose=True):
    
    if not classes_to_use:
        classes_to_use = [f for f in os.listdir(DATASET_PATH) if '_' not in f and '.' not in f and os.path.isdir(os.path.join(DATASET_PATH,f))]

    
    num_folds_generated=0
    unused_design_names = get_all_design_names(data_path, classes_to_use)
    
    while unused_design_names and num_folds_generated<files_to_generate:
        fold_txt_name = 'fold'+str(num_folds_generated)+'.txt'
        fold_path = os.path.join(output_dir, fold_txt_name)
        
        generate_one_fold_file(fold_path, unused_design_names, files_per_fold)
        
        num_folds_generated +=1
        
        if verbose:
            print('Fold %s generated. %d files generated so far.' % (fold_txt_name, num_folds_generated))
        
    print('Done.')    
        
            
def generate_one_fold_file(fold_path, unused_design_names, files_per_fold):
    
    with open(fold_path, 'w+') as f:
        for _ in range(files_per_fold):
            # Sample a class
            cl = np.random.choice(list(unused_design_names.keys()))
            print('chosen class:', cl)

            # Get a design name from that class
            design = np.random.choice(unused_design_names[cl])
            link = get_dropbox_link(design)

            # Add to fold
            f.write(link+'\n')

            # Remove that design from the list of usable designs
            unused_design_names[cl].remove(design)
            if not unused_design_names[cl]:
                del unused_design_names[cl]
            
    return unused_design_names


def get_dropbox_link(design):
    
    ## TODO: Finish this function
    return 'https://www.dropbox.com/s/'+str(design)

In [30]:
generate_fold_files(data_path=DATASET_PATH, 
                    output_dir=FOLD_OUTPUT_DIR, 
                    classes_to_use=None, #all classes will be used
                    files_per_fold = 10, 
                    mix_classes=True, 
                    files_to_generate=2)

chosen class: cd-covers
chosen class: cd-covers
chosen class: flyers
chosen class: posters
chosen class: book-covers
chosen class: magazine-covers
chosen class: certificates
chosen class: coupons
chosen class: coupons
chosen class: book-covers
Fold fold0.txt generated. 1 files generated so far.
chosen class: cd-covers
chosen class: web-ads
chosen class: certificates
chosen class: web-ads
chosen class: certificates
chosen class: flyers
chosen class: cd-covers
chosen class: magazine-covers
chosen class: certificates
chosen class: flyers
Fold fold1.txt generated. 2 files generated so far.
Done.
