> **Problem overview**

"Quick, Draw!" was released as an experimental game to educate the public in a playful way about how AI works. The game prompts users to draw an image depicting a certain category, such as ”banana,” “table,” etc. The game generated more than 1B drawings, of which a subset was publicly released as the basis for this competition’s training set. That subset contains 50M drawings encompassing 340 label categories.

Sounds fun, right? Here's the challenge: since the training data comes from the game itself, drawings can be incomplete or may not match the label. You’ll need to build a recognizer that can effectively learn from this noisy data and perform well on a manually-labeled test set from a different distribution.

Your task is to build a better classifier for the existing Quick, Draw! dataset. By advancing models on this dataset, Kagglers can improve pattern recognition solutions more broadly. This will have an immediate impact on handwriting recognition and its robust applications in areas including OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) & NLP (Natural Language Processing).

In [None]:
# import python standard library
import os

# import data manipulation library
import numpy as np
import pandas as pd

# import data visualization library
from tqdm import tqdm

> **Acquiring training and testing data**

We start by acquiring the training and testing datasets into Pandas DataFrames.

In [None]:
def csvload(file: str, nrows: int = None) -> pd.DataFrame:
    """ Return a loaded csv file. """
    
    return pd.read_csv('../input/train_simplified/' + file, nrows=nrows)

In [None]:
# class files and dictionary
files = sorted(os.listdir('../input/train_simplified/'), reverse=False)
class_dict = {file[:-4].replace(" ", "_"): i for i, file in enumerate(files)}

# data dimensions
num_shuffles = 100

In [None]:
# acquiring training and testing data
for i, file in tqdm(enumerate(files)):
    df_data = csvload(file, nrows=30000)
    df_data['shuffle'] = (df_data['key_id'] // 10 ** 7) % num_shuffles
    for k in range(num_shuffles):
        df_chunk = df_data[df_data['shuffle'] == k]
        if i == 0: df_chunk.to_csv('train_k%d.csv' %k, index=False)
        else: df_chunk.to_csv('train_k%d.csv' %k, header=False, index=False, mode='a')            

In [None]:
# shuffle and compress file
for k in tqdm(range(num_shuffles)):
    df_data = pd.read_csv('train_k%d.csv' %k)
    print(df_data.shape)
    df_data['rand'] = np.random.rand(df_data.shape[0])
    df_data = df_data.sort_values(['rand']).drop(['rand'], axis=1)
    df_data.to_csv('train_k%d.csv.gz' %k, index=False, compression='gzip')
    
    # memory clean-up
    os.remove('train_k%d.csv' %k)