# QuickDraw Data

If machine learning is rocket science then data is your fuel! So before
doing anything we will have a close look at the data available and spend
some time bringing it into the "right" form (i.e.
[tf.train.Example](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/train/Example)).

That's why we start by spending quite a lot of time on this notebook, downloading
the data, understanding it, and transforming it into the right format for
Tensorflow.

The data used in this workshop is taken from Google's quickdraw (click on
the images to see loads of examples):

https://quickdraw.withgoogle.com/data

We will download the data below.

## Init

First, we'll choose where our data should be stored.

If you choose a path under **"/content/gdrive/My Drive"** then data will be stored in your Google drive and persisted across VM starts (preferrable).

In [None]:
data_path = '/content/gdrive/My Drive/amld_data'
# Alternatively, you can also store the data in a local directory. This method
# will also work when running the notebook in Jupyter instead of Colab.
# data_path = './amld_data'

In [None]:
if data_path.startswith('/content/gdrive/'):
  from google.colab import drive
  assert data_path.startswith('/content/gdrive/My Drive/'), \
         'Google Drive paths must start with "/content/gdrive/My Drive/"!'
  drive.mount('/content/gdrive')

if data_path.startswith('gs://'):
  from google.colab import auth
  auth.authenticate_user()

In [None]:
import base64, collections, io, itertools, functools, json, os, random, re, textwrap, time, urllib
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot
from PIL import Image, ImageDraw
from IPython import display
from six.moves.urllib import request
from xml.dom import minidom

# Always make sure you are using running the expected version.
# There are considerable differences between versions...
# Tested with 1.12.0
tf.__version__

## Get the data

In this section we download a set of raw data files from the web.

In [None]:
# Retrieve list of categories.

def list_bucket(bucket, regexp='.*'):
    """Returns a (filtered) list of Keys in specified GCS bucket."""
    keys = []
    fh = request.urlopen('https://storage.googleapis.com/%s' % bucket)
    content = minidom.parseString(fh.read())
    for e in content.getElementsByTagName('Contents'):
        key = e.getElementsByTagName('Key')[0].firstChild.data
        if re.match(regexp, key):
            keys.append(key)
    return keys

all_ndjsons = list_bucket('quickdraw_dataset', '.*ndjson$')
print('available: (%d)' % len(all_ndjsons))
print('\n'.join(textwrap.wrap(
    '|'.join([key.split('/')[-1].split('.')[0] for key in all_ndjsons]),
    width=100)))

In [None]:
# Mini group of two animals.
pets = ['cat', 'dog']

# Somewhat larger group of zoo animals.
zoo = ['camel', 'crocodile', 'dolphin', 'elephant', 'flamingo', 'giraffe',
       'kangaroo', 'lion', 'monkey', 'penguin', 'rhinoceros']

# Even larger group of all animals.
animals = ['ant', 'bat', 'bear', 'bee', 'bird', 'butterfly', 'camel', 'cat',
           'cow', 'crab', 'crocodile', 'dog', 'dolphin', 'dragon', 'duck',
           'elephant', 'fish', 'flamingo', 'frog', 'giraffe', 'hedgehog',
           'horse', 'kangaroo', 'lion', 'lobster', 'monkey', 'mosquito',
           'mouse', 'octopus', 'owl', 'panda', 'parrot', 'penguin', 'pig',
           'rabbit', 'raccoon', 'rhinoceros', 'scorpion', 'sea turtle', 'shark',
           'sheep', 'snail', 'snake', 'spider', 'squirrel', 'swan']

Create your own group -- the more categories you include the more challenging the classification task will be...

In [None]:
# YOUR ACTION REQUIRED:
# Choose one of above groups for remainder of workshop.
# Note: This will result in ~100MB of download per class.
# The "dataset_name" will be used to construct directories containing the data.
labels, dataset_name = zoo, 'zoo'

In [None]:
# Download above chosen group.

def valid_ndjson(filename):
  """Checks presence + completeness of .ndjson file."""
  try:
    json.loads(tf.gfile.Open(filename).readlines()[-1])
    return True
  except (ValueError, IOError):
    return False

def retrieve(bucket, key, filename):
  """Returns a file specified by its Key from a GCS bucket."""
  url = 'https://storage.googleapis.com/%s/%s' % (
    bucket, urllib.parse.quote(key))
  print('\n' + url)
  if not tf.gfile.Exists(filename):
    with tf.gfile.Open(filename, 'w') as f:
      f.write(request.urlopen(url).read())
  while not valid_ndjson(filename):
    print('*** Corrupted download (%.2f MB), retrying...' % (os.path.getsize(filename) / 2.**20))
    with tf.gfile.Open(filename, 'w') as f:
      f.write(request.urlopen(url).read())

tf.gfile.MakeDirs(data_path)

print('\n%d labels:' % len(labels))

for name in labels:
  print(name, end=' ')
  dst = '%s/%s.ndjson' % (data_path, name)
  retrieve('quickdraw_dataset', 'full/simplified/%s.ndjson' % name, dst)
  print('%.2f MB' % (tf.gfile.Stat(dst).length / 2.**20))

print('\nDONE :)')

## Inspect the data

Let's find out what the format of the downloaded files is.

First, we are going to enumerate them.

In [None]:
print('\n'.join([
    '%6.1fM : %s' % (tf.gfile.Stat(path).length/1024**2, path)
    for path in tf.gfile.Glob('{}/*.ndjson'.format(data_path))
]))

Let's further explore what the `NDJSON` file format is.

In [None]:
path = sorted(tf.gfile.Glob(os.path.join(data_path, '*.ndjson')))[0]
print(tf.gfile.Open(path).read()[:1000] + '...')

As we can see, it's a format that contains one JSON dictionary per line.

Let's parse one single line.

In [None]:
data_json = json.loads(tf.gfile.Open(path).readline())
data_json.keys()

In [None]:
# So we have some meta information...
for k, v in data_json.items():
  if k != 'drawing':
    print('%20s   ->   %s' % (k, v))

In [None]:
# ...and the actual drawing.
drawing = data_json['drawing']
# The drawing consists of a series of strokes:
print('Shapes:', [np.array(stroke).shape for stroke in drawing])
print('Example stroke:', drawing[0])

In [None]:
# Draw the image -- the strokes all have have shape (2, n)
# so the first index seems to be x/y coordinate:
for stroke in drawing:
  # Each array has X coordinates at [0, :] and Y coordinates at [1, :].
  pyplot.plot(np.array(stroke[0]), -np.array(stroke[1]))
# Would YOU recognize this drawing successfully?

In [None]:
# Some more code to load many sketches at once.
# Let's ignore the difficult "unrecognized" sketches for now...
# (i.e. unrecognized by the official quickdraw classifier)

def convert(line):
    """Converts single JSON line and converts 'drawing' to list of np.array."""
    d = json.loads(line)
    d['drawing'] = [np.array(stroke) for stroke in d['drawing']]
    return d

def loaditer(name, unrecognized=False):
  """Returns iterable of drawings in specified file.

  Args:
    name: Name of the downloaded object (e.g. "elephant").
    unrecognized: Whether to include drawings that were not recognized
        by Google AI (i.e. the hard ones).
  """
  for line in tf.gfile.Open('%s/%s.ndjson' % (data_path, name)):
    d = convert(line)
    if d['recognized'] or unrecognized:
      yield d

def loadn(name, n, unrecognized=False):
  """Returns list of drawings.

  Args:
    name: Name of the downloaded object (e.g. "elephant").
    n: Number of drawings to load.
    unrecognized: Whether to include drawings that were not recognized
        by Google AI (i.e. the hard ones).
  """
  it = loaditer(name, unrecognized=unrecognized)
  return list(itertools.islice(it, 0, n))

n = 100
print('Loading {} instances of "{}"...'.format(n, labels[0]), end='')
sample = loadn(labels[0], 100)
print('done.')

In [None]:
# Some more drawings...
rows, cols = 3, 3
pyplot.figure(figsize=(3*cols, 3*rows))
for y in range(rows):
  for x in range(cols):
    i = y * cols + x
    pyplot.subplot(rows, cols, i + 1)
    for stroke in sample[i]['drawing']:
      pyplot.plot(np.array(stroke[0]), -np.array(stroke[1]))

## Rasterize

Idea: After converting the raw drawing data into rasterized images, we can
use [MNIST](https://www.tensorflow.org/get_started/mnist/beginners)-like
image processing to classify the drawings.

In [None]:
def dict_to_img(drawing, img_sz=64, lw=3, maximize=True):
  """Converts QuickDraw data to quadratic rasterized image.
  
  Args:
    drawing: Dictionary instance of QuickDraw dataset.
    img_sz: Size output image (in pixels).
    lw: Line width (in pixels).
    maximize: Whether to maximize drawing within image pixels.
    
  Returns:
    A PIL.Image with the rasterized drawing.
  """
  img = Image.new('L', (img_sz, img_sz))
  draw = ImageDraw.Draw(img)
  lines = np.array([
      stroke[0:2, i:i+2]
      for stroke in drawing['drawing']
      for i in range(stroke.shape[1] - 1)
  ], dtype=np.float32)
  if maximize:
    for i in range(2):
      min_, max_ = lines[:,i,:].min() * 0.95, lines[:,i,:].max() * 1.05
      lines[:,i,:] = (lines[:,i,:] - min_) / max(max_ - min_, 1)
  else:
    lines /= 1024
  for line in lines:
    draw.line(tuple(line.T.reshape((-1,)) * img_sz), fill='white', width=lw)
  return img

In [None]:
# Show some examples.

def showimg(img):
  """Shows an image with an inline HTML <img> tag.
  
  Args:
    img: Can be a PIL.Image or a numpy.ndarray.
  """
  if isinstance(img, np.ndarray):
    img = Image.fromarray(img, 'L')
  b = io.BytesIO()
  img.convert('RGB').save(b, format='png')
  enc = base64.b64encode(b.getvalue()).decode('utf-8')
  display.display(display.HTML(
      '<img src="data:image/png;base64,%s">' % enc))

# Fetch some images + shuffle order.
rows, cols = len(labels), 10
n_per_class = rows * cols // len(labels) + 1
drawings_list = [drawing for name in labels
                 for drawing in loadn(name, cols)]

# Create mosaic of rendered images.
lw = 4
img_sz = 64
tableau = np.zeros((img_sz * rows, img_sz * cols), dtype=np.uint8)
for y in range(rows):
  for x in range(cols):
    i = y * cols + x
    img = dict_to_img(drawings_list[i], img_sz=img_sz, lw=lw, maximize=True)
    tableau[y*img_sz:(y+1)*img_sz,
            x*img_sz:(x+1)*img_sz] = np.asarray(img)

showimg(tableau)
print('{} samples of : {}'.format(cols, ' '.join(labels)))

## Protobufs and tf.train.Example

Tensorflow's "native" format for data storage is the `tf.train.Example`
[protocol buffer](https://en.wikipedia.org/wiki/Protocol_Buffers).

In this section we briefly explore the API needed to access the data
inside the `tf.train.Example` protocol buffer. It's **not necessary** to read
through the
[Protocol Buffer Basics: Python - documentation](https://developers.google.com/protocol-buffers/docs/pythontutorial).

In [None]:
# Create a new (empty) instance.
example = tf.train.Example()
# (empty example will print nothing)
print(example)
# An example contains a map from feature name to "Feature".
# Every "Feature" contains a list of elements of the same
# type, which is one of:
# - bytes_list (similar to Python's "str")
# - float_list (float number)
# - int64_list (integer number)

# These values can be accessed as follows (no need to understand
# details):

# Add float value "3.1416" to feature "magic_numbers"
example.features.feature['magic_numbers'].float_list.value.append(3.1416)
# Add some more values to the float list "magic_numbers".
example.features.feature['magic_numbers'].float_list.value.extend([2.7183, 1.4142, 1.6180])

### YOUR ACTION REQUIRED:
# Create a second feature named "adversaries" and add the elements
# b'Alice' and b'Bob'.
example.features.feature['adversaries'].

# This will now print a serialized representation of our protocol buffer
# with features "magic_numbers" and "adversaries" set...
print(example)

# .. et voila : that's all you need to know about protocol buffers
# for this workshop.

## Create datasets

Now let's create a "dataset" of `tf.train.Example`
[protocol buffers](https://developers.google.com/protocol-buffers/) ("protos").

A single example will containt all the information we want to use for training for a drawing (i.e. rasterized
image, label, and maybe other information).

A dataset consists of non-overlapping sets of examples that will be used for
training and evaluation of the classifier (the "test" set will be used for the
final evaluation). Because these files can quickly become very large, we
"shard" them into multiple smaller files of equal size.

In [None]:
# Let's first check how many [recognized=True] examples we have in each class.
for name in labels:
    print(name, len(list(tf.gfile.Open('%s/%s.ndjson' % (data_path, name)))), 'recognized', len(list(loaditer(name))))

In [None]:
# Helper code to create sharded recordio files.
# (No need to read through this.)

# Well... Since you continue to read through this cell, I could as
# well explain in more detail what it is about :-)
# Because we work with large amounts of data, we will create "sharded"
# files, that is, we split a single dataset into a number of files, like
# train-00000-of-00004, ..., train-00000-of-00005 (if we're using 5 shards).
# This way we have smaller individual files, and we can also easily access
# e.g. 20% of all data, or have 5 threads reading through the data
# simultaneously. With large datasets, try to shard data into individual files
# ~ 100 MB.

# The code in this cell simply takes a list of iterators and then
# randomly distributes the values returned by these iterators into sharded
# datasets (e.g. a train/eval/test split).

def rand_key(counts):
  """Returns a random key from "counts", using values as distribution."""
  r = random.randint(0, sum(counts.values()))
  for key, count in counts.items():
    if r > count or count == 0:
      r -= count
    else:
      counts[key] -= 1
      return key

def get_split(i, splits):
  """Returns key from "splits" for iteration "i"."""
  i %= sum(splits.values())
  for split in sorted(splits):
    if i < splits[split]:
      return split
    i -= splits[split]

def make_counts(labels, total):
  """Generates counts for "labels" totaling "total"."""
  counts = {}
  for i, name in enumerate(labels):
    counts[name] = total // (len(labels) - i)
    total -= counts[name]
  return counts

def example_to_dict(example):
  """Converts a tf.train.Example to a dictionary."""
  example_dict = {}
  for name, value in example.features.feature.items():
    if value.HasField('bytes_list'):
      value = value.bytes_list.value
    elif value.HasField('int64_list'):
      value = value.int64_list.value
    elif value.HasField('float_list'):
      value = value.float_list.value
    else:
      raise 'Unknown *_list type!'
    if len(value) == 1:
      example_dict[name] = value[0]
    else:
      example_dict[name] = np.array(value)
  return example_dict

def make_sharded_files(make_example, path, labels, iters, counts, splits,
                       shards=10, overwrite=False, report_dt=10, make_df=False):
  """Create sharded dataset from "iters".

  Args:
    make_example: Converts object returned by elements of "iters"
        to tf.train.Example() proto.
    path: Directory that will contain recordio files.
    labels: Names of labels, will be written to "labels.txt".
    iters: List of iterables returning drawing objects.
    counts: Dictionary mapping class to number of examples.
    splits: Dictionary mapping filename to multiple of examples. For example,
        splits=dict(a=2, b=1) will result in two exampels being written to "a"
        for every example being written to "b".
    shards: Number of files to be created per split.
    overwrite: Whether a pre-existing directory should be overwritten.
    report_dt: Number of seconds between status updates (0=no updates).
    make_df: Also write data as pandas.DataFrame - do NOT use this with very
        large datasets that don't fit in memory!

  Returns:
    Total number of examples written to disk per split.
  """
  assert len(iters) == len(labels)
  # Prepare output.
  if not os.path.exists(path):
    os.makedirs(path)
  paths = {
      split: ['%s/%s-%05d-of-%05d' % (path, split, i, shards)
              for i in range(shards)]
      for split in splits
  }
  assert overwrite or not os.path.exists(paths.values()[0][0])
  writers = {
      split: [tf.python_io.TFRecordWriter(ps[i]) for i in range(shards)]
      for split, ps in paths.items()
  }
  t0 = time.time()
  examples_per_split = collections.defaultdict(int)
  i, n = 0, sum(counts.values())
  counts = dict(**counts)
  rows = []
  # Create examples.
  while sum(counts.values()):
    name = rand_key(counts)
    split = get_split(i, splits)
    writer = writers[split][examples_per_split[split] % shards]
    label = labels.index(name)
    example = make_example(label, next(iters[label]))
    writer.write(example.SerializeToString())
    if make_df:
      example.features.feature['split'].bytes_list.value.append(split.encode('utf8'))
      rows.append(example_to_dict(example))
    examples_per_split[split] += 1
    i += 1
    if report_dt > 0 and time.time() - t0 > report_dt:
      print('processed %d/%d (%.2f%%)' % (i, n, 100. * i / n))
      t0 = time.time()
  # Store results.
  for split in splits:
    for writer in writers[split]:
      writer.close()
  with open('%s/labels.txt' % path, 'w') as f:
    f.write('\n'.join(labels))
  with open('%s/counts.json' % path, 'w') as f:
    json.dump(examples_per_split, f)
  if make_df:
    df_path = '%s/dataframe.pkl' % path
    print('Writing %s...' % df_path)
    pd.DataFrame(rows).to_pickle(df_path)
  return dict(**examples_per_split)

### Create IMG dataset

In [None]:
# Uses dict_to_img() from previous cell to create raster image.

def make_example_img(label, drawing):
  """Converts QuickDraw dictionary to example with rasterized data.

  Args:
    label: Numerical representation of the label (e.g. "0" for labels[0]).
    drawing: Dictionary with QuickDraw data.

  Returns:
    A tf.train.Example protocol buffer (with "label", "img_64", and additional
    metadata features).
  """
  example = tf.train.Example()
  example.features.feature['label'].int64_list.value.append(label)
  img_64 = np.asarray(dict_to_img(drawing, img_sz=64, lw=4, maximize=True)).reshape(-1)
  example.features.feature['img_64'].int64_list.value.extend(img_64)
  example.features.feature['countrycode'].bytes_list.value.append(drawing['countrycode'].encode())
  example.features.feature['recognized'].int64_list.value.append(drawing['recognized'])
  example.features.feature['word'].bytes_list.value.append(drawing['word'].encode())
  ts = drawing['timestamp']
  ts = time.mktime(time.strptime(ts[:ts.index('.')], '%Y-%m-%d %H:%M:%S'))
  example.features.feature['timestamp'].int64_list.value.append(int(ts))
  example.features.feature['key_id'].int64_list.value.append(int(drawing['key_id']))
  return example

In [None]:
# Create the (rasterized) dataset.

path = '%s/%s_img' % (data_path, dataset_name)
t0 = time.time()
examples_per_split = make_sharded_files(
    make_example=make_example_img,
    path=path,
    labels=labels,
    iters=[loaditer(name) for name in labels],
    # Creating 50k train, 10k eval, 20k test examples. Takes ~2min
    # Note : Larger datasets take longer to generate and to train on, but
    #        also lead to better classification results.
    counts=make_counts(labels, 80000),
    splits=dict(train=5, eval=1, test=2),
    overwrite=True,
    # Note : Set this to False when generating large datasets...
    make_df=True,
)

### If you don't see the final output below, it's probably because your VM
### has run out of memory and crashed !! This can happen with make_df=True ...

print('stored data to "%s"' % path)
print('generated %s examples in %d seconds' % (examples_per_split, time.time() - t0))

### Create STROKE dataset

This section creates another dataset of example protos that contain the raw
stroke data, suitable for usage with a recurrent neural network.

In [None]:
# Convert stroke coordinates into normalized relative coordinates,
# one single list, and add a "third dimension" that indicates when
# a new stroke starts.

def dict_to_stroke(d):
  norm = lambda x: (x - x.min()) / max(1, (x.max() - x.min()))
  xy = np.concatenate([np.array(s, dtype=np.float32) for s in d['drawing']], axis=1)
  z = np.zeros(xy.shape[1])
  if len(d['drawing']) > 1:
    z[np.cumsum(np.array(list(map(lambda x: x.shape[1], d['drawing'][:-1]))))] = 1
  dxy = np.diff(norm(xy))
  return np.concatenate([dxy, z.reshape((1, -1))[:, 1:]])

In [None]:
# Visualize / control output of dict_to_stroke().

stroke = dict_to_stroke(sample[0])
# First 2 dimensions are normalized dx/dy coordinates
# third dimension indicates "new stroke".
xy = stroke[:2, :].cumsum(axis=1)
pyplot.plot(xy[0,:], -xy[1,:])
pxy = xy[:, stroke[2] != 0]
# Indicate "new stroke" with a red circle.
pyplot.plot(pxy[0], -pxy[1], 'ro');

In [None]:
# Uses dict_to_stroke() from previous cell to create raster image.

def make_example_stroke(label, drawing):
  """Converts QuickDraw dictionary to example with stroke data.

  Args:
    label: Numerical representation of the label (e.g. "0" for labels[0]).
    drawing: Dictionary with QuickDraw data.

  Returns:
    A tf.train.Example protocol buffer (with "label", "stroke_x", "stroke_y",
    "stroke_z", and additional metadata features).
  """
  example = tf.train.Example()
  example.features.feature['label'].int64_list.value.append(label)
  stroke = dict_to_stroke(drawing)
  example.features.feature['stroke_x'].float_list.value.extend(stroke[0, :])
  example.features.feature['stroke_y'].float_list.value.extend(stroke[1, :])
  example.features.feature['stroke_z'].float_list.value.extend(stroke[2, :])
  example.features.feature['stroke_len'].int64_list.value.append(stroke.shape[1])
  example.features.feature['countrycode'].bytes_list.value.append(drawing['countrycode'].encode())
  example.features.feature['recognized'].int64_list.value.append(drawing['recognized'])
  example.features.feature['word'].bytes_list.value.append(drawing['word'].encode())
  ts = drawing['timestamp']
  ts = time.mktime(time.strptime(ts[:ts.index('.')], '%Y-%m-%d %H:%M:%S'))
  example.features.feature['timestamp'].int64_list.value.append(int(ts))
  example.features.feature['key_id'].int64_list.value.append(int(drawing['key_id']))
  return example

In [None]:
path = '%s/%s_stroke' % (data_path, dataset_name)
t0 = time.time()
examples_per_split = make_sharded_files(
    make_example=make_example_stroke,
    path=path,
    labels=labels,
    iters=[loaditer(name) for name in labels],
    # Creating 50k train, 10k eval, 20k test examples. Takes ~2min
    # Note : You can improve 
    counts=make_counts(labels, 80000),
    splits=dict(train=5, eval=1, test=2),
    overwrite=True,
    # Note : Set this to False when generating large datasets...
    make_df=True,
)

print('stored data to "%s"' % path)
print('generated %s examples in %d seconds' % (examples_per_split, time.time() - t0))

# ----- Optional part -----

## Inspect data

In [None]:
# YOUR ACTION REQUIRED:
# Check out the files generated in $data_path

# Note that you can also inspect the files in http://drive.google.com if you
# used Drive as the destination.


In [None]:
# Let's look at a single file of the sharded dataset.
tf_record_path = '{}/{}_img/eval-00000-of-00010'.format(data_path, dataset_name)
# YOUR ACTION REQUIRED:
# Use tf.python_io.tf_record_iterator() to read a single record from the file
# an assign it to the variable "record".
# What datatype has this record?
#record = ...
#record


**Note**: 
The `tf.python_io` should only be used for data processing in pure  Python. For machine learning applications you should instead use the `tf.data.Dataset` interface (see `2_keras.ipynb`). This has the advantage that the underlying file reading and protobuf parsing operations can be translated into TensorFLow Ops and implemented efficiently without passing the data through the Python kernel.



In [None]:
# Check out the features. They should correspond to what we generated in
# make_example_img() above.
example = tf.train.Example()
example.ParseFromString(record)
print(list(example.features.feature.keys()))

In [None]:
# YOUR ACTION REQUIRED:
# Extract the label and the image data from the example protobuf.
# (use above section "tf.train.Example" for reference).
label_int =
img_64 = 


In [None]:
# Visualize the image:
print(labels[label_int])
pyplot.matshow(np.array(img_64).reshape((64, 64)));

In [None]:
# YOUR ACTION REQUIRED:
# Check that we have an equal distribution of labels in the training files.


## More on protobufs

In [None]:
# If we want to create our own protocol buffers, we first need to install
# some programs...
!apt-get -y install protobuf-compiler python-pil python-lxml

In [None]:
# Step 1 : Write a proto file that describes our data format.
# YOUR ACTION REQUIRED: Complete the definition of the "Person" message (you
# can use the slide for inspiration).
with open('person.proto', 'w') as f:
  f.write('''
      syntax = "proto2";
  ''')


In [None]:
# Step 2 : Compile proto definition to a Python file.
!protoc --python_out=. person.proto
!ls -lh

In [None]:
# Step 3 : Import code from generated Python file.
from person_pb2 import Person

In [None]:
person = Person()
person.name = 'John Doe'
person.email = 'john.doe@gmail.com'
person.lucky_numbers.extend([13, 99])
person.SerializeToString()

In [None]:
# YOUR ACTION REQUIRED:
# Compare the size of the serialized person structure in proto format
# vs. JSON encoded (you can use Python's json.dumps() and list members
# manually, or import google.protobuf.json_format).

# Which format is more efficient? Why?
# Which format is easier to use?
# Which format is more versatile?
