# Thingi10k EDA

In the thingi10k_metadata_retrieval notebook, we built an Thingi10k index. This index contains multiple points of metadata about each stl object as well as the filename of the stl object itself.

In this notebook, our goal is to learn more about the dataset. Specifically, we have these questions:

1. How big/small are the stl objects?
2. Will we need to normalize the coordinates?


In [None]:
import pandas as pd
# ask matplotlib to show figures in notebook
%pylab inline

In [None]:
import env
from data import THINGI10K_INDEX
df = pd.read_csv(THINGI10K_INDEX)

## Num Vertices

In [None]:
# https://stackoverflow.com/questions/40347689/dataframe-describe-suppress-scientific-notation
df.num_vertices.describe().apply(lambda x: format(x, 'f'))

In [None]:
_ = df.hist(column='num_vertices')

In [None]:
# what if we only look at the top 90%?

# https://stackoverflow.com/questions/18580461/eliminating-all-data-over-a-given-percentile
_ = df[df.num_vertices < df.num_vertices.quantile(.90)].hist(column='num_vertices')

In [None]:
# what if we only look at the top 80%?

# https://stackoverflow.com/questions/18580461/eliminating-all-data-over-a-given-percentile
_ = df[df.num_vertices < df.num_vertices.quantile(.80)].hist(column='num_vertices')

### Takeaways

* Count < 1000: the Thingi10k dataset comes with a handful of .ply and .obj files; we ignore those.
* Histogram with long right-tail: it might be a good idea to ignore the largest files as they are not as representative and to keep input data low for the network
    * 80% looks like a good option; it cuts out 10% of the data points compared to 90% but cuts the max vertex count in half

## Size of STL Input

We will be inputting the stl vertices into the network and want to know how much memory we will potentially be using to calculate a good batch size.

In [None]:
# if a float is 4 bytes and each vertex is 3 floats (x,y,z coordinates)
# note that the actual stl file has extra info like normal vectors, name, etc.
# our network won't care about that info, so we are focused on only vertices here.
df['stl_data_points'] = df.num_vertices * 3
df.stl_data_points.describe().apply(lambda x: format(x, 'f'))

In [None]:
df['stl_size_bytes'] = df.stl_data_points * 4
df.stl_size_bytes.describe().apply(lambda x: format(x, 'f'))

In [None]:
# gb
df['stl_size_gb'] = df.stl_size_bytes / 1024 / 1024 / 1024
df.stl_size_gb.describe().apply(lambda x: format(x, 'f'))

In [None]:
_ = df.hist(column='stl_size_bytes')

In [None]:
_ = df[df.num_vertices < df.num_vertices.quantile(.80)].hist(column='stl_size_bytes')

### Takeaways

* As expected, the memory usage of the input is linearly related to the number of vertices (the histograms have the same shape)

### Batch Size Memory Usage

Let's set an arbritary goal of 1GB for memory usage for our batches. What should the batch size be to reach that goal?

In [None]:
ARBRITARY_MEM_GOAL = 1
avg_gb = df.stl_size_gb.mean()
batch_size = ARBRITARY_MEM_GOAL / avg_gb
batch_size

With a 1GB goal, our batch size can be fairly large. Even without counting for padding, it is safe to assume that a batch size of < 1000 will be safe. 

## STL Coordinates

Are coordinates all positive?

From [Wikipedia](https://en.wikipedia.org/wiki/STL_(file_format)), "In the original specification, all STL coordinates were required to be positive numbers, but this restriction is no longer enforced and negative coordinates are commonly encountered in STL files today."

This means that we will likely need to normalize our coordinates. What's the best way to do so?

In [34]:
from importlib import reload
import env
from data import thingi10k
from data import THINGI10K_INDEX_100
import numpy as np

In [51]:
# aggreagate the mins and maxes
reload(thingi10k)
Thingi = thingi10k.Thingi10k.init100()
n_samples = len(Thingi)
mins = list()
maxs = list()
for batch in Thingi.batchmaker(100, filenames=False):
    for vectors in batch:
        mins.append(np.amin(vectors[0], axis=0))
        maxs.append(np.amax(vectors[0], axis=0))

In [94]:
# compute the absolute lowest and highest
lowest = np.amin(np.asarray(mins), axis=0)
mostest = np.amax(np.asarray(maxs), axis=0)

In [95]:
# apply for normalization
# https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range
vectors.shape

(284, 3, 3)

In [96]:
(vectors - lowest) / (mostest - lowest)

array([[[0.9025598 , 0.564332  , 0.757745  ],
        [0.89359844, 0.56765884, 0.7582866 ],
        [0.89498764, 0.56493425, 0.757745  ]],

       [[0.9025598 , 0.564332  , 0.757745  ],
        [0.89359844, 0.56765884, 0.7582866 ],
        [0.89498764, 0.56739503, 0.7577451 ]],

       [[0.90188664, 0.564332  , 0.759031  ],
        [0.893476  , 0.5672056 , 0.7585204 ],
        [0.89476323, 0.5665642 , 0.7581737 ]],

       ...,

       [[0.88065237, 0.5367918 , 0.759031  ],
        [0.8895905 , 0.5613093 , 0.7585204 ],
        [0.8874154 , 0.5557542 , 0.757745  ]],

       [[0.88669574, 0.5603161 , 0.774833  ],
        [0.89071405, 0.5645018 , 0.76139355],
        [0.89267683, 0.5635956 , 0.763441  ]],

       [[0.901708  , 0.56438184, 0.774833  ],
        [0.89071405, 0.5655864 , 0.76139355],
        [0.89267683, 0.5635956 , 0.763441  ]]], dtype=float32)

In [97]:
def normalize(tri):
    tri[0] = (tri[0] - lowest[0]) / (mostest[0] - lowest[0])
    return tri

x = np.array(list(map(normalize, vectors)))
x

array([[[0.9025598 , 0.89096093, 0.89120156],
        [0.5742343 , 0.56765884, 0.5664808 ],
        [0.763926  , 0.7576141 , 0.757745  ]],

       [[0.9025598 , 0.89096093, 0.89120156],
        [0.5742343 , 0.56765884, 0.5664808 ],
        [0.763926  , 0.75868416, 0.7577451 ]],

       [[0.90188664, 0.89096093, 0.89198923],
        [0.5737748 , 0.5672056 , 0.56701857],
        [0.76355964, 0.7583229 , 0.7581737 ]],

       ...,

       [[0.88065237, 0.8836256 , 0.89198923],
        [0.5591868 , 0.5613093 , 0.5670185 ],
        [0.7515641 , 0.7536224 , 0.757745  ]],

       [[0.88669574, 0.88989127, 0.90166867],
        [0.56340504, 0.5645018 , 0.57362604],
        [0.76015353, 0.75703204, 0.763441  ]],

       [[0.901708  , 0.89097416, 0.90166867],
        [0.56340504, 0.5655864 , 0.57362604],
        [0.76015353, 0.75703204, 0.763441  ]]], dtype=float32)

In [122]:
lowest = np.array([-1.0, -2.0, -5.0])
mostest = np.array([3.0, 4.0, 5.0])
tri = np.array([[1.0, 2.0, 3.0],
               [1.0, 2.0, 3.0],
               [-1.0, 0.0, 1.0]])

norm = tri.copy()
for i in range(3):
    norm[i] = (tri[i] - lowest[i]) / (mostest[i] - lowest[i])
    
val = (3 - -1) / (3 - -1)
print(val)
print(tri[0][1])
assert norm[0][2] == val
norm[0]

1.0
2.0


array([0.5 , 0.75, 1.  ])