# Getting and clearing dataset

For the dataset we propose to use [Thingi10K dataset](https://ten-thousand-models.appspot.com/). But the initial data consists of complicated 3D models (45% with self-intersections, 26% with multiple components, etc.). At the beginning I propose to get as simple 3D figures as possible.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load data
input_data = pd.read_csv('../data/Thingi10K/metadata/Thingi10K Summary - Input Summary.csv')
geometry_data = pd.read_csv('../data/Thingi10K/metadata/Thingi10K Summary - Geometry Data.csv')

In [3]:
print(input_data.shape)
print(geometry_data.shape)

(10000, 11)
(9997, 49)


In [4]:
# rename columns in geometry_data
geometry_data = geometry_data.rename(columns={'file_id': 'ID'})

In [5]:
# merge data
data = input_data.merge(geometry_data, on='ID')

In [6]:
data.head()

Unnamed: 0,ID,Thing ID,License,Link,No duplicated faces,Closed,Edge manifold,No degenerate faces,Vertex manifold,Single Component,...,p75_aspect_ratio,p90_aspect_ratio,p95_aspect_ratio,max_aspect_ratio,PWN_y,solid,ave_area,ave_valance,ave_dihedral_angle,ave_aspect_ratio
0,32770,10367,Creative Commons - Attribution - Share Alike,https://thingiverse-production-new.s3.amazonaw...,True,True,True,True,True,True,...,1.763788,2.518721,3.41081,15444360.0,1.0,1,0.233377,5.999367,0.0871,822.276536
1,34783,10955,Creative Commons - Attribution - Share Alike,https://thingiverse-production-new.s3.amazonaw...,True,True,True,True,True,True,...,15.362083,52.434491,146.752893,20207640.0,1.0,0,0.172308,6.0,0.157806,1623.715832
2,34784,10955,Creative Commons - Attribution - Share Alike,https://thingiverse-production-new.s3.amazonaw...,True,True,True,True,True,True,...,14.709113,54.730347,156.206824,6803872.0,1.0,0,0.17213,6.0,0.170783,788.200103
3,34785,10955,Creative Commons - Attribution - Share Alike,https://thingiverse-production-new.s3.amazonaw...,True,True,True,True,True,True,...,2.74117,7.026847,12.575539,146745.4,1.0,1,0.171683,6.0,0.070198,28.456043
4,35269,10367,Creative Commons - Attribution - Share Alike,https://thingiverse-production-new.s3.amazonaw...,True,True,True,True,True,True,...,1.759741,2.521702,3.494814,23032550.0,1.0,1,0.219888,5.999404,0.084723,1044.792582


In [7]:
data.shape

(9997, 59)

# Data cleaning
Data might contain some NaN values, not closed models, models with more than 1000 number of vertices, models with more than 1 number of components. We need to remove them.

In [8]:
# remove rows with NaN
data = data.dropna()

# remove not closed models
data = data[data['Closed'] != False]

# remove models with more than 1000 number of vertices
data = data[data['num_vertices'] <= 1000]

# remove models with more than 1 number of components
data = data[data['Single Component'] == True]

data.shape

(2904, 59)

In [9]:
# we need only ID, License, number of vertices and link columns
data_croppped = data[['ID', 'License', 'num_vertices', 'Link']]
data_croppped.head()

Unnamed: 0,ID,License,num_vertices,Link
5,36069,Creative Commons - Attribution - Share Alike,82,https://thingiverse-production-new.s3.amazonaw...
7,36082,Creative Commons - Attribution - Share Alike,76,https://thingiverse-production-new.s3.amazonaw...
8,36086,Creative Commons - Attribution - Share Alike,292,https://thingiverse-production-new.s3.amazonaw...
10,36090,Creative Commons - Attribution - Share Alike,292,https://thingiverse-production-new.s3.amazonaw...
14,36372,Creative Commons - Attribution - Share Alike,76,https://thingiverse-production-new.s3.amazonaw...


# Downloading models

In [10]:
# download models by the link and save them to the folder
import urllib.request
import os
from tqdm import tqdm

In [11]:
PATH_TO_SAVE = "../data/Thingi10K/models"

num_models = data_croppped.shape[0]
# num_models = 5

for i in tqdm(range(num_models)):
    url = data_croppped.iloc[i]['Link']
    try:
        urllib.request.urlretrieve(url, os.path.join(PATH_TO_SAVE, str(i) + '.stl'))
    except:
        continue

 60%|█████▉    | 1729/2904 [30:23<27:10,  1.39s/it]  