# Example notebook for processing point cloud data for PointNet

For this example I simply downloaded the "Oakland" dataset (training) http://www.cs.cmu.edu/~vmr/datasets/oakland_3d/cvpr09/doc/ and converted the dataset to multiple LAZ files for demonstration purposes.

In [2]:
from pathlib import Path
from laspy.file import File
import morton

import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
# also requires fastparquet

%load_ext line_profiler

## Reading text based point cloud (for LAS/LAZ see below)

In [None]:
data_dir = Path('/home/tom/vision/data/training')
data = data_dir.joinpath('oakland_part3_an_training.xyz_label_conf')
df = dd.read_csv(data)

## Reading LAS (or LAZ)

In [20]:
lasfile_dir = Path('/home/tom/vision/data/training')
lasfiles = sorted(list(lasfile_dir.glob('*.las')))
lasfiles

[PosixPath('/home/tom/vision/data/training/oakland_part3_an_training.las')]

In [27]:
dropped_columns = ['flag_byte', 'scan_angle_rank', 'user_data', 'pt_src_id']
meta = pd.DataFrame(np.empty(0, dtype=[('X',float),('Y',float),('Z',float),
                                       ('intensity',float),('raw_classification',int)]))

@delayed
def load(file):
    with File(file.as_posix(), mode='r') as las_data:
        las_df = pd.DataFrame(las_data.points['point'], dtype=float).drop(dropped_columns, axis=1)
        return las_df

In [28]:
dfs = [load(file) for file in lasfiles]
df = dd.from_delayed(dfs, meta=meta)
df = df.repartition(npartitions=10)

I often write intermediate steps to Parquet storage to be able to experiment freely with the dataframe. I believe loading Parquet is not (or not much) faster than loading LAS

In [29]:
df.to_parquet('/home/tom/vision/data/training/oakland', compression='GZIP')

## Spatial partitioning

### Translate origin

In [3]:
df = dd.read_parquet('/home/tom/vision/data/training/oakland')

In [4]:
df['X'] = df.X - df.X.min()
df['Y'] = df.Y = df.Y.min()

In [5]:
%time df.to_parquet('/home/tom/vision/data/training/oakland_trans', compression='GZIP')

CPU times: user 1.95 s, sys: 72 ms, total: 2.02 s
Wall time: 2.1 s


### Compute grid cell identifier

In [9]:
df = dd.read_parquet('/home/tom/vision/data/training/oakland_trans')

In [10]:
grid_size = 5.0 #meters
m = morton.Morton(dimensions=2, bits=32)

In [19]:
def get_hash(point, grid_size=grid_size):
    return m.pack(int(point.X // grid_size), int(point.Y // grid_size))

In [6]:
df['hash'] = df[['X', 'Y']].apply(get_hash, grid_size=grid_size, meta=('hash', int), axis=1)

In [None]:
%time df.to_parquet('/home/tom/vision/data/training/oakland_hash', compression='GZIP')

## Normalization

In [None]:
df = dd.read_parquet('/home/tom/vision/data/training/oakland_hash')

In [None]:
meta = pd.DataFrame(np.empty(0, dtype=list(zip(list(df.columns), list(df.dtypes))) + \
                             list(zip(['XN', 'YN'], [np.dtype('float64')]*2))))

def normalise(df):
    df = df.copy()
    df['XN'] = (df.X - df.X.mean()) / (df.X.max() - df.X.min())
    df['YN'] = (df.Y - df.Y.mean()) / (df.Y.max() - df.Y.min())
    return df

df = df.groupby('hash').apply(normalise, meta=meta)

In [None]:
df['ZN'] = (df.Z - df.Z.mean()) / (df.Z.max() - df.Z.min())

In [None]:
%time df.to_parquet('/home/tom/vision/data/training/oakland_norm', compression='GZIP')

That's it for data preparation, the final normalized dataset is your dataset.

## Split dataset

Before training and testing the model you should split this dataset into `train`, `test` and `validation`. I also implemented this code in the `train_custom.py`.

In [None]:
df = dd.read_parquet('/home/tom/vision/data/training/oakland_norm')

In [None]:
hashes = df.index.unique().compute().values

train_test_msk = np.random.rand(len(hashes))
train_val_hashes = hashes[train_test_msk < 0.8]
test_hashes = hashes[~(train_test_msk < 0.8)]

train_val_msk = np.random.rand(len(train_val_hashes))
train_hashes = train_val_hashes[train_val_msk < 0.8]
validation_hashes = train_val_hashes[~(train_val_msk < 0.8)]

with open('/home/tom/vision/data/training/data_split.json', 'w') as data_split:
    json.dump({'train': train_hashes.tolist(), 'validation': validation_hashes.tolist(), 'test': test_hashes.tolist()}, data_split)

## Generator

To feed the data to the deep learning network you need a generator. I also implemented this code in the `train_custom.py`.

```python
def generator(df, hashes, BATCH_SIZE, NUM_POINT, N_AUGMENTATIONS, shuffled=True):
    """
    Generator function to serve the data to the algorithm.
    
    IN: df (the entire dataframe), hashes (the indices to serve), 
        BATCH_SIZE and NUM_POINTS (to set output shape),
        N_AUGMENTATIONS (the number of "augmentations" or iterations of sampling)
    OUT: data, label (batch of data and corresponding labels)
    """
    data_channels = ['X', 'Y', 'Z', 'XN', 'YN','intensity']
    
    seed_hash = []
    for seed in range(N_AUGMENTATIONS):
        for h in hashes:
            seed_hash.append((seed, h))
    shuffle(seed_hash)
    
    batches = [seed_hash[i:i+BATCH_SIZE] for i in range(0,len(seed_hash),BATCH_SIZE)]
    if len(batches[-1]) < BATCH_SIZE: batches = batches[:-1]
    if shuffled: [shuffle(batch) for batch in batches]
        
    def random_sample_block(group, seed):
        """
        Sample entirely random for the entire grid cell
        IN: group (all points in a grid cell), seed (random state value)
        OUT: data_group (a subset of the points in the grid cell; a training sample)
        """
        if len(group) > NUM_POINT:
            data_group = group.sample(n=NUM_POINT, replace=False, random_state=seed)
        else:
            data_group = group.sample(n=NUM_POINT, replace=True, random_state=seed)
        return data_group

    for batch in batches:
        df_batch = [random_sample_block(df.loc[h], s) for s,h in batch]
        data = np.stack([b[data_channels].values for b in df_batch])
        label = np.stack([l.label.values for l in df_batch])
        yield data, label
```

## Adapting `train.py`

Check out the `train_custom.py` for my adaptations to the `train.py` from the original PointNet codebase. This implements the data splitting and generator I mentioned earlier.

## Train PointNet

In [None]:
os.chdir('/home/tom/vision/pointnet/sem_seg/') 
print(os.getcwd())
%run train_custom.py --log_dir=log --max_epoch=50 --num_point=4000 --batch_size=12 --n_augmentations=1