# Tensorflow - Loading data - Adjusting the layer to our needs

https://www.kaggle.com/code/roberthatch/gislr-feature-data-on-the-shoulders/notebook

This notebook generates a tensorflow preprocessing layer that transforms the whole dataset of single parquet files into numpy arrays.

The flow is as follow:

***preprocessing layer:***
1. drop z coordinate if desired (this would reduce number of coordinates to 2; following array shapes show number for 3 coordinates)
2. Convert input into x containing avg face, lips, upper pose landmarks, left hand, right hand for every frame, 
    e.g. [23, 106, 3] = (number of frames, landmarks, coordinates)
3. Pad x or cut x such that the length is as defined (e.g. length = 30 --> [30, 106, 3])  = (number of frames, landmarks, coordinates)
4. interpolate single missing values, then replace NaN values with zero
4. Resize it to either flattened or 3 dimensional array:
    * 3D: ( [1, 30, 318] = (1 row, number of frames, landmarks * coordinates) 
    * or flattened: [1, 9540])  = (1 row, number of frames * landmarks * coordinates)

***looping through csv file ***

By looping through the csv file each parquet file will be loaded and the required data will be extracted and transformed by running it through the preprocessing layer.
Then the data will be added to our final numpy array of all recordings. If the recording is exceeding defined length limits it will be removed from the dataset.

* final shape of data: flattened: x (94477, 5796) and y (94477,) or unflattened: (94477, 30, 212) (94477,)
* shape also depends on selected landmarks and coordinates and migth vary

## Import libraries

In [827]:
%pip install tqdm
import os

import json
from tqdm import tqdm
import numpy as np
import pandas as pd
import random

import tensorflow as tf

Note: you may need to restart the kernel to use updated packages.


## Setup

In [828]:
#file paths for Kaggle
# LANDMARK_FILES_DIR = "/kaggle/input/asl-signs/train_landmark_files"
# TRAIN_FILE = "/kaggle/input/asl-signs/train.csv"
# OUTPUT = ""
# label_map = json.load(open("/kaggle/input/asl-signs/sign_to_prediction_index_map.json", "r"))

#for local notebook, adjust file paths here if required
LANDMARK_FILES_DIR = "../data/asl-signs/"
TRAIN_FILE = "../data/asl-signs/train.csv"
OUTPUT = "../data/" #path to save x and y files
label_map = json.load(open("../data/asl-signs/sign_to_prediction_index_map.json", "r"))

In [829]:
#DONE:
#option for zeropadding, end/beginning
#CSV Filtering
#Drop too short/too long sequences
#numpy array with NaN instead of zero --> only useful if arrays with different length will be added
#first/lastframe padding
#interpolate single missing values


#TODO: options to add for defining dataset
#change numpy array to list? or numpy array with different length --> this would slow down the compiling process significantly
#define handedness and mirror coordinates to one side


#how does rocket handle different lengths?
#how does rocket handle NaNs?

## Configuration

In [830]:
#limit dataset for quick test
QUICK_TEST = False
QUICK_LIMIT = 1000

#Define length of sequences for padding or cutting; 22 is the median length of all sequences
LENGTH = 22

#define min or max length of sequences; sequences too long/too short will be dropped
#max value of 92 was defined by calculating the interquartile range
MIN_LENGTH = 10
MAX_LENGTH = 92

#final data will be flattened, if false data will be 3 dimensional
FLATTEN = False

#define initialization of numpy array 
ARRAY = False #(True=Zeros, False=empty values)

#Define padding mode 
#1 = padding at start&end; 2 = padding at end; 3 = no padding, 4 = copy first/lastframe, 5 = copy last frame)
#Note: Mode 3 will give you an error due to different lengths, working on that
PADDING = 2
CONSTANT_VALUE = 0 #only required for mode 1 and 2; enter tf.constant(float('nan')) for NaN

#define if z coordinate will be dropped
DROP_Z = True

#define if csv file should be filtered
CSV_FILTER  = True
#define how many participants for test set
TEST_COUNT = 5 #5 participants account for ca 23% of dataset
#generate test or train dataset (True = Train dataset; False = Test dataset)
TRAIN = True #only works if CSV_FILTER is activated
#TRAIN = False

#define filenames for x and y:
feature_data = 'X_train' #x data
feature_labels = 'y_train' #y data

#use for test dataset
#feature_data = 'X_test' #x data
#feature_labels = 'y_test' #y data


RANDOM_STATE = 42

#Defining Landmarks
#index ranges for each landmark type
#dont change these landmarks
FACE = list(range(0, 468))
LEFT_HAND = list(range(468, 489))
POSE = list(range(489, 522))
POSE_UPPER = list(range(489, 510))
RIGHT_HAND = list(range(522, 543))
LIPS = [61, 185, 40, 39, 37,  0, 267, 269, 270, 409,
                 291,146, 91,181, 84, 17, 314, 405, 321, 375, 
                 78, 191, 80, 81, 82, 13, 312, 311, 310, 415, 
                 95, 88, 178, 87, 14,317, 402, 318, 324, 308]
#defining landmarks that will be merged
averaging_sets = [FACE]

#generating list with all landmarks selected for preprocessing
#change landmarks you want to use here:
point_landmarks = LEFT_HAND + POSE_UPPER + RIGHT_HAND + LIPS


#calculating sum of total landmarks used
LANDMARKS = len(point_landmarks) + len(averaging_sets)
print(f'Total count of used landmarks: {LANDMARKS}')

#defining input shape for model
if DROP_Z:
    INPUT_SHAPE = (LENGTH,LANDMARKS*2)
else:
    INPUT_SHAPE = (LENGTH,LANDMARKS*3)


Total count of used landmarks: 104


### Helper Functions

In [831]:
ROWS_PER_FRAME = 543
def load_relevant_data_subset(pq_path):
    #defines which columns will be read from the file
    data_columns = ['x', 'y', 'z']
    data = pd.read_parquet(pq_path, columns=data_columns)
    #calculates the number of frames in the data by dividing the length of the data by the number of rows per frame
    n_frames = int(len(data) / ROWS_PER_FRAME)
    #reshapes the data into a 3D array with shape (n_frames, ROWS_PER_FRAME, len(data_columns))
    data = data.values.reshape(n_frames, ROWS_PER_FRAME, len(data_columns))
    return data.astype(np.float32)

In [832]:
def right_hand_percentage(x):
    #calculates percentage of right hand usage
    right = tf.gather(x, RIGHT_HAND, axis=1)
    left = tf.gather(x, LEFT_HAND, axis=1)
    right_count = tf.reduce_sum(tf.where(tf.math.is_nan(right), tf.zeros_like(right), tf.ones_like(right)))
    left_count = tf.reduce_sum(tf.where(tf.math.is_nan(left), tf.zeros_like(left), tf.ones_like(left)))
    return right_count / (left_count+right_count)

In [833]:
def tf_nan_mean(x, axis=0):
    #calculates the mean of a TensorFlow tensor x along a specified axis while ignoring any NaN values in the tensor.
    return tf.reduce_sum(tf.where(tf.math.is_nan(x), tf.zeros_like(x), x), axis=axis) / tf.reduce_sum(tf.where(tf.math.is_nan(x), tf.zeros_like(x), tf.ones_like(x)), axis=axis)

def tf_nan_std(x, axis=0):
    #calculates the standard deviation of a tensor x along a specified axis, while ignoring any NaN values in the tensor
    d = x - tf_nan_mean(x, axis=axis)
    return tf.math.sqrt(tf_nan_mean(d * d, axis=axis))

#this function is only required if mean and std will be calculated for specific segments of the data
def flatten_means_and_stds(x, axis=0):
    #Get means and stds
    x_mean = tf_nan_mean(x, axis=0)
    x_std  = tf_nan_std(x,  axis=0)
    #concats mean and std values for each sequence
    x_out = tf.concat([x_mean, x_std], axis=0)
    x_out = tf.reshape(x_out, (1, INPUT_SHAPE[1]*2))
    #replaces NaN values with zeros
    x_out = tf.where(tf.math.is_finite(x_out), x_out, tf.zeros_like(x_out))
    return x_out

## TensorFlow Feature Preprocessing Layer

In [834]:
#generating preprocessing layer that will be added to final model
class FeatureGen(tf.keras.layers.Layer):
    #defines custom tensorflow layer 
    def __init__(self):
        #initializes layer
        super(FeatureGen, self).__init__()
    
    def call(self, x_in):
        #drop z coordinates if required
        if DROP_Z:
            x_in = x_in[:, :, 0:2]
        
        #generates list with mean values for landmarks that will be merged
        x_list = [tf.expand_dims(tf_nan_mean(x_in[:, av_set[0]:av_set[0]+av_set[1], :], axis=1), axis=1) for av_set in averaging_sets]
        #extracts specific columns from input x_in defined by landmarks
        x_list.append(tf.gather(x_in, point_landmarks, axis=1))
        #concatenates the two tensors from above along axis 1/columns
        x = tf.concat(x_list, 1)

        #padding to desired length of sequence (defined by LENGTH)
        #get current number of rows
        x_padded = x
        current_rows = tf.shape(x_padded)[0]
        #if current number of rows is greater than desired number of rows, truncate excess rows
        if current_rows > LENGTH:
            x_padded = x_padded[:LENGTH, :, :]

        #if current number of rows is less than desired number of rows, add padding
        elif current_rows < LENGTH:
            #calculate amount of padding needed
            pad_rows = LENGTH - current_rows

            if PADDING ==4: #copy first/last frame
                if pad_rows %2 == 0: #if pad_rows is even
                    padding_front = tf.repeat(x_padded[0:1, :], pad_rows//2, axis=0)
                    padding_back = tf.repeat(x_padded[-1:, :], pad_rows//2, axis=0)
                else: #if pad_rows is odd
                    padding_front = tf.repeat(x_padded[0:1, :], (pad_rows//2)+1, axis=0)
                    padding_back = tf.repeat(x_padded[-1:, :], pad_rows//2, axis=0)
                x_padded = tf.concat([padding_front, x_padded, padding_back], axis=0)
            elif PADDING == 5: #copy last frame
                padding_back = tf.repeat(x_padded[-1:, :], pad_rows, axis=0)
                x_padded = tf.concat([x_padded, padding_back], axis=0)
            else:
                if PADDING ==1: #padding at start and end
                    if pad_rows %2 == 0: #if pad_rows is even
                        paddings = [[pad_rows//2, pad_rows//2], [0, 0], [0, 0]]
                    else: #if pad_rows is odd
                        paddings = [[pad_rows//2+1, pad_rows//2], [0, 0], [0, 0]]
                elif PADDING ==2: #padding only at the end of sequence
                    paddings = [[0, pad_rows], [0, 0], [0, 0]]
                elif PADDING ==3: #no padding
                    paddings = [[0, 0], [0, 0], [0, 0]]
                x_padded = tf.pad(x_padded, paddings, mode='CONSTANT', constant_values=CONSTANT_VALUE)

        x = x_padded
        current_rows = tf.shape(x)[0]

        #interpolate single missing values
        x = pd.DataFrame(np.array(x).flatten()).interpolate(method='linear', limit=2, limit_direction='both')
        #fill missing values with zeros
        x = tf.where(tf.math.is_nan(x), tf.zeros_like(x), x)
        
        #reshape data to 2D or 3D array
        if FLATTEN:
            x = tf.reshape(x, (1, current_rows*INPUT_SHAPE[1]))
        else:
            x = tf.reshape(x, (1, current_rows, INPUT_SHAPE[1]))

        return x

#define converter using generated layer
feature_converter = FeatureGen()

In [835]:
#Tests for generated layer
#One tests symbolic tensor, the other tests real data.
#print(feature_converter(tf.keras.Input((543, 3), dtype=tf.float32, name="inputs")))

#tests preprocessing layer with parquet file
feature_converter(load_relevant_data_subset(f'{LANDMARK_FILES_DIR}{pd.read_csv(TRAIN_FILE).path[1]}'))

<tf.Tensor: shape=(1, 22, 208), dtype=float32, numpy=
array([[[0.57777625, 0.5093604 , 0.51049846, ..., 0.53057057,
         0.6235912 , 0.5270701 ],
        [0.57509214, 0.5065524 , 0.5077551 , ..., 0.52969885,
         0.6222859 , 0.5262149 ],
        [0.5703879 , 0.5069116 , 0.50810647, ..., 0.52882755,
         0.62084705, 0.5254454 ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ]]], dtype=float32)>

In [836]:
def convert_row(i, row, long_sequences, short_sequences):

    #loads data from parquet file
    x = load_relevant_data_subset(f'{LANDMARK_FILES_DIR}{row[1].path}')
    
    #if sequence is too long or too short its index will be added to list, so we can later remove them from dataset
    if x.shape[0] < MIN_LENGTH:
        short_sequences.append(i)
    elif x.shape[0] > MAX_LENGTH:
        long_sequences.append(i)

    #applies preprocessing layer to loaded data
    x = feature_converter(tf.convert_to_tensor(x)).cpu().numpy()
    #returns transformed x data and label of sign
    return x, row[1].label

def convert_and_save_data():
    #reads csv file
    df = pd.read_csv(TRAIN_FILE)
    #maps label number to sign column
    df['label'] = df['sign'].map(label_map)
    
    #filter dataset for participant train/test split
    if CSV_FILTER:      
        #set random state
        random.seed(RANDOM_STATE)
        #get random participants from dataset and add to sample list, number of participants is defined in setup
        sample_list = random.sample(df.groupby('participant_id').mean().index.values.tolist(), TEST_COUNT)
        print(f'These participants are used as test set: {sample_list}.') 
        print(f'Together they have {df.query(f"participant_id in {sample_list}").shape[0]} recordings, {round(df.query(f"participant_id in {sample_list}").shape[0]/df.shape[0]*100,2)}% of the dataset.')
        print('-------------------------------------')
        if TRAIN:
            #limit dataset to participants used for train set
            df = df.query(f'participant_id not in {sample_list}')
        else:
            #limit dataset to participants used for test set
            df = df.query(f'participant_id in {sample_list}')


    #sets number of total rows
    total = df.shape[0]
    #limits number of rows if quick_test is activated
    if QUICK_TEST:
        total = QUICK_LIMIT
    
    #generates numpy array with zeros in shape (total number of rows, number of expected columns)
    if ARRAY: #initialize array with zeros
        if FLATTEN:
            npdata = np.zeros((total, INPUT_SHAPE[0]*INPUT_SHAPE[1]))
        else:
            npdata = np.zeros((total, INPUT_SHAPE[0], INPUT_SHAPE[1]))
        nplabels = np.zeros(total)
    else: #initialize empty array
        if FLATTEN:
            npdata = np.empty((total, INPUT_SHAPE[0]*INPUT_SHAPE[1]))
        else:
            npdata = np.empty((total, INPUT_SHAPE[0], INPUT_SHAPE[1]))
        nplabels = np.empty(total)

    #initializing lists for collecting too long and too short sequences
    long_sequences = []
    short_sequences = []

    #for loop iterates through each row in df dataframe; i is index of the row and row accesses information in the row of df
    #tqdm is used for showing progress bar
    for i, row in tqdm(enumerate(df.iterrows()), total=total):
        #load specific parquet file, run preprocessing layer and save x and y data
        (x,y) = convert_row(i, row, long_sequences, short_sequences)
        #save x and y to specific row in prepared numpy arrays
        npdata[i,:] = x
        nplabels[i] = y

        #break if quick test is activated
        if QUICK_TEST and i == QUICK_LIMIT - 1:
            break

    #remove rows of sequences that are too long/too short
    npdata = np.delete(npdata, obj = (long_sequences + short_sequences), axis=0)
    nplabels = np.delete(nplabels, obj = (long_sequences + short_sequences), axis=0)

    print (f'{len(long_sequences)} sequences were removed because they are too long, which is {round(len(long_sequences)/df.shape[0]*100,2)}% of the whole dataset.')
    print (f'{len(short_sequences)} sequences were removed because they are too short, which is {round(len(short_sequences)/df.shape[0]*100,2)}% of the whole dataset.')
    print('-------------------------------------')
    print('Shapes of the final datasets are:')
    print(npdata.shape, nplabels.shape)

    #save as np file
    np.save(f"{OUTPUT}{feature_data}.npy", npdata)
    np.save(f"{OUTPUT}{feature_labels}.npy", nplabels)
        
convert_and_save_data()

These participants are used as test set: [62590, 18796, 2044, 28656, 27610].
Together they have 21713 recordings, 22.98% of the dataset.
-------------------------------------


100%|██████████| 72764/72764 [05:08<00:00, 236.15it/s]


6578 sequences were removed because they are too long, which is 9.04% of the whole dataset.
95 sequences were removed because they are too short, which is 0.13% of the whole dataset.
-------------------------------------
Shapes of the final datasets are:
(66091, 22, 208) (66091,)


In [837]:
#test of loading data
X = np.load(f"{OUTPUT}{feature_data}.npy")
y = np.load(f"{OUTPUT}{feature_labels}.npy")

print(X.shape, y.shape)

(66091, 22, 208) (66091,)
