# Data Preprocessing

Now that we've extracted the inputs from the replays, we now precompute (some of) our preprocessing steps for the whole dataset so that we don't have to it over and over every time we want to change a hyperparameter.

In [1]:
import os

import pandas as pd

## Find the files

In [None]:
data_path = 'dataset/inputs' # set this to wherever your dataset is being stored

all_files = os.listdir(data_path)
games_inputs = {}
for filename in all_files:
    filepath = data_path + filename
    inputs = pd.read_csv(filepath)
    games_inputs[filename.split('.')[0]] = inputs[['joy_x', 'joy_y']] # we are only interested in joy_x and joy_y columns

games_inputs[all_files[5]].head(10)

## Convert to displacement event representation

The output of a joystick is based on its position, rather than its displacement as is the case for a mouse. The result of this is that users tend to move the joystick to a position, hold it there for some time, and then move it somewhere else. The dataset contains the state of the controller for every single frame of gameplay. We can fit the same amount of information in a more compact representation by only saving the state on frames where the state has changed, and recording the amount of time since the previous change. This smaller representation improves performance because our models won't have to sift through large amounts of redundant information.

In [None]:
def nextDisplacement(df, index):
    displacements = [] # stores both x and y displacement
    for col in df.columns:
        displacements.append(df.at[index+1, col] - df.at[index, col])

    return displacements

In [None]:
game_displacements = {}

for g_name in games_inputs.keys():
    game = games_inputs[g_name]

    ds = []
    prev_frame = 0
    for i in range(game.shape[0] - 1):
        # get next displacement
        d = nextDisplacement(game, i)

        if not d == [0, 0]: # only include frames that have some velocity
            # get elapsed
            elapsed = i - prev_frame
            prev_frame = i

            if elapsed > 60: # arbitrarily set all values > 60 to 0
                elapsed = 0

            ds.append((d[0], d[1], elapsed))

    game_displacements[g_name] = pd.DataFrame(ds, columns=['d_x', 'd_y', 'frames_elapsed'])

game_displacements[all_files[0]].head(10)

## Scale frames_elapsed to between 0 and 1
norm(x) = (x / max_x)

In [None]:
max = 60 # we know this is the maximum because we removed every value that was bigger

for game in game_displacements.values():
    game['frames_elapsed'] = game['frames_elapsed'].map(lambda x : x / max)

game_displacements[all_files[0]].head(10)

## Write the preprocessed data to a CSV

In [None]:
output_dir = 'dataset/demo/displacements/'
for game in game_displacements.keys():
    game_displacements[game].to_csv(output_dir+game+'.csv', index=False)