# **Age Group Classification using Multiple Touchscreen Gestures**

### Child Safety Project

Tom Birmingham & Dr. Hossain, Southern CT State University, Computer Science Department, New Haven, CT

![University Banner](<https://clubrunner.blob.core.windows.net/00000400324/EventImages/951e9035-3850-4f27-99d4-5a5d4bd17430-southern-connecticut-state-university(1).jpg>)

photo: [connecticut.csteachers.org](https://connecticut.csteachers.org/events/chapter-meeting-november-2019-southern-connecticut-state-university)


# About

This notebook represents the thesis work of graduate student Thomas Birmingham under the advisement of Dr. Hossain, Southern Connecticut State University (SCSU) Computer Science Department.

## Purpose

The purpose of the project is to further the state-of-the-art in age group classification using multiple touchscreen gestures and parallel fusion. Previous methods for classifying users (as child, teen, or adult) have been performed by other students in the project with Tap, Zoom, and Swipe gestures using machine learning. In order to further the state-of-the-art, this project will combine gestures using three types of parallel fusion to determine the most effective method for age group classification using multiple touchscreen gestures.

## Methodology

In order to determine the most effective method of parallel fusion for age classification, we must first obtain the same or better results of classification using features of each gesture Swipe, Zoom, and Tap. After classificiton of each gesture is completed, classification with parallel fusion will be performed at the feature level, score level, and decision level. The performance of each level of fusion will be assessed to determine the most effective method for age group classification.


# Prerequisites

This notebook was created on Apple M3 Max with 36 GB of memory. Not everyone has access to these computerers so this notebook can be made to run on Google Colab as well, where students can access shared compute resources with large amounts of RAM. This project has also been reviewed for memory optimization to reduce the amount of memory required to run these notebooks. For example, after loading the Child Safety Dataset, a Python object for each gesture is created as a Pandas Dataframes and exported to the /Checkpoints directory as a Python "pickle" file. Working from /Checkpoints allows users to resume their work without the time or space complexity of working with the raw dataset. You can check how much memory is available on your current runtime by running the following code block:

In [5]:
# check available memory
from psutil import virtual_memory

def print_available_memory():
    ram_gb = virtual_memory().total / 1024**3
    print('Your runtime has {:.1f} GB of available RAM\n'.format(ram_gb))
    if ram_gb < 20:
        print('This is not considered a high-RAM runtime')
    else:
        print('This is considered a high-RAM runtime')

print_available_memory()

Your runtime has 36.0 GB of available RAM

This is considered a high-RAM runtime


**Note:** Any variable **not** contained within a fuction is treated as a global variable. In the above code block, notice the variable `ram_gb` is contained within the function `print_available_memory()`. Variables that are contained within a function do not persist after the function has been called. Variables outside of functions are treated as global variables, will persist, and use available memory. While you work with this notebook, try not to use variables outside of function definitions unless they are needed. You can check what variables are in memory and their size in bytes using the following code block: 

In [68]:
import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
initial_vars = sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith(
    '_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)

total = 0
print(f"variable\tsize (bytes)")
print(f"--------\t------------")
for (var, size) in initial_vars:
    total = total + size
    if (size > 160 and var not in ["initial_vars", "total"]):
        print(var, '\t', size)
print(f"\nTOTAL:\t\t{total/1024:.1f} KB")

del ipython_vars, initial_vars, var, size, total

variable	size (bytes)
--------	------------
old_tap_df 	 69211277
tap_df 	 69211277
session_df 	 90447
user_df 	 39910

TOTAL:		135308.3 KB


Run this cell at any time to show what variables are used in memory and the total amount of memory used by these . Unused variables can be removed with the `del` keyword followed by the variable. Remove them after the cell where they are no longer used!

# Data Collection

## Imports and Helper Functions

In [57]:
import pandas as pd
import pickle
import os
import sys
import inspect
from glob import glob


def load_dataframe(df_name):
    filepath = os.path.join("Checkpoints", f"{df_name}.pkl")
    if os.path.exists(filepath):

        with open(filepath, 'rb') as input:
            df = pd.read_pickle(input)
            print(f"Loaded {df_name} from {filepath}")
            return df
    else:
        print(f"File {filepath} does not exist.")
        return None


def save_dataframe(df):
    # Create the Checkpoints directory if it doesn't exist
    checkpoints_dir = 'Checkpoints'
    if not os.path.exists(checkpoints_dir):
        os.makedirs(checkpoints_dir)
    
    # Get the variable name of the DataFrame dynamically
    frame = inspect.currentframe().f_back
    variable_name = None
    for name, value in frame.f_locals.items():
        if value is df:
            variable_name = name
            break

    if variable_name is None:
        raise ValueError("Could not determine the variable name of the DataFrame.")
    
    # Create the file path
    file_path = os.path.join(checkpoints_dir, f"{variable_name}.pkl")

    # Save the DataFrame as a pickle file
    with open(file_path, 'wb') as output:
        pickle.dump(df, output)
        print(f"DataFrame saved as {file_path}")
    

def get_dataset_files():
    if 'google.colab' in sys.modules:
        print("Using Google Colab hosted runtime")
        files = get_dataset_files_from_google_drive()
    else:
        print("Using local runtime")
        files = get_dataset_files_locally()
    return files


def get_dataset_files_from_google_drive():
    from google.colab import drive

    # Mount Google Drive
    drive.mount('/content/drive', force_remount=True)

    # Create Symbolic Links to directories 'Child_Safety_Data' and 'Checkpoints'
    ! ln -s "/content/drive/MyDrive/Child Safety Project/Child_Safety_Data"
    ! ln -s "/content/drive/MyDrive/Child Safety Project/Checkpoints"

    # Get files in Child_Safety_Data and output total number of files
    files = glob(r'Child_Safety_Data/**/*.txt', recursive=True)
    print("Total Files:", len(files))
    return files


def get_dataset_files_locally():
    files = glob(r'../Dataset/Child_Safety_Data/**/*.txt', recursive=True)
    print("Total Files:", len(files))
    return files

## Get Users (user_df)

In [43]:
def get_user_df():

    # attempt to load dataframe from checkpoint
    user_df = load_dataframe("user_df")

    # if checkpoint does not exist, create and save dataframe
    if (user_df is None):
        user_df = build_user_df()
        save_dataframe(user_df)

    return user_df


def build_user_df():

    user_file_dfs = list() # a list of dataframes, one user for each file in the dataset, to be concatenated into a single dataframe, user_df

    for file in get_dataset_files():

        filename = file.split('/')[-1]
        if "mod0" in filename:

            with open(file) as file:

                # user data to extract from each file
                user_id = ''
                researcher = ''
                grade = ''
                gender = ''
                age = ''
                teacher = ''

                # researcher
                if filename[0] == '_':
                    researcher = 'Carl'
                elif filename[0] != '_':
                    researcher = 'Kate'

                # get data from file
                for line in file:

                    # collect key value pairs from file
                    if "=" in line:
                        key, value = line.strip().split("=")
                        
                        key = key.lower()
                        if key == "grade":
                            grade = value
                        elif key == "gender" or key == "student":
                            gender = value
                        elif key == "age" or key == "studentage":
                            age = value
                        elif key == "teacher":
                            teacher = value

                # User Unique Identifiers
                if researcher == 'Carl':
                    user_id = filename.split('_')[1]
                elif researcher == 'Kate':
                    user_id = teacher + grade + gender + age

                # Note: Kate had recorded data differently than Carl
                # where Carl collected datetime, session number, machine name, grade,
                # gender, age, middle initial, birth month, and birth day
                # and Kate collected datetime, session number, machine name, teacher,
                # grade, gender, and age.
                # Carl generated and assigned a unique identifier to each study participant,
                # but we do not have a unique identifier for each of Kate's participant.
                # We try to generate as unique an identifier as possible using a combination of
                # teacher, grade, gender, and age, but there are duplicate user ids and
                # therefore we cannot use all this data.

                # create user dataframe for each file and append it to a list
                user_data = {}
                user_data['user_id'] = user_id
                user_data['age'] = age
                user_data['grade'] = grade
                user_data['gender'] = gender
                user_data['researcher'] = researcher
                user_file_df = pd.DataFrame(user_data, index=[0])
                user_file_dfs.append(user_file_df)
    
    # concatenate dataframes of user data into a single dataframe
    user_df = pd.concat(user_file_dfs)

    # remove dupliate user ids from kate
    is_duplicate_kate = (user_df['researcher'] == 'Kate') & user_df.duplicated(subset='user_id', keep=False)
    user_df = user_df[~is_duplicate_kate | (user_df['researcher'] != 'Kate')]

    # Keep first user_id from Carl
    user_df = user_df.drop_duplicates(subset='user_id', keep='first')

    # reset indexes
    user_df = user_df.reset_index(drop=True)

    # set data types
    user_df['age'] = user_df['age'].astype(int)
    user_df['grade'] = user_df['grade'].astype(int)

    # add age group
    user_df['age_group'] = user_df['age'].apply(get_age_group)

    return user_df


def get_age_group(age):
    if age < 13:
        return 'child'
    elif age < 18:
        return 'teen'
    else:
        return 'adult'


# return user_df as global variable
user_df = get_user_df()

Loaded user_df from Checkpoints/user_df.pkl


In [44]:
user_df.head()

Unnamed: 0,user_id,age,grade,gender,researcher,age_group
0,14S0915,18,14,M,Carl,adult
1,14T1227,19,14,M,Carl,adult
2,NEVOLIS12M18,18,12,M,Kate,adult
3,ROY12M17,17,12,M,Kate,teen
4,REGAN12F17,17,12,F,Kate,teen


## Get Sessions (session_df)

In [46]:
def get_session_df():

    # attempt to load dataframe from checkpoint
    session_df = load_dataframe("session_df")

    # if checkpoint does not exist, create and save dataframe
    if (session_df is None):
        session_df = build_session_df()
        save_dataframe(session_df)

    return session_df


def build_session_df():
    print("Building session_df from data")

    session_file_dfs = list() # a list of dataframes, one session for each file in the dataset, to be concatenated into a single dataframe, user_df

    for file in get_dataset_files():

        filename = file.split('/')[-1]

        if "mod0" in filename:

            with open(file) as file:

                # expected data in file
                session_id = ''
                user_id = ''
                researcher = ''
                datetime = ''
                device = ''
                device_type = ''
                session = ''
                grade = ''
                gender = ''
                age = ''
                teacher = ''

                # researcher
                if filename[0] == '_':
                    researcher = 'Carl'
                elif filename[0] != '_':
                    researcher = 'Kate'

                # get data from file
                for line in file:
                    if "=" in line:

                        # collect key value pairs from file
                        key, value = line.strip().split("=")
                        key = key.lower()
                        if key == "colldatetime":
                            datetime = value
                        elif key == "machname":
                            device = value
                        elif key == "session":
                            session = value
                        elif key == "grade":
                            grade = value
                        elif key == "gender" or key == "student":
                            gender = value
                        elif key == "age" or key == "studentage":
                            age = value
                        elif key == "teacher":
                            teacher = value

                session_id = device + "_" + session

                # User Unique Identifiers
                if researcher == 'Carl':
                    user_id = filename.split('_')[1]
                elif researcher == 'Kate':
                    user_id = teacher + grade + gender + age

                # Device Type
                if 'm' in device:
                    device_type = 'tablet'
                elif 'p' in device:
                    device_type = 'phone'

                # create session dataframe for each file and append it to a list
                session_data = {}
                session_data['session_id'] = session_id
                session_data['user_id'] = user_id
                session_data['datatime'] = datetime
                session_data['device'] = device
                session_data['session'] = session
                session_data['device_type'] = device_type
                session_file_df = pd.DataFrame(session_data, index=[0])
                session_file_dfs.append(session_file_df)

    # Create Session DataFrame
    session_df = pd.concat(session_file_dfs)

    # Drop sessions from session dataframe where there is not a unique user id
    print("Removing sessions where there is not a unique user id...")
    unique_user_ids = get_user_df()['user_id'].unique()
    mask = session_df['user_id'].isin(unique_user_ids)
    session_df = session_df[mask]

    # reset index
    session_df = session_df.reset_index(drop=True)

    return session_df


# return session_df as global variable
session_df = get_session_df()

Loaded session_df from Checkpoints/session_df.pkl


In [47]:
session_df.head()

Unnamed: 0,session_id,user_id,datatime,device,session,device_type
0,p2_29,14S0915,Wed 2019-9-04 07:28,p2,29,phone
1,p7_38,14T1227,Wed 2019-8-28 07:28,p7,38,phone
2,p3_12,NEVOLIS12M18,Fri 2019-2-01 10:48,p3,12,phone
3,p7_8,ROY12M17,Thu 2019-1-31 11:59,p7,8,phone
4,p1_17,REGAN12F17,Wed 2019-2-06 11:12,p1,17,phone


## Get Tap Data (tap_df)

In [50]:
def get_tap_df():

    # attempt to load dataframe from checkpoint
    tap_df = load_dataframe("tap_df")

    # if checkpoint does not exist, create and save dataframe
    if (tap_df is None):
        tap_df = build_tap_df()
        save_dataframe(tap_df)

    return tap_df


def build_tap_df():
    print("Building tap_df...")
    
    # Initialize Data Structures
    file_df = pd.DataFrame()
    file_dfs = list()
    tap_df = pd.DataFrame()
    session_df = get_session_df()

    for file in get_dataset_files():

        # get filename
        filename = file.split('/')[-1]

        # get session id
        session_id = filename.split('_')[-4] + '_' + filename.split('_')[-3]

        # Process mod5 (tap) files if session captured
        if ('mod5' in filename) and (session_id in session_df['session_id'].unique()):

            with open(file) as file:

                fileDictionary = {}

                for line in file:
                    line = line.strip()

                    if line:
                        # Add key and value list to dictionary
                        if ":" in line:
                            # split the line into key and values
                            key, values = line.split(': ')
                            # split values into list
                            values = values.split(' ')
                            # add the key-value pair to the dictionary
                            fileDictionary[key] = values

                # Build file dataframe from dictionary
                file_df = pd.DataFrame.from_dict(fileDictionary, orient='index').transpose()

                # Release memory of fileDictionary TODO: Why do we need to do this? Is it necessary?
                fileDictionary = {}

                # Insert user_id and session_id from session dataframe
                user_id = session_df.loc[session_df['session_id'] == session_id, 'user_id'].iloc[0]
                file_df.insert(0, 'session_id', session_id)
                file_df.insert(1, 'user_id', user_id)

                # Append to list of file dataframes
                file_dfs.append(file_df)

    # Concatenate dataframes from files into single dataframe
    tap_df = pd.concat(file_dfs)

    # Clean dataframe
    tap_df = clean_tap_df(tap_df)

    return tap_df


def clean_tap_df(tap_df):
    # Rename Columns
    tap_df = tap_df.rename(columns={
        'Time': 'mille',
        'Tapped': 'tapped',
        'Action': 'action',
        'X': 'x',
        'Y': 'y',
        'Size': 'size',
        'Pressure': 'pressure'
    })

    # Replace Actions
    tap_df['action'] = tap_df['action'].replace(to_replace={
        '0': 'DOWN-1ST',
        '2': 'MOVE',
        '1': 'UP-LAST'
    })

    # Generate unique Event IDs for all events starting with 'DOWN-1ST'
    tap_df.insert(2, 'event_no', (tap_df['action'] == 'DOWN-1ST').cumsum())

    # Specify Data Types
    tap_df = tap_df.astype({
        'mille': 'int64',
        'tapped': 'int32',
        'x': 'float64',
        'y': 'float64',
        'size': 'float64',
        'pressure': 'float64'
    })

    return tap_df


# return tap_df as global variable
tap_df = get_tap_df()

Loaded tap_df from Checkpoints/tap_df.pkl


In [51]:
tap_df.head()

Unnamed: 0,session_id,user_id,event_no,mille,tapped,action,x,y,size,pressure
0,p7_38,14T1227,1,2508201,1,DOWN-1ST,186.7356,87.57617,0.224609,1.05
1,p7_38,14T1227,1,2508228,1,MOVE,186.7356,87.57617,0.223633,1.0625
2,p7_38,14T1227,1,2508245,1,MOVE,186.7356,87.57617,0.223633,1.05
3,p7_38,14T1227,1,2508262,1,MOVE,186.7356,87.57617,0.224609,1.0
4,p7_38,14T1227,1,2508278,1,UP-LAST,186.7356,87.57617,0.224609,1.0
