<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://bsethwalker.github.io/assets/img/clemson_paw.png">
</div>

## Course Project (Online Workers) - Checkpoint 1

**Clemson University**<br>
**Fall 2023**<br>
**Instructor(s):** Nina Hubig <br>
**Project Team:**
<ul>
    <li>David Croft <dcroft@g.clemson.edu></li>
    <li>Stephen Becker <sgbecke@g.clemson.edu></li>
    <li>Tony Hang <qhang@g.clemson.edu></li>
    <li>Zachary Trabookis <ztraboo@clemson.edu></li>
</ul>

---



In [12]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://bsethwalker.github.io/assets/css/cpsc6300.css").text
HTML(styles)

## Summary Goals

* Summary of the data set that, at a minimum, answers the following questions: What is the unit of analysis? How many observations in total are in the data set? How many unique observations are in the data set? What time period is covered?
  
* Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?
  
* Description of outcome with an appropriate visualization technique.
  
* Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.

In [13]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
# Set the max columns to none. This allows all the columns to display for the dataframes.
pd.set_option('display.max_columns', None)
pd.set_option('mode.chained_assignment', None)

from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import mean_squared_error

import warnings

In [14]:
from datetime import datetime, date

# Generic functions for cleaning the data
def convert_epoch_time_to_datetime(epoch_time):
    """
    Takes an epoch timestamp and converts it to datetime format
    Ref: 
    https://www.pythonforbeginners.com/basics/convert-epoch-to-datetime-in-python
    https://stackoverflow.com/questions/49710963/converting-13-digit-unixtime-in-ms-to-timestamp-in-python 
    """

    converted_date = None
    try:
        # Divide by 1,000 to remove ms time
        converted_date = datetime.fromtimestamp(int(epoch_time)/1000).isoformat()
    except ValueError as err:
        # print(f"Cannot convert epoch time {epoch_time} to isodate {err}")

        try:
            date.fromisoformat(str(epoch_time))
        except ValueError as err:
            # Value passed is already in the correct `iso` format. Nothing else to do here.
            return epoch_time
    
    return converted_date

# Testing the epoch time conversion
# convert_epoch_time_to_datetime(1588990000000) # '2020-05-08T22:06:40'
# convert_epoch_time_to_datetime(1588994215395)   # '2020-05-08T23:16:55.395000'


In [15]:
import json
from ast import literal_eval

import flatdict

from collections.abc import MutableMapping

def flatten_json_column(json_cols):
    """
    This function flattens JSON columns to individual columns
    It merges the flattened dataframe with expected dataframe to capture missing columns from JSON
    :param df: Crowd Work Data CSV raw dataframe
    :param json_cols: custom data columns in CSV's
    :param custom_df: expected dataframe
    :return: returns df pandas dataframe

    Ref: 
    https://github.com/vvgsrk/ParseCSVContainsJSONUsingPandas/tree/main
    https://avithekkc.medium.com/how-to-convert-nested-json-into-a-pandas-dataframe-9e8779914a24
    """

    # Make sure to sort the `na_positions` last because this could effect how many columns
    # that the nested column values are shown. If the nested column value is `NaN` first then
    # nothing will get populated for those nested column fields. (e.g. `c4_project.hit_requirements`)
    # Note: Comment out the fields that you don't want to show up in the final dataframe.
    struct_data_json = {
        "task_id": [None],
        "assignment_id": [None],
        "accepted_at": [None],
        "deadline": [None],
        "time_to_deadline_in_seconds": [None],
        "state": [None],
        "question.value": [None],
        "question.type": [None],
        # "question.attributes": [None],
        "question.attributes.FrameSourceAttribute": [None],
        "question.attributes.FrameHeight": [None],
        "project.hit_set_id": [None],
        "project.title": [None],
        "project.requester_id": [None],
        "project.requester_name": [None],
        "project.description": [None],
        "project.assignment_duration_in_seconds": [None],
        "project.creation_time": [None],
        "project.assignable_hits_count": [None],
        "project.latest_expiration_time": [None],
        "project.caller_meets_requirements": [None],
        "project.caller_meets_preview_requirements": [None],
        "project.last_updated_time": [None],
        "project.monetary_reward.currency_code": [None],
        "project.monetary_reward.amount_in_dollars": [None],
        # "project.hit_requirements.qualification_type_id": [None],
        # "project.hit_requirements.comparator": [None],
        # "project.hit_requirements.worker_action": [None],
        # "project.hit_requirements.qualification_values": [None],
        # "project.hit_requirements.caller_meets_requirement": [None],
        # "project.hit_requirements.qualification_type.qualification_type_id": [None],
        # "project.hit_requirements.qualification_type.name": [None],
        # "project.hit_requirements.qualification_type.visibility": [None],
        # "project.hit_requirements.qualification_type.description": [None],
        # "project.hit_requirements.qualification_type.has_test": [None],
        # "project.hit_requirements.qualification_type.is_requestable": [None],
        # "project.hit_requirements.qualification_type.keywords": [None],
        # "project.hit_requirements.caller_qualification_value.integer_value": [None],
        # "project.hit_requirements.caller_qualification_value.locale_value.country": [None],
        # "project.hit_requirements.caller_qualification_value.locale_subdivision": [None],
        "project.requester_url": [None],
        "expired_task_action_url": [None],
        "task_url": [None]
    }

    def _flatten_dict(d: MutableMapping, sep: str= '.') -> MutableMapping:
        """
        Take in 
        """
        [flat_dict] = pd.json_normalize(data=d, sep=sep, max_level=None).to_dict(orient='records')
        return flat_dict

    try:
        df_temp = pd.DataFrame(struct_data_json)

        # If c4 `nan` value is passed, do nothing except return empty dataframe.
        # If c4 has a string dicionary, then build new dataframe from it.
        if isinstance(json_cols, str):
            # Convert the input (str) to (dict) type.
            # Build a flattened dictionary before sending to Pandas to `json_normalize`
            dict_json_flattened = _flatten_dict(json.loads(json_cols))

            # Explicitly remove this column because it's a nested list and is hard to flatten.
            # Plus this column doesn't have any values that we need for our model.
            del dict_json_flattened["project.hit_requirements"]

            df_temp = pd.DataFrame([dict_json_flattened])
            
    except json.JSONDecodeError as e:
        print(f"flatten_json_columns: Invalid JSON argument passed - json.JSONDecodeError: {e}")

    # Return dataframe with flatten columns and 'c4.' prefix.
    return df_temp.add_prefix('c4.')


### Cleaning: Reading in the telemetry data (Amazon Mechanical Turk (AMT))
We read in and clean the data from `amazon_mechanical_turk_records.csv`.

#### Background Research Paper
Research Paper: *Quantifying the Invisible Labor in Crowd Work*

https://dl.acm.org/doi/abs/10.1145/3476060?casa_token%3Dw4mZH0IjVgsAAAAA:XBgWg_Oq0TtNVqH8SzCxl2fXU_fZ9bzQ6g22QkI0odMy5NKW2EJdYrOaqxu_2NIqJs-rA_sM1sbT

A browser plugin was used to collect the data. 
- https://github.com/GigPlatform/toloka-web-extension (has not been used for any analysis on any paper)
- https://github.com/anonym-research/invisible-labor (was used for an invisible labor analysis but any predictive analysis has been performed)

**List of available variables:** 

- **c1 (ID)**: continuous
- **c2 (current)**: url, site visited by the worker (while working)
- **c3 (event)**: categorical, 18 values (['PAGE_LOAD', 'PAGE_BLUR', 'TAB_CHANGE', 'PAGE_FOCUS', 'PAGE_CLICK', 'PAGE_SCROLL', 'PAGE_LAST', 'PAGE_CLOSE', 'INTERNALURL', 'PAGE_KEY', 'PAGE_INACTIVITY', 'TAB_CLOSED', 'EXTERNALURL', 'PAGE_REACTIVATE', 'SYSTEM_DISABLED_WORKING', 'SYSTEM_ENABLED_WORKING', 'SYSTEM_ENABLED', 'SYSTEM_DISABLED'])
  - The web browser plugin recorded multiple events, the most relevant is PAGE_LOAD, other events can provide repetitive information.
- **c4 (extra)**: json object {task_id, assignment_id, ...} – may include NaN values
  - It provides a JSON object with the specificities of the tasks, it only has values for certain events.
  - Are there any values out of this list that we should pay particular attention to?
    - Yes, I recommend parsing the JSON so you can find information about the task, including how much was paid.
- **c5 (platform)**: categorial, 5 values ['OTHER', 'MTURK', 'FIVERR', 'UPWORK', 'FREELANCER']
  - The work platform in which the worker was working on (it is usually constant)
- **c6 (skip)**: categorial, 2 values (0: no complete, 1: complete) – may include NaN values
  - That was a field that was not used
  - You mentioned skip. Does this represent if the task was completed or skipped?
    - I do not remember the purpose of that field, maybe was not used
- **c7 (subtype)**: categorial, 29 values ['OTHER', 'TASK_STARTED', 'ADDED_TASK', 'TASK_SUBMITED', 'FINISHED_TASK', 'TASKS_LIST', 'WORKER_DASHBOARD', 'UNKNOWN', 'TASK_FRAME', 'TASK_PREVIEW', 'TASK_INFO', 'TASK_RETURNED', 'PLATFORM_LOGIN', 'TASK_QUEUE', 'TASK_SKIP', 'WORKER_EARNINGS_DETAILS', 'TASK_TIMEOUT', 'WORKER_EARNINGS', 'WORKER_QUALIFICATIONS', 'TASKS_PER_REQUESTER', 'MESSAGES_SEND', 'TASKS_LIST_FILTER', 'WORKER_QUALIFICATIONS_PENDING', 'TASKS_PREVIEW', 'PLATFORM_HELP', 'TASKS_PROJECTS', 'TASKS_DETAILS', 'MESSAGES_READ', 'TASKS_APPLY']
  - It defines if a worker is listing the tasks available, if a task just started, or if a task was completed (submitted)
  - Is there a description that we can lookup to find what these event values mean?
    - FINISHED_TASK == TASK_SUBMITED both refers to task completed (submitted is when it was recently submitted and finished when the next URL was loaded)
    - TASK_RETURNED When the worker decided not to work on a task
- **c8 (time)**: continuous, datetime (milliseconds), 1970 start date Unix Time (Week, Month, Day, Hours, Minutes, Seconds)
  - This is a timestamp in milliseconds. You have to convert to a date, it contains day, month, year, hour, minute, second.
  - You can use any function that converts from timestamp to datetime.
  - What does this time represent? Task completed?
    - Time of the event, remember that this is an event log, every event happened at this time. Time series analysis is a common approach to use.
  - Does this time represent when the worker did the event recorded in c9 (type)?
    - Worker was working 
    - Communicating by sending messages in the platform 
    - Searching for tasks
    - Visiting their profile
    - That is correct. it is the time at which it was recorded
- **c9 (type)**: categorial, 10 values ['OTHER', 'WORKING', 'LOGS', 'SEARCHING', 'PROFILE', 'UNKNOWN', 'REJECTED', 'COMMUNICATION', 'LEARNING', 'PROPOSAL']
  - It identify if at that time the worker was working, communicating by sending messages in the platform, searching for tasks, visiting their profile. There are other types that mean that the workers changed the CONFIG of the web plugin or the API of Toloka retrieved new tasks.
  - Do you have a description for all these events?
    - I do not have a data dictionary but these are separated by the events from the worker in the interface: worker was working, communicating by sending messages in the platform, searching for tasks, visiting their profile. The other events do not represent an activity of the worker but a state of the plugin, does not mean any worker activity these only got extra information from the web plugin.
- **c10 (user)**: categorical, 120 values unique – Todo: Need to verify this is correct field value.
  - User ID
- **c11**: Not relevant (an activity was taken after being recommended)
- **c12**: Not used

In [16]:
import os

def flatten_user_to_csv(user_id: int, df_user: pd.DataFrame, csv_output_path: str=""):
    """
    Read in the cleaned Amazon MT dataset and write it out for a particular user (e.g `ae862298385abab2a0a1619f8cedef9d`)
    Convert the `c4` event column by flattening most dict values into separate columns and 
    write out to temporary *.csv to run limited records moving forward.

    Parameters
    user_id (int): c10 (user) field in the original dataset
    df_user (pd.DataFrame): Pandas dataframe information for the user on the original dataset.
    debug (bool): indicate
    """

    # Read the data into a dataframe and flatten 'c4 (event)' column to new columns.

    # # Suggest writing this transformed data out to a file to read in that transformed file for further processing.
    # df_temp_user = pd.read_csv(FLATTENED_USER_CSV_PATH, encoding='utf-8', header="infer")

    # df_temp_user.sort_values(by=['user', 'time'], ascending=[True, True]).to_csv(FLATTENED_USER_CSV_PATH,
    #                                                                              encoding='utf-8', header=True, columns=columns, index=False, mode="w")
    
    # Convert the milliseconds c8 (time) to an iso datetime format.
    # For some reason writing out to temporary csv here helps when joining with the c4 (extra) flattened information.
    # If we forget this step the concat of c4 (extra) to the original dataframe gets messed up.
    # ------------------------------------------------------------------------------------
    TIMECONVERTED_USER_CSV_PATH = f"../data/amazon_mechanical_turk_records_{user_id}.csv"

    # Convert the epoch timestamp to datetime and write out to temporary csv file.
    df_user['time'] = df_user.time.map(convert_epoch_time_to_datetime)
    df_user.to_csv(TIMECONVERTED_USER_CSV_PATH, encoding='utf-8', header=True, index=False, mode="w")

    # Droping the Dataframe to avoid duplicates when reading in again from csv after transforming the epic timestamp.
    df_user.drop(df_user.index, inplace=True)

    # Read in the transformed epoch time values and perform further processing with c4 (extra) column.
    df_user = pd.read_csv(TIMECONVERTED_USER_CSV_PATH, encoding='utf-8', header="infer")
    os.remove(TIMECONVERTED_USER_CSV_PATH)
    # print("Removed temporary time converted file {TIMECONVERTED_USER_CSV_PATH}.")
    # ------------------------------------------------------------------------------------

    df_c4_flattened = pd.DataFrame()
    for i, j in df_user.iterrows():
        if j["c4"]:
            # https://stackoverflow.com/questions/33094056/is-it-possible-to-append-series-to-rows-of-dataframe-without-making-a-list-first
            series_temp = flatten_json_column(j["c4"]).iloc[0].to_frame().T
            df_c4_flattened = pd.concat([df_c4_flattened, series_temp], ignore_index=True) 

    df_user = pd.concat([df_user, df_c4_flattened], axis="columns")
    del df_user["c4"]   

    # Data is sorted by c8 (time) field to ensure the events are in order.
    # This sorting by c8 (time) is important before we perform calculations on invisible labor time.
    df_user.sort_values(by=['time'], ascending=[True], inplace=True)

    # Write out the flattened user csv information.
    df_user.to_csv(csv_output_path, encoding='utf-8', header=True, index=False, mode="w")

    print(f"Created flattened csv {csv_output_path} for user {user_id}.")

In [17]:
import os

def transform_users_to_flattened_csv(csv_output_path: str="", limit_users: int=None, debug: bool=False):
    """
    Read in the original Amazon MT dataset and write it out for a particular user (e.g. `ae862298385abab2a0a1619f8cedef9d`)
    Convert the `c8` time column to epoch time and write out to temporary *.csv to run limited records moving forward.
    """

    DATASET_AWS_MTURK_CSV_PATH = "../data/amazon_mechanical_turk_records.csv"

    # Define columns for data
    columns = ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'time', 'c9', 'user']

    # Read the data into a dataframe for further manipulation.

    # https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function
    # iter_csv_default = pd.read_csv(DATASET_AWS_MTURK_CSV_PATH, encoding='utf-8', header=None, names=columns, low_memory=False, iterator=True, chunksize=1000)
    # df_default = pd.concat([chunk[chunk['user'] == "ae862298385abab2a0a1619f8cedef9d"] for chunk in iter_csv_default])

    # 4m 18.8s – Suggest writing this transformed data out to a file to read in that transformed file for further processing.
    df_default = pd.read_csv(DATASET_AWS_MTURK_CSV_PATH, encoding='utf-8', header=None, names=columns, low_memory=False)
    
    # # Grab number of unique users based on dataset provided.
    # if limit_users is not None:
    #     unique_users = df_default['user'].dropna().unique().tolist()[:limit_users]
    # else:
    #     # Grab all users if we not limiting number of users.
    #     # Warning: Depending on how large the dataset is this could take a while to process all users.
    #     # Recommend limiting the numbers of users before running all.
    #     unique_users = df_default['user'].dropna().unique().tolist()
    unique_users = [
        "4bc91376a0d164c5594c524788723d",
        "8697ae35eb3bf8ce8d5669260897895",
        "ffb0c8b77e4f393844efa24f5254e9b",
        "ce19344b46e1aa94598dfe0e36e40",
        "ba633f7f73ff50c74c8539f09a8e55e",
        "612d686264ac53a67ba88dec91238837",
        "d94eed14283e3cf95445a2d2d0c732ea",
        "c45ae0d5903111f7536eff13443b121f",
        "b3768cd510cd9563f4bbb8426b943c81",
        "3a305b70db8874759c2dfc4228ae05f",
        "c11a1f6f688f48c521d071ae29570e6",
        "f2687f9a5c9a26cc8936b2d93be8ec8",
        "c35975cf36e8a25076da28a82ce70be",
        "2a1d22d7101b7b66e06530c86d1b57b0",
        "ae1d53c23b4e45aa85e3b668bef64ab",
        "e4629593798243d93e1d3b8af9912110",
        "8bf2181261573e7e329c40614c279719",
        "19a4b3e35b14fdfd95020d7a8483a4",
        "5b93a634b0dfadc05a938f9f37677b4",
        "1164fceb602e16e44dc3f6e49ec51",
        "de6f341f486569af7f658f13a698ea4e",
        "82f7a1a4be3dff5b63c145c97ba748a1",
        "99bce524fe0a27863aba87f905e34d6",
        "fc66131fcea7f4f2f80278189ed1f8",
        "e61f4320ad212464d4ee2c764a993b",
        "1fbbe51c2b7814f7810188ddc19f4ed",
        "f4b1f58182ccaef03a28fed5539a89f9",
        "c99687531ab1967a51302de2d4572",
        "b8e3eb35cabedb645e1a279ad96d567",
        "01ee4a227223252d9dd5862d122f9e",
        "4912f382b67922416cfea2359a828d1",
        "a87398e4bd28fe7b8077886098c6e",
        "32d3b5ecde232650c24cd36bd3ad2d1d",
        "f3d82fd8fa6a5eb7555c0abf43edd23",
        "a0f699a13a335bda7a73115e5f4ee08f",
        "442f53cfa6a817e452a1f769f3c5d2",
        "aaf05f30ee38781cae9bbca0b5acdac0",
        "8eacce56d38f1f69950fb77cd1537",
        "9eabb6c86fa8b6152aa3a6588581edd5",
        "c2f95332a399afb229355bd932754646",
        "ce3faf44e77f5dda69a3a6633621ab",
        "88a8c56fc6b65633e7a1975bc29f1fec",
        "dcff9ab0bc8c3fa1a9ee573585a5889",
        "f0bd27d3aaa7b49f4edfc5f72f6d99",
        "fef9998b1fa31a786993019be5461fa",
        "cabad7e3a75eb660956c8bf6cbf3c07",
        "2e259988387deea78754c4cde4cc8cc",
        "ed8a6079f5230fa4f787c72b1993bba",
        "dc79c0907046d4df5ca45b213681dc98",
        "fd89522795bd36f4ee4ae5f0a11c71",
        "7f2f41e71876dd8a981dd7c7bb5fa3f1",
        "fccf29e7a47d097ee4695db8aa1bf5",
        "baa323f53929015db7a2c69d97bb4d9",
        "7c8f72c6eb861ce6fd35a6debce1161b",
        "2fa31ea51f645df2cb9aa14b58c4183b",
        "c2d79d4e92f4215c93f17ea1acf38798",
        "caf59447d7eb2954a3657b638b590c4",
        "74cd6c2ee9d9709ce0a534388793de",
        "bc42df9a3373d27be9780fd6ae128",
        "ceef928ee76bf9a879d35998e60a4be",
        "2ecafc36ff07621e0ce652b264b8bdf",
        "623963faa7268752f2bad8b37b7585",
        "d73b4fa896b26f5a91a1f41d7b5fdd1",
        "f04ca22c35c85fe6562a3b4d159fb79",
        "93c4b02efd6dd4e0f7a9a3d3a6466670",
        "14c6ecb74490e12bb7f8108cb4a9c34e",
        "8ade05afb024cb45efadebcdec7a32",
        "dc6e62b296f81ec347d5de0911fa719",
        "d7e17849a338aad1a22d3ec665846d4f",
        "4bc966abc2d6aead5bfbb5e47fbab",
        "2be6ab2acaf924bc5adbc3f979b87",
        "c137b2a837ec10c59778c65338d882",
        "28cda45766dd7647cbd9ff4890fadd",
        "dc5b9a35c8d99453ebcf929a9b38cc",
        "9a13864813d25c911b8d107462f1aa",
        "10a76d9ce5e5dfee6352614a9b18b6bc",
        "241b12459cdd203778fd3d94e563a712",
        "5356644850542c7e38a29248ab1f1b61",
        "ab6c2eab2d5214e4a5205e2b7aa8db",
        "25a05b19f637703073d22de66e69fc21",
        "858afc5d972fc3a567d3d2a1d0e44698",
        "66c189ce8436ff6335731bc1159f7f",
        "42bcd2adfa6d5d6a493259e1cb2d9f",
        "eea8927de66f1de3fa354ba4eb95a20",
        "83fbc02785ea9b9d6951be0e201ee1",
        "6aba944ddaef676f885c762196691d4",
        "83e55147fca340899e2119ad2c2820c6",
        "f0ac4982ad88a5b45e5923a71242145",
        "cb56c5429418696a451c3b4cfcec8d6",
        "86e0382753203b338b5117a461471080",
        "b5437f47f866feced73d66ff3db3e4",
        "935a391af179718263b18f56a2bf36f2",
        "3019198226cdb429651d3758afa8b59b",
        "47a95b7352b2928f249f434d10ee32",
        "7dd3d86e412c9c22b2d48acb2a3924a0",
        "80ab8240161d8f53a67d927ef37c225e",
        "586b9c67fff494279a99be871455838",
        "3eb9e4726656aef256b9d916f0c7",
        "f0d0c143cb4df66a96b931fc14ae7c",
        "75b1987cc42221d4a11e74244fdc9ee",
        "84da44b85948b817e5945afde6f0fb10",
        "34dc96ce13cdc44f897e25eb4398079",
        "9cfee018ac43b24823ace44ea835f2"
    ]
    # unique_users = [
    #     "ae862298385abab2a0a1619f8cedef9d",
    #     "149c64b9f9b890bcf32bd2dcf595fd",
    #     "49e7482b6cda157f388c73b3bcc2ebfc",
    #     "6c418c234637a262f4b3cc8fbb2ab683",
    #     "13327b278744b9997a995fbfcc83d9e",
    #     "376b1526a10bc9d9ca8a71b51d951f6",
    #     "f0dacff155d4665310aa9c1b5b76c6c",
    #     "d7a1761ca2820df428de327256d16",
    #     "2cd3cfa492db635891e3ea3b402982",
    #     "bdb1e5a4546e1de80f3e8d65ed5a81",
    #     "1c4a09c264cc784a227191d775b213b",
    #     "112a14c770ec5b7f907fd95b5c13c",
    #     "7c70f88eb8d1574e2bfd7b7b883c3a64",
    #     "f7894afaa9c87d30f7b8a102e92479d",
    #     "5ba2c2bc7b3bb19711ede5857611abc",
    # ]

    # unique_users = ["ae862298385abab2a0a1619f8cedef9d", "149c64b9f9b890bcf32bd2dcf595fd", "49e7482b6cda157f388c73b3bcc2ebfc"]

    # Write out all flatten files per user.
    for user_id in unique_users:
        FLATTENED_USER_CSV_PATH = f"../data/amazon_mechanical_turk_records_{user_id}_flattened.csv"
        # df_temp_user = pd.DataFrame()
        # df_temp_user = df_default[df_default.user == user_id].copy(deep=True)

        # df_temp_user.sort_values(by=['user', 'time'], ascending=[True, True]).to_csv(FLATTENED_USER_CSV_PATH,
        #             encoding='utf-8', header=True, columns=columns, index=False, mode="w")
        
        # Write out user. Perform a copy of the dataset into 'df_user' to prevent altering the original 'df_default'.
        flatten_user_to_csv(
            user_id,
            df_user=df_default[df_default.user == user_id],
            csv_output_path=FLATTENED_USER_CSV_PATH
            )
    
    # # Combine all flattened user files into one.
    # if csv_output_path:           
    #     df_flatten_users = pd.concat(
    #         [
    #             pd.read_csv(
    #                 f"../data/amazon_mechanical_turk_records_{user_id}_flattened.csv", encoding='utf-8', header="infer"
    #                 ) for user_id in unique_users
    #         ], ignore_index=True
    #     )

    #     # sort_values(by=['user', 'time'], ascending=[True, True])
    #     df_flatten_users.to_csv(csv_output_path, encoding='utf-8', header=True, index=False, mode="w")
    #     print(f"Created flattened csv for {unique_users} users at {csv_output_path}.")

    #     # Keep the csv around for each user to debug further.
    #     if not debug:
    #         for user_id in unique_users:
    #             os.remove(FLATTENED_USER_CSV_PATH)
    #             print(f"Removed temporary flattened file {FLATTENED_USER_CSV_PATH}.")
    # else:
    #     print("Cannot flatten users to csv because the output path is not specified.")
    

In [18]:
FLATTENED_USERS_CSV_PATH = f"../data/amazon_mechanical_turk_records_users_flattened.csv"
LIMIT_USERS = None # 3

# Call this function to create a separate *.csv for a particular user and use that moving forward.
# Because this takes long to create the flatten files and combine them we only want to read the 
# original data we're just calling this once.
# 5m 47.5s - first 3 users
# transform_users_to_flattened_csv(csv_output_path=FLATTENED_USERS_CSV_PATH, limit_users=LIMIT_USERS, debug=True)

# Read in flattened c4 csv for the user.
df_amt = pd.read_csv(FLATTENED_USERS_CSV_PATH, encoding='utf-8', header="infer", low_memory=False)

# Sort by user then by time
# df_amt = df_amt.sort_values(by=['user', 'time'], ascending=[True, True])

# df_amt.head(30)
# df_amt[df_amt["c4.task_id"].notna()]
# df_amt.iloc[0:16,:]
# df_amt

In [19]:
# Look at the features
df_amt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351686 entries, 0 to 351685
Data columns (total 36 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   c1                                            351686 non-null  int64  
 1   c2                                            351686 non-null  object 
 2   c3                                            351686 non-null  object 
 3   c5                                            351686 non-null  object 
 4   c6                                            351686 non-null  float64
 5   c7                                            351686 non-null  object 
 6   time                                          351686 non-null  object 
 7   c9                                            351686 non-null  object 
 8   user                                          351686 non-null  object 
 9   c4.task_id                                    13

In [20]:
from datetime import datetime
import re

def locate_user_task_start(
        df: pd.DataFrame,
        user_id: str="",
        task_id: str="",
        task_started_pattern: str="https://worker.mturk.com/projects/(.*)/tasks/(.*)?assignment_id(.*)"
        ) -> (int, datetime):
    """
    Search `c2` url and locate the first event where `c7 (subtype) == TASK_STARTED` and matches
    the argument `task_id`.

    This limits searching the whole dataset to find the first `c2` url where the task_id has 
    'TASK_STARTED' generated. Filtering data by c3 == ('PAGE_LOAD', 'TAB_CHANGE') and c7 == 'TASK_STARTED'

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need columns c2 (url) and c8 (time))
    user_id (str): c10 (user) value.
    task_id (str): c4.task_id value.
    task_started_pattern (str): Regular expression pattern to identify 'TASK_STARTED'. Defaults to expression in the Chome plugin for AWS MTurk.

    Returns:
    id: c1 (id) location of dataset record.
    datetime: Task id start date for first event 'TASK_STARTED'
    """

    c1_id = None
    c8_event_date = None
    url_task_id = None

    # Limit the amount of records needing searched.
    df_task_started = df[
        (df.user == user_id) & (
            (
                ((df.c7 == 'TASK_FRAME') | (df.c7 == 'TASK_STARTED')) & 
                (df.c9 == 'WORKING')
            ) |
            ((df.c7 == 'ADDED_TASK') & (df.c9 == 'LOGS'))
        )
        ].sort_values(by=['time'], ascending=[True])

    for i, j in df_task_started.c2.items():
        if re.search("https://worker.mturk.com/projects/(.*)/tasks/(.*)?assignment_id(.*)", j, re.IGNORECASE):
            url_task_id = j.split('/')[6].split('?')[0]
        elif re.search("https://www.mturkcontent.com/dynamic/hit\?assignmentId=(.*)", j, re.IGNORECASE):
            url_task_id = j.split('hitId=')[1].split('&')[0]

        if url_task_id and (url_task_id == task_id):
            try:
                c1_id = df_task_started.c1[i]
                c8_event_date = datetime.fromisoformat(df_task_started.time[i])
                # print(f"Found task {task_id} first 'TASK_STARTED': id {c1_id}, date {df_task_started.time[i]}")    
            except ValueError:
                print(f"Could not find 'TASK_STARTED' event date for task_id {task_id}")
            
            # Make sure that we break here to ensure that we're just looking at the 
            # first 'TASK_STARTED' event time.
            break

    # We may have a situation where there is no 'TASK_STARTED' for the task but a 'ADDED_TASK' exists.
    if not (c1_id and c8_event_date):
        try:
            # Limit the amount of records needing searched.
            df_task_started = df[
                (df.user == user_id) &
                (df.c7 == 'ADDED_TASK') & 
                (df.c9 == 'LOGS') & 
                (df["c4.task_id"] == task_id)
                ].iloc[0]
        
            c1_id = df_task_started.c1
            c8_event_date = datetime.fromisoformat(df_task_started.time)
            # print(f"Found task {task_id} first 'ADDED_TASK': id {c1_id}, date {df_task_started.time}")    
        except (IndexError, ValueError) as err:
            print(f"Could not find 'ADDED_TASK' event date for task_id {task_id}: {err}")
    else:
        try:
            # Check to see if we have an 'ADDED_TASK' before this 'c1_id' and use it instead if there.
            # This is because the background process logs this event every 30 minutes and it does not
            # abide by the event order.
            # (e.g. 3E9ZFLPWO0KJDJRNXDHG072XWOZXIK)
            df_added_task = df_task_started[(df_task_started.c7 == 'ADDED_TASK') & (df_task_started["c4.task_id"] == task_id)].iloc[0]

            # Only update if the df_added_task is before what we found previously.
            # This would be the case if there was no c1_id previously set.
            if df_added_task.c1 < c1_id:
                c1_id = df_added_task.c1
                c8_event_date = datetime.fromisoformat(df_added_task.time)
        except IndexError as err:
            # No previous 'ADDED_TASK' found, so we'll stick with what we found with 'TASK_STARTED'.
            pass

    return c1_id, c8_event_date


def locate_user_task_end(
        df: pd.DataFrame,
        user_id: str="",
        task_id: str="",
        ) -> (int, datetime):
    """
    Search for observations where last event is`c7 (subtype) == FINISHED_TASK` and matches
    the argument `task_id`. 

    Note: We could have used `TASK_SUBMITTED`, however, we noticed some invisible tasks came after.
    Carlos mentions the following:
    "FINISHED_TASK == TASK_SUBMITED both refers to task completed (submitted is when it was 
    recently submitted and finished when the next URL was loaded)"

    This limits searching the whole dataset to find the first `c7` subtype where the task_id has 
    'FINISHED_TASK' generated. Filtering data by c7 == 'FINISHED_TASK' and c9 (type) == 'LOGS'

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need columns c7 (subtype) and c9 (type))
    user_id (str): c10 (user) value.
    task_id (str): c4.task_id value.

    Returns:
    id: c1 (id) location of dataset record.
    datetime: Task id end date for event 'FINISHED_TASK'
    """

    c1_id = None
    c8_event_date = None

    try:
        # Limit the amount of records needing searched to the first row.
        df_task_finished = df[
            (df.user == user_id) &
            (df.c7 == 'FINISHED_TASK') & 
            (df.c9 == 'LOGS') &
            (df["c4.task_id"] == task_id)
            ].iloc[0]

        c1_id = df_task_finished.c1
        c8_event_date = datetime.fromisoformat(df_task_finished.time)
        # print(f"Found task {task_id} first 'FINISHED_TASK': id {c1_id}, date {df_task_finished.time}")    
    except (IndexError, ValueError) as err:
        print(f"Could not find 'FINISHED_TASK' event date for task_id {task_id}: {err}")
    
    return c1_id, c8_event_date


In [11]:
from datetime import datetime, date

def task_platform_name(df: pd.DataFrame, c1_id: int) -> str:
    """
    Returns platform name for the 'FINISHED_TASK' event.

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need column c1 (id) for lookup)
    c1_id: The c1 (id) value for lookup int the dataframe passed.

    Returns:
    str: Name of the platform where the worker task exists.
    """
    requester_name = ""
    try:
        requester_name = df[df.c1 == c1_id].iloc[0]["c5"]
    except (IndexError, TypeError) as err:
        print(f"Cannot locate platform name for the task at c1 == {c1_id} {err}")

    return requester_name


def task_requester_name(df: pd.DataFrame, c1_id: int) -> str:
    """
    Returns project requester name for the 'FINISHED_TASK' event.

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need column c1 (id) for lookup)
    c1_id: The c1 (id) value for lookup int the dataframe passed.

    Returns:
    str: Requester of the current task.
    """
    requester_name = ""
    try:
        requester_name = df[df.c1 == c1_id].iloc[0]["c4.project.requester_name"]
    except (IndexError, TypeError) as err:
        print(f"Cannot locate project requester name at c1 == {c1_id} {err}")

    return requester_name


def task_estimate_duration_in_seconds(df: pd.DataFrame, c1_id: int) -> int:
    """
    Returns 'c4.project.assignment_duration_in_seconds' duration in seconds from 'FINISHED_TASK' event.

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need column c1 (id) for lookup)
    c1_id: The c1 (id) value for lookup int the dataframe passed.

    Returns:
    int: Seconds of task total time estimate duration. This includes labor (working + invisible).
    """
    duration = 0
    try:
        duration = int(df[df.c1 == c1_id].iloc[0]["c4.project.assignment_duration_in_seconds"])
    except (IndexError, TypeError) as err:
        print(f"Cannot calculate task total time duration {err}")

    return duration


def task_monetary_reward_in_dollars(df: pd.DataFrame, c1_id: int) -> float:
    """
    Returns 'c4.project.monetary_reward.amount_in_dollars' duration in seconds from 'FINISHED_TASK' event.

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need column c1 (id) for lookup)
    c1_id: The c1 (id) value for lookup int the dataframe passed.

    Returns:
    int: US dollar amount for the task. This includes labor (working + invisible).
    """
    monetary = 0
    try:
        monetary = float(df[df.c1 == c1_id].iloc[0]["c4.project.monetary_reward.amount_in_dollars"])
    except (IndexError, TypeError) as err:
        print(f"Cannot calculate task monetary reward {err}")

    return monetary


def task_assignable_hits_count(df: pd.DataFrame, c1_id: int) -> float:
    """
    Returns 'c4.project.assignable_hits_count' value from 'FINISHED_TASK' event.

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk (need column c1 (id) for lookup)
    c1_id: The c1 (id) value for lookup int the dataframe passed.

    Returns:
    int: assignable hits count from the project task id.
    """
    assignable_hits_count = 0
    try:
        assignable_hits_count = float(df[df.c1 == c1_id].iloc[0]["c4.project.assignable_hits_count"])
    except (IndexError, TypeError) as err:
        print(f"Cannot calculate task assignable hits count {err}")

    return assignable_hits_count


def total_labor_event_count(df: pd.DataFrame, c1_id_start: int, c1_id_end: int) -> int:
    """
    Returns count of events between and including 'TASK_STARTED' to 'FINISHED_TASK' for a task.
    (includes working + invisible tasks) 

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk
    c1_id_start (int): The task start c1 (id) for 'TASK_STARTED' event time.
    c1_id_end (int): The task start c1 (id) for 'FINISHED_TASK' event time.

    Returns:
    int: Count of task between c1_id_start and c1_id_end (includes working + invisible tasks)
    """
    count = 0
    try:
        count = len(df[df.c1.between(c1_id_start, c1_id_end, inclusive="both") == True])
    except TypeError as err:
        print(f"Cannot calculate task count for labor events {err}")

    return count


def total_labor_duration_in_seconds(dt_start: datetime, dt_end: datetime) -> int:
    """
    Returns duration in seconds between task 'FINISHED_TASK' - 'TASK_STARTED' time.

    Parameters:
    dt_start (datetime): The task start 'TASK_STARTED' time
    dt_end (datetime): The task end 'FINISHED_TASK' time

    Returns:
    int: Seconds of total time duration.
    """
    duration = 0
    try:
        duration = (dt_end - dt_start).seconds
    except TypeError as err:
        print(f"Cannot calculate labor total time seconds {err}")

    return duration


def working_labor_event_count(df: pd.DataFrame, c1_id_start: int, c1_id_end: int) -> int:
    """
    Returns count of 'WORKING' events between and including 'TASK_STARTED' to 'FINISHED_TASK' for a task.
    (includes working tasks only)

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk
    c1_id_start (int): The task start c1 (id) for 'TASK_STARTED' event time.
    c1_id_end (int): The task start c1 (id) for 'FINISHED_TASK' event time.

    Returns:
    int: Count of task events between c1_id_start and c1_id_end (includes working tasks)
    """
    count = 0
    try:
        # Make sure to sort_values first then reset_index. Otherwise the index could be out of order.
        df_working = df[
            ((df.c1 >= c1_id_start) & (df.c1 <= c1_id_end)) &
            ((df.c9 == 'WORKING') | (df.c9 == 'LOGS'))
            ].sort_values(by=['time'], ascending=[True]).reset_index(drop=True)
        
        count = len(df_working)
    except TypeError as err:
        print(f"Cannot calculate task count for labor working events {err}")

    return count


def working_labor_duration_in_seconds(df: pd.DataFrame, c1_id_start: int, c1_id_end: int, debug=False) -> int:
    """
    Returns duration 'WORKING' events between and including 'TASK_STARTED' to 'FINISHED_TASK' for a task.
    (includes working tasks only)

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk
    c1_id_start (int): The task start c1 (id) for 'TASK_STARTED' event time.
    c1_id_end (int): The task start c1 (id) for 'FINISHED_TASK' event time.

    Returns:
    int: Seconds of task events between c1_id_start and c1_id_end (includes working tasks)
    """

    if debug:
        task_id = df[df["c1"] == c1_id_end].iloc[0]["c4.task_id"]
        print(f"\nTask {task_id}: c1_id_start {c1_id_start}, c1_id_end {c1_id_end}")

    duration = 0
    try:
        # Make sure to sort_values first then reset_index. Otherwise the index could be out of order.
        df_working = df[
            ((df.c1 >= c1_id_start) & (df.c1 <= c1_id_end))
            ].sort_values(by=['time'], ascending=[True]).reset_index(drop=True)
        
        for i, j in df_working.iterrows():
            if (j["c9"] == 'WORKING') or (j["c7"] == "ADDED_TASK" and j["c9"] == 'LOGS'):
                try:
                    dt_working = datetime.fromisoformat(j["time"])
                    dt_next_event = datetime.fromisoformat(df_working.iloc[i + 1]["time"])

                    if dt_working < dt_next_event:
                        duration += (dt_next_event - dt_working).seconds
                        if debug:
                            print(f"Difference of id {df_working.iloc[i + 1]['c1']}: ({dt_next_event}) and id {j['c1']}: ({dt_working}) is {(dt_next_event - dt_working).seconds} seconds.")
                    else:
                        # Todo: For task_id = "3M0556243RJC3RXK4Q2QWOJY3G6NF6" the 'FINISHED_TASK' 
                        # has earlier datetiem than the 'WORKING' task that comes before it so 
                        # we need to swap the dates to avoid a negative date and large seconds value.
                        duration += (dt_working - dt_next_event).seconds
                        if debug:
                            print(f"Difference of id {df_working.iloc[i + 1]['c1']}: ({dt_next_event}) and id {j['c1']}: ({dt_working}) is {(dt_working - dt_next_event).seconds} seconds.")

                except IndexError as err:
                    print(f"Cannot locate next dataframe duration {err}")
            
    except (IndexError, TypeError) as err:
        print(f"Cannot calculate task duration for labor working events {err}")

    return duration


def adjusted_monetary_reward_in_dollars(
        task_monetary_reward_in_dollars: float,
        task_estimate_duration_in_seconds: int,
        working_labor_duration_in_seconds: int,
        invisible_labor_duration_in_seconds: int,

) -> float:
    """
    Returns float indicating adjusted monetary reward in dollars based on parameters passed.

    Parameters:
    task_monetary_reward_in_dollars (float): requester task monetary reward in dollars.
    task_estimate_duration_in_seconds (int): requester task estimate time to complete in seconds.
    working_labor_duration_in_seconds (int): working labor duration in seconds.
    invisible_labor_duration_in_seconds (int): invisible labor duration in seconds.

    Returns:
    float: monetary value for reward if we factor in (estimate, working, and invisible parameters passed).
    This adjusted value represents actual pay should we include invisible work duration.
    """
    adjusted_pay = 0
    try:
        # Calculate a pay rate per task event.
        task_event_pay_rate = task_monetary_reward_in_dollars / task_estimate_duration_in_seconds

        # Using this 'task_event_pay_rate' to build out adjusted pay based on invisible duration.
        adjusted_pay = task_monetary_reward_in_dollars + \
                       task_event_pay_rate * (
                           working_labor_duration_in_seconds + invisible_labor_duration_in_seconds
                           )
            
    except TypeError as err:
        print(f"Cannot calculate adjusted monetary reward. {err}")

    return adjusted_pay


def is_task_completed(df: pd.DataFrame, c1_id_start: int, c1_id_end: int) -> int:
    """
    Returns boolean indicating if there exists an event between c1_id_start to c1_id_end that
    has a c9 == 'REJECTED'. If rejected, the return value is 0 meaning that the task was not completed.

    Parameters:
    df (pd.DataFrame): Source information from AWS MTurk
    c1_id_start (int): The task start c1 (id) for 'TASK_STARTED' event time.
    c1_id_end (int): The task start c1 (id) for 'FINISHED_TASK' event time.

    Returns:
    int: indicates if there is an event between c1_id_start and c1_id_end that includes c9 == 'REJECTED'.
    0 means found, 1 means not found.
    """
    completed = 0
    try:
        # Make sure to sort_values first then reset_index. Otherwise the index could be out of order.
        df_working = df[
            ((df.c1 >= c1_id_start) & (df.c1 <= c1_id_end)) &
            ((df.c9 == 'REJECTED'))
            ].sort_values(by=['time'], ascending=[True]).reset_index(drop=True)

        if len(df_working) == 0:
            completed = 1
            
    except TypeError as err:
        print(f"Cannot calculate task rejection {err}")

    # If we find a record that's between c1_id_start and c2_id_end with c9 == 'REJECTED' then return False (0).
    return completed


df_features = pd.DataFrame(columns=[
    'platform.name',
    'user.id',
    'task.id',
    'task.monetary_reward_in_dollars',
    'task.assignable_hits_count',
    'task.requester_name',
    'task.estimate_duration_in_seconds',
    'total.labor.event_count',
    'total.labor.duration_in_seconds',
    'working.labor.event_count',
    'working.labor.duration_in_seconds',
    'invisible.labor.event_count',
    'invisible.labor.duration_in_seconds',
    'adjusted.monetary_reward_in_dollars',
    'completed_task'
])

# For now grab only the first LIMIT_USERS.
if LIMIT_USERS is not None:
    unique_users = df_amt['user'].dropna().unique().tolist()[:LIMIT_USERS]
else:
    # Run for all users.
    unique_users = df_amt['user'].dropna().unique().tolist()
# unique_users = ['ae862298385abab2a0a1619f8cedef9d']
# unique_users = ['149c64b9f9b890bcf32bd2dcf595fd']
# '149c64b9f9b890bcf32bd2dcf595fd'
# '49e7482b6cda157f388c73b3bcc2ebfc'
# 'ae862298385abab2a0a1619f8cedef9d'

# Enumerate here to get the index (i) value in case we need to pop(i) values from original list
# due to invalid values in the dataset provided.
for j, user_id in enumerate(unique_users):

    df_amt_user = df_amt[df_amt.user == user_id]

    # Use `.tolist()` to convert ndarray to Python list.
    # https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tolist.html
    unique_tasks = df_amt_user['c4.task_id'].dropna().unique().tolist() 
    # unique_tasks = ['3N5YJ55YXG3AVVNQ3MIU09Y29Q1AN7']
    # unique_tasks = ['3CMIQF80GN0XWBC81XEGMJKYJB8Q64']
    # ["3W9XHF7WGLV68SQR2YVUGGPI6QVTK3"] 
    # ["3KA7IJSNW54MTVXHF3TMHNX8OOTPBI"]
    # ["3E9ZFLPWO0KJDJRNXDHG072XWOZXIK"]

    for i, t in enumerate(unique_tasks):

        task_c1_id_start, task_start_date = locate_user_task_start(
            df_amt_user, user_id, t, "https://worker.mturk.com/projects/(.*)/tasks/(.*)?assignment_id(.*)"
            )
        task_c1_id_end, task_end_date = locate_user_task_end(
            df_amt_user, user_id, t
            )
        
        # If we have record that doesn't have a start time then let's remove it.
        # Example: '302U8RURJY0UZS7SBXEJSBSUBN2VNK' doesn't include a 'ADDED_TASK' or 'TASK_STARTED' event.            
        if not task_c1_id_start or not task_c1_id_end:
            unique_tasks.pop(i)
            continue

        # Calculate totals for (t = Total, w = Working, i = Invisible)
        task_reward_in_dollars = task_monetary_reward_in_dollars(df_amt_user, task_c1_id_end)
        task_assign_hits_count = task_assignable_hits_count(df_amt_user, task_c1_id_end)
        task_est_duration_in_seconds = task_estimate_duration_in_seconds(df_amt_user, task_c1_id_end)
        t_labor_event_count = total_labor_event_count(df_amt_user, task_c1_id_start, task_c1_id_end)
        t_labor_duration_in_seconds = total_labor_duration_in_seconds(task_start_date, task_end_date)
        w_labor_event_count = working_labor_event_count(df_amt_user, task_c1_id_start, task_c1_id_end)
        w_labor_duration_in_seconds = working_labor_duration_in_seconds(df_amt_user, task_c1_id_start, task_c1_id_end, debug=False)
        i_labor_event_count = t_labor_event_count - w_labor_event_count
        i_labor_duration_in_seconds = t_labor_duration_in_seconds - w_labor_duration_in_seconds

        df_features.loc[len(df_features.index)] = [
            task_platform_name(df_amt_user, task_c1_id_end),
            user_id,
            t,
            task_reward_in_dollars,
            task_assign_hits_count,
            task_requester_name(df_amt_user, task_c1_id_end),
            task_est_duration_in_seconds,
            t_labor_event_count,
            t_labor_duration_in_seconds,
            w_labor_event_count,
            w_labor_duration_in_seconds,
            i_labor_event_count,
            i_labor_duration_in_seconds,
            adjusted_monetary_reward_in_dollars(
                task_reward_in_dollars,
                task_est_duration_in_seconds,
                w_labor_duration_in_seconds,
                i_labor_duration_in_seconds
            ),
            is_task_completed(df_amt_user, task_c1_id_start, task_c1_id_end)
        ]

# Write out to the final feature *.csv file.
df_features.to_csv(f"../data/cloudworker_tasks.csv",
                encoding='utf-8', header=True, index=False, mode="w")

NameError: name 'locate_user_task_start' is not defined