<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://bsethwalker.github.io/assets/img/clemson_paw.png">
</div>

## Course Project (Online Workers) - Checkpoint 1

**Clemson University**<br>
**Fall 2023**<br>
**Instructor(s):** Nina Hubig <br>
**Project Team:**
<ul>
    <li>David Croft <dcroft@g.clemson.edu></li>
    <li>Stephen Becker <sgbecke@g.clemson.edu></li>
    <li>Tony Hang <qhang@g.clemson.edu></li>
    <li>Zachary Trabookis <ztraboo@clemson.edu></li>
</ul>

---



In [103]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://bsethwalker.github.io/assets/css/cpsc6300.css").text
HTML(styles)

## Summary Goals

* Summary of the data set that, at a minimum, answers the following questions: What is the unit of analysis? How many observations in total are in the data set? How many unique observations are in the data set? What time period is covered?
  
* Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?
  
* Description of outcome with an appropriate visualization technique.
  
* Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.

In [104]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
# Set the max columns to none. This allows all the columns to display for the dataframes.
pd.set_option('display.max_columns', None)

from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import mean_squared_error

import warnings

In [116]:
from datetime import datetime

# Generic functions for cleaning the data
def convert_epoch_time_to_datetime(epoch_time: int):
    """
    Takes an epoch timestamp and converts it to datetime format
    Ref: 
    https://www.pythonforbeginners.com/basics/convert-epoch-to-datetime-in-python
    https://stackoverflow.com/questions/49710963/converting-13-digit-unixtime-in-ms-to-timestamp-in-python 
    """
    # Divide by 1,000 to remove ms time
    return datetime.fromtimestamp(int(epoch_time)/1000).isoformat()

# Testing the epoch time conversion
convert_epoch_time_to_datetime(1588994215395)

'2020-05-08T23:16:55.395000'

In [106]:
# data_json = {
#     "task_id": "3W9XHF7WGLV68SQR2YVUGGPI6QVTK3",
#     "assignment_id": "3E47SOBEYRW130WM6BKU2A6RAYQCIE",
#     "accepted_at": "2020-05-09T03:13:10.000Z",
#     "deadline": "2020-05-09T05:13:10.000Z",
#     "time_to_deadline_in_seconds": 6946,
#     "state": "Assigned",
#     "question": {
#       "value": "https://www.mturkcontent.com/dynamic/hit?assignmentId=3E47SOBEYRW130WM6BKU2A6RAYQCIE&amp;hitId=3W9XHF7WGLV68SQR2YVUGGPI6QVTK3&amp;workerId=A3QVZ4SZB79D8W&amp;turkSubmitTo=https%3A%2F%2Fwww.mturk.com",
#       "type": "InternalURL",
#       "attributes": {
#         "FrameSourceAttribute": "https://www.mturkcontent.com/dynamic/hit?assignmentId=3E47SOBEYRW130WM6BKU2A6RAYQCIE&amp;hitId=3W9XHF7WGLV68SQR2YVUGGPI6QVTK3&amp;workerId=A3QVZ4SZB79D8W&amp;turkSubmitTo=https%3A%2F%2Fwww.mturk.com",
#         "FrameHeight": "0"
#       }
#     },
#     "project": {
#       "hit_set_id": "354DQCRRIJZHIYT5G3CFVURWGQJLSW",
#       "requester_id": "A28S2SRZA50N0",
#       "requester_name": "HCI Lab",
#       "title": "Install a chrome extension for 7 days ($1 bonus per each day) to measure your work performance.",
#       "description": "Install a chrome extension that will help you to keep track of how you spend time on MTurk (potentially helping you to avoid unpaid labor on MTurk).. You are asked to install the chrome extension to track how you are spending your time on the platform. ",
#       "assignment_duration_in_seconds": 7200,
#       "creation_time": "2020-05-09T03:01:26.000Z",
#       "assignable_hits_count": 1,
#       "latest_expiration_time": "2020-05-16T03:01:26.000Z",
#       "caller_meets_requirements": False,
#       "caller_meets_preview_requirements": False,
#       "last_updated_time": "2020-05-09T03:01:26.000Z",
#       "monetary_reward": {
#         "currency_code": "USD",
#         "amount_in_dollars": 2
#       },
#       "hit_requirements": [
#         {
#           "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
#           "comparator": "Exists",
#           "worker_action": "ViewHitSet",
#           "qualification_values": [],
#           "caller_meets_requirement": None,
#           "qualification_type": {
#             "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
#             "name": "CrowdCoach",
#             "visibility": True,
#             "description": "Have already used crowd coach plugin",
#             "has_test": False,
#             "is_requestable": True,
#             "keywords": None
#           },
#           "caller_qualification_value": {
#             "integer_value": None,
#             "locale_value": {
#               "country": None,
#               "subdivision": None
#             }
#           }
#         },
#         {
#           "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
#           "comparator": "DoesNotExist",
#           "worker_action": "ViewHitSet",
#           "qualification_values": [],
#           "caller_meets_requirement": None,
#           "qualification_type": {
#             "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
#             "name": "GigOverhead Setup Diagnostic",
#             "visibility": True,
#             "description": "Installation and diagnostic survey",
#             "has_test": False,
#             "is_requestable": True,
#             "keywords": None
#           },
#           "caller_qualification_value": {
#             "integer_value": None,
#             "locale_value": {
#               "country": None,
#               "subdivision": None
#             }
#           }
#         }
#       ],
#       "requester_url": "/requesters/A28S2SRZA50N0/projects?ref=w_pl_prvw"
#     },
#     "expired_task_action_url": "/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks?ref=w_pl_prvw",
#     "task_url": "/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks/3W9XHF7WGLV68SQR2YVUGGPI6QVTK3?assignment_id=3E47SOBEYRW130WM6BKU2A6RAYQCIE&ref=w_pl_prvw"
#   }

# df_data_json = pd.DataFrame(data_json)


In [107]:
def flat_nested_cols(alist):
        """
        Pass a List of Nested Columns and it returns flattened columns
        Args:
        INPUT - alist- Pass a List which have nested dictionaries inside it
        OUTPUT - Returns a Dictionary which have seperate Header and Values for Each Element inside a nested dictionary

        Ref:
        https://github.com/sunkusowmyasree/Flatten-Nested-Jsons.
        https://sunkusowmyasree.medium.com/different-ways-to-flatten-deeply-nested-jsons-into-a-pandas-data-frame-ace2380b401c
        """
        outputdict = {}
        for dic in alist:
            for key, value in dic.items():
                if isinstance(value, dict):
                    for k2, v2, in value.items():
                        #Append Key as a prefix to Each Header Name
                        k2=key+'.'+k2
                        outputdict[k2] = outputdict.get(k2, []) + [v2]
                else:
                    outputdict[key] = outputdict.get(key, []) + [value]
        return outputdict 

# Testing this function
hit_requirements = [
    {
        "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
        "comparator": "Exists",
        "worker_action": "ViewHitSet",
        "qualification_values": [],
        "caller_meets_requirement": None,
        "qualification_type": {
        "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
        "name": "CrowdCoach",
        "visibility": True,
        "description": "Have already used crowd coach plugin",
        "has_test": False,
        "is_requestable": True,
        "keywords": None
        },
        "caller_qualification_value": {
        "integer_value": None,
        "locale_value": {
            "country": None,
            "subdivision": None
        }
        }
    },
    {
        "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
        "comparator": "DoesNotExist",
        "worker_action": "ViewHitSet",
        "qualification_values": [],
        "caller_meets_requirement": None,
        "qualification_type": {
        "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
        "name": "GigOverhead Setup Diagnostic",
        "visibility": True,
        "description": "Installation and diagnostic survey",
        "has_test": False,
        "is_requestable": True,
        "keywords": None
        },
        "caller_qualification_value": {
        "integer_value": None,
        "locale_value": {
            "country": None,
            "subdivision": None
        }
        }
    }
]

# Testing out flattening of a nested lists
flat_nested_cols(hit_requirements)

{'qualification_type_id': ['3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU',
  '34O6CUUXI0IPA69PLM76WR259RNW0R'],
 'comparator': ['Exists', 'DoesNotExist'],
 'worker_action': ['ViewHitSet', 'ViewHitSet'],
 'qualification_values': [[], []],
 'caller_meets_requirement': [None, None],
 'qualification_type.qualification_type_id': ['3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU',
  '34O6CUUXI0IPA69PLM76WR259RNW0R'],
 'qualification_type.name': ['CrowdCoach', 'GigOverhead Setup Diagnostic'],
 'qualification_type.visibility': [True, True],
 'qualification_type.description': ['Have already used crowd coach plugin',
  'Installation and diagnostic survey'],
 'qualification_type.has_test': [False, False],
 'qualification_type.is_requestable': [True, True],
 'qualification_type.keywords': [None, None],
 'caller_qualification_value.integer_value': [None, None],
 'caller_qualification_value.locale_value': [{'country': None,
   'subdivision': None},
  {'country': None, 'subdivision': None}]}

In [113]:
import json
from ast import literal_eval

# https://www.geeksforgeeks.org/convert-class-object-to-json-in-python/
# https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/


def flatten_json_columns(df, json_cols):
    """
    This function flattens JSON columns to individual columns
    It merges the flattened dataframe with expected dataframe to capture missing columns from JSON
    :param df: Crowd Work Data CSV raw dataframe
    :param json_cols: custom data columns in CSV's
    :param custom_df: expected dataframe
    :return: returns df pandas dataframe

    Ref: 
    https://github.com/vvgsrk/ParseCSVContainsJSONUsingPandas/tree/main
    https://avithekkc.medium.com/how-to-convert-nested-json-into-a-pandas-dataframe-9e8779914a24
    """

    # Make sure to sort the `na_positions` last because this could effect how many columns
    # that the nested column values are shown. If the nested column value is `NaN` first then
    # nothing will get populated for those nested column fields. (e.g. `c4_project.hit_requirements`)
    df = df.sort_values(by=json_cols, na_position='last')

    # Loop through all JSON columns
    for column in json_cols:
        if not df[column].isnull().all():
            # create a temp col to preserve the orginal data
            df['custom_data_temp'] = df[column]
            # Replace None and NaN with empty braces
            df[column].fillna(value='{}', inplace=True)
            try:
                # Deserialize's a str instance containing a JSON document to a Python object
                df[column] = [json.loads(row, strict=False) for row in df[column]]
            except TypeError:
                # Convert all values to string using literal eval
                df[column] = df[column].apply(lambda x: literal_eval(str(x)))
                
            # Normalize semi-structured JSON data into a flat table
            column_as_df = pd.json_normalize(df[column], max_level=None)

            # Extract main column name and attach it to each sub column name
            column_as_df.columns = [f"{column}_{subcolumn}" for subcolumn in column_as_df.columns]

            # Replace empty strings with None
            column_as_df.replace('', np.nan, inplace=True)

            # Replace orginal data with temp data
            df[column] = df['custom_data_temp']

            # Merge extracted result from custom_data field with expected fields
            # result_df = pd.merge(column_as_df, custom_df, how='left')
            result_df = column_as_df
            
            # Drop the temp column and merge the flattened dataframe with orginal dataframe
            df = df.drop('custom_data_temp', axis=1).merge(result_df, right_index=True, left_index=True)

            # Identify nested field values (e.g. 'c4_project.hit_requirements')
            cols_list = []
            for col in column_as_df:
                try:
                    # Only append columns values that are not `NaN` and are a list
                    if (column_as_df[col][0] != np.nan) and isinstance(column_as_df[col][0], list):
                        cols_list.append(col)
                except:
                    continue

            for col in cols_list:
                li=[]
                for r in range(column_as_df[col].size):
                    try:
                        a = flat_nested_cols(column_as_df[col][r])
                        li.append(a) 
                    except:
                        li.append({})

                df_l = pd.DataFrame(li).add_prefix(col + '.')
                df = pd.concat([df.reset_index(drop=True),df_l.reset_index(drop=True)], axis=1)

            # Drop the columns in cols_list
            df.drop(cols_list, axis=1, inplace=True)

    # Drop the columns in json_cols
    df.drop(json_cols, axis=1, inplace=True)

    # Return dataframe with flatten columns
    return df



### Cleaning: Reading in the telemetry data (Amazon Mechanical Turk (AMT))
We read in and clean the data from `amazon_mechanical_turk_records.csv`.

**List of available variables (includes target variable `TBD: c6`):**

- **c1**: continuous
- **c2**: url to work task
- **c3**: categorical, 18 values (['PAGE_LOAD', 'PAGE_BLUR', 'TAB_CHANGE', 'PAGE_FOCUS', 'PAGE_CLICK', 'PAGE_SCROLL', 'PAGE_LAST', 'PAGE_CLOSE', 'INTERNALURL', 'PAGE_KEY', 'PAGE_INACTIVITY', 'TAB_CLOSED', 'EXTERNALURL', 'PAGE_REACTIVATE', 'SYSTEM_DISABLED_WORKING', 'SYSTEM_ENABLED_WORKING', 'SYSTEM_ENABLED', 'SYSTEM_DISABLED'])
- **c4**: json object {task_id, assignment_id, ...} – may include NaN values
- **c5**: categorial, 5 values ['OTHER', 'MTURK', 'FIVERR', 'UPWORK', 'FREELANCER']
- **c6**: categorial, 2 values (0: no complete, 1: complete) – may include NaN values
- **c7**: categorial, 29 values ['OTHER', 'TASK_STARTED', 'ADDED_TASK', 'TASK_SUBMITED', 'FINISHED_TASK', 'TASKS_LIST', 'WORKER_DASHBOARD', 'UNKNOWN', 'TASK_FRAME', 'TASK_PREVIEW', 'TASK_INFO', 'TASK_RETURNED', 'PLATFORM_LOGIN', 'TASK_QUEUE', 'TASK_SKIP', 'WORKER_EARNINGS_DETAILS', 'TASK_TIMEOUT', 'WORKER_EARNINGS', 'WORKER_QUALIFICATIONS', 'TASKS_PER_REQUESTER', 'MESSAGES_SEND', 'TASKS_LIST_FILTER', 'WORKER_QUALIFICATIONS_PENDING', 'TASKS_PREVIEW', 'PLATFORM_HELP', 'TASKS_PROJECTS', 'TASKS_DETAILS', 'MESSAGES_READ', 'TASKS_APPLY']
- **time**: continuous, datetime (milliseconds), 1970 start date Unix Time (Week, Month, Day, Hours, Minutes, Seconds)
- **c9**: categorial, 10 values ['OTHER', 'WORKING', 'LOGS', 'SEARCHING', 'PROFILE', 'UNKNOWN', 'REJECTED', 'COMMUNICATION', 'LEARNING', 'PROPOSAL']
- **user**: categorical, 120 values unique – Todo: Need to verify this is correct field value.

In [114]:
# Define columns for data
columns = ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'time', 'c9', 'user']

# Read the data into a dataframe

# 4m 18.8s – Suggest writing this transformed data out to a file to read in that transformed file for further processing.
# df_amt = pd.read_csv("../data/amazon_mechanical_turk_records.csv", encoding='utf-8', header=None, names=columns)

df_amt = pd.read_csv("../data/amazon_mechanical_turk_records_ae862298385abab2a0a1619f8cedef9d.csv", encoding='utf-8', header=None, names=columns)
# df_amt = pd.read_csv("../data/amazon_mechanical_turk_records_ae862298385abab2a0a1619f8cedef9d_c4_events_only.csv", encoding='utf-8', header=None, names=columns)

# Convert the epoch timestamp to datetime
df_amt['time']=df_amt.time.map(convert_epoch_time_to_datetime)

# Flatten columns with JSON values
df_amt = flatten_json_columns(df=df_amt, 
                              json_cols=['c4'])

# Sort by user then by time
df_amt = df_amt.sort_values(by=['user', 'time'], ascending=[True, True])

df_amt.head(10)
# df_amt[df_amt.c4_task_id == "3W9XHF7WGLV68SQR2YVUGGPI6QVTK3"]



Unnamed: 0,c1,c2,c3,c5,c6,c7,time,c9,user,c4_accepted_at,c4_assignment_id,c4_deadline,c4_expired_task_action_url,c4_state,c4_task_id,c4_task_url,c4_time_to_deadline_in_seconds,c4_project.assignable_hits_count,c4_project.assignment_duration_in_seconds,c4_project.caller_meets_preview_requirements,c4_project.caller_meets_requirements,c4_project.creation_time,c4_project.description,c4_project.hit_set_id,c4_project.last_updated_time,c4_project.latest_expiration_time,c4_project.monetary_reward.amount_in_dollars,c4_project.monetary_reward.currency_code,c4_project.requester_id,c4_project.requester_name,c4_project.requester_url,c4_project.title,c4_question.attributes.FrameHeight,c4_question.attributes.FrameSourceAttribute,c4_question.type,c4_question.value,c4_project.hit_requirements.caller_meets_requirement,c4_project.hit_requirements.caller_qualification_value.integer_value,c4_project.hit_requirements.caller_qualification_value.locale_value,c4_project.hit_requirements.comparator,c4_project.hit_requirements.qualification_type.description,c4_project.hit_requirements.qualification_type.has_test,c4_project.hit_requirements.qualification_type.is_requestable,c4_project.hit_requirements.qualification_type.keywords,c4_project.hit_requirements.qualification_type.name,c4_project.hit_requirements.qualification_type.qualification_type_id,c4_project.hit_requirements.qualification_type.visibility,c4_project.hit_requirements.qualification_type_id,c4_project.hit_requirements.qualification_values,c4_project.hit_requirements.worker_action
0,54,https://worker.mturk.com/tasks/,INTERNALURL,MTURK,0,FINISHED_TASK,2020-05-08T22:06:40,LOGS,ae862298385abab2a0a1619f8cedef9d,2020-05-09T18:37:25.000Z,3SLE99ER0NDYXCSMTBNSLC8XSN7BZF,2020-05-09T19:07:25.000Z,/projects/3USZNBD0HCQYVGQMCUSVCVDG2AV15X/tasks...,Assigned,389A2A304OIIICG8DF1T1DF22WGC00,/projects/3USZNBD0HCQYVGQMCUSVCVDG2AV15X/tasks...,1784.0,18.0,1800.0,False,False,2020-05-09T17:55:14.000Z,survey,3USZNBD0HCQYVGQMCUSVCVDG2AV15X,2020-05-09T17:55:14.000Z,2020-06-08T17:55:14.000Z,0.5,USD,AMF6LLR465U0W,MSH University of Grenoble,/requesters/AMF6LLR465U0W/projects?ref=w_pl_prvw,Psychology Study(~ 10 minutes),0,https://app.cloudresearch.com/TakeLaunchedSurv...,ExternalURL,https://app.cloudresearch.com/TakeLaunchedSurv...,"[None, None]","[None, None]","[{'country': None, 'subdivision': None}, {'cou...","[Exists, DoesNotExist]","[Have already used crowd coach plugin, Install...","[False, False]","[True, True]","[None, None]","[CrowdCoach, GigOverhead Setup Diagnostic]","[3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU, 34O6CUUXI0IPA...","[True, True]","[3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU, 34O6CUUXI0IPA...","[[], []]","[ViewHitSet, ViewHitSet]"
81,15,https://worker.mturk.com/tasks/,INTERNALURL,MTURK,0,ADDED_TASK,2020-05-08T22:06:40,LOGS,ae862298385abab2a0a1619f8cedef9d,2020-05-10T05:43:30.000Z,3WQ3B2KGE8GOHLWQ9DMCZNF0Z8I1BB,2020-05-10T05:58:30.000Z,/projects/30814BGWVE63FZHQKC2UCGJ1AN0H2S/tasks...,Assigned,3UQVX1UPFSHRZZDJ3ZKZW5XX2D002V,/projects/30814BGWVE63FZHQKC2UCGJ1AN0H2S/tasks...,703.0,1.0,900.0,False,False,2020-05-10T05:33:38.000Z,Give us your opinion about workouts guided onl...,30814BGWVE63FZHQKC2UCGJ1AN0H2S,2020-05-10T05:33:38.000Z,2020-05-17T05:33:38.000Z,0.5,USD,AWFVIASQV8JZ0,tiffani wang,/requesters/AWFVIASQV8JZ0/projects?ref=w_pl_prvw,opinions about online workouts 5 minutes 50 cents,0,https://www.mturkcontent.com/dynamic/hit?assig...,InternalURL,https://www.mturkcontent.com/dynamic/hit?assig...,"[None, None]","[None, None]","[{'country': None, 'subdivision': None}, {'cou...","[Exists, DoesNotExist]","[Have already used crowd coach plugin, Install...","[False, False]","[True, True]","[None, None]","[CrowdCoach, GigOverhead Setup Diagnostic]","[3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU, 34O6CUUXI0IPA...","[True, True]","[3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU, 34O6CUUXI0IPA...","[[], []]","[ViewHitSet, ViewHitSet]"
84,1,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_LOAD,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T03:13:10.000Z,3E47SOBEYRW130WM6BKU2A6RAYQCIE,2020-05-09T05:13:10.000Z,/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks...,Assigned,3W9XHF7WGLV68SQR2YVUGGPI6QVTK3,/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks...,6676.0,1.0,7200.0,False,False,2020-05-09T03:01:26.000Z,Install a chrome extension that will help you ...,354DQCRRIJZHIYT5G3CFVURWGQJLSW,2020-05-09T03:01:26.000Z,2020-05-16T03:01:26.000Z,2.0,USD,A28S2SRZA50N0,HCI Lab,/requesters/A28S2SRZA50N0/projects?ref=w_pl_prvw,Install a chrome extension for 7 days ($1 bonu...,0,https://www.mturkcontent.com/dynamic/hit?assig...,InternalURL,https://www.mturkcontent.com/dynamic/hit?assig...,,,,,,,,,,,,,,
85,2,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_BLUR,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T03:30:06.000Z,37FMASSAYCQXZT77SA84XASHHRKIBC,2020-05-09T03:42:06.000Z,/projects/3D72DZZDWDJ2Q3C2S042BGF3QONYON/tasks...,Assigned,3KA7IJSNW54MTVXHF3TMHNX8OOTPBI,/projects/3D72DZZDWDJ2Q3C2S042BGF3QONYON/tasks...,702.0,1.0,720.0,False,False,2020-05-09T02:31:43.000Z,Pick the choice that best answers the question...,3D72DZZDWDJ2Q3C2S042BGF3QONYON,2020-05-09T02:31:43.000Z,2020-05-12T02:31:43.000Z,0.15,USD,AI2HRFAYYSAW7,PickFu,/requesters/AI2HRFAYYSAW7/projects?ref=w_pl_prvw,"Take a 1-question survey (US-based, Mobile Gam...",0,https://www.pickfu.com/mtjob/T1NYF13WLFI2?assi...,ExternalURL,https://www.pickfu.com/mtjob/T1NYF13WLFI2?assi...,,,,,,,,,,,,,,
86,3,https://worker.mturk.com/projects/354DQCRRIJZH...,TAB_CHANGE,MTURK,0,TASK_STARTED,2020-05-08T22:06:40,WORKING,ae862298385abab2a0a1619f8cedef9d,2020-05-09T04:05:22.000Z,31Z0PCVWUKGMSZO83Z6BU6BM9FP7TY,2020-05-09T05:05:22.000Z,/projects/3I7I3OEKSOU67PVN30U9E92H0XFD2G/tasks...,Assigned,3UEBBGULPFPBKLKAVLPVYAEF6LSUFW,/projects/3I7I3OEKSOU67PVN30U9E92H0XFD2G/tasks...,3598.0,23.0,3600.0,False,False,2020-05-09T04:02:06.000Z,survey,3I7I3OEKSOU67PVN30U9E92H0XFD2G,2020-05-09T04:02:06.000Z,2020-06-08T04:02:06.000Z,0.3,USD,A8YWFHYTOLWIB,CB Lab,/requesters/A8YWFHYTOLWIB/projects?ref=w_pl_prvw,A Brief Survey on Life Satisfaction(~ 8 minutes),0,https://app.cloudresearch.com/TakeLaunchedSurv...,ExternalURL,https://app.cloudresearch.com/TakeLaunchedSurv...,,,,,,,,,,,,,,
87,4,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_FOCUS,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T04:11:08.000Z,3ZDAD0O1T2Y1QUGKC6IUCZGOULJXTV,2020-05-09T05:11:08.000Z,/projects/35GRDJC93JQP8E900AOFFOUTS6T7CV/tasks...,Assigned,3MQY1YVHS35Y79MCYT1FYVKHDBQB2G,/projects/35GRDJC93JQP8E900AOFFOUTS6T7CV/tasks...,3584.0,80.0,3600.0,False,False,2020-05-09T03:55:55.000Z,Qualification Test for the HITs titled 'Find o...,35GRDJC93JQP8E900AOFFOUTS6T7CV,2020-05-09T03:55:55.000Z,2020-06-19T06:25:55.000Z,2.0,USD,A2ST2HIC0MRRNE,Tamara Berg,/requesters/A2ST2HIC0MRRNE/projects?ref=w_pl_prvw,Find out predictable events in videos (Qualifi...,1000,https://www.mturkcontent.com/dynamic/hit?assig...,InternalURL,https://www.mturkcontent.com/dynamic/hit?assig...,,,,,,,,,,,,,,
88,5,https://docs.google.com/forms/d/e/1FAIpQLScvig...,TAB_CHANGE,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T17:39:21.000Z,3F6KKYWMNG8LZQMKO50JMNXE9RRDN6,2020-05-09T21:39:21.000Z,/projects/3P6UG6V26B87HA49NX4Y5ISB4S24NL/tasks...,Assigned,3E9VAUV7C0KFQYQSLVDQPISUFSBYA9,/projects/3P6UG6V26B87HA49NX4Y5ISB4S24NL/tasks...,14177.0,0.0,14400.0,False,False,2020-05-09T17:36:19.000Z,Give us your opinion about political and socia...,3P6UG6V26B87HA49NX4Y5ISB4S24NL,2020-05-09T17:36:19.000Z,2020-05-10T17:36:19.000Z,0.7,USD,AVWA6B7JPTSO0,PP Group,/requesters/AVWA6B7JPTSO0/projects?ref=w_pl_prvw,Answer a short survey for social science research,0,https://www.mturkcontent.com/dynamic/hit?assig...,InternalURL,https://www.mturkcontent.com/dynamic/hit?assig...,,,,,,,,,,,,,,
89,6,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_CLICK,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T17:43:38.000Z,308Q0PEVB8DRSIE8KK6CR7WDMAVI9X,2020-05-11T17:43:38.000Z,/projects/32RJU5X1ZP960UC7NAF5EQVRCA2VDG/tasks...,Assigned,307L9TDWJYSU5X3QAPDSGIUMBKC3ND,/projects/32RJU5X1ZP960UC7NAF5EQVRCA2VDG/tasks...,172714.0,1.0,172800.0,False,False,2020-05-09T17:28:27.000Z,Write in your favorite team and review a subje...,32RJU5X1ZP960UC7NAF5EQVRCA2VDG,2020-05-09T17:28:27.000Z,2020-05-11T17:28:27.000Z,0.2,USD,A2OULRGNWAH644,Kirk Wakefield,/requesters/A2OULRGNWAH644/projects?ref=w_pl_prvw,"60 second survey about NBA, NFL or NHL fans",0,https://www.mturkcontent.com/dynamic/hit?assig...,InternalURL,https://www.mturkcontent.com/dynamic/hit?assig...,,,,,,,,,,,,,,
90,7,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_SCROLL,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T18:37:25.000Z,3SLE99ER0NDYXCSMTBNSLC8XSN7BZF,2020-05-09T19:07:25.000Z,/projects/3USZNBD0HCQYVGQMCUSVCVDG2AV15X/tasks...,Assigned,389A2A304OIIICG8DF1T1DF22WGC00,/projects/3USZNBD0HCQYVGQMCUSVCVDG2AV15X/tasks...,1664.0,20.0,1800.0,False,False,2020-05-09T17:55:14.000Z,survey,3USZNBD0HCQYVGQMCUSVCVDG2AV15X,2020-05-09T17:55:14.000Z,2020-06-08T17:55:14.000Z,0.5,USD,AMF6LLR465U0W,MSH University of Grenoble,/requesters/AMF6LLR465U0W/projects?ref=w_pl_prvw,Psychology Study(~ 10 minutes),0,https://app.cloudresearch.com/TakeLaunchedSurv...,ExternalURL,https://app.cloudresearch.com/TakeLaunchedSurv...,,,,,,,,,,,,,,
91,8,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_LAST,OTHER,0,OTHER,2020-05-08T22:06:40,OTHER,ae862298385abab2a0a1619f8cedef9d,2020-05-09T19:19:42.000Z,37XITHEISW8TBQKG0BJ7VZ7DA69RCZ,2020-05-09T19:39:42.000Z,/projects/3WRXIMH6E16EV4C6UFL2COC83YS4LE/tasks...,Assigned,3SNR5F7R91STS53HGDQSJBBU3NSIE8,/projects/3WRXIMH6E16EV4C6UFL2COC83YS4LE/tasks...,1195.0,40.0,1200.0,False,False,2020-05-09T18:37:09.000Z,Select how typical often you would have seen s...,3WRXIMH6E16EV4C6UFL2COC83YS4LE,2020-05-09T18:37:09.000Z,2020-05-16T18:37:09.000Z,0.2,USD,A3LBCE3RW2MFUK,SLS-6,/requesters/A3LBCE3RW2MFUK/projects?ref=w_pl_prvw,How often do you see these sentences?,0,https://www.mturkcontent.com/dynamic/hit?assig...,InternalURL,https://www.mturkcontent.com/dynamic/hit?assig...,,,,,,,,,,,,,,


In [115]:
# Look at the features
df_amt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11711 entries, 0 to 11710
Data columns (total 50 columns):
 #   Column                                                                Non-Null Count  Dtype  
---  ------                                                                --------------  -----  
 0   c1                                                                    11711 non-null  int64  
 1   c2                                                                    11711 non-null  object 
 2   c3                                                                    11711 non-null  object 
 3   c5                                                                    11711 non-null  object 
 4   c6                                                                    11711 non-null  int64  
 5   c7                                                                    11711 non-null  object 
 6   time                                                                  11711 non-null  object 
 7   

In [42]:
# Output continuous/categorical columns to see what the data has.

# df_amt.c3.unique()
# array(['PAGE_LOAD', 'PAGE_BLUR', 'TAB_CHANGE', 'PAGE_FOCUS', 'PAGE_CLICK',
#        'PAGE_SCROLL', 'PAGE_LAST', 'PAGE_CLOSE', 'INTERNALURL',
#        'PAGE_KEY', 'PAGE_INACTIVITY', 'TAB_CLOSED', 'EXTERNALURL',
#        'PAGE_REACTIVATE', 'SYSTEM_DISABLED_WORKING',
#        'SYSTEM_ENABLED_WORKING', 'SYSTEM_ENABLED', 'SYSTEM_DISABLED'],
#       dtype=object)

# df_amt.c4.unique()
# {
#   "task_id": "3W9XHF7WGLV68SQR2YVUGGPI6QVTK3",
#   "assignment_id": "3E47SOBEYRW130WM6BKU2A6RAYQCIE",
#   "accepted_at": "2020-05-09T03:13:10.000Z",
#   "deadline": "2020-05-09T05:13:10.000Z",
#   "time_to_deadline_in_seconds": 6946,
#   "state": "Assigned",
#   "question": {
#     "value": "https://www.mturkcontent.com/dynamic/hit?assignmentId=3E47SOBEYRW130WM6BKU2A6RAYQCIE&amp;hitId=3W9XHF7WGLV68SQR2YVUGGPI6QVTK3&amp;workerId=A3QVZ4SZB79D8W&amp;turkSubmitTo=https%3A%2F%2Fwww.mturk.com",
#     "type": "InternalURL",
#     "attributes": {
#       "FrameSourceAttribute": "https://www.mturkcontent.com/dynamic/hit?assignmentId=3E47SOBEYRW130WM6BKU2A6RAYQCIE&amp;hitId=3W9XHF7WGLV68SQR2YVUGGPI6QVTK3&amp;workerId=A3QVZ4SZB79D8W&amp;turkSubmitTo=https%3A%2F%2Fwww.mturk.com",
#       "FrameHeight": "0"
#     }
#   },
#   "project": {
#     "hit_set_id": "354DQCRRIJZHIYT5G3CFVURWGQJLSW",
#     "requester_id": "A28S2SRZA50N0",
#     "requester_name": "HCI Lab",
#     "title": "Install a chrome extension for 7 days ($1 bonus per each day) to measure your work performance.",
#     "description": "Install a chrome extension that will help you to keep track of how you spend time on MTurk (potentially helping you to avoid unpaid labor on MTurk).. You are asked to install the chrome extension to track how you are spending your time on the platform. ",
#     "assignment_duration_in_seconds": 7200,
#     "creation_time": "2020-05-09T03:01:26.000Z",
#     "assignable_hits_count": 1,
#     "latest_expiration_time": "2020-05-16T03:01:26.000Z",
#     "caller_meets_requirements": false,
#     "caller_meets_preview_requirements": false,
#     "last_updated_time": "2020-05-09T03:01:26.000Z",
#     "monetary_reward": {
#       "currency_code": "USD",
#       "amount_in_dollars": 2
#     },
#     "hit_requirements": [
#       {
#         "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
#         "comparator": "Exists",
#         "worker_action": "ViewHitSet",
#         "qualification_values": [],
#         "caller_meets_requirement": null,
#         "qualification_type": {
#           "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
#           "name": "CrowdCoach",
#           "visibility": true,
#           "description": "Have already used crowd coach plugin",
#           "has_test": false,
#           "is_requestable": true,
#           "keywords": null
#         },
#         "caller_qualification_value": {
#           "integer_value": null,
#           "locale_value": {
#             "country": null,
#             "subdivision": null
#           }
#         }
#       },
#       {
#         "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
#         "comparator": "DoesNotExist",
#         "worker_action": "ViewHitSet",
#         "qualification_values": [],
#         "caller_meets_requirement": null,
#         "qualification_type": {
#           "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
#           "name": "GigOverhead Setup Diagnostic",
#           "visibility": true,
#           "description": "Installation and diagnostic survey",
#           "has_test": false,
#           "is_requestable": true,
#           "keywords": null
#         },
#         "caller_qualification_value": {
#           "integer_value": null,
#           "locale_value": {
#             "country": null,
#             "subdivision": null
#           }
#         }
#       }
#     ],
#     "requester_url": "/requesters/A28S2SRZA50N0/projects?ref=w_pl_prvw"
#   },
#   "expired_task_action_url": "/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks?ref=w_pl_prvw",
#   "task_url": "/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks/3W9XHF7WGLV68SQR2YVUGGPI6QVTK3?assignment_id=3E47SOBEYRW130WM6BKU2A6RAYQCIE&ref=w_pl_prvw"
# }

# df_amt.c5.unique()
# array(['OTHER', 'MTURK', 'FIVERR', 'UPWORK', 'FREELANCER'], dtype=object)

# df_amt.c6.unique()
# array([ 0.,  1., nan])

# df_amt.c7.unique()
# array(['OTHER', 'TASK_STARTED', 'ADDED_TASK', 'TASK_SUBMITED',
#        'FINISHED_TASK', 'TASKS_LIST', 'WORKER_DASHBOARD', 'UNKNOWN',
#        'TASK_FRAME', 'TASK_PREVIEW', 'TASK_INFO', 'TASK_RETURNED',
#        'PLATFORM_LOGIN', 'TASK_QUEUE', 'TASK_SKIP',
#        'WORKER_EARNINGS_DETAILS', 'TASK_TIMEOUT', 'WORKER_EARNINGS',
#        'WORKER_QUALIFICATIONS', 'TASKS_PER_REQUESTER', 'MESSAGES_SEND',
#        'TASKS_LIST_FILTER', 'WORKER_QUALIFICATIONS_PENDING',
#        'TASKS_PREVIEW', 'PLATFORM_HELP', 'TASKS_PROJECTS',
#        'TASKS_DETAILS', 'MESSAGES_READ', 'TASKS_APPLY'], dtype=object)

# df_amt.c9.unique()
# array(['OTHER', 'WORKING', 'LOGS', 'SEARCHING', 'PROFILE', 'UNKNOWN',
#        'REJECTED', 'COMMUNICATION', 'LEARNING', 'PROPOSAL'], dtype=object)

## What is the unit of analysis?

## How many observations in total are in the data set? 

## How many unique observations are in the data set? 

## What time period is covered?

## Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?

## Description of outcome with an appropriate visualization technique.

## Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.

### Cleaning: Reading in the telemetry data (Toloka)
We read in and clean the data from `toloka_telemetry_db.csv`.

**List of available variables (includes target variable `TBD`):**

- **c1**: continuous
- **c2**: url to work task
- **current**: categorial, 31 values ['PAGE_LOAD', 'TAB_CLOSED', 'PAGE_BLUR', 'PAGE_FOCUS', 'TAB_CHANGE', 'CONFIG_UPDATE', 'PLUGIN_INSTALL', 'CONFIG_FILE', 'APP_ACTIVATED', 'PAGE_CLOSE', 'USER', 'BELL_CLICK', 'PAGE_LAST', 'PAGE_CLICK', 'PAGE_KEY', 'PAGE_SCROLL', 'TASK', 'PAGE_INACTIVITY', 'TRAINING', 'PAGE_REACTIVATE', 'LIST_NEW', 'LIST_RECOM', 'LIST_PAY', 'SETT_CLICK', 'MSG_RCV_WORKER', 'MSG_CLICK_WORKER', 'TASK_HIDE_OFF', 'TASK_HIDE_ON', 'MSG_RCV_REQUESTER', 'SETT_SAVE', 'MSG_CLICK_REQUESTER'] - May need to remove/transform urls (e.g. 'https://toloka.yandex.com)
- **event**: json object {activeAssignments, ...} - may include NaN values
- **platform**: categorial, 2 values {0, NaN}
- **subtype**: categorial, 24 values ['TASK_STARTED', 'TASKS_LIST', 'TASK_SUBMITED', 'OTHER', 'FINISHED_TASK', 'SYSTEM', 'GENERAL', 'ADDED_TASK', 'META_DATA', 'UNKNOWN', 'TASK_QUEUE', 'WORKER_QUALIFICATIONS', 'WORKER_DASHBOARD', 'WORKER_EARNINGS', 'WORKER_EARNINGS_DETAILS', 'TASK_INFO', 'MESSAGES_READ', 'TASK_TIMEOUT', 'REFERRAL', 'NOTIFICATIONS', 'MESSAGES_REQUESTER', 'MESSAGES_OUTBOX', 'MESSAGES_ADMIN', 'MESSAGES_NOTIFICATION'] 
- **time**: continous (duration)
- **type**: categorial, 11 values ['WORKING', 'SEARCHING', 'OTHER', 'LOGS', 'CONFIG', 'API', 'SYSTEM', 'UNKNOWN', 'PROFILE', 'COMMUNICATION', 'REJECTED']
- **user**: categorial, user id
- **ordinal**: continuous, 2 values [1, 2]
- **unnamed**: continous, values [nan, 0.00000e+00, 2.89000e+02, ..., 3.79992e+05, 3.25610e+04, 3.61200e+04] – may include NaN values

In [43]:
# Define columns for data
columns = ['c1', 'c2', 'current', 'event', 'extra', 'platform', 'subtype', 'time', 'type', 'user', 'ordinal', 'unnamed']

# Read the data into a dataframe (this has a header so remove row 0)
df_toloka = pd.read_csv("../data/toloka_telemetry_db.csv", encoding='utf-8', header=0, names=columns)

# Convert the epoch timestamp to datetime
df_toloka['time']=df_toloka.time.map(convert_epoch_time_to_datetime)

# Sort by user then by time
df_toloka = df_toloka.sort_values(by=['user', 'time'], ascending=[True, True])

#Examine the first few rows of the dataframe
df_toloka.head(10)

Unnamed: 0,c1,c2,current,event,extra,platform,subtype,time,type,user,ordinal,unnamed
452928,458654,chrome-extension://hpkclaeeilidghfdodofedfilgk...,TAB_CHANGE,,OTHER,,OTHER,2022-02-12T18:18:16.148000,OTHER,0,2,74037.0
452929,458656,https://docs.google.com/forms/d/e/1FAIpQLSeDtR...,TAB_CHANGE,,OTHER,,OTHER,2022-02-12T18:18:18.610000,OTHER,0,2,74039.0
452931,458657,chrome-extension://hpkclaeeilidghfdodofedfilgk...,TAB_CHANGE,,OTHER,,OTHER,2022-02-12T18:18:51.291000,OTHER,0,2,74040.0
452933,458661,https://toloka.yandex.com/,PAGE_INACTIVITY,,TOLOKA,,UNKNOWN,2022-02-12T18:19:08.825000,UNKNOWN,0,2,74044.0
452934,458660,https://docs.google.com/forms/d/e/1FAIpQLSeDtR...,TAB_CHANGE,,OTHER,,OTHER,2022-02-12T18:19:09.235000,OTHER,0,2,74043.0
452936,458664,https://toloka.yandex.com/tasks,TAB_CHANGE,,TOLOKA,,TASKS_LIST,2022-02-12T18:19:17.931000,SEARCHING,0,2,74047.0
452937,458665,https://toloka.yandex.com/,PAGE_FOCUS,,TOLOKA,,UNKNOWN,2022-02-12T18:19:18.495000,UNKNOWN,0,2,74048.0
452938,458667,https://toloka.yandex.com/,PAGE_REACTIVATE,,TOLOKA,,UNKNOWN,2022-02-12T18:19:20.318000,UNKNOWN,0,2,74050.0
452939,458666,https://toloka.yandex.com/,PAGE_SCROLL,,TOLOKA,,UNKNOWN,2022-02-12T18:19:20.327000,UNKNOWN,0,2,74049.0
452940,458668,https://toloka.yandex.com/,PAGE_CLICK,,TOLOKA,,UNKNOWN,2022-02-12T18:19:22.262000,UNKNOWN,0,2,74051.0


In [44]:
# Look at the features
df_toloka.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2894936 entries, 452928 to 2841010
Data columns (total 12 columns):
 #   Column    Dtype  
---  ------    -----  
 0   c1        int64  
 1   c2        object 
 2   current   object 
 3   event     object 
 4   extra     object 
 5   platform  float64
 6   subtype   object 
 7   time      object 
 8   type      object 
 9   user      object 
 10  ordinal   int64  
 11  unnamed   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 287.1+ MB


In [45]:
# Output continuous/categorical columns to see what the data has.

# df_toloka.current.unique()
# array(['PAGE_LOAD', 'TAB_CLOSED', 'PAGE_BLUR', 'PAGE_FOCUS', 'TAB_CHANGE',
#        'CONFIG_UPDATE', 'PLUGIN_INSTALL', 'CONFIG_FILE', 'APP_ACTIVATED',
#        'PAGE_CLOSE', 'USER', 'BELL_CLICK', 'PAGE_LAST', 'PAGE_CLICK',
#        'PAGE_KEY', 'PAGE_SCROLL', 'TASK', 'PAGE_INACTIVITY', 'TRAINING',
#        'PAGE_REACTIVATE', 'LIST_NEW', 'LIST_RECOM', 'LIST_PAY',
#        'SETT_CLICK', 'MSG_RCV_WORKER', 'MSG_CLICK_WORKER',
#        'https://toloka.yandex.com/tasks', 'TASK_HIDE_OFF', 'TASK_HIDE_ON',
#        'MSG_RCV_REQUESTER', 'SETT_SAVE',
#        'https://toloka.yandex.com/vi/tasks',
#        'https://toloka.yandex.com/fr/messages',
#        'https://toloka.yandex.com/es/tasks', 'MSG_CLICK_REQUESTER',
#        'https://toloka.yandex.com/profile/history/73643/2022-01-20',
#        'https://toloka.yandex.com/fr/tasks',
#        'https://toloka.yandex.com/task/31740731?refUuid=75fd7974-884b-4fff-aa42-b08d20fba5b4',
#        'https://toloka.yandex.com/task/21133006/',
#        'https://toloka.yandex.com/tasks/active',
#        'https://toloka.yandex.com/profile/history?status=all',
#        'https://toloka.yandex.com/profile/money',
#        'https://toloka.yandex.com/messages/inbox/620abf1a68ab666108b4b4a0',
#        'https://toloka.yandex.com/task/31161320?refUuid=6acf2a47-f652-4d2a-b1de-4f4c4724c5f6',
#        'https://toloka.yandex.com/task/2990538/00002da1ca--620b8c1db0c49a3a992921d4',
#        'https://toloka.yandex.com/task/31746103/0001e46837--620bb1bec930da47000174d0',
#        'https://toloka.yandex.com/profile/history?status=income',
#        'https://toloka.yandex.com/profile/history?status=blocked',
#        'https://toloka.yandex.com/profile/edit',
# ...
#        'https://toloka.yandex.com/ru/tasks',
#        'https://toloka.yandex.com/ru/task/32969499/',
#        'https://toloka.yandex.com/ru/task/32969361/',
#        'https://toloka.yandex.com/ru/task/33037260',
#        'https://toloka.yandex.com/ru/messages'], dtype=object)


# df_toloka.event.unique()
# array([nan,
#        
#        '{"browserName":"CHROME","currentMode":"PASSIVE","currentState":0,"dailySurveyUrl":"https://docs.google.com/forms/d/e/1FAIpQLSe2XInPGnb9EI539KyUiBNr6gcmRNPzg55LRUaEmXFtx_3zqg/viewform?usp=pp_url&entry.1915115278=","finalSurveyUrl":"https://docs.google.com/forms/d/e/1FAIpQLSefhPyHJs9x-_v0WxQ5aI3kGF3hfXNZH2vk3KMx1C8CnoZSGw/viewform?usp=pp_url&entry.1172212682=","groupId":"GN","hideUnpaidTasks":true,"initialSurveyUrl":"https://docs.google.com/forms/d/e/1FAIpQLSdWQa1AkVc5vV0VhdOYQLGgK77Pw-LEpmsuCBKpdo7QQopWbg/viewform?usp=pp_url&entry.784114392=","installTime":1642708724229,"instructionsUrl":"https://bit.ly/cul-act-gn","isUserStudy":true,"logServerUrl":"https://script.google.com/macros/s/AKfycbwGx2_5a6IwcNI2YZuz2AZvb1J-7Y8Ulk5fYHjZoA8wvHzajv9P55DYiI8UnoV0W403HA/exec","mode":"PROTOCOL","nextDue":1643313524229,"pluginName":"Toloka Assistant","protocol":[{"durationMins":10080,"mode":"PASSIVE"},{"durationMins":10080,"mode":"ACTIVE"},{"durationMins":10080,"mode":"FINISH"}],"rankMethod":"AI","sandbox":false,"serverUrls":["https://script.google.com/macros/s/AKfycbwGx2_5a6IwcNI2YZuz2AZvb1J-7Y8Ulk5fYHjZoA8wvHzajv9P55DYiI8UnoV0W403HA/exec","https://hcilab.ml/overhead/api"],"settings":{"msg_requ":true,"msg_work":true,"not_brow":true,"not_page":true,"not_whil":true,"num_task":5},"socketUrl":"http://hcilab.ml:5000","studyDurationDays":14,"studyDurationMins":20160,"userData":{"acceptedEula":12,"actualUser":{"defaultEmail":"carlostoxtli@yandex.com","displayName":"carlostoxtli","login":"carlostoxtli","readOnlyModeToActUnderAccount":false,"role":"WORKER","uid":1274303000,"userLang":"ES"},"adultAllowed":true,"authoritiesInfo":{"issuedAuthorities":["APP_USER","U_WALLETS_EDIT","U_ASSIGNMENTS_VIEW","U_TRANSACTIONS_CREATE","U_ASSIGNMENTS_UNDERTAKE","APP","U_ASSIGNMENTS_HISTORY","U_PROFILE_VIEW","U_TRANSACTIONS_VIEW","U_ASSIGNMENTS_SUBMIT","U_FORUM_VIEW","U_MESSAGES_CREATE","U_FORUM_EDIT","U_PROFILE_EDIT","U_MESSAGES_VIEW"],"notIssuedAuthoritiesReasons":{}},"availableAccounts":[],"balance":0,"birthDay":"1982-11-15","blockedBalance":0,"citizenship":"US","cityId":103027,"country":"US","createdDate":"2020-12-21","defaultEmail":"carlostoxtli@yandex.com","displayName":"carlostoxtli","education":"HIGH","firstName":"Carlos","fullName":"Carlos Toxtli","gender":"MALE","isAccountOwner":true,"languages":["EN","ES"],"lastName":"Toxtli","login":"carlostoxtli","rating":0,"regionId":223,"role":"WORKER","systemBan":false,"uid":1274303000,"userLang":"ES"},"userId":"a5d84fcd0637d31f4675cdf17b71a35"}',
#        ...,
#        '{"refUuid":"4eb5cecf-42ce-4580-b89d-c2edb9624873","groupUuid":"2cacd9db-5da7-4ff4-9032-d38402d5825d","lightweightTec":{"poolId":33139449,"projectId":58019,"poolStartedAt":"2022-04-25T10:00:27.146","mayContainAdultContent":true,"title":"Find content from website (universal app prod)","description":"Find the exact web page that contains the listed information from the given website domain. Use the google translate or bing translate browser extensions to translate international web pages into understandable language.\\\\nНайдите точную веб-страницу, содержащую перечисленную информацию из данного домена веб-сайта. Используйте расширения браузера google translate или bing translate, чтобы переводить международные веб-страницы на понятный язык.\\\\n","hasInstructions":true,"snapshotMajorVersion":1,"snapshotMinorVersion":8,"snapshotMajorVersionActual":true,"assignmentConfig":{"reward":"0.020","maxDurationSeconds":600,"issuing":{"type":"AUTOMATIC"}},"trainingConfig":{"training":false},"requesterInfo":{"id":"97e0e18092318a1140eb08402e7cc5ac","name":{"EN":"Bing Local Search 2","FR":"Bing Local Search 2","ID":"Bing Local Search 2","RU":"Bing Local Search 2","TR":"Bing Local Search 2"},"trusted":false},"projectMetaInfo":{"projectId":58019,"bookmarked":true,"bookmarkedAt":"2022-03-12T01:28:35.577","experimentMeta":{"dj_task_duration__snippet__duration_less_than_minute":"1","dj_project_class__snippet__web_searching":"1","dj_project_tag__requester_type__snippet__experienced_requester":"1"}},"iframeSubdomain":"97e0e18092318a1140eb08402e7cc5ac"},"availability":{"available":true},"activeAssignments":[{"id":"0001f9aaf9--6266ca43ca6f212f45a79130","expireTime":"2022-04-25T16:30:19.528","secondsLeft":597,"reward":0.02}],"acceptanceDetails":{"postAccept":true,"acceptanceRate":99,"acceptancePeriodDays":1,"averageAcceptancePeriodDays":1},"trainingDetails":{"training":false},"taskDetails":{"grade":{"total_grade":4.87},"averageSubmitTimeSec":23,"averageAcceptanceTimeSec":86406,"moneyAvgHourly":3.13043472,"moneyAvg":17.48184971098265,"moneyMed":18.22,"moneyTop10":30.288000000000004,"moneyMax3":18.330940090548125},"grade":{"total_grade":4.87}}',
#        '{"refUuid":"e046bcc5-ece8-40b1-8203-e48fed3ddc99","groupUuid":"9fde347c-5a79-4e78-9023-fe325f8bd616","lightweightTec":{"poolId":33139449,"projectId":58019,"poolStartedAt":"2022-04-25T10:00:27.146","mayContainAdultContent":true,"title":"Find content from website (universal app prod)","description":"Find the exact web page that contains the listed information from the given website domain. Use the google translate or bing translate browser extensions to translate international web pages into understandable language.\\\\nНайдите точную веб-страницу, содержащую перечисленную информацию из данного домена веб-сайта. Используйте расширения браузера google translate или bing translate, чтобы переводить международные веб-страницы на понятный язык.\\\\n","hasInstructions":true,"snapshotMajorVersion":1,"snapshotMinorVersion":8,"snapshotMajorVersionActual":true,"assignmentConfig":{"reward":"0.020","maxDurationSeconds":600,"issuing":{"type":"AUTOMATIC"}},"trainingConfig":{"training":false},"requesterInfo":{"id":"97e0e18092318a1140eb08402e7cc5ac","name":{"EN":"Bing Local Search 2","FR":"Bing Local Search 2","ID":"Bing Local Search 2","RU":"Bing Local Search 2","TR":"Bing Local Search 2"},"trusted":false},"projectMetaInfo":{"projectId":58019,"bookmarked":true,"bookmarkedAt":"2022-03-12T01:28:35.577","experimentMeta":{"dj_task_duration__snippet__duration_less_than_minute":"1","dj_project_class__snippet__web_searching":"1","dj_project_tag__requester_type__snippet__experienced_requester":"1"}},"iframeSubdomain":"97e0e18092318a1140eb08402e7cc5ac"},"availability":{"available":true},"activeAssignments":[{"id":"0001f9aaf9--6266ca43ca6f212f45a79130","expireTime":"2022-04-25T16:30:19.528","secondsLeft":567,"reward":0.02}],"acceptanceDetails":{"postAccept":true,"acceptanceRate":99,"acceptancePeriodDays":1,"averageAcceptancePeriodDays":1},"trainingDetails":{"training":false},"taskDetails":{"grade":{"total_grade":4.87},"averageSubmitTimeSec":23,"averageAcceptanceTimeSec":86406,"moneyAvgHourly":3.13043472,"moneyAvg":17.48184971098265,"moneyMed":18.22,"moneyTop10":30.288000000000004,"moneyMax3":18.330940090548125},"grade":{"total_grade":4.87}}',
#        '{"uid":1206161147,"login":"sholesy@gmail.com","role":"WORKER","userLang":"EN","defaultEmail":"sholesy@gmail.com","connectionId":"s:1650784330835:uZlwaQ:2d","authorizationStatus":"VALID","avatarId":"0/0-0","displayName":"sholesy@gmail.com","fullName":"Oluwatosin Solesi","firstName":"Oluwatosin","lastName":"Solesi","isAccountOwner":true,"actualUser":{"uid":1206161147,"login":"sholesy@gmail.com","role":"WORKER","userLang":"EN","defaultEmail":"sholesy@gmail.com","displayName":"sholesy@gmail.com","readOnlyModeToActUnderAccount":false},"availableAccounts":[],"createdDate":"2020-10-26","systemBan":false,"gender":"FEMALE","birthDay":"1991-10-14","cityId":21063,"country":"NG","citizenship":"US","education":"HIGH","languages":["EN"],"adultAllowed":true,"acceptedEula":13,"rating":0,"authoritiesInfo":{"issuedAuthorities":["U_ASSIGNMENTS_VIEW","U_ASSIGNMENTS_HISTORY","APP_USER","U_MESSAGES_CREATE","U_MESSAGES_VIEW","U_FORUM_VIEW","U_PROFILE_VIEW","U_WALLETS_EDIT","U_TRANSACTIONS_VIEW","U_FORUM_EDIT","U_ASSIGNMENTS_UNDERTAKE","U_TRANSACTIONS_CREATE","U_ASSIGNMENTS_SUBMIT","U_PROFILE_EDIT","APP"],"notIssuedAuthoritiesReasons":{}},"balance":"0.191","blockedBalance":"0.035","regionId":20741}'],
#       dtype=object)
# {
#   "acceptanceDetails": {
#     "postAccept": false
#   },
#   "activeAssignments": [
#     {
#       "expireTime": "2022-01-20T19:37:28.194",
#       "id": "00001086d6--61e9b930a2d62b2b56644596",
#       "reward": 0.3,
#       "secondsLeft": 193
#     }
#   ],
#   "availability": {
#     "available": true
#   },
#   "groupUuid": "9455a911-5624-4951-9f4e-ea09cc1cc5f5",
#   "lightweightTec": {
#     "assignmentConfig": {
#       "issuing": {
#         "type": "AUTOMATIC"
#       },
#       "maxDurationSeconds": 200,
#       "reward": 0.3
#     },
#     "description": "Answer the questions in the survey. Choose one or more options or write your own answer",
#     "hasInstructions": true,
#     "iframeSubdomain": "54f8685950e9694b99faccce011a21df",
#     "mayContainAdultContent": false,
#     "poolId": 1083094,
#     "poolStartedAt": "2022-01-20T19:01:57.121",
#     "projectId": 89244,
#     "projectMetaInfo": {
#       "experimentMeta": {},
#       "projectId": 89244
#     },
#     "requesterInfo": {
#       "id": "54f8685950e9694b99faccce011a21df",
#       "name": {
#         "EN": "davidjohnsonits"
#       },
#       "trusted": false
#     },
#     "snapshotMajorVersion": 1,
#     "snapshotMajorVersionActual": true,
#     "snapshotMinorVersion": 2,
#     "title": "Survey One David",
#     "trainingConfig": {
#       "training": false
#     }
#   },
#   "refUuid": "a95bab84-7c39-45bc-9268-2f474011c0ae",
#   "taskDetails": {
#     "averageSubmitTimeSec": 13,
#     "moneyAvgHourly": 83.07692316
#   },
#   "trainingDetails": {
#     "training": false
#   }
# }

# df_toloka.platform.unique()
# array([ 0., nan])

# df_toloka.subtype.unique()
# array(['TASK_STARTED', 'TASKS_LIST', 'TASK_SUBMITED', 'OTHER',
#        'FINISHED_TASK', 'SYSTEM', 'GENERAL', 'ADDED_TASK', 'META_DATA',
#        'UNKNOWN', 'TASK_QUEUE', 'WORKER_QUALIFICATIONS',
#        'WORKER_DASHBOARD', 'WORKER_EARNINGS', 'WORKER_EARNINGS_DETAILS',
#        'TASK_INFO', 'MESSAGES_READ', 'TASK_TIMEOUT', 'REFERRAL',
#        'NOTIFICATIONS', 'MESSAGES_REQUESTER', 'MESSAGES_OUTBOX',
#        'MESSAGES_ADMIN', 'MESSAGES_NOTIFICATION'], dtype=object)

# df_toloka.type.unique()
# array(['WORKING', 'SEARCHING', 'OTHER', 'LOGS', 'CONFIG', 'API', 'SYSTEM',
#        'UNKNOWN', 'PROFILE', 'COMMUNICATION', 'REJECTED'], dtype=object)

# df_toloka.ordinal.unique()
# array([1, 2])

# df_toloka.unnamed.unique()
# array([        nan, 0.00000e+00, 2.89000e+02, ..., 3.79992e+05,
#        3.25610e+04, 3.61200e+04])

## What is the unit of analysis?

## How many observations in total are in the data set? 

## How many unique observations are in the data set? 

## What time period is covered?

## Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?

## Description of outcome with an appropriate visualization technique.

## Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.