<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://bsethwalker.github.io/assets/img/clemson_paw.png">
</div>

## Course Project (Online Workers) - Checkpoint 1

**Clemson University**<br>
**Fall 2023**<br>
**Instructor(s):** Nina Hubig <br>
**Project Team:**
<ul>
    <li>David Croft <dcroft@g.clemson.edu></li>
    <li>Stephen Becker <sgbecke@g.clemson.edu></li>
    <li>Tony Hang <qhang@g.clemson.edu></li>
    <li>Zachary Trabookis <ztraboo@clemson.edu></li>
</ul>

---



In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://bsethwalker.github.io/assets/css/cpsc6300.css").text
HTML(styles)

## Summary Goals

* Summary of the data set that, at a minimum, answers the following questions: What is the unit of analysis? How many observations in total are in the data set? How many unique observations are in the data set? What time period is covered?
  
* Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?
  
* Description of outcome with an appropriate visualization technique.
  
* Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.

In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import mean_squared_error

import warnings

### Cleaning: Reading in the telemetry data (Amazon Mechanical Turk (AMT))
We read in and clean the data from `amazon_mechanical_turk_records.csv`.

**List of available variables (includes target variable `TBD: c6`):**

- **c1**: continuous
- **c2**: url to work task
- **c3**: categorical, 18 values (['PAGE_LOAD', 'PAGE_BLUR', 'TAB_CHANGE', 'PAGE_FOCUS', 'PAGE_CLICK', 'PAGE_SCROLL', 'PAGE_LAST', 'PAGE_CLOSE', 'INTERNALURL', 'PAGE_KEY', 'PAGE_INACTIVITY', 'TAB_CLOSED', 'EXTERNALURL', 'PAGE_REACTIVATE', 'SYSTEM_DISABLED_WORKING', 'SYSTEM_ENABLED_WORKING', 'SYSTEM_ENABLED', 'SYSTEM_DISABLED'])
- **c4**: json object {task_id, assignment_id, ...} – may include NaN values
- **c5**: categorial, 5 values ['OTHER', 'MTURK', 'FIVERR', 'UPWORK', 'FREELANCER']
- **c6**: categorial, 2 values (0: no complete, 1: complete) – may include NaN values
- **c7**: categorial, 29 values ['OTHER', 'TASK_STARTED', 'ADDED_TASK', 'TASK_SUBMITED', 'FINISHED_TASK', 'TASKS_LIST', 'WORKER_DASHBOARD', 'UNKNOWN', 'TASK_FRAME', 'TASK_PREVIEW', 'TASK_INFO', 'TASK_RETURNED', 'PLATFORM_LOGIN', 'TASK_QUEUE', 'TASK_SKIP', 'WORKER_EARNINGS_DETAILS', 'TASK_TIMEOUT', 'WORKER_EARNINGS', 'WORKER_QUALIFICATIONS', 'TASKS_PER_REQUESTER', 'MESSAGES_SEND', 'TASKS_LIST_FILTER', 'WORKER_QUALIFICATIONS_PENDING', 'TASKS_PREVIEW', 'PLATFORM_HELP', 'TASKS_PROJECTS', 'TASKS_DETAILS', 'MESSAGES_READ', 'TASKS_APPLY']
- **time**: continuous, datetime (milliseconds), 1970 start date Unix Time (Week, Month, Day, Hours, Minutes, Seconds)
- **c9**: categorial, 10 values ['OTHER', 'WORKING', 'LOGS', 'SEARCHING', 'PROFILE', 'UNKNOWN', 'REJECTED', 'COMMUNICATION', 'LEARNING', 'PROPOSAL']
- **user**: categorical, 120 values unique

In [3]:
# Define columns for data
columns = ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'time', 'c9', 'user']

# Read the data into a dataframe
df_amt = pd.read_csv("../data/amazon_mechanical_turk_records.csv", encoding='utf-8', header=None, names=columns)

#Examine the first few rows of the dataframe
df_amt.head(10)

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,1,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_LOAD,,OTHER,0.0,OTHER,1588994215395,OTHER,ae862298385abab2a0a1619f8cedef9d
1,2,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_BLUR,,OTHER,0.0,OTHER,1588994217989,OTHER,ae862298385abab2a0a1619f8cedef9d
2,3,https://worker.mturk.com/projects/354DQCRRIJZH...,TAB_CHANGE,,MTURK,0.0,TASK_STARTED,1588994218051,WORKING,ae862298385abab2a0a1619f8cedef9d
3,4,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_FOCUS,,OTHER,0.0,OTHER,1588994221371,OTHER,ae862298385abab2a0a1619f8cedef9d
4,5,https://docs.google.com/forms/d/e/1FAIpQLScvig...,TAB_CHANGE,,OTHER,0.0,OTHER,1588994221397,OTHER,ae862298385abab2a0a1619f8cedef9d
5,6,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_CLICK,,OTHER,0.0,OTHER,1588994222607,OTHER,ae862298385abab2a0a1619f8cedef9d
6,7,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_SCROLL,,OTHER,0.0,OTHER,1588994223316,OTHER,ae862298385abab2a0a1619f8cedef9d
7,8,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_LAST,,OTHER,0.0,OTHER,1588994223667,OTHER,ae862298385abab2a0a1619f8cedef9d
8,9,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_CLOSE,,OTHER,0.0,OTHER,1588994225242,OTHER,ae862298385abab2a0a1619f8cedef9d
9,10,https://docs.google.com/forms/d/e/1FAIpQLScvig...,PAGE_CLICK,,OTHER,0.0,OTHER,1588994225243,OTHER,ae862298385abab2a0a1619f8cedef9d


In [4]:
# Look at the features
df_amt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3496374 entries, 0 to 3496373
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   c1      int64  
 1   c2      object 
 2   c3      object 
 3   c4      object 
 4   c5      object 
 5   c6      float64
 6   c7      object 
 7   c8      int64  
 8   c9      object 
 9   c10     object 
dtypes: float64(1), int64(2), object(7)
memory usage: 266.8+ MB


In [10]:
# Output continuous/categorical columns to see what the data has.

df_amt.c10.unique().size

# df_amt.c3.unique()
# array(['PAGE_LOAD', 'PAGE_BLUR', 'TAB_CHANGE', 'PAGE_FOCUS', 'PAGE_CLICK',
#        'PAGE_SCROLL', 'PAGE_LAST', 'PAGE_CLOSE', 'INTERNALURL',
#        'PAGE_KEY', 'PAGE_INACTIVITY', 'TAB_CLOSED', 'EXTERNALURL',
#        'PAGE_REACTIVATE', 'SYSTEM_DISABLED_WORKING',
#        'SYSTEM_ENABLED_WORKING', 'SYSTEM_ENABLED', 'SYSTEM_DISABLED'],
#       dtype=object)

# df_amt.c4.unique()
# {
#   "task_id": "3W9XHF7WGLV68SQR2YVUGGPI6QVTK3",
#   "assignment_id": "3E47SOBEYRW130WM6BKU2A6RAYQCIE",
#   "accepted_at": "2020-05-09T03:13:10.000Z",
#   "deadline": "2020-05-09T05:13:10.000Z",
#   "time_to_deadline_in_seconds": 6946,
#   "state": "Assigned",
#   "question": {
#     "value": "https://www.mturkcontent.com/dynamic/hit?assignmentId=3E47SOBEYRW130WM6BKU2A6RAYQCIE&amp;hitId=3W9XHF7WGLV68SQR2YVUGGPI6QVTK3&amp;workerId=A3QVZ4SZB79D8W&amp;turkSubmitTo=https%3A%2F%2Fwww.mturk.com",
#     "type": "InternalURL",
#     "attributes": {
#       "FrameSourceAttribute": "https://www.mturkcontent.com/dynamic/hit?assignmentId=3E47SOBEYRW130WM6BKU2A6RAYQCIE&amp;hitId=3W9XHF7WGLV68SQR2YVUGGPI6QVTK3&amp;workerId=A3QVZ4SZB79D8W&amp;turkSubmitTo=https%3A%2F%2Fwww.mturk.com",
#       "FrameHeight": "0"
#     }
#   },
#   "project": {
#     "hit_set_id": "354DQCRRIJZHIYT5G3CFVURWGQJLSW",
#     "requester_id": "A28S2SRZA50N0",
#     "requester_name": "HCI Lab",
#     "title": "Install a chrome extension for 7 days ($1 bonus per each day) to measure your work performance.",
#     "description": "Install a chrome extension that will help you to keep track of how you spend time on MTurk (potentially helping you to avoid unpaid labor on MTurk).. You are asked to install the chrome extension to track how you are spending your time on the platform. ",
#     "assignment_duration_in_seconds": 7200,
#     "creation_time": "2020-05-09T03:01:26.000Z",
#     "assignable_hits_count": 1,
#     "latest_expiration_time": "2020-05-16T03:01:26.000Z",
#     "caller_meets_requirements": false,
#     "caller_meets_preview_requirements": false,
#     "last_updated_time": "2020-05-09T03:01:26.000Z",
#     "monetary_reward": {
#       "currency_code": "USD",
#       "amount_in_dollars": 2
#     },
#     "hit_requirements": [
#       {
#         "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
#         "comparator": "Exists",
#         "worker_action": "ViewHitSet",
#         "qualification_values": [],
#         "caller_meets_requirement": null,
#         "qualification_type": {
#           "qualification_type_id": "3WHKV9Z6RB7LBJ77DO4ZLXEIHB2AWU",
#           "name": "CrowdCoach",
#           "visibility": true,
#           "description": "Have already used crowd coach plugin",
#           "has_test": false,
#           "is_requestable": true,
#           "keywords": null
#         },
#         "caller_qualification_value": {
#           "integer_value": null,
#           "locale_value": {
#             "country": null,
#             "subdivision": null
#           }
#         }
#       },
#       {
#         "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
#         "comparator": "DoesNotExist",
#         "worker_action": "ViewHitSet",
#         "qualification_values": [],
#         "caller_meets_requirement": null,
#         "qualification_type": {
#           "qualification_type_id": "34O6CUUXI0IPA69PLM76WR259RNW0R",
#           "name": "GigOverhead Setup Diagnostic",
#           "visibility": true,
#           "description": "Installation and diagnostic survey",
#           "has_test": false,
#           "is_requestable": true,
#           "keywords": null
#         },
#         "caller_qualification_value": {
#           "integer_value": null,
#           "locale_value": {
#             "country": null,
#             "subdivision": null
#           }
#         }
#       }
#     ],
#     "requester_url": "/requesters/A28S2SRZA50N0/projects?ref=w_pl_prvw"
#   },
#   "expired_task_action_url": "/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks?ref=w_pl_prvw",
#   "task_url": "/projects/354DQCRRIJZHIYT5G3CFVURWGQJLSW/tasks/3W9XHF7WGLV68SQR2YVUGGPI6QVTK3?assignment_id=3E47SOBEYRW130WM6BKU2A6RAYQCIE&ref=w_pl_prvw"
# }

# df_amt.c5.unique()
# array(['OTHER', 'MTURK', 'FIVERR', 'UPWORK', 'FREELANCER'], dtype=object)

# df_amt.c6.unique()
# array([ 0.,  1., nan])

# df_amt.c7.unique()
# array(['OTHER', 'TASK_STARTED', 'ADDED_TASK', 'TASK_SUBMITED',
#        'FINISHED_TASK', 'TASKS_LIST', 'WORKER_DASHBOARD', 'UNKNOWN',
#        'TASK_FRAME', 'TASK_PREVIEW', 'TASK_INFO', 'TASK_RETURNED',
#        'PLATFORM_LOGIN', 'TASK_QUEUE', 'TASK_SKIP',
#        'WORKER_EARNINGS_DETAILS', 'TASK_TIMEOUT', 'WORKER_EARNINGS',
#        'WORKER_QUALIFICATIONS', 'TASKS_PER_REQUESTER', 'MESSAGES_SEND',
#        'TASKS_LIST_FILTER', 'WORKER_QUALIFICATIONS_PENDING',
#        'TASKS_PREVIEW', 'PLATFORM_HELP', 'TASKS_PROJECTS',
#        'TASKS_DETAILS', 'MESSAGES_READ', 'TASKS_APPLY'], dtype=object)

# df_amt.c9.unique()
# array(['OTHER', 'WORKING', 'LOGS', 'SEARCHING', 'PROFILE', 'UNKNOWN',
#        'REJECTED', 'COMMUNICATION', 'LEARNING', 'PROPOSAL'], dtype=object)

120

## What is the unit of analysis?

## How many observations in total are in the data set? 

## How many unique observations are in the data set? 

## What time period is covered?

## Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?

## Description of outcome with an appropriate visualization technique.

## Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.

### Cleaning: Reading in the telemetry data (Toloka)
We read in and clean the data from `toloka_telemetry_db.csv`.

**List of available variables (includes target variable `TBD`):**

- **c1**: continuous
- **c2**: url to work task
- **current**: categorial, 31 values ['PAGE_LOAD', 'TAB_CLOSED', 'PAGE_BLUR', 'PAGE_FOCUS', 'TAB_CHANGE', 'CONFIG_UPDATE', 'PLUGIN_INSTALL', 'CONFIG_FILE', 'APP_ACTIVATED', 'PAGE_CLOSE', 'USER', 'BELL_CLICK', 'PAGE_LAST', 'PAGE_CLICK', 'PAGE_KEY', 'PAGE_SCROLL', 'TASK', 'PAGE_INACTIVITY', 'TRAINING', 'PAGE_REACTIVATE', 'LIST_NEW', 'LIST_RECOM', 'LIST_PAY', 'SETT_CLICK', 'MSG_RCV_WORKER', 'MSG_CLICK_WORKER', 'TASK_HIDE_OFF', 'TASK_HIDE_ON', 'MSG_RCV_REQUESTER', 'SETT_SAVE', 'MSG_CLICK_REQUESTER'] - May need to remove/transform urls (e.g. 'https://toloka.yandex.com)
- **event**: json object {activeAssignments, ...} - may include NaN values
- **platform**: categorial, 2 values {0, NaN}
- **subtype**: categorial, 24 values ['TASK_STARTED', 'TASKS_LIST', 'TASK_SUBMITED', 'OTHER', 'FINISHED_TASK', 'SYSTEM', 'GENERAL', 'ADDED_TASK', 'META_DATA', 'UNKNOWN', 'TASK_QUEUE', 'WORKER_QUALIFICATIONS', 'WORKER_DASHBOARD', 'WORKER_EARNINGS', 'WORKER_EARNINGS_DETAILS', 'TASK_INFO', 'MESSAGES_READ', 'TASK_TIMEOUT', 'REFERRAL', 'NOTIFICATIONS', 'MESSAGES_REQUESTER', 'MESSAGES_OUTBOX', 'MESSAGES_ADMIN', 'MESSAGES_NOTIFICATION'] 
- **time**: continous (duration)
- **type**: categorial, 11 values ['WORKING', 'SEARCHING', 'OTHER', 'LOGS', 'CONFIG', 'API', 'SYSTEM', 'UNKNOWN', 'PROFILE', 'COMMUNICATION', 'REJECTED']
- **user**: categorial, user id
- **ordinal**: continuous, 2 values [1, 2]
- **unnamed**: continous, values [nan, 0.00000e+00, 2.89000e+02, ..., 3.79992e+05, 3.25610e+04, 3.61200e+04] – may include NaN values

In [6]:
# Define columns for data
columns = ['c1', 'c2', 'current', 'event', 'extra', 'platform', 'subtype', 'time', 'type', 'user', 'ordinal', 'unnamed']

# Read the data into a dataframe (this has a header so remove row 0)
df_toloka = pd.read_csv("../data/toloka_telemetry_db.csv", encoding='utf-8', header=0, names=columns)

#Examine the first few rows of the dataframe
df_toloka.head(10)

Unnamed: 0,c1,c2,current,event,extra,platform,subtype,time,type,user,ordinal,unnamed
0,495553,https://toloka.yandex.com/task/31577897/0001e1...,PAGE_LOAD,,TOLOKA,0.0,TASK_STARTED,1290745000000.0,WORKING,311ad54dd8763dd3365ea2342627aaf,1,
1,1605885,https://toloka.yandex.com/tasks,TAB_CLOSED,,TOLOKA,0.0,TASKS_LIST,1351745000000.0,SEARCHING,fd978fa116dde1ead273fa5fc7316697,1,
2,0,https://toloka.yandex.com/tasks,PAGE_BLUR,,TOLOKA,,TASKS_LIST,1642014000000.0,SEARCHING,d75e96a84a13b15a1f6291c4c8df8b,2,0.0
3,0,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_FOCUS,,TOLOKA,0.0,TASK_STARTED,1642719000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,
4,1,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_FOCUS,,TOLOKA,0.0,TASK_SUBMITED,1642719000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,
5,2,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_BLUR,,TOLOKA,0.0,TASK_STARTED,1642719000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,
6,4,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_FOCUS,,TOLOKA,0.0,TASK_STARTED,1642719000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,
7,6,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_BLUR,,TOLOKA,0.0,TASK_STARTED,1642719000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,
8,7,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_BLUR,,TOLOKA,0.0,TASK_SUBMITED,1642719000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,
9,8,https://sandbox.toloka.yandex.com/es/task/1083...,PAGE_FOCUS,,TOLOKA,0.0,TASK_STARTED,1642720000000.0,WORKING,8d22505d156899a9e716e418221b2d10,1,


In [7]:
# Look at the features
df_toloka.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2894936 entries, 0 to 2894935
Data columns (total 12 columns):
 #   Column    Dtype  
---  ------    -----  
 0   c1        int64  
 1   c2        object 
 2   current   object 
 3   event     object 
 4   extra     object 
 5   platform  float64
 6   subtype   object 
 7   time      float64
 8   type      object 
 9   user      object 
 10  ordinal   int64  
 11  unnamed   float64
dtypes: float64(3), int64(2), object(7)
memory usage: 265.0+ MB


In [8]:
# Output continuous/categorical columns to see what the data has.

# df_toloka.current.unique()
# array(['PAGE_LOAD', 'TAB_CLOSED', 'PAGE_BLUR', 'PAGE_FOCUS', 'TAB_CHANGE',
#        'CONFIG_UPDATE', 'PLUGIN_INSTALL', 'CONFIG_FILE', 'APP_ACTIVATED',
#        'PAGE_CLOSE', 'USER', 'BELL_CLICK', 'PAGE_LAST', 'PAGE_CLICK',
#        'PAGE_KEY', 'PAGE_SCROLL', 'TASK', 'PAGE_INACTIVITY', 'TRAINING',
#        'PAGE_REACTIVATE', 'LIST_NEW', 'LIST_RECOM', 'LIST_PAY',
#        'SETT_CLICK', 'MSG_RCV_WORKER', 'MSG_CLICK_WORKER',
#        'https://toloka.yandex.com/tasks', 'TASK_HIDE_OFF', 'TASK_HIDE_ON',
#        'MSG_RCV_REQUESTER', 'SETT_SAVE',
#        'https://toloka.yandex.com/vi/tasks',
#        'https://toloka.yandex.com/fr/messages',
#        'https://toloka.yandex.com/es/tasks', 'MSG_CLICK_REQUESTER',
#        'https://toloka.yandex.com/profile/history/73643/2022-01-20',
#        'https://toloka.yandex.com/fr/tasks',
#        'https://toloka.yandex.com/task/31740731?refUuid=75fd7974-884b-4fff-aa42-b08d20fba5b4',
#        'https://toloka.yandex.com/task/21133006/',
#        'https://toloka.yandex.com/tasks/active',
#        'https://toloka.yandex.com/profile/history?status=all',
#        'https://toloka.yandex.com/profile/money',
#        'https://toloka.yandex.com/messages/inbox/620abf1a68ab666108b4b4a0',
#        'https://toloka.yandex.com/task/31161320?refUuid=6acf2a47-f652-4d2a-b1de-4f4c4724c5f6',
#        'https://toloka.yandex.com/task/2990538/00002da1ca--620b8c1db0c49a3a992921d4',
#        'https://toloka.yandex.com/task/31746103/0001e46837--620bb1bec930da47000174d0',
#        'https://toloka.yandex.com/profile/history?status=income',
#        'https://toloka.yandex.com/profile/history?status=blocked',
#        'https://toloka.yandex.com/profile/edit',
# ...
#        'https://toloka.yandex.com/ru/tasks',
#        'https://toloka.yandex.com/ru/task/32969499/',
#        'https://toloka.yandex.com/ru/task/32969361/',
#        'https://toloka.yandex.com/ru/task/33037260',
#        'https://toloka.yandex.com/ru/messages'], dtype=object)


# df_toloka.event.unique()
# array([nan,
#        
#        '{"browserName":"CHROME","currentMode":"PASSIVE","currentState":0,"dailySurveyUrl":"https://docs.google.com/forms/d/e/1FAIpQLSe2XInPGnb9EI539KyUiBNr6gcmRNPzg55LRUaEmXFtx_3zqg/viewform?usp=pp_url&entry.1915115278=","finalSurveyUrl":"https://docs.google.com/forms/d/e/1FAIpQLSefhPyHJs9x-_v0WxQ5aI3kGF3hfXNZH2vk3KMx1C8CnoZSGw/viewform?usp=pp_url&entry.1172212682=","groupId":"GN","hideUnpaidTasks":true,"initialSurveyUrl":"https://docs.google.com/forms/d/e/1FAIpQLSdWQa1AkVc5vV0VhdOYQLGgK77Pw-LEpmsuCBKpdo7QQopWbg/viewform?usp=pp_url&entry.784114392=","installTime":1642708724229,"instructionsUrl":"https://bit.ly/cul-act-gn","isUserStudy":true,"logServerUrl":"https://script.google.com/macros/s/AKfycbwGx2_5a6IwcNI2YZuz2AZvb1J-7Y8Ulk5fYHjZoA8wvHzajv9P55DYiI8UnoV0W403HA/exec","mode":"PROTOCOL","nextDue":1643313524229,"pluginName":"Toloka Assistant","protocol":[{"durationMins":10080,"mode":"PASSIVE"},{"durationMins":10080,"mode":"ACTIVE"},{"durationMins":10080,"mode":"FINISH"}],"rankMethod":"AI","sandbox":false,"serverUrls":["https://script.google.com/macros/s/AKfycbwGx2_5a6IwcNI2YZuz2AZvb1J-7Y8Ulk5fYHjZoA8wvHzajv9P55DYiI8UnoV0W403HA/exec","https://hcilab.ml/overhead/api"],"settings":{"msg_requ":true,"msg_work":true,"not_brow":true,"not_page":true,"not_whil":true,"num_task":5},"socketUrl":"http://hcilab.ml:5000","studyDurationDays":14,"studyDurationMins":20160,"userData":{"acceptedEula":12,"actualUser":{"defaultEmail":"carlostoxtli@yandex.com","displayName":"carlostoxtli","login":"carlostoxtli","readOnlyModeToActUnderAccount":false,"role":"WORKER","uid":1274303000,"userLang":"ES"},"adultAllowed":true,"authoritiesInfo":{"issuedAuthorities":["APP_USER","U_WALLETS_EDIT","U_ASSIGNMENTS_VIEW","U_TRANSACTIONS_CREATE","U_ASSIGNMENTS_UNDERTAKE","APP","U_ASSIGNMENTS_HISTORY","U_PROFILE_VIEW","U_TRANSACTIONS_VIEW","U_ASSIGNMENTS_SUBMIT","U_FORUM_VIEW","U_MESSAGES_CREATE","U_FORUM_EDIT","U_PROFILE_EDIT","U_MESSAGES_VIEW"],"notIssuedAuthoritiesReasons":{}},"availableAccounts":[],"balance":0,"birthDay":"1982-11-15","blockedBalance":0,"citizenship":"US","cityId":103027,"country":"US","createdDate":"2020-12-21","defaultEmail":"carlostoxtli@yandex.com","displayName":"carlostoxtli","education":"HIGH","firstName":"Carlos","fullName":"Carlos Toxtli","gender":"MALE","isAccountOwner":true,"languages":["EN","ES"],"lastName":"Toxtli","login":"carlostoxtli","rating":0,"regionId":223,"role":"WORKER","systemBan":false,"uid":1274303000,"userLang":"ES"},"userId":"a5d84fcd0637d31f4675cdf17b71a35"}',
#        ...,
#        '{"refUuid":"4eb5cecf-42ce-4580-b89d-c2edb9624873","groupUuid":"2cacd9db-5da7-4ff4-9032-d38402d5825d","lightweightTec":{"poolId":33139449,"projectId":58019,"poolStartedAt":"2022-04-25T10:00:27.146","mayContainAdultContent":true,"title":"Find content from website (universal app prod)","description":"Find the exact web page that contains the listed information from the given website domain. Use the google translate or bing translate browser extensions to translate international web pages into understandable language.\\\\nНайдите точную веб-страницу, содержащую перечисленную информацию из данного домена веб-сайта. Используйте расширения браузера google translate или bing translate, чтобы переводить международные веб-страницы на понятный язык.\\\\n","hasInstructions":true,"snapshotMajorVersion":1,"snapshotMinorVersion":8,"snapshotMajorVersionActual":true,"assignmentConfig":{"reward":"0.020","maxDurationSeconds":600,"issuing":{"type":"AUTOMATIC"}},"trainingConfig":{"training":false},"requesterInfo":{"id":"97e0e18092318a1140eb08402e7cc5ac","name":{"EN":"Bing Local Search 2","FR":"Bing Local Search 2","ID":"Bing Local Search 2","RU":"Bing Local Search 2","TR":"Bing Local Search 2"},"trusted":false},"projectMetaInfo":{"projectId":58019,"bookmarked":true,"bookmarkedAt":"2022-03-12T01:28:35.577","experimentMeta":{"dj_task_duration__snippet__duration_less_than_minute":"1","dj_project_class__snippet__web_searching":"1","dj_project_tag__requester_type__snippet__experienced_requester":"1"}},"iframeSubdomain":"97e0e18092318a1140eb08402e7cc5ac"},"availability":{"available":true},"activeAssignments":[{"id":"0001f9aaf9--6266ca43ca6f212f45a79130","expireTime":"2022-04-25T16:30:19.528","secondsLeft":597,"reward":0.02}],"acceptanceDetails":{"postAccept":true,"acceptanceRate":99,"acceptancePeriodDays":1,"averageAcceptancePeriodDays":1},"trainingDetails":{"training":false},"taskDetails":{"grade":{"total_grade":4.87},"averageSubmitTimeSec":23,"averageAcceptanceTimeSec":86406,"moneyAvgHourly":3.13043472,"moneyAvg":17.48184971098265,"moneyMed":18.22,"moneyTop10":30.288000000000004,"moneyMax3":18.330940090548125},"grade":{"total_grade":4.87}}',
#        '{"refUuid":"e046bcc5-ece8-40b1-8203-e48fed3ddc99","groupUuid":"9fde347c-5a79-4e78-9023-fe325f8bd616","lightweightTec":{"poolId":33139449,"projectId":58019,"poolStartedAt":"2022-04-25T10:00:27.146","mayContainAdultContent":true,"title":"Find content from website (universal app prod)","description":"Find the exact web page that contains the listed information from the given website domain. Use the google translate or bing translate browser extensions to translate international web pages into understandable language.\\\\nНайдите точную веб-страницу, содержащую перечисленную информацию из данного домена веб-сайта. Используйте расширения браузера google translate или bing translate, чтобы переводить международные веб-страницы на понятный язык.\\\\n","hasInstructions":true,"snapshotMajorVersion":1,"snapshotMinorVersion":8,"snapshotMajorVersionActual":true,"assignmentConfig":{"reward":"0.020","maxDurationSeconds":600,"issuing":{"type":"AUTOMATIC"}},"trainingConfig":{"training":false},"requesterInfo":{"id":"97e0e18092318a1140eb08402e7cc5ac","name":{"EN":"Bing Local Search 2","FR":"Bing Local Search 2","ID":"Bing Local Search 2","RU":"Bing Local Search 2","TR":"Bing Local Search 2"},"trusted":false},"projectMetaInfo":{"projectId":58019,"bookmarked":true,"bookmarkedAt":"2022-03-12T01:28:35.577","experimentMeta":{"dj_task_duration__snippet__duration_less_than_minute":"1","dj_project_class__snippet__web_searching":"1","dj_project_tag__requester_type__snippet__experienced_requester":"1"}},"iframeSubdomain":"97e0e18092318a1140eb08402e7cc5ac"},"availability":{"available":true},"activeAssignments":[{"id":"0001f9aaf9--6266ca43ca6f212f45a79130","expireTime":"2022-04-25T16:30:19.528","secondsLeft":567,"reward":0.02}],"acceptanceDetails":{"postAccept":true,"acceptanceRate":99,"acceptancePeriodDays":1,"averageAcceptancePeriodDays":1},"trainingDetails":{"training":false},"taskDetails":{"grade":{"total_grade":4.87},"averageSubmitTimeSec":23,"averageAcceptanceTimeSec":86406,"moneyAvgHourly":3.13043472,"moneyAvg":17.48184971098265,"moneyMed":18.22,"moneyTop10":30.288000000000004,"moneyMax3":18.330940090548125},"grade":{"total_grade":4.87}}',
#        '{"uid":1206161147,"login":"sholesy@gmail.com","role":"WORKER","userLang":"EN","defaultEmail":"sholesy@gmail.com","connectionId":"s:1650784330835:uZlwaQ:2d","authorizationStatus":"VALID","avatarId":"0/0-0","displayName":"sholesy@gmail.com","fullName":"Oluwatosin Solesi","firstName":"Oluwatosin","lastName":"Solesi","isAccountOwner":true,"actualUser":{"uid":1206161147,"login":"sholesy@gmail.com","role":"WORKER","userLang":"EN","defaultEmail":"sholesy@gmail.com","displayName":"sholesy@gmail.com","readOnlyModeToActUnderAccount":false},"availableAccounts":[],"createdDate":"2020-10-26","systemBan":false,"gender":"FEMALE","birthDay":"1991-10-14","cityId":21063,"country":"NG","citizenship":"US","education":"HIGH","languages":["EN"],"adultAllowed":true,"acceptedEula":13,"rating":0,"authoritiesInfo":{"issuedAuthorities":["U_ASSIGNMENTS_VIEW","U_ASSIGNMENTS_HISTORY","APP_USER","U_MESSAGES_CREATE","U_MESSAGES_VIEW","U_FORUM_VIEW","U_PROFILE_VIEW","U_WALLETS_EDIT","U_TRANSACTIONS_VIEW","U_FORUM_EDIT","U_ASSIGNMENTS_UNDERTAKE","U_TRANSACTIONS_CREATE","U_ASSIGNMENTS_SUBMIT","U_PROFILE_EDIT","APP"],"notIssuedAuthoritiesReasons":{}},"balance":"0.191","blockedBalance":"0.035","regionId":20741}'],
#       dtype=object)
# {
#   "acceptanceDetails": {
#     "postAccept": false
#   },
#   "activeAssignments": [
#     {
#       "expireTime": "2022-01-20T19:37:28.194",
#       "id": "00001086d6--61e9b930a2d62b2b56644596",
#       "reward": 0.3,
#       "secondsLeft": 193
#     }
#   ],
#   "availability": {
#     "available": true
#   },
#   "groupUuid": "9455a911-5624-4951-9f4e-ea09cc1cc5f5",
#   "lightweightTec": {
#     "assignmentConfig": {
#       "issuing": {
#         "type": "AUTOMATIC"
#       },
#       "maxDurationSeconds": 200,
#       "reward": 0.3
#     },
#     "description": "Answer the questions in the survey. Choose one or more options or write your own answer",
#     "hasInstructions": true,
#     "iframeSubdomain": "54f8685950e9694b99faccce011a21df",
#     "mayContainAdultContent": false,
#     "poolId": 1083094,
#     "poolStartedAt": "2022-01-20T19:01:57.121",
#     "projectId": 89244,
#     "projectMetaInfo": {
#       "experimentMeta": {},
#       "projectId": 89244
#     },
#     "requesterInfo": {
#       "id": "54f8685950e9694b99faccce011a21df",
#       "name": {
#         "EN": "davidjohnsonits"
#       },
#       "trusted": false
#     },
#     "snapshotMajorVersion": 1,
#     "snapshotMajorVersionActual": true,
#     "snapshotMinorVersion": 2,
#     "title": "Survey One David",
#     "trainingConfig": {
#       "training": false
#     }
#   },
#   "refUuid": "a95bab84-7c39-45bc-9268-2f474011c0ae",
#   "taskDetails": {
#     "averageSubmitTimeSec": 13,
#     "moneyAvgHourly": 83.07692316
#   },
#   "trainingDetails": {
#     "training": false
#   }
# }

# df_toloka.platform.unique()
# array([ 0., nan])

# df_toloka.subtype.unique()
# array(['TASK_STARTED', 'TASKS_LIST', 'TASK_SUBMITED', 'OTHER',
#        'FINISHED_TASK', 'SYSTEM', 'GENERAL', 'ADDED_TASK', 'META_DATA',
#        'UNKNOWN', 'TASK_QUEUE', 'WORKER_QUALIFICATIONS',
#        'WORKER_DASHBOARD', 'WORKER_EARNINGS', 'WORKER_EARNINGS_DETAILS',
#        'TASK_INFO', 'MESSAGES_READ', 'TASK_TIMEOUT', 'REFERRAL',
#        'NOTIFICATIONS', 'MESSAGES_REQUESTER', 'MESSAGES_OUTBOX',
#        'MESSAGES_ADMIN', 'MESSAGES_NOTIFICATION'], dtype=object)

# df_toloka.type.unique()
# array(['WORKING', 'SEARCHING', 'OTHER', 'LOGS', 'CONFIG', 'API', 'SYSTEM',
#        'UNKNOWN', 'PROFILE', 'COMMUNICATION', 'REJECTED'], dtype=object)

# df_toloka.ordinal.unique()
# array([1, 2])

# df_toloka.unnamed.unique()
# array([        nan, 0.00000e+00, 2.89000e+02, ..., 3.79992e+05,
#        3.25610e+04, 3.61200e+04])

## What is the unit of analysis?

## How many observations in total are in the data set? 

## How many unique observations are in the data set? 

## What time period is covered?

## Brief summary of any data cleaning steps you have performed. For example, are there any particular observations / time periods / groups / etc. you have excluded?

## Description of outcome with an appropriate visualization technique.

## Description of key predictors with appropriate visualization techniques that compare predictors to the response. You should investigate all predictors in your data as part of your project. For the purpose of this assignment, pick the one or two predictors that you think are going to be most important in explaining the outcome. Your selection of predictors can either be guided by your domain knowledge or be the result of your EDA on all predictors.