# ABOUT THIS NOTEBOOK

This is a Jupyter notebook designed to simplify data extraction and data deduplication of M2C2kit data on the production servers.

Below is code to (1) download, (2) deduplicate and (3) save the data as a CSV.

Before you can begin using this Jupyter notebook, you will have to submit a username and create a password using the [authorization guide](https://github.com/m2c2-project/m2c2kit-integration-guides/blob/main/docs/authorization_guide.md) and submit an [Airtable form](https://airtable.com/app0JQhjqc5VNZMpZ/shr2FrUEAeaZV7RzF) to request data access. Our team would also need to communicate that you have been entered into the system before proceeding here. If this has been completed, please proceed. Your username, password, and study_id will be used in the designated field within this Jupyter notebook.

## How is Data Saved by M2C2Kit?

Data from assessments are saved every trial. As a result, you can expect duplication when you query the raw data from our database. 

## Before You Start Your Study

Before beginning your study, please ensure that the 'user_uid' values you are receiving the dataframes above match what you'd expect from either Qualtrics (whatever criteria used) or Metricwire (24 characters, alphanumeric).

# Getting Started

To get started with this Jupyter notebook, you will need to either:

- Install [Anaconda](https://www.anaconda.com/)
- Install [Jupyter Lab](https://jupyter.org/install)
- Install [Visual Studio Code with the Jupyter Notebook Extension](https://code.visualstudio.com/)

Once you've configured your frameowrk of choice, you will then need to install the following Python libraries (`pandas`, `requests`) by running the cell below.

Thereafter, navigate to the cell below the heading `Configure your data query` and modify the username to the one you were provided by the M2C2 Team. You will be prompted for your password each time you login.

In [14]:
! pip install requests
! pip install pandas



## Load libraries and custom functions

### Note: DO NOT modify any of the functions below

In [2]:
import urllib.parse
import datetime
import requests
import pandas as pd
from getpass import getpass

In [3]:
def get_m2c2kit_access_token(username=None, password=None):
    # specify login endpoint URL
    login_url = "https://prod.m2c2kit.com/auth/token"
    payload = f"=grant_type%3D&=scope%3D&=client_id%3D&=client_secret%3D&username={username}&password={password}"
    headers = {
        "accept": "application/json",
        "Content-Type": "application/x-www-form-urlencoded"
    }

    # attempt login
    login_response = requests.request("POST", login_url, data=payload, headers=headers)
    access_token = login_response.json().get("access_token")
    return access_token

In [4]:
def get_m2c2kit_trial_level_data(access_token=None, study_id=None, start_date=None, end_date=None, activity_name=None, skip=0):

    # check if required fields present
    if access_token is None:
        raise ValueError("access_token is required")
    if study_id is None:
        raise ValueError("study_id is required")
    if start_date is None:
        raise ValueError("start_date is required")
    if end_date is None:
        raise ValueError("end_date is required")
    if activity_name is None:
        raise ValueError("activity_name is required")

    # specify query endpoint URL
    query_url = "https://prod.m2c2kit.com/query/"

    # specify query parameters ----
    querystring = {"fields":"study_uid,uid,session_uid,activity_name,event_type,content,metadata",
                "activity_name":activity_name,
                "format":"json",
                "study_uid":study_id,
                "start_date":start_date,
                "end_date":end_date,
                "skip":skip,
                "limit":"1000"}

    payload = ""
    headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {access_token}"
    }

    # TODO: check for total and run with new limit and skip if reached limit ----
    data_response = requests.request("GET", query_url, data=payload, headers=headers, params=querystring)
    data_json = data_response.json()
    data_records = data_json.get("results")
    data_total = data_json.get("total")
    data_limit = data_json.get("limit")
    data_df = pd.DataFrame(data_records)

    # iterate over the dataset to get all trials ----
    all_trials = []
    for index, row in data_df.iterrows():
        json_data = row['content'].get("trials", [])
        all_trials.extend(json_data)

    # convert all trials to dataframe ----
    df_all = pd.DataFrame(all_trials)
    return df_all, data_total, data_limit

In [5]:
def get_m2c2kit_metadata(access_token=None, study_id=None, resource="session-counts"):

    # check if required fields present
    if access_token is None:
        raise ValueError("access_token is required")
    if study_id is None:
        raise ValueError("study_id is required")
    
    # specify query endpoint URL
    query_url = f"https://prod.m2c2kit.com/metadata/{resource}"

    # specify query parameters ----
    querystring = {"study_uid":study_id}

    payload = ""
    headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {access_token}"
    }

    # TODO: check for total and run with new limit and skip if reached limit ----
    data_response = requests.request("GET", query_url, data=payload, headers=headers, params=querystring)
    data_json = data_response.json()
    data_df = pd.DataFrame(data_json)

    return data_df, data_json

## Configure your data query

Note: If you have more cognitive assessments than displayed here, copy the existing code for a cognitive assessment, paste, and change the cognitive assessment name (colorshapes, dotmemory).<br><br> For the next two blocks of code below, the ONLY changes you need to make is to update the backend_username that you created with the authorization form and to use your study_id that was created for you.<br><br>The start_date and end_date can be changed to query specific dates. Everything else can be left as is.

In [6]:
# specify parameters for M2C2kit backend
backend_username = input('Enter username for M2C2kit backend...') # you will be prompted for a username
backend_password = getpass('Enter password for M2C2kit backend...') # you will be prompted for a password

# login to M2C2kit backend to get access token for querying data (expires in X minutes)
access_token = get_m2c2kit_access_token(username=backend_username, 
                                password=backend_password)

# specify filename from current run time for filenames
ts_fn = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# query range of data for a given study
print("Token: ", access_token, "\nGranted at: ", ts_fn)

In [None]:
# set query params
study_id = "demo"
start_date = "2023-11-29"
end_date = "2023-11-29"

In [None]:
# query Symbol Search activity data
df_symbolsearch, total_symbolsearch, limit_symbolsearch = get_m2c2kit_trial_level_data(access_token=access_token, 
                                                             study_id=study_id, 
                                                             start_date=start_date, 
                                                             end_date=end_date, 
                                                             skip=0,
                                                             activity_name="Symbol Search")

# query Symbol Search activity data
df_gridmemory, total_gridmemory, limit_gridmemory = get_m2c2kit_trial_level_data(access_token=access_token, 
                                                           study_id=study_id, 
                                                           start_date=start_date, 
                                                           end_date=end_date, 
                                                           skip=0,
                                                           activity_name="Grid Memory")

## Deduplicate Dataset

As mentioned above, data duplication is expected since the M2C2kit assessments save all data every trial to minimize any data loss. Below is code for deduplicating this data:

In [23]:
df_symbolsearch_dedup = df_symbolsearch.drop_duplicates(subset=['activity_uuid', 'session_uuid', 'trial_begin_iso8601_timestamp'])
df_gridmemory_dedup = df_gridmemory.drop_duplicates(subset=['activity_uuid', 'session_uuid', 'trial_begin_iso8601_timestamp'])

In [24]:
# confirm deduplication
print(f"Symbol Search: {df_symbolsearch.shape} to {df_symbolsearch_dedup.shape}")
print(f"Grid Memory: {df_gridmemory.shape} to {df_gridmemory_dedup.shape}")

Symbol Search: (12, 23) to (6, 23)
Grid Memory: (10, 24) to (6, 24)


## Preview Data

In [16]:
# preview data
display(df_symbolsearch_dedup.head(3))
display(df_gridmemory_dedup.head(3))

Unnamed: 0,document_uuid,session_uuid,activity_uuid,activity_id,activity_version,device_timezone,device_timezone_offset_minutes,activity_begin_iso8601_timestamp,trial_begin_iso8601_timestamp,trial_index,...,user_response_index,correct_response_index,quit_button_pressed,device_metadata,study_id,session_id,participant_id,api_key,group,wave
0,90aa1bb1-4f19-4e1e-8755-43e108094d08,918ff750-dffd-4932-976b-ccba72d4b543,eb3c656a-c985-4a6c-b195-0468fb259050,symbol-search,0.8.4,America/New_York,300,2023-11-29T19:49:26.540Z,2023-11-29T19:49:48.346Z,0,...,0,0,False,{'userAgent': 'Mozilla/5.0 (Windows NT 10.0; W...,demo,,,demo,,
1,90aa1bb1-4f19-4e1e-8755-43e108094d08,918ff750-dffd-4932-976b-ccba72d4b543,eb3c656a-c985-4a6c-b195-0468fb259050,symbol-search,0.8.4,America/New_York,300,2023-11-29T19:49:26.540Z,2023-11-29T19:49:48.346Z,0,...,0,0,False,{'userAgent': 'Mozilla/5.0 (Windows NT 10.0; W...,demo,,,demo,,
2,bb83af09-2741-4e2e-88c4-fe298350a38e,918ff750-dffd-4932-976b-ccba72d4b543,eb3c656a-c985-4a6c-b195-0468fb259050,symbol-search,0.8.4,America/New_York,300,2023-11-29T19:49:26.540Z,2023-11-29T19:49:54.263Z,1,...,0,0,False,{'userAgent': 'Mozilla/5.0 (Windows NT 10.0; W...,demo,,,demo,,


Unnamed: 0,document_uuid,session_uuid,activity_uuid,activity_id,activity_version,device_timezone,device_timezone_offset_minutes,activity_begin_iso8601_timestamp,trial_begin_iso8601_timestamp,trial_index,...,user_interference_actions,number_of_correct_dots,quit_button_pressed,device_metadata,study_id,session_id,participant_id,api_key,group,wave
0,0e401d46-283b-49e5-a3c3-d385a094faca,f1ea804a-fe91-4256-9e79-e6132b9dc115,b47f74e7-77d0-4acb-8e29-d29031f77a2d,grid-memory,0.8.4,America/New_York,300,2023-11-29T15:04:00.922Z,2023-11-29T15:04:13.809Z,0.0,...,"[{'elapsed_duration_ms': 2872, 'action_type': ...",1.0,False,{'userAgent': 'Mozilla/5.0 (Macintosh; Intel M...,demo,,,demo,,
1,0e401d46-283b-49e5-a3c3-d385a094faca,f1ea804a-fe91-4256-9e79-e6132b9dc115,b47f74e7-77d0-4acb-8e29-d29031f77a2d,grid-memory,0.8.4,America/New_York,300,2023-11-29T15:04:00.922Z,2023-11-29T15:04:13.809Z,0.0,...,"[{'elapsed_duration_ms': 2872, 'action_type': ...",1.0,False,{'userAgent': 'Mozilla/5.0 (Macintosh; Intel M...,demo,,,demo,,
2,0ab848e9-bebc-4cd9-b153-f056121282ab,f1ea804a-fe91-4256-9e79-e6132b9dc115,b47f74e7-77d0-4acb-8e29-d29031f77a2d,grid-memory,0.8.4,America/New_York,300,2023-11-29T15:04:00.922Z,2023-11-29T15:04:30.442Z,,...,,,True,{'userAgent': 'Mozilla/5.0 (Macintosh; Intel M...,demo,,,demo,,


## Save Data

In [26]:
# save data

# with duplicates (i.e., all downloaded data)
df_symbolsearch.to_csv(f"m2c2kit_raw_symbolsearch_{ts_fn}.csv")
df_gridmemory.to_csv(f"m2c2kit_raw_gridmemory_{ts_fn}.csv")

# without duplicates (i.e., deduplicated data)
df_symbolsearch_dedup.to_csv(f"m2c2kit_dedup_symbolsearch_{ts_fn}.csv")
df_gridmemory_dedup.to_csv(f"m2c2kit_dedup_gridmemory_{ts_fn}.csv")

# Just need quick insights?

If you would like a list of Unique Session IDs or a count of unique sessions by participants, use the examples below

### Metadata Report 1: Count of unique sessions (i.e., `session_uid`)

In [20]:
session_counts = get_m2c2kit_metadata(access_token=access_token, study_id=study_id, resource="session-counts")
session_counts_df = session_counts[0]
session_counts_df.to_csv(f"m2c2kit_metadata_session-counts_{ts_fn}.csv", index=False)
display(session_counts_df)


Unnamed: 0,participant_id,unique_session_id_count
0,64624f8bd957b73d388a4dbe,43
1,64dce03c3d92457ab58cd2ea,36
2,64ada25fac2b67345d232571,40
3,647f742655d9b920f266d2c2,31
4,646c013abccac6fe51ad8c50,67
...,...,...
81,650c96b9515af5d1f1b97bdb,27
82,6540431925abfbd9e5ad24d6,17
83,6516f7b69936342b56ffb4a3,24
84,65527f09f5224b0570ffb06c,28


### Metadata Report 2: List of unique sessions (i.e., `session_uid`)

In [19]:
unique_session_ids = get_m2c2kit_metadata(access_token=access_token, study_id=study_id, resource="unique-session-ids")
unique_session_ids_df = unique_session_ids[0]
unique_session_ids_df.to_csv(f"m2c2kit_metadata_unique-session-ids_{ts_fn}.csv", index=False)
display(unique_session_ids_df)


Unnamed: 0,participant_id,unique_session_ids
0,64c84ad78311fd5cf0b56074,"[d4a37dcf-1cc7-451a-9e27-367475bf355d, 5bd98f5..."
1,64fb4b38419ef2e8f1207bd0,"[24b9025d-c292-4383-8719-91706f20a9fc, 5501afb..."
2,64de6ad1ad138e942aa54942,"[c81f25fc-71fc-4b3d-8b08-0f4171dbac77, 79962d3..."
3,6526c39df6b026bb6f79b6ff,"[e9b4c591-0955-4b39-b773-17c93600b4bc, 5cef1b8..."
4,652da7cba1b2a0259dbbbb54,"[db920cfd-aa82-4143-bd2a-51075d862fd1, 400212e..."
...,...,...
81,64dd01d3e01e2d821d68696d,"[ca3c4f8e-f04f-4cef-921d-9b213b86d2a3, fdcfb6c..."
82,647f9e4e8eea3c84a65782dd,"[b1e5b7d6-d1a0-446a-ad8a-ef3c697e250a, 8942abf..."
83,64b9771b05c7f4b3fc3e0fbf,"[9d30a723-f03a-4b76-821e-2e7a7eabad38, 23d0748..."
84,651481045c8e858ac9c02c64,"[8bdaff47-5c8a-42be-a202-a7e63911ad1e, 3f1439f..."


# Ready to score your data?

If you are ready to score your data, please contact us at [m2c2@psu.edu](mailto:m2c2@psu.edu)

# Coming soon - this code as a pip installable package
<!-- pip install cookiecutter
cookiecutter https://github.com/waynerv/cookiecutter-pypackage.git -->