# ABOUT THIS NOTEBOOK

This is a Jupyter notebook designed to simplify data extraction and data deduplication of M2C2kit data on the production servers.

Below is code to (1) download, (2) deduplicate and (3) save the data as a CSV.

Before you can begin using this Jupyter notebook, you will have to submit a username and create a password using the [authorization guide](https://github.com/m2c2-project/m2c2kit-integration-guides/blob/main/docs/authorization_guide.md) and submit an [Airtable form](https://airtable.com/app0JQhjqc5VNZMpZ/shr2FrUEAeaZV7RzF) to request data access. Our team would also need to communicate that you have been entered into the system before proceeding here. If this has been completed, please proceed. Your username, password, and study_id will be used in the designated field within this Jupyter notebook.

## How is Data Saved by M2C2Kit?

Data from assessments are saved every trial. As a result, you can expect duplication when you query the raw data from our database. 

## Before You Start Your Study

Before beginning your study, please ensure that the 'user_uid' values you are receiving the dataframes above match what you'd expect from either Qualtrics (whatever criteria used) or Metricwire (24 characters, alphanumeric).

# Getting Started

To get started with this Jupyter notebook, you will need to either:

- Install [Anaconda](https://www.anaconda.com/)
- Install [Jupyter Lab](https://jupyter.org/install)
- Install [Visual Studio Code with the Jupyter Notebook Extension](https://code.visualstudio.com/)

Once you've configured your frameowrk of choice, you will then need to install the following Python libraries (`pandas`, `requests`) by running the cell below.

Thereafter, navigate to the cell below the heading `Configure your data query` and modify the username to the one you were provided by the M2C2 Team. You will be prompted for your password each time you login.

In [1]:
! pip install requests
! pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


## Load libraries and custom functions

### Note: DO NOT modify any of the functions below

In [2]:
import urllib.parse
import datetime
import requests
import pandas as pd
from getpass import getpass

In [3]:
def get_filename_strings(ts_fn):
    return {
        "metadata_unique_session_ids": f"m2c2kit_metadata_unique-session-ids_{ts_fn}.csv",
        "metadata_unique_participant_ids": f"m2c2kit_metadata_unique-participant-ids_{ts_fn}.csv",
    }

In [1]:
def get_m2c2kit_access_token(username=None, password=None):

    if username is None or password is None:
        # specify parameters for M2C2kit backend
        username = input('Enter username for M2C2kit backend...') # you will be prompted for a username
        password = getpass('Enter password for M2C2kit backend...') # you will be prompted for a password

    # specify login endpoint URL
    login_url = "https://prod.m2c2kit.com/auth/token"
    payload = f"=grant_type%3D&=scope%3D&=client_id%3D&=client_secret%3D&username={username}&password={password}"
    headers = {
        "accept": "application/json",
        "Content-Type": "application/x-www-form-urlencoded"
    }

    # attempt login
    login_response = requests.request("POST", login_url, data=payload, headers=headers)
    access_token = login_response.json().get("access_token")


    # specify filename from current run time for filenames
    ts_fn = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

    # query range of data for a given study
    print("Token granted at: ", ts_fn)

    return access_token, ts_fn

In [5]:
def get_m2c2kit_trial_level_data(access_token=None, study_id=None, start_date=None, end_date=None, activity_name=None, skip=0):

    # check if required fields present
    if access_token is None:
        raise ValueError("access_token is required")
    if study_id is None:
        raise ValueError("study_id is required")
    if start_date is None:
        raise ValueError("start_date is required")
    if end_date is None:
        raise ValueError("end_date is required")
    if activity_name is None:
        raise ValueError("activity_name is required")

    # specify query endpoint URL
    query_url = "https://prod.m2c2kit.com/query/"

    # specify query parameters ----
    querystring = {"fields":"study_uid,uid,session_uid,activity_name,event_type,content,metadata",
                "activity_name":activity_name,
                "format":"json",
                "study_uid":study_id,
                "start_date":start_date,
                "end_date":end_date,
                "skip":skip,
                "limit":"1000"}

    payload = ""
    headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {access_token}"
    }

    # TODO: check for total and run with new limit and skip if reached limit ----
    data_response = requests.request("GET", query_url, data=payload, headers=headers, params=querystring)
    data_json = data_response.json()
    data_records = data_json.get("results")
    data_total = data_json.get("total")
    data_limit = data_json.get("limit")
    data_df = pd.DataFrame(data_records)

    # iterate over the dataset to get all trials ----
    all_trials = []
    for index, row in data_df.iterrows():
        json_data = row['content'].get("trials", [])
        all_trials.extend(json_data)

    # convert all trials to dataframe ----
    df_all = pd.DataFrame(all_trials)
    return df_all, data_total, data_limit

In [6]:
def get_m2c2kit_metadata(access_token=None, study_id=None, resource="session-counts"):

    # check if required fields present
    if access_token is None:
        raise ValueError("access_token is required")
    if study_id is None:
        raise ValueError("study_id is required")
    
    # specify query endpoint URL
    query_url = f"https://prod.m2c2kit.com/metadata/{resource}"

    # specify query parameters ----
    querystring = {"study_uid":study_id}

    payload = ""
    headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {access_token}"
    }

    # TODO: check for total and run with new limit and skip if reached limit ----
    data_response = requests.request("GET", query_url, data=payload, headers=headers, params=querystring)
    data_json = data_response.json()
    data_df = pd.DataFrame(data_json)

    return data_df, data_json

In [7]:
def summary_symbol_search(x, trials_expected=20):
    d = {}
    d["flag_is_invalid_n_trials"] = x["session_uuid"].count() != trials_expected
    d["n_trials"] = x["session_uuid"].count()
    d["n_trials_lure"] = (x["trial_type"] == "lure").sum()
    d["n_trials_responsetime_lt250ms"] = (x["response_time_duration_ms"] < 250).sum()
    d["n_trials_responsetime_gt10000ms"] = (
        x["response_time_duration_ms"] > 10000
    ).sum()
    d["n_correct_trials"] = (
        x["user_response_index"] == x["correct_response_index"]
    ).sum()
    d["n_incorrect_trials"] = (
        x["user_response_index"] != x["correct_response_index"]
    ).sum()
    d["mean_response_time_overall"] = x["response_time_duration_ms"].mean()
    d["mean_response_time_correct"] = x.loc[
        (x["user_response_index"] == x["correct_response_index"]),
        "response_time_duration_ms",
    ].mean()
    d["mean_response_time_incorrect"] = x.loc[
        (x["user_response_index"] != x["correct_response_index"]),
        "response_time_duration_ms",
    ].mean()
    d["median_response_time_overall"] = x["response_time_duration_ms"].median()
    d["median_response_time_correct"] = x.loc[
        (x["user_response_index"] == x["correct_response_index"]),
        "response_time_duration_ms",
    ].median()
    d["median_response_time_incorrect"] = x.loc[
        (x["user_response_index"] != x["correct_response_index"]),
        "response_time_duration_ms",
    ].median()
    d["sd_response_time_overall"] = x["response_time_duration_ms"].std()
    d["sd_response_time_correct"] = x.loc[
        (x["user_response_index"] == x["correct_response_index"]),
        "response_time_duration_ms",
    ].std()
    d["sd_response_time_incorrect"] = x.loc[
        (x["user_response_index"] != x["correct_response_index"]),
        "response_time_duration_ms",
    ].std()
    return pd.Series(
        d,
        index=[
            "flag_is_invalid_n_trials",
            # 'flag_is_potentially_invalid_rt',
            "n_trials",
            "n_trials_lure",
            "n_correct_trials",
            "n_incorrect_trials",
            "n_trials_responsetime_lt250ms",
            "n_trials_responsetime_gt10000ms",
            "mean_response_time_overall",
            "mean_response_time_correct",
            "mean_response_time_incorrect",
            "median_response_time_overall",
            "median_response_time_correct",
            "median_response_time_incorrect",
            "sd_response_time_overall",
            "sd_response_time_correct",
            "sd_response_time_incorrect",
        ],
    )


def summary_grid_memory(x, trials_expected=4):
    d = {}
    d["flag_is_invalid_n_trials"] = x["session_uuid"].count() != trials_expected
    d["n_trials"] = x["session_uuid"].count()
    d["n_perfect_trials"] = (x["number_of_correct_dots"] == 3.0).sum()
    d["mean_correct_dots"] = (x["number_of_correct_dots"]).mean()
    d["min_correct_dots"] = (x["number_of_correct_dots"]).min()
    d["sum_correct_dots"] = (x["number_of_correct_dots"]).sum()
    return pd.Series(
        d,
        index=[
            "flag_is_invalid_n_trials",
            "n_trials",
            "n_perfect_trials",
            "mean_correct_dots",
            "min_correct_dots",
            "sum_correct_dots",
        ],
    )

In [8]:
def summarise_m2c2kit_data(df = None, activity_name=None, group_by=["participant_id", "session_uuid", "session_id"], trials_expected = -999, ts_fn = None):
    if activity_name == "symbol-search" or activity_name == "symbolsearch" or activity_name == "symbol_search" or activity_name == "Symbol Search" or activity_name == "Symbol Match":
        activity_name_fn = activity_name.replace(" ", "_").lower()
        df_session_summary = df.groupby(group_by).apply(summary_symbol_search, trials_expected=trials_expected)
        df_session_summary.reset_index().to_csv(f"m2c2kit_scored_activity-{activity_name_fn}_{ts_fn}.csv", index=False)
        valid_scoring = True
    if activity_name == "grid-memory" or activity_name == "Dot Memory" or activity_name == "Grid Memory":
        activity_name_fn = activity_name.replace(" ", "_").lower()
        df_session_summary = df.groupby(group_by).apply(summary_grid_memory, trials_expected=trials_expected)
        valid_scoring = True

    if valid_scoring:
        df_session_summary.reset_index().to_csv(f"m2c2kit_scored_activity-{activity_name_fn}_{ts_fn}.csv", index=False)
        return df_session_summary
    else:
        print("Activity not supported yet. Please contact M2C2 for further coordination.")
        df_session_summary = None

## Configure your data query

Note: If you have more cognitive assessments than displayed here, copy the existing code for a cognitive assessment, paste, and change the cognitive assessment name (colorshapes, dotmemory).<br><br> For the next two blocks of code below, the ONLY changes you need to make is to update your study_id that was created for you.<br><br>The start_date and end_date can be changed to query specific dates. Everything else can be left as is.

In [9]:
# set query params
study_id = "demo"
start_date = "2023-11-29"
end_date = "2023-11-29"

In [None]:
# login to M2C2kit backend to get access token for querying data (expires in X minutes)
access_token, ts_fn = get_m2c2kit_access_token()

## Query Data

In [11]:
# query Symbol Search activity data
df_symbolsearch, total_symbolsearch, limit_symbolsearch = get_m2c2kit_trial_level_data(access_token=access_token, 
                                                             study_id=study_id, 
                                                             start_date=start_date, 
                                                             end_date=end_date, 
                                                             skip=0,
                                                             activity_name="Symbol Search")

# query Symbol Search activity data
df_gridmemory, total_gridmemory, limit_gridmemory = get_m2c2kit_trial_level_data(access_token=access_token, 
                                                           study_id=study_id, 
                                                           start_date=start_date, 
                                                           end_date=end_date, 
                                                           skip=0,
                                                           activity_name="Grid Memory")

## Deduplicate Dataset

As mentioned above, data duplication is expected since the M2C2kit assessments save all data every trial to minimize any data loss. Below is code for deduplicating this data:

In [12]:
df_symbolsearch_dedup = df_symbolsearch.drop_duplicates(subset=['activity_uuid', 'session_uuid', 'trial_begin_iso8601_timestamp'])
df_gridmemory_dedup = df_gridmemory.drop_duplicates(subset=['activity_uuid', 'session_uuid', 'trial_begin_iso8601_timestamp'])

In [13]:
# confirm deduplication
print(f"Symbol Search: {df_symbolsearch.shape} to {df_symbolsearch_dedup.shape}")
print(f"Grid Memory: {df_gridmemory.shape} to {df_gridmemory_dedup.shape}")

Symbol Search: (12, 23) to (6, 23)
Grid Memory: (10, 24) to (6, 24)


## Preview Data

In [None]:
# preview data
display(df_symbolsearch_dedup.head(3))
display(df_gridmemory_dedup.head(3))

## Save Data

In [14]:
# save data

# with duplicates (i.e., all downloaded data)
df_symbolsearch.to_csv(f"m2c2kit_raw_symbolsearch_{ts_fn}.csv")
df_gridmemory.to_csv(f"m2c2kit_raw_gridmemory_{ts_fn}.csv")

# without duplicates (i.e., deduplicated data)
df_symbolsearch_dedup.to_csv(f"m2c2kit_dedup_symbolsearch_{ts_fn}.csv")
df_gridmemory_dedup.to_csv(f"m2c2kit_dedup_gridmemory_{ts_fn}.csv")

# Just need quick insights?

If you would like a list of Unique Session IDs or a count of unique sessions by participants, use the examples below

### Metadata Report 1: Count of unique sessions (i.e., `session_uid`)

In [None]:
session_counts = get_m2c2kit_metadata(access_token=access_token, study_id=study_id, resource="session-counts")
session_counts_df = session_counts[0]
session_counts_df.to_csv(f"m2c2kit_metadata_session-counts_{ts_fn}.csv", index=False)
display(session_counts_df)


In [None]:
session_counts_by_activity = get_m2c2kit_metadata(access_token=access_token, study_id=study_id, resource="session-counts-by-activity")
session_counts_by_activity_df = session_counts_by_activity[0]
session_counts_by_activity_df.to_csv(f"m2c2kit_metadata_session-counts-by-activity_{ts_fn}.csv", index=False)
display(session_counts_by_activity_df)

### Metadata Report 2: List of unique sessions (i.e., `session_uid`)

In [None]:
unique_session_ids = get_m2c2kit_metadata(access_token=access_token, study_id=study_id, resource="unique-session-ids")
unique_session_ids_df = unique_session_ids[0]
unique_session_ids_df.to_csv(f"m2c2kit_metadata_unique-session-ids_{ts_fn}.csv", index=False)
display(unique_session_ids_df)


# Ready to score your data?

If you are ready to score your data, please contact us at [m2c2@psu.edu](mailto:m2c2@psu.edu)

In [15]:
# specify expected number of trials (based on study configuration)
trials_expected_symbolsearch = 20
trials_expected_gridmemory = 4

In [18]:
# note, this function writes files for you! 
df_symbolsearch_summary = summarise_m2c2kit_data(df = df_symbolsearch_dedup, activity_name="Symbol Search", group_by=["participant_id", "session_uuid", "activity_uuid", "session_id"], trials_expected = -999, ts_fn = ts_fn)
df_gridmemory_summary = summarise_m2c2kit_data(df = df_gridmemory_dedup, activity_name="Grid Memory", group_by=["participant_id", "session_uuid", "activity_uuid", "session_id"], trials_expected = -999, ts_fn = ts_fn)

In [19]:
df_symbolsearch_summary 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,flag_is_invalid_n_trials,n_trials,n_trials_lure,n_correct_trials,n_incorrect_trials,n_trials_responsetime_lt250ms,n_trials_responsetime_gt10000ms,mean_response_time_overall,mean_response_time_correct,mean_response_time_incorrect,median_response_time_overall,median_response_time_correct,median_response_time_incorrect,sd_response_time_overall,sd_response_time_correct,sd_response_time_incorrect
participant_id,session_uuid,activity_uuid,session_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
,778c8fd4-a134-4747-81cf-dd34b7aa3bac,d1b3a433-9696-4249-a4a6-c4ed3128108f,,True,3,2,1,2,2,0,189.966667,333.7,118.1,119.3,333.7,118.1,124.482502,,1.697056
,918ff750-dffd-4932-976b-ccba72d4b543,eb3c656a-c985-4a6c-b195-0468fb259050,,True,3,2,3,0,0,0,3109.566667,3109.566667,,2430.8,2430.8,,2046.969395,2046.969395,


In [20]:
df_gridmemory_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,flag_is_invalid_n_trials,n_trials,n_perfect_trials,mean_correct_dots,min_correct_dots,sum_correct_dots
participant_id,session_uuid,activity_uuid,session_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,25b31ae2-145c-4f8c-8ae0-69f7335b109a,cf4ae271-9ad0-4f20-afa8-307be36c9003,,True,1,1,3.0,3.0,3.0
,27f2e8fd-6fa1-46bb-9242-c16041a91421,269c943c-4ca1-4294-8de4-46c5a3667af0,,True,3,1,1.666667,1.0,5.0
,f1ea804a-fe91-4256-9e79-e6132b9dc115,b47f74e7-77d0-4acb-8e29-d29031f77a2d,,True,2,0,1.0,1.0,1.0


# Coming soon - this code as a pip installable package
<!-- pip install cookiecutter
cookiecutter https://github.com/waynerv/cookiecutter-pypackage.git -->