# ABOUT THIS NOTEBOOK

This is a Jupyter notebook designed to simplify data extraction and data deduplication of M2C2kit data on the production servers.

Below is code to (1) download, (2) deduplicate and (3) save the data as a CSV.

Before you can begin using this Jupyter notebook, you will have to submit a username and create a password using the [authorization guide](https://github.com/m2c2-project/m2c2kit-integration-guides/blob/main/docs/authorization_guide.md) and submit an [Airtable form](https://airtable.com/app0JQhjqc5VNZMpZ/shr2FrUEAeaZV7RzF) to request data access. Our team would also need to communicate that you have been entered into the system before proceeding here. If this has been completed, please proceed. Your username, password, and study_id will be used in the designated field within this Jupyter notebook.

## How is Data Saved by M2C2Kit?

Data from assessments are saved every trial. As a result, you can expect duplication when you query the raw data from our database. 

## Before You Start Your Study

Before beginning your study, please ensure that the 'user_uid' values you are receiving the dataframes above match what you'd expect from either Qualtrics (whatever criteria used) or Metricwire (24 characters, alphanumeric).

# Getting Started

To get started with this Jupyter notebook, you will need to either:

- Install [Anaconda](https://www.anaconda.com/)
- Install [Jupyter Lab](https://jupyter.org/install)
- Install [Visual Studio Code with the Jupyter Notebook Extension](https://code.visualstudio.com/)

Once you've configured your frameowrk of choice, you will then need to install the following Python libraries (`pandas`, `requests`) by running the cell below.

Thereafter, navigate to the cell below the heading `Configure your data query` and modify the username to the one you were provided by the M2C2 Team. You will be prompted for your password each time you login.

In [22]:
! pip install requests
! pip install pandas

Collecting requests
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Using cached charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests)
  Using cached idna-3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Using cached urllib3-2.2.2-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests)
  Using cached certifi-2024.7.4-py3-none-any.whl.metadata (2.2 kB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Using cached certifi-2024.7.4-py3-none-any.whl (162 kB)
Using cached charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB)
Using cached idna-3.7-py3-none-any.whl (66 kB)
Using cached urllib3-2.2.2-py3-none-any.whl (121 kB)
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests
Successfully installed certifi-2024.7.4 charset-normalizer-3.3.2 idna-3.7 

## Load libraries and custom functions

### Note: DO NOT modify any of the functions below

In [1]:
import urllib.parse
import datetime
import requests
import pandas as pd
from getpass import getpass
import glob
import json

In [22]:
def read_json_file(file_path):
    with open(file_path) as f:
        data = json.load(f)
    return data

def get_data_from_json_files(json_files):
    data = []
    for file in json_files:
        data.append(read_json_file(file))
    return data

def parse_metricwire_data(filepath = "data/unzipped/*/*/*.json"):
    # locate json files in the unzipped folder
    json_files = glob.glob(filepath)
    print(f"Ready to process {len(json_files)} JSON files exported from Metricwire.")

    # specify filename from current run time for filenames
    ts_fn = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

    # load all data into list of dictionaries
    data = get_data_from_json_files(json_files)

    # Elevate the data
    datao = []
    for i in range(len(data)):
        for j in range(len(data[i])):
            print(data[i][j])
            x = data[i][j]
            # Extract the identifiers
            identifiers = {k: v for k, v in x.items() if k != 'data'}
            identifiers_keys = set(identifiers.keys())
            for entry in x['data']:
                new_entry = {**identifiers, **entry}
                datao.append(new_entry)
    return datao

In [10]:
def summary_symbol_search(x, trials_expected=20):
    d = {}
    d["flag_is_invalid_n_trials"] = x["submissionSessionId"].count() != trials_expected
    d["n_trials"] = x["submissionSessionId"].count()
    d["n_trials_lure"] = (x["trial_type"] == "lure").sum()
    d["n_trials_responsetime_lt250ms"] = (x["response_time_duration_ms"] < 250).sum()
    d["n_trials_responsetime_gt10000ms"] = (
        x["response_time_duration_ms"] > 10000
    ).sum()
    d["n_correct_trials"] = (
        x["user_response_index"] == x["correct_response_index"]
    ).sum()
    d["n_incorrect_trials"] = (
        x["user_response_index"] != x["correct_response_index"]
    ).sum()
    d["mean_response_time_overall"] = x["response_time_duration_ms"].mean()
    d["mean_response_time_correct"] = x.loc[
        (x["user_response_index"] == x["correct_response_index"]),
        "response_time_duration_ms",
    ].mean()
    d["mean_response_time_incorrect"] = x.loc[
        (x["user_response_index"] != x["correct_response_index"]),
        "response_time_duration_ms",
    ].mean()
    d["median_response_time_overall"] = x["response_time_duration_ms"].median()
    d["median_response_time_correct"] = x.loc[
        (x["user_response_index"] == x["correct_response_index"]),
        "response_time_duration_ms",
    ].median()
    d["median_response_time_incorrect"] = x.loc[
        (x["user_response_index"] != x["correct_response_index"]),
        "response_time_duration_ms",
    ].median()
    d["sd_response_time_overall"] = x["response_time_duration_ms"].std()
    d["sd_response_time_correct"] = x.loc[
        (x["user_response_index"] == x["correct_response_index"]),
        "response_time_duration_ms",
    ].std()
    d["sd_response_time_incorrect"] = x.loc[
        (x["user_response_index"] != x["correct_response_index"]),
        "response_time_duration_ms",
    ].std()
    return pd.Series(
        d,
        index=[
            "flag_is_invalid_n_trials",
            # 'flag_is_potentially_invalid_rt',
            "n_trials",
            "n_trials_lure",
            "n_correct_trials",
            "n_incorrect_trials",
            "n_trials_responsetime_lt250ms",
            "n_trials_responsetime_gt10000ms",
            "mean_response_time_overall",
            "mean_response_time_correct",
            "mean_response_time_incorrect",
            "median_response_time_overall",
            "median_response_time_correct",
            "median_response_time_incorrect",
            "sd_response_time_overall",
            "sd_response_time_correct",
            "sd_response_time_incorrect",
        ],
    )


def summary_grid_memory(x, trials_expected=4):
    d = {}
    d["flag_is_invalid_n_trials"] = x["submissionSessionId"].count() != trials_expected
    d["n_trials"] = x["submissionSessionId"].count()
    d["n_perfect_trials"] = (x["number_of_correct_dots"] == 3.0).sum()
    d["mean_correct_dots"] = (x["number_of_correct_dots"]).mean()
    d["min_correct_dots"] = (x["number_of_correct_dots"]).min()
    d["sum_correct_dots"] = (x["number_of_correct_dots"]).sum()
    return pd.Series(
        d,
        index=[
            "flag_is_invalid_n_trials",
            "n_trials",
            "n_perfect_trials",
            "mean_correct_dots",
            "min_correct_dots",
            "sum_correct_dots",
        ],
    )

def summarise_m2c2kit_data(df = None, activity_name=None, group_by=["participant_id", "session_uuid", "session_id"], trials_expected = -999, ts_fn = None):
    if activity_name == "symbol-search" or activity_name == "symbolsearch" or activity_name == "symbol_search" or activity_name == "Symbol Search" or activity_name == "Symbol Match":
        activity_name_fn = activity_name.replace(" ", "_").lower()
        df_session_summary = df.groupby(group_by).apply(summary_symbol_search, trials_expected=trials_expected)
        df_session_summary.reset_index().to_csv(f"m2c2kit_scored_activity-{activity_name_fn}_{ts_fn}.csv", index=False)
        valid_scoring = True
    if activity_name == "grid-memory" or activity_name == "Dot Memory" or activity_name == "Grid Memory":
        activity_name_fn = activity_name.replace(" ", "_").lower()
        df_session_summary = df.groupby(group_by).apply(summary_grid_memory, trials_expected=trials_expected)
        valid_scoring = True

    if valid_scoring:
        df_session_summary.reset_index().to_csv(f"m2c2kit_scored_activity-{activity_name_fn}_{ts_fn}.csv", index=False)
        return df_session_summary
    else:
        print("Activity not supported yet. Please contact M2C2 for further coordination.")
        df_session_summary = None

## Locate and Parse all JSON files across all unzipped ZIP files

Directory structure should look like the directory tree below. 
That is to say, a folder containing folders of zip files exported from the Metricwire interface at this URL: XXXX

```
        ├── cognitivetask-22b32593-1cef-4ebe-af5a-8efcaa32a10a
        │   ├── 668eaa8c596e9abfbecf30b8
        │   │   └── 668eaa8c596e9abfbecf30b8-submissions.json
        │   ├── 668f0537c7c70cfc82e8c1ef
        │   │   └── 668f0537c7c70cfc82e8c1ef-submissions.json
        │   └── usersDataCount.csv
        ├── cognitivetask-a4b87a20-7931-452b-a7e2-e997c79eded5
        │   ├── 668eaa8c596e9abfbecf30b8
        │   │   └── 668eaa8c596e9abfbecf30b8-submissions.json
        │   ├── 668f0537c7c70cfc82e8c1ef
        │   │   └── 668f0537c7c70cfc82e8c1ef-submissions.json
        │   └── usersDataCount.csv
        └── cognitivetask-fd457b1b-3a6a-437b-9b4b-847e969626ae
            ├── 668eaa8c596e9abfbecf30b8
            │   └── 668eaa8c596e9abfbecf30b8-submissions.json
            ├── 668f0537c7c70cfc82e8c1ef
            │   └── 668f0537c7c70cfc82e8c1ef-submissions.json
            └── usersDataCount.csv
```

In [23]:
# filter `elevated_data` where activityName == 'Symbol Search'
data = parse_metricwire_data(filepath = "data/unzipped/*/*/*.json")

Ready to process 6 JSON files exported from Metricwire.
{'studyId': '6687f32ac2db9efff4d31c66', 'cognitiveTaskId': '668ed63c16e2fa2c407452da', 'userId': '668eaa8c596e9abfbecf30b8', 'activityId': '2bbb883d-28da-4294-90f7-61106a104ac7', 'activityVersion': '0.8.12 (38470c49)', 'submissionSessionId': '22c12aa2-b61e-4d30-8348-3461f286c242', 'triggerId': '668ed87e16e2fa2c4074677b', 'triggerTimestamp': '1720637643961', 'deviceOs': 'android', 'deviceVersion': '34', 'appVersion': '4.2.0', 'deviceId': '668eaa8d596e9abfbecf30c3', 'activityName': 'Grid Memory', 'timeZoneMinutes': -300, 'sessionId': '9436abfe-a6f6-4aa0-b81e-f6d04c6c762c', 'date': '07/10/2024', 'time': '13:54:52', 'timestamp': 1720637692784, 'data': [{'activity_begin_iso8601_timestamp': '2024-07-10T18:54:35.347Z', 'trial_begin_iso8601_timestamp': '2024-07-10T18:54:38.809Z', 'trial_end_iso8601_timestamp': '2024-07-10T18:54:53.102Z', 'trial_index': 0, 'response_time_duration_ms': 1745.5, 'presented_cells': [{'row': 0, 'column': 3}, {'

In [24]:
# filter `data` where activityName == 'task of interest'
dict_symbolsearch = [x for x in data if x['activityName'] == 'Symbol Search']
dict_gridmemory = [x for x in data if x['activityName'] == 'Grid Memory']

# load data into pandas dataframes
df_symbolsearch = pd.DataFrame(dict_symbolsearch)
df_gridmemory = pd.DataFrame(dict_gridmemory)

## Deduplicate Dataset

As mentioned above, data duplication is expected since the M2C2kit assessments save all data every trial to minimize any data loss. Below is code for deduplicating this data:

In [25]:
df_symbolsearch_dedup = df_symbolsearch.drop_duplicates(subset=['cognitiveTaskId', 'submissionSessionId', 'trial_begin_iso8601_timestamp'])
df_gridmemory_dedup = df_gridmemory.drop_duplicates(subset=['cognitiveTaskId', 'submissionSessionId', 'trial_begin_iso8601_timestamp'])

In [26]:
# confirm deduplication
print(f"Symbol Search: {df_symbolsearch.shape} to {df_symbolsearch_dedup.shape}")
print(f"Grid Memory: {df_gridmemory.shape} to {df_gridmemory_dedup.shape}")

Symbol Search: (26, 28) to (26, 28)
Grid Memory: (16, 29) to (16, 29)


## Preview Data

In [27]:
# preview data
display(df_symbolsearch_dedup.head(3))
display(df_gridmemory_dedup.head(3))

Unnamed: 0,studyId,cognitiveTaskId,userId,activityId,activityVersion,submissionSessionId,triggerId,triggerTimestamp,deviceOs,deviceVersion,...,activity_begin_iso8601_timestamp,trial_begin_iso8601_timestamp,trial_end_iso8601_timestamp,trial_index,trial_type,card_configuration,response_time_duration_ms,user_response_index,correct_response_index,quit_button_pressed
0,6687f32ac2db9efff4d31c66,66880801c2db9efff4d5e093,668eaa8c596e9abfbecf30b8,4fac0197-1172-4b2c-85db-3ecb5d3d0f35,0.8.12 (38470c49),d3259542-ef62-433a-991e-2ebd613cf20f,668c418850f86a98b069ba82,1720630890468,android,34,...,2024-07-10T17:02:17.255Z,2024-07-10T17:02:30.333Z,2024-07-10T17:02:31.414Z,0.0,lure,"{'top_cards_symbols': [{'top': 13, 'bottom': 1...",1080.3,0.0,1.0,False
1,6687f32ac2db9efff4d31c66,66880801c2db9efff4d5e093,668eaa8c596e9abfbecf30b8,4fac0197-1172-4b2c-85db-3ecb5d3d0f35,0.8.12 (38470c49),d3259542-ef62-433a-991e-2ebd613cf20f,668c418850f86a98b069ba82,1720630890468,android,34,...,2024-07-10T17:02:17.255Z,2024-07-10T17:02:31.925Z,2024-07-10T17:02:32.249Z,1.0,normal,"{'top_cards_symbols': [{'top': 20, 'bottom': 1...",323.3,1.0,0.0,False
2,6687f32ac2db9efff4d31c66,66880801c2db9efff4d5e093,668eaa8c596e9abfbecf30b8,4fac0197-1172-4b2c-85db-3ecb5d3d0f35,0.8.12 (38470c49),d3259542-ef62-433a-991e-2ebd613cf20f,668c418850f86a98b069ba82,1720630890468,android,34,...,2024-07-10T17:02:17.255Z,2024-07-10T17:02:32.758Z,2024-07-10T17:02:33.008Z,2.0,lure,"{'top_cards_symbols': [{'top': 6, 'bottom': 13...",249.1,0.0,1.0,False


Unnamed: 0,studyId,cognitiveTaskId,userId,activityId,activityVersion,submissionSessionId,triggerId,triggerTimestamp,deviceOs,deviceVersion,...,trial_begin_iso8601_timestamp,trial_end_iso8601_timestamp,trial_index,response_time_duration_ms,presented_cells,selected_cells,user_dot_actions,user_interference_actions,number_of_correct_dots,quit_button_pressed
0,6687f32ac2db9efff4d31c66,668ed63c16e2fa2c407452da,668eaa8c596e9abfbecf30b8,2bbb883d-28da-4294-90f7-61106a104ac7,0.8.12 (38470c49),22c12aa2-b61e-4d30-8348-3461f286c242,668ed87e16e2fa2c4074677b,1720637643961,android,34,...,2024-07-10T18:54:38.809Z,2024-07-10T18:54:53.102Z,0,1745.5,"[{'row': 0, 'column': 3}, {'row': 2, 'column':...","[{'row': 4, 'column': 2}, {'row': 1, 'column':...","[{'elapsed_duration_ms': 682.0999999046326, 'a...","[{'elapsed_duration_ms': 1018.2999999523163, '...",1,False
1,6687f32ac2db9efff4d31c66,668ed63c16e2fa2c407452da,668eaa8c596e9abfbecf30b8,2bbb883d-28da-4294-90f7-61106a104ac7,0.8.12 (38470c49),22c12aa2-b61e-4d30-8348-3461f286c242,668ed87e16e2fa2c4074677b,1720637643961,android,34,...,2024-07-10T18:54:53.111Z,2024-07-10T18:55:07.593Z,1,1937.0,"[{'row': 4, 'column': 0}, {'row': 2, 'column':...","[{'row': 2, 'column': 2}, {'row': 2, 'column':...","[{'elapsed_duration_ms': 729.2999999523163, 'a...","[{'elapsed_duration_ms': 1972.9000000953674, '...",0,False
2,6687f32ac2db9efff4d31c66,668ed63c16e2fa2c407452da,668eaa8c596e9abfbecf30b8,2bbb883d-28da-4294-90f7-61106a104ac7,0.8.12 (38470c49),22c12aa2-b61e-4d30-8348-3461f286c242,668ed87e16e2fa2c4074677b,1720637643961,android,34,...,2024-07-10T18:55:07.604Z,2024-07-10T18:55:23.483Z,2,3337.9,"[{'row': 1, 'column': 3}, {'row': 4, 'column':...","[{'row': 2, 'column': 0}, {'row': 1, 'column':...","[{'elapsed_duration_ms': 1495.9000000953674, '...",[],1,False


## Save Data

In [28]:
# (potentially) with duplicates (i.e., all downloaded data)
df_symbolsearch.to_csv(f"tidy/m2c2kit_raw_symbolsearch_{ts_fn}.csv")
df_gridmemory.to_csv(f"tidy/m2c2kit_raw_gridmemory_{ts_fn}.csv")

# without duplicates (i.e., deduplicated data)
df_symbolsearch_dedup.to_csv(f"tidy/m2c2kit_dedup_symbolsearch_{ts_fn}.csv")
df_gridmemory_dedup.to_csv(f"tidy/m2c2kit_dedup_gridmemory_{ts_fn}.csv")

# Ready to score your data?

If you are ready to score your data, please contact us at [m2c2@psu.edu](mailto:m2c2@psu.edu)

In [29]:
# specify expected number of trials (based on study configuration)
trials_expected_symbolsearch = 20
trials_expected_gridmemory = 4

In [30]:
# note, this function writes files for you! 
df_symbolsearch_summary = summarise_m2c2kit_data(df = df_symbolsearch_dedup, activity_name="Symbol Search", group_by=["userId", "submissionSessionId"], trials_expected = -999, ts_fn = ts_fn)
df_gridmemory_summary = summarise_m2c2kit_data(df = df_gridmemory_dedup, activity_name="Grid Memory", group_by=["userId", "submissionSessionId"], trials_expected = -999, ts_fn = ts_fn)

  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  df_session_summary = df.groupby(group_by).apply(summary_symbol_search, trials_expected=trials_expected)
  df_session_summary = df.groupby(group_by).apply(summary_grid_memory, trials_expected=trials_expected)


In [31]:
df_symbolsearch_summary 

Unnamed: 0_level_0,Unnamed: 1_level_0,flag_is_invalid_n_trials,n_trials,n_trials_lure,n_correct_trials,n_incorrect_trials,n_trials_responsetime_lt250ms,n_trials_responsetime_gt10000ms,mean_response_time_overall,mean_response_time_correct,mean_response_time_incorrect,median_response_time_overall,median_response_time_correct,median_response_time_incorrect,sd_response_time_overall,sd_response_time_correct,sd_response_time_incorrect
userId,submissionSessionId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
668eaa8c596e9abfbecf30b8,224f578c-10d0-48b1-92b0-eb9d59f1408b,True,5,3,2,3,4,0,120.94,186.8,77.033333,64.5,186.8,54.5,111.710666,172.958319,52.650008
668eaa8c596e9abfbecf30b8,22c12aa2-b61e-4d30-8348-3461f286c242,True,1,0,0,1,0,0,,,,,,,,,
668eaa8c596e9abfbecf30b8,d3259542-ef62-433a-991e-2ebd613cf20f,True,5,3,1,4,3,0,421.4,225.1,470.475,249.1,225.1,286.2,370.450618,,408.561512
668f0537c7c70cfc82e8c1ef,763c8274-6b71-4ab6-a242-415e352277bd,True,5,3,4,1,1,0,3810.42,3408.575,5417.8,3230.5,1776.4,5417.8,4052.814573,4563.318814,
668f0537c7c70cfc82e8c1ef,7a695355-4740-40ea-972d-d1b0379264dd,True,5,3,4,1,2,0,926.04,1153.45,16.4,638.1,822.5,16.4,1096.534198,1121.790656,
668f0537c7c70cfc82e8c1ef,a0983373-1a88-427c-8c3d-454ee1ff6bdd,True,5,3,2,3,3,1,4029.64,95.9,6652.133333,166.6,95.9,547.7,8506.16636,70.144993,10904.89591


In [32]:
df_gridmemory_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,flag_is_invalid_n_trials,n_trials,n_perfect_trials,mean_correct_dots,min_correct_dots,sum_correct_dots
userId,submissionSessionId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
668eaa8c596e9abfbecf30b8,22c12aa2-b61e-4d30-8348-3461f286c242,True,4,0,0.5,0,2
668f0537c7c70cfc82e8c1ef,763c8274-6b71-4ab6-a242-415e352277bd,True,4,0,0.75,0,3
668f0537c7c70cfc82e8c1ef,7a695355-4740-40ea-972d-d1b0379264dd,True,4,0,0.0,0,0
668f0537c7c70cfc82e8c1ef,a0983373-1a88-427c-8c3d-454ee1ff6bdd,True,4,0,0.25,0,1


# Coming soon - this code as a pip installable package
<!-- pip install cookiecutter
cookiecutter https://github.com/waynerv/cookiecutter-pypackage.git -->