# Decoding Individual Identity
Tim Tyree<br>
9.9.2022

In [1]:
from lib.get_init import *
%load_ext autoreload
%autoreload 2
np.random.seed(42)

In [2]:
#reset matplotlib settings
sns.reset_orig()
mpl.rc_file_defaults()

# select the data

In [3]:
use_search_dialog=False
if use_search_dialog:
    from lib.controller.filedialog import search_for_file
    data_dir = search_for_file (currdir = os.getcwd())
    print(f"  nota bene: restarting the .ipynb should make the widget window go away.")
else:
    data_dir='/Users/timothytyree/Documents/TyreeEtAl_data/data.json'

In [4]:
print(f"{data_dir=}")
assert os.path.exists(data_dir)

data_dir='/Users/timothytyree/Documents/TyreeEtAl_data/data.json'


# load the data

schema of data.json:
- observer_name: name of the observer
- session_num: integer index of recording session
- dict_anatomical: anatomical information for recording session
- dict_relationships: relationship information confirmed between observers and conspecifics
- dict_transcript: spike times for all neurons labeled by trial, trial information
    - spike_time_array: table containing neuron spike times centered at stimulus onset t=0. rows index trials. columns index neurons.
    - df_trial_data: table of trial information. rows index trials.
    - df_labels: table of trial labels computed from trial information. rows index trials.
- dict_error_estimates: estimated spike sorting error rates (percent of oversplit or undersplit neurons)
- dict_subpopulations: dictionary of neurons identified as concept cells or responsive to face or voice stimuli
- dict_decoder_hyperparameters: dictionary of decoder hyperparameter settings
- dict_etc:
    - df_neurons: table of information describing all neurons.  rows index neurons.
    - dict_spike_templates: dictionary of spike sorting templates for all neurons.
        - t_values: time values centered at the waveform peak
        - dict_spike_template_lst: list spike template dictionaries ordered by neuron
        - xdim: units of t_values are in milliseconds.
        - ydim: units of spike templates are in millivolts.
        - flip_signs: true if original spike template was multiplied by -1, which typically lead to spikes appearing as a maximum instead of as a minimum value.

In [5]:
#load data from .json
data=load_from_json(data_dir)
observer_name = data['observer_name']
session_num = data['session_num']
print(f"This dataset was loaded from {data_dir=}.")
print(f"\nObserver: {observer_name.capitalize()} (Session #{session_num+1})") #using session_num+1 here bc the first recording session had session_num=0.
print(f"\n------------------------------\n")
print(f"\ndata has the following keys:")
# print(*data)
for key in data:
    print(key)
print(f"\ndict_anatomical holds anatomical data for recoring session:")
print_dict(data['dict_anatomical'])

This dataset was loaded from data_dir='/Users/timothytyree/Documents/TyreeEtAl_data/data.json'.

Observer: Hades (Session #47)

------------------------------


data has the following keys:
session_num
observer_name
dict_anatomical
dict_relationships
dict_transcript
dict_error_estimates
dict_subpopulations
dict_decoder_hyperparameters
dict_etc

dict_anatomical holds anatomical data for recoring session:
session_num=46
anatomical_region=''
AP_pos_set=-3.1


__Anatomical location in Hippocampus:__

The estimated centroid of the microwire brush array was used to determine anatomical location of a recording session.  The anterior-posterior (`AP_pos`) position was estimated (in millimeters), and if we had great confidence that the majority of the array was recording predominantly in one hippocampal subfield, then we labeled the recording session with the appropriate subfield (`anatomical_region`).  If `anatomical_region=''`, then we did not have this great confidence for any of the subfields located in the hippocampus.

__Nota bene__: <br>The recording session previous to Session #47 (Session #46) was prodominantly in CA1 (i.e. `anatomical_region='CA1`).

<!--  ![image.png](attachment:image.png). -->

In [6]:
#load spike data and trial data from from the dict_transcript
print(f"dict_transcript has the following keys:")
# print(*data['dict_transcript'])
for key in data['dict_transcript']:
    print(key)
spike_time_array = np.array(data['dict_transcript']['spike_time_array'])
df_trial_data = pd.DataFrame(data['dict_transcript']['df_trial_data'])
df_labels = pd.DataFrame(data['dict_transcript']['df_labels'])
df_trial_data.head()

dict_transcript has the following keys:
spike_time_array
df_trial_data
df_labels


Unnamed: 0,index,Duration,PheeName,block,imDur,imMatchFlag,imName,imNum,novel,session_num,trial_num,monkName,faceName,pheeName
17024,0,2.415744,hermes,1.0,2.415744,3.0,none,1.0,1,46,0,Hades,none,hermes
17025,1,3.502351,none,1.0,3.502351,0.0,Aladdin,2.0,1,46,1,Hades,aladdin,none
17026,2,2.060948,aladdin,1.0,2.060948,1.0,Aladdin,3.0,1,46,2,Hades,aladdin,aladdin
17027,3,3.501758,none,1.0,3.501758,0.0,Ares,4.0,1,46,3,Hades,ares,none
17028,4,2.341655,chewie,1.0,2.341655,3.0,none,5.0,1,46,4,Hades,none,chewie


In [7]:
#separate trials by modality
trial_num_values_face_only = df_trial_data[df_trial_data['pheeName']=='none']['trial_num'].values
trial_num_values_voice_only = df_trial_data[df_trial_data['faceName']=='none']['trial_num'].values
boo_xmod = (df_trial_data['faceName']!='none') & (df_trial_data['pheeName']!='none')
trial_num_values_match = df_trial_data[boo_xmod&(df_trial_data['faceName']!=df_trial_data['pheeName'])]['trial_num'].values
trial_num_values_mismatch = df_trial_data[boo_xmod&(df_trial_data['faceName']==df_trial_data['pheeName'])]['trial_num'].values
print(f"number of trials by mode:")
print(f"\t- {trial_num_values_face_only.shape[0]} (face-only)")
print(f"\t- {trial_num_values_voice_only.shape[0]} (voice-only)")
print(f"\t- {trial_num_values_match.shape[0]} (identity match)")
print(f"\t- {trial_num_values_mismatch.shape[0]} (identity mismatch)")

number of trials by mode:
	- 172 (face-only)
	- 57 (voice-only)
	- 88 (identity match)
	- 83 (identity mismatch)


We will use train_test_split_multimodal_crossval to perform the train-test split

In [8]:
train_test_split_multimodal_crossval

<function lib.model.train_test_split_multimodal.train_test_split_multimodal_crossval(concept_name_values_selected, d_labels, trial_num_values_remove, n_splits=5, shuffle=True, random_state=42, **kwargs)>

We will be using gener_tbins_fast to generate candidate time bins.

In [9]:
print (gener_tbins_fast.__doc__)

gener_tbins_fast returns a tuple of df_tbins and df_tbins_refined, respectively.
    booT,booF are boolean index arrays indexing the true/false training trials, respectively.
    spike_time_array is a 2D numpy array instance of list objects that contain spike times for a given trial-neuron pair.
    decreasing max_dur_overlap may needlessly remove useful predictive time bins,
    so its default value is set arbitrarily large while remaining small enough to be a float32 instance.
    using refinement may add ~2minutes to the estimated run time (default: refinement=True).
    otherwise, gener_tbins_fast can run in tpyically less than 2 minutes per call.

    Parameters Settings
    --------------------
        nid_values: neuron index values to consider.  all neurons are considered if nid_values is None (default: nid_values=None)

        taumin: earliest start time

        taumax: latest end time

        delta_tau_min: time between two start/end times

        refinement: whetehr or n

we will use fit_decoder to train the decoder

In [10]:
fit_decoder

<function lib.model.decoder.fit_decoder(xtrain, ytrain, xtest=None, ytest=None, param_dict=None, verbose=0, n_estimators=21, use_label_encoder=False, **kwargs)>