# Organize and sort data files
In this notebook, I will collect all files from each participant and merge them into one dataframe with each individual file ordered by its time and task type.
<br>
<br>
I start by importing _libraries_ and _tools_ that will be needed for the rest of the notebook.

### Import modules

In [2]:
#For data sorting
import glob
import os
import re
import fnmatch

#For data analysis
import pandas as pd
import numpy as np

#To display the figures in the notebook itself.
%matplotlib inline
import matplotlib.pyplot as plt
from ggplot import *

### Dataframe overview
There are a total of 30 participants, each partcipant completed 6 sets of memory tasks (4 training, 2 testing). 
Each memory task set is stored into individual dataframe with filename specifying task taken time and type.
<br>
_glob_, _re_, _fnmatch_ modules are used for searching and sorting dataframes.
<br>
The resulting dataframe gathers all information about _subject ids_, _time_, _type of tasks_ and _file names_.

In [26]:
#create a new df that has information about fname, subj_id, date, block, and type

#the list of file names 
fnames = sorted(glob.glob('*.csv'))

#create a new df for testing files
testing_fnames_info = pd.DataFrame(columns=['subj_id', 'phase', 'type','date','time','file_name'])

#create a new df for training files
training_fnames_info = pd.DataFrame(columns=['subj_id', 'phase', 'type','date','time','file_name'])

#a for loop that find the matching fnames for testing and training phase and put them into two different dfs
for fname in fnames:
    testing= re.match('(\d*)_main_exp_testing_(\w*)_(\w*)_(\d*_\w*_\d*)_(\d*).csv',fname)
    if testing:
        info = re.compile('(\d*)_main_exp_(\w*)_(\w*)_only_(\d*_\w*_\d*)_(\d*).csv')
        subj_info= info.findall(fname)
        subj_id = subj_info[0][0]
        subj_phase = subj_info[0][1]
        subj_type = subj_info[0][2]
        subj_date = subj_info[0][3]
        subj_time = subj_info[0][4]
        testing_fnames_info = testing_fnames_info.append({'subj_id': subj_id, 'phase': subj_phase, 'type': subj_type,\
                                                  'date':subj_date, 'time': subj_time, 'file_name':fname}, ignore_index=True)
        
    training_1 = re.match('(\d*)_main_exp_training_(\w*)_wbuilder_(\d*_\w*_\d*)_(\d*)_data.csv',fname)
    if training_1:
        info =  re.compile('(\d*)_main_exp_(\w*)_(\w*)_wbuilder_(\d*_\w*_\d*)_(\d*)_data.csv')
        subj_info= info.findall(fname)
        subj_id = subj_info[0][0]
        subj_phase = subj_info[0][1]
        subj_type = subj_info[0][2]
        subj_date = subj_info[0][3]
        subj_time = subj_info[0][4]
        training_fnames_info = training_fnames_info.append({'subj_id': subj_id, 'phase': subj_phase, 'type': subj_type,\
                                                  'date':subj_date, 'time': subj_time, 'file_name':fname}, ignore_index=True)
        
    training_2 = re.match('(\d*)_main_exp_training_(\w*)_wbuilder_(\d*_\w*_\d*)_(\d*).csv',fname)
    if training_2:
        info =  re.compile('(\d*)_main_exp_(\w*)_(\w*)_wbuilder_(\d*_\w*_\d*)_(\d*).csv')
        subj_info= info.findall(fname)
        subj_id = subj_info[0][0]
        subj_phase = subj_info[0][1]
        subj_type = subj_info[0][2]
        subj_date = subj_info[0][3]
        subj_time = subj_info[0][4]
        training_fnames_info = training_fnames_info.append({'subj_id': subj_id, 'file_name':fname, 'phase': subj_phase, 'type': subj_type,\
                                                  'date':subj_date, 'time': subj_time}, ignore_index=True)
        
#combine the training and testing dfs into one df       
fnames_info = pd.concat([testing_fnames_info,training_fnames_info])
fnames_info = fnames_info.sort_values(by=['subj_id']).reset_index(drop=True)

#code block demo
fnames_info.head(20)

Unnamed: 0,subj_id,phase,type,date,time,file_name
0,1,testing,oldnew,2018_Oct_16,1543,01_main_exp_testing_oldnew_only_2018_Oct_16_15...
1,1,training,lion,2018_Oct_16,1442,01_main_exp_training_lion_wbuilder_2018_Oct_16...
2,1,training,lion,2018_Oct_16,1522,01_main_exp_training_lion_wbuilder_2018_Oct_16...
3,1,training,beaver,2018_Oct_16,1531,01_main_exp_training_beaver_wbuilder_2018_Oct_...
4,1,training,beaver,2018_Oct_16,1506,01_main_exp_training_beaver_wbuilder_2018_Oct_...
5,1,testing,order,2018_Oct_16,1546,01_main_exp_testing_order_only_2018_Oct_16_154...
6,2,testing,order,2018_Oct_25,1547,02_main_exp_testing_order_only_2018_Oct_25_154...
7,2,testing,oldnew,2018_Oct_25,1541,02_main_exp_testing_oldnew_only_2018_Oct_25_15...
8,2,training,beaver,2018_Oct_25,1451,02_main_exp_training_beaver_wbuilder_2018_Oct_...
9,2,training,beaver,2018_Oct_25,1521,02_main_exp_training_beaver_wbuilder_2018_Oct_...


### Organize dataframe by time blocks
Now that I have all the information for all subjects gathered in one dataframe, I can reorganize the dataframe by assigning _time blocks_ to each individual dataframe for easy access later on.
<br>
Since time block is not related for testing phase, the time block for testing phase can be NaN.

In [27]:
# Sorting fnames_info dataframe and add in time block
fnames_info = fnames_info.sort_values(by = ["subj_id","time"])
fnames_info["block_order"] = (['1','2','3','4','NaN','NaN']* 30)
fnames_info.head(20)  

Unnamed: 0,subj_id,phase,type,date,time,file_name,block_order
1,1,training,lion,2018_Oct_16,1442,01_main_exp_training_lion_wbuilder_2018_Oct_16...,1.0
4,1,training,beaver,2018_Oct_16,1506,01_main_exp_training_beaver_wbuilder_2018_Oct_...,2.0
2,1,training,lion,2018_Oct_16,1522,01_main_exp_training_lion_wbuilder_2018_Oct_16...,3.0
3,1,training,beaver,2018_Oct_16,1531,01_main_exp_training_beaver_wbuilder_2018_Oct_...,4.0
0,1,testing,oldnew,2018_Oct_16,1543,01_main_exp_testing_oldnew_only_2018_Oct_16_15...,
5,1,testing,order,2018_Oct_16,1546,01_main_exp_testing_order_only_2018_Oct_16_154...,
8,2,training,beaver,2018_Oct_25,1451,02_main_exp_training_beaver_wbuilder_2018_Oct_...,1.0
10,2,training,lion,2018_Oct_25,1508,02_main_exp_training_lion_wbuilder_2018_Oct_25...,2.0
9,2,training,beaver,2018_Oct_25,1521,02_main_exp_training_beaver_wbuilder_2018_Oct_...,3.0
11,2,training,lion,2018_Oct_25,1527,02_main_exp_training_lion_wbuilder_2018_Oct_25...,4.0


It is also important for the research team to know which type of memory task has been performed first.
<br>
Because the second memory tasks have shorter trials and they are designed to test out how well participants perform after learning from the first two traning sets.

In [29]:
# add a mew column that has all the order for the first or secoond time seeing types of data
fnames_info["block_order_bytype"] = (['1','1','2','2','NaN','NaN']* 30)

# dataframe demo
fnames_info.head(20)

Unnamed: 0,subj_id,phase,type,date,time,file_name,block_order,block_order_bytype
1,1,training,lion,2018_Oct_16,1442,01_main_exp_training_lion_wbuilder_2018_Oct_16...,1.0,1.0
4,1,training,beaver,2018_Oct_16,1506,01_main_exp_training_beaver_wbuilder_2018_Oct_...,2.0,1.0
2,1,training,lion,2018_Oct_16,1522,01_main_exp_training_lion_wbuilder_2018_Oct_16...,3.0,2.0
3,1,training,beaver,2018_Oct_16,1531,01_main_exp_training_beaver_wbuilder_2018_Oct_...,4.0,2.0
0,1,testing,oldnew,2018_Oct_16,1543,01_main_exp_testing_oldnew_only_2018_Oct_16_15...,,
5,1,testing,order,2018_Oct_16,1546,01_main_exp_testing_order_only_2018_Oct_16_154...,,
8,2,training,beaver,2018_Oct_25,1451,02_main_exp_training_beaver_wbuilder_2018_Oct_...,1.0,1.0
10,2,training,lion,2018_Oct_25,1508,02_main_exp_training_lion_wbuilder_2018_Oct_25...,2.0,1.0
9,2,training,beaver,2018_Oct_25,1521,02_main_exp_training_beaver_wbuilder_2018_Oct_...,3.0,2.0
11,2,training,lion,2018_Oct_25,1527,02_main_exp_training_lion_wbuilder_2018_Oct_25...,4.0,2.0


In [17]:
#reset index and save to csv
fnames_info = fnames_info.reset_index(drop = True)

In [18]:
#fnames_info = fnames_info.to_csv("fnames_info.csv")
fnames_info = fnames_info.to_csv("fnames_info.csv")