## Organizational Goals for Project and Basic Data Cleaning Functions:

Ideally, you should only have to tell a script a few things in order to get stuff like:
   - Merged dataframes
   - Dataframe without outliers
   - Basic plots
   - Basic stats

If you tell the script things like 'relevant columns', you should be able to output dataframes that only have variables of interest.

We can create one function to read in both kinds of data (beh/eye), but will need separate functions for cleaning the data.

Here's an example of the only things that should be hard-coded in the final script(s), ideally in a cell right at the top of the notebook:
- current folder directory
    - or multiple, say if you have some for data and another for outputs
- txt file names:
    - a file for new column names
    - a file for columns to filter
- the independent variables of interest (list of strings is fine)
- the dependent variables of interest ('')
- probably a few other things but haven't gotten there yet!


## Basic functions for data processing:

### import_all_csvs: reads in all CSVs in a folder and compiles them into one dataframe
- required packages: pandas, glob
- inputs: folder directory
- outputs: new dataframe

In [18]:
import pandas as pd
from glob import glob
    
def import_all_csvs(folder): 
    files = glob(folder + '/*.csv')
    master_df = pd.concat([pd.read_csv(f) for f in files ])
    return master_df   

# test cases:
# master_df = import_all_csvs('/Users/elizabethpierotti/Desktop/school/python/pandas_story_telling/data/raw')
# master_df = import_all_csvs('/Users/elizabethpierotti/Desktop/Kids Auditory N4/data/interim')

master_df = import_all_csvs('/Users/elizabethpierotti/Desktop/school/python/final project/data/psycho_py')
# master_df = import_all_csvs('/Users/elizabethpierotti/Desktop/Kids Auditory N4/data/behavioral')

master_df

Unnamed: 0,targetAnim,audioPos,distPos,visualL,expLab,picR,picL,condLabel,corrAns,targetPos,...,key_resp_2.corr,key_resp_2.rt,spaceToBegin.keys,spaceToBegin.rt,date,frameRate,expName,session,participant,Unnamed: 31
0,,,,,,,,,,,...,,,,,2020_Mar_04_1508,59.971444,duckFrog,1,12,
1,frog,ribbitR.wav,-20.0,landoltCL.png,prac,frogFront.png,duckFront.png,Lcong,b,20.0,...,0.0,2.669235,,,2020_Mar_04_1508,59.971444,duckFrog,1,12,
2,duck,quackL.wav,20.0,landoltCTop.png,prac,frogFront.png,duckFront.png,Hcong,h,-20.0,...,0.0,2.919198,,,2020_Mar_04_1508,59.971444,duckFrog,1,12,
3,frog,wnR.wav,20.0,landoltCBottom.png,prac,duckFront.png,frogFront.png,LinCongWN,b,-20.0,...,0.0,2.218867,,,2020_Mar_04_1508,59.971444,duckFrog,1,12,
4,duck,wnL.wav,20.0,landoltCBottom.png,prac,frogFront.png,duckFront.png,HcongWN,b,-20.0,...,0.0,1.301516,,,2020_Mar_04_1508,59.971444,duckFrog,1,12,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
562,duck,ribbitL.wav,-20.0,landoltCR.png,V1,duckFront.png,frogFront.png,HinCong,h,20.0,...,0.0,0.784165,,,2020_Feb_26_1011,59.958637,duckFrog,1,3,
563,duck,wnR.wav,-20.0,landoltCL.png,V1,duckFront.png,frogFront.png,HcongWN,h,20.0,...,0.0,1.051047,,,2020_Feb_26_1011,59.958637,duckFrog,1,3,
564,frog,quackL.wav,-20.0,landoltCL.png,V1,frogFront.png,duckFront.png,LinCong,h,20.0,...,0.0,0.784375,,,2020_Feb_26_1011,59.958637,duckFrog,1,3,
565,duck,ribbitL.wav,-20.0,landoltCR.png,V1,duckFront.png,frogFront.png,HinCong,b,20.0,...,0.0,0.650831,,,2020_Feb_26_1011,59.958637,duckFrog,1,3,


### filter columns: reads in .txt file w column names to keep and makes a new df with only those columns
- required packages: pandas
- inputs: relevant columns .txt file, dataframe to edit columns
    - NOTE: the .txt file should have each col name on a new line
- outputs: filtered dataframe

In [8]:
import pandas as pd

def filter_columns(txt_file, df):
    file = open(txt_file,'r')
    cols = file.readlines()
    names = [col.strip('\n') for col in cols ]
    filtered_df = pd.DataFrame()
    for name in names:
        if name in df.columns:
            rel_col = df[[name]]
            filtered_df[[name]]= rel_col  
    return filtered_df

# test case:
curr_folder = '/Users/elizabethpierotti/Desktop/Kids Auditory N4/data/behavioral'    
rel_col_file = curr_folder + '/relevant_cols.txt'  
filter_columns(rel_col_file, master_df)


### rename_columns: reads in .txt file w new column names and replaces column names in current df
- required packages: pandas
- inputs: column name .txt file, dataframe to edit columns
    - NOTE: the .txt file should have each col name on a new line
- outputs: new dataframe

In [11]:
import pandas as pd

def rename_columns(txt_file, df):
    file = open(txt_file,'r')
    cols = file.readlines()
    names = [col.strip('\n') for col in cols ]
    # if there are extra columns at the end that you don't want, this cuts them off:
    if len(df.columns) == len(names):
        df.columns = names
    else:
        df = df.iloc[:,0:(len(names))]
        df.columns = names
    return df
    
# test case:
curr_folder = '/Users/elizabethpierotti/Desktop/Kids Auditory N4/data/behavioral'    
rename_col_file = curr_folder + '/new_cols.txt'  
newnames_df = rename_columns(rename_col_file, master_df)


### subject_accuracy: using two specified columns (correct and incorrect), calculates accuracy and adds a column for these values
- required packages: 
- inputs: df, correct column name (or index?), incorrect column name
- outputs: new dataframe

### remove_outliers: given a df and a variable to act on, calculate IQR and identify values that exceed  it. Then return df without these values.
- required packages: stats/iqr
- inputs: df, variable of interest
- outputs: new dataframe

In [17]:
from scipy.stats import iqr

def remove_outliers(df, var, outlier_constant = 1.5):
    IQR = iqr(df[var].dropna())
    outliers = IQR*outlier_constant
    lowerOutliersCalc = (df[var].quantile([.25])) - outliers
    upperOutliersCalc = (df[var].quantile([.75])) + outliers
    lowerOutliers = lowerOutliersCalc.iloc[0]
    upperOutliers = upperOutliersCalc.iloc[0]
    cleanTrials = df[df[var].between(lowerOutliers, upperOutliers)]
    return(cleanTrials)     

# test case:
#curr_folder = '/Users/elizabethpierotti/Desktop/Kids Auditory N4/data/processed'    
#f = curr_folder + '/poster_data.csv' 
#poster_data = pd.read_csv(f)
#clean_df = remove_outliers(poster_data, 'MNA')

#print(poster_data.shape)
#print(clean_df.shape)

clean_df = remove_outliers(master_df, 'key_resp_2.rt')
print(master_df.shape)
print(clean_df.shape)

(5670, 32)
(5510, 32)


In [None]:
merged_df = pd.merge(MSD_eye, MSD_behavioral, on =['trial', 'participant'], how = 'outer')

In [None]:
#create CSV of Finalized Data
merged_df.to_csv(curr_folder + '/processed_data.csv')