# Python for Psychologists - Session 8

## hands on

In [None]:
%matplotlib inline

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns
from scipy import stats
import os
import numpy as np

Check your current working directory, i.e. where your notebook is saved on your disk. Todays data{}.csv sheets need to be in the same directory as your jupyter notebook.

In [None]:
pwd

1) Use a for loop to create one dataframe that contains all .csv files for your 10 participants. Use the **os** module instead of creating a new subject list (hint: your files all end with .csv and os.listdir() shows you all files in your pwd). Hint: You need to set ```decimal="," ``` when you use ```pd.read_csv```.

In [None]:
os.listdir()

Note: only the .csv files should be included in the overall data frame.

In [None]:
all_df=[]

for file in os.listdir():
    if file.endswith(".csv"):
        df = pd.read_csv(file, sep=";", decimal=",")
        all_df.append(df)

df = pd.concat(all_df)
df.head()

2) Check the dataframe for missing values. If there are any missing values, replace it with 0 inside your current dataframe

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df["AmbigCorrectSwitch_RT"].fillna(0, inplace=True)

3) Insert four new columns, that contain

- Switchcost_Error = Error_Switch - Error_Baseline
- Switchcost_RT = MeanRT_Switch - MeanRT_Baseline
- Switchrate = switches / 20 
- Ambig_RT = (AmbigCorrectStay_RT + AmbigCorrectSwitchRT) / 2

In [None]:
df["Switchcost_error"] = df["Error_Switch"] - df["Error_Baseline"]    #switchcost accuracy
df["Switchcost_RT"] = df["MeanRT_Switch"] - df["MeanRT_Baseline"]      #switchcost RT
df["Switchrate"] = df["switches"]/20 
df["Ambig_RT"] = (df["AmbigCorrectStay_RT"] + df["AmbigCorrectSwitch_RT"])/2

df.head(10)

4) Sanity Check: Check whether Error_Baseline and Korrekt_Baseline adds up to 100% 

In [None]:
sanitycheck = df["Error_Baseline"] + df["Korrekt_Baseline"] 
#sum(sanitycheck) == 10.0
sanitycheck

5) Check whether any participant has more than 30% errors in the baseline, switch or stay condition using ```df.loc```. Create a respective "exclusion_{}.format(condition)" list, that contains the participants and print it. 

In [None]:
exclusion_baseline = df.loc[df["Error_Baseline"] > 0.3, "subj_idx"].tolist()
exclusion_stay = df.loc[df["Error_Stay"] > 0.3, "subj_idx"].tolist()
exclusion_switch = df.loc[df["Error_Switch"] > 0.3, "subj_idx"].tolist()

print(exclusion_baseline)
print(exclusion_stay)
print(exclusion_switch)

5.1) For educational purposes only: Combine all three exclusion lists to a single exclusion_overall list, that does only contain unique values (i.e. your participants)

In [None]:
exclusion_overall = list(set(exclusion_baseline+exclusion_stay+exclusion_switch))
exclusion_overall

5.2) Now exclude cases in which "Korrekt_Baseline" is less than 95% and save the new data frame to a new variable "df2" **without** using ```df.loc```. Evaluate the new variable afterwards. Then, print a list of the subjects included in the new data frame "df2".

In [None]:
df2= df[df["Korrekt_Baseline"]<0.95]
df2

In [None]:
print(df2["subj_idx"].tolist())

6) Plot the RT for the baseline / stay / switch condition in one figure. Hint: Use sns.displot and 3 lines of code (see https://seaborn.pydata.org/generated/seaborn.distplot.html)

- all conditions should have a different color 
- all conditions should have a label 
- plot only the distribution (i.e. set the hist parameter to False)

In [None]:
sns.distplot(df["MeanRT_Baseline"] , color="lightblue", label="baseline", hist=False)
sns.distplot(df["MeanRT_Stay"] , color="red", label="stay", hist=False)
sns.distplot(df["MeanRT_Switch"] , color="purple", label="switch", hist=False)



7) Backup the impression that RT increases with our task getting more cognitively demanding with the descriptive statistics. Mean results should be rounded to two decimals and fill in the respective values below:

In [None]:
round(df["MeanRT_Baseline"].mean(),2)

In [None]:
round(df["MeanRT_Stay"].mean(),2)

In [None]:
round(df["MeanRT_Switch"].mean(),2)

In [None]:
conditions = ["baseline", "stay", "switch"]

print("{}-RT: ".format(conditions[0]) + str(round(df["MeanRT_Baseline"].mean(),2)))
print("{}-RT: ".format(conditions[1]) + str(round(df["MeanRT_Stay"].mean(),2)))
print("{}-RT: ".format(conditions[2]) + str(round(df["MeanRT_Switch"].mean(),2)))

8) List comprehension

- create a new random column that contains "yes" if a participant has more at least 95% accuracy in Baseline and Switch trials and "no" if not. Afterwards, print a list that contains only those subjects with a "yes" in your new column. 

In [None]:
df["new"] = ["yes" if a >= 0.95 and b >= 0.95 else "no" for (a,b) in zip(df["Korrekt_Baseline"], df["Korrekt_Switch"])]

In [None]:
df.loc[df["new"] == "yes"]["subj_idx"].tolist()

9) Correlate "Switchcost_RT" and "Switchrate" using the stats module. Please check whether both variables follow a normal distribution and choose either pearson or spearman correlation accordingly. 

In [None]:
stats.shapiro(df["Switchcost_RT"])[1] < 0.05

In [None]:
stats.shapiro(df["Switchrate"])[1] < 0.05

In [None]:
stats.pearsonr(df["Switchcost_RT"], df["Switchrate"])

9.1 Now visualize the association of both variables using sns.jointplot (see https://seaborn.pydata.org/generated/seaborn.jointplot.html)

In [None]:
sns.jointplot("Switchcost_RT", "Switchrate", data=df, kind="reg")

10) Try to create a correlation matrix for your whole dataframe. 

In [None]:
df.corr()

11) Create a new data frame "wide" consisting of the columns defined below. Then, set the index of "wide" to the subject index. Afterwards, use the `.stack()` method to create a new series called "long". Then, turn the series to a data frame. Finally, reset the index to numbers as before and rename the columns in a sensible way.

In [None]:
col = ["MeanRT_Baseline", "MeanRT_Stay", "MeanRT_Switch", "subj_idx"]

wide = df[col]


wide = wide.set_index("subj_idx")
long = wide.stack().to_frame()
long = long.reset_index()
long = long.rename(columns= {"level_1":"condition",0:"RT"})

12) Now our data is in the right format to easily plot multiple conditions (e.g., from a repeated measurement design) in one figure, i.e. next to each other. Try to use ```sns.violinplot```to plot all conditions and RTs.

In [None]:
sns.violinplot(x="condition", y="RT", data=long)