# Introduction

For this assignment, I have built a program that is able to:
- Read in a csv file 
- View and tidy the data
- Generate summary statistics and multiple visualisations 
- Perform ANOVA and post hoc analyses. 
- Handle basic errors

Please load the Docker file and read through the README.md file.

I will break down the program into key components to aid my explanations. Refer to the output in each section to see what each component of the script does.

# Packages

To load in the packages, we first bring them in using the `import` keyword. We then specify what we want to call to use the function using the `as` keyword. The `{numpy}` package will allow us to work with arrays in our dataset and provides the means to generate specific descriptive statistics. `{pandas}` allows us to read in our csv file, whilst `{seaborn}` provides attracitive and informative data visualisations based on `{matplotlib}`, a library for creating visualisations in its own right. The `ols()` function imported from `{statsmodels.formula.api}` creates models from a formula and dataframe, while `stats()` from `{scipy}` enables us to run t-tests. Finally, we use `AnovaRM()` from `{statsmodels.stats.anova}` to perform repeated measures ANOVA in fully balanced designs.

# load_data_function()

First of all, I introduce the program within the `print()` function. Following, I use the `def` keyword to define my first function - `load_data_function()`. Any variables created or changed within a variable will only be stored locally within that variable by default. If such a variable is called outside the function, it won't be recognised. Therefore, I use the `global` keyword to make all the variables that I want to use outside the function accessible. Once all the appropriate variables have been modified to be global, I create a new variable called `file`. By using the `input()` function, I can make the value of `file` equalivalent to whatever the user enters. The input is preceded by some information about what is required from the user. I then add my first `if` keyword, where the following indented code will only run if the statement is met. The requirements for it to run are for `file` to contain the string `.csv`. This is a basic error-handling technique which prevents the user from attempting to read in a non-csv file type. If the `if` statement is met, the following code reads in the data using the `pd.read_csv()` function and renames the first data column 'Participant'. Through the use of indexing combined with the `unique()` function, it also maps all the salient aspects of the data, including the columns and levels, on to defined variables. However, it is worth noting that, in its current state, the program will not run properly if the data columns are not in the following order - Participant, IV1, IV2, DV. If the file inputted by the user does not contain the string `.csv`, then my `else` statement will run, which tells the user to ensure the file is on csv type. Within the `else` statement, I also added my first `again` function, which I'll be using substantially throughout the code. As I subsequently define, the `load_data_function_again()` function simply repeats the `load_data_function()`. The final line of code determines where the `load_data_function()` ends. 

# anova_function()

Once the data has been loaded in, the `load_data_function()` will not run again. From here, all the respective code will be nested within the `anova_function()`. I create a new variable, `options`, which will allow the user to navigate through a variety of different choices. The message I've attached to the function fits within triple quotes, which allows the string to be written on to a few different lines.

# Option 1  - Viewing the data frame

This line of code is fairly simple. If the user entered `1`, a brief description of the function will appear, followed by the first and last 5 rows of the dataframe generated using the `display()` function. The data could use a few improvements to advance interpretability, which our next function enables.

# Option 2 - Tidying the data

In option 2, I use the `elif` keyword, which translates to 'else/if' and continues any `if` statement. If options is equal to 2 (dicated by `== '2'`), then the indented code will run. I first define `tidy_data_function()`, then make the variables I'll be manipulating global. Following this, the user has the option to rename the columns or the levels through the use of the `input()` variable `columns_levels`. If 1 is entered, then the variables `IV1`, `IV2`, and `DV` will be updated to match the input. This is carried out by mapping the new entered variables onto a new dataset, then passing this into `data` using the `rename()` function. In my output below, I changed IV1 and IV2 to title case, and rename the DV 'Reaction Time' 

If the user instead opts to tidy the levels, then my `Which_IV_function()` will be defined, followed by the appropriate `global` variables. They are then presented with another two options, renaming the first IV, or the second. The string message attached to the `Which_IV` input variable includes the `IV1` and `IV2` variables, which will change depending on the dataset. The following code words similar to rename the dataframe columns, except that we use the `replace()` function instead of the `rename()` function to modify the data points in our dataset. In the event that the user decides to skip the data tidying option, we map our our new levels on to the old ones, which means that we won't have to differentiate in later functions which one to call. In my output below, I removed the 'prime' and 'target' strings from the dataset and changed them to title case.  

To be able to navigate around the program, I add the option to go back during various stages. By adding another `elif` statement, if the user enters 'b' at any point in the appropriate inputs, then the program will run the preceding function. 

I add an `else` statement which produces an error message and reruns the program to handle situations where the user might input something outside of the `if` and `elif` statements. Finally, I define where the my functions conclude.

# Option 3 - Summary Statistics

If the user picks option '3', they are first shown a brief asthetic message describing the process. To generate summary statistics, I first use the `loc()` function to remove the column `participant` from the dataset using the `data.columns!=participant` argument, as the participants' number adds nothing to the summary stats. I then group `IV1` and `IV2` together using the `groupby()` function, which splits them and allows us to generate seperate summary stats before combining the results. Finally, I use the `aggregate` function to generate the actual descriptives, defining what stats we want using the `numpy` arguments. I wrap all this code within the `display()` function so that the output is returned for the user.

In our output, it looks as though 

In [None]:
elif options == '3':
        print ('Displaying summary statistics')
        display(data.loc[:, data.columns!= participant].groupby([IV1,IV2]).aggregate([np.mean, np.std, np.median]))

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from scipy import stats
from statsmodels.stats.anova import AnovaRM
print('Welcome to my 2-way repeated measures ANOVA program.')
def load_data_function():
    global participant
    global data
    global IV1
    global IV2
    global DV
    global IV1_LVL1
    global IV1_LVL2
    global IV2_LVL1
    global IV2_LVL2
    file = input('''
Enter your csv file name (e.g. 'data.csv'). 
Please ensure that data is in long format before entering: 

''')
    if '.csv' in file:
        print('csv type file loaded in.')
        data = pd.read_csv(file)
        mapping = {data.columns[0]: 'Participant'}
        data = data.rename(columns=mapping)
        participant = data.columns[0]
        IV1 = data.columns[1]
        IV2 = data.columns[2]
        DV = data.columns[3]
        IV1_Dist = data[IV1].unique()
        IV2_Dist = data[IV2].unique()
        IV1_LVL1 = IV1_Dist[0]
        IV1_LVL2 = IV1_Dist[1]
        IV2_LVL1 = IV2_Dist[0]
        IV2_LVL2 = IV2_Dist[1]
    else:
        print('Invalid option, try again. Ensure data file is of csv type.')
        load_data_function_again()
def load_data_function_again():
    load_data_function()
load_data_function()  
def anova_function():
    options = input('''
What would you like to do with your data:
Enter 1 to view data frame
Enter 2 to tidy data
Enter 3 to generate summary statistics
Enter 4 to visualise the data
Enter 5 to perform and report an ANOVA

''')       
    if options == '1':
        print('Displaying Data Frame')
        display(data)
    elif options == '2':
        def tidy_data_function():
            global data
            global IV1
            global IV2
            global DV
            columns_levels = input('''
What would you like to tidy?
Enter 1 to rename the columns
Enter 2 to rename the levels
Enter b to go back

        ''')
            if columns_levels == '1':
                print('''
Ensure that you input the variables in the same order as your dataset.

''')
                IV1 = input('''
Enter your first independent variable: 

            ''')
                IV2 = input('''
Enter your second independent variable: 

            ''')
                DV = input('''
Enter your dependent variable: 
            
            ''')
                mapping = {data.columns[1]: IV1, data.columns[2]: IV2, data.columns[3]: DV}
                data = data.rename(columns=mapping)
            elif columns_levels == '2':
                def Which_IV_function():
                    global data
                    global IV1_LVL1
                    global IV1_LVL2
                    global IV2_LVL1
                    global IV2_LVL2
                    global new_IV1_LVL1
                    global new_IV1_LVL2
                    global new_IV2_LVL1
                    global new_IV2_LVL2
                    Which_IV = input('Which independent variable would you like to rename the levels of?\nEnter 1 to change the levels of ' + IV1 + '\nEnter 2 to change the levels of ' + IV2 + '\nEnter b to go back\n\n') 
                    if Which_IV == '1':
                        new_IV1_LVL1 = input('''
What would you like to rename your first level to?

''')
                        new_IV1_LVL2 = input('''
What would you like to rename your second level to?

''')
                        data = data.replace([IV1_LVL1],[new_IV1_LVL1])
                        data = data.replace([IV1_LVL2],[new_IV1_LVL2])
                        IV1_LVL1 = new_IV1_LVL1
                        IV1_LVL2 = new_IV1_LVL2
                    elif Which_IV == '2':
                        new_IV2_LVL1 = input('''
What would you like to rename your first level to?

''')
                        new_IV2_LVL2 = input('''
What would you like to rename your second level to?

''')
                        data = data.replace([IV2_LVL1],[new_IV2_LVL1])
                        data = data.replace([IV2_LVL2],[new_IV2_LVL2])
                        IV2_LVL1 = new_IV2_LVL1
                        IV2_LVL2 = new_IV2_LVL2
                    elif Which_IV == 'b':
                        tidy_data_function()
                    else:                            
                        print('Invalid option. Try again.')
                    Which_IV_function_again()
                def Which_IV_function_again():
                    Which_IV_function()
                Which_IV_function()
            elif columns_levels == 'b':
                anova_function()
            else:
                print('Invalid choice. Try again')
                tidy_data_function_again()
            def tidy_data_function_again():
                tidy_data_function()
            tidy_data_function_again()
        tidy_data_function()
    elif options == '3':
        print ('Displaying summary statistics')
        display(data.loc[:, data.columns!= participant].groupby([IV1,IV2]).aggregate([np.mean, np.std, np.median]))
    elif options == '4':
        DV_meas = input('''
What was your dependent variable measured in?

''')
        def visualisations_function():
            vis_type = input('''
What visualisation would you like to generate?
Enter 1 for categorical scatterplots 
Enter 2 for categorical distribution plots
Enter 3 for categorical estimate plots
Enter b to go back
            
            ''')
            if vis_type == '1':
                def scat_type_function():
                    scat_type = input('''
What categorical scatterplot would you like to generate?
Enter 1 for a strip plot
Enter 2 for a swarm plot
Enter b to to back

                ''')
                    if scat_type == '1':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='strip', aspect=1.5)
                        plt.title('Strip Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif scat_type == '2':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='swarm', aspect=1.5)
                        plt.title('Swarm Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif scat_type == 'b':
                        visualisations_function_again()
                    else:
                        print('Invalid choice. Please try again.')
                        scat_type_function_again()
                    scat_type_function_again()
                def scat_type_function_again():
                    scat_type_function()
                scat_type_function()
            elif vis_type == '2':
                def dist_type_function():
                    dist_type = input('''
What categorical distribution plot would you like to generate?
Enter 1 for a box plot
Enter 2 for a violin plot
Enter 3 for a boxen plot
Enter b to go back

                ''')
                    if dist_type == '1':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='box', aspect=1.5)
                        plt.title('Box Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif dist_type == '2':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='violin', aspect=1.5)
                        plt.title('Violin Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif dist_type == '3':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='boxen', aspect=1.5)
                        plt.title('Boxen Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif dist_type == 'b':
                        visualisations_function_again()
                    else:
                        print('Invalid choice. Please try again.')
                        dist_type_function_again()
                    dist_type_function_again()
                def dist_type_function_again():
                    dist_type_function()
                dist_type_function()
            elif vis_type == '3':
                def est_type_function():
                    est_type = input('''
What categorical estimate plot would you like to generate?
Enter 1 for a point plot
Enter 2 for a bar plot
Enter b to go back

                ''')
                    if est_type == '1':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='point', ci=30, aspect=1.5)
                        plt.title('Point Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif est_type == '2':
                        sns.catplot(x=IV1, y=DV, hue=IV2,
                        data=data, kind='bar', ci=100, aspect=1.5)
                        plt.title('Bar Plot Examining the Effect of\n ' + IV1 + ' and ' + IV2 + ' on ' + DV)
                        plt.ylabel(DV + ' (' + DV_meas + ')')
                        plt.show()
                    elif est_type == 'b':
                        visualisations_function_again()
                    else:
                        print('Invalid choice. Please try again.')
                        est_type_function_again()
                    est_type_function_again()
                def est_type_function_again():
                    est_type_function()
                est_type_function()
            elif vis_type == 'b':
                anova_function()
            else:
                print('Invalid choice. Please try again.')
                visualisations_function_again()
        def visualisations_function_again():
            visualisations_function()   
        visualisations_function()  
    elif options == '5':
        print('Performing ANOVA')
        print(AnovaRM(data=data, depvar=DV, within=[IV1,IV2], subject=participant).fit())
        pairwise_comp = input(
'Would you like to carry out pairwise comparisons? y/n\n(Note these are only recommended when there is a significant interaction between ' + IV1 + ' and ' + IV2 + ')\n\n')
        if pairwise_comp == 'y':
            print('''
Performing pairwise comparisons:       
            ''')
            print('t-test result for ' + IV2_LVL1 + ' ' + IV2 + 's in ' + IV1_LVL1 + ' and ' + IV1_LVL2 + ' ' + IV1 + 's:')
            index = (data[IV1]==IV1_LVL1) & (data[IV2]==IV2_LVL1)
            PP = data[index][DV]
            index = (data[IV1]==IV1_LVL2) & (data[IV2]==IV2_LVL1)
            NP = data[index][DV]
            print(stats.ttest_rel(PP, NP))
            print('''
''')
            print('t-test result for ' + IV2_LVL2 + ' ' + IV2 + 's in ' + IV1_LVL1 + ' and ' + IV1_LVL2 + ' ' + IV1 + 's:')
            index = (data[IV1]==IV1_LVL1) & (data[IV2]==IV2_LVL2)
            PN = data[index][DV] 
            index = (data[IV1]==IV1_LVL2) & (data[IV2]==IV2_LVL2)
            NN = data[index][DV]
            print(stats.ttest_rel(PN, NN))
        elif pairwise_comp == 'n':
            anova_function()
    else:
        print('Invalid choice. Please try again.')
    anova_function_again()

def anova_function_again():
    anova_function()
    
anova_function()

Welcome to my 2-way repeated measures ANOVA program.

Enter your csv file name (e.g. 'data.csv'). 
Please ensure that data is in long format before entering: 

data.csv
csv type file loaded in.

What would you like to do with your data:
Enter 1 to view data frame
Enter 2 to tidy data
Enter 3 to generate summary statistics
Enter 4 to visualise the data
Enter 5 to perform and report an ANOVA

3
Displaying summary statistics


Unnamed: 0_level_0,Unnamed: 1_level_0,rt,rt,rt
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,median
prime,target,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
negativeprime,negativetarget,1547.256757,52.3827,1550.5
negativeprime,positivetarget,1562.648649,50.4684,1562.0
positiveprime,negativetarget,1566.959459,54.016737,1570.5
positiveprime,positivetarget,1547.391892,44.879072,1550.5
