# Testing a perpetual phenomenon

Author: Vivek V Dwivedi<br/>
Dated: July 11, 2018

## On your mark ...

In a Stroop task, participants are presented with a list of words, with each word displayed in a color of ink. The participant’s task is to say out loud the color of the ink in which the word is printed. The task has two conditions: a congruent words condition, and an incongruent words condition. In the congruent words condition, the words being displayed are color words whose names match the colors in which they are printed: for example <span style="color:red">RED</span>, <span style="color:blue">BLUE</span>. In the incongruent words condition, the words displayed are color words whose names do not match the colors in which they are printed: for example <span style="color: green">PURPLE</span>, <span style="color: blue">ORANGE</span>. In each case, we measure the time it takes to name the ink colors in equally-sized lists. Each participant will go through and record a time from each condition.

## Get Set ...

We are looking at the response time for saying congruent or incongruent words. Hence we can state that:

- Congruent and Incongruent words, or the type of word is our **independent variable**. 

- Time taken to say the color of ink in two lists of same size would be our **dependent variable**.

With our dependent and independent variables identified, let's set our null and alternate hypothesis. A null hypothesis in this case would be that there is no significant difference between the mean time taken to say colors for congruent (μ<sub>c</sub>) and incongruent (μ<sub>i</sub>) words. Even though it seems likely that incongruent words will take longer time, my alternate hypothesis is that there is a significant difference in time taken for congruent and incongruent lists.

H<sub>0</sub>: μ<sub>c</sub> = μ<sub>i</sub>

H<sub>A</sub>: μ<sub>c</sub> != μ<sub>i</sub>

With these null and alternate hypothesis set, what kind of statistical tests am I going to do?

*Two tailed dependent samples t-test*

Since I am not concerened about the direction, I will certainly go for a two tailed test. My dataset for this test contains 25 records and parameters are not available to me. Given that t distribution is used for small sample sizes with unknown population variance, I will go for a dependent samples t-test and see if we can retain or reject the null hypothesis. Also the same person is taking the test twice, so this falls under the *within-subject* or *repeated-measures* tests.

## Go ...

Let's figure out some descriptive statistics about our dataset. 

- Degrees of freedom (n -1): 24


In [27]:
import csv
import numpy as np
import math 

timings = []
with open('./data/stroopdata.csv') as f:
    reader = csv.DictReader(f, delimiter=',')
    for row in reader:
        timings.append(row)


In [28]:
# takes a list of numbers and returns the mean
def mean(data):
    return round(sum(data[0:len(data)])/len(data), 2)
    

In [29]:
# takes a list of numbers and returns the median
def median(data):
    data.sort()
    length = len(data)

    if length % 2 == 0:
        median = round(((data[(length//2)] + data[(length//2) - 1]) / 2) , 2)
    else:
        median = round(data[(length//2)], 2)

    return median

In [48]:
# given a sample and sample mean, it returns the variance
# for this very specific case, I am providing the mean as well, else it can be calculated
def variance(data, mean):
    # squared difference from mean
    squared_difference = [(num - mean)**2 for num in data] 
    # sum of squared difference by degrees of freedom or sample size - 1
    return round(sum(squared_difference[0:len(squared_difference)])/len(squared_difference), 2)

In [49]:
def describe_data(data, entity_type):
    mean_calculated = mean(data)
    variance_calculated = variance(data, mean_calculated)
    sd = math.sqrt(variance_calculated)
    sem = sd / math.sqrt(len(data) - 1)
    print('\n-------------------------- {} --------------------------'.format(entity_type))
    print('Mean: {}'.format(mean_calculated))
    print('Median: {}'.format(median(data)))
    print('Variance: {}'.format(variance_calculated))
    print('Standard Deviation: {}'.format(sd))
    print('Standard Error: {}'.format(sem))


In [50]:
congruent_timings = [float(row['Congruent']) for row in timings]
describe_data(congruent_timings, 'congruent')

incongruent_timings = [float(row['Incongruent']) for row in timings]
describe_data(incongruent_timings, 'Incongruent')


-------------------------- congruent --------------------------
Mean: 14.1
Median: 14.4
Variance: 12.1
Standard Deviation: 3.478505426185217
Standard Error: 0.7253185207353657

-------------------------- Incongruent --------------------------
Mean: 22.0
Median: 21.0
Variance: 22.1
Standard Deviation: 4.701063709417263
Standard Error: 0.9802395448141191
