# Project 1: Test a Perceptual Phenomenon
### Preamble and experimental design

In this variation of the Stroop Task, we want to know if an incongruency between the printed ink color of a word and its text content will increase the time needed to name the ink color. The independent variable is the congruent or incongruent condition of the words; that is, whether or not the printed color of the word matches its text. The dependent variable is the time it takes to name the ink colors. The question to answer is: does the incongruent condition affect the time needed to name the ink color? The setup of the experiment provides us with dependent, paired samples; one for each condition. As such, a standard Student's t-test for the comparison of dependent samples should be sufficient to determine the condition's effect. Our hypotheses and experimental setup is as follows, where $ \mu_c $ is the mean congruent time for naming, and $ \mu_i $ the mean incongruent time. We select an alhpa-level of 0.05, which for a two-tailed test with 24 samples gives us a t-critical value of plus/minus 2.064.

$ H_0: \mu_c=\mu_i $

$ H_A: \mu_c \ne \mu_i $

$ \alpha=0.05, n=24 $

$ t_{critical} = \pm2.064 $

### Initial setup and exploration
First, we import the pandas library and load the sample data. We then take a look at the data to make sure it imported correctly, then use a pandas built-in function to generate some descriptive statistics. We see that there appears to be a significant difference between the mean naming time for the congruent and incongruent conditions. While the center of the data sets are different, their standard deviation and interquartile ranges are comparable. This implies that the data has undergone a distinct and identifiable change, shifting the center while not affecting variability.

In [None]:
import pandas as pd

# https://chrisalbon.com/python/pandas_dataframe_importing_csv.html
stroop_data = pd.read_csv('stroopdata.csv')
stroop_data

In [None]:
# http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/
# https://chrisalbon.com/python/pandas_dataframe_descriptive_stats.html
stroop_data.describe()

### Data visualization
Visualization will provide greater insight into the descriptive statistics. We make use of the graphics functions bulit into pandas and integrated into the jupyter notebook. We can generate histograms of the datasets showing the fequency distributions for the naming times. We look at the histograms both side by side, and laid on top of one another. This provides us a sense of the shape of the data and their intersection, respectively. The jupyter notebook provides widgets for interacting with the data; here, we use a slider to control the bin size. Some fiddling suggests that a bin size of 16 is most illustrative. We see that both the congruent and incogruent data are normally distributed, with a slight right skew. Both data sets are packed tightly arond the mean, with some outliers to the right increasing the skew.

In [None]:
# http://stackoverflow.com/questions/10511024/in-ipython-notebook-pandas-is-not-displying-the-graph-i-try-to-plot
%matplotlib inline

# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html#interactive
from ipywidgets import interact

# http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist
import matplotlib
matplotlib.style.use('ggplot')

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
# http://stackoverflow.com/questions/28654003/how-to-plot-histograms-from-dataframes-in-pandas
# http://stackoverflow.com/questions/12125880/changing-default-x-range-in-histogram-matplotlib
# http://stackoverflow.com/questions/24571005/return-max-value-from-panda-dataframe-as-a-whole-not-based-on-column-or-rows
def side_by_side(bin_size):
    stroop_data.hist(bins=bin_size, layout=(1,2), figsize=(12,4), range=(0, stroop_data.values.max()))

interact(side_by_side, bin_size=(1,len(stroop_data),1))

In [None]:
# http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist

def overlay(bin_size):
    stroop_data.plot.hist(bins=bin_size, alpha=0.5, figsize=(8,4), range=(0, stroop_data.values.max()))

interact(overlay, bin_size=(1,len(stroop_data),1))

### Hypothesis testing and conclusion

We perform the Student's t-test for the comparison of dependent samples. The scipy.stats library has a built-in function for this. From the initial setup, we recall the t-critical value for our experiment is plus/minus 2.064. The test returns a t-statistic of -8, far beyond the critical value. Accordingly, the test provides a probablity of observing such a value due to chance at less than 1 in 10 million.

We perform the t-test manually, just to be sure the function is behaving as expected. This also sets us up with the data necessary to generate a confidence interval.

Conclusion: there is sufficient evidence to reject the null hypothesis. It appears that incongruence between printed ink color and text context content affects naming the ink color. We are 95% confident that the incogruent naming time is between approximately 6 and 10 seconds slower than the congruent naming time. This lines up with our naive expectations; it seems logical that incogruence would cause cognitive interference, and slow down the naming process.

In [None]:
# http://stackoverflow.com/questions/13404468/t-test-in-pandas-python
# http://stackoverflow.com/questions/15984221/how-to-perform-two-sample-one-tailed-t-test-with-numpy-scipy
# https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.ttest_rel.html
from scipy import stats

stats.ttest_rel(stroop_data['Congruent'], stroop_data['Incongruent'])

In [None]:
# Double-check

stroop_data['dlt'] = stroop_data.apply( lambda x: x.Congruent - x.Incongruent, axis=1 )
stroop_data['diff_dev'] = stroop_data['dlt'].apply( lambda x: (x - stroop_data['dlt'].mean())**2 )
print stroop_data

mean_diff = stroop_data['Congruent'].mean() - stroop_data['Incongruent'].mean()
print "DF mean equals point estimate: ", round(stroop_data['dlt'].mean(), 7) == round(mean_diff, 7)

calc_std = (stroop_data.diff_dev.sum() / (len(stroop_data) - 1) ) ** 0.5
print "DF diff std equals calculated std: ", stroop_data['dlt'].std() == calc_std

t = stroop_data['dlt'].mean() / (stroop_data['dlt'].std()  / len(stroop_data) ** 0.5 )
print "t-statistic: ", t

In [None]:
ci = 2.064 * (calc_std / len(stroop_data)**0.5 )
print "confidence interval: (", mean_diff - ci, ",", mean_diff + ci, ")"