Analyses pre-registered for in-lab version of study 3d (reported in the appendix accompanying the submitted manuscript as study 3d).

Pre-registration: https://osf.io/de935

In [2]:
from __future__ import division
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import re
from scipy import stats
from pyspan.utils import *
from pyspan.plurals.analysis import *
assert not mturk
from pyspan.plurals.preprocess import *
from pyspan.plurals.utils import *

## Demographics

In [3]:
gender_raw = pd.read_csv("{}in-lab/Gender.csv".format(BASE_DIR))
len(gender_raw), len(gender.loc[gender.ident == "MALE"]), len(gender.loc[gender.ident == "FEMALE"])

(189, 53, 104)

In [4]:
demographic_info(gender.loc[gender.ident == "FEMALE"])

Age: 24.0 (SE = 0.87577981894)
Gender: [('Female', 104)]


In [5]:
demographic_info(gender.loc[gender.ident == "MALE"])

Age: 24.4716981132 (SE = 1.33410215988)
Gender: [('Male', 53)]


### Between subjects

#### 60% item selection threshold

##### Logistic regression

Selection of positive word ~ Participant's gender identity + Condition + Participant's gender identity * Condition + Dummy indicating whether or not this was the first survey the participant took (including participant-level effects)

We hypothesize that the coefficient on Participant's gender identity * Condition will be positive. We will commit to throwing out all data from the gender survey by participants who did not take the survey first if the order dummy has a non-zero coefficient.

In [6]:
gdummied, Y = dummy(gender, [ "MALE", "FEMALE" ],
                    sets = np.stack((positive60, 
                                     negative60)))

In [7]:
X, Y = df_to_matrix(gdummied, Y, columns = { 0: "id",
                                             1: "condition",
                                             2: (0,1), 
                                             3: "order" })

In [8]:
logit = SparseLR(Y, X); print logit.coef[:4]; logit.auc 



[0. 0. 0. 0.]


0.5

##### t-tests

In [9]:
gsummary = gender[["Condition", "ident"]]
dat = gender[ixs].values
props = np.apply_along_axis(get_prop, 1, dat, positive60,
                            negative60)
gsummary["ppos"] = props
assert gsummary.values.shape == (len(gender), 3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


For participants in the male condition, the difference in means mean(% positive words chosen amongst males) - mean(% positive words chosen amongst females) > 0.

In [10]:
a = gsummary.loc[(gsummary["Condition"] == "MALE") & (gsummary["ident"] == "MALE")]["ppos"].values
b = gsummary.loc[(gsummary["Condition"] == "MALE") & (gsummary["ident"] == "FEMALE")]["ppos"].values
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=0.5734117289405796, pvalue=0.5686661620858503)

For participants in the female condition, the difference in means mean(% positive words chosen amongst females) - mean(% positive words chosen amongst males) > 0.

In [11]:
a = gsummary.loc[(gsummary["Condition"] == "FEMALE") & (gsummary["ident"] == "FEMALE")]["ppos"].values
b = gsummary.loc[(gsummary["Condition"] == "FEMALE") & (gsummary["ident"] == "MALE")]["ppos"].values
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=1.161959145849565, pvalue=0.25027571718281966)

The difference in means mean(% positive words chosen amongst those whose condition matched their gender identity) - mean(% positive words chosen amongst those whose condition didn't match their gender identity) > 0

In [12]:
a = gsummary.loc[gsummary["Condition"] == gsummary["ident"]]["ppos"].values
b = gsummary.loc[gsummary["Condition"] != gsummary["ident"]]["ppos"].values
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=0.9056742597351738, pvalue=0.3665166357032258)

Calculate the degrees of freedom.

In [13]:
var_a = np.var(a, ddof = 1) / len(a)
var_b = np.var(b, ddof = 1) / len(b)
num = (var_a + var_b)**2
denom = (var_a**2 / (len(a) - 1)) + (var_b**2 / (len(b) - 1))
num / denom

154.70515733051286

In [14]:
cohensd(a, b)

0.1426954484574329

#### 80% item selection threshold

##### Logistic regression

Selection of positive word ~ Participant's gender identity + Condition + Participant's gender identity * Condition + Dummy indicating whether or not this was the first survey the participant took (including participant-level effects)

In [15]:
gdummied, Y = dummy(gender, sets = np.stack((positive80, 
                                              negative80)),
                    classes = [ "MALE", "FEMALE" ])

In [16]:
X, Y = df_to_matrix(gdummied, Y, columns = { 0: "id",
                                             1: "condition",
                                             2: (0,1), 
                                             3: "order" })

In [17]:
logit = SparseLR(Y, X); print logit.coef[:4]; logit.auc

[0. 0. 0. 0.]


0.5

##### t-tests

In [18]:
gsummary = gender[["Condition", "ident"]]
dat = gender[ixs].values
props = np.apply_along_axis(get_prop, 1, dat, positive80,
                            negative80)
gsummary["ppos"] = props
assert gsummary.values.shape == (len(gender), 3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


For participants in the male condition, the difference in means mean(% positive words chosen amongst males) - mean(% positive words chosen amongst females) > 0.

In [19]:
a = gsummary.loc[(gsummary["Condition"] == "MALE") & (gsummary["ident"] == "MALE")]["ppos"].values
b = gsummary.loc[(gsummary["Condition"] == "MALE") & (gsummary["ident"] == "FEMALE")]["ppos"].values
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=1.013491294307676, pvalue=0.3174742983108879)

For participants in the female condition, the difference in means mean(% positive words chosen amongst females) - mean(% positive words chosen amongst males) > 0.

In [20]:
a = gsummary.loc[(gsummary["Condition"] == "FEMALE") & (gsummary["ident"] == "FEMALE")]["ppos"].values
b = gsummary.loc[(gsummary["Condition"] == "FEMALE") & (gsummary["ident"] == "MALE")]["ppos"].values
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=1.3695766379680065, pvalue=0.17679572821015874)

The difference in means mean(% positive words chosen amongst those whose condition matched their gender identity) - mean(% positive words chosen amongst those whose condition didn't match their gender identity) > 0

In [21]:
a = gsummary.loc[gsummary["Condition"] == gsummary["ident"]]["ppos"]
b = gsummary.loc[gsummary["Condition"] != gsummary["ident"]]["ppos"]
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=2.440305987849316, pvalue=0.015856380643959132)

Calculate the degrees of freedom.

In [22]:
var_a = np.var(a, ddof = 1) / len(a)
var_b = np.var(b, ddof = 1) / len(b)
num = (var_a + var_b)**2
denom = (var_a**2 / (len(a) - 1)) + (var_b**2 / (len(b) - 1))
num / denom

147.9463462869692

In [23]:
cohensd(a, b)

0.3917790570486218

### Within-subjects

#### Logistic regression

Selection of positive word ~ Participant's gender identity + Condition + Participant's gender identity * Condition + Dummy indicating whether or not the participant took the gender survey before the valence survey (including participant-level effects)

We hypothesize that the coefficient on Participant's gender identity * Condition will be positive. 

In [24]:
# Pre-registered: "The within-subject analyses would be run
# both for items within each pre-specified valence category
# (25 positive items, 25 negative items and 25 neutral
# items), and using all 75 non-distractor items."
#
# Change the ixs_ variable below to restrict the subset of
# stims used for analysis
ixs_ = np.arange(100, 125) # Positive items
ixs_ = np.arange(125, 150) # Negative items
ixs_ = np.arange(150, 175) # Neutral items
ixs_ = np.arange(100, 175) # Positive, negative and neutral
# items

In [25]:
gdummied_ws, Y = dummy(gender, 
                       classes = [ "MALE", "FEMALE" ], 
                       within = True, 
                       ixs = ixs_)
X, Y = df_to_matrix(gdummied_ws, Y, 
                    columns = { 0: "id", 1: "condition", 
                                2: (0,1), 3: "order" }, 
                    ixs = ixs_)

In [26]:
logit = SparseLR(Y, X); print logit.coef[:4]; logit.auc

[0.         0.         0.04628255 0.        ]


0.5306000945410451

#### t-tests

In [27]:
gsummary_ws = summarize(gdummied_ws, ixs_)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  summary["p"] = props


For participants in the male condition, the difference in means mean(% positive words chosen amongst males) - mean(% positive words chosen amongst females) > 0.

In [28]:
stats.ttest_ind(gsummary_ws.loc[(gsummary_ws["Condition"] == 1) & (gsummary_ws["ident"] == 1)]["p"],
                gsummary_ws.loc[(gsummary_ws["Condition"] == 1) & (gsummary_ws["ident"] == 0)]["p"][~np.isnan(gsummary_ws.loc[(gsummary_ws["Condition"] == 1) & (gsummary_ws["ident"] == 0)]["p"])],
                equal_var = False)

Ttest_indResult(statistic=0.3405812551557937, pvalue=0.7362407325351605)

For participants in the female condition, the difference in means mean(% positive words chosen amongst females) - mean(% positive words chosen amongst males) > 0.

In [29]:
stats.ttest_ind(gsummary_ws.loc[(gsummary_ws["Condition"] == 0) & (gsummary_ws["ident"] == 0)]["p"],
                gsummary_ws.loc[(gsummary_ws["Condition"] == 0) & (gsummary_ws["ident"] == 1)]["p"],
                equal_var = False)

Ttest_indResult(statistic=3.0081943370891953, pvalue=0.004581025358360105)

The difference in means mean(% positive words chosen amongst those whose condition matched their gender identity) - mean(% positive words chosen amongst those whose condition didn't match their gender identity) > 0

In [30]:
a = gsummary_ws.loc[gsummary_ws["Condition"] == gsummary_ws["ident"]]["p"]
b = gsummary_ws.loc[gsummary_ws["Condition"] != gsummary_ws["ident"]]["p"][~np.isnan(gsummary_ws.loc[gsummary_ws["Condition"] != gsummary_ws["ident"]]["p"])]
stats.ttest_ind(a, b, equal_var = False)

Ttest_indResult(statistic=2.5626945178296414, pvalue=0.011747757560593628)

Calculate the degrees of freedom.

In [31]:
var_a = np.var(a, ddof = 1) / len(a)
var_b = np.var(b, ddof = 1) / len(b)
num = (var_a + var_b)**2
denom = (var_a**2 / (len(a) - 1)) + (var_b**2 / (len(b) - 1))
num / denom

109.22185444746182

In [32]:
cohensd(a, b)

0.4556248175088296