# Project Psyched: A Closer Look Into Reproducibility In Psychological Research

## Data Analysis & Visualization Script: Comperhensive Analysis
This script for data analysis and visualization after data has been scraped from TDM Studio. 

This notebook combines analyses done across part 1 & 2 of this project, and utilizes the full corupus of the 6 APA journals avaiable in this project.

Author: Yuyang Zhong (2021). This work is licensed under a [Creative Commons BY-NC-SA 4.0 International
License][cc-by].

![CC BY-NC-SA 4.0][cc-by-shield]

[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-shield]: https://img.shields.io/badge/license-CC--BY--NC--SA%204.0-blue

#### Setup & Imports

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy import stats

#### Load Journal Metadata

In [47]:
meta = pd.read_csv(in_path + "metadata_all.csv", index_col=0)
meta = meta[['Journal', 'Date Published']]
meta.head()

Unnamed: 0,Journal,Date Published
614337945.xml,Journal of Personality and Social Psychology,1987-03-01
1647028895.xml,Journal of Personality and Social Psychology,2015-01-01
614404963.xml,Journal of Personality and Social Psychology,2002-07-01
614332724.xml,Journal of Personality and Social Psychology,1997-11-01
614304222.xml,Journal of Personality and Social Psychology,1990-11-01


In [48]:
meta.shape

(46057, 2)

In [49]:
meta['Journal'] = meta['Journal'].replace('The Journal of Abnormal Psychology', 'Journal of Abnormal Psychology')
meta['Journal'].value_counts()

American Psychologist                                                  13854
Journal of Applied Psychology                                           9960
Developmental Psychology                                                6757
Journal of Personality and Social Psychology                            6048
Journal of Abnormal Psychology                                          5266
Journal of Experimental Psychology: Learning, Memory, and Cognition     4171
Journal of Experimental Social Psychology                                  1
Name: Journal, dtype: int64

Remove the 1 single JESP entry

In [52]:
meta = meta[meta['Journal'] != 'Journal of Experimental Social Psychology']
meta.shape

(46056, 2)

#### Descriptive Statistics

In [53]:
meta['Date Published'] = pd.to_datetime(meta['Date Published'])
meta['Year'] = meta['Date Published'].dt.year

In [54]:
meta.groupby('Journal').agg({'Journal': 'count', 'Year': ['max', 'min']})

Unnamed: 0_level_0,Journal,Year,Year
Unnamed: 0_level_1,count,max,min
Journal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
American Psychologist,13854,2020,1946
Developmental Psychology,6757,2020,1969
Journal of Abnormal Psychology,5266,2020,1906
Journal of Applied Psychology,9960,2020,1917
"Journal of Experimental Psychology: Learning, Memory, and Cognition",4171,2020,1982
Journal of Personality and Social Psychology,6048,2020,1985


Remove articles from before 1985

In [55]:
list1985 = meta[meta['Year'] >= 1985].index

In [56]:
meta1985 = meta.loc[list1985,:]
meta1985.head()

Unnamed: 0,Journal,Date Published,Year
614337945.xml,Journal of Personality and Social Psychology,1987-03-01,1987
1647028895.xml,Journal of Personality and Social Psychology,2015-01-01,2015
614404963.xml,Journal of Personality and Social Psychology,2002-07-01,2002
614332724.xml,Journal of Personality and Social Psychology,1997-11-01,1997
614304222.xml,Journal of Personality and Social Psychology,1990-11-01,1990


In [57]:
meta1985.shape

(28440, 3)

In [58]:
meta1985.groupby('Journal').agg({'Journal': 'count', 'Year': ['max', 'min']})

Unnamed: 0_level_0,Journal,Year,Year
Unnamed: 0_level_1,count,max,min
Journal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
American Psychologist,7260,2020,1985
Developmental Psychology,4691,2020,1985
Journal of Abnormal Psychology,2898,2020,1985
Journal of Applied Psychology,3539,2020,1985
"Journal of Experimental Psychology: Learning, Memory, and Cognition",4004,2020,1985
Journal of Personality and Social Psychology,6048,2020,1985


#### Load p-values & statistics

In [59]:
in_path = "../data/"
in_name = 'stats_all.csv'

In [60]:
df_stats = pd.read_csv(in_path + in_name, index_col='File').drop('Unnamed: 0', axis=1)
df_stats.head()

Unnamed: 0_level_0,Original,Type,stat-Sign,p-Sign,Reported p-value,Recalculated p-value
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
614337945.xml,"t (41) = 4.10, p < .01",t,=,<,0.01,9.531027e-05
614337945.xml,"t (41) = −3.56, p < .01",t,=,<,0.01,0.9995224
614337945.xml,"t (41) = 8.21, p < .01",t,=,<,0.01,1.708961e-10
614337945.xml,"t (41) = 4.82, p < .01",t,=,<,0.01,9.9876e-06
614337945.xml,"t (41) = −2.57, p < .01",t,=,<,0.01,0.9930493


In [61]:
df_stats.shape

(212589, 6)

Remove entries prior to 1985 (none removed!)

In [68]:
df_stats.index.isin(list1985)

array([ True,  True,  True, ...,  True,  True,  True])

In [69]:
df_stats[df_stats.index.isin(list1985)]

Unnamed: 0_level_0,Original,Type,stat-Sign,p-Sign,Reported p-value,Recalculated p-value
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
614337945.xml,"t (41) = 4.10, p < .01",t,=,<,0.010,9.531027e-05
614337945.xml,"t (41) = −3.56, p < .01",t,=,<,0.010,9.995224e-01
614337945.xml,"t (41) = 8.21, p < .01",t,=,<,0.010,1.708961e-10
614337945.xml,"t (41) = 4.82, p < .01",t,=,<,0.010,9.987600e-06
614337945.xml,"t (41) = −2.57, p < .01",t,=,<,0.010,9.930493e-01
...,...,...,...,...,...,...
614300872.xml,"F(2, 54) = 3.98, p = .024",f,=,=,0.024,2.441313e-02
614300872.xml,"F(1, 53) = 4.55, p = .015",f,=,=,0.015,3.756755e-02
614300872.xml,"t(38) = 3.61, p < .01",t,=,<,0.010,4.405335e-04
614300872.xml,"t(38) = 1.80, p < .05",t,=,<,0.050,3.990185e-02


In [82]:
df_stats.index.value_counts()[df_stats.index.value_counts() > 100]

1640024140.xml    404
2316529621.xml    320
2259585220.xml    224
1824548299.xml    170
614498611.xml     168
904195315.xml     161
2239317783.xml    148
614316377.xml     145
2392062798.xml    141
2273406296.xml    141
1707066777.xml    132
614401313.xml     128
614314052.xml     126
614509854.xml     125
1013917884.xml    121
614494302.xml     121
1753445757.xml    119
614489049.xml     118
1905844604.xml    117
614324338.xml     117
1960441284.xml    112
1717485855.xml    111
614483228.xml     111
614375114.xml     110
1882273130.xml    109
614404391.xml     108
614313050.xml     108
614322439.xml     107
614370933.xml     107
2027384737.xml    107
2316522056.xml    106
1501833957.xml    106
2071944745.xml    106
614320059.xml     106
1888774581.xml    105
2192053844.xml    105
614476045.xml     104
614331212.xml     104
2164361865.xml    104
614371006.xml     104
2387814363.xml    103
1666305131.xml    101
Name: File, dtype: int64

In [73]:
meta = pd.read_csv(in_path + "metadata_all.csv", index_col=0)
meta.loc['1640024140.xml',:]

Journal           Journal of Experimental Psychology: Learning, ...
Date Published                                  2015-07-01 00:00:00
Year                                                           2015
Name: 1640024140.xml, dtype: object

In [79]:
df_stats.loc['1640024140.xml',:]

Unnamed: 0_level_0,Original,Type,stat-Sign,p-Sign,Reported p-value,Recalculated p-value
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1640024140.xml,"F(1, 19) = .241, p = .629",f,=,=,0.629,6.291069e-01
1640024140.xml,"F(2, 38) = 2.82, p = .072",f,=,=,0.072,7.212374e-02
1640024140.xml,"F(3, 57) = 1.23, p = .307",f,=,=,0.307,3.072161e-01
1640024140.xml,"F(4, 76) = .997, p = .415",f,=,=,0.415,4.145328e-01
1640024140.xml,"F(5, 95) = .497, p = .778",f,=,=,0.778,7.777792e-01
...,...,...,...,...,...,...
1640024140.xml,"F(10, 40) = 4.98, p < .001",f,=,<,0.001,1.112748e-04
1640024140.xml,"F(10, 40) = 2.53, p = .018",f,=,=,0.018,1.822316e-02
1640024140.xml,"F(1, 2) = 112, p = .009",f,=,=,0.009,8.810744e-03
1640024140.xml,"F(13, 26) = 8.81, p < .001",f,=,<,0.001,1.636754e-06


Title             First things first: Similar list length and ou...
Date Published                                           2015-07-01
Peer Review                                                    True
DOI                                              10.1037/xlm0000086
Author            ['Cortis, Cathleen', 'Dent, Kevin', 'Kennett, ...
Keywords          ['free recall', 'visuospatial memory', 'tactil...
Methodology               ['Empirical Study', 'Quantitative Study']
References                                                    114.0
Journal           Journal of Experimental Psychology: Learning, ...
Volume                                                         41.0
Issue                                                             4
Pages                                                     1179-1214
Name: 1640024140.xml, dtype: object

## Analysis 1: p-value distribution over time

In [10]:
p_vals_clean_10 = [float(i) for i in list(df_stats['Reported p-value']) if float(i) <= 0.10]
len(p_vals_clean_10)

186406

In [11]:
p_vals_clean_10_leq = [float(i) for i in list(df_stats[(df_stats['p-Sign'] == '<') | (df_stats['p-Sign'] == '=')]['Reported p-value']) if float(i) <= 0.10]
len(p_vals_clean_10_leq)

182198

`TODO`: rewrite plot with `sns.displot` and FacetGrid

In [None]:
plt.figure(figsize=(8,5))
plt.xlim(0, 0.10)

ax = sns.distplot(p_vals_clean_10_leq, hist = True, bins = 60, norm_hist=False,
            kde_kws={"lw": 3, "label": "KDE"})

# Annotations
ax.annotate(f'p=0.001,\n$n$={p_vals_clean_10_leq.count(0.001)}' +
            f'({100*p_vals_clean_10_leq.count(0.001)/len(p_vals_clean_10_leq):.2f}%)',
            xy=(0.001, 60), xytext=(0.005, 175),
            arrowprops=dict(arrowstyle="->"),
            fontsize='large')

ax.annotate(f'p=0.01,\n$n$={p_vals_clean_10_leq.count(0.01)}' +
            f'({100*p_vals_clean_10_leq.count(0.01)/len(p_vals_clean_10_leq):.2f}%)',
            xy=(0.01, 35), xytext=(0.015, 140),
            arrowprops=dict(arrowstyle="->"),
            fontsize='large')

ax.annotate(f'p=0.05,\n$n$={p_vals_clean_10_leq.count(0.05)}' +
            f'({100*p_vals_clean_10_leq.count(0.05)/len(p_vals_clean_10_leq):.2f}%)',
            xy=(0.05, 36), xytext=(0.055, 141),
            arrowprops=dict(arrowstyle="->"),
            fontsize='large')

plt.title("Distribution & Kernal Density Estimation for P-Values (All Years, All Journals)")
ax.set_xlabel("p-Values", fontsize = 'x-large')
ax.set_ylabel("Density", fontsize = 'x-large')
plt.legend();