# Project Psyched: A Closer Look Into Reproducibility In Psychological Research

## Statistics Mining Script: Part 2 - Test Statistics 
This script is set up for ProQuest TDM Studio's corpus of Psychology texts.

Author: Yuyang Zhong (2020). This work is licensed under a [Creative Commons BY-NC-SA 4.0 International
License][cc-by].

![CC BY-NC-SA 4.0][cc-by-shield]

[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-shield]: https://img.shields.io/badge/license-CC--BY--NC--SA%204.0-blue

#### Set Up & Imports

In [1]:
import xml.etree.ElementTree as ET
import re

#### File Path & File

In [2]:
path = "../articles/samples/"

In [3]:
file = "1011297999.txt"

In [4]:
root = ET.parse(path+file).getroot()

#### Cleaning up full text

In [5]:
raw_text = root.find('TextInfo').find('PreformattedData').find('PsycArticles').text

#### Strip HTML tags & next line characters

In [6]:
raw_text = re.sub(r'<[^>]*>', '', raw_text)

In [7]:
raw_text = re.sub(r'\n\s*', '   ', raw_text)

In [8]:
html_symbols = {
    "&": r'&amp;',
    '"': r'&quot;',
    "'": r'&apos;',
    ">": r'&gt;',
    "<": r'&lt;',
}

In [9]:
for i in iter(html_symbols):
    raw_text = re.sub(html_symbols[i], i, raw_text)

In [10]:
raw_text

"   Journal of Personality and Social Psychology   Attitudes and Social Cognition   0022-3514   1939-1315   American Psychological Association   psp_103_1_38   10.1037/a0028124   2012-10021-001   Why and When Peer Prediction Is Superior to Self-Prediction: The Weight Given to Future Aspiration Versus Past Achievement   Eliot R.   Smith   Editor   Erik G.   Helzer   David   Dunning   Department of Psychology, Cornell University   This research was supported financially by National Science Foundation Grant 0745806, awarded to David Dunning. We thank members of the Dunning Self and Social Insight Lab for their comments. We also thank Jack Cao, Dan Connolly, Polina Minkin, Mary Panos, and Carolyn Spiro for their assistance with data collection.   Erik G. Helzer, Department of Psychology, Cornell University, Ithaca, NY 14853-7601   egh42@cornell.edu   April   16, 2012   July   2012   103   1   38   53   September   7, 2011   February   21, 2012   February   27, 2012   2012   American Psycho

#### F-Statistics, numeric p-values

In [29]:
F_stats = re.findall(r'Fs?\s*\(\s*\d+\s*\,\s*\d+\s*\)\s*[\<|\>|\=]\s*\d*\.?\d*\s*\,\s*p\s*[\<|\>|\=]\s*\d*\.\d+', 
                     raw_text)
F_stats

['F(1, 40) = 7.90, p < .01',
 'F(1, 40) = 8.85, p < .01',
 'F(1, 113) = 66.53, p < .0001',
 'F(1, 113) = 31.60, p < .0001',
 'F(1, 113) = 11.89, p < .001',
 'F(1, 114) = 31.26, p < .0001',
 'F(1, 114) = 78.84, p < .0001',
 'F(1, 114) = 7.49, p < .01',
 'F(1, 114) = 4.56, p < .04',
 'F(1, 114) = 3.77, p = .05',
 'F(1, 114) = 32.83, p < .0001',
 'F(1, 114) = 6.50, p < .02']

#### F-Statistics, written non-significance

In [28]:
F_stats_ns = re.findall(r'Fs?\s*\(\s*\d+\s*\,\s*\d+\s*\)\s*[\<|\>|\=]\s*\d*\.?\d*\s*\,\s*n\.?s\.?', 
                        raw_text)
F_stats_ns

['F(1, 114) = 0.02, ns', 'F(1, 114) = 1.54, ns']

#### T-Scores, numeric p-values

In [30]:
t_scores = re.findall(r't\s*\(\s*\d*\s*,?\s*\d+\s*\)\s*[\<|\>|\=]\s*[\−|\-]?\s*\d*\.?\d*\s*,\s*p\s*[\<|\>|\=]\s*\d?\.\d+',
                      raw_text)
t_scores

['t(40) = 2.68, p = .01',
 't(40) = 2.32, p = .02',
 't(102) = 2.66, p < .01',
 't(103) = 2.19, p < .04',
 't(101) = 2.49, p < .02',
 't(101) = 2.05, p < .05',
 't(116) = 1.97, p = .05',
 't(116) = 3.05, p < .005',
 't(116) = 1.61, p < .12']

#### T-Scores, numeric p-values

In [31]:
t_scores_ns = re.findall(r't\s*\(\s*\d*\s*,?\s*\d+\s*\)\s*[\<|\>|\=]\s*[\−|\-]?\s*\d*\.?\d*\s*,\s*n\.?s\.?',
                      raw_text)
t_scores_ns

['t(103) = 0.89, ns']