# RELiC Dataset Analysis

* setup

## Analyze Descriptive Passages

* content
* time series
* topic modeling
* neural

## Analyze Critical Claims

* broad analysis
* more grounded coding

### Setup

In [1]:
# data
import numpy as np
import pandas as pd

# POS
import spacy

# nltk for wordnet and tokenization
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus.reader.wordnet import WordNetError
from nltk import sent_tokenize
from nltk import word_tokenize

In [2]:
# spacy parser
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.remove_pipe('ner')

('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fb44039eb20>)

In [3]:
'''
Read in .csv of descriptive passages as a Pandas data frame
Add appropriate header to the columns as well.
col names are: 
[blank],passage,book,left_claim,left_claim_keywords,
right_claim,right_claim_keywords,claim_id,passage_id,
passage_size,match_output

'''
def read_as_df(filename):
    # read data
    df = pd.read_csv(filename)
    # filter out first column as well as books that are not Left/Both
    df = df[df['match_output'] != 'Right']
    # drop unneeded row number
    df.drop(['Unnamed: 0'], axis=1, inplace=True)
    return df

In [4]:
'''
Helper for reporting 5-number summary of an inputted list
'''
def five_number(data):
    # calculate quartiles
    quartiles = np.percentile(data, [25, 50, 75])
    # calculate min/max
    data_min, data_max = data.min(), data.max()
    # print 5-number summary
    print('Min: %.3f' % data_min)
    print('Q1: %.3f' % quartiles[0])
    print('Median: %.3f' % quartiles[1])
    print('Q3: %.3f' % quartiles[2])
    print('Max: %.3f' % data_max)

In [5]:
descriptive_df = read_as_df('data/descriptive_claims_subset.csv')

In [6]:
descriptive_df.shape

(2383, 10)

### Analyze Descriptive Passages 

### content work

* spaCy on each description
    * general counts of adj, prep, pronoun, and can be used for later analysis
* column view à la Bal, Tenen
    * words per unique thing (Tenen) -- in just these descriptive passages; aka Unique Clutter Distance
    * words per thing (Tenen) -- in just these descriptive passages (self-selecting sample); aka Clutter Distance
* specificity (Nelson 2020)
    * per descriptive passage, calculate specificity rating

#### Column View (Bal, Tenen)

#### Specificity (Nelson)

### time series

would need:
* number of fragments total
* number of descriptive fragments
* publish years for each work
* -> pessimstic view of descriptive passages/work/time

### topic model

* what is each description/claim talking about

### neural
* universal sentence encoder, across each description, and then cluster together?
* looking for different authors creating similar descriptions ...

### analyze critical claims

* number of subjects
* entities
* repeats?
* mentioning other criticisms?
* more grounded-coding -- who is close-reading, what else fits, etc.