In [1]:
library(dplyr)
library(data.table)
options(repr.matrix.max.rows=600, repr.matrix.max.cols=200)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

    between, first, last




This is my attempt at performing collocation analysis similar to the one performed in Yipu and Sander's paper analysing perspective markers in Mandarin Chinese. This uses the **mclm** package which is completely undocumented

First find all the relevant files of your corpus


In [2]:
advice_files <- list.files(path='../data',pattern='*[0-9]+.txt', recursive=TRUE, full.names = TRUE)

question_files <- list.files(path='../data',pattern='context.txt', recursive=TRUE, full.names = TRUE)

Now do a super simple surface collocation frequency analysis using the *surf_cooc* function from the *mclm* library. For now, I'll be doing it on the words hypothesized to be indicators of subjectivity versus objectivity - *since* vs *because*, *for that reason* and *as a result*, This is based on the following from the paper *Unifying dimensions in coherence relations* by Ted Sanders:

> ...English since seems to have a preference for epistemic/subjective relations, whereas for that reason and as a result seem to be cue phrases for typical objective causal relations

and Marta Andersson's thesis:

> This has been argued particularly about backward causal relations, where because can be used in every domain; however, since commonly occurs in subjective epistemic and speech-act relations, while because has been found to prevail in objective contexts(Zufferey & Cartoni, 2012).

Although she adds the caveat that the lexicon of English is not as constrained as, say Dutch (perhaps?).

Anyway now to do some preliminary surface collocations with the questions

# Question analysis of since vs because 

In [3]:
library(mclm)
# Replace token splitter with ////s+ OR <br><br>
q_since_cooc <- surf_cooc(question_files, re_node="since", re_token_splitter="\\s+|<br><br>")
q_because_cooc <- surf_cooc(question_files, re_node="because", re_token_splitter="\\s+|<br><br>")

Loading required package: ca


The above can be improved with the other optional arguments to *surf_cooc*:
* re_node 
    can be a regex, so perhaps something more than just the word?
* re_boundary
    seems important, dunno how to use it
* re_drop_token
    For now its dropping the pesky <br>. Perhaps more?

Now to analyse them a little more

In [4]:
q_since_cooc

$target_freqlist
Frequency list (number of types: 325, number of tokens: 588)
  word frequency
1    i        24
2  was        19
3    a        17
4  and        12
5   to        12
6   my        10
...

$ref_freqlist
Frequency list (number of types: 10844, number of tokens: 90280)
  word frequency
1    i      3081
2   to      2972
3  and      2697
4  the      2135
5    a      2072
6   my      1613
...

$target_n
[1] 588

$ref_n
[1] 90280

attr(,"class")
[1] "cooc_info"

In [5]:
q_since_target_freqlist <- as.data.frame(q_since_cooc['target_freqlist'])
q_since_ref_freqlist <- as.data.frame(q_since_cooc['ref_freqlist'])

q_because_target_freqlist <- as.data.frame(q_because_cooc['target_freqlist'])
q_because_ref_freqlist <- as.data.frame(q_because_cooc['ref_freqlist'])

In [6]:
q_since_ref_freqlist

Unnamed: 0,frequency
i,3081
to,2972
and,2697
the,2135
a,2072
my,1613
of,1193
is,1032
for,955
that,941


In [7]:
assoc_scores(q_since_cooc, measures = c("G"))

Unnamed: 0,a,b,c,d,dir,G
a,17,571,2072,88208,1,0.85575754
about,6,582,421,89859,1,2.87392365
an,3,585,183,90097,1,1.91005584
and,12,576,2697,87583,-1,2.02879531
are,3,585,400,89880,1,0.0568861
been,9,579,215,90065,1,18.12460346
don't,3,585,237,90043,1,1.06889652
especially,5,583,28,90252,1,22.73558146
ever,5,583,34,90246,1,21.01360478
for,4,584,955,89325,-1,0.91145742


In [8]:
q_because_assoc <- data.frame(assoc_scores(q_because_cooc, measures = c("G")))
q_because_assoc_att <- q_because_assoc[which(q_because_assoc$dir==1 & q_because_assoc$G>1),]
q_because_assoc_rep <- q_because_assoc[which(q_because_assoc$dir==-1 & q_because_assoc$G>1),]

setDT(q_because_assoc_att, keep.rownames="word")
setDT(q_because_assoc_rep, keep.rownames="word")

In [9]:
q_because_assoc_att[order(q_because_assoc_att$G, decreasing=TRUE),]

word,a,b,c,d,dir,G
this,30,1630,507.0,88523,1,27.642316
throwaway,3,1657,1e-05,89030,1,24.008831
i,90,1570,3015.0,86015,1,17.475043
was,25,1635,537.0,88493,1,15.503958
it's,11,1649,148.0,88882,1,13.539352
of,40,1620,1158.0,87872,1,12.427619
she,31,1629,841.0,88189,1,11.485979
pooped,3,1657,9.0,89021,1,10.845007
they,17,1643,364.0,88666,1,10.576475
"well,",3,1657,10.0,89020,1,10.332628


In [10]:
q_since_assoc <- data.frame(assoc_scores(q_since_cooc, measures = c("G")))
q_since_assoc_att <- q_since_assoc[which(q_since_assoc$dir==1 & q_since_assoc$G>1),]
q_since_assoc_rep <- q_since_assoc[which(q_since_assoc$dir==-1 & q_since_assoc$G>1),]

setDT(q_since_assoc_att, keep.rownames="word")
setDT(q_since_assoc_rep, keep.rownames="word")

In [11]:
q_since_assoc_att[order(q_since_assoc_att$G, decreasing=TRUE),]

word,a,b,c,d,dir,G
was,19,569,543,89737,1,32.938291
especially,5,583,28,90252,1,22.735581
ever,5,583,34,90246,1,21.013605
i've,7,581,101,90179,1,20.102511
been,9,579,215,90065,1,18.124603
then,4,584,101,90179,1,7.666821
having,4,584,111,90169,1,7.054575
they,6,582,375,89905,1,3.658227
much,3,585,132,90148,1,3.191445
only,3,585,134,90146,1,3.128094


Is there anything to be seen here in the two tables that show relative attraction of lexical items to contexts which contain *since* vs *because* ?

What about words that are repulsed by because and attracted by since (and vice versa) ?

## Words attracted by because and repelled by since

In [12]:
inner_join(q_because_assoc_att, q_since_assoc_rep, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
of,40,1620,1158,87872,1,12.42762,5,583,1193,89087,-1,1.138467


## Words attracted by since and repelled by because

In [13]:
inner_join(q_because_assoc_rep, q_since_assoc_att, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
about,3,1657,424,88606,-1,3.955862,6,582,421,89859,1,2.873924


## Words attracted by because and since

In [14]:
inner_join(q_because_assoc_att, q_since_assoc_att, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
especially,3,1657,30,89000,1,5.009743,5,583,28,90252,1,22.735581
his,9,1651,325,88705,1,1.218602,4,584,330,89950,1,1.263316
she,31,1629,841,88189,1,11.485979,10,578,862,89418,1,2.78478
they,17,1643,364,88666,1,10.576475,6,582,375,89905,1,3.658227
was,25,1635,537,88493,1,15.503958,19,569,543,89737,1,32.938291


## Words repelled by because and since

In [15]:
inner_join(q_because_assoc_rep, q_since_assoc_rep, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
and,17,1643,2692,86338,-1,29.839859,12,576,2697,87583,-1,2.028795
to,39,1621,2945,86085,-1,5.203746,12,576,2972,87308,-1,3.314391


# Analyse advice

In [16]:
a_since_cooc <- surf_cooc(advice_files, re_node="since", re_token_splitter="\\s+|-+", re_drop_token="^Reply\\s\\d")
a_because_cooc <- surf_cooc(advice_files, re_node="because", re_token_splitter="\\s+|-+")

In [17]:
a_since_cooc

$target_freqlist
Frequency list (number of types: 348, number of tokens: 819)
  word frequency
1 they        26
2    ,        20
3    .        20
4    i        18
5  the        18
6   he        17
...

$ref_freqlist
Frequency list (number of types: 12055, number of tokens: 265426)
  word frequency
1    .     13032
2    ,      7333
3   to      6784
4    i      6193
5  the      6154
6  and      6111
...

$target_n
[1] 819

$ref_n
[1] 265426

attr(,"class")
[1] "cooc_info"

In [18]:
a_because_cooc

$target_freqlist
Frequency list (number of types: 845, number of tokens: 2811)
  word frequency
1   it        96
2    i        91
3    ,        67
4 they        63
5   of        57
6  the        55
...

$ref_freqlist
Frequency list (number of types: 12009, number of tokens: 263101)
  word frequency
1    .     13022
2    ,      7286
3   to      6751
4    i      6120
5  the      6117
6  and      6107
...

$target_n
[1] 2811

$ref_n
[1] 263101

attr(,"class")
[1] "cooc_info"

In [19]:
a_since_target_freqlist <- as.data.frame(a_since_cooc['target_freqlist'])
a_since_ref_freqlist <- as.data.frame(a_since_cooc['ref_freqlist'])

a_because_target_freqlist <- as.data.frame(a_because_cooc['target_freqlist'])
a_because_ref_freqlist <- as.data.frame(a_because_cooc['ref_freqlist'])

In [20]:
a_because_assoc <- data.frame(assoc_scores(a_because_cooc, measures = c("G")))
a_because_assoc_att <- a_because_assoc[which(a_because_assoc$dir==1 & a_because_assoc$G>1),]
a_because_assoc_rep <- a_because_assoc[which(a_because_assoc$dir==-1 & a_because_assoc$G>1),]

setDT(a_because_assoc_att, keep.rownames="word")
setDT(a_because_assoc_rep, keep.rownames="word")

In [21]:
a_since_assoc <- data.frame(assoc_scores(a_since_cooc, measures = c("G")))
a_since_assoc_att <- a_since_assoc[which(a_since_assoc$dir==1 & a_since_assoc$G>1),]
a_since_assoc_rep <- a_since_assoc[which(a_since_assoc$dir==-1 & a_since_assoc$G>1),]

setDT(a_since_assoc_att, keep.rownames="word")
setDT(a_since_assoc_rep, keep.rownames="word")

In [22]:
a_because_assoc_att[order(a_because_assoc_att$G, decreasing=TRUE),]

word,a,b,c,d,dir,G
they,63,2748,1787.0,261314,1,62.236701
was,50,2761,1491.0,261610,1,45.889908
it,96,2715,4256.0,258845,1,42.745102
just,38,2773,1046.0,262055,1,38.944311
icky,3,2808,1e-05,263101,1,27.300648
n’t,25,2786,704.0,262397,1,24.78015
of,57,2754,2788.0,260313,1,19.559441
she,35,2776,1434.0,261667,1,18.342001
apparently,4,2807,14.0,263087,1,17.630367
's,37,2774,1696.0,261405,1,14.987576


In [23]:
a_since_assoc_att[order(a_since_assoc_att$G, decreasing=TRUE),]

word,a,b,c,d,dir,G
were,14,805,354,265072,1,45.346645
especially,8,811,81,265345,1,39.312429
they,26,793,1824,263602,1,39.119745
memorized,3,816,4,265422,1,25.179341
he,17,802,1245,264181,1,24.327117
ever,6,813,117,265309,1,22.221253
beginning,3,816,11,265415,1,20.234794
was,16,803,1525,263901,1,16.647897
number,3,816,22,265404,1,16.504315
born,4,815,63,265363,1,16.375073


In [28]:
a_because_assoc_rep[order(a_because_assoc_rep$G, decreasing=TRUE),]

word,a,b,c,d,dir,G
.,30,2781,13022,250079,-1,129.648159
and,15,2796,6107,256994,-1,56.885516
reply,5,2806,4043,259058,-1,54.989616
with,3,2808,1798,261303,-1,21.227052
to,46,2765,6751,256350,-1,11.020277
if,8,2803,1684,261417,-1,6.992825
as,5,2806,1198,261903,-1,6.170535
for,15,2796,2457,260644,-1,5.706124
on,7,2804,1329,261772,-1,4.47619
when,6,2805,1055,262046,-1,2.960714


In [29]:
a_since_assoc_rep[order(a_since_assoc_rep$G, decreasing=TRUE),]

word,a,b,c,d,dir,G
.,20,799,13032,252394,-1,12.97347
that,3,816,2945,262481,-1,5.557694
reply,6,813,4042,261384,-1,4.2046
and,11,808,6111,259315,-1,3.922098
to,13,806,6784,258642,-1,3.549093
but,3,816,1744,263682,-1,1.260389
be,3,816,1720,263706,-1,1.195153
for,5,814,2467,262959,-1,1.026866


## Words attracted by because and repelled by since

In [24]:
inner_join(a_because_assoc_att, a_since_assoc_rep, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
that,39,2772,2909,260192,1,1.866768,3,816,2945,262481,-1,5.557694


## Words attracted by since and repelled by because

In [25]:
inner_join(a_because_assoc_rep, a_since_assoc_att, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y


## Words attracted by because and since

In [26]:
inner_join(a_because_assoc_att, a_since_assoc_att, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
're,11,2800,397,262704,1,7.351142,4,815,404,265022,1,3.810851
's,37,2774,1696,261405,1,14.987576,12,807,1721,263705,1,6.215821
are,29,2782,1561,261540,1,7.399917,11,808,1579,263847,1,5.682585
have,27,2784,1913,261188,1,1.904396,13,806,1927,263499,1,6.265564
he,28,2783,1234,261867,1,12.449256,17,802,1245,264181,1,24.327117
her,22,2789,1315,261786,1,3.805847,7,812,1330,264096,1,1.687467
only,7,2804,346,262755,1,2.304713,5,814,348,265078,1,7.504832
our,7,2804,433,262668,1,1.039955,3,816,437,264989,1,1.492051
she,35,2776,1434,261667,1,18.342001,9,810,1460,263966,1,3.477685
there,14,2797,707,262394,1,4.34071,5,814,716,264710,1,2.584925


## Words repelled by because and since

In [27]:
inner_join(a_because_assoc_rep, a_since_assoc_rep, by="word")

word,a.x,b.x,c.x,d.x,dir.x,G.x,a.y,b.y,c.y,d.y,dir.y,G.y
.,30,2781,13022,250079,-1,129.648159,20,799,13032,252394,-1,12.97347
and,15,2796,6107,256994,-1,56.885516,11,808,6111,259315,-1,3.922098
be,13,2798,1710,261391,-1,1.685562,3,816,1720,263706,-1,1.195153
but,13,2798,1734,261367,-1,1.835646,3,816,1744,263682,-1,1.260389
for,15,2796,2457,260644,-1,5.706124,5,814,2467,262959,-1,1.026866
reply,5,2806,4043,259058,-1,54.989616,6,813,4042,261384,-1,4.2046
to,46,2765,6751,256350,-1,11.020277,13,806,6784,258642,-1,3.549093
