# Filter evaluations by precision-recall analysis

## 1 Setup

Flags and settings.

In [1]:
SAMPLE_SIZE = 100

Imports and database setup.

In [2]:
from random import sample
from textwrap import indent, fill

import numpy as np

%cd -q ..
from brainscopypaste.conf import settings
%cd -q notebooks
from brainscopypaste.db import Cluster, Quote, Substitution
from brainscopypaste.utils import init_db, session_scope, langdetect
from brainscopypaste.filter import filter_quote_offset
from brainscopypaste.mine import Model, Time, Source, Past, Durl
engine = init_db()

## 2 Evaluate language filtering

Either run this cell to generate a new selection...

In [3]:
#with session_scope() as session:
#    quote_ids = sample([id for (id,)
#                        in session.query(Quote.id).filter(Quote.filtered == False)],
#                       SAMPLE_SIZE)

... or this one to use the previous selection (which has already been coded).

In [4]:
quote_ids = [1956761, 1987670, 2013139, 1762813, 514301, 203909, 2369165, 1912805, 1534840,
             464066, 152061, 720930, 2331572, 268789, 457062, 1219889, 790024, 2309594,
             2712468, 1282691, 730591, 1840441, 382029, 376767, 1058965, 2423390, 1013911,
             1558617, 2491471, 777666, 648372, 2123449, 630199, 1479403, 488899, 2501461,
             1453754, 1190828, 239246, 1981112, 491273, 2554944, 80036, 1433989, 1067430,
             67526, 937337, 952234, 110443, 123775, 1454732, 1600502, 1877933, 1746630,
             1387646, 654399, 1896017, 254697, 72175, 657151, 2296113, 2692900, 2340689,
             1253031, 2671619, 1299967, 2323288, 321528, 2391467, 2704508, 2596163, 2044260,
             420936, 448593, 966688, 252997, 1573373, 328472, 1176294, 494275, 1110521, 74025,
             647887, 2469274, 1748974, 209377, 116114, 2657880, 923507, 2530004, 200625, 671881,
             209809, 2651830, 1105095, 120217, 16153, 95646, 1162318, 1267527]

In [5]:
with session_scope() as session:
    strings = [session.query(Quote).get(id).string for id in quote_ids]
strings_langs = [(string, langdetect(string)) for string in strings]

In [6]:
print("Over a sample of {}, {} quotes are rejected because their detected "
      "language is not English"
      .format(SAMPLE_SIZE, np.sum([lang != 'en' for _, lang in strings_langs])))

Over a sample of 100, 17 quotes are rejected because their detected language is not English


Here are the individual strings and their detected languages.

In [7]:
for i, (string, lang) in enumerate(strings_langs):
    title = ' {} / {}'.format(i + 1, SAMPLE_SIZE)
    print('-' * (80 - len(title)) + title)
    print('Language:', lang)
    print()
    print(indent(fill(string), ' ' * 5))
    print()

------------------------------------------------------------------------ 1 / 100
Language: en

     some states swear in their new governors in early december a month
     after the election and parliamentary democracies produce new prime
     ministers within a day or two of the election

------------------------------------------------------------------------ 2 / 100
Language: en

     they want to set an example for harlow nicole and joel are doing this
     to seal their love legally they want the ceremony to coincide with
     harlow's first birthday and be one big joyous occasion for everybody

------------------------------------------------------------------------ 3 / 100
Language: en

     i cannot believe what i have seen in the last 36 hours i have seen
     dead bodies blood everywhere and only heard gunshots

------------------------------------------------------------------------ 4 / 100
Language: en

     but i have to reiterate once again that we only have one president

**The question**, here, is whether language is properly detected or not (both precision and recall are important to us here). Hand-coding these sentences gives a direct precision-recall answer to this. See the paper for details.

## 3 Evaluate full cluster filtering

Either run this cell to generate a new selection...

In [8]:
#with session_scope() as session:
#    cluster_ids = sample([id for (id,)
#                          in session.query(Cluster.id).filter(Cluster.filtered == False)],
#                         SAMPLE_SIZE)

... or this one to use the previous selection (which has already been coded).

In [9]:
cluster_ids = [678375, 1030236, 200131, 227946, 182411, 1782509, 2571784, 2385863, 548464,
               1417058, 442671, 417270, 206587, 2119286, 2509349, 1363677, 2384075, 98089,
               2472784, 2291625, 1252250, 766764, 2005041, 1786498, 2532146, 547696, 119025,
               99671, 2293637, 37913, 198342, 1824441, 393254, 1438581, 726763, 1162402, 136659,
               700003, 1530925, 868029, 1988243, 1528193, 453911, 1118515, 942719, 163403,
               37335, 2531211, 1207286, 139905, 2098233, 985830, 1309232, 1949178, 1208225,
               163161, 2390394, 1000682, 33330, 1438761, 143158, 914871, 1012869, 1515844,
               2155238, 318470, 1705922, 1998918, 1559891, 1567974, 1072475, 1377888, 2659498,
               214657, 2307338, 2702297, 1718647, 1830269, 1075089, 563301, 2675991, 1936421,
               650209, 1000108, 270310, 224933, 1078355, 2665218, 988251, 921771, 193937,
               1005385, 922307, 966693, 1468503, 1112509, 476413, 1537564, 799385, 1344350]

In [10]:
with session_scope() as session:
    clusters = [session.query(Cluster).get(id) for id in cluster_ids]
    strings_kepts = []
    for c in clusters:
        fcluster = c.filter()
        if fcluster is not None:
            kept_quote_ids = set([q.id - filter_quote_offset() for q in fcluster.quotes])
        else:
            kept_quote_ids = set([])
        strings_kepts.append(([(q.string, q.id in kept_quote_ids) for q in c.quotes],
                              c.filter() is not None))

In [11]:
print("Over a sample of {}, {} clusters are rejected by the cluster filter"
      .format(SAMPLE_SIZE, SAMPLE_SIZE - np.sum([kept for _, kept in strings_kepts])))

Over a sample of 100, 34 clusters are rejected by the cluster filter


Here are the individual cluster strings and their respective rejected/kept status.

In [12]:
for i, (strings, ckept) in enumerate(strings_kepts):
    title = ' {} / {}'.format(i + 1, SAMPLE_SIZE)
    print('-' * (80 - len(title)) + title)
    print('Kept:',  'yes' if ckept else 'no')
    print()
    for string, skept in strings:
        fstring = indent(fill(string), ' ' * 5)
        if not skept:
            fstring = fstring[0] + 'x' + fstring[2:]
        print(fstring)
        print()

------------------------------------------------------------------------ 1 / 100
Kept: no

 x   it was actually the first time that i cried since the whole incident
     started

 x   the enormity of the situation

 x   i thought i was in the end zone

 x   i thought i was in the clear

 x   and it was the first time that i thought i was in the mafia

 x   just last night actually for the first time i saw some of the
     reenactments about how the airplane actually had to land on the water
     with the nose high and i think for the first time kind of the enormity
     of the situation really hit me and it was actually the first time that
     i cried since the whole incident started

------------------------------------------------------------------------ 2 / 100
Kept: yes

 x   everyone is entitled to an informed opinion

     everyone is entitled to an opinion and so is obama and his staff then
     again you know what they say about opinions

     you know what they say about it



**The question**, here, is whether the cluster filter lets us keep only real clusters (i.e. not clusters of causally unrelated quotes; in other words, we'd like high precision, and we're not really interested in recall). The codings for these clusters are in the `codings/filter_evaluations_clusters-precision-recall` file, and precision-recall follows from them.

## 4 Evaluate the substitution filtering

### 4.0 Sample clusters and substitutions

**Sample clusters**

Either run this cell to generate a new selection...

In [13]:
#with session_scope() as session:
#    # Sample 5 times more to make sure we get at least 100 substitutions
#    cluster_ids = sample([id for (id,)
#                          in session.query(Cluster.id).filter(Cluster.filtered == True)],
#                         15 * SAMPLE_SIZE)

... or this one to use the previous selection (which has already been coded).

In [14]:
cluster_ids = [
    1001058714, 1000234230, 1001661015, 1000793763, 1000992927, 1001447160, 1000928909,
    1000652632, 1002429300, 1000424511, 1000028725, 1001804178, 1001962169, 1002010696,
    1002620164, 1000541603, 1002038276, 1002178216, 1001264080, 1000658653, 1002656290,
    1000602035, 1002444841, 1001270720, 1000980555, 1000049133, 1001594109, 1000874654,
    1000270878, 1001267640, 1000587081, 1001502597, 1001290073, 1000742456, 1000803305,
    1001768144, 1001790796, 1000420570, 1002288719, 1002195335, 1000630276, 1001310108,
    1002544437, 1001355443, 1001215220, 1002184159, 1001954566, 1000417270, 1001062143,
    1000193302, 1001073091, 1001318792, 1001333529, 1002230713, 1000253352, 1002165662,
    1001000108, 1001492202, 1002697831, 1002554478, 1002203884, 1000368041, 1000743708,
    1001803629, 1001123734, 1001892371, 1001817722, 1001887535, 1001566025, 1001800790,
    1000778799, 1000917456, 1001772704, 1001500452, 1001176167, 1001580298, 1000786039,
    1000276995, 1001329982, 1002416136, 1001861324, 1000314572, 1000873226, 1002526111,
    1001214205, 1002498111, 1000073947, 1001636514, 1000381995, 1001214611, 1001605320,
    1000100464, 1000853926, 1002047137, 1000532977, 1002113182, 1002222906, 1001773241,
    1000492857, 1000949715, 1000164833, 1000263282, 1000669491, 1000537798, 1000493265,
    1001739413, 1002188621, 1002425058, 1000997844, 1001798690, 1000158229, 1000131343,
    1001936787, 1001056713, 1000069848, 1001032119, 1000492141, 1000001841, 1000062295,
    1001365462, 1001259190, 1002360197, 1000417145, 1001266464, 1000395595, 1002109225,
    1000519609, 1002341113, 1002627477, 1001447816, 1002598858, 1000739902, 1001334186,
    1000421527, 1001369574, 1001468521, 1001221778, 1002397089, 1000194804, 1001834376,
    1000847659, 1000857363, 1001389106, 1000404597, 1002673066, 1002010483, 1002482970,
    1001309374, 1002135445, 1000676392, 1001402554, 1000023387, 1000861742, 1000739666,
    1002208103, 1001610066, 1002091675, 1000055741, 1000098549, 1002606093, 1001019247,
    1002081961, 1000046067, 1002202634, 1002023513, 1001803811, 1002270829, 1001090860,
    1000942827, 1002488258, 1000073692, 1002394081, 1002642836, 1000964777, 1001327690,
    1000005769, 1001231453, 1001843109, 1002095337, 1000144749, 1000856624, 1001408503,
    1002077967, 1000689166, 1002604060, 1002619245, 1000260034, 1002549738, 1000886518,
    1001146812, 1001499168, 1000981840, 1000510458, 1001824441, 1000023650, 1002645020,
    1002476030, 1001445978, 1001729766, 1002396899, 1002704984, 1000456379, 1000164580,
    1001594835, 1001879217, 1001658148, 1000222104, 1000175721, 1001934132, 1000458812,
    1000597134, 1000983803, 1001045067, 1001934934, 1001101372, 1002064679, 1001354812,
    1001387573, 1001552266, 1001049178, 1000211591, 1001535131, 1000068077, 1002667160,
    1001958712, 1002049050, 1001354277, 1001406656, 1001333150, 1001800505, 1001948524,
    1000398683, 1002031029, 1001171493, 1001750650, 1000486399, 1002344374, 1000798536,
    1002483840, 1001327852, 1000574008, 1002405878, 1001603998, 1001658395, 1000311644,
    1001804403, 1001249334, 1001265825, 1001434048, 1001393656, 1001794262, 1001120310,
    1001337829, 1001450332, 1002610877, 1001479081, 1000806304, 1000310768, 1000399139,
    1001749658, 1000062651, 1001672488, 1000168730, 1000287594, 1000031766, 1002453238,
    1000907880, 1000172027, 1002431192, 1001999079, 1000072697, 1001986304, 1002115343,
    1001729577, 1001750064, 1000630283, 1000459150, 1002202737, 1002558133, 1001727072,
    1000130129, 1000657054, 1000701933, 1000953247, 1002349120, 1001881970, 1002607340,
    1002248182, 1001559921, 1001252335, 1000301764, 1001307914, 1001284847, 1001340444,
    1001785513, 1000158707, 1000072111, 1002577972, 1001846670, 1002551562, 1000423962,
    1001392074, 1002514611, 1001067331, 1002031448, 1000179811, 1001451249, 1002350833,
    1000722484, 1000541045, 1001513981, 1002327622, 1001060749, 1002530663, 1002644774,
    1001007144, 1000257361, 1002563963, 1002394972, 1001058481, 1000854528, 1002613959,
    1001676866, 1000073613, 1000209836, 1001243402, 1002440440, 1002364622, 1000090769,
    1002031916, 1001978714, 1002287608, 1001866823, 1001583138, 1002663786, 1002190005,
    1001188041, 1000497933, 1000061665, 1000964817, 1001367239, 1001935938, 1002551108,
    1002637119, 1000525740, 1002690274, 1000762606, 1002218984, 1002540254, 1002012695,
    1002535392, 1002045997, 1001761229, 1001345695, 1001540708, 1000409133, 1002281542,
    1002031988, 1001905217, 1001927225, 1000456501, 1000103149, 1001595739, 1002629715,
    1000633871, 1002582219, 1000125203, 1002622788, 1000152868, 1001691221, 1002550150,
    1001367563, 1001517118, 1000871577, 1002129646, 1001997590, 1000404780, 1002438973,
    1002241374, 1001038946, 1000088364, 1001781958, 1001978086, 1001035980, 1001110133,
    1001604717, 1002078831, 1002488583, 1000792847, 1000641673, 1001165015, 1001283536,
    1001500299, 1001336583, 1001464976, 1001775715, 1001334053, 1001752220, 1001059238,
    1000061280, 1000343167, 1002262659, 1001017521, 1000926454, 1001549906, 1001399657,
    1001596953, 1001511989, 1002134226, 1000997323, 1002423339, 1002561081, 1002351089,
    1000541923, 1001581475, 1001867280, 1001543175, 1001503038, 1000093464, 1002711441,
    1000730805, 1001702743, 1001948884, 1000065378, 1000215299, 1001326290, 1001638327,
    1000007295, 1001515868, 1002212135, 1001575847, 1002161948, 1001357387, 1001766184,
    1000727306, 1000699503, 1000075976, 1002128547, 1002602954, 1001760223, 1000659375,
    1002062608, 1000750867, 1001376415, 1000364749, 1001058467, 1000618691, 1001144316,
    1001536519, 1002587087, 1001101833, 1000247382, 1000609876, 1001645973, 1001840226,
    1001602395, 1002684664, 1000776962, 1001157755, 1002342517, 1000022381, 1001920189,
    1001801346, 1001162402, 1002689307, 1002156227, 1001687826, 1002596959, 1002349648,
    1001894040, 1001450690, 1001735336, 1001498688, 1001180316, 1000810720, 1001220544,
    1001529035, 1001497563, 1000466115, 1000218057, 1002531776, 1001810079, 1000174553,
    1000145964, 1001540017, 1002531576, 1001728260, 1002033880, 1001150137, 1001953094,
    1002077979, 1001228678, 1002168298, 1000151281, 1002335600, 1002526758, 1001714987,
    1001291988, 1000419221, 1002166820, 1000925310, 1000982141, 1001751176, 1001191639,
    1000428712, 1000885331, 1000130902, 1001035008, 1000538396, 1002436753, 1002272435,
    1001444011, 1000873940, 1000706171, 1000465018, 1000006623, 1000275277, 1002396180,
    1002582721, 1000118329, 1000037089, 1000453971, 1000963958, 1000640803, 1001129926,
    1001224067, 1000047577, 1000393604, 1001300568, 1000147695, 1000808608, 1001958358,
    1001269596, 1002713792, 1002275177, 1002410739, 1002315383, 1000223504, 1001345610,
    1000036857, 1002471258, 1001375224, 1002695360, 1001511054, 1002619075, 1002701703,
    1000428130, 1000207474, 1000367210, 1002676594, 1002290678, 1000780766, 1002248783,
    1000331455, 1000869013, 1002351131, 1001827981, 1001624275, 1001359969, 1000562736,
    1000393157, 1001947238, 1002170490, 1002493182, 1000951354, 1000734774, 1000272483,
    1001067850, 1002234686, 1002259959, 1001223336, 1002086622, 1002102465, 1002404394,
    1000833030, 1001881525, 1002214009, 1001142763, 1001334771, 1001399809, 1000971111,
    1000841798, 1000351662, 1000918770, 1000051328, 1001085842, 1001426393, 1001499469,
    1002673391, 1002439989, 1002656579, 1000503223, 1000199277, 1001102502, 1000804320,
    1001829920, 1002377023, 1000669157, 1000789097, 1002582408, 1000755806, 1000826738,
    1001776772, 1001368136, 1000804767, 1002148559, 1001719604, 1002662183, 1002320519,
    1000047835, 1000240968, 1000980627, 1000889403, 1002350939, 1001131879, 1001271289,
    1002320021, 1000812131, 1001424421, 1002461839, 1001799302, 1000362981, 1001116881,
    1001201610, 1001904014, 1002449788, 1000361053, 1000143359, 1000922307, 1001675730,
    1000773716, 1000964829, 1000830471, 1000431368, 1002320455, 1002506830, 1001079629,
    1002573767, 1000514517, 1002143145, 1001158803, 1001644001, 1002618006, 1002492025,
    1000676599, 1001041617, 1002145451, 1000017230, 1001866916, 1001265926, 1002274458,
    1001815958, 1002702195, 1002236588, 1002247341, 1000118770, 1002178890, 1001472895,
    1000063547, 1000762613, 1002297834, 1000807747, 1001970314, 1002276553, 1000562856,
    1001140852, 1001601748, 1002639679, 1001182746, 1002650044, 1001227746, 1000170311,
    1000564560, 1001740055, 1001904646, 1000539800, 1001501829, 1000531802, 1000128379,
    1000790358, 1001940684, 1000158028, 1001881112, 1002492922, 1000803880, 1000643541,
    1000052799, 1000297615, 1001726109, 1002543957, 1000054716, 1000155125, 1002652618,
    1000769467, 1002042242, 1000368676, 1000858673, 1001822210, 1000350231, 1001442450,
    1000050760, 1002041174, 1000233535, 1001572148, 1001432199, 1001742887, 1001172196,
    1000364258, 1001970826, 1002105767, 1002037960, 1000451410, 1002099403, 1000967350,
    1000147974, 1000341770, 1001680878, 1001613527, 1002528697, 1002715601, 1002474922,
    1000818246, 1001278242, 1000681152, 1000997312, 1000010867, 1000952787, 1000993230,
    1001786224, 1000203905, 1000911729, 1000885883, 1002075501, 1001614854, 1001924725,
    1000385909, 1001433831, 1000201148, 1001606931, 1001941294, 1000962064, 1001860524,
    1000774230, 1002033685, 1001261980, 1000288611, 1001334187, 1001921383, 1002166359,
    1001938346, 1001262841, 1000993575, 1001270506, 1002702506, 1001842295, 1000382304,
    1000042449, 1000021771, 1000481204, 1002121723, 1001652776, 1001226182, 1002225233,
    1001253286, 1001562116, 1000154064, 1000094446, 1001633759, 1001546738, 1001004738,
    1002556326, 1001205897, 1000902561, 1002001937, 1001700843, 1002435437, 1000989228,
    1001653126, 1001334470, 1002071621, 1001522103, 1001548001, 1001285963, 1000090895,
    1000746072, 1001574873, 1002175814, 1002531988, 1002144195, 1001770967, 1001451200,
    1001856086, 1000559388, 1001297817, 1000837494, 1001534871, 1000220313, 1001259385,
    1002100069, 1001332921, 1000544786, 1001841512, 1001383653, 1000684470, 1002161309,
    1002231396, 1002348115, 1001758109, 1001747036, 1000284497, 1000832130, 1001282170,
    1002010186, 1000427241, 1001581775, 1000578192, 1000078262, 1000016694, 1002705118,
    1001733624, 1001363335, 1002436545, 1001328603, 1002180807, 1000767722, 1002485195,
    1002272469, 1001271235, 1001481701, 1001348533, 1000388469, 1001219997, 1001460577,
    1001590875, 1001240142, 1000238824, 1000121322, 1002004474, 1001592727, 1001385776,
    1002470366, 1002664998, 1001997277, 1001120647, 1002575276, 1001722300, 1002216129,
    1000067607, 1000994775, 1001498352, 1002699359, 1000587363, 1000217706, 1002109991,
    1000288982, 1002699713, 1002233823, 1001153232, 1000914136, 1002548433, 1001734510,
    1002342520, 1002452277, 1001100860, 1000950342, 1000388189, 1000127650, 1002315582,
    1001566979, 1000957774, 1000052007, 1000364394, 1001996793, 1001908176, 1002105936,
    1002086992, 1002016069, 1001183918, 1000800618, 1001022239, 1001270359, 1002242646,
    1002505634, 1000716919, 1000901112, 1002481410, 1000298970, 1000516124, 1000057052,
    1002361761, 1002647536, 1000845811, 1002171985, 1002050592, 1000006586, 1001255552,
    1001837963, 1001256691, 1001761220, 1000938672, 1000530532, 1000311534, 1001776809,
    1000318357, 1002053600, 1001444195, 1002365910, 1001746878, 1000228196, 1002060204,
    1001739958, 1001886840, 1000698210, 1002509868, 1001502009, 1001893527, 1002638619,
    1002066935, 1001550280, 1002294683, 1000107898, 1000010223, 1001508748, 1001689201,
    1001003737, 1001014351, 1002082703, 1002224120, 1002157272, 1000882553, 1000053316,
    1001414012, 1001342905, 1002156509, 1000251513, 1001232667, 1001399253, 1001855493,
    1001013265, 1001688212, 1002583768, 1001529428, 1001127460, 1001698379, 1002315633,
    1002318183, 1001620522, 1001378536, 1001574209, 1000569652, 1001604632, 1002697581,
    1001575675, 1001117505, 1002051194, 1000183177, 1000855203, 1001423033, 1001777736,
    1002318334, 1001056375, 1002426820, 1002236516, 1000150483, 1001578508, 1001358370,
    1000011455, 1002362070, 1000806847, 1001801348, 1000085743, 1001413671, 1001540172,
    1001829995, 1001851813, 1001597485, 1000078746, 1001840646, 1001995820, 1000160512,
    1002451933, 1001191721, 1002203693, 1000238465, 1002033572, 1001320063, 1000872380,
    1000623197, 1002147129, 1000011914, 1001530603, 1002446364, 1000079514, 1002448631,
    1001076382, 1002223593, 1000490815, 1000630667, 1000376213, 1002462823, 1000941817,
    1002357231, 1001797587, 1000926673, 1000594216, 1001861749, 1001684910, 1002501511,
    1001579538, 1002606713, 1000695748, 1000903985, 1001546192, 1001996840, 1001472327,
    1000032826, 1001545056, 1002014395, 1002095028, 1001913231, 1000075824, 1000952746,
    1001853596, 1002182047, 1002628022, 1002004078, 1001566113, 1002163349, 1000790111,
    1002285133, 1002494233, 1000397761, 1001439740, 1001101891, 1001618046, 1001291605,
    1002314992, 1001234028, 1001559891, 1001477621, 1000780535, 1000875694, 1000645845,
    1000484118, 1002111321, 1001878295, 1002646120, 1002039447, 1002029116, 1000156966,
    1000010498, 1001769120, 1000175849, 1000418713, 1001915737, 1000709712, 1001922556,
    1001283094, 1002208729, 1000333316, 1000012505, 1001047126, 1001172398, 1000177517,
    1001487440, 1000001465, 1000882506, 1002659495, 1002052955, 1001398557, 1001303140,
    1000860772, 1002525449, 1002527325, 1002183331, 1001776468, 1000205762, 1000597585,
    1000113771, 1001870016, 1000025477, 1002086198, 1000502730, 1001981118, 1002473830,
    1001480043, 1002528637, 1000838413, 1001653286, 1001330880, 1002679849, 1002506095,
    1000486043, 1001982507, 1002103300, 1000257805, 1001964267, 1000874462, 1001326641,
    1001782479, 1001021296, 1001867545, 1000694265, 1002033815, 1000304505, 1000985515,
    1000135881, 1002495192, 1000821217, 1001810184, 1002310190, 1002635800, 1000729831,
    1000784775, 1001668437, 1002274466, 1000797484, 1002537649, 1000475111, 1000437411,
    1001657664, 1001278789, 1001836754, 1001983301, 1001494451, 1002357233, 1002212481,
    1001782077, 1001629749, 1000824935, 1002484633, 1000050369, 1000716672, 1001310861,
    1001954605, 1002473461, 1000000607, 1000825586, 1001437178, 1000204392, 1002359775,
    1001499856, 1001810430, 1002093548, 1001448073, 1001173925, 1000746976, 1002502228,
    1000177952, 1001304657, 1001653859, 1000214289, 1000157134, 1001597852, 1002060579,
    1000735315, 1002615587, 1001055343, 1002059341, 1000356177, 1000447259, 1000699286,
    1002457427, 1001989841, 1001801036, 1001434979, 1001957769, 1000007005, 1000916205,
    1001169392, 1001889767, 1000075558, 1002025030, 1001976875, 1001560208, 1001115658,
    1001598286, 1001205754, 1002231878, 1000813552, 1001981838, 1000677677, 1001053575,
    1002255917, 1002025140, 1001734307, 1001057270, 1001087649, 1002574757, 1001915861,
    1001674897, 1001364136, 1002154777, 1001019768, 1001614035, 1000379490, 1001908572,
    1002591014, 1000164476, 1000021830, 1002468679, 1001717996, 1001998254, 1000470822,
    1001431751, 1001319414, 1002050667, 1000560326, 1001259195, 1000866918, 1001379510,
    1002113186, 1001517995, 1001646691, 1000465784, 1001374778, 1001896574, 1002089665,
    1000872939, 1001914402, 1001909903, 1002351476, 1002080607, 1001717244, 1000018324,
    1001994530, 1001422230, 1001353347, 1000339608, 1000475920, 1001694035, 1000982021,
    1001590385, 1001535698, 1000544821, 1001163684, 1000525655, 1002208497, 1001140991,
    1000766128, 1000607192, 1000151074, 1000723980, 1002133819, 1000500270, 1000834415,
    1002225356, 1000373694, 1001824087, 1000067090, 1000297958, 1002317409, 1001770960,
    1001508749, 1000167477, 1001947065, 1001940057, 1000781919, 1000988086, 1002518061,
    1001558269, 1001815508, 1001091260, 1000263913, 1000405911, 1001511990, 1001723810,
    1001074987, 1002376815, 1001680122, 1001506913, 1001594016, 1001519696, 1001789977,
    1000692462, 1000029089, 1000338311, 1001956638, 1002575343, 1001274784, 1001851680,
    1002390771, 1001326868, 1001743949, 1002550572, 1002559841, 1002530236, 1001459660,
    1000449148, 1002536266, 1002701796, 1002253595, 1001124589, 1001133908, 1002463064,
    1000392223, 1001121726, 1001289467, 1000289591, 1002105764, 1002381694, 1002341368,
    1000827595, 1002084983, 1001682153, 1000633018, 1001113583, 1000928142, 1000033000,
    1000621627, 1001084527, 1001200554, 1001458154, 1002412589, 1001972202, 1002691528,
    1000265992, 1002659026, 1001366843, 1000467244, 1001664670, 1002489296, 1002483595,
    1001900662, 1002686469, 1001330713, 1000830278, 1000795336, 1002364454, 1000811242,
    1001678392, 1000033963, 1002634804, 1001804844, 1000527231, 1002344474, 1002369472,
    1001578312, 1001845592, 1001540167, 1002337924, 1000539731, 1002311229, 1000201698,
    1000953073, 1000629221, 1002663893, 1001791750, 1001914401, 1002604562, 1000828450,
    1002613802, 1001268711, 1000927905, 1001578575, 1001005287, 1002131006, 1002396856,
    1000010276, 1001202203, 1000137591, 1001716758, 1001404162, 1002497367, 1000410225,
    1000002836, 1000456777, 1000005943, 1002065249, 1002074323, 1000814733, 1002442276,
    1002663916, 1001447228, 1000979497, 1000956742, 1001038308, 1002638096, 1000415025,
    1001868899, 1001017794, 1000056720, 1000058183, 1000821880, 1001228149, 1000480755,
    1002499177, 1001537485, 1000339767, 1002044113, 1002573766, 1001269992, 1001354843,
    1000118888, 1000149428, 1001803975, 1000222326, 1001257849, 1002187778, 1002149432,
    1001445651, 1001971190, 1001377902, 1000168509, 1001351656, 1001713927, 1001104442,
    1001890822, 1001869495, 1002575689, 1000762777, 1000150393, 1001259333, 1000565784,
    1002312763, 1001455163, 1002431805, 1002170320, 1001727497, 1002120786, 1001211845,
    1002009293, 1002002759, 1002256034, 1000960075, 1000905489, 1002410523, 1000381073,
    1000173214, 1002395671, 1002256806, 1001768142, 1000177679, 1002692425, 1002088771,
    1000856361, 1001423847, 1001575759, 1000449711, 1001550004, 1001554359, 1001906605,
    1001710361, 1001862213, 1002048548, 1000734007, 1001059720, 1002352183, 1001121521,
    1000213563, 1001664822, 1002450758, 1001128119, 1001734650, 1000402851, 1000517539,
    1001565687, 1001288446, 1000169483, 1000780042, 1000862817, 1000904640, 1001952654,
    1000734391, 1001156795, 1001074923, 1001855482, 1001336701, 1001371862, 1002545044,
    1001454360, 1001719858, 1001598709, 1001450807, 1000713806, 1000792912, 1002677052,
    1001580304, 1002257558, 1001500784, 1002261599, 1000810175, 1001677954, 1001202743,
    1002314510, 1001777562, 1000822655, 1001002115, 1001217819, 1002520043, 1000716537,
    1001958057, 1002669138, 1002021941, 1000828887, 1000177607, 1001993644, 1002485054,
    1001084532, 1000857473,
]

**Sample substitutions**

Either run this cellto generate a new selection...

In [15]:
#substitution_ids = {}
#with session_scope() as session:
#    for max_distance in range(1, 3):
#        model = Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, max_distance)
#        # Sample 5 times more to make sure we get at least 100 substitutions
#        # even with different quotes
#        substitution_ids[model] = sample(
#            [id for (id,)
#             in session.query(Substitution.id).filter(Substitution.model == model)],
#            5 * SAMPLE_SIZE
#        )

... or this one to use the previous selection (which has already been coded).

In [16]:
substitution_ids = {
    Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, 1): [
        5472, 10466, 8963, 6380, 778, 9747, 474, 830, 8524, 8271,
        674, 6907, 5478, 9421, 8203, 7668, 6772, 2201, 6117, 3916,
        8525, 10101, 1863, 1846, 2741, 8648, 9870, 4681, 1871, 7960,
        6830, 7839, 4664, 2974, 5660, 7994, 4239, 5473, 8170, 10334,
        4306, 9317, 1887, 9638, 6799, 6115, 849, 3303, 4259, 6105,
        7455, 10475, 2176, 5343, 913, 8808, 420, 489, 5431, 872,
        6754, 389, 9343, 8201, 6234, 498, 3411, 9359, 8402, 7506,
        7084, 6809, 7417, 9337, 1924, 5685, 648, 415, 5702, 4612,
        10714, 8804, 4269, 6758, 10980, 7140, 1967, 2968, 5443, 2661,
        9635, 3911, 4346, 3496, 946, 879, 4845, 7148, 3445, 6574,
        5650, 8046, 8943, 7115, 9914, 5688, 54, 1425, 932, 1498,
        3318, 7357, 10723, 571, 6440, 10721, 5723, 4271, 943, 854,
        4067, 2694, 6135, 57, 2434, 8190, 417, 5571, 2213, 6612,
        2674, 7063, 4353, 6653, 800, 6138, 2688, 459, 859, 2430,
        5269, 9871, 10381, 8986, 2072, 9381, 200, 354, 4194, 7394,
        9877, 7033, 723, 9883, 8626, 4206, 486, 3617, 243, 5339,
        5355, 9338, 1724, 811, 6130, 2224, 3311, 4616, 4521, 9410,
        5482, 6369, 7963, 10768, 4256, 4061, 565, 3623, 6657, 141,
        4226, 8642, 11052, 4052, 786, 40, 2025, 632, 541, 5755,
        7486, 2706, 9560, 213, 10879, 5441, 5800, 8684, 2953, 6822,
        411, 9342, 6847, 9494, 392, 6820, 1841, 6945, 5656, 467,
        6433, 9407, 442, 6663, 1996, 5639, 7359, 2162, 8546, 4488,
        7548, 6777, 3649, 2435, 4313, 7125, 348, 10991, 7082, 6795,
        7142, 9535, 5464, 4620, 867, 9760, 8974, 5705, 4223, 7458,
        8382, 8076, 2199, 6183, 10719, 10356, 988, 4330, 10971, 6871,
        4374, 8899, 42, 2157, 8092, 5514, 744, 9361, 1961, 5555,
        8163, 4463, 8887, 4134, 865, 432, 1413, 5754, 5469, 9414,
        3305, 5669, 6181, 570, 7551, 8759, 695, 2419, 6903, 2202,
        969, 7570, 2438, 5429, 4465, 179, 5239, 10438, 7501, 573,
        2559, 9486, 9457, 6146, 949, 1377, 8926, 4390, 9745, 5780,
        3404, 279, 9490, 3765, 4197, 2164, 4691, 5753, 10775, 277,
        11048, 8414, 2014, 7678, 10338, 10636, 6367, 140, 4348, 4429,
        3910, 4220, 410, 2095, 3294, 2230, 4833, 2045, 3626, 5466,
        3306, 3622, 2243, 5322, 8510, 2192, 6235, 4231, 564, 9423,
        9333, 5492, 1725, 10378, 8270, 10173, 2154, 201, 2151, 8562,
        3928, 7006, 5647, 9228, 2721, 8970, 3909, 2686, 9061, 5765,
        7027, 824, 5436, 4860, 7534, 2090, 8192, 4126, 9466, 3446,
        1009, 3450, 5248, 6182, 9888, 22, 513, 7389, 8819, 992,
        11047, 5549, 1519, 7352, 10615, 616, 3624, 7112, 3290, 2181,
        8941, 10877, 977, 2719, 4221, 10399, 2200, 5504, 6238, 9562,
        10862, 9418, 10709, 3618, 9063, 9388, 7137, 897, 3407, 6362,
        10468, 2047, 5673, 10618, 9552, 851, 3416, 4058, 8947, 6814,
        5581, 701, 972, 4340, 4215, 689, 8713, 3956, 8707, 6176,
        3413, 10021, 4516, 4780, 9422, 3347, 7032, 3340, 6431, 5724,
        4288, 4496, 9417, 6119, 4057, 325, 4413, 9869, 8500, 220,
        6152, 9385, 3293, 7996, 793, 219, 4456, 939, 8086, 613,
        6948, 212, 1965, 4417, 10623, 804, 4449, 5424, 9766, 7542,
        9782, 4361, 6356, 1964, 8061, 9393, 8566, 4240, 9401, 4323,
        5484, 10423, 501, 8033, 3414, 8542, 4257, 6462, 4048, 2034,
        7669, 3277, 4295, 7113, 973, 9588, 4480, 2463, 4451, 11113
    ],
    Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, 2): [
        7242, 6639, 9155, 6065, 8704, 3122, 9273, 3556, 7633, 1514,
        2335, 7235, 1309, 4084, 3690, 6671, 4171, 3205, 1207, 2595,
        2587, 4151, 2790, 9504, 3008, 9022, 1610, 3169, 9163, 10168,
        2783, 9271, 8480, 2497, 3782, 9805, 3991, 9513, 10277, 2374,
        5875, 5527, 6079, 8333, 11029, 3028, 3366, 7255, 6691, 3160,
        2108, 10525, 10165, 1810, 6285, 2738, 3897, 2917, 10509, 8718,
        10011, 2801, 1703, 8783, 4901, 6560, 5945, 669, 9647, 3802,
        604, 8024, 7037, 7303, 4930, 2729, 6622, 2346, 4744, 3386,
        10184, 7689, 1708, 3536, 10207, 3973, 7517, 5877, 6349, 2534,
        8851, 1481, 7818, 2912, 5895, 9050, 2483, 1910, 8360, 3992,
        4899, 1147, 5040, 1222, 4766, 3190, 2305, 8869, 10458, 773,
        2730, 4994, 10863, 7299, 7566, 3534, 5334, 6016, 1490, 7904,
        3686, 8794, 7664, 8698, 1239, 1809, 4185, 7523, 1915, 1894,
        6569, 10556, 4018, 7376, 7699, 10256, 9717, 8717, 2765, 5843,
        2266, 9266, 9921, 7896, 7400, 7091, 1800, 4887, 7860, 1454,
        812, 3831, 68, 9168, 9933, 4541, 3696, 9804, 2288, 3329,
        7612, 2500, 1494, 8122, 7167, 9614, 9172, 3486, 6000, 8482,
        8479, 3660, 10679, 10053, 8923, 1312, 10216, 9127, 2516, 9024,
        8322, 3794, 9817, 10508, 7323, 5372, 9778, 1628, 1359, 5092,
        6670, 8738, 3899, 7284, 9911, 4722, 1674, 3067, 10746, 2284,
        4019, 10223, 8139, 9034, 6887, 7629, 3700, 1899, 2736, 5335,
        8690, 2980, 10641, 9726, 7431, 2793, 4716, 3082, 5532, 3006,
        2498, 1397, 7041, 1311, 1493, 6507, 10681, 257, 9895, 10098,
        7653, 6561, 10542, 600, 5220, 1149, 10573, 5311, 2622, 9117,
        5373, 5064, 3687, 8865, 7354, 9822, 11044, 5292, 8267, 8030,
        1775, 8313, 9595, 8247, 578, 4509, 6320, 3674, 3115, 8456,
        9001, 3482, 5396, 11096, 7934, 2678, 8108, 10962, 5079, 6995,
        9603, 5048, 7695, 3161, 6517, 5022, 5059, 10265, 10130, 5121,
        3584, 7716, 6197, 10282, 4628, 11060, 5003, 3509, 9068, 644,
        2537, 4761, 3374, 1877, 1353, 10655, 11076, 9553, 5119, 107,
        9610, 3801, 7880, 10871, 3144, 4578, 4767, 7774, 653, 3753,
        251, 1067, 10076, 1460, 9184, 10580, 8921, 8604, 6709, 2914,
        5143, 145, 10453, 9861, 8003, 4167, 9819, 5007, 10941, 5290,
        7891, 8228, 2833, 4163, 1757, 1826, 3177, 3581, 4965, 7078,
        6219, 3381, 8848, 1412, 7381, 9837, 1209, 2352, 6516, 314,
        10728, 8349, 1423, 9057, 8356, 91, 6198, 159, 9939, 8769,
        2878, 10560, 10729, 9287, 9735, 4591, 9907, 211, 1472, 5595,
        6740, 9069, 1768, 5392, 4092, 3002, 6201, 6419, 9053, 2481,
        1408, 1621, 4525, 6955, 7759, 8745, 7575, 2649, 658, 5938,
        4139, 1602, 3085, 1895, 5106, 1650, 9721, 4954, 9521, 3763,
        3221, 4972, 6093, 5826, 6340, 2623, 7742, 11065, 113, 5280,
        6668, 7035, 3050, 4915, 7265, 8733, 7271, 1157, 9976, 10112,
        3713, 10088, 7942, 9083, 2540, 4963, 2527, 5512, 6651, 9722,
        5905, 1597, 6606, 5948, 172, 4991, 4756, 1111, 9708, 1231,
        8615, 1560, 3134, 3708, 7892, 4593, 3850, 1566, 7438, 8113,
        4181, 1182, 1685, 5405, 5913, 7779, 8107, 1177, 1734, 5950,
        2600, 4913, 2261, 6039, 6297, 10818, 1154, 8788, 5289, 118,
        1645, 5197, 3118, 2792, 1945, 10583, 1049, 3324, 8121, 2533,
        10425, 7713, 9774, 10939, 6957, 6558, 6705, 10066, 2116, 10930,
        4633, 1194, 1692, 5828, 9840, 3780, 4508, 4995, 1624, 10176
    ]
}

Our substitution printing function:

In [17]:
def print_substitution(number, substitution):
    title = ' {} / {}'.format(number, SAMPLE_SIZE)
    print('-' * (80 - len(title)) + title)
    if substitution.validate():
        print('Kept: yes')
    else:
        print('Kept: no')
    print()
    print('     Tokens: {tokens[0]} -> {tokens[1]}'
          .format(tokens=substitution.tokens))
    print('     Lemmas: {lemmas[0]} -> {lemmas[1]}'
          .format(lemmas=substitution.lemmas))
    print()
    print(indent(fill(substitution.source.string), ' ' * 5))
    print()
    print(indent(fill(substitution.destination.string), ' ' * 5))
    print()

def mine_print_substitutions(model):
    seen = 0
    seen_substitutions = set()
    for cluster_id in cluster_ids:
        if seen >= SAMPLE_SIZE:
            break

        model.drop_caches()
        with session_scope() as session:
            cluster = session.query(Cluster).get(cluster_id)

            for substitution in cluster.substitutions(model):
                session.rollback()

                if seen >= SAMPLE_SIZE:
                    break
                if (substitution.destination.id, substitution.position) in seen_substitutions:
                    break

                seen += 1
                seen_substitutions.add((substitution.destination.id, substitution.position))
                print_substitution(seen, substitution)

    if seen < SAMPLE_SIZE:
        print("Didn't find {} substitutions, you should sample more clusters"
              .format(SAMPLE_SIZE))


def db_print_substitutions(model):
    seen = 0
    seen_substitutions = set()
    for substitution_id in substitution_ids[model]:
        if seen >= SAMPLE_SIZE:
            break

        with session_scope() as session:
            substitution = session.query(Substitution).get(substitution_id)

            if (substitution.destination.id, substitution.position) in seen_substitutions:
                continue

            seen += 1
            seen_substitutions.add((substitution.destination.id, substitution.position))
            print_substitution(seen, substitution)

### 4.1 Single-substitution mining

Here are the individual substitutions and their respective rejected/kept status, **with single substitution**.

In [18]:
mine_print_substitutions(Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, 1))

------------------------------------------------------------------------ 1 / 100
Kept: no

     Tokens: that -> it
     Lemmas: that -> it

     i hope laura and i did the same thing but i believe he will and i know
     his girls are on his mind and he wants to make sure that first and
     foremost he is a good dad and i think that's going to be an important
     part of his presidency

     but i believe he will and i know his girls are on his mind and he
     wants to make sure that first and foremost he's a good dad and i think
     it's going to be an important part of his presidency

------------------------------------------------------------------------ 2 / 100
Kept: no

     Tokens: higher-level -> higher
     Lemmas: higher-level -> high

     going forward we'll use jquery as one of the libraries used to
     implement higher-level controls in the asp net ajax control toolkit as
     well as to implement new ajax server-side helper methods for asp net
     mvc

     going f

Now with only valid substitutions, to make the precision analysis more accurate.

In [19]:
db_print_substitutions(Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, 1))

------------------------------------------------------------------------ 1 / 100
Kept: yes

     Tokens: required -> necessary
     Lemmas: require -> necessary

     the auto companies must not squander this chance to reform bad
     management practices and begin the long-term restructuring that is
     absolutely required to save this critical industry and the millions of
     american jobs that depend on it

     the auto companies must not squander this chance to reform bad
     management practices and begin the long-term restructuring that is
     absolutely necessary to save this critical industry and the millions
     of american jobs that depend on it

------------------------------------------------------------------------ 2 / 100
Kept: yes

     Tokens: info -> information
     Lemmas: info -> information

     i'm not kidding around can you put that aside and understand that the
     case we are trying here and the info you're going to hear about here
     is totally separ

Here, the only possible source of bad filtering is stopwords and rejection because of levenshtein distance <= 1 (the other checks are very specific).

We already know we don't want substitutions on stopwords because we want to focus on meaningful changes, so **the question** is:
1. precision of the filter (i.e. are there any obvious missing checks?)
2. whether the levenshtein distance <= 1 rejection causes many meaningful substitutions to be lost or not: we want to know if there is high recall for this, so as to know whether using orthographic neighborhood density is useful or not.

The codings the substitutions are in `codings/filter_evaluations_substitutions-precision-recall`; precision answers the first question, recall the second one.

### 4.2 Double-substitution mining

Now the same with double-substitutions.

In [20]:
mine_print_substitutions(Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, 2))

------------------------------------------------------------------------ 1 / 100
Kept: no

     Tokens: that -> it
     Lemmas: that -> it

     i hope laura and i did the same thing but i believe he will and i know
     his girls are on his mind and he wants to make sure that first and
     foremost he is a good dad and i think that's going to be an important
     part of his presidency

     but i believe he will and i know his girls are on his mind and he
     wants to make sure that first and foremost he's a good dad and i think
     it's going to be an important part of his presidency

------------------------------------------------------------------------ 2 / 100
Kept: no

     Tokens: unmistakeably -> unmistakably
     Lemmas: unmistakeably -> unmistakably

     the finger of suspicion unmistakeably points to the territory of our
     neighbour pakistan

     the finger of suspicion unmistakably points to the territory of our
     neighbor pakistan

----------------------------

Now with only valid substitutions, to make the precision analysis more accurate.

In [21]:
db_print_substitutions(Model(Time.discrete, Source.majority, Past.last_bin, Durl.all, 2))

------------------------------------------------------------------------ 1 / 100
Kept: yes

     Tokens: witnessed -> seen
     Lemmas: witness -> see

     palestine has never witnessed an uglier massacre

     palestine has never seen an uglier massacre

------------------------------------------------------------------------ 2 / 100
Kept: yes

     Tokens: world -> global
     Lemmas: world -> global

     the united states will lose its superpower status in the world
     financial system the world financial system will become more multi-
     polar

     the united states will lose its superpower status in the global
     financial system

------------------------------------------------------------------------ 3 / 100
Kept: yes

     Tokens: borders -> limit
     Lemmas: border -> limit

     from tomorrow representatives of the european union will begin
     conducting monitoring up to the southern borders of the security zone

     up to the southern limit of the security zone
