Inspecting the sentence and span lengths in the data to see correlation between propaganda techniques and span lengths and to decide on a maximum sentence lengths for our systems.

In [0]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

TC_LABELS_FILE = '../data/tc-train.tsv'
SI_LABELS_FILE = '../data/train-data-with-sents-no-maxlen.tsv'
SI_LABELS_FILE_BASELINE = '../data/train-data-with-sents-baseline.tsv'
SI_LABELS_FILE_IMPROVED = '../data/train-data-with-sents-improved.tsv'

# Technique Classification (span length)

In [0]:
# quoting=3 --> 'treat quotes as normal characters'
df = pd.read_csv(TC_LABELS_FILE, sep='\t',
                 usecols=['document_id', 'label', 'span_start', 'span_end', 'text'],
                 encoding='ISO-8859-1', quoting=3)
df['length_words'] = df['text'].apply(lambda x : len(str(x).split()))
df['length_chars'] = df['span_end'] - df['span_start']
df.groupby(['label']).length_words.describe().sort_values('mean', ascending=False)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Appeal_to_Authority,144.0,23.201389,21.854119,2.0,10.0,16.0,29.25,130.0
Causal_Oversimplification,209.0,21.521531,12.616792,3.0,13.0,19.0,28.0,71.0
Doubt,493.0,21.144016,16.127337,1.0,9.0,17.0,29.0,141.0
Black-and-White_Fallacy,107.0,18.71028,13.242306,2.0,9.0,16.0,23.0,58.0
Appeal_to_fear-prejudice,294.0,17.047619,12.905449,1.0,6.0,13.5,25.0,74.0
"Whataboutism,Straw_Men,Red_Herring",108.0,16.5,11.353702,1.0,8.0,15.0,22.25,71.0
"Bandwagon,Reductio_ad_hitlerum",72.0,16.444444,12.281773,2.0,7.0,13.0,21.0,60.0
Flag-Waving,229.0,10.628821,11.630165,1.0,2.0,6.0,15.0,73.0
"Exaggeration,Minimisation",466.0,7.44206,5.824851,1.0,3.0,6.0,10.0,43.0
Thought-terminating_Cliches,76.0,6.131579,4.024399,1.0,3.75,5.0,8.0,20.0


Long sequences tend to consist of multiple connected sentences and/or quotes with somewhat arbitrary cut-off points.

In [0]:
pd.set_option('display.max_colwidth', -1)
df[df.length_words > 60][['label', 'sent']]

# Span Identification (sentence length)

## Sentences as split by NLTK

In [0]:
# The quoting=3 part is really important!

df = pd.read_csv(SI_LABELS_FILE, sep='\t', names=['doc_id', 'sent_id', 'idx_start', 'idx_end', 'tokens', 'label'], encoding='utf-8', quoting=3)
df = df.drop(columns=['doc_id', 'idx_start', 'idx_end', 'label'])
df = df.groupby('sent_id')['tokens'].apply(list).to_frame()
df['sent_len'] = df['tokens'].apply(lambda x : len(x))
pd.set_option('display.max_colwidth', -1)
df.head(15)

## Baseline
Sentences split into fragments if they are longer than 35 tokens

In [0]:
df_baseline = pd.read_csv(SI_LABELS_FILE_BASELINE, sep='\t', names=['doc_id', 'sent_id', 'idx_start', 'idx_end', 'tokens', 'label'], encoding='utf-8', quoting=3)
df_baseline = df_baseline.drop(columns=['doc_id', 'idx_start', 'idx_end', 'label'])
df_baseline = df_baseline.groupby('sent_id')['tokens'].apply(list).to_frame()
df_baseline['sent_len'] = df_baseline['tokens'].apply(lambda x : len(x))
df_baseline.head(18)

## Improved splitting

- NLTK sentence splitter
- Always split along linebreaks
- If a sentence is too long: split along quotes, semicolons, colons, commas

In [0]:
df_improved = pd.read_csv(SI_LABELS_FILE_IMPROVED, sep='\t', names=['doc_id', 'sent_id', 'idx_start', 'idx_end', 'tokens', 'label'], encoding='utf-8', quoting=3)
df_improved = df_improved.drop(columns=['doc_id', 'idx_start', 'idx_end', 'label'])
df_improved = df_improved.groupby('sent_id')['tokens'].apply(list).to_frame()
df_improved['sent_len'] = df_improved['tokens'].apply(lambda x : len(x))
df_improved.head(19)

In [0]:
pd.concat([df.describe(), df_baseline.describe(), df_improved.describe()], axis=1, sort=False)

- Punctuation marks add a lot of tokens to a sentence!
- Embedded quotes can be a problem
- Headlines w/o punctuation are recognized as part of the following sentence
- Some sentences are just very long (writing style)

Abbreviations are the opposite problem:

In [0]:
df_improved[df_improved.sent_len == 1][['sent_len', 'tokens']]