# Demo Notebook

This notebook demonstrates a toolkit for text data analysis in Digital Humanities research. It is aimed at easy and efficient parsing, cleaning and analyzing of data. The package focuses on tabular text data.

In [4]:
from src.toolkit import visualization,config,dataloader,distinctiveness,collocation

vs = visualization.Style()
vs.set_default()

## Loading Configuration
Set paths to text data, stopword files etc. in a json-formatted text file for easy access.

In [5]:
config_json = config.load('config.json')

## Loading Data
Load your data by pointing to a folder with text files. Data cleaning based on stopwords and Part-of-Speech tags can also be done.

In [6]:
dl = dataloader.DataLoader(
                           year_range=(1945,1946),
                           text_column='lemm_cleaned',
                           data_path=config_json['text_data_path'],
                           stopword_path=config_json['stopword_path'],
                           load_text=True
                           )
dl.load_pos(pos_path=config_json['pos_path'])
dl.load()
dl.clean(remove_stopwords=True,pos_types=['N','ADJ','WW'])

	 > loading data ...


100%|██████████| 1/1 [00:00<00:00,  3.25it/s]


	 > cleaning data ...


In [7]:
data  = dl.data

## Counting Words
The basis of text mining: count words easily.

In [None]:
fq = toolkit.frequency.Frequency(data,'lemm_cleaned','date','party-ref','month')
fq.get_total_tokens()
fq.count_word('parliament')

## Finding Distinctive Terms

Textual difference forms the basis for computational humanistic inquiry. Here we use log likelihood estimates to find terms distinctive for a specific category.

In [None]:
dst = distinctiveness.Distinctiveness(data=data,type_column='party-ref',text_column='lemm_cleaned')
dst.fit_vectorizer(max_features=10000,ngram_range=(1,2))
dst_df = dst.get_likelihoods()

## Finding Collocates

Co-occurrences can be used to estimate the relatedness of two terms based on the observed co-occurrence compared to the expected co-occurrence. Simply pass your data and choose a collocation metric and a window.

In [8]:
clc = collocation.Collocation(data=data,measure='dice')

In [9]:
clc.find_collocates()
clc.score_collocates()

In [10]:
clc.find_term('parlement')

{}

In [13]:
clc.finder

<nltk.collocations.BigramCollocationFinder at 0x7f4f9734faf0>

In [14]:
dict(clc.finder.score_ngrams('pmi'))

{}

In [15]:
data

Unnamed: 0,speaker,role,party-ref,member-ref,speech_id,lemm_cleaned,date
0,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,delen inkomen,1946-1-22
1,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,deg,1946-1-22
2,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,bericht lid verhinderen vergadering wonen,1946-1-22
3,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,ongesteldheid vorig week volgen week ongesteld...,1946-1-22
4,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,heeren volgen dag ongesteldheid volgen dag slu...,1946-1-22
...,...,...,...,...,...,...,...
13341,van Poll,mp,nl.p.kvp,nl.m.01047,nl.proc.sgd.d.194519460000145.2.36,interpretatie uitblijven invrijheidstelling be...,1946-5-7
13342,van Poll,mp,nl.p.kvp,nl.m.01047,nl.proc.sgd.d.194519460000145.2.36,overzeesche gebied deelen motie constateering ...,1946-5-7
13343,van Poll,mp,nl.p.kvp,nl.m.01047,nl.proc.sgd.d.194519460000145.2.36,beraadslaging sluiten,1946-5-7
13344,van der Leeuw,government,na,nl.m.00793,nl.proc.sgd.d.194519460000145.4.10,interpellatie aangaan geldleeningen kredietove...,1946-5-7
