# Demo Notebook

This notebook demonstrates a toolkit for text data analysis in Digital Humanities research. It is aimed at easy and efficient parsing, cleaning and analyzing of data. The package focuses on tabular text data.

In [1]:
from src.toolkit import visualization,config,dataloader,distinctiveness,collocation, frequency

vs = visualization.Style()
vs.set_default()

## Loading Configuration
Set paths to text data, stopword files etc. in a json-formatted text file for easy access.

In [3]:
config_json = config.load('config.json')

## Loading Data
Load your data by pointing to a folder with text files. Data cleaning based on stopwords and Part-of-Speech tags can also be done.

In [4]:
dl = dataloader.DataLoader(
                           year_range=(1945,1950),
                           text_column='lemm_cleaned',
                           data_path=config_json['text_data_path'],
                           stopword_path=config_json['stopword_path'],
                           load_text=True
                           )
dl.load_pos(pos_path=config_json['pos_path'])
dl.load()
dl.clean(remove_stopwords=True,pos_types='all')
data  = dl.data

	 > loading data ...


100%|██████████| 5/5 [00:03<00:00,  1.37it/s]


	 > cleaning data ...


## Counting Words
The basis of text mining: count words easily.

In [None]:
fq = frequency.Frequency(data,'lemm_cleaned','date','party-ref','month')
fq.get_total_tokens()
fq.count_word('parliament')

## Finding Distinctive Terms

Textual difference forms the basis for computational humanistic inquiry. Here we use log likelihood estimates to find terms distinctive for a specific category.

In [None]:
dst = distinctiveness.Distinctiveness(data=data,type_column='party-ref',text_column='lemm_cleaned')
dst.fit_vectorizer(max_features=10000,ngram_range=(1,2))
dst_df = dst.get_likelihoods()

## Finding Collocates

Co-occurrences can be used to estimate the relatedness of two terms based on the observed co-occurrence compared to the expected co-occurrence. Simply pass your data and choose a collocation metric and a window.

In [5]:
clc = collocation.Collocation(data=data,measure='dice')

In [6]:
clc.find_collocates()
clc.score_collocates()

In [7]:
clc.find_term('parlement')

{}

In [9]:
clc.finder

<nltk.collocations.BigramCollocationFinder at 0x7f9702eff250>

In [14]:
dict(clc.finder.score_ngrams(BigramAssocMeasures.pmi))

{}

In [13]:
from nltk.collocations import BigramAssocMeasures


In [8]:
data

Unnamed: 0,speaker,role,party-ref,member-ref,speech_id,lemm_cleaned,date
0,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,delen kamer mede inkomen,1946-1-22
1,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,deg,1946-1-22
2,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,bericht lid verhinderen vergadering wonen,1946-1-22
3,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,bochove ongesteldheid bijlsma evenals vorig we...,1946-1-22
4,van Schaik,chair,nl.p.kvp,nl.m.01184,nl.proc.sgd.d.194519460000125.1.2,heeren schmal brule volgen dag ongesteldheid a...,1946-1-22
...,...,...,...,...,...,...,...
29438,Schokking,government,na,nl.m.01861,nl.proc.sgd.d.194919500000642.3.16,staan nederlands regering duitsland ochtend av...,1950-9-14
29439,Schokking,government,na,nl.m.01861,nl.proc.sgd.d.194919500000642.3.16,achten afvaardigen fens westduitsland recht ve...,1950-9-14
29440,Schokking,government,na,nl.m.01861,nl.proc.sgd.d.194919500000642.3.24,departement oorlog betreffen hierbij bedoeling...,1950-9-14
29441,Schokking,government,na,nl.m.01861,nl.proc.sgd.d.194919500000642.3.24,horen [s] welnu uitnodiging zeer concreet geac...,1950-9-14
