A collection of tools for determining the association between arbitrary linguistic structures.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
doc
pareidoscope
.gitignore
CHANGES.txt
LICENSE.txt
MANIFEST.in
README.md
README.rst
setup.py

README.md

Pareidoscope

The Pareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, e.g. between words (collocations), between words and structures (collostructions) or between larger linguistic structures. For the underlying cooccurrence model, cf. Proisl (in preparation).

Installation

The Pareidoscope is available on PyPI and can be installed using pip:

pip3 install Pareidoscope

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/Pareidoscope.git

In the new directory, run the following command:

python3 setup.py install

Usage

Input formats

Corpora

Corpora can be provided in two different formats: In CoNLL-U format or in CWB-treebank format.

CoNLL-U is the format used for the treebanks of the Universal Dependencies project (Nivre et al., 2016). The format is specified in the UD documentation. Here is an example that has been adapted from the documentation:

1    They     they    PRON    PRP    _    2    nsubj    2:nsubj|4:nsubj    _
2    buy      buy     VERB    VBP    _    0    root     0:root             _
3    and      and     CONJ    CC     _    4    cc       4:cc               _
4    sell     sell    VERB    VBP    _    2    conj     0:root|2:conj      _
5    books    book    NOUN    NNS    _    2    obj      2:obj|4:obj        _
6    .        .       PUNCT   .      _    2    punct    2:punct            _

There are ten tab-separated columns. The first five columns are for the word ID, the word form, the lemma, the universal part-of-speech tag and a language-specific part-of-speech tag. Columns six and ten, which are empty in this example, are for morphological features and miscellaneous annotation. In columns seven to nine, the dependency analysis of this sentence is encoded. Columns seven and eight encode the basic dependencies which are required to form a tree. Column seven indicates the ID of the governor, column eight the type of the dependency relation between the governor and the current word. In column nine, an enhanced dependency graph can be represented that does not need to be a tree.

For details on the CWB-treebank format, cf. Proisl and Uhrig (2012).

Queries

The query graphs can be provided as JSON serializations of the node-link format understood by NetworkX (Hagberg et al., 2008). All command-line tools can operate on multiple queries, therefore a list of queries has to be provided, even for a single query. Here is an example of a one-element list containing the query graph for finding associated larger structures of monotransitive uses of the verb give with pareidoscope_associated_structures (more example queries are provided in the doc directory):

[
    {
        "graph": {
            "description": "Monotransitive uses of the verb give"
        },
        "nodes": [
            {
                "id": 0,
                "wc": "VERB",
                "lemma": "give",
                "focus_point": true,
                "not_outdep": ["iobj", "obl"]
            },
            {
                "id": 1
            }
        ],
        "links": [
            {
                "source": 0,
                "target": 1,
                "relation": "obj"
            }
        ]
    }
]

Queries are represented as dictionaries with two obligatory keys: nodes for the vertices and links for the edges. Under the key graph, additional information such as a description of the query can be stored. Both the vertices and the edges of the query graph are represented as lists of dictionaries. An edge is specified by the IDs of its source and target vertices and, optionally, by the kind of dependency relation. The vertices are required to have an ID and can have other, optional attributes.

The attributes that can be used for the vertices depend on the kind of query. The following attributes can always be used: word, pos, lemma, wc, root, not_indep (a list), not_outdep (a list). The first five attributes can also be negated by prefixing them with not_, e.g. "not_wc": "NOUN" for indicating that a vertex should not be a noun.

For determining the association strength between two structures with pareidoscope_association_strength, the following additional attributes can be used. The attribute query has to be used for every vertex and takes the values A, B or AB. This attribute indicates if the vertex belongs to GA, GB or to both, i.e. to GC. For vertices marked as "query": "AB", the optional attributes only_A and only_B can be used. These attributes are lists and indicate which other attributes only apply to GA or to GB. The focus point vertex of the graph can be marked by setting "focus_point": true. The attributes only_A and only_B can also be used for edges.

For simple collexeme analysis with pareidoscope_collexeme_analysis, the attribute collo_item has to be set to true for the collexeme vertex. This vertex is automatically the focus point.

For relational cooccurrences and covarying collexeme analysis with pareidoscope_covarying_collexemes, the attributes collo_A, collo_B have to be set to true for the two collexeme vertices. The attribute focus_point can be used to mark the focus point vertex.

For finding associated larger structures with pareidoscope_associated_structures, the focus point vertex can be marked by setting "focus_point": true.

Convert a corpus into an SQLite3 database

For most of the programs described below, it is necessary to convert your corpus into an SQLite3 database. This can considerably speed up highly selective queries; for very general queries that require that almost every sentence in the corpus is checked, this makes less of a difference.

Corpora in CoNNL-U or CWB-treebank format can be converted to an SQLite3 database using pareidoscope_corpus_to_sqlite. Running the program with the option -h outputs a help message with detailed usage information. Here is an example where we convert the training part of the English Universal Dependencies treebank (en-ud-train.conllu; we use the version included in the 2.0 release of the UD treebanks.) which is in CoNLL-U format, and create the database en-ud-train.db:

pareidoscope_corpus_to_sqlite --db en-ud-train.db --format conllu en-ud-train.conllu

Association between two linguistic structures

The program pareidoscope_association_strength determines the association strength between two linguistic structures.

Here is a sample query for the cooccurrence of the ditransitive with direct objects that have a determiner (this query and other queries can be found in the query file ex_association_two_structures.json):

[
    {
        "graph": {
            "description": "cooccurrence of the ditransitive with direct
                            objects that have a determiner"
        },
        "nodes": [
            {
                "id": 0,
                "wc": "VERB",
                "query": "AB",
                "focus_point": true
            },
            {
                "id": 1,
                "query": "A"
            },
            {
                "id": 2,
                "wc": "NOUN",
                "query": "AB"
            },
            {
                "id": 3,
                "query": "B"
            }
        ],
        "links": [
            {
                "source": 0,
                "target": 1,
                "relation": "iobj"
            },
            {
                "source": 0,
                "target": 2,
                "relation": "obj"
            },
            {
                "source": 2,
                "target": 3,
                "relation": "det"
            }
        ]
    }
]

The verb the and direct object are part of both linguistic structures and are therefore marked as AB. The indirect object only belongs to the ditransitive and is marked as A, the determiner only belongs to the other linguistic structure and is marked as B. Additionally, the verb is marked as the focus point vertex.

Here is an example for invoking the program (use the option -h for detailed usage information):

pareidoscope_association_strength --format db -o associations en-ud-train.db ex_association_two_structures.json

In this example, we run the queries specified in ex_association_two_structures.json on the corpus converted above. Option --format db indicates that we operate on an SQLite3 database (this program can also operate directly on corpus files in CoNLL-U or CWB-treebank format). The results are written to associations.tsv in a tab-separated format and contain, for every query and every counting method, the frequencies O11, R1, C1 and N, the number of inconsistencies and three association measures (log-likelihood, t-score, Dice coefficient).

Simple collexeme analysis

The program pareidoscope_collexeme_analysis performs a simple collexeme analysis, i.e. it determines the association strength between a linguistic structure and the word forms or lemmata that occur in a given slot of that structure. To this end, the collo item vertex has to be marked with "collo_item": true in the query. Here is an example query (taken from the query file ex_collexeme_analysis.json) that finds verbs that are associated with the ditransitive:

[
    {
        "graph": {
            "description": "Verbs associated with the ditransitive"
        },
        "nodes": [
            {
                "id": 0,
                "wc": "VERB",
                "collo_item": true
            },
            {
                "id": 1
            },
            {
                "id": 2
            }
        ],
        "links": [
            {
                "source": 0,
                "target": 1,
                "relation": "iobj"
            },
            {
                "source": 0,
                "target": 2,
                "relation": "obj"
            }
        ]
    }
]

Here is an example for invoking the program (use the option -h for detailed usage information):

pareidoscope_collexeme_analysis -o collexemes en-ud-train.db ex_collexeme_analysis.json

In this example, we run the queries specified in ex_collexeme_analysis.json on the corpus converted above. The program takes an optional option -c where we can specify if the collo items should be word forms or lemmata (the latter is the default).

The results are written to collexemes.tsv in a tab-separated format and contain, for every query and cooccurring lemma, the frequencies O11, R1, C1 and N and three association measures (log-likelihood, t-score, Dice coefficient). For simple collexeme analysis, three of the four counting methods are fully equivalent. Since counting sentences does not make much sense in this case because of the large number of inconsistencies that can be expected, we do not include that counting method. As a consequence, we do not need to distinguish between different counting methods and do not need to include a field for inconsistencies. The results are ordered by log-likelihood.

Relational cooccurrences and covarying collexeme analysis

The program pareidoscope_covarying_collexemes performs a covarying collexeme analysis which, for linguistic structures that consist of a single dependency relation, is equivalent to analyzing relational cooccurrences. The program determines the association between the word forms or lemmata that cooccur in two slots of a linguistic structure. To this end, the two slots have to be marked with "collo_A": true and "collo_B": true in the query. Here is an example query (taken from the query file ex_covarying_collexemes.json) that determines the association between the verbs in the into-causative:

[
    {
        "graph": {
            "description": "Into-causative, i.e. verb someone into verbing"
        },
        "nodes": [
            {
                "id": 0,
                "wc": "VERB",
                "collo_A": true
            },
            {
                "id": 1,
                "pos": "VBG",
                "collo_B": true
            },
            {
                "id": 2
            },
            {
                "id": 3,
                "lemma": "into"
            }
        ],
        "links": [
            {
                "source": 0,
                "target": 1,
                "relation": "advcl"
            },
            {
                "source": 0,
                "target": 2,
                "relation": "obj"
            },
            {
                "source": 1,
                "target": 3,
                "relation": "mark"
            }
        ]
    }
]

Here is an example for invoking the program (use the option -h for detailed usage information):

pareidoscope_covarying_collexemes -o covarying en-ud-train.db ex_covarying_collexemes.json

In this example, we run the queries specified in ex_covarying_collexemes.json on the corpus converted above. The program takes an optional option -c where we can specify if the cooccurring items should be word forms or lemmata (the latter is the default).

The results are written to covarying.tsv in a tab-separated format and contain, for every query, cooccurring pair of items and counting method, the frequencies O11, R1, C1 and N, the number of inconsistencies and three association measures (log-likelihood, t-score, Dice coefficient). The results are ordered by log-likelihood for counting focus points.

Associated larger structures

The program pareidoscope_associated_structures determines which larger delexicalized linguistic structures are associated with the query structure. It considers all star-like larger structures, i.e. structures where all new vertices have to be adjacent to a query vertex, that cooccur with the query structury in at least --min-coocc sentences (default: 5) and have a maximum of --max-size vertices (default: 7). The vertices of the larger structures are delexicalized and contain only word class information (the wc attribute). Here is an example query that looks for larger structures that are associated with monotransitive uses of the verb give:

[
    {
        "graph": {
            "description": "Monotransitive uses of the verb give"
        },
        "nodes": [
            {
                "id": 0,
                "wc": "VERB",
                "lemma": "give",
                "focus_point": true,
                "not_outdep": ["iobj", "obl"]
            },
            {
                "id": 1
            }
        ],
        "links": [
            {
                "source": 0,
                "target": 1,
                "relation": "obj"
            }
        ]
    }
]

Here is an example for invoking the program (use the option -h for detailed usage information):

pareidoscope_associated_structures -o assoc_struc en-ud-train.db ex_associated_structures.json

In this example, we run the queries specified in ex_associated_structures.json on the corpus converted above.

The results are written to assoc_struc.tsv in a tab-separated format and contain, for every query, associated larger structure and counting method, the frequencies O11, R1, C1 and N, the number of inconsistencies and three association measures (log-likelihood, t-score, Dice coefficient). The results are ordered by log-likelihood for counting focus points.

Visualizing associated structures

The associated larger structures output by pareidoscope_associated_structures are in the same node-link format as the query graphs and can be visualized with the program pareidoscope_draw_graphs. Note that this requires that Graphviz and the Python package pydot are installed on your computer.

Here is an example for invoking the program (use the option -h for detailed usage information):

tail -n +2 assoc_struc.tsv | head | cut -f2 | pareidoscope_draw_graphs -o draw -

In this example, we use the output file created by the previous command, extract the ten most strongly associated larger structures (using GNU coreutils) and draw them. The images are written to the directory draw. Here are the visualizations created for the four larger structures that are most strongly associated with monotransitive give.

Rank 1 Rank 2 Rank 3 Rank 4