# `Segram` basics and single document analysis

This notebook covers the basics of using `segram` package with a particular
focus on the main functionalities provided by grammar classes
(`segram.Doc`, `segram.Sent`, `segram.Phrase` and `segram.Component`)
and analysis of a single document.

<div class="alert alert-block alert-info">
For a more detailed discussion of the package and its features,
including supported languages and language models, see REAMDE in the 
root directory of the <a href="https://github.com/sztal/segram">Github repository</a> and the official documentation.
<br><br>
Furthermore, while this and other sample notebooks try to be as self-contained 
as possible, some basic familiarity with the <tt>spacy</tt> package is very much 
recommended (it has an excellent <a href="https://spacy.io/usage">documentation</a>).
</div>

<div class="alert alert-block alert-warning">
For the code to run without any problems make sure that <tt>spacy</tt>
and <tt>segram</tt> have been installed according to the instructions
from README.
</div>

## Loading and configuring language model and pipeline components

`Segram` is based on the excellent <a href="https://spacy.io/">spacy</a> 
package, which is used to solve the core NLP tasks such as tokenization, 
POS and dependency tagging and entity recognition. Thus, before `segram` can be 
used it is necessary to load and initialize a `spacy` language model.
As a matter of fact, `segram` functionalities are provided by a 
dedicated pipeline component, which is automatically registered with `spacy`
upon the package installation. **This is the correct way to parse texts
and create document objects with `segram`**.

Crucially, to do any work we first need to download and install appropriate
language models. In this respect `segram` is fully dependent on `spacy`,
so we need only to download thoroughly tested models offered by `spacy`.
We will need three different models:

1. **Main English model based on the transformer architecture.**
   In an environment in which `spacy` is already installed the
   model can be obtained with the command `python -m spacy download en_core_web_trf`.
2. **Word vector model.** The transformer model is powerful, but it does
   not provide static word vectors, but only context-dependent vectors.
   To use structured matching implemented in `segram` we need to have access
   to context-free word vectors, so we obtain them from a different model.
   This one can be downloaded with: `python -m spacy download en_core_web_lg`.
3. **Coreference resolution model.** The last model will be used for implementing
   coreference resolution. This is still experimental feature not integrated
   into the core of `spacy`, so that is why we need to use an extra model
   for that. It can be installed with:
   `pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl`.

In [3]:
import spacy
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("segram", config={
    "vectors": "en_core_web_lg"
})
nlp.add_pipe("segram_coref")

<segram.nlp.pipeline.coref.Coref at 0x7f1b2723db50>

Let us now understand the meaning of the code we executed.

1. Import `spacy` package.
2. Load model for English using `spacy.load` function.
3. Use `.add_pipe` method to add dedicated `segram` component.
   The `vectors` option from the config specifies the name of a model
   used for obtaining context-free word vectors. This is useful in the
   case we want to use a model without such vectors such as `en_core_web_trf`
   for solving core NLP tasks. 
4. Finally, we use `.add_pipe` again to add `segram_coref` module for 
   coreference resolution.

## Parsing texts into document & `segram` data model

In this section we will learn how to parse raw texts (`str` objects)
to get document objects providing us with all the nice methods for doing
semantic grammar analysis. Along the way we will see a quick overview
of the data model of `segram`.

The code chunk below provides a simple text from an article covering the war in Syria
(see `README` in the `examples/data` directory to learn mode about the dataset).

<div class="alert alert-block alert-info">
We cover more complex analyses based on multiple texts
in <tt>2-stories-and-frames.ipynb</tt> notebook.
</div>

In [4]:
text = (
    "Victims of a suspected chemical attack in Syria appeared to show symptoms"
    " consistent with reaction to a nerve agent the World Health Organization said on Wednesday."
    " \"Some cases appear to show additional signs consistent with exposure to "
    "organophosphorus chemicals a category of chemicals that includes nerve agents\""
    " WHO said in a statement putting the death toll at at least 70."
    " The United States has said the deaths were caused by sarin nerve gas dropped by Syrian aircraft."
    " Russia has said it believes poison gas had leaked from a rebel chemical weapons depot struck by Syrian bombs."
    " Sarin is an organophosporus compound and a nerve agent."
    " Chlorine and mustard gas which are also believed to have been used in the past in Syria are not."
    " A Russian Defence Ministry spokesman did not say what agent was used "
    "in the attack but said the rebels had used the same chemical weapons in Aleppo last year."
    " The WHO said it was likely that some kind of chemical was used in the attack "
    "because sufferers had no apparent external injuries and died from a rapid onset"
    " of similar symptoms including acute respiratory distress. "
    "It said its experts in Turkey were giving guidance to overwhelmed "
    "health workers in Idlib on the diagnosis and treatment of patients and "
    "medicines such as Atropine an antidote for some types of chemical exposure "
    "and steroids for symptomatic treatment had been sent. "
    "A U.N. Commission of Inquiry into human rights in Syria has previously said"
    " forces loyal to Syrian President Bashar al-Assad have used lethal chlorine gas on multiple occasions."
    " Hundreds of civilians died in a sarin gas attack in Ghouta on the outskirts of Damascus in August 2013."
    " Assads government has always denied responsibility for that attack."
    " Syria agreed to destroy its chemical weapons in 2013 under a deal brokered by Moscow and Washington."
    " But Russia a Syrian ally and China have repeatedly vetoed any United Nations "
    "move to sanction Assad or refer the situation in Syria to the International Criminal Court."
    " \"These types of weapons are banned by international law because they represent an intolerable barbarism\""
    " Peter Salama Executive Director of the WHO Health Emergencies Programme said in the WHO statement."
)

Now, we are ready to create our first document object. We simply follow the API
of `spacy` and pass the text to our language model stored as `nlp` variable.

In [10]:
spacy_doc = nlp(text)
spacy_doc   # type(doc) --> spacy.tokens.doc.Doc

Victims of a suspected chemical attack in Syria appeared to show symptoms consistent with reaction to a nerve agent the World Health Organization said on Wednesday. "Some cases appear to show additional signs consistent with exposure to organophosphorus chemicals a category of chemicals that includes nerve agents" WHO said in a statement putting the death toll at at least 70. The United States has said the deaths were caused by sarin nerve gas dropped by Syrian aircraft. Russia has said it believes poison gas had leaked from a rebel chemical weapons depot struck by Syrian bombs. Sarin is an organophosporus compound and a nerve agent. Chlorine and mustard gas which are also believed to have been used in the past in Syria are not. A Russian Defence Ministry spokesman did not say what agent was used in the attack but said the rebels had used the same chemical weapons in Aleppo last year. The WHO said it was likely that some kind of chemical was used in the attack because sufferers had no 

This gave us a standard `spacy` document. To enjoy `segram` features we need
to convert it to so-called **grammar document**, which is a wrapper object that
sits on top of a document (using composition instead of inheritance)
and exposes the semantic grammar framework to the user.

Since this is a document composed of multiple sentences, let us first focus
on the first sentence. During parsing documents are segmented into sentences
(and sentences into tokens, typically but not necessarily corresponding to individual words),
so we can easily iterate over the sentence sequence of a document.

In [12]:
doc = spacy_doc._.segram  # convert the spacy document to a grammar doc
sent = doc.sents[0]       # take the first sentence
sent

[38;5;220mVictims[0m [38;5;191mof[0m [38;5;220ma[0m [38;5;219msuspected[0m [38;5;219mchemical[0m [38;5;220mattack[0m [38;5;191min[0m [38;5;220mSyria[0m [38;5;196mappeared[0m [38;5;196mto[0m [38;5;196mshow[0m [38;5;220msymptoms[0m [38;5;219mconsistent[0m [38;5;191mwith[0m [38;5;220mreaction[0m [38;5;191mto[0m [38;5;220ma[0m [38;5;220mnerve[0m [38;5;220magent[0m [38;5;220mthe World Health Organization[0m [38;5;196msaid[0m [38;5;191mon[0m [38;5;220mWednesday[0m[38;5;196m.[0m 

We can readily tell that we are dealing now with a sentence coming from a 
grammar document produced by `segram` as it is printed differently and uses 
colors to denote different **components**.

Components are groups of tokens (not necessarily contiguous) controlled
by a syntactically (and semantically) important head token. There are four
types of components, which roughly correspond to the main parts-of-speech
(i.e. are based on a coarse-grained POS typology):

* Verbs (red) with alias `Verb`
* Nouns (yellow/orange) with alias `Noun`
* Descriptions (pink/violet) with alias `Desc`
* Prepositions (green/lime) with alias `Prep`

Why do we care about the aliases? To understand this let us see what is the
actual class of, let us say, a verb component in our sentence.

In [36]:
verb = sent.verbs[0]  
# use `.nouns`, `.descs` and `.preps` to access other types
# all components are given by `.components` property
type(verb), verb.alias

(segram.nlp.backend.rulebased.lang.en.grammar.components.RulebasedEnglishVerb,
 'Verb')

Clearly, it is not a pure verb but a `RulebasedEnglishVerb`. This is an
implementation detail, so we do not care. However, let us note that its
alias is simply a string `"Verb"`.

In principle, we could look for
verbs using standard `isnstance` check, but this requires locating and importing
base component classes appropriate for testing, so it would be nice to have
an alternative way of testing whether a component belongs to a given type
just by providing a string, which is easier to do on-the-fly. This is one
of several reasons it is nice to have aliases.

In [39]:
# Finding verbs using `isinstance`
from segram.grammar import Verb
sent.components.filter(isinstance, Verb)

(are [38;5;196mbanned[0m, [38;5;196mrepresent[0m, [38;5;196msaid[0m)

In [40]:
# Using alias (importing `Verb` is not necessary)
sent.components.filter("match", alias="Verb")

(are [38;5;196mbanned[0m, [38;5;196mrepresent[0m, [38;5;196msaid[0m)

Note that we used a custom framework for dealing with the rich data produced
by `segram`. Namely, the `components` property (and this applies to all data
properties and attributes of grammar classes) returns something that looks like
a tuple, but it not exactly a vanilla tuple but rather its subclass called
`DataTuple`.

In [42]:
type(sent.components), sent.components

(segram.datastruct.collections.DataTuple,
 (These [38;5;220mtypes[0m,
  [38;5;191mof[0m,
  [38;5;220mweapons[0m,
  are [38;5;196mbanned[0m,
  [38;5;191mby[0m,
  [38;5;219minternational[0m,
  [38;5;220mlaw[0m,
  [38;5;220mthey[0m[types],
  [38;5;196mrepresent[0m,
  [38;5;219mintolerable[0m,
  an [38;5;220mbarbarism[0m,
  [38;5;220mPeter Salama[0m,
  [38;5;220mExecutive[0m,
  [38;5;220mDirector[0m,
  [38;5;191mof[0m,
  [38;5;220mthe WHO Health Emergencies Programme[0m,
  [38;5;196msaid[0m,
  [38;5;191min[0m,
  [38;5;220mWHO[0m,
  the [38;5;220mstatement[0m))

`DataTuple` is just a plain old tuple with several additional for data
filtering and processing, most importantly, `map`, `filter`, `sort`, `pipe`,
`groupby`, `get` and `flat`. We will discuss them in more detail shortly, 
but first let us try to do something slightly more useful and count occurrences
of distinct descriptions in the whole document based on their lemmas. 
To do so we will need a little help of the `Counter` class from the `collections` module of the Python standard library.

Below we describe this simple processing pipeline step-by-step in the comments
in the code.

In [49]:
from collections import Counter

doc.sents \
    .get("descs") \
    .flat \
    .groupby("")

Counter({[38;5;219msuspected[0m: 1,
         [38;5;219mconsistent[0m: 1,
         [38;5;219madditional[0m: 1,
         [38;5;219mconsistent[0m: 1,
         [38;5;219mSyrian[0m: 1,
         [38;5;219mrebel[0m: 1,
         [38;5;219mSyrian[0m: 1,
         [38;5;219malso[0m: 1,
         [38;5;219msame[0m: 1,
         [38;5;219mchemical[0m: 1,
         [38;5;219mlikely[0m: 1,
         [38;5;219mapparent[0m: 1,
         [38;5;219mexternal[0m: 1,
         [38;5;219mrapid[0m: 1,
         [38;5;219msimilar[0m: 1,
         [38;5;219macute[0m: 1,
         [38;5;219mrespiratory[0m: 1,
         [38;5;219mits[0m[the World Health Organization]: 1,
         [38;5;219moverwhelmed[0m: 1,
         [38;5;219msuch[0m: 1,
         [38;5;219mchemical[0m: 1,
         [38;5;219msymptomatic[0m: 1,
         [38;5;219mpreviously[0m: 1,
         [38;5;219mloyal[0m: 1,
         [38;5;219mSyrian[0m: 1,
         [38;5;219mlethal[0m: 1,
         [38;5;219mmultiple

In [29]:
doc.sents.get("components").flat.filter("match", alias="Desc")

([38;5;219msuspected[0m,
 [38;5;219mconsistent[0m,
 [38;5;219madditional[0m,
 [38;5;219mconsistent[0m,
 [38;5;219mSyrian[0m,
 [38;5;219mrebel[0m,
 [38;5;219mSyrian[0m,
 [38;5;219malso[0m,
 [38;5;219msame[0m,
 [38;5;219mchemical[0m,
 [38;5;219mlikely[0m,
 [38;5;219mapparent[0m,
 [38;5;219mexternal[0m,
 [38;5;219mrapid[0m,
 [38;5;219msimilar[0m,
 [38;5;219macute[0m,
 [38;5;219mrespiratory[0m,
 [38;5;219mits[0m[the World Health Organization],
 [38;5;219moverwhelmed[0m,
 [38;5;219msuch[0m,
 [38;5;219mchemical[0m,
 [38;5;219msymptomatic[0m,
 [38;5;219mpreviously[0m,
 [38;5;219mloyal[0m,
 [38;5;219mSyrian[0m,
 [38;5;219mlethal[0m,
 [38;5;219mmultiple[0m,
 [38;5;219malways[0m,
 [38;5;219mits[0m[Syria],
 [38;5;219mchemical[0m,
 [38;5;219mSyrian[0m,
 [38;5;219mrepeatedly[0m,
 [38;5;219minternational[0m,
 [38;5;219mintolerable[0m)

In [34]:
S = doc.sents \
    .get("proots").flat \
    .filter("match", **{
        "alias": "VP",
        "subj": lambda x: \
            x.filter("match", "World Health Organization|WHO").any(),
        "verb": lambda x:
            x.get("head").filter("match", lemma="say").any()
    }) \
    .get("sent") \
    .unique()

In [35]:
for sent in doc.sents:
    print(sent.to_str(color=sent in S))

[38;5;220mVictims[0m [38;5;191mof[0m [38;5;220ma[0m [38;5;219msuspected[0m [38;5;219mchemical[0m [38;5;220mattack[0m [38;5;191min[0m [38;5;220mSyria[0m [38;5;196mappeared[0m [38;5;196mto[0m [38;5;196mshow[0m [38;5;220msymptoms[0m [38;5;219mconsistent[0m [38;5;191mwith[0m [38;5;220mreaction[0m [38;5;191mto[0m [38;5;220ma[0m [38;5;220mnerve[0m [38;5;220magent[0m [38;5;220mthe World Health Organization[0m [38;5;196msaid[0m [38;5;191mon[0m [38;5;220mWednesday[0m[38;5;196m.[0m 
[38;5;196m"[0m[38;5;220mSome[0m [38;5;220mcases[0m [38;5;196mappear[0m [38;5;196mto[0m [38;5;196mshow[0m [38;5;219madditional[0m [38;5;220msigns[0m [38;5;219mconsistent[0m [38;5;191mwith[0m [38;5;220mexposure[0m [38;5;191mto[0m [38;5;220morganophosphorus[0m [38;5;220mchemicals[0m [38;5;220ma[0m [38;5;220mcategory[0m [38;5;191mof[0m [38;5;220mchemicals[0m [38;5;220mthat[0m [38;5;196mincludes[0m [38;5;220mnerve[0m [38;5;220magent