# `Segram` basics and single document analysis

This notebook covers the basics of using `segram` package with a particular
focus on the main functionalities provided by grammar classes
(`segram.Doc`, `segram.Sent`, `segram.Phrase` and `segram.Component`)
and analysis of a single document.

<div class="alert alert-block alert-info">
For a more detailed discussion of the package and its features,
including supported languages and language models, see the README file in the 
root directory of the <a href="https://github.com/sztal/segram">Github repository</a>.
</div>

<div class="alert alert-block alert-warning">
For the code to run without any problems make sure that `spacy`
and `segram` have been installed according to the instructions
from README.
</div>

## Loading and configuring language model and pipeline components

`Segram` is based on the excellent <a href="https://spacy.io/">spacy</a> 
package, which is used to solve the core NLP tasks such as tokenization, 
POS and dependency tagging and entity recognition. Thus, before `segram` can be 
used it is necessary to load and initialize a `spacy` language model.
As a matter of fact, `segram` functionalities are provided by a 
dedicated pipeline component, which is automatically registered with `spacy`
upon the package installation.

Crucially, to do any work we first need to download and install appropriate
language models. In this respect `segram` is fully dependent on `spacy`,
so we need only to download thoroughly tested models offered by `spacy`.
We will need three different models:

1. **Main English model based on the transformer architecture.**
   In an environment in which `spacy` is already installed the
   model can be obtained with the command `python -m spacy download en_core_web_trf`.
2. **Word vector model.** The transformer model is powerful, but it does
   not provide static word vectors, but only context-dependent vectors.
   To use structured matching implemented in `segram` we need to have access
   to context-free word vectors, so we obtain them from a different model.
   This one can be downloaded with: `python -m spacy download en_core_web_lg`.
3. **Coreference resolution model.** The last model will be used for implementing
   coreference resolution. This is still experimental feature not integrated
   into the core of `spacy`, so that is why we need to use an extra model
   for that. It can be installed with:
   `pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl`.

In [1]:
import spacy
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("segram", config={
    "vectors": "en_core_web_lg"
})
nlp.add_pipe("segram_coref")

<segram.nlp.pipeline.coref.Coref at 0x7f3f79892610>

Let us now understand the meaning of the code we executed.

1. Import `spacy` package.
2. Load model for English using `spacy.load` function.
3. Use `.add_pipe` method to add dedicated `segram` component.
   The `vectors` option from the config specifies the name of a model
   used for obtaining context-free word vectors. This is useful in the
   case we want to use a model without such vectors such as `en_core_web_trf`
   for solving core NLP tasks. 
4. Finally, we use `.add_pipe` again to add custom `segram` coreference
   resolution model.

The code chunk provides a simple text from an article covering the war in Syria
(see `README` in the `examples/data` directory to learn mode about the dataset).

<div class="alert alert-block alert-info">
We cover more complex analyses based on multiple texts
in <tt>2-stories-and-frames.ipynb</tt> notebook.
</div>

In [2]:
text = (
    "Victims of a suspected chemical attack in Syria appeared to show symptoms"
    " consistent with reaction to a nerve agent the World Health Organization said on Wednesday."
    " \"Some cases appear to show additional signs consistent with exposure to "
    "organophosphorus chemicals a category of chemicals that includes nerve agents\""
    " WHO said in a statement putting the death toll at at least 70."
    " The United States has said the deaths were caused by sarin nerve gas dropped by Syrian aircraft."
    " Russia has said it believes poison gas had leaked from a rebel chemical weapons depot struck by Syrian bombs."
    " Sarin is an organophosporus compound and a nerve agent."
    " Chlorine and mustard gas which are also believed to have been used in the past in Syria are not."
    " A Russian Defence Ministry spokesman did not say what agent was used "
    "in the attack but said the rebels had used the same chemical weapons in Aleppo last year."
    " The WHO said it was likely that some kind of chemical was used in the attack "
    "because sufferers had no apparent external injuries and died from a rapid onset"
    " of similar symptoms including acute respiratory distress. "
    "It said its experts in Turkey were giving guidance to overwhelmed "
    "health workers in Idlib on the diagnosis and treatment of patients and "
    "medicines such as Atropine an antidote for some types of chemical exposure "
    "and steroids for symptomatic treatment had been sent. "
    "A U.N. Commission of Inquiry into human rights in Syria has previously said"
    " forces loyal to Syrian President Bashar al-Assad have used lethal chlorine gas on multiple occasions."
    " Hundreds of civilians died in a sarin gas attack in Ghouta on the outskirts of Damascus in August 2013."
    " Assads government has always denied responsibility for that attack."
    " Syria agreed to destroy its chemical weapons in 2013 under a deal brokered by Moscow and Washington."
    " But Russia a Syrian ally and China have repeatedly vetoed any United Nations "
    "move to sanction Assad or refer the situation in Syria to the International Criminal Court."
    " \"These types of weapons are banned by international law because they represent an intolerable barbarism\""
    " Peter Salama Executive Director of the WHO Health Emergencies Programme said in the WHO statement."
)

Now, we are read to create our first document object. We simply follow the API
of `spacy` and pass the text to our language model stored as `nlp` variable.

In [18]:
doc = nlp(text)

Victims of a suspected chemical attack in Syria appeared to show symptoms consistent with reaction to a nerve agent the World Health Organization said on Wednesday. "Some cases appear to show additional signs consistent with exposure to organophosphorus chemicals a category of chemicals that includes nerve agents" WHO said in a statement putting the death toll at at least 70. The United States has said the deaths were caused by sarin nerve gas dropped by Syrian aircraft. Russia has said it believes poison gas had leaked from a rebel chemical weapons depot struck by Syrian bombs. Sarin is an organophosporus compound and a nerve agent. Chlorine and mustard gas which are also believed to have been used in the past in Syria are not. A Russian Defence Ministry spokesman did not say what agent was used in the attack but said the rebels had used the same chemical weapons in Aleppo last year. The WHO said it was likely that some kind of chemical was used in the attack because sufferers had no 

In [20]:
%%timeit
doc.grammar

1.68 µs ± 29 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
