# Demo notebook for interactive plotting and data analysis
SSC, September 2022

This demonstrates usage of the interactive plotting methods.

In [None]:
# Please ignore this cell: extra install steps that are only executed when running the notebook on Google Colab
# flake8-noqa-cell
import os
if 'google.colab' in str(get_ipython()) and not os.path.isdir('Test_Data'):
    # we're running on colab and we haven't already downloaded the test data
    # first install pinned version of setuptools (latest version doesn't seem to work with this package on colab)
    %pip install setuptools==61 -qqq
    # install the moralization package
    %pip install git+https://github.com/ssciwr/moralization.git -qqq
      # download test data sets
    !wget https://github.com/ssciwr/moralization/archive/refs/heads/test_data.zip -q
    !mkdir -p data && unzip -qq test_data.zip && mv -f moralization-test_data/*_Data ./data/. && rm -rf moralization-test_data test_data.zip
    !spacy download de_core_news_sm

In [None]:
from moralization import DataManager

# Import the data using the DataManager

If you need more information about raised warnings run: <br>
```import logging ``` <br>
```logging.getLogger().setLevel(logging.DEBUG)```

In [None]:
# analyse small dataset
# data_manager = DataManager("/content/data/Test_Data/XMI_11")
# if you have data in a language different than German, you 
# can pass the selected language model for the corpus language 
# using the language_model keyword argument
# for a selection of the models, see https://spacy.io/usage/models
data_manager = DataManager("/content/data/Test_Data/XMI_11", language_model="en_core_web_sm")
# analyse full dataset
# data_manager = DataManager("/content/data/All_Data/XMI_11") 

The integrity of the data is checked using the `check_data_integrity` method:

In [None]:
data_manager.check_data_integrity()

This will tell you if some categories are exceptionally rare and therefore not reliable in both statistics and training.

# Data analysis
Analysis of how often an annotation occurs per text source is carried out using `occurrence_analysis`. The output can be displayed as a table:

In [None]:
data_manager.occurrence_analysis()

or as a heatmap:

In [None]:
data_manager.occurrence_analysis(_type="heatmap")

or be provided as an occurrence correlation:

In [None]:
data_manager.occurrence_analysis(_type="corr")

The dataframes can also be exported as csv:

In [None]:
# for the general table
df = data_manager.occurrence_analysis()
df.to_csv("./table_occurrence.csv")

In [None]:
# for the general table
df = data_manager.occurrence_analysis(_type="corr")
df.to_csv("./table_correlation.csv")

If you do not want the full table but filter it for specific data files, you can do so by providing a `file_filter` keyword:

In [None]:
data_manager.occurrence_analysis(file_filter="test_data-trimmed_version_of-Gerichtsurteile-neg-AW-neu-optimiert-BB")

In [None]:
data_manager.occurrence_analysis(_type="heatmap", file_filter="test_data-trimmed_version_of-Gerichtsurteile-neg-AW-neu-optimiert-BB")

Likewise if you do not want the full correlation plot but filter it for specific categories, you can do so by providing a `cat_filter` keyword (only works for `type=heatmap`):

In [None]:
data_manager.occurrence_analysis(_type="heatmap", cat_filter="KAT1-Moralisierendes Segment")

# Check the data
You can also analyse the data uasing the spacy [span analyzer](https://github.com/ljvmiranda921/spacy-span-analyzer).

More documentation here - Gwydion.

In [None]:
data_manager.return_analyzer_result()

In [None]:
data_manager.return_analyzer_result(result_type="length")

In [None]:
data_manager.return_analyzer_result(result_type="span_distinctiveness")

In [None]:
data_manager.return_analyzer_result(result_type="boundary_distinctiveness")

In [None]:
data_manager.return_analyzer_result(result_type="all")

Again, any of these can be exported as csv.

In [None]:
df = data_manager.return_analyzer_result(result_type="boundary_distinctiveness")
df.to_csv("./boundary_distinctiveness.csv")

# Interactive data analysis

Here is an example of the interactive data analysis tools we provide.

please note, that it can take a couple seconds on google colab to go from `loading` to the interface.<br>
Once Dash shows that it is running on a port you can click the link to open the applet in a new tab.

In [None]:
data_manager.interactive_data_analysis()

In [None]:
data_manager.interactive_correlation_analysis()

The next function might struggle on large datasets.

In [None]:
data_manager.interactive_data_visualization()