# Introduction
This notebook provides an example of the kind of analysis you can do comparing two EpiDoc corpora, in this case in terms of the distribution of abbreviations.

# Load the dependencies and corpora

In order to conduct the analysis, we will need PyEpiDoc, and the two corpora.

In [1]:
# Load dependencies
from pyepidoc import EpiDoc, EpiDocCorpus

# Create the corpora

## I.Sicily

#### Load the corpus
The corpus can be downloaded [here](https://github.com/ISicily/ISicily).

Once the corpus is downloaded, we can load it into PyEpiDoc as follows (replace the path with the path to where you have saved the I.Sicily corpus):

In [2]:
isicily = EpiDocCorpus(r'..\..\..\..\Data\ISicily\ISicily\inscriptions')

Print the basic information about the corpus:

In [None]:
isicily.print_info()

The I.Sicily corpus is not fully tokenized, so we need to tokenize it first:

In [None]:
isicily.tokenize_to_folder(r'..\..\..\..\Data\ISicily\ISicilyTokenizedNames')

Then we need to load the tokenized corpus:

In [5]:
isicily_tokenized = EpiDocCorpus(r'..\..\..\..\Data\ISicily\ISicilyTokenized')

In [None]:
isicily_tokenized.print_info()

### Create a Roman-era sub-corpus

We can now create a sub-corpus of the Roman-era inscriptions:

In [6]:
isicily_roman = isicily_tokenized.filter_by_dateafter(-1)

In [None]:
isicily_roman.doc_count

#### Create a subcorpus of documents ranging from 1 to 200 CE

In [16]:
isicily_1_to_200 = isicily_tokenized.filter_by_daterange(1, 200)

In [None]:
isicily_1_to_200.doc_count

#### Create a late-antique subcorpus

In [18]:
isicily_late_antique = isicily_tokenized.filter_by_dateafter(200)

In [None]:
isicily_late_antique.doc_count

## Cyrene

### Obtain the corpus

A `.zip` file containing the Inscriptions from Roman Cyrene corpus (IRCyr) can be obtained from [here](https://ircyr2020.inslib.kcl.ac.uk/en/inscriptions/).

### Load the corpus

The Cyrene corpus is already tokenized, so we just need to load it.

In [None]:
cyrene = EpiDocCorpus(r'..\..\..\..\Data\IRCyr')

### Create a Roman-era sub-corpus

In [None]:
cyrene_roman = cyrene.filter_by_dateafter(1)
cyrene_roman.doc_count

### Create a subcorpus for inscriptions dating from between 1 and 200 CE

In [21]:
cyrene_1_to_200 = cyrene.filter_by_daterange(1, 200)

In [None]:
cyrene_1_to_200.daterange

### Compare the distributions of abbreviations (in CSV files)

In [23]:
from pyepidoc.analysis.abbreviations import csvoutput as csvout

We write out the abbreviation distribution to a CSV file (if the path isn't modified, it will be created locally, in the `notebooks` folder):

In [None]:
csvout.overall_analysis_to_csv(cyrene_roman, 'cyrene_roman.csv')

In [None]:
csvout.overall_analysis_to_csv(isicily_roman, 'isicily_roman.csv')

In [None]:
csvout.overall_analysis_to_csv(cyrene_1_to_200, 'cyrene_1_to_200.csv')
csvout.overall_analysis_to_csv(isicily_1_to_200, 'isicily_1_to_200.csv')