# Getting Started

This section covers installing `sentibank`, loading dictionaries, and basic usage.

```{contents}
:local:
```

## Installation

`sentibank` can be installed directly from PyPI using pip:


```python
pip install sentibank
```

This will download the latest version and install the necessary dependencies. To upgrade the older version to the latest: 

```python
pip install --upgrade sentibank
``` 

## Loading Dictionaries 

The `sentibank.archive` module provides access to the 7+ (and counting) curated sentiment dictionaries.

To load a preprocessed dictionary in `dict` format:

In [2]:
from sentibank import archive

load = archive.load()
vader = load.dict("VADER_v2014")

```{admonition} SentiBank’s predefined lexicon identifiers 
The predefined lexicon identifiers follow the convention {NAME}_{VERSION} - for example, VADER_v2014. This naming structure indicates the lexicon name and its version for easy recognition and selection. To view the available predefined lexicon identifier, please refer to the [Available Dictionaries](content:references:AvailableDictionaries).
```

To load the original (unprocessed) dictionary as a Pandas DataFrame:

In [3]:
afinn = load.origin("AFINN_v2015")

This returns the original AFINN dictionary in `pd.DataFrame` format.

## Usage Example

Here is an example of loading a dictionary, evaluating summary statistics, and extracting lexical insights:

In [None]:
from sentibank import archive
from sentibank.utils import lexical_overview

load = archive.load()
vader = load.dict("VADER_v2014")

lexical_overview(vader)


This would print the sentiment score distribution and lexical breakdown of the VADER dictionary.

The `lexical_overview` function provides a quick way to summarise key metadata from a dictionary. 

`````{admonition} Utilising lexical_overview
:class: tip 

The `lexical_overview` covers both holistic sentiment statistics as well as detailed lexical category analysis. Together these provide both the forest and the trees - from overall sentiment trends down to word type composition:

 - **Dictionary Type**: Indicates if the sentiment is measured via labels (discrete/categorical) or scores (continuous). The type includes `categorical`, `discrete`, `continuous`, `categorical (multi-label)`, and `discrete (multi-label)`. 

 - **Sentiment Score**: Distribution statistics of sentiment labels or scores. For labels, it summarises the frequency of labels within the dictionary. For scores, it summarises the overall  sentiment distribution, such as frequency, mean, median, range, and standard deviation. 

 - **Sentiment Lexicon**: Breaks down lexicon by its Parts-of Speech (POS). Provides frequency counts for categories like nouns, adjectives, verbs, emoticons, and more. Useful for understanding lexicon composition.
	- **General POS Tags**: A general overview of POS tags using simplified [Universal POS tagging system](https://universaldependencies.org/u/pos/) influenced by [NLTK](https://www.nltk.org/book/ch05.html). Includes `adjectives`,  `adverbs`, `conjunctions`, `determiners`, `emos` (emoticons and emojis), `nouns`, `numerals`, `particles`, `prepositions`, `pronouns`, `verbs`,  `miscellaneous`.     
	- **Granular POS Tags**: More fine-grained lexical breakdown using [OntoNotes(ver5.0)](https://catalog.ldc.upenn.edu/LDC2013T19) tagging system. Includes singular/plural nouns, comparative/superlative adjectives, verb tenses, and more. Enables deeper lexical analysis.
 	- **Miscellaneous POS**: Catches any rare or unknown Part-of-Speech tags for completeness.
`````

```{warning} 
Please note that the input for `lexical_overview` must be a *processed dictionary*, loaded with `sentibank.archive.load.dict`. 
``` 

```{note}
We are currently working on `lexical_overview` to handle abbreviations and n-grams.  
```

(content:references:AvailableDictionaries)=
## Available Dictionaries

| Sentiment Dictionary | Description | Genre | Domain | Predefined Identifiers |
|------------------------|---------------|------|-----|------------------------|
|**AFINN** <br> (Nielsen, 2011)| General purpose lexicon with sentiment ratings for common emotion words. |Social Media|General| `AFINN_v2009`, `AFINN_v2011`, `AFINN_v2015` |
|**Aigents+** <br> (Raheman et al., 2022)| Lexicon optimised for social media posts related to cryptocurrencies. |Social Media|Cryptocurrency| `Aigents+_v2022`|
|**General Inquirer** <br> (Stone et al., 1962)| Lexicon capturing broad psycholinguistic dimensions across semantics, values and motivations.  |General|Psychology| `HarvardGI_v2000`|
|**MASTER** <br> (Loughran and McDonland, 2011; Bodnaruk, Loughran and McDonald, 2015)| Financial lexicons covering expressions common in business writing. |Corporate Filings|Finance| `MASTER_v2022`|
|**SentiWordNet** <br> (Esuli and Sebastiani, 2006; Baccianella, Esuli and Sebastiani, 2010)| Lexicon associating WordNet synsets with positive, negative, and objective scores. |General|General| `SentiWordNet_v2010_simple`, `SentiWordNet_v2010_nuanced` |
|**VADER** <br> (Hutto and Gilbert, 2014)| General purpose lexicon optimised for social media and microblogs. |Social Media|General| `VADER_v2014`|
|**WordNet-Affect** <br> (Strapparava and Valitutti, 2004; Valitutti, Strapparava and Stock, 2004; Strapparava, Valitutti and Stock, 2006)| Hierarchically organised affective labels providing a  granular emotional dimension. |General|Psychology| `WordNet-Affect_v2006`|