<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/uu-small.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Sharing data features

## Explore additional data

Once you analyse a corpus, it is likely that you produce data that others can reuse.
Maybe you have defined a set of proper name occurrences, or you have computed sentiments.

It is possible to turn these insights into *new features*, i.e. new `.tf` files with values assigned to specific nodes.

## Make your own data

New data is a product of your own methods and computations in the first place.
But how do you turn that data into new TF features?
It turns out that the last step is not that difficult.

If you can shape your data as a mapping (dictionary) from node numbers (integers) to values
(strings or integers), then TF can turn that data into a feature file for you with one command.

## Share your new data
You can then easily share your new features on GitHub, so that your colleagues everywhere
can try it out for themselves.

You can add such data on the fly, by passing a `mod={org}/{repo}/{path}` parameter,
or a bunch of them separated by commas.

If the data is there, it will be auto-downloaded and stored on your machine.

Let's do it.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections
import os

from tf.app import use


In [3]:
A = use("quran:clone", checkout="clone", hoist=globals())
# A = use('quran', hoist=globals())

# Making data

We illustrate the data creation part by creating a new feature, `sentiment`.

It is not very sensical, but it serves to illustrate the workflow.

We consider ayas that start with a vocative particle as a positive context,
and ayas that start with a resumptive particle as a negative context.

For each lemma of a noun, verb, or adjective in the corpus,
we count how often it occurs in a positive context,
and subtract how many times it occurs in a negative context.

The resulting number is the sentiment.

We use a query to fetch the postive contexts and the negative contexts.

In [4]:
contentTypes = set("verb noun adjective".split())
contentTypeCrit = "|".join(contentTypes)

In [5]:
queryP = f"""
aya
  =: word posx=vocative
  word pos={contentTypeCrit}
"""

queryN = f"""
aya
  =: word posx=resumption
  word pos={contentTypeCrit}
"""

In [6]:
resultsP = A.search(queryP)
resultsN = A.search(queryN)

  0.30s 2513 results
  0.30s 7351 results


Here are the first few results of both:

In [7]:
A.displaySetup(extraFeatures="translation@en")

In [8]:
A.show(resultsP, end=2, condensed=True)

In [9]:
A.show(resultsN, end=2, condensed=True)

Observe how the positive results indeed have a positive sentiment, and the negative ones are indeed negative.

However, we do not attempt at all to weed out the positive words under negation from the negative contexts.

So our sentiments have to work against a massive "pollution", and are probably not useful.

In [10]:
sentiment = collections.Counter()

for (results, kind) in ((resultsP, 1), (resultsN, -1)):
    for (aya, particle, word) in results:
        sentiment[F.lemma.v(word)] += kind

Let's check what we found: how many lemma's per sentiment.

In [11]:
sentimentDist = collections.Counter()

for (lemma, sent) in sentiment.items():
    sentimentDist[sent] += 1

for (sent, amount) in sorted(
    sentimentDist.items(),
    key=lambda x: (-x[1], x[0]),
):
    print(f"sentiment {sent:>3} is assigned to {amount:>4} lemmas")

sentiment  -1 is assigned to  870 lemmas
sentiment   1 is assigned to  273 lemmas
sentiment  -2 is assigned to  248 lemmas
sentiment   0 is assigned to  118 lemmas
sentiment  -3 is assigned to  107 lemmas
sentiment  -4 is assigned to   87 lemmas
sentiment   2 is assigned to   56 lemmas
sentiment  -5 is assigned to   53 lemmas
sentiment  -6 is assigned to   38 lemmas
sentiment  -7 is assigned to   30 lemmas
sentiment  -8 is assigned to   20 lemmas
sentiment   3 is assigned to   17 lemmas
sentiment -10 is assigned to   16 lemmas
sentiment  -9 is assigned to   10 lemmas
sentiment -13 is assigned to    7 lemmas
sentiment -11 is assigned to    7 lemmas
sentiment -14 is assigned to    6 lemmas
sentiment -12 is assigned to    6 lemmas
sentiment -17 is assigned to    5 lemmas
sentiment -32 is assigned to    4 lemmas
sentiment -21 is assigned to    4 lemmas
sentiment -49 is assigned to    3 lemmas
sentiment -22 is assigned to    3 lemmas
sentiment -18 is assigned to    3 lemmas
sentiment -16 is

We show the most negative and most positive sentiments in context.

In [12]:
negaThreshold = -100
posiThreshold = 4

xPlemmas = {lemma for lemma in sentiment if sentiment[lemma] >= posiThreshold}
xNlemmas = {lemma for lemma in sentiment if sentiment[lemma] <= negaThreshold}

xPwords = [
    w
    for w in F.otype.s("word")
    if F.lemma.v(w) in xPlemmas and F.pos.v(w) in contentTypes
]
xNwords = [
    w
    for w in F.otype.s("word")
    if F.lemma.v(w) in xNlemmas and F.pos.v(w) in contentTypes
]

print(f"{len(xPwords)} extremely positive word occurrences")
print(f"{len(xNwords)} extremely negative word occurrences")

929 extremely positive word occurrences
6650 extremely negative word occurrences


We put the words in their ayas, and show a few.

In [13]:
xPayas = collections.defaultdict(list)
xNayas = collections.defaultdict(list)

for w in xPwords:
    a = L.u(w, otype="aya")[0]
    xPayas[a].append(w)

for w in xNwords:
    a = L.u(w, otype="aya")[0]
    xNayas[a].append(w)

print(f"{len(xPayas)} ayas with extremely positive word occurrences")
print(f"{len(xNayas)} ayas with extremely negative word occurrences")

xPtuples = [(a, *words) for (a, words) in sorted(xPayas.items())]
xNtuples = [(a, *words) for (a, words) in sorted(xNayas.items())]

692 ayas with extremely positive word occurrences
3558 ayas with extremely negative word occurrences


We show three ayas of each category

In [14]:
A.show(xPtuples, end=3)

In [15]:
A.show(xNtuples, end=3)

Probably Allah has a negative sentiment because He occurs in many negative contexts as a punisher.

Anyway, we do not try to be sophisticated here.

We move on to export this sentiment feature.

# Saving data

The [documentation](https://annotation.github.io/text-fabric/tf/core/fabric.html#tf.core.fabric.FabricCore.save) explains how to save this data into a text-fabric
data file.

We choose a location where to save it, the `exercises` repository in the `q-ran` organization, in the folder `mining`.

In order to do this, we restart the TF api, but now with the desired output location in the `locations` parameter.

In [17]:
GITHUB = os.path.expanduser("~/github")
ORG = "q-ran"
REPO = "exercises"
PATH = "mining"
VERSION = A.version

Note the version: we have built the version against a specific version of the data:

In [18]:
A.version

'0.4'

Later on, we pass this version on, so that users of our data will get the shared data in exactly the same version as their core data.

We have to specify a bit of metadata for this feature:

In [19]:
metaData = {
    "sentiment": dict(
        valueType="int",
        description="crude sentiments in the Quran",
        creator="Dirk Roorda",
    ),
}

sentimentData = {
    w: sentiment[F.lemma.v(w)]
    for w in F.otype.s("word")
    if F.lemma.v(w) in sentiment and F.pos.v(w) in contentTypes
}

Now we can give the save command:

In [20]:
TF.save(
    nodeFeatures=dict(sentiment=sentimentData),
    metaData=metaData,
    location=f"{GITHUB}/{ORG}/{REPO}/{PATH}/tf",
    module=VERSION,
)

  0.00s Exporting 1 node and 0 edge and 0 config features to ~/github/q-ran/exercises/mining/tf/0.4:
   |     0.07s T sentiment            to ~/github/q-ran/exercises/mining/tf/0.4
  0.07s Exported 1 node features and 0 edge features and 0 config features to ~/github/q-ran/exercises/mining/tf/0.4


True

# Sharing data

How to share your own data is explained in the
[documentation](https://annotation.github.io/text-fabric/tf/about/datasharing.html).

Here we show it step by step for the `sentiment` feature.

## Zip the data

We need to zip the data in exactly the right directory structure. Text-Fabric can do that for us:

In [21]:
%%sh

text-fabric-zip q-ran/exercises/mining/tf

True
Create release data for q-ran/exercises/mining/tf
Found 2 versions
zip files end up in ~/Downloads/q-ran-release/exercises
zipping q-ran/exercises            0.3 with   1 features ==> mining-tf-0.3.zip
zipping q-ran/exercises            0.4 with   1 features ==> mining-tf-0.4.zip


Now you have the file in the desired structure in your Downloads folder.

## Put the data on Github

The next thing is: make a new release in your Github directory, in this case Nino-cunei/exercises, and attach
the zip file as a binary.

You have to do this in your web browser, on the Github website.

Here is the result for our case:

![release](images/release.png)

# Use the data

We can use the data by calling it up when we say `use('quran', ...)`.

Here is how:

In [22]:
A = use(
    "quran:clone",
    checkout="clone",
    hoist=globals(),
    mod="q-ran/exercises/mining/tf:clone",
)
# A = use('quran', hoist=globals(), mod='q-ran/exercises/mining/tf')

   |     0.11s T sentiment            from ~/github/q-ran/exercises/mining/tf/0.4


Above you see a new section in the feature list: **q-ran/exercises/mining/tf** with our foreign feature in it: `sentiment`.

Now, suppose did not know much about this feature, then we would like to do a few basic checks:

In [23]:
F.sentiment.freqList()

((-1, 4509),
 (-2, 3044),
 (-233, 2699),
 (-4, 2368),
 (-3, 2217),
 (1, 2040),
 (-5, 1964),
 (-7, 1693),
 (-195, 1618),
 (-8, 1541),
 (0, 1517),
 (-6, 1416),
 (-10, 1406),
 (-138, 1358),
 (-49, 1147),
 (-130, 975),
 (-32, 865),
 (-21, 841),
 (2, 833),
 (-13, 696),
 (-12, 690),
 (-14, 669),
 (-36, 662),
 (-25, 589),
 (-17, 589),
 (-9, 577),
 (-30, 571),
 (29, 537),
 (-22, 529),
 (-11, 431),
 (-23, 410),
 (-18, 360),
 (-45, 358),
 (-33, 289),
 (-54, 278),
 (-37, 271),
 (-31, 271),
 (-24, 261),
 (-16, 245),
 (-26, 236),
 (3, 205),
 (-73, 176),
 (122, 153),
 (-34, 136),
 (-39, 127),
 (5, 115),
 (-20, 75),
 (11, 75),
 (-15, 72),
 (4, 49))

Which nodes have a sentiment feature?

In [24]:
{F.otype.v(n) for n in N.walk() if F.sentiment.v(n)}

{'word'}

Only words have the feature.

Which part of speech do these words have?

In [25]:
{F.pos.v(n) for n in F.otype.s("word") if F.sentiment.v(n)}

{'adjective', 'noun', 'verb'}

Let's have a look at a table of some words with positive sentiments.

In [26]:
results = A.search(
    """
word sentiment>0
"""
)

  0.13s 4007 results


In [27]:
A.table(results, start=1, end=5)

n,p,word
1,1:1,رَّحْمَٰنِ
2,1:3,رَّحْمَٰنِ
3,1:5,نَسْتَعِينُ
4,2:3,يُؤْمِنُ
5,2:3,يُنفِقُ


In [28]:
results = A.search(
    """
word sentiment<0
"""
)

  0.15s 39229 results


In [29]:
A.table(results, start=1, end=5)

n,p,word
1,1:1,سْمِ
2,1:1,ٱللَّهِ
3,1:1,رَّحِيمِ
4,1:2,حَمْدُ
5,1:2,لَّهِ


Let's get lines with both positive and negative signs:

In [30]:
results = A.search(
    """
aya
  word sentiment>0
  word sentiment<0
"""
)

  0.37s 40238 results


In [31]:
A.table(results, start=1, end=2, condensed=True)

n,p,aya,word,word.1,word.2,word.3
1,1:1,بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ,سْمِ,ٱللَّهِ,رَّحْمَٰنِ,رَّحِيمِ
2,1:3,ٱلرَّحْمَٰنِ ٱلرَّحِيمِ,رَّحْمَٰنِ,رَّحِيمِ,,


With highlights:

In [32]:
highlights = {}

for w in F.otype.s("word"):
    sent = F.sentiment.v(w)
    if sent:
        color = "lightsalmon" if sent < 0 else "mediumaquamarine"
        highlights[w] = color

In [33]:
A.table(results, start=1, end=10, condensed=True, highlights=highlights)

n,p,aya,word,word.1,word.2,word.3,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
1,1:1,بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ,سْمِ,ٱللَّهِ,رَّحْمَٰنِ,رَّحِيمِ,,,,,,
2,1:3,ٱلرَّحْمَٰنِ ٱلرَّحِيمِ,رَّحْمَٰنِ,رَّحِيمِ,,,,,,,,
3,1:5,إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ,نَعْبُدُ,نَسْتَعِينُ,,,,,,,,
4,2:3,ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ,غَيْبِ,يُقِيمُ,صَّلَوٰةَ,رَزَقْ,يُنفِقُ,يُؤْمِنُ,,,,
5,2:4,وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ,ءَاخِرَةِ,يُوقِنُ,يُؤْمِنُ,أُنزِلَ,أُنزِلَ,قَبْلِ,,,,
6,2:6,إِنَّ ٱلَّذِينَ كَفَرُوا۟ سَوَآءٌ عَلَيْهِمْ ءَأَنذَرْتَهُمْ أَمْ لَمْ تُنذِرْهُمْ لَا يُؤْمِنُونَ,يُؤْمِنُ,كَفَرُ,سَوَآءٌ,أَنذَرْ,تُنذِرْ,,,,,
7,2:8,وَمِنَ ٱلنَّاسِ مَن يَقُولُ ءَامَنَّا بِٱللَّهِ وَبِٱلْيَوْمِ ٱلْءَاخِرِ وَمَا هُم بِمُؤْمِنِينَ,يَوْمِ,ءَاخِرِ,مُؤْمِنِينَ,نَّاسِ,يَقُولُ,ءَامَ,ٱللَّهِ,,,
8,2:9,يُخَٰدِعُونَ ٱللَّهَ وَٱلَّذِينَ ءَامَنُوا۟ وَمَا يَخْدَعُونَ إِلَّآ أَنفُسَهُمْ وَمَا يَشْعُرُونَ,ءَامَنُ,يَشْعُرُ,ٱللَّهَ,أَنفُسَ,,,,,,
9,2:10,فِى قُلُوبِهِم مَّرَضٌ فَزَادَهُمُ ٱللَّهُ مَرَضًا وَلَهُمْ عَذَابٌ أَلِيمٌۢ بِمَا كَانُوا۟ يَكْذِبُونَ,مَّرَضٌ,زَادَ,ٱللَّهُ,مَرَضًا,عَذَابٌ,أَلِيمٌۢ,كَانُ,يَكْذِبُ,قُلُوبِ,
10,2:13,وَإِذَا قِيلَ لَهُمْ ءَامِنُوا۟ كَمَآ ءَامَنَ ٱلنَّاسُ قَالُوٓا۟ أَنُؤْمِنُ كَمَآ ءَامَنَ ٱلسُّفَهَآءُ أَلَآ إِنَّهُمْ هُمُ ٱلسُّفَهَآءُ وَلَٰكِن لَّا يَعْلَمُونَ,سُّفَهَآءُ,سُّفَهَآءُ,يَعْلَمُ,قِيلَ,ءَامِنُ,ءَامَنَ,نَّاسُ,قَالُ,نُؤْمِنُ,ءَامَنَ


If we do a pretty display, the `sentiment` feature shows up.

In [34]:
A.show(results, start=1, end=3, condensed=True, withNodes=True, highlights=highlights)

# All together!

If more researchers have shared data modules, you can draw them all in.

Then you can design queries that use features from all these different sources.

In that way, you build your own research on top of the work of others.

---

All chapters:

* **[start](start.ipynb)** introduction to computing with your corpus
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **share** draw in other people's data and let them use yours
* **[similarAyas](similarAyas.ipynb)** spot the similarities between lines
* **[rings](rings.ipynb)** ring structures in sura 2

CC-BY Dirk Roorda