# Descriptive Statistics & Basic Properties

#### What do we want to do here? (*to be expanded*)
 - (distributional) properties of objects & terms
   - distribution over objects, object properties 

 - explore the properties of the collection graph & thesaurus graph:
   - network structure (connectivity, etc)
   
 - explore the connections between collection & thesaurus (collection links into thesaurus)
   - intersections of entities
   - density of indexed terms in collection


# 1. Statistical Properties of Graph

### 1.1 Basic Properties

In [None]:
n = lambda gen: len(set(gen))
info = [["", "object graph", "thesaurus"],
        ["len", len(obj_graph), len(thesaurus)],
        ["n_subjects", n(obj_graph.subjects()), n(thesaurus.subjects())],
        ["n_objects", n(obj_graph.objects()), n(thesaurus.objects())],
        ["n_preds", n(obj_graph.predicates()), n(thesaurus.predicates())]]
info = list(zip(*info))
print(tabulate(info[1:], headers=info[0]))

### 1.2 Jaccard distances between entity sets

In [None]:
headers = ["object subjs", "object preds", "object objs", "thes subjs", "thes preds", "thes objs"]
sets = [obj_graph.subjects(), obj_graph.predicates(), obj_graph.objects(), thesaurus.subjects(), thesaurus.predicates(), thesaurus.objects()]
sets = list(map(set, sets))

import sklearn.metrics

f = lambda s1, s2: len(s1&s2)
jacc = lambda s1, s2: len(s1&s2)/len(s1|s2)

intersections = [[h]+[jacc(s1, s2) if not (s1 is s2) else None for s2 in sets] for s1, h in zip(sets, headers)]
print(tabulate(intersections, headers=["jaccard distances"]+headers))

## 2. Statistical Properties of Text Descriptions

In [None]:
# length statistics

import seaborn as sns

lens = list(map(len, map(str.split, descriptions.values())))

plt.hist(lens)
plt.show()

ls, freqs = list(zip(*sorted(Counter(lens).items())))


plt.loglog(ls, freqs, ".")
plt.show()


sns.histplot(lens, log_scale=(10 ,10))