# Keyword extraction experiments

Running the cell below loads the details about statements and keywords.

In [4]:
import json

mathlib_keyword_summary = json.load(open("full_mathlib_keyword_summary.json", "r"))

A function to find the sentences containing a given keyword.

In [2]:
def get_statements_with_keyword(kw:str) -> list:
    return [stmt for stmt in mathlib_keyword_summary if kw in stmt["keywords"]]

---

A test with a specific keyword.

In [4]:
for stmt in get_statements_with_keyword("monoid"):
    print(stmt["keywords"]["monoid"])

0.09568045026443411
0.07438724594558906
0.06844581806138879
0.05393656033701752
0.0528063806271324
0.0528063806271324
0.059572864407278624
0.09593831581184391
0.18569843656348187
0.0596404586934656
0.17881754828257995
0.11227863081557292
0.14390902704051098
0.05393656033701752
0.18569843656348187
0.059572864407278624
0.0768411838291854
0.07030442419566416
0.0528063806271324
0.28325026875694265
0.1155310835876123
0.0771485953923296
0.06037786452709367
0.30753389830415107
0.054709255964732355
0.15831692877998726
0.1778692885602501
0.09568045026443411
0.08205340856523911
0.2094232239987773
0.13309686053898662
0.04491197687864554
0.0771485953923296
0.12113252405818227
0.05331699930238388
0.19744254481508877
0.04491197687864554
0.052376395424323874
0.14222822903176371
0.15831692877998726
0.1155310835876123
0.09631441144923199
0.15831692877998726
0.04491197687864554
0.05815421818951193
0.053316999302383886
0.04491197687864554
0.09492398510093508
0.07767990991064867
0.04491197687864554
0.0210

---

## The Yake keyword extractor

Running this as well as the next section requires having [`Yake` - _Yet Another Keyword Extractor_](https://pypi.org/project/yake/).

`Yake` can be installed by running `pip3 install -U yake`.

In [13]:
import yake

kw_extractor = yake.KeywordExtractor()

An example of extracting keywords from a statement

In [6]:
kw_extractor.extract_keywords("Every ring is a field.")

[('field', 0.15831692877998726), ('ring', 0.29736558256021506)]

---
# ProofWiki statement experiments

In [7]:
import random

The collection of theorems from ProofWiki. The cell below needs to be loaded only once. 

In [8]:
proofwiki_theorems = json.load(open("./../misc_experiments/proofwiki.json", "r"))["dataset"]["theorems"]

A small program to chose a theorem and its corresponding proof from ProofWiki. This cell also needs to be run just once.

In [9]:
def choose_theorem():
    random_theorem = random.choice(proofwiki_theorems)
    while random_theorem["proofs"] == []:
        random_theorem = random.choice(proofwiki_theorems)
    
    return random_theorem, random.choice(random_theorem["proofs"])

The experimentation begins here. We start by choosing a random theorem and its corresponding proof. 

In [62]:
random_theorem, random_proof = choose_theorem()

The title of the theorem is:

In [63]:
random_theorem["title"]

'Element of Group is in its own Coset/Right'

The chosen proof is in the output of the next code cell. Since the proof can be quite large, it is best to run the next cell only when a careful inspection of the proof is required. 

In [66]:
if False:
    # The theorem statement
    for line in random_theorem["contents"]:
        print(line)

    print("\n\n Proof: \n")

    # The proof
    for line in random_proof["contents"]:
        print(line)

The keywords extracted from the theorem statement and proof are in the output of the next cell. This part uses `Yake`, a keyword extraction tool. 

In [67]:
kw_extractor.extract_keywords(
    ' '.join(random_theorem["contents"]) + 
    ' '.join(random_proof["contents"]))

[('Identity Element', 0.023603978357594858),
 ('Identity', 0.042649923710523965),
 ('Coset', 0.04460543155230717),
 ('Definition', 0.04726221252621686),
 ('Element', 0.10569732868798065),
 ('modulo', 0.1700603157922483),
 ('Identity of Subgroup', 0.21792319760342144),
 ('Subgroup', 0.27548831158732046),
 ('Qed', 0.3133276295785712),
 ('exists', 0.6165876982414822),
 ('behaviour', 0.665034344509052),
 ('result', 0.665034344509052)]

For comparison, these are some of the definitions that were referred to in the theorem statement and proof:

In [68]:
random_theorem["refs"] + random_proof["refs"]

['Definition:Coset/Right Coset',
 'Definition:Identity (Abstract Algebra)/Two-Sided Identity',
 'Identity of Subgroup',
 'Definition:Identity (Abstract Algebra)/Two-Sided Identity',
 'Definition:Coset/Right Coset']

These are the categories in ProofWiki that the *theorem* belongs to.  

In [69]:
random_theorem["categories"]

['Element of Group is in its own Coset']

---

In [6]:
'''
mathlib_keyword_lookup = {}

for i, obj in enumerate(mathlib_keyword_summary):
    for k in obj["keywords"]:
        try:
            mathlib_keyword_lookup[k].append(i)
        except:
            mathlib_keyword_lookup[k] = [i]
        
for k in mathlib_keyword_lookup:
   mathlib_keyword_lookup[k].sort(key = lambda i: mathlib_keyword_summary[i]["keywords"][k]) 
'''

In [2]:
def keywords(stmt:str) -> dict:
    return dict(kw_extractor.extract_keywords(stmt))

In [10]:
mathlib_keyword_lookup = json.load(open("mathlib_keyword_lookup.json", "r"))

In [19]:
def fetch_relevant_mathlib_statements(stmt:str):
    revelant_mathlib_statements = []
    
    for kw in keywords(stmt):
        try:
            for idx in mathlib_keyword_lookup[kw][:3]: # top 3 matches
                match_stmt = mathlib_keyword_summary[idx]
                if match_stmt not in revelant_mathlib_statements:
                    revelant_mathlib_statements.append(match_stmt)
        except: continue
        
    return revelant_mathlib_statements 
    #return [mathlib_keyword_summary[idx] for kw in keywords(stmt) for idx in mathlib_keyword_lookup[kw][:5]]

In [21]:
[stmt["doc_string"] for stmt in fetch_relevant_mathlib_statements("A finite integral domain is a field.")]

[' A `comm_ring` `K` which is the localization of an integral domain `R` at `R - {0}` is an integral domain.',
 ' Given an integral domain `A` with field of fractions `K`, and an injective ring hom `g : A →+* L` where `L` is a field, field hom induced from `K` to `L` maps `f x / f y` to `g x / g y` for all `x : A, y ∈ non_zero_divisors A`.',
 'In an integral domain, a sum indexed by a nontrivial homomorphism from a finite group is zero.',
 'Sum of elements in a `intermediate_field` indexed by a `finset` is in the `intermediate_field`.',
 'Sum of a multiset of elements in a `intermediate_field` is in the `intermediate_field`.',
 'If the integral extension `R → S` is injective, and `S` is a field, then `R` is also a field.',
 "If `S` is a finite `R`-algebra, then `S' = M⁻¹S` is a finite `R' = M⁻¹R`-algebra.",
 "This is a version of **Hall's Marriage Theorem** in terms of a relation between types `α` and `β` such that `α` is finite and the image of each `x : α` is finite (it suffices for 

In [22]:
keywords("A finite integral domain is a field.")

{'finite integral domain': 0.0042542192213185686,
 'finite integral': 0.02570861714399338,
 'integral domain': 0.02570861714399338,
 'field': 0.09568045026443411,
 'finite': 0.15831692877998726,
 'integral': 0.15831692877998726,
 'domain': 0.15831692877998726}