# Keyword extraction experiments

Running the cell below loads the details about statements and keywords.

In [15]:
import json

mathlib_keyword_summary = json.load(open("full_mathlib_keyword_summary.json", "r"))

A function to find the sentences containing a given keyword.

In [18]:
def get_statements_with_keyword(kw:str) -> list:
    return [stmt for stmt in mathlib_keyword_summary if kw in stmt["keywords"]]

---

A test with a specific keyword.

In [19]:
for stmt in get_statements_with_keyword("free group"):
    print(stmt["doc_string"])

 Two homomorphisms out of a free group are equal if they are equal on generators.  See note [partially-applied ext lemmas].
 If two words correspond to the same element in the free group, then they have a common maximal reduction. This is the proof that the function that sends an element of the free group to its maximal reduction is well-defined.
 A word and its maximal reduction correspond to the same element of the free group.
Quotienting a group by its additive torsion subgroup yields an additive torsion free group.
Quotienting a group by its torsion subgroup yields a torsion free group.
 If two words have a common maximal reduction, then they correspond to the same element in the free group.
 The universal property of a free group: A functions from the generators of `G` to another group extends in a unique way to a homomorphism from `G`.  Note that since `is_free_group.lift` is expressed as a bijection, it already expresses the universal property.
The canonical injection from the t

---

## The Yake keyword extractor

Running this as well as the next section requires having [`Yake` - _Yet Another Keyword Extractor_](https://pypi.org/project/yake/).

`Yake` can be installed by running `pip3 install -U yake`.

In [1]:
import yake

kw_extractor = yake.KeywordExtractor()

An example of extracting keywords from a statement

In [20]:
kw_extractor.extract_keywords("Every ring is a field.")

[('field', 0.15831692877998726), ('ring', 0.29736558256021506)]

---
# ProofWiki statement experiments

In [28]:
import random

The collection of theorems from ProofWiki. The cell below needs to be loaded only once. 

In [32]:
proofwiki_theorems = json.load(open("./../misc_experiments/proofwiki.json", "r"))["dataset"]["theorems"]

A small program to chose a theorem and its corresponding proof from ProofWiki. This cell also needs to be run just once.

In [62]:
def choose_theorem():
    random_theorem = random.choice(proofwiki_theorems)
    while random_theorem["proofs"] == []:
        random_theorem = random.choice(proofwiki_theorems)
    
    return random_theorem, random.choice(random_theorem["proofs"])

The experimentation begins here. We start by choosing a random theorem and its corresponding proof. 

In [70]:
random_theorem, random_proof = choose_theorem()

The title of the theorem is:

In [71]:
random_theorem["title"]

'Basis of Free Module is No Greater than Generator'

The chosen proof is in the output of the next code cell. Since the proof can be quite large, it is best to run the next cell only when a careful inspection of the proof is required. 

In [None]:
# The theorem statement
for line in random_theorem["contents"]:
    print(line)

print("\n\n Proof: \n")

# The proof
for line in random_proof["contents"]:
    print(line)

The keywords extracted from the theorem statement and proof are in the output of the next cell. This part uses `Yake`, a keyword extraction tool. 

In [79]:
kw_extractor.extract_keywords(
    ' '.join(random_theorem["contents"]) + 
    ' '.join(random_proof["contents"]))

[('Definition', 0.0031778568390870257),
 ('Free Module', 0.0061323616934051755),
 ('Module Homomorphism', 0.007148831579508057),
 ('Module', 0.007640668938014451),
 ('Canonical Basis', 0.014328056073927967),
 ('Basis', 0.020649934194435365),
 ('Commutative and Unitary', 0.021179825255396932),
 ('Module on Set', 0.030081708982670794),
 ('Unitary Ring', 0.030194218673231944),
 ('Free', 0.03055705496922549),
 ('paren', 0.0342159951743744),
 ('set', 0.03641188546202711),
 ('commutative ring', 0.03951471460385235),
 ('Homomorphism', 0.03962017082079049),
 ('Generator', 0.045559548356186294),
 ('Basis of Free', 0.04612927143326951),
 ('Commutative', 0.05148192694927661),
 ('Vector Space', 0.05444464630800907),
 ('Canonical', 0.05808429902925768),
 ('Maximal Ideal', 0.05822570296344944)]

For comparison, these are some of the definitions that were referred to in the theorem statement and proof:

In [81]:
random_theorem["refs"] + random_proof["refs"]

['Definition:Commutative and Unitary Ring',
 'Definition:Free Module',
 'Definition:Basis (Linear Algebra)',
 'Definition:Generator of Module',
 'Definition:Injection',
 'Definition:Generator of Module',
 'Definition:Surjection',
 'Definition:Linear Transformation',
 'Definition:Free Module on Set',
 'Definition:Basis of Module',
 'Definition:Isomorphism (Abstract Algebra)/R-Algebraic Structure Isomorphism/Module Isomorphism',
 'Definition:Surjection',
 'Definition:Linear Transformation',
 "Krull's Theorem",
 'Definition:Maximal Ideal of Ring',
 'Maximal Ideal iff Quotient Ring is Field',
 'Definition:Field (Abstract Algebra)',
 'Definition:Quotient Mapping',
 'Definition:Linear Transformation',
 'Definition:Direct Sum of Module Homomorphisms',
 'Definition:Surjection',
 'Definition:Free Module on Set/Canonical Basis',
 'Definition:Free Module on Set/Canonical Basis',
 'Definition:Generator of Vector Space',
 'Definition:Surjection',
 'Definition:Generator of Vector Space',
 'Basis of 

These are the categories in ProofWiki that the *theorem* belongs to.  

In [74]:
random_theorem["categories"]

['Free Modules']

---