# Using Gitta to reconstructing grammar from generated examples

This notebook shows the power of [GITTA](https://github.com/twinters/gitta) *(Grammar Induction using a Template Tree Approach)*
by generating some text with a grammar, and then using these texts to induce a template-driven generative grammar.

In [1]:
import random
import grammar_induction
from gitta.context_free_grammar import ContextFreeGrammar

random.seed(123)


## Creating dataset
We first model a generative grammar, and use this to generate some example strings

In [2]:
rules = {
    'origin': '<hello>, <location>!',
    'hello': ['Hello', 'Greetings', 'Howdy', 'Hey'],
    'location': ['world', 'solar system', 'galaxy', 'universe']
}

grammar = ContextFreeGrammar.from_string(rules)
original_dataset = grammar.generate_all_string()
dataset = list(original_dataset)
dataset

['Hello, universe!',
 'Greetings, solar system!',
 'Greetings, world!',
 'Hey, galaxy!',
 'Hey, world!',
 'Hello, galaxy!',
 'Howdy, galaxy!',
 'Greetings, universe!',
 'Greetings, galaxy!',
 'Howdy, universe!',
 'Hello, world!',
 'Howdy, solar system!',
 'Hello, solar system!',
 'Howdy, world!',
 'Hey, universe!',
 'Hey, solar system!']

Let's only take half the dataset for induction, to show it is generalising!
You can of course also leave this code out and learn from all examples instead.

In [3]:
number_of_training_instances = 9
random.shuffle(dataset)
dataset = dataset[:number_of_training_instances]
dataset

['Howdy, universe!',
 'Hey, galaxy!',
 'Greetings, galaxy!',
 'Hello, galaxy!',
 'Greetings, world!',
 'Greetings, universe!',
 'Hello, world!',
 'Howdy, solar system!',
 'Hello, universe!']

## Grammar induction

Let's use GITTA to induce a grammar using some texts generated by the previous grammar (which GITTA is unaware of).

In [4]:
reconstructed_grammar = grammar_induction.induce_grammar_using_template_trees(
    dataset,
    words_per_slot=2,
    relative_similarity_threshold=0.2, # This value decides when to join value lists
)
print(reconstructed_grammar.to_json())


{
    "origin": [
        "<A>, <E>!"
    ],
    "A": [
        "Greetings",
        "Hello",
        "Hey",
        "Howdy"
    ],
    "E": [
        "galaxy",
        "solar system",
        "universe",
        "world"
    ]
}


### Evaluation
Let's check if the grammar created above is truly the same as the initial grammar, by checking all possible generations,
and checking ismorphism (= grammars are equal if mapping certain slot values onto the other's slots).

In [5]:
all_generations = reconstructed_grammar.generate_all()
all_generations

{"Greetings, galaxy!",
 "Greetings, solar system!",
 "Greetings, universe!",
 "Greetings, world!",
 "Hello, galaxy!",
 "Hello, solar system!",
 "Hello, universe!",
 "Hello, world!",
 "Hey, galaxy!",
 "Hey, solar system!",
 "Hey, universe!",
 "Hey, world!",
 "Howdy, galaxy!",
 "Howdy, solar system!",
 "Howdy, universe!",
 "Howdy, world!"}

In [6]:
print("Same grammar:", reconstructed_grammar.is_isomorphic_with(grammar))
print("Same grammar output:", {s.to_flat_string() for s in all_generations} == set(original_dataset))

Same grammar: True
Same grammar output: True
