### 0. HuggingFace

In [None]:
pip install transformers



### 1. Import dependencies

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

### 2. Load pre-trained GPT2 models

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id = tokenizer.eos_token_id)

### 3. Provide example xml (optional)

In [None]:
example_xml = """
    <car>
        <make>Honda</make>
        <model>Civic</model>
        <year>2014</year>
    </car>
"""

### 4. Design the prompt

In [None]:
prompt1 = example_xml + "This is an example xml. Generate a new xml featuring a menu: \n <Menu>"

### 5. Generate the text and XML content

In [None]:
generated_text = model.generate(
    tokenizer.encode(prompt1, return_tensors="pt"),
    max_length=50,
    do_sample = True,
    num_beams=1,
    no_repeat_ngram_size=5,
    top_k = 50,
    top_p = 1,
    temperature = 1,
    pad_token_id=tokenizer.eos_token_id,
)


* input_ids: This is the tensor containing the input sequence.

* max_length: The maximum length of the generated sequence.

* do_sample: if set to True, the parameter allows decoding strategies such as beam search, top-k, and top-p sampling.

* num_beams: Number of steps for beam search. Beam search is a technique used in sequence generation to explore multiple possible sequences concurrently. This can help improve the quality of the generated text.

* no_repeat_ngram_size: Size of n-grams that should not be repeated. N-gram is a sequence of N words This helps the model avoid repetitive patterns in generated text.

* top_k: How many potential tokens are considered when sampling.

* top_p: Cumulative probability threshold for nucleus sampling. This is an alternative to top-k sampling. The difference is it selects the next token based on the smallest set of tokens which the cumulative probability exceeds a specified value. Can also be used concurrently with top_k.

* temperature: Used for controlling randomness in sampling. A lower temperature (closer to 0.0) results in less random completions. As the temperature approaches zero, the model becomes more deterministic and repetitive. A higher temperature (closer to 1.0 or above) increases randomness and creativity in the generated text, but it can also lead to less coherent outputs.

* pad_token_id: ID of the padding token in vocabulary.

Source: https://huggingface.co/docs/transformers/generation_strategies

### 6. Decode the generated text and XML content

In [None]:
generated_xml = tokenizer.decode(generated_text[0], skip_special_tokens = True)

In [None]:
print(generated_xml)


    <car>
        <make></make>
        <model></model>
        <year></year>
    </car>
This is an example xml. Generate a new xml featuring a menu: 
 <Menu>

<MenuItem name="menu">

<Menu item="menuItem">

<Item name="menuItem"> <Item name="menuitem">

<item name="menuitem" value="menuitem"> <item name="menuItem" value="menuItem"> </item>

</item>

<Item item="menuitem


### 7. Create more prompts and test

In [None]:
prompt2 = "Toyota makes three different kinds of vehicles. Generate an XML to categorize these vehicle types \n <Toyota>"

In [None]:
generated_text2 = model.generate(
    tokenizer.encode(prompt2, return_tensors="pt"),
    max_length=100,
    do_sample = True,
    num_beams=5,
    no_repeat_ngram_size=5,
    top_k = 50,
    top_p = 0.7,
    temperature = 1.0,
    pad_token_id=tokenizer.eos_token_id,
)

In [None]:
generated_xml2 = tokenizer.decode(generated_text2[0], skip_special_tokens = True)

In [None]:
print(generated_xml2)

Toyota makes three different kinds of vehicles. Generate an XML to categorize these vehicle types 
 <Toyota> <Vehicle> <Name>Toyota</Name> </Vehicle> </Toyota>

<Toyota> <vehicle> <name>Toyota</name> </vehicle>

<vehicle> <type>Toyota</type>

<name>Toyota Prius</name>

</vehicle>



In [None]:
prompt3 = "There are three primary colors. Generate an XML to summarize these colors \n <Color>"

In [None]:
generated_text3 = model.generate(
    tokenizer.encode(prompt3, return_tensors="pt"),
    max_length=100,
    do_sample = True,
    num_beams=5,
    no_repeat_ngram_size=5,
    top_k = 10,
    top_p = 0.7,
    temperature = 0.1,
    pad_token_id=tokenizer.eos_token_id,
)

In [None]:
generated_xml3 = tokenizer.decode(generated_text3[0], skip_special_tokens = True)

In [None]:
print(generated_xml3)

There are three primary colors. Generate an XML to summarize these colors 
 <Color> <Color name="color_1" color="red" color="blue" color="green" color="blue"> <Color name="Color name="color2" color="red, green, blue" color="blue, yellow" color="red"> <Color name="" color="red, blue, yellow" color="" color="blue, green, blue"> <Color name= "color3


In [None]:
example_xml4 = """
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>J.K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>
"""

In [None]:
prompt4 = example_xml4 + "This is a XML file showing what genres of books are sold in a bookstore. Generate another one following the same format of category, title, author, year, and price. \n"

In [None]:
generated_text4 = model.generate(
    tokenizer.encode(prompt4, return_tensors="pt"),
    max_length=500,
    do_sample = True,
    num_beams=5,
    no_repeat_ngram_size=10,
    num_return_sequences=1,
    top_k = 50,
    top_p = 1,
    temperature = 0.3,
    pad_token_id=tokenizer.eos_token_id,
)

In [None]:
generated_xml4 = tokenizer.decode(generated_text4[0], skip_special_tokens = True)

In [None]:
print(generated_xml4)


<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>J.K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>
This is a XML file showing what genres of books are sold in a bookstore. Generate another one following the same format of category, title, author, year, and price. 
<bookstore>  <book category="cooking">    <title lang="en">Italian</title>
 <author>Giada de Laurentiis</author>    <year>2004</year>
 <price>30.00</price>  </book>  <book category="children">
<book category="children">
<title lang="en">Harry Potter and the Philosopher's Stone</title>
 <author>J. K. Rowling</author>


In [None]:
prompt5 = "Generate a XML file showing a book's genre, language, author, and price \n <bookstore>"

In [None]:
generated_text5 = model.generate(
    tokenizer.encode(prompt5, return_tensors="pt"),
    max_length=75,
    do_sample = True,
    num_beams=1,
    no_repeat_ngram_size=5,
    top_k = 50,
    top_p = 1,
    temperature = 0.8,
    pad_token_id=tokenizer.eos_token_id,
)

In [None]:
generated_xml5 = tokenizer.decode(generated_text5[0], skip_special_tokens = True)

In [None]:
print(generated_xml5)

Generate a XML file showing a book's genre, language, author, and price 
 <bookstore> <title>HarperCollins.com - The Bookseller of the Century (2005)</title> </bookstore> <author>HarperCollins</author> <prices> <pricescene> <author>Charles Scribner</author>
