Skip to content

svenskscifi/karinboye

Repository files navigation

karinboye

Training GPT-2 for generating Karin Boye Poems

In this repository you can find files and links connected to the project Ammaseus Horisont, where Open AI:s pretrained 355 M parameter GPT-2 model has been retrained to generate Swedish Poems in the style of Karin Boye. The GPT-2 model was published in the paper Language Models are Unsupervised Multitask Learners, (Radford et al., 2019), and a full repository with runable code and models from that paper can be found here: https://github.com/openai/gpt-2

Due to the size of some of the datasets and the trained model, not all files are uploaded on this repository, but instead links to the files are provided below.

Shortly the pretrained GPT-2 model was retrained in four steps (see below). All training has taken place on google cloud computing, with a standard GPU setting using an NVIDIA Tesla K80 GPU machine. The training rate was set to default for the first two steps, and set to a slower rate for the last two.

Step 1

Training about 20 000 epochs on generic swedish texts from Swedish Wikipedia, corresponding to approximatly 10 GB of text. A compressed xml dump can be downloaded from Svenska Språkbanken at: Wikipedia-sv

Step 2

After this the network was directly trained on a smaller dataset of selected swedish literature from Project Runeberg (Trainingdata2.txt), approximatly 90 MB of text. See listoftrainingdata2.txt for a full list of the selected works. The full textfile with all combined works can be downloaded here: Trainingdata2.txt

After 5000 epochs, the network generates swedish text like:

Han blev nästan sällan värkande, och fällde en man, som rörde sig in i den myckenhet, 
att han skulle kunna vässa spelmännen vid detta rum.

And after training the network for 18 000 epochs, the network generates text like:

Hör! Nu säger vi så: det är en skugga, en frisk och djupt dundrande, 
den är bara som en backe, ingen kan släcka det, och du behagar endast säga mer.

Step 3

The network was then trained on a set of selected swedish poems Trainingdata3.txt, for 2000 epochs. A list of the selected works can be found in listoftrainingdata3.txt

Step 4

And finally the network was trained only on Karin Boye Poems Trainingdata4.txt, for 2000 epochs. Example of results during this phase are shown below. A list of the selected works can be found in listoftrainingdata4.txt

500 epochs

Du blommar och guld, guld 
Väntar på långt innan mörker 
och stelnad känner du döden. 

1 000 epochs

Vacker är måsarna, 
lösta från resorna
av himlafurstarnas ängel.

1 800 epochs

Jag ville veta, vad som väntade
   	de döda har sökt oss
   	i sömn låg vi hemma. 
Ammaseus horisont 
Ammaseus Horisont

More examples of generated poems, can be found in the textfile: exampleresults.txt

Test the trained model

You can download the final model here PretrainedGPT-2KarinBoye (9.95 GB) and then test it by running it at this free google colab resource Run a GPT-2 Text-Generating Model, by using the instructions below (you have to have a google drive account with at least 10 GB of free storage to run this).

  1. Copy the trained model above to your google drive.
  2. Open the link to the colab above in a chrome webbrowser
  3. Make a local copy of the colab in your google drive: File->Save a copy in drive.
  4. Run the 1st and 2nd cells in order to initialize tensorflow and allocated a free GPU resource from google colab.
  5. Run the 4th cell, mounting your google drive and follow instructions.
  6. Jump to the 9th cell Load a Trained model Checkpoint and run it with: gpt2.copy_checkpoint_from_gdrive(run_name='PretrainedGPT-2KarinBoye') and wait for the model to upload.
  7. Run the 10th cell by gpt2.load_gpt2(sess, run_name='PretrainedGPT-2KarinBoye')
  8. Run the 11th cell Generate Text From The Trained Model

If you want to generate output with a specific seed and/or control the temperature of the output jump to the next cell . Lower temperature will give more consistent but less creative output, whereas higher temeprature will prodouce less coherent but more creative results. For the results in exampleresults.txt a temperature of 0.8 was used.

If you want to generate output into a textfile then jump to the 13th cell Generate Text From The Pretrained Model and follow instructions.

It is also possible to make limited training of the model at this colab, but usually not for more than a couple of hours before the runtime expires.

Make sure to include license information LICENSE.txt if you use text generated by this model.

About

GPT-2 for generating Karin Boye Poems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published