## IMPORTS

In [19]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [21]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 367Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 5.61Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 390Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:28, 17.4Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 242Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 6.81Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 7.21Mit/s]


In [22]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Uploading dataset

In [23]:
file_name = "skill_dataset.txt"

In [24]:
df = pd.read_excel('/content/drive/My Drive/JobDescriptionPrediction/Resources/Data/processed/skills_dataset.xlsx')
df[['JobID','Job_Title','Skill','Output']].head(n=2)

Unnamed: 0,JobID,Job_Title,Skill,Output
0,1,data scientist,"sap, sql",['any experience in statistical modeling field...
1,2,data scientist,"machine learning, r, sas, sql, python","['spss, sas, stata, r) required.experience wit..."


In [25]:
TRAIN_SIZE      = 0.8
def split_data(df, S=TRAIN_SIZE):
    print(TRAIN_SIZE)

    # Split into training and validation sets    
    train_size = int(S * len(df))


    train_data = df[:train_size]
    val_data = df[train_size:]

    return train_data, val_data

In [26]:
train_data, test_data = split_data(df, TRAIN_SIZE)

f'There are {len(train_data) :,} samples for training, and {len(test_data) :,} samples for validation testing'

0.8


'There are 3,337 samples for training, and 835 samples for validation testing'

In [27]:
train_data.to_csv('skill_dataset.txt', header=True, index=False, sep='\t', mode='w')

In [None]:

# gpt2.copy_file_from_gdrive("/content/drive/My Drive/JobDescriptionPrediction/Resources/Data/processed/skill_dataset.txt")

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2.


In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=len(train_data),
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=len(train_data),
              save_every=len(train_data)
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:06<00:00,  6.14s/it]


dataset has 1321375 tokens
Training...
[10 | 27.39] loss=3.37 avg=3.37
[20 | 48.84] loss=3.29 avg=3.33
[30 | 70.43] loss=2.99 avg=3.22
[40 | 92.23] loss=2.93 avg=3.14
[50 | 114.21] loss=3.05 avg=3.12
[60 | 136.38] loss=2.89 avg=3.08
[70 | 158.63] loss=2.55 avg=3.01
[80 | 180.95] loss=2.60 avg=2.95
[90 | 203.27] loss=3.04 avg=2.96
[100 | 225.59] loss=2.80 avg=2.95
[110 | 247.92] loss=2.69 avg=2.92
[120 | 270.19] loss=2.78 avg=2.91
[130 | 292.45] loss=2.96 avg=2.91
[140 | 314.69] loss=2.91 avg=2.91
[150 | 336.97] loss=2.86 avg=2.91
[160 | 359.27] loss=2.68 avg=2.89
[170 | 381.55] loss=2.76 avg=2.89
[180 | 403.82] loss=2.49 avg=2.86
[190 | 426.10] loss=2.44 avg=2.84
[200 | 448.43] loss=2.53 avg=2.82
[210 | 470.73] loss=2.54 avg=2.81
[220 | 493.01] loss=2.47 avg=2.79
[230 | 515.30] loss=2.56 avg=2.78
[240 | 537.60] loss=2.49 avg=2.77
[250 | 559.90] loss=2.22 avg=2.74
[260 | 582.19] loss=2.41 avg=2.73
[270 | 604.45] loss=2.39 avg=2.71
[280 | 626.73] loss=2.12 avg=2.69
[290 | 649.04] loss=2.

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Load a Trained Model Checkpoint

Running the next cell will copy the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [None]:
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [None]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

In [None]:
gpt2.generate(sess, run_name='run1')

NameError: ignored

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [None]:
sample_outputs = gpt2.generate(sess,
              prefix="data science sap sql",
              length=50,
              temperature=0.7,
              nsamples=2
              )
print('abc')
# print(output)
for i, sample_output in enumerate(sample_outputs):
  print(sample_output)
  print(i)

data science sap sql	['strong understanding of data structures, algorithms, and statistical analysis.', 'ability to apply some of the above mentioned skills to work cross-functionally and cross-functional at all levels within the organization.', 'excellent written and verbal communication
data science sap sql natural language processing spark python	['excellent written and verbal communication skills.', 'ability to work across the business and in the product teams.', 'excellent communication skills and ability to explain complex business concepts to team members.', 'ability to
abc


TypeError: ignored

In [None]:
type(text)

NoneType

In [None]:
print(text)

None


In [None]:
text2

'None'

In [None]:
gpt2.generate(sess,
              prefix="Associate data scientist AI Machine Learning Network R SAS C/C++ Java SPSS Data SciencePython",
              length=150,
              temperature=0.7,
              nsamples=2
              )

Associate data scientist AI Machine Learning Network R SAS C/C++ Java SPSS Data SciencePython	['[   design, develop, and launch efficient and scalable machine learning solutions leveraging data and analytics across a range of industries   deliver on the strategic vision and goals while leveraging cutting edge technologies and techniques including deep learning and machine learning   assess the value of emerging technologies in an integrated environment, and work with the latest technologies and techniques   develop and execute on advanced research projects including, but not limited to, predictive analytics, machine learning, data mining   assist in the design and implementation of advanced research projects,    bachelor’s degree in mathematics, computer science, statistics, or related field.,    3+ years of industry experience   deep understanding of machine learning   extensive knowledge of python and spark
Associate data scientist AI Machine Learning Network R SAS C/C++ Java SPSS Da

In [None]:
gpt2.generate(sess,
              prefix="data scientist TensorFlow Machine Learning Hadoop Scala Kafka HBase Big Data Java Software Development Python Elasticsearch",
              length=150,
              temperature=0.7,
              nsamples=2
              )

data scientist TensorFlow Machine Learning Hadoop Scala Kafka HBase Big Data Java Software Development Python Elasticsearch data science	['[   leverage large, complex data repositories for data analysis   build and maintain efficient and robust statistical and machine learning algorithms for machine learning   monitor and improve the quality of warranty data   develop, own, and manage a team of data scientists   collaborate with a team of data scientists to understand the business and data needs of the company   collaborate with a team of data scientists to design, implement, and monitor advanced analytics and modeling pipelines   assist with the design and development of monitoring data products   work closely with various stakeholders to understand business needs and provide solutions   create and maintain high-quality code repositories,    bachelor’s degree in mathematics, economics, computer science, or other
data scientist TensorFlow Machine Learning Hadoop Scala Kafka HBase Big D

In [None]:
output = gpt2.generate(sess,
              prefix="data science AI Quantitative Analysis Data Mining Machine Learning Analysis Skills CSS",
              length=150,
              include_prefix=False,
              truncate=False,
              temperature=0.7,
              nsamples=2
              )

data science AI Quantitative Analysis Data Mining Machine Learning Analysis Skills CSS (fluent css preferred) python1): machine learning python, css: css,r,java,ss,linux ruby,g++,cassandra,hive,node.js,r,java,ss,linux ruby,g++,cassandra,hive,node.js,r, php,javascript,javascript,javascript,javascript,javascript]']
3479	senior data scientist	machine learning r azure sas sql python php	['you have some practical experience of using tools like tableau or spotfire, and you're comfortable contributing to a cross-functional team., job requirements,    advanced degree in math, statistics, computer science, or similar quantitative field or
data science AI Quantitative Analysis Data Mining Machine Learning Analysis Skills CSS, python, r, julia, hadoop, weka, scala, sas, tableau, pandas)   execute data science solutions that can scale and automate processes within a data analytics business   use data science and data engineering techniques to develop a deep understanding of a wide variety of the w

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [None]:
# may have to run twice to get file to download
files.download(gen_file)

## Generate Text From The Pretrained Model

If you want to generate text from the pretrained model, not a finetuned model, pass `model_name` to `gpt2.load_gpt2()` and `gpt2.generate()`.

This is currently the only way to generate text from the 774M or 1558M models with this notebook.

In [None]:
model_name = "774M"

gpt2.download_gpt2(model_name=model_name)

In [None]:
sess = gpt2.start_tf_sess()

gpt2.load_gpt2(sess, model_name=model_name)

In [None]:
gpt2.generate(sess,
              model_name=model_name,
              prefix="The secret of life is",
              length=100,
              temperature=0.7,
              top_p=0.9,
              nsamples=5,
              batch_size=5
              )