<a href="https://colab.research.google.com/github/sergiomar73/nlp-google-colab/blob/main/nlp-poc-01-Text_Generation_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA MARKETING LABS : Text Generation

- FR : https://www.datamarketinglabs.com/guide-complet-sur-gpt-3-part2

- EN : https://www.datamarketinglabs.org/gpt-3-a-full-guide-technical-part

Objectives : Understand and use Text generation with GPT-3

The data behind the Inside Airbnb site is sourced from publicly available information from the Airbnb site.

http://insideairbnb.com/get-the-data.html

## Step 1 : Install OpenAI

In [3]:
!pip install openai

import os
import openai

openai.organization = "org-XXXXXXXXXXXXXXXXXX"
openai.api_key = "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Step 2 : Import Libraries

In [4]:
import re
import json
import pandas as pd
import openai
import string
import requests
import random

from collections import Counter
from statistics import mean 
from tqdm.auto import tqdm


## Step 3 : List Available Engines

In [5]:
openai.Engine.list()['data']

[<Engine engine id=davinci-instruct-beta at 0x7f7ff8cd9350> JSON: {
   "created": null,
   "id": "davinci-instruct-beta",
   "object": "engine",
   "owner": "openai",
   "permissions": null,
   "ready": true
 }, <Engine engine id=text-similarity-ada-001 at 0x7f7ff8cd9290> JSON: {
   "created": null,
   "id": "text-similarity-ada-001",
   "object": "engine",
   "owner": "openai-dev",
   "permissions": null,
   "ready": true
 }, <Engine engine id=text-search-babbage-query-001 at 0x7f7ff8cd91d0> JSON: {
   "created": null,
   "id": "text-search-babbage-query-001",
   "object": "engine",
   "owner": "openai-dev",
   "permissions": null,
   "ready": true
 }, <Engine engine id=code-davinci-edit-001 at 0x7f7ff8cd2ef0> JSON: {
   "created": null,
   "id": "code-davinci-edit-001",
   "object": "engine",
   "owner": "openai",
   "permissions": null,
   "ready": true
 }, <Engine engine id=babbage-search-document at 0x7f7ff8cd2fb0> JSON: {
   "created": null,
   "id": "babbage-search-document",
  

## Step 4 : Load and Clean Data


In [6]:
url = 'http://data.insideairbnb.com/united-kingdom/england/bristol/2021-02-23/visualisations/listings.csv'

%load_ext google.colab.data_table
df = pd.read_csv(url)
df.drop(columns=['id', 'host_id', 'latitude', 'longitude', 'last_review','reviews_per_month','room_type','calculated_host_listings_count','neighbourhood_group'], inplace=True)

df.columns = ['booking', 'hostname', 'city', 'pricing', 'minimum_nights', 'number_of_reviews', 'availability']
df


Unnamed: 0,booking,hostname,city,pricing,minimum_nights,number_of_reviews,availability
0,City Centre-Waterside Retreat,Marcus,Clifton,66,1,143,364
1,The White Room - Central Bristol Art House Ga...,Orla,Bedminster,29,21,39,23
2,Peaceful Safe Home & Clear Space 'The Lilac Room',Wendy,Easton,30,7,21,180
3,HUGE Room CENTRAL location House,Sue And Toby,Ashley,69,2,80,88
4,Listed Georgian house in the heart of Bristol.,Samantha,Ashley,336,2,79,345
...,...,...,...,...,...,...,...
1564,2 Bed Furnished house with Parking in Filton,Madiha,Horfield,64,1,0,3
1565,Bright Newly Furnished Box Room,Charlie,Easton,28,28,0,180
1566,1 Bedroom Flat on the Harbourside.,Adam,Southville,51,25,0,75
1567,"""3 bed house only 2.7miles from Train station""",Chantal,Brislington East,100,2,0,132


## Step 5 : Prepare Helpers Functions

In [15]:
import os
import openai

def build_random_prompt(data):

  rnd = df.sample(n=1).iloc[0]

  prompt = """Please write a message that offers a detailed description of booking and add vacation ideas about the city using only the provided characteristics:
  ====
  characteristics:
  hostname: Marie
  booking: Apartment close to the train station 
  pricing: 39 euros per night
  minimum_nights: 1 night
  number_of_reviews: 0 review
  availability: 288 days per year
  city: Bordeaux
  message:
  Hi,
  My name is Marie and I am delighted to welcome you to my apartment near the train station.
  Bordeaux is a beautiful city and you can visit its museums, have a glass of wine in the Saint-Pierre district.
  The rental price is only 39 dollars per night.
  However, my apartment will be available 288 days a year and the minimum rental period is 1 night.
  Contact me if you have any questions.
  ====
  characteristics:
  hostname: Vincent
  booking: House near the Eiffel tower
  pricing: 220 euros per night
  minimum_nights: 2 nights
  number_of_reviews: 62 reviews 
  availability: 200 days per year
  city: Paris
  message:
  Hello,
  I am Vincent and I am delighted to welcome you in my house near the Eiffel Tower.
  The Eiffel Tower is a must-see if you make a trip to Paris. You can also go up to the top by elevator or on foot. More seriously, it's a really nice place to discover. 
  The price of the rental is 220 dollars per night and I have 62 positive reviews for the moment.
  I can welcome you up to 200 days a year but you have to book 2 nights minimum.
  If you have any question, contact me without hesitation.
  ====
  characteristics:
  hostname: {hostname}
  booking: {booking}
  pricing: {pricing} dollars per night
  minimum_nights: {minimum_nights}
  number_of_reviews: {number_of_reviews} reviews
  city: {city}
  """

  # manage plurals and empty data, units are very importants
  if rnd['number_of_reviews']>1:
      number_of_reviews = str(rnd['number_of_reviews'])+" reviews"
  else:
      number_of_reviews = "0 review"

  if rnd['minimum_nights']>1:
      minimum_nights = str(rnd['minimum_nights'])+" nights"
  elif rnd['minimum_nights']==1:
      minimum_nights = str(rnd['minimum_nights'])+" night"

  prompt =  prompt.format(hostname=rnd['hostname'], 
                         booking=rnd['booking'],
                         pricing=rnd['pricing'],
                         minimum_nights=minimum_nights,
                         number_of_reviews=rnd['number_of_reviews'],
                         city=rnd['city'])
  
  # if no availability, we skip : very important
  if rnd['availability']>0:
    prompt = prompt + 'availability: '+str(rnd['availability'])+' days per year'
  
  return prompt,rnd['hostname'],rnd['booking'],rnd['pricing'],minimum_nights,rnd['number_of_reviews'],rnd['city']
  
  
def completion(prompt, engine_id='davinci-instruct-beta', debug=True, **kwargs):

    COMPLETION_ENDPOINT = 'https://api.openai.com/v1/engines/{engine_id}/completions'.format(engine_id=engine_id)
    
    headers = {'Authorization': 'Bearer {api_key}'.format(api_key=openai.api_key),
              'Content-Type': 'application/json'}

    data = {
              "prompt": prompt,
              "max_tokens": kwargs.get('max_tokens', 300),
              "temperature": kwargs.get('temperature', 0.8),
              "stop": ["====","characteristics:"]
            }
    
    data.update(kwargs)

    response = requests.post(COMPLETION_ENDPOINT, headers=headers, data=json.dumps(data))
    result = response.json()

    if debug:
        print('Headers:')
        print(json.dumps(headers, indent=4))
        print('Data:')
        print(json.dumps(data, indent=4))
        print('Result:')
        print(json.dumps(result, indent=4))

    if response.status_code == 200:
        return [x['text'].strip() for x in result['choices']]
    else:
        return "Error: {}".format(result['error']['message'])
  

## Step 6: Build Prompt & Test Completion Function


In [18]:
hostname = ''
booking = ''
pricing = ''
minimum_nights = ''
number_of_reviews = ''
city = ''

# build the prompt
prompt,hostname,booking,pricing,minimum_nights,number_of_reviews,city = build_random_prompt(df)
print('-------------------------------')
# We focus only on the new instructions
last = prompt.rfind("characteristics:")+16
instructions  = prompt[last:]
print(instructions)
print('-------------------------------')

newtext = completion(prompt, debug=False, max_tokens=300, temperature=0.5)
print(newtext[0][8:])

-------------------------------

  hostname: Penny
  booking: Windmill Hill- spacious, sunny, comfortable room
  pricing: 35 dollars per night
  minimum_nights: 1 night
  number_of_reviews: 73 reviews
  city: Windmill Hill
  availability: 58 days per year
-------------------------------


Hi,

My name is Penny and I am delighted to welcome you to my windmill.
  Windmill Hill is a beautiful city and you can visit its museums, have a glass of wine in the Saint-Pierre district.
  The rental price is only 35 dollars per night.
  However, my windmill will be available 58 days a year and the minimum rental period is 1 night.
  Contact me if you have any questions.


## Step 7 : Run many examples and save results to CSV

In [19]:
df_out = pd.DataFrame()
num_examples = 10

hostname = ''
booking = ''
pricing = ''
minimum_nights = ''
number_of_reviews = ''
city = ''

for i in range(num_examples):

  prompt,hostname,booking,pricing,minimum_nights,number_of_reviews,city = build_random_prompt(df)

  # generate the prompt
  last = prompt.rfind("characteristics:")+16
  instructions  = prompt[last:]

  # generate the completion
  newtext = completion(prompt, engine_id='curie-instruct-beta', debug=False, max_tokens=300)
  # remove the first prefixe "message:"
  newtext = newtext[0][8:]

  df_out = df_out.append({'input_data': instructions, 'generated_result': newtext}, ignore_index=True)

df_out.to_csv('output_prompt.csv', index=None)

print('Saved output to:', 'output.csv')

df_out

Saved output to: output.csv


Unnamed: 0,input_data,generated_result
0,\n hostname: Mairead\n booking: Spacious and...,"\n Hi,\n My name is Mairead and I am delight..."
1,"\n hostname: Tony\n booking: Cosy ,rustic,se...","\n Hello,\n My name is Tony and I am delight..."
2,\n hostname: Eleanor\n booking: Double room ...,"\n Hi, I am Eleanor and I am delighted to wel..."
3,\n hostname: Pru\n booking: Massive flat... ...,"\n Hi,\n My name is Pru and I am delighted t..."
4,\n hostname: Aneta\n booking: Beautiful and ...,"\n Hi,\n My name is Aneta and I am delighted..."
5,\n hostname: Shena\n booking: A tranquil hid...,My name is Shena and I am delighted to welcom...
6,\n hostname: SACO Bristol West India\n booki...,\n My name is SACO Bristol West India and my ...
7,\n hostname: Penny\n booking: Bishopston/ Re...,"\n Hi,\n My name is Penny and I am delighted..."
8,\n hostname: Carlos\n booking: Ensuite room ...,"\n Hi,\n My name is Carlos and I am delighte..."
9,\n hostname: Hopewell\n booking: Lower Park ...,\n Hello my name is Hopewell and welcome to L...


## Step 8 : Benchmark your results

In [20]:
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=b8564d9b3f183795fbd348abf019a39d848fa340b15222699115854ada2683db
  Stored in directory: /root/.cache/pip/wheels/84/ac/6b/38096e3c5bf1dc87911e3585875e21a3ac610348e740409c76
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [21]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

# Exactly the same sentence
score1 = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown fox jumps over the lazy dog')

print('score 1 ( same sentences ) : '+str(score1))

# Change the last word.
score2 = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown fox jumps over the lazy mouse')

print('score 2 ( last word changed ) : '+str(score2))

# Inverse the word: fox and dog
score3 = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps over the lazy fox')

print('score 3 ( Inverse the word: fox and dog ) : '+str(score3))

# Keep only the same beginning
score4 = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown mouse eats a piece of cheese')

print('score 4 ( Keep only the same beginning ) : '+str(score4))

score 1 ( same sentences ) : {'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
score 2 ( last word changed ) : {'rougeL': Score(precision=0.8888888888888888, recall=0.8888888888888888, fmeasure=0.8888888888888888)}
score 3 ( Inverse the word: fox and dog ) : {'rougeL': Score(precision=0.7777777777777778, recall=0.7777777777777778, fmeasure=0.7777777777777778)}
score 4 ( Keep only the same beginning ) : {'rougeL': Score(precision=0.3333333333333333, recall=0.3333333333333333, fmeasure=0.3333333333333333)}


In [24]:
df_out = pd.DataFrame()
num_examples = 2
hostname = ''
booking = ''
pricing = ''
minimum_nights = ''
number_of_reviews = ''
city = ''

for i in range(num_examples):

  prompt,hostname,booking,pricing,minimum_nights,number_of_reviews,city = build_random_prompt(df)

  # generate the prompt
  last = prompt.rfind("characteristics:")+16
  instructions  = prompt[last:]

  # generate the completion
  newtext = completion(prompt, engine_id='curie-instruct-beta', debug=False, max_tokens=300, temperature=0.95)

  # remove the first prefixe "message:"
  newtext = newtext[0][8:]

  # use ROUGE-L for comparing with text spinning
  # TODO : you can create one or three basic texts
  basic_text = """
  Hi, I am {hostname} and I am delighted to welcome you to {booking}. 
  You can book it for {pricing} and stay up to {minimum_nights}.
  I have {number_of_reviews} for the moment. 
  I am sure you will enjoy your stay in {city}. Please let me know if you have any questions.
  """

  basic_text =  basic_text.format(hostname=hostname, 
                         booking=booking,
                         pricing=pricing,
                         minimum_nights=minimum_nights,
                         number_of_reviews=number_of_reviews,
                         city=city)
  
  scoreRougeL = scorer.score(newtext,basic_text)['rougeL'].precision
  
  df_out = df_out.append({'input_data': instructions, 'generated_result': newtext, 'basic_text': basic_text, 'RougeL': scoreRougeL}, ignore_index=True)

df_out.to_csv('output_prompt.csv', index=None)

print('Saved output to:', 'output.csv')

df_out

Saved output to: output.csv


Unnamed: 0,input_data,generated_result,basic_text,RougeL
0,\n hostname: Tim\n booking: Bijou room in To...,"\n Hello friend,\n I am Tim and I am delight...","\n Hi, I am Tim and I am delighted to welcome...",0.396552
1,\n hostname: Kirsty\n booking: Quirky Cosy C...,"\n Hi,\n What can I say? We have an amazing ...","\n Hi, I am Kirsty and I am delighted to welc...",0.339286
