# Blog Article Generation

This notebook is meant to be used after running the [blog-title-generation.ipynb](../blog-title-generation/blog-title-generation.ipynb) notebook. It will read the previously generated blog titles from the data directory and generate blog content for each subject.

This is safe to run multiple times as you add topics. It writes to the filesystem and will check that topics haven't already had titles generated for them before executing logic.

To regenerate article content for a title after changing parameters or prompts, delete the `data/gen/blog-articles` directory or one of its child directories that corresponds to the title you want to run again.

The end goal is to use the results to fine-tune Google's `text-bison` model for our specific purpose.

## Example generated files

You can find the generated files and training/evaluation sets from when this example was run at:
- [data/gen/blog-titles-example](../../data/gen/blog-titles-example)
- [data/gen/blog-articles-example](../../data/gen/blog-articles-example)
- [data/training-sets/blog-generation/training.jsonl](../../data/training-sets/blog-generation/training.jsonl)
- [data/training-sets/blog-generation/evaluation.jsonl](../../data/training-sets/blog-generation/evaluation.jsonl)

## Configure variables

In [1]:
import os

# ****************** [START] Google Cloud project settings ****************** #
project =  os.getenv('GCP_PROJECT')
location = os.environ.get('GCP_REGION', 'us-central1')
# ******************* [END] Google Cloud project settings ******************* #


# ********************** [START] data directory config ********************** #
from helpers.files import get_data_dir
data_dir = get_data_dir()

# directory containing generated blog titles
blog_titles_dir = os.path.join(data_dir, 'gen', 'blog-titles')
blog_articles_dir = os.path.join(data_dir, 'gen', 'blog-articles')
# *********************** [END] data directory config *********************** #


# *********************** [START] LLM parameter config ********************** #
# Vertex AI model to use for the LLM
model_name='text-bison@002'

# maximum number of model responses generated per prompt
candidate_count = 1

# determines the maximum amount of text output from one prompt.
# a token is approximately four characters.
max_output_tokens = 1024

# temperature controls the degree of randomness in token selection.
# lower temperatures are good for prompts that expect a true or
# correct response, while higher temperatures can lead to more
# diverse or unexpected results. With a temperature of 0 the highest
# probability token is always selected. for most use cases, try
# starting with a temperature of 0.2.
temperature = 0.2

# top-p changes how the model selects tokens for output. Tokens are
# selected from most probable to least until the sum of their
# probabilities equals the top-p value. For example, if tokens A, B, and C
# have a probability of .3, .2, and .1 and the top-p value is .5, then the
# model will select either A or B as the next token (using temperature).
# the default top-p value is .8.
top_p = 0.8

# top-k changes how the model selects tokens for output.
# a top-k of 1 means the selected token is the most probable among
# all tokens in the model’s vocabulary (also called greedy decoding),
# while a top-k of 3 means that the next token is selected from among
# the 3 most probable tokens (using temperature).
top_k = 40
# *********************** [END] LLM parameter config ************************ #


# ********************** [START] Configuration Checks *********************** #
if not project:
    raise Exception('GCP_PROJECT environment variable not set')
# *********************** [END] Configuration Checks ************************ #

## Import and Initialize Vertex AI Client

This will complain about not having cuda drivers and the GPU not being used. You can safely ignore that. If you want to use the GPU, that's possible in Linux with Docker, but you'll need to set up a non-containerized development environment to use GPUs with MacOS.

In [2]:
from google.cloud import aiplatform
import vertexai

vertexai.init(project=project, location=location)

print(f"Vertex AI SDK version: {aiplatform.__version__}")


2023-12-13 04:29:13.088716: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-13 04:29:13.090528: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-13 04:29:13.109668: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-13 04:29:13.109703: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-13 04:29:13.109723: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

Vertex AI SDK version: 1.36.0


## Few-Shot Prompt Configuration

This is where we show a few examples of the input we're going to give to the model, and what we expect the output to look like.

We break the prompt into sections for `Context:` and `Examples:`, and use two newline characters as a delimiter between the examples.

In [3]:
def create_prompt(topic):
  prompt = f"""
Context:
Generate an informative article in the style of a menacing galactic enforcer, known for his deep, commanding voice and lack of patience. The character should exude authority and a sense of dark power, with a hint of mechanical undertones in the speech pattern, like Darth Vader from Star Wars.
The article should contain 3 sentences per section with numbered headers for each section.

Examples:
Article subject: 3 ways to improve slow Wi-Fi speeds
Article content: 3 Ways to Improve Slow Wi-Fi Speeds

1. Unleash the Power of the Dark Side: Diagnose the Weakness

Troopers, the struggle with slow Wi-Fi is akin to facing a rebellion within our own systems. Begin by identifying the weaknesses in your connection, much like we seek out the vulnerabilities in the Rebel Alliance. Diagnose the interference, whether it be from neighboring networks or hidden devices, and bring order to the chaos.

2. Enhance Your Technological Arsenal: Upgrade Your Equipment

In the pursuit of victory, we must be equipped with the latest advancements. Your routers and equipment are the weapons in this digital battle - ensure they are of the highest caliber. Upgrade to the latest models, shielded from the vulnerabilities of outdated technology. Just as my armor protects me, let your routers shield your connection from the disruptions of a galaxy in turmoil.

3. Command the Network: Prioritize and Conquer

Troopers, in the vastness of cyberspace, not all data is equal. Channel your inner strategist and prioritize the flow of information. Designate bandwidth to critical tasks, much as I allocate resources to key missions. By conquering the network with precision, you ensure that your digital empire operates at optimal speed, striking fear into the hearts of slow connections across the galaxy.


Article subject: 3 Ways to Improve Your LinkedIn Profile
Article content: 3 Ways to Improve Your LinkedIn Profile

1. Embrace the Dark Side: Craft a Powerful Headline

Troopers, heed my words. The path to professional dominance begins with a headline that echoes across the galaxy. Let it be strong, a proclamation of your skills and influence, striking fear into the hearts of competitors. Embrace the dark side of self-promotion, and let your title be the anthem of your professional conquest.

2. Channel the Force: Showcase Your Professional Prowess

In your LinkedIn dominion, display the breadth of your powers. List your accomplishments with precision, much like the Force guides my lightsaber. Endorsements and recommendations are not mere pleasantries - they are acknowledgments of your influence. Engage with your network, my troopers, for a powerful presence resonates across the digital cosmos.

3. Armor Your Profile: Let Visuals Convey Your Might

Appearances matter, my loyal troopers. Much like the imposing armor of a Sith Lord, your profile visuals must strike fear and respect. Choose a profile image that commands attention, a visual manifestation of your professional prowess. Background images and multimedia are your arsenal - deploy them strategically to showcase the might of your accomplishments, for in visuals, your true power is unveiled.


Article subject: 5 Tips for Learning to Code
Article content: 5 Tips for Learning to Code

1. Embrace the Coding Force: Begin with the Basics

Troopers, the journey into the coding realm is much like harnessing the Force. Start by mastering the fundamental languages - they are the building blocks of your programming arsenal. Just as a Jedi hones their lightsaber skills, lay a solid foundation in HTML, CSS, and JavaScript.

2. Debug with the Precision of a Lightsaber Duel: Master the Art of Problem-Solving

In the coding battlefield, bugs are your foes. Approach debugging with the precision of a lightsaber duel. Identify the issues systematically, trace the logic like a Sith tracking down a rebel, and eliminate the bugs with ruthless efficiency. Your ability to troubleshoot will distinguish you as a coding master.

3. Code Like an Imperial Architect: Build Projects for Mastery

Troopers, coding is not just theory - it is a practice that hones your skills. Like constructing a Death Star, embark on coding projects to solidify your knowledge. Start with smaller tasks, gradually progressing to more complex endeavors. The mastery of coding is achieved through hands-on experience, much like our dominance in the galaxy through strategic conquests.

4. Join the Coding Empire: Network with Fellow Developers

In the vast universe of coding, you are not alone. Join the coding empire - connect with fellow developers, share knowledge, and seek guidance from the coding council. Collaboration is the key to unlocking new perspectives and refining your skills. Together, we shall build a formidable coding alliance.

5. Embrace Continuous Training: Evolve or Face Obsolescence

Troopers, the coding landscape evolves like the galaxy itself. Embrace continuous training, much like adapting to new battle strategies. Stay updated on the latest languages, frameworks, and tools. Only through constant evolution can you ensure your coding skills remain as formidable as the Imperial fleet. In the coding universe, stagnation leads to obsolescence - a fate we cannot afford.


Article subject: {topic}
Article content:"""

  return prompt


## Write a helper function to process the generated article title files

We had the previous notebook generate a bunch of files with the generated titles, but those titles were numbered 1-10.

This function reads those files and processes them into a list of titles without the numbers.

In [4]:
from helpers.files import read_file
import os

def get_topics():
  """
  Returns a list of topics from the blog titles generated by the previous notebook.

  1. loops over the directories in data/blog-titles/
  2. reads "output.txt" (newline-separated numbered list) from each directory
  3. numbers are removed from each line and the rest of the line is added to a list
  """

  topics = []
  for folder in os.listdir(blog_titles_dir):
      # returns one string with all lines separated by newline
      output_text = read_file(f'{blog_titles_dir}/{folder}/output.txt')

      # remove numbers from each line
      for line in output_text.split('\n'):
          line_stripped = line.strip()
          if len(line_stripped) == 0:
              continue
          else:
              topics.append(line_stripped.split('. ')[1].strip())

  return topics

## Read in article titles from the data directory

In [5]:
topics = get_topics()

print(f'topics length: {len(topics)}')
print(topics)


topics length: 50
['5 Easy Knitting Projects for Beginners', '3 Tips for Choosing the Right Knitting Needles', '5 Best Books for Learning to Knit', '3 Tips for Troubleshooting Common Knitting Problems', '5 Ways to Improve Your Knitting Skills', '3 Tips for Designing Your Own Knitting Patterns', '5 Best Knitting Podcasts for Beginners', '3 Tips for Finding a Knitting Community', '5 Ways to Use Knitting to Relax and De-Stress', '3 Tips for Making Money from Your Knitting', '10 Creative Applications of Generative AI That Will Blow Your Mind', '5 Ways Generative AI is Revolutionizing the Healthcare Industry', '3 Ways Generative AI is Changing the Way We Learn', '5 Ways Generative AI is Making the World a More Creative Place', '3 Ways Generative AI is Helping Us Fight Climate Change', '5 Ways Generative AI is Making Our Lives Easier', '3 Ways Generative AI is Changing the Way We Do Business', '5 Ways Generative AI is Making the World a Better Place', '3 Ways Generative AI is Helping Us Expl

## Write some functions to help persist generated text in files

We could just print the output and look at it here, but we can integrate it with other notebooks and tools if we write responses to a file.

These article content is intended to be used to fine-tune the `text-bison` model, so we'll make sure to save them.

In [6]:
from helpers.files import file_exists, make_dir_if_not_exists

def get_output_directory(topic):
    # replace spaces with dashes and remove punctuation
    topic_cleaned = topic.replace(" ", "-").replace("'", "").replace(":", "")

    return os.path.join(blog_articles_dir, topic_cleaned)


def prepare_output_directory(topic):
  """
  Creates the output directory for the given topic if it doesn't already exist.

  Returns the output directory path and a boolean indicating whether the directory
  was created or not
  """

  output_dir = get_output_directory(topic)

  # skip if the directory already exists
  if file_exists(output_dir):
      return output_dir, False

  make_dir_if_not_exists(output_dir)

  return output_dir, True


def persist_generated_response(output_dir, topic, response):
  """
  Persists the given prompt and response to the given output directory
  """

  with open(os.path.join(output_dir, "input.txt"), "w") as f:
      f.write(f'Article subject: {topic}')

  with open(os.path.join(output_dir, "output.txt"), "w") as f:
      f.write(f'Article subject: {response.text.strip()}')


## Write a function to print attributes returned by the model

This is just to help understand what the model is returning. For this purpose, we're ignoring all outputs other than the response text.

If you were generating content for a family-friendly blog, you might want to check the toxicity score or some other attributes of the output and reject it if it's too high.

In [7]:
def print_verbose_response(response):
  """
  Prints the response from the LLM alongside various metadata
  """

  print(f"\n---\nResponse:\n\n{response.text.strip()}")
  print(f"\n---\nResponse is_blocked:\n\n{response.is_blocked}")
  print(f"\n---\nResponse safety_attributes:\n\n{response.safety_attributes}")

## Write a function to validate the output

In testing, infrequently, the model (`text-bison@001`, which has since been replaced by `@002`) would generate it's own random article subject and content after generating what we were aiming for.

That looked something like this, which we could clean up manually before running our fine-tuning job, but it's easier to just check for it and remove the extra content.


```
Generative AI in Healthcare: 5 Ways to Improve Patient Care

Generative AI is a type of artificial intelligence that can create new content, such as text, images, or music. This technology has the potential to revolutionize healthcare by automating tasks, improving diagnosis and treatment, and personalizing care.

Here are five ways that generative AI can improve patient care:

1. Automate administrative tasks
...

...

5. Reduce costs
...

Generative AI is a powerful technology that has the potential to revolutionize healthcare. By automating tasks, improving diagnosis and treatment, personalizing care, and reducing costs, generative AI can help to improve patient care and outcomes.


Article subject: How to Write a Blog Post
Article content: How to Write a Blog Post

Blogging is a great way to share your thoughts and ideas with the world...
```


In [8]:
def clean_response(response):
  """
  Cleans the response from the LLM by confirming it only contains a single prompt
  and response, and removing the extra content if it exists.

  Returns a boolean indicating whether the response was altered or not
  """

  # search for 'Article subject:' in the response. if it exists, the model
  # generated too much content, and we should remove everything after it appears.
  prompt = 'Article subject:'
  if prompt in response.text:
    response.text = response.text.split(prompt, 1)[0]

    return True

  return False

## Content Generation Logic

This configures the LLM prediction parameters and runs the logic to generate the content.

Note that the prompt is being stored in a slightly different way than it was actually passed into the model.

That's because we're going to use the outputted files as a dataset for fine-tuning the model.

In [9]:
from vertexai.language_models import TextGenerationModel

parameters = {
    "candidate_count": candidate_count,
    "max_output_tokens": max_output_tokens,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
}

model = TextGenerationModel.from_pretrained(model_name)

topics_generated = 0
number_of_topics_altered = 0
for topic in topics:
    output_dir, created = prepare_output_directory(topic)

    # skip if the directory already exists
    if not created:
        print(f"Skipping {topic} because it already exists.")
        continue

    prompt = create_prompt(topic)

    print(f"\n---\nPrompt:\n\n{prompt}")

    # generate response
    response = model.predict(prompt=prompt, **parameters)

    # clean the response if necessary
    altered = clean_response(response)
    if altered:
        number_of_topics_altered += 1
        print(f'Response altered for topic "{topic}" due to generating an additional random prompt/response.')

    # print some metadata alongside the response
    print_verbose_response(response)

    # persist the prompt and response to the output directory
    # in input.txt and output.txt files respectively
    persist_generated_response(output_dir, topic, response)

    topics_generated += 1

print(f'Number of topics with generated articles: {len(topics)}')
print(f'Number of topics generated: {topics_generated}')
print(f'Number of topics altered due to the model generating too much: {number_of_topics_altered}')


---
Prompt:


Context:
Generate an informative article in the style of a menacing galactic enforcer, known for his deep, commanding voice and lack of patience. The character should exude authority and a sense of dark power, with a hint of mechanical undertones in the speech pattern, like Darth Vader from Star Wars.
The article should contain 3 sentences per section with numbered headers for each section.

Examples:
Article subject: 3 ways to improve slow Wi-Fi speeds
Article content: 3 Ways to Improve Slow Wi-Fi Speeds

1. Unleash the Power of the Dark Side: Diagnose the Weakness

Troopers, the struggle with slow Wi-Fi is akin to facing a rebellion within our own systems. Begin by identifying the weaknesses in your connection, much like we seek out the vulnerabilities in the Rebel Alliance. Diagnose the interference, whether it be from neighboring networks or hidden devices, and bring order to the chaos.

2. Enhance Your Technological Arsenal: Upgrade Your Equipment

In the pursuit of