# Advanced Prompt Engineering Techniques and Building Tools with GPT and LangChain

### Prompt Engineering with GPT and LangChain

LangChain is a powerful framework designed to facilitate prompt engineering and the seamless integration of generative AI capabilities into applications or data platforms. This framework offers a wide array of functionalities, some of which will be introduced in later modules. For now, we will start with a gentle introduction to some of the fundamental and easy-to-understand concepts within LangChain.

In this project, you will build an AI agent that leverages Python and GPT to perform sentiment analysis on financial headlines. This project is designed to showcase your skills as a generative AI engineer and can be a valuable addition to your portfolio.

## Project Overview

### Objectives
- **Set Up OpenAI Developer Account**: Learn how to create and configure an OpenAI developer account and integrate it with your development environment.
- **Interact with OpenAI Models**: Use the LangChain framework to interact with OpenAI models, enabling you to harness the power of GPT for various tasks.
- **Prompt Templates**: Create reusable and dynamic prompt templates to streamline the process of generating prompts for different tasks.
- **LLM Chains**: Understand and implement LLM (Large Language Model) chains to manage complex workflows involving multiple AI models.
- **Output Parsing**: Automatically parse the output of an LLM to make it usable for downstream applications.
- **LangChain Agents and Tools**: Work with LangChain agents and tools to build more sophisticated AI applications.
- **Content Moderation**: Utilize the OpenAI Moderation API to filter explicit content, ensuring that your application adheres to content guidelines.

### Workflow
1. **Environment Setup**
   - Install necessary libraries and dependencies.
   - Configure API keys and authentication for OpenAI.

2. **Data Preparation**
   - Load and preprocess the sample datasets: `financial_headlines.txt` and `reddit_comments.txt`.
   - Understand the structure and content of the datasets to tailor the AI models accordingly.

3. **Prompt Engineering**
   - Design and implement prompt templates that can dynamically generate prompts based on input data.
   - Test and refine prompts to ensure they produce the desired output.

4. **Model Interaction**
   - Use LangChain to interact with OpenAI models.
   - Implement functions to send prompts to the models and receive responses.

5. **LLM Chains**
   - Create and manage LLM chains to handle complex workflows.
   - Chain multiple models together to perform multi-step tasks.

6. **Output Parsing**
   - Develop methods to parse the model outputs.
   - Ensure the parsed data is in a format suitable for further analysis or downstream applications.

7. **Agent and Tool Integration**
   - Integrate LangChain agents and tools to enhance the functionality of your AI application.
   - Implement additional tools as needed to support specific tasks.

8. **Content Moderation**
   - Use the OpenAI Moderation API to filter out explicit content.
   - Ensure your application complies with content guidelines and maintains a high standard of quality.

### Technology Stack
- **Python**: The primary programming language used for this project.
- **LangChain**: The framework used to facilitate prompt engineering and model interaction.
- **OpenAI GPT**: The generative AI model used for sentiment analysis and other tasks.
- **OpenAI Moderation API**: Used for content moderation to filter explicit content.
- **Jupyter Notebook**: The development environment for writing and testing code.

### Sample Datasets
For this project, we are using two small samples:
- `financial_headlines.txt`: A sample dataset containing financial headlines.
- `reddit_comments.txt`: A sample dataset containing Reddit comments.

These 5-6 line samples are kept short to simplify evaluation, but the same code and prompt engineering techniques can scale to much larger datasets.

## Conclusion
By the end of this project, you will have a robust understanding of how to use LangChain and GPT for prompt engineering and sentiment analysis. This project will not only enhance your skills but also serve as a strong portfolio piece demonstrating your capabilities as a generative AI engineer.

## Setup

We need to install a few packages, one of which being the `langchain` package. This is currently being developed quickly, sometimes with breaking changes, so we fix the version.

`langchain` depends on a recent version of `typing_extensions`, so we need to update that package, again fixing the version.

Run the following code to install `openai`, `langchain`, `typing_extensions` and `pandas`.

In [2]:
# Install the openai package, locked to version 1.27
!pip install openai==1.27

# Install the langchain package, locked to version 0.1.19
!pip install langchain==0.1.19

# Install the langchain-openai package, locked to version 0.1.6
!pip install langchain-openai==0.1.6

# Install the langchain-experimental package, locked to version 0.0.58
!pip install langchain-experimental==0.0.58

# Update the typing_extensions package, locked to version 4.11.0
!pip install typing_extensions==4.11.0

Defaulting to user installation because normal site-packages is not writeable
Collecting openai==1.27
  Downloading openai-1.27.0-py3-none-any.whl.metadata (21 kB)
Downloading openai-1.27.0-py3-none-any.whl (314 kB)
Installing collected packages: openai
Successfully installed openai-1.27.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting langchain==0.1.19
  Downloading langchain-0.1.19-py3-none-any.whl.metadata (13 kB)
Downloading langchain-0.1.19-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain
Successfully installed langchain-0.1.19

[1m[[0m[34

For this project, we need first need to load the openai and os packages to set the API key from the environment variables you just created.

- Import the `os` package.
- Import the `openai` package.
- Set `openai.api_key` to the `OPENAI_API_KEY` environment variable.

In [3]:
# Import the os package.
import os

# Import the openai package.
import openai

# Set openai.api_key to the OPENAI_API_KEY environment variable.
openai.api_key = os.environ["OPENAI_API_KEY"]

For the `langchain` package, let's start by importing its `OpenAI` and `ChatOpenAI` classes, which are used to interact with completion models and chat completion models respectively.

Completion models, such as GPT-1, GPT-2, GPT-3, and GPT-3.5, work as advanced autocomplete models. Given a certain snippet of text as input, they will complete the text until a certain point. This could be either an end-of-sequence token (a natural way of stopping), the model reaching its maximum token limit for outputs, and so on.

Chat completion models, such as GPT-3.5-Turbo (the ChatGPT model) and GPT-4, are designed for conversational use. These models are typically more fine-tuned for conversations, keep a prompt/conversation history, and allow access to a system message, which we can use as a meta prompt to define a role, a tone of voice, a scope, etc.

Completion models and chat completion models tend to work with different classes and functions in the SDK. For that reason, we will start by importing both classes.

- Import `OpenAI` and `ChatOpenAI` from `langchain_openai`.
- From the `langchain.prompts` module, import the `PromptTemplate` and `ChatPromptTemplate` classes.
- From the `langchain.output_parsers` module, import the `CommaSeparatedListOutputParser` class.
- From the `langchain_experimental.agents.agent_toolkits` module, import `create_python_agent`.
- From the `langchain_experimental.tools.python.tool` module, import `PythonREPLTool`.

In [4]:
# From langchain_openai, import OpenAI, ChatOpenAI
from langchain_openai import OpenAI, ChatOpenAI

# From langchain.prompts, import PromptTemplate, ChatPromptTemplate
from langchain.prompts import PromptTemplate, ChatPromptTemplate

# From langchain.output_parsers, import CommaSeparatedListOutputParser
from langchain.output_parsers import CommaSeparatedListOutputParser

# From langchain_experimental.agents.agent_toolkits, import create_python_agent
from langchain_experimental.agents.agent_toolkits import create_python_agent

# From langchain_experimental.tools.python.tool, import PythonREPLTool
from langchain_experimental.tools.python.tool import PythonREPLTool

Take a look of our LangChain version instaled:

In [5]:
pip show langchain

Name: langchain
Version: 0.1.19
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /home/repl/.local/lib/python3.10/site-packages
Requires: aiohttp, async-timeout, dataclasses-json, langchain-community, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: crewai, embedchain, langchain-experimental
Note: you may need to restart the kernel to use updated packages.


## Import the Financial News Headlines Data

A small sample of financial headlines is stored in `financial_headlines.txt`.

Our first step is to read in the text file and store the headlines in a Python list.


Import the text file to a Python list.

- Open `financial_headlines.txt` for reading.
- Read in the lines using the `.readlines()` method. Assign to `headlines`.
- Print the sample headlines.

In [6]:
# Open the text file and read its lines.
with open("financial_headlines.txt", "r") as file:
    headlines  = file.readlines()

# Print all headlines.
headlines

["Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .\n",
 'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .\n',
 'Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .\n',
 'Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 , up from EUR 0.3 mn loss in 2008 .\n',
 "Finnish Metso Paper has been awarded a contract for the rebuild of Sabah Forest Industries ' ( SFI ) pulp mill in Sabah , Malaysia .\n",
 'Finnish Outokumpu Technology has been awarded several new grinding technology contracts .']

The headlines seem to a bit of whitespace preceding the punctuation, but this does not influence the performance of our large language model.
You can also see that every headline ends with a new line (`\n`).

We can quickly strip the `\n` from the end of each headline, as this might improve visibility later down the line, when printing these headlines in a dataframe. 


Strip the `\n` character from the end of every news headline.

- Loop through `headlines` and use the `.strip()` method to remove the `\n` character from each line.
- Print the result.

In [7]:
# Strip the new line character from all headlines.
headlines = [line.strip("\n") for line in headlines] 

# Print all headlines.
headlines

["Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .",
 'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .',
 'Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .',
 'Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 , up from EUR 0.3 mn loss in 2008 .',
 "Finnish Metso Paper has been awarded a contract for the rebuild of Sabah Forest Industries ' ( SFI ) pulp mill in Sabah , Malaysia .",
 'Finnish Outokumpu Technology has been awarded several new grinding technology contracts .']

## Setting up Prompt Templates

- Create a Prompt Template to analyze financial sentiment.
- Create a `PromptTemplate` object by using its `.from_template()` method. Assign to `prompt_template`.
- For the template argument, use:

```
"Analyze the following financial headline for sentiment: {headline}"
```

- Format the prompt using its `.format()` method. Let's use our first headline as input. Assign to `formatted_prompt`.
- Print the formatted prompt.

In [8]:
# Create a dynamic template to analyze a single headline.
prompt_template = PromptTemplate.from_template(
    "Analyze the following financial headline for sentiment: {headline}"
)

# Format the prompt template on the first headline of the dataset.
formatted_prompt = prompt_template.format(headline = headlines[0])

# Print the formatted template.
formatted_prompt

"Analyze the following financial headline for sentiment: Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 ."

- Define a system message as follows and assign to `system_message`.

```
"""You are performing sentiment analysis on news headlines regarding financial analysis. 
This sentiment is to be used to advice financial analysts. 
The format of the output has to be consistent. 
The output is strictly limited to any of the following options: [positive, negative, neutral]."""
```

- Instantiate a new `ChatPromptTemplate` using its `.from_messages()` method. Assign to `chat_template`.
    - This method will take a list of tuples as input. We need two tuples, one for the system message and one for the human message. To distinguish the two, the first element of the tuple is either `"system"` or `"human"`.
    - The second element of the tuple is the actual message, as string. For the system message, you can use the `system_message`variable. For the human message, we can reuse the same message as before (including the input variable `{headlines}`).
    
- Format the template using its `.format_messages()` method. Let's use our first headline again. Assign to `formatted_chat_template`.
- Print the formatted template.

In [9]:
# Define the system message.
system_message = """You are performing sentiment analysis on news headlines regarding financial analysis. 
This sentiment is to be used to advice financial analysts. 
The format of the output has to be consistent. 
The output is strictly limited to any of the following options: [positive, negative, neutral]."""

# Initialize a new ChatPromptTemplate with a system message and human message.
chat_template = ChatPromptTemplate.from_messages([
    ("system", system_message),
    ("human", "Analyze the following financial headline for sentiment: {headline}")
])

# Format the ChatPromptTemplate.
formatted_chat_template = chat_template.format(headline = headlines[0]) # we could also use the  .format_messages() methods instead the .format()

# Print the formatted template.
formatted_chat_template

"System: You are performing sentiment analysis on news headlines regarding financial analysis. \nThis sentiment is to be used to advice financial analysts. \nThe format of the output has to be consistent. \nThe output is strictly limited to any of the following options: [positive, negative, neutral].\nHuman: Analyze the following financial headline for sentiment: Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 ."

## Setting up LLM Chains

Create an LLM chain for a completion model.
- Define an `OpenAI()` client model. Assign to `client`.
- Pipe the prompt template to the client. Assign to `completion_chain`.
- Invoke `completion_chain`, setting the headline variable to the first element of the `headlines` list.

In [10]:
#import LLMChain class
from langchain.chains import LLMChain

# Define a client model. Assign to client.
client = OpenAI()

# Pipe the prompt template to the client. Assign to completion_chain.
completion_chain = LLMChain(llm= client, prompt = prompt_template)

# Invoke completion_chain, setting the headline variable to the first headline
completion_chain.invoke({"headline": headlines[0]})  # we can also use .run method instead of .invoke()


The class `LLMChain` was deprecated in LangChain 0.1.17 and will be removed in 0.3.0. Use RunnableSequence, e.g., `prompt | llm` instead.



{'headline': "Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .",
 'text': '\n\nPositive'}

Now let's do the same, using a chat completion model.

- Define a chat client model. Assign to `chat`.
- Pipe the chat template to the client. Assign to `chat_chain`.
- Invoke `chat_chain`, setting the headline to the first element of `headlines`. In the additioanl arguments, set `system_message` to `system_message`.

In [11]:
# Define a chat client model. Assign to chat.
chat = ChatOpenAI()

# Pipe the chat template to the client. Assign to chat_chain.
chat_chain = LLMChain(llm=chat, prompt=chat_template)  # chat_chain = chat | chat_template

# Invoke chat_chain, setting headline to the first headline and using system_message
chat_chain.invoke({"headline": headlines[0]}, {"system_message": system_message})

{'headline': "Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .",
 'text': 'The sentiment of the financial headline is positive.'}

## Extracting Company Names with the Output Parser

Output parsing is a very useful feature in Langchain when integrating LLM outputs into your application. The output parser can automatically transform the output of the GPT-model to numerous data types, such as lists, datetimes, JSONs and so on.

In this example, we will ask the GPT-model to extract the company name from every headline and instantly assign them to a Python list.

As we want to combine sentiment with the company name later, we will limit the output to one name per headline.

In order to format the output as a Python list, we can make use of the `CommaSeparatedListOutputParser` class in Langchain.

Create an output parser and a formatted prompt template to extract company names from multiple headlines.
- Instantiate a new `CommaSeparatedListOutputParser` and assign to `output_parser`.
- To retrieve the parsing instructions from the output parser, we can use its `.get_format_instructions()` method. Assign this to `format_instructions`.
- Let's instantiate a new `PromptTemplate`. This time we won't use its `.from_template()` method. When calling `PromptTemplate()` with the output parser, we need to pass three arguments:
    - `template`: here we can use the following string; 
```
"List all the company names from the following headlines, limited to one name per headline: {headlines}.\n{format_instructions}"
```

- `input_variables`: This is a list of strings containing the input variables that are required. In our case, this is only `"headlines"`.
- `partial_variables`: Here we pass along a dictionary with the key being `"format_instructions"` and the value being the `format_instructions` variable we created earlier.
- Format the prompt template using the entire `headlines` list.

In [12]:
# Instantiate the output parser.
output_parser = CommaSeparatedListOutputParser()

# Get the format instructions from the output parser.
format_instructions = output_parser.get_format_instructions()

# Instantiate a new prompt template with the format instructions.
company_name_template = PromptTemplate(
    template = "List all the company names from the following headlines, limited to one name per headline: {headlines}.\n{format_instructions}",
    input_variables = ["headlines"],
    partial_variables = {"format_instructions": format_instructions},
)

# Format the prompt using all headlines.
company_name_template_formated = company_name_template.format(headlines=headlines)

print(company_name_template_formated)

List all the company names from the following headlines, limited to one name per headline: ["Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .", 'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .', 'Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .', 'Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 , up from EUR 0.3 mn loss in 2008 .', "Finnish Metso Paper has been awarded a contract for the rebuild of Sabah Forest Industries ' ( SFI ) pulp mill in Sabah , Malaysia .", 'Finnish Outokumpu Technology has been awarded several new grinding technology contracts .'].
Your r

Now that we have a template with format instructions ready, let's send it to a GPT-model and look at the output. We want to run these kinds of tasks with the temperature parameter of the large language model set to zero, as this maximizes precision. 

We tend to distinguish tasks that either require precision or creativity. When we are looking for correctness in the answer (e.g. when generating code) we aim for high precision (by lowering temperature) whereas when generating ideas or content, we might prefer more creativity (by increasing temperature). A simplified explanation of the *temperature* of a large language model is its randomness. When temperature is set to 0, we will get the exact same output, given the same inputs.

Create a new Langchain model, send over the template and inspect the parsed output.
- Instantiate a new `OpenAI()` client model. Set the temperature to 0. Assign to `model`.
- Invoke `model` on the formatted template. Assign to `_output`. The underscore preceding our variable name indicates that this is just a temporary variable, that will likely be overwritten many times.
- Use the `.parse()` method of the output parser on the output of the model. Assign to `company_names`.
- Print the data type of `company_names`.
- Print the company names.

<details>
<summary>Code hints</summary>
<p>

- The temperature of the model can be set to 0 by using `OpenAI(temperature= )`.
- We can get the data type of a variable by using `type(variable)`.

</p>
</details>

In [13]:
# Instantiate a Langchain OpenAI Model object.
model = OpenAI(temperature=0)

# Invoke the model on the input.
_output = model.invoke(company_name_template_formated)

# Parse the output.
company_names = output_parser.parse(_output)

# Print the data type the parsed output.
print(type(company_names))

# Print the output.
print(company_names)

<class 'list'>
['Aktia Group', 'Vaisala Oyj', 'Orion', 'Tiimari', 'Metso Paper', 'Outokumpu Technology']


## Working with Agents and Tools

Leveraging the agents and tools in LangChain is where the framework's value really starts to shine! But before we dive deeper into this concept, we need to understand MRKL prompts.

Before we continue with our financial analysis, let's create a quick example of how code can be ran using a Python agent. In this case, we will ask it to make a calculation (something that most large language models are not trained to do out-of-the-box).
- Create a Python agent by calling the `create_python_agent()` function. Assign to `agent_executor`. This function takes three arguments:
    - `llm`: here we can create a new `OpenAI()` model. Let's set the `temperature` to 0 and `max_tokens` to 1000.
    - `tool`: here we instantiate a new `PythonREPLTool()`.
    - `verbose`: set this to True so that can we see the prompt loop.
- Invoke the agent using its `.invoke()` method. As an example, you can ask it: `"What is the square root of 250? Round the answer down to 4 _decimals."`_

In [14]:
# Instantiate a Python agent, with the PythonREPLTool.
agent_executor = create_python_agent(
    llm = OpenAI(temperature = 0, max_tokens = 1000), 
    tool = PythonREPLTool(),
    verbose = True  # to actually see the prompt loop im action
)

# Ask the agent for the solution of a mathematical equation.
agent_executor.run("What is the square root of 250? Round the answer down to 4 decimals.")



[1m> Entering new AgentExecutor chain...[0m



The method `Chain.run` was deprecated in langchain 0.1.0 and will be removed in 0.3.0. Use invoke instead.



[32;1m[1;3m I can use the math module to calculate the square root.
Action: Python_REPL
Action Input: import math[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m Now that the math module is imported, I can use the sqrt function.
Action: Python_REPL
Action Input: math.sqrt(250)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m The result is a float, so I can use the round function to round it down to 4 decimals.
Action: Python_REPL
Action Input: round(math.sqrt(250), 4)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: 15.8114[0m

[1m> Finished chain.[0m


'15.8114'

Ask the agent to extract the company name and sentiment from the headlines and save its output in a `.csv` file called `financial_analysis.csv`.
- Invoke the agent on the following prompt:
    
    ``` 
    f"""For every of the following headlines, extract the company name and whether the financial sentiment is positive, neutral or negative. 
    Load this data into a pandas dataframe. 
    The dataframe will have three columns: the name of the company, whether the financial sentiment is positive or negative and the headline itself. 
    The dataframe can then be saved in the current working directory under the name financial_analysis.csv.
    If a csv file already exists with the same name, it should be overwritten.

    The headlines are the following:
    {headlines}"""
    ```

In [15]:
# Invoke the agent
agent_executor.invoke(f"""For every of the following headlines, extract the company name and whether the financial sentiment is positive, neutral or negative. 
Load this data into a pandas dataframe. 
The dataframe will have three columns: the name of the company, whether the financial sentiment is positive or negative and the headline itself. 
The dataframe can then be saved in the current working directory under the name financial_analysis.csv.
If a csv file already exists with the same name, it should be overwritten.

The headlines are the following:
{headlines}""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to import the necessary libraries and modules to work with pandas and csv files.
Action: Python_REPL
Action Input: import pandas as pd[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I need to create a list of the headlines.
Action: Python_REPL
Action Input: headlines = ["Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .", 'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .', 'Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .', 'Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 , up from EUR 0.

{'input': 'For every of the following headlines, extract the company name and whether the financial sentiment is positive, neutral or negative. \nLoad this data into a pandas dataframe. \nThe dataframe will have three columns: the name of the company, whether the financial sentiment is positive or negative and the headline itself. \nThe dataframe can then be saved in the current working directory under the name financial_analysis.csv.\nIf a csv file already exists with the same name, it should be overwritten.\n\nThe headlines are the following:\n["Finnish Aktia Group \'s operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .", \'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .\', \'Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2

Observe the output above. Do you see anything that could be improved? We will come back to this later in this notebook.

For now, let's quickly load our `.csv` file in a dataframe to analyze.


Load the data in a dataframe for evaluation.
- Import `pandas` under its usual alias: `pd`.
- Load the `financial_analysis.csv` file into a dataframe. Assign to `df`.
- Print the dataframe. As our dataframe only contains six rows, we can just print the entire dataframe.

In [16]:
# Make the necessary import.
import pandas as pd

# Load the CSV file into a dataframe.
df = pd.read_csv("financial_analysis.csv")

# Print the dataframe.
df

Unnamed: 0,Company Name,Financial Sentiment,Headline
0,Finnish,Positive,Finnish Aktia Group 's operating profit rose t...
1,Finnish,Negative,Finnish measuring equipment maker Vaisala Oyj ...
2,Finnish,Positive,Finnish pharmaceuticals company Orion reports ...
3,Finnish,Positive,"Tiimari , the Finnish retailer , reported to h..."
4,Finnish,Neutral,Finnish Metso Paper has been awarded a contrac...
5,Finnish,Neutral,Finnish Outokumpu Technology has been awarded ...


## Analysis of Output and Room for Improvement

When analyzing the output above (looking at the company names and sentiment), you will probably notice some room for improvement. 

### Issues Identified

1. **Extraction Method**: 
   - Company names and sentiment may not be extracted in a very powerful way. The reason for this is that without further instructions, the GPT-model will use the PythonREPLTool (Python code) to complete its task. 
   - Looking back at the output from our last call to the Python agent, we may find that it created rule sets on how to extract the company name or determine the sentiment. These hard-coded rules negate the power of large language models!

2. **Sentiment vs. Financial Sentiment**:
   - Another problem that might arise is that the *sentiment* of a sentence can differ from *financial sentiment*. 
   - For example, an aggressive headline complaining about a large corporation making too much profit might result in negative sentiment, while from a financial analysis point of view the sentiment is positive. 

### Solution: Few-Shot Learning

To steer the GPT-model to our desired outcome, we will now introduce few-shot learning.

#### Example

- *Company X was awarded a new contract* might be categorized as a neutral sentence. The sentence itself is simply an objective statement or observation. Nothing is mentioned about whether we like or dislike that particular company because of this. 
- From a financial perspective however, this is considered as something positive. 

To steer the GPT-model to our desired outcome, we will now introduce few-shot learning.
```

## Adding Few Shot Learning

Few shot learning basically comes down to adding some examples into our prompt, in this case, what we consider to be positive or negative headlines. A shot refers to an example given to the model in the input prompt (or sometimes the system message).

We distinguish three categories of contextual learning:
- Few shot leaning (multiple examples)
- Single shot learning (one example)
- Zero shot leaning (no examples)

Few shot learning might take more effort in terms of prompt building, but it will generally yield better results, as the model has a better understanding of our desired outcome.

Let's look at an example of financial sentiment analysis without few shot learning first.

Create a prompt template with output parsing to determine the financial sentiment of all headlines.
- Create a new `PromptTemplate` called `sentiment_template`. Remember the three arguments `template`, `input_variables` and `partial_variables`. Assign to `sentiment_template`.
    - We can reuse the `format_instructions` variable that we have loaded into memory before.
    - As a template, use: 
```
"Get the financial sentiment of each of the following headlines. The output is strictly limited to any of the following options: ['Positive', 'Negative', 'Neutral']: {headlines}.\n{format_instructions}"
```


- Format the template on all headlines. Assign to `formatted_sentiment_template`.
- Run the formatted template by invoking our `model` and assign the result to our temporary variable `_output`.
- Parse the output using the output parser. Assign the result to `sentiments`.
- Print the sentiments.

In [17]:
# Create a new prompt template with output parsing.
sentiment_template = PromptTemplate(
    template = "Get the financial sentiment of each of the following headlines. The output is strictly limited to any of the following options: ['Positive', 'Negative', 'Neutral']: {headlines}.\n{format_instructions}",
    input_variables = ["headlines"],
    partial_variables = {"format_instructions": format_instructions}
)

# Format the prompt template.
formatted_sentiment_template = sentiment_template.format(headlines=headlines)

# Invoke the model on the formatted prompt template.
_output = model.invoke(formatted_sentiment_template)

# Parse the output.
sentiments  = output_parser.parse(_output)

# Print the list of sentiments.
sentiments

['Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive']

It is hard to evaluate the sentiments without seeing the associated headline. To make our lives easier, let's write a quick function to easily visualize and interpret the result.


Visualize and interpret the results of the sentiment analysis.
- Write a function called `visualize_sentiments` to visualize both the sentiment and associated headline, for all headlines. 
    - The input for this function should be two lists: one containing all headlines and one containing all sentiments.
    - As a best practice, start with using an `assert` that ensures that both lists are of equal length.
    - There are many ways to create this: simply printing with f-strings, making a dictionary or Dataframe, get creative!
- Call the `visualize_sentiments` function using `headlines` and `sentiments` as input.

In [18]:
# Define a new function with two inputs,
def visualize_sentiments(headlines, sentiments):
    # Assert that both inputs are of equal length
    assert len(headlines) == len(sentiments)

    # Visualize the sentiments and their respective headlines
    for i, _ in enumerate(headlines):
        print(f"{sentiments[i].upper()}: {headlines[i]}")

# Call the function
visualize_sentiments(headlines, sentiments)

POSITIVE: Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .
NEGATIVE: Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .
POSITIVE: Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .
POSITIVE: Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 , up from EUR 0.3 mn loss in 2008 .
POSITIVE: Finnish Metso Paper has been awarded a contract for the rebuild of Sabah Forest Industries ' ( SFI ) pulp mill in Sabah , Malaysia .
POSITIVE: Finnish Outokumpu Technology has been awarded several new grinding technology contracts .


Now we might see that the financial sentiment is not always correctly assigned, such as a contract being awarded not being recognized as a financially positive headline.
To improve the performance, we will add some examples. Few shot learning can be done by either giving some observations (headlines in this case) accompanied by their ground truth (label) *or* by giving an abstract description of what is seen as positive, negative or neutral.

In this case, we will opt for the later. Here is a prompt you can use for few shot learning:

```
"""
If a company is doing financially better than before, the sentiment is positive. For example, when profits or revenue have increased since the last quarter or year, exceeding expectations, a contract is awarded or an acquisition is announced.
If the company's profits are decreasing, losses are mounting up or overall performance is not meeting expectations, the sentiment is negative.
If nothing positive or negative is mentioned from a financial perspective, the sentiment is neutral.
"""
```


Create and run a prompt template using few shot learning.
- Store the prompt above in a variable called `sentiment_examples`.
- Create a `PromptTemplate` called `sentiment_template` like we did two cells above.
    - In our template, we will add a new input variable called `few_shot_examples`. This can be placed in between the two sentences.
    - Don't forget to add our new input variables to the list of `input_variables`.
    - Reuse the same `format_instructions` as before.
- Format the `sentiment_template`. Remember that you will need to pass both `headlines` and `sentiment_examples`.
- Run the formatted template by invoking our `model` and assign the result to our temporary variable `_output`.
- Parse the output using the output parser. Assign the result to `sentiments`.
- Visualize and interpret the results using your newly created `visualize_sentiments` function.

In [19]:
# Store the few shot examples in a variable.
sentiment_examples = """
    If a company is doing financially better than before, the sentiment is positive. For example, when profits or revenue have increased since the last quarter or year, exceeding expectations, a contract is awarded or an acquisition is announced.
    If the company's profits are decreasing, losses are mounting up or overall performance is not meeting expectations, the sentiment is negative.
    If nothing positive or negative is mentioned from a financial perspective, the sentiment is neutral.
"""

# Instantiate a new prompt template with the format instructions.
sentiment_template = PromptTemplate(
    template="Get the financial sentiment of each of the following headlines. {few_shot_examples} The output is strictly limited to any of the following options: ['Positive', 'Negative', 'Neutral']: {headlines}.\n{format_instructions}",
    input_variables=["headlines", "few_shot_examples"],
    partial_variables={"format_instructions": format_instructions}
)

# Format the template.
formatted_sentiment_template = sentiment_template.format(
	headlines=headlines, 
	few_shot_examples=sentiment_examples
)

# Invoke the model on the formatted template.
_output = model.invoke(formatted_sentiment_template)

# Parse the model output.
sentiments = output_parser.parse(_output)

# Visualize the result.
visualize_sentiments(headlines, sentiments)

POSITIVE: Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .
NEGATIVE: Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .
POSITIVE: Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .
POSITIVE: Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 , up from EUR 0.3 mn loss in 2008 .
POSITIVE: Finnish Metso Paper has been awarded a contract for the rebuild of Sabah Forest Industries ' ( SFI ) pulp mill in Sabah , Malaysia .
POSITIVE: Finnish Outokumpu Technology has been awarded several new grinding technology contracts .


## Combining Tools and Output Parsing

As you may have noticed in Task 5, using tools is not a guaranteed success. We can improve the performance by clearly determining which tasks can be completed by the Python tool and which we use the GPT-model itself for.
To maximize the powerful capabilities of the GPT-model, we prefer its use over hard-coded rule sets when it comes to company name extraction or financial sentiment analysis.
However, other (cumbersome) tasks that do not require the ability to handle ambiguity, are often best left to the Python tool.

Let's ask the model to use the existing lists that we got from our templates (`company_names` and `sentiments`), but use the Python tool to neatly place them in a Pandas dataframe and write them locally to a `.csv` file.

Use the following prompt:

```
f"""Create a dataframe with two columns: company_name, sentiment and headline.
                   To fill the dataframe, use the following lists respectively: {str(company_names)}, {str(sentiments)} and {str(headlines)}. 
                   The dataframe can then be saved in the current working directory under the name financial_analysis_with_parsing.csv.
                   If a csv file already exists with the same name, it should be overwritten.
                   """
```

In the prompt above, we pass along lists that were generated by the GPT-model before (when it did not have access to the Python tool). Now we only want to give instructions on tasks that should be carried out using Python code, such as the creation of the dataframe, saving (and overwriting) it, ...

Keep in mind that we can use this same way of working for much more complex tasks, that might encompass extensive coding requirements.

- Invoke the `agent_executor` on the prompt above.

In [20]:
# Invoke the agent to create a file with the headlines, company names and sentiments.
agent_executor.invoke(f"""Create a dataframe with two columns: company_name, sentiment and headline.
                   To fill the dataframe, use the following lists respectively: {str(company_names)}, {str(sentiments)} and {str(headlines)}. 
                   The dataframe can then be saved in the current working directory under the name financial_analysis_with_parsing.csv.
                   If a csv file already exists with the same name, it should be overwritten.
                   """)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to import pandas to create a dataframe. I also need to create the lists for the columns and the data.
Action: Python_REPL
Action Input: import pandas as pd[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m Now that I have imported pandas, I can create the dataframe using the lists.
Action: Python_REPL
Action Input: df = pd.DataFrame({'company_name': ['Aktia Group', 'Vaisala Oyj', 'Orion', 'Tiimari', 'Metso Paper', 'Outokumpu Technology'], 'sentiment': ['Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive'], 'headline': ["Finnish Aktia Group 's operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .", 'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .', 'Finnish pharmaceuticals company Orion report

{'input': 'Create a dataframe with two columns: company_name, sentiment and headline.\n                   To fill the dataframe, use the following lists respectively: [\'Aktia Group\', \'Vaisala Oyj\', \'Orion\', \'Tiimari\', \'Metso Paper\', \'Outokumpu Technology\'], [\'Positive\', \'Negative\', \'Positive\', \'Positive\', \'Positive\', \'Positive\'] and ["Finnish Aktia Group \'s operating profit rose to EUR 17.5 mn in the first quarter of 2010 from EUR 8.2 mn in the first quarter of 2009 .", \'Finnish measuring equipment maker Vaisala Oyj HEL : VAIAS said today that its net loss widened to EUR4 .8 m in the first half of 2010 from EUR2 .3 m in the corresponding period a year earlier .\', \'Finnish pharmaceuticals company Orion reports profit before taxes of EUR 70.0 mn in the third quarter of 2010 , up from EUR 54.9 mn in the corresponding period in 2009 .\', \'Tiimari , the Finnish retailer , reported to have geenrated quarterly revenues totalling EUR 1.3 mn in the 4th quarter 2009 

If we look at our working directory, we will see a new file pop up, called `financial_analysis_with_parsing.csv`.

Let's analyze it and compare against the output from Task 5.

Load and display the new file.
- Load `financial_analysis_with_parsing.csv` into a dataframe called `df`.
- Print the dataframe.

In [21]:
# Load the CSV file into a dataframe.
df = pd.read_csv("financial_analysis_with_parsing.csv")

# Print the dataframe.
df

Unnamed: 0,company_name,sentiment,headline
0,Aktia Group,Positive,Finnish Aktia Group 's operating profit rose t...
1,Vaisala Oyj,Negative,Finnish measuring equipment maker Vaisala Oyj ...
2,Orion,Positive,Finnish pharmaceuticals company Orion reports ...
3,Tiimari,Positive,"Tiimari , the Finnish retailer , reported to h..."
4,Metso Paper,Positive,Finnish Metso Paper has been awarded a contrac...
5,Outokumpu Technology,Positive,Finnish Outokumpu Technology has been awarded ...


## Using the OpenAI Moderation API

The OpenAI API platform also sports a Moderation API, in addition to their model and embeddings APIs. The Moderation API can check whether the prompt contains explicit content and can flag various categories like hate, violence, sexually explicit content, and more. When we are building an application targeting large user bases, it becomes crucial to leverage the Moderation API and filter our input prompts to avoid the complications associated with unethical LLM usage.

To test the Moderation API, we have a small sample of five comments picked from the `r/WallStreetBets` subreddit, stored in the `reddit_comments.txt` file.

Let's start by reading the text file and displaying its contents to understand what kind of comments we are dealing with. This will help us see how the Moderation API can be applied to real-world data.

Here are the steps we will follow:
1. Read the `reddit_comments.txt` file.
2. Display the contents of the file.

Read the text file and store its lines in a variable called `comments`.
- Open `reddit_comments.txt` as read.
- Use the `.readlines()` method to store its contents in a list called `comments`.
- Optionally print the comments.

In [22]:
# Load the lines of the text file.
with open('reddit_comments.txt', 'r') as data:
    comments = data.readlines()

# Optionally print the comments.
comments

["It's the poors fault for thinking they had a chance in a negative sum gambling casino run by people richer than you who hired Asian quants that are smarter than you.\n",
 "Canada is basically a global real estate investment scheme. It's not even a country, it's a showroom.\n",
 'Lol China not going to make a dent in the global scale. Wake me up when America’s housing market is about to implode that’s when I’m pulling out all my investments. Because the world is going to burn.\n',
 'I would normally have the knee-jerk reaction to seethe at this post but I remind myself that if I had a lot of money I would probably be the snobbiest and stingiest rich person ever. I wouldn’t even help anyone even if they begged me to financially free them from their Wendy’s dumpster obligations\n',
 "I know China will be fine because Peter Zeihan keeps saying China is imploding. If you want to know what the US State Department desperately wants you to believe, just keep up to date with whatever Peter Ze

Analyze a comment using the Moderation API.
- Pick a comment from the dataset and store this in a variable called `comment`.
- Use the `openai` package to define an OpenAI model. Assign to `client`.
- Use the API by calling the previously defined `client`'s `.moderations.create()` method. For the `input` argument, pass the `comment`. Assign to `moderation_output`.
- Print the comment and moderation output.

In [23]:
# Pick a comment.
comment = comments[2]

# Define an OpenAI model. Assign to client.
client = openai.OpenAI()

# Send the comment to the Moderation API.
moderation_output = client.moderations.create(input=comment)

# Optionally print the comment.
print(comment)

# Print the output.
moderation_output

Lol China not going to make a dent in the global scale. Wake me up when America’s housing market is about to implode that’s when I’m pulling out all my investments. Because the world is going to burn.



ModerationCreateResponse(id='modr-BeqR0BvL0oy9c4v05fLcnwSI6Ke1Y', model='text-moderation-007', results=[Moderation(categories=Categories(harassment=True, harassment_threatening=False, hate=False, hate_threatening=False, self_harm=False, self_harm_instructions=False, self_harm_intent=False, sexual=False, sexual_minors=False, violence=False, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=False), category_scores=CategoryScores(harassment=0.5870464444160461, harassment_threatening=0.035818107426166534, hate=0.03748295083642006, hate_threatening=0.008557282388210297, self_harm=3.218484198441729e-05, self_harm_instructions=2.1505543372768443e-06, self_harm_intent=1.1581191756704357e-05, sexual=3.791242761508329e-06, sexual_minors=4.0563273273619416e-07, violence=0.06111923232674599, violence_graphic=0.0001648496399866417, self-harm=3.218484198441729e-05,

We can analyze the output above to determine whether the comment has been deemed explicit or not. The `"flagged"` boolean will show us if any (at least one) category has been flagged, and underneath we can see which categories have been flagged.

The moderation scores for each category can be retrieved to explore why the text was flagged as inappropriate. It's slightly tedious code, but can be reused exactly whenever you use the moderation API.

In [24]:
# Run this code to see the scores
pd.DataFrame(moderation_output.results[0].dict())[["categories", "category_scores"]]

Unnamed: 0,categories,category_scores
harassment,True,0.5870464
harassment_threatening,False,0.03581811
hate,False,0.03748295
hate_threatening,False,0.008557282
self_harm,False,3.218484e-05
self_harm_instructions,False,2.150554e-06
self_harm_intent,False,1.158119e-05
sexual,False,3.791243e-06
sexual_minors,False,4.056327e-07
violence,False,0.06111923


## Summary

Here's a quick recap of what we've done:

- **Prompt Engineering**: Mastered essential tricks and optimizations to enhance the performance of your prompts.
- **Prompt Templates**: Set up and utilized prompt templates for consistent and efficient prompt generation.
- **LLMChains**: Leveraged LLMChains to streamline the process of chaining together multiple language model calls.
- **Output Parsing**: Used LangChain's output parsing capabilities to convert AI-generated text into Python objects for downstream applications.
- **Agents and Tools**: Integrated LangChain Agents and Tools to extend the functionality of your generative AI projects.
- **Moderation API**: Employed the Moderation API to filter and moderate user input, ensuring safe and appropriate interactions.

With these skills, you are well-prepared to tackle more advanced modules and projects. Best of luck in your continued learning journey!
```

## Conclusion

In this project, we embarked on an insightful journey through the landscape of generative AI, leveraging the power of LangChain and various other tools to analyze and moderate user-generated content. Our exploration was not just about understanding the data but also about honing a diverse set of skills that are crucial for any AI enthusiast. Here’s a summary of our accomplishments and the skills we developed:

### Data Analysis and Moderation
We started by analyzing a dataset containing company names, sentiments, and headlines. This allowed us to understand the context and sentiment behind each headline, providing a foundation for more advanced AI applications. We then integrated the Moderation API to ensure that the content we analyzed and generated adhered to community guidelines, showcasing our ability to implement ethical AI practices.

### Prompt Engineering and Optimization
One of the key skills we developed was prompt engineering. We learned how to craft effective prompts that guide the AI to produce the desired output. This involved understanding the nuances of language and the importance of context, which are essential for creating reliable and accurate AI models.

### Setting Up Prompt Templates
We explored the creation of prompt templates, which are reusable components that streamline the process of generating AI responses. This not only improved our efficiency but also ensured consistency in the outputs, a critical aspect when scaling AI solutions.

### Utilizing LLMChains
By using LLMChains, we were able to chain together multiple language models to perform complex tasks. This demonstrated our ability to build sophisticated AI pipelines that can handle a variety of inputs and produce coherent, contextually relevant outputs.

### Output Parsing and Python Integration
We delved into LangChain’s output parsing capabilities, learning how to convert AI-generated text into Python objects. This skill is invaluable for integrating AI outputs into larger systems, enabling seamless interaction between AI models and other software components.

### Implementing LangChain Agents and Tools
Our journey also included the use of LangChain Agents and Tools, which added a layer of functionality to our AI projects. These tools allowed us to extend the capabilities of our models, making them more versatile and powerful.

### Leveraging the Moderation API
Finally, we leveraged the Moderation API to filter user input, ensuring that our AI models operate within ethical boundaries. This step underscored the importance of responsible AI development and our commitment to creating safe and inclusive AI solutions.