<a href="https://colab.research.google.com/github/ubinix-warun/mad-bootcamp-2024/blob/main/colab/mad_8_llm_application.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Note**: For those who are not familiar with Google's Colab, please visit https://colab.research.google.com/ for more information and how-to.


<h1> 0. OpenAI's API Setup </h1>

Install the Openai module

In [None]:
!pip install openai

Check for the version of OpenAI module. <br>
The version should be 1.0.0 or above. If not, run `!pip install --upgrade --force-reinstall openai`

In [None]:
import openai
print(openai.__version__)

1.35.3


In [None]:
OPENAI_API_KEY = '<your api key>'
GPT4 = 'gpt-4-1106-preview'
GPT3 = "gpt-3.5-turbo"

<h3> Helper Function for API Calling<h3>

To connect to OpenAI API, the API key is passed into `openai` instance.

In [None]:
client = openai.OpenAI(api_key=OPENAI_API_KEY)

In this demonstration, we will be using OpenAI's `gpt-3.5-turbo` model and the chat completion endpoint `client.chat.completions` as shown in function `get_completion`.

In [None]:
def get_completion(prompt, model=GPT3):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    return response.choices[0].message.content

Check for API connection, try asking some question by replacing `<insert your question>` with your own.

In [None]:
text = """
<insert your question>
"""

prompt = f"""
As an professional assistance, answer the given question, or instruction, outlined by angle brackets.
Text: <{text}>
"""

resp = get_completion(f'{prompt}')
print(resp)

Why did the scarecrow win an award? Because he was outstanding in his field!


<h1> 1. LangChain: Post Tagging <h1>

<h2> 1.1 Setup </h2>

**Installation**
*   LangChain Core
*   LangChain for OpenAI
*   LangChain from Community (in case that langchain-openai requires)

In [None]:
!pip install langchain
!pip install langchain-openai
!pip install langchain-community

**Import**
*   `ChatOpenAI`: For creating `llms` object
*   `ChatPromptTemplate`: For generating `prompt` from prompt template
*   `BaseModel, Field, List`: For creating `schema` object

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List

<h2> 1.2 Data for Demonstration </h2>

In this demo, we will be using some mock social media posts, `post_list`, for LangChain Tagging. There are, in total, 10 items in this object. Each of which is a `post-tags` pair, where `post` is a string containing the content of the social media post and the `tags` are the categories or topics that best represent the post.


In [None]:
post_list = [
    {
        "post": "Just finished a killer workout at the gym! Feeling pumped! üí™",
        "tags": ["Fitness", "Health"]
    },
    {
        "post": "Exploring the awesome streets of Paris today. The Eiffel Tower is lit! #travel",
        "tags": ["Travel", "Tourism"]
    },
    {
        "post": "Made a bomb homemade pizza tonight. It was a hit with the fam! üçï Extra cheese and fresh basil, definitely doing this again!",
        "tags": ["Food", "Cooking"]
    },
    {
        "post": "Can't wait for the new tech conference next week! So hyped to see the latest gadgets. #TechGeek. Hoping to get a sneak peek at the new smartphones and VR headsets.",
        "tags": ["Tech", "Events"]
    },
    {
        "post": "Watching the game tonight with the crew. Go team! üèÄ We're all decked out in our jerseys and cheering loud. This season has been epic so far!",
        "tags": ["Sports", "Entertainment"]
    },
    {
        "post": "Started a new book today. Loving the mystery and suspense so far. The plot twists are keeping me on the edge of my seat. Highly recommend this to any thriller fans!",
        "tags": ["Books", "Entertainment"]
    },
    {
        "post": "Just copped a new camera. Can't wait to snap some amazing photos on my next trip. Looking forward to capturing beautiful landscapes and candid moments. Any tips for a newbie photographer?",
        "tags": ["Photography", "Travel"]
    },
    {
        "post": "Hit up an amazing concert last night. The energy was insane! The band played all their bangers and the crowd was electric. It was a night to remember.",
        "tags": ["Music", "Events"]
    },
    {
        "post": "Spent the afternoon gardening and planting new flowers. It's so chill to be outdoors. #green #environment",
        "tags": ["Gardening", "Outdoor"]
    },
    {
        "post": "Had an awesome dinner at the new Italian spot downtown. The pasta was delish and the service was on point.",
        "tags": ["Food", "Dining"]
    }
]


<h2> 1.3 Setting up LLM Model, Prompt, and Schema for output format </h2>

<h3> LLM Model </h3>

Setup LangChain's LLM Object using our `OPENAI_API_KEY` from the previous section. The model used in this demonstration is `gpt-3.5-turbo`. The method  `ChatOpenAI` is a function for creating `llms` object. This object will be our processing core of the overall chain.

In [None]:
llm = ChatOpenAI(temperature=0, model=GPT3, openai_api_key=OPENAI_API_KEY)

<h3> Prompt </h3>

`ChatPromptTemplate` is a Langchain PromptTemplate sub-module, specifically for ChatPrompt. With `ChatPromptTemplate.from_template`, this will create a `prompts` object, waiting for the `post` or data to be plugged in. The object will, then, be fed into the `llms` object.

In [None]:
tagging_prompt = ChatPromptTemplate.from_template(
"""
You are an expert text classifier. Your task is to assign relevant tags to the given social media post.
The tags should accurately represent the main topics or categories of the post.

Here is the social media post:
"{post}"
"""
)

<h3> Schema </h3>

The `schema` used in this demonstration is a Pydantic-model (or dictionary type) acting as a structure format of the tagging task for the LLM to process efficiently. Its purpose is to indicate details of each output field, its desciption, data type, etc.

For more infomation, check out https://docs.pydantic.dev/latest/ .
<br>
<br>
The following are the Tagging Fields in this example. Each field requires a `type` (`int`, `str`, `bool`, etc.), to specify the output type, and `description`, which is an instruction of what the LLM should do for this field.

*   `categories`:
    - **Type**: Array of strings
    - **Description**: The categories or topics that best represent the social media post. At least two tags should be provided. If fewer than two clear tags are identified, repeat the most relevant tag to ensure at least two tags.
    
*   `summary`(optional):
    - **Type**: String
    - **Description**: A brief summary of the social media post, capturing the main idea or activity described. If the post is too short to summarize meaningfully, repeat the post itself.
    
*   `sentiment`(optional):
    - **Type**: String
    - **Description**: The sentiment analysis of the social media post, categorized as 'Positive', 'Neutral', or 'Negative'. Provide a single word indicating the sentiment.

Lastly, as mentioned, these tagging fields (`tags`, `summary`, `sentiment`) are just for this demonstration. They can be customized into any field that the users want. For example, count number of hashtag or get a list of places mentioned in the post.

In [None]:
class TaggingSchema(BaseModel):
    categories: List[str] = Field(
        description="""Given a social media post, identify and extract at least two tags that best represent the main topics or categories of the post.
        If there are fewer than two clear tags, repeat the most relevant tag to ensure at least two tags are always provided. The tags should be general topics, do not include any specific term.
        The output should be an array of these tags.""",
        min_items=2
    )
    summary: str = Field(
        description="""Provide a brief summary of the social media post. This summary should capture the main idea or activity described in the post,
        using concise and clear language. If the post is too short to summarize meaningfully, repeat the post itself."""
    )
    sentiment: str = Field(
        description="""Analyze the sentiment of the social media post. Categorize the sentiment as 'Positive', 'Neutral', or 'Negative' based on the tone and content of the post.
        Provide a single word indicating the sentiment.""",
        enum=['Positive', 'Negative', 'Neutral']
    )

Alternatively, the schema can also be construct from normal python dictionary. This will yield the same result as the Pydantic Model.

In [None]:
dict_schema = {
    'title': 'TaggingSchema',
    'type': 'object',
    'description': 'Schema for tagging, summarizing, and analyzing sentiment of Social Media Posts',
    "properties": {
        "categories": {
            "type": "array",
            "minItems": 2,
            "items": {
                "type": "string"
            },
            "description": """Given a social media post, identify and extract at least two tags that best represent the main topics or categories of the post.
            If there are fewer than two clear tags, repeat the most relevant tag to ensure at least two tags are always provided.
            The output should be an array of these tags."""
        },
        "summary": {
            "type": "string",
            "description": """Provide a brief summary of the social media post. This summary should capture the main idea or activity described in the post,
            using concise and clear language. If the post is too short to summarize meaningfully, repeat the post itself."""
        },
        "sentiment": {
            "type": "string",
            "description": """Analyze the sentiment of the social media post. Categorize the sentiment as 'Positive', 'Neutral', or 'Negative' based on the tone and content of the post.
            Provide a single word indicating the sentiment.""",
            'enum': ['Positive', 'Negative', 'Neutral']
        },
    },
    'required': ['tags', 'summary', 'sentiment']
}

The schema is, then, applied to the `llm` object with function `with_structured_output` forcing the output to be in our configured format.

In [None]:
llm_structured = llm.with_structured_output(TaggingSchema)

# in case you use the dictionary version of schema
# llm_structured = llm.with_structured_output(dict_schema)

<h2> 1.3 Create Tagging Chain and Apply </h2>

After all components are set, including `llm_structured` and `tagging_prompt`, we create a `chain` object using LangChain Expression Language (LCEL) (`|`).

In [None]:
tagging_chain = tagging_prompt | llm_structured

Finally, use the chain with some of the input post. By applying `invoke` function, the parameter passed into the function will be sent through the `chain` object, resulted in our desire formatted output. The parameter `input` accepts dict-type object, which the value is the name we define in the `tagging_prompt` object, in this case, `post`.
<br>
<br>
The result is in a `TaggingSchema` object which includes all fields that we previously define in `TaggingSchema` Class. You can access all attributes as usual, or you change the result type into `Dict`/`json`.

In [None]:
# try passing changing the input post into the chain and observe the output
input_post = post_list[4]['post']
result = tagging_chain.invoke(input={'post': input_post})
result

TaggingSchema(categories=['Sports', 'Socializing'], summary='Watching the game with friends, cheering for the team in jerseys. Exciting season!', sentiment='Positive')

In [None]:
print(result.categories)

['Sports', 'Socializing']


In [None]:
result.dict()

{'categories': ['Sports', 'Socializing'],
 'summary': 'Watching the game with friends, cheering for the team in jerseys. Exciting season!',
 'sentiment': 'Positive'}

In [None]:
print('Post Content:', input_post)
print('Topic:', result.dict()['categories'])

Post Content: Watching the game tonight with the crew. Go team! üèÄ We're all decked out in our jerseys and cheering loud. This season has been epic so far!
Topic: ['Sports', 'Socializing']


<h1> 2. Basic Semantic Search using Cosine Similarity

<h2> 2.1 Data for Demonstration </h2>

In this demonstration, we are going to use OpenAI embedding for semantic searching. Belows is the story that everyone might be familiar with, **The Tortoise and the Hare**. This story will be our search pool for the semantic search.

In [None]:
storys = """

Once upon a time, there was a speedy hare who bragged about how fast he could run.

Tired of hearing him boast, the slow and steady tortoise challenged him to a race.

All the animals in the forest gathered to watch.

The hare ran down the road for a while and then paused to rest.

He looked back at the slow tortoise and cried out, "How do you expect to win this race when you are walking along at your slow, slow pace?"

The hare stretched himself out alongside the road and fell asleep, thinking, "There is plenty of time to relax."

The tortoise walked and walked, never ever stopping until he came to the finish line.

The animals who were watching cheered so loudly for the tortoise that they woke up the hare.

The hare stretched, yawned, and began to run again, but it was too late.

The tortoise had already crossed the finish line.

""".split('\n')

Divide the data into chunks and store it in a pandas dataframe.

In [None]:
story_chunks = [chunk for chunk in storys if chunk != '']

In [None]:
import pandas as pd
df_story = pd.DataFrame(story_chunks, columns=['story'])

In [None]:
df_story

Unnamed: 0,story
0,"Once upon a time, there was a speedy hare who ..."
1,"Tired of hearing him boast, the slow and stead..."
2,All the animals in the forest gathered to watch.
3,The hare ran down the road for a while and the...
4,He looked back at the slow tortoise and cried ...
5,The hare stretched himself out alongside the r...
6,"The tortoise walked and walked, never ever sto..."
7,The animals who were watching cheered so loudl...
8,"The hare stretched, yawned, and began to run a..."
9,The tortoise had already crossed the finish line.


<h2> 2.2 Text Embedding </h2>

The OpenAI's API is called from a similar way as when use for `chat.completions`, but this time we are using the `embeddings` class for text embedding.
<br>
In this demonstration, we will be using `text-embedding-3-large` embedding model since it is a default, and capable for both English and non-English data.

In [None]:
embedding_model = "text-embedding-3-large"

In [None]:
def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model=embedding_model
    )
    return response.data[0].embedding

Apply the `get_embedding` function to all the story chucks to get their vectorized representations.

In [None]:
df_story['story_emb'] = df_story['story'].apply(get_embedding)

In [None]:
df_story

Unnamed: 0,story,story_emb
0,"Once upon a time, there was a speedy hare who ...","[-0.008668859489262104, -0.01259252056479454, ..."
1,"Tired of hearing him boast, the slow and stead...","[-0.011227128095924854, -0.035865020006895065,..."
2,All the animals in the forest gathered to watch.,"[0.02016708441078663, -0.006326204631477594, -..."
3,The hare ran down the road for a while and the...,"[-0.005236516706645489, -0.008388666436076164,..."
4,He looked back at the slow tortoise and cried ...,"[0.0051983497105538845, -0.0004825580108445138..."
5,The hare stretched himself out alongside the r...,"[0.01595405302941799, 0.011277619749307632, -0..."
6,"The tortoise walked and walked, never ever sto...","[0.01413221936672926, 0.012286463752388954, -0..."
7,The animals who were watching cheered so loudl...,"[-0.01613091304898262, -0.0004371799004729837,..."
8,"The hare stretched, yawned, and began to run a...","[0.01623072661459446, 0.020464228466153145, -0..."
9,The tortoise had already crossed the finish line.,"[0.024334479123353958, -0.01379484310746193, -..."


In [None]:
print(df_story.loc[0, 'story'])
print(df_story.loc[0, 'story_emb'])

Once upon a time, there was a speedy hare who bragged about how fast he could run.
[-0.008668859489262104, -0.01259252056479454, -0.010771994479000568, 0.016457030549645424, 0.036252789199352264, 0.009332661516964436, -0.0038842272479087114, -0.01265824306756258, -0.03909201920032501, 0.00621739262714982, -0.027603646740317345, 0.017364008352160454, -0.006542721297591925, -0.03520122170448303, -0.0002735718444455415, 0.05988676846027374, -0.023489387705922127, -0.010055613704025745, 0.006887766998261213, 0.011153187602758408, 0.0037002030294388533, 0.0075975749641656876, -0.0014927329029887915, -0.037435803562402725, -0.02413347363471985, 0.0278928279876709, -0.002088347217068076, 0.0038710827939212322, -0.001399077707901597, -0.017061682417988777, -0.007354400120675564, 0.038066741079092026, 0.026013150811195374, -0.04374520853161812, 0.02500101737678051, 0.024935293942689896, 0.004429727792739868, -0.006894339341670275, -0.024527810513973236, 0.008287666365504265, 0.00629297411069273

<h2> 2.3 Cosine Similarity and Semantic Search Function </h2>

A `cosine_similarity` function is defined, to get similarity between 2 vectors, `query` and `story_chunk`. Lastly, `simantic_search` for searching `story_chunk` with top similarity score and return the output.

In [None]:
import numpy as np
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    return dot_product

In [None]:
def semantic_search(query, df, top_n=2):
    query_emb = get_embedding(query)
    df['similarity'] = df['story_emb'].apply(lambda x: cosine_similarity(query_emb, x))
    return df.sort_values('similarity', ascending=False).head(top_n)

<h2> 2.4 Apply: Semantic Search </h2>

With all set, we can type in the question (query) to the `simantic_search` to get an answer from the most relevant part of the story. <br>
Example Question:
*   What did the hare tell the tortoise during the race?
*   What did the turtoise do through out the race?
*   Who boasted out how fast he can run?

In [None]:
query = 'Who won the race at the very end?'
related_chunk = semantic_search(query, df_story).iloc[0]
print(f"""With the Question:
{query}

Related part from Cosine Similarity Score:
{related_chunk['story']}

With score of {related_chunk['similarity'].round(2)}""")

With the Question:
Who won the race at the very end?

Related part from Cosine Similarity Score:
The tortoise had already crossed the finish line.

With score of 0.45


<h2> 2.5 Caution!!! </h2>

Sometimes, the answer might not relatable to the question directly. By using purely **Cosine Similarity may not be enough** when facing more complicating documents or queries. It might require some further analysis after getting to the right part related to the answer.

In [None]:
query = 'How did the hare lose the race?'
related_chunk = semantic_search(query, df_story)['story'].iloc[0]
print(f'''With the Question:
{query}

Related part from Cosine Similarity Score:
{related_chunk}

The actual related part:
The hare stretched himself out alongside the road and fell asleep, thinking, "There is plenty of time to relax."
''')

With the Question:
How did the hare lose the race?

Related part from Cosine Similarity Score:
The hare stretched, yawned, and began to run again, but it was too late.

The actual related part:
The hare stretched himself out alongside the road and fell asleep, thinking, "There is plenty of time to relax."



For further implementaion, LangChain provides tools that help with this kind of search. It is called Retrieval Augmented Generation, where we feed both related parts of document along with the question/instruction to LLM for them to answer. Visit https://python.langchain.com/v0.2/docs/tutorials/rag/

<h1> End of Demonstration