# Document Classification using LangChain and Groq ChatModel

## Introduction
Document classification involves tagging or labeling a document with one or more predefined categories. This process is critical for tasks such as content categorization, sentiment analysis, and information retrieval. This documentation demonstrates how to use LangChain with Groq's ChatModel to classify documents efficiently.

Tagging has a few components:

* function: Like extraction, tagging uses functions to specify how the model should tag a document
* schema: defines how we want to tag the document

---

## Loading a chatmodel i.e Groq chatmodel and classify text using Classification function using Pydantic model

In [51]:
import getpass
import os

if not os.environ.get("GROQ_API_KEY"):
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

from langchain_groq import ChatGroq

llm = ChatGroq(model="llama3-8b-8192")

Let's specify a Pydantic model with a few properties and their expected type in our schema.

In [52]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following passage.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)


# Define the Classification schema
class Classification(BaseModel):
    sentiment: str = Field(
        description="The sentiment of the text (e.g., positive, neutral, negative)"
    )
    aggressiveness: int = Field(
        description="A score from 1 to 10 indicating how aggressive the text is"
    )
    language: str = Field(
        description="The language the text is written in (e.g., English, Spanish)"
    )


# Initialize the ChatGroq model
from langchain_groq import ChatGroq  # Ensure you have the correct import for ChatGroq

llm = ChatGroq(
    model="mixtral-8x7b-32768",
    temperature=0.0,
    max_retries=2,
    # other params...
).with_structured_output(Classification)

def classify_text(input_text):
    prompt = tagging_prompt.format(input=input_text)
    try:
        response = llm.invoke(prompt)
        print(response)
    except Exception as e:
        print(f"Error: {e}")


In [53]:
# Example usage
input_text = "This is an example input text."
classify_text(input_text)

sentiment='neutral' aggressiveness=1 language='English'


In [54]:
inp = "Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!"
classify_text(inp)


sentiment='positive' aggressiveness=1 language='Spanish'


In [55]:
inp = "Estoy muy enojado con vos! Te voy a dar tu merecido!"
classify_text(inp)

sentiment='negative' aggressiveness=10 language='Spanish'


## Finer control
Careful schema definition gives us more control over the model's output.

Specifically, we can define:

* Possible values for each property
* Description to make sure that the model understands the property
* Required properties to be returned

In [56]:
class Classification(BaseModel):
    sentiment: str = Field(..., enum=["happy", "neutral", "sad"])
    aggressiveness: int = Field(
        ...,
        description="describes how aggressive the statement is, the higher the number the more aggressive",
        enum=[1, 2, 3, 4, 5],
    )
    language: str = Field(
        ..., enum=["spanish", "english", "french", "german", "italian"]
    )

In [57]:
tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following passage.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)

llm = ChatGroq(
    model="mixtral-8x7b-32768",
    temperature=0.0,
    max_retries=2,
    # other params...
).with_structured_output(Classification)

In [58]:
# Example usage
input_text = "This is an example input text."
classify_text(input_text)

sentiment='neutral' aggressiveness=1 language='english'


In [59]:
inp = "Weather is ok here, I can go outside without much more than a coat"
classify_text(inp)

sentiment='happy' aggressiveness=1 language='english'


## OpenAI Metadata Tagger

Tagging documents with structured metadata (like title, tone, or length) can be very helpful for better similarity searches. But doing this manually for a large number of documents can be time-consuming.

The **OpenAIMetadataTagger** is a tool that automates this process. It extracts metadata from documents based on a schema you provide. It uses OpenAI Functions under the hood, so if you provide a custom LLM, it must be a model that supports these functions.

### Key Points:
- This tool works best with complete documents. Run it on whole documents first before splitting or further processing them.
- You can define the metadata structure using a JSON Schema.

### Example:
Imagine you want to tag a set of movie reviews. You can set up the metadata tagger with a JSON Schema to specify what metadata to extract (e.g., title, tone, etc.).


In [60]:
from langchain_community.document_transformers.openai_functions import (
    create_metadata_tagger,
)
from langchain_core.documents import Document

In [61]:
schema = {
    "properties": {
        "movie_title": {"type": "string"},
        "critic": {"type": "string"},
        "tone": {"type": "string", "enum": ["positive", "negative"]},
        "rating": {
            "type": "integer",
            "description": "The number of stars the critic rated the movie",
        },
    },
    "required": ["movie_title", "critic", "tone"],
}

llm = ChatGroq(
    model="mixtral-8x7b-32768",
    temperature=0.0,
    max_retries=2,
    # other params...
)

document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)

You can then simply pass the document transformer a list of documents, and it will extract metadata from the contents:

In [62]:
original_documents = [
    Document(
        page_content="Review of The Bee Movie\nBy Roger Ebert\n\nThis is the greatest movie ever made. 4 out of 5 stars."
    ),
    Document(
        page_content="Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
        metadata={"reliable": False},
    ),
]

enhanced_documents = document_transformer.transform_documents(original_documents)

In [63]:
import json

print(
    *[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
    sep="\n\n---------------\n\n",
)

Review of The Bee Movie
By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.

{"critic": "Roger Ebert", "movie_title": "The Bee Movie", "tone": "positive"}

---------------

Review of The Godfather
By Anonymous

This movie was super boring. 1 out of 5 stars.

{"critic": "Anonymous", "movie_title": "The Godfather", "tone": "negative", "reliable": false}


The new documents can then be further processed by a text splitter before being loaded into a vector store. Extracted fields will not overwrite existing metadata.

We can also initialize the document transformer with a Pydantic schema:

In [64]:
from typing import Literal

from pydantic import BaseModel, Field


class Properties(BaseModel):
    movie_title: str
    critic: str
    tone: Literal["positive", "negative"]
    rating: int = Field(description="Rating out of 5 stars")


document_transformer = create_metadata_tagger(Properties, llm)
enhanced_documents = document_transformer.transform_documents(original_documents)

print(
    *[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
    sep="\n\n---------------\n\n",
)

Review of The Bee Movie
By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.

{"critic": "Roger Ebert", "movie_title": "The Bee Movie", "rating": 4, "tone": "positive"}

---------------

Review of The Godfather
By Anonymous

This movie was super boring. 1 out of 5 stars.

{"critic": "Anonymous", "movie_title": "The Godfather", "rating": 1, "tone": "negative", "reliable": false}


In [65]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """Extract relevant information from the following text.
Anonymous critics are actually Roger Ebert.

{input}
"""
)

document_transformer = create_metadata_tagger(schema, llm, prompt=prompt)
enhanced_documents = document_transformer.transform_documents(original_documents)

print(
    *[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
    sep="\n\n---------------\n\n",
)

Review of The Bee Movie
By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.

{"critic": "Roger Ebert", "movie_title": "The Bee Movie", "tone": "positive"}

---------------

Review of The Godfather
By Anonymous

This movie was super boring. 1 out of 5 stars.

{"critic": "Roger Ebert", "movie_title": "The Godfather", "tone": "negative", "reliable": false}


## Second Example:
Let’s say you want to tag product reviews based on attributes like product name, reviewer name, sentiment, and rating.

In [68]:
from typing import Literal
from pydantic import BaseModel, Field

# Define the Pydantic model for the metadata
# Pydantic Model: The ReviewMetadata class defines the structure for the metadata, including the product name, reviewer name, sentiment, and rating.
class ReviewMetadata(BaseModel):
    product_name: str
    reviewer_name: str
    sentiment: Literal["positive", "neutral", "negative"]
    rating: int = Field(description="Rating out of 5 stars")


# Initialize the metadata tagger with the defined schema
# Metadata Tagger: The create_metadata_tagger function takes the model and the LLM instance to create a document transformer that can automatically extract the metadata from the reviews.
document_transformer = create_metadata_tagger(ReviewMetadata, llm)

# Transform the documents (this would be the list of product reviews you want to process)
# Processing Documents: The transform_documents method processes the documents and extracts metadata based on the defined schema.
original_documents = [
    Document(
        "I recently bought the Wireless Headphones, and I must say, they are amazing! The sound quality is crystal clear, and they are super comfortable to wear. I would highly recommend them to anyone looking for good wireless headphones. 4/5 stars. Atharv"
    ),
    Document(
        "The Wireless Headphones didn't live up to my expectations. While the sound is decent, the battery life is terrible, and they are not very comfortable. I was hoping for more from this brand. 2/5 stars."
    ),
    Document(
        "These Wireless Headphones are okay. They're not great, but they get the job done. The sound is decent, but there are better options out there for the price. 3/5 stars."
    ),
    Document(
        "I'm absolutely in love with these Wireless Headphones! They provide excellent sound quality, and the battery lasts forever. The design is sleek and modern, and they're very comfortable. Definitely worth the money. 5/5 stars."
    ),
]


enhanced_documents = document_transformer.transform_documents(original_documents)

# Print the transformed documents with their metadata
# Output: For each document, the code prints the original review (content) along with the extracted metadata in JSON format.
import json

print(
    *[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
    sep="\n\n---------------\n\n",
)

I recently bought the Wireless Headphones, and I must say, they are amazing! The sound quality is crystal clear, and they are super comfortable to wear. I would highly recommend them to anyone looking for good wireless headphones. 4/5 stars. Atharv

{"product_name": "Wireless Headphones", "rating": 4, "reviewer_name": "Atharv", "sentiment": "positive"}

---------------

The Wireless Headphones didn't live up to my expectations. While the sound is decent, the battery life is terrible, and they are not very comfortable. I was hoping for more from this brand. 2/5 stars.

{"product_name": "Wireless Headphones", "rating": 2, "reviewer_name": "unknown", "sentiment": "negative"}

---------------

These Wireless Headphones are okay. They're not great, but they get the job done. The sound is decent, but there are better options out there for the price. 3/5 stars.

{"product_name": "Wireless Headphones", "rating": 3, "reviewer_name": "N/A", "sentiment": "neutral"}

---------------

I'm absolutel

In [69]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """Extract relevant information from the following text. 

{input}
"""
)

document_transformer = create_metadata_tagger(ReviewMetadata, llm, prompt=prompt)
enhanced_documents = document_transformer.transform_documents(original_documents)

print(
    *[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
    sep="\n\n---------------\n\n",
)

I recently bought the Wireless Headphones, and I must say, they are amazing! The sound quality is crystal clear, and they are super comfortable to wear. I would highly recommend them to anyone looking for good wireless headphones. 4/5 stars. Atharv

{"product_name": "Wireless Headphones", "rating": 4, "reviewer_name": "Atharv", "sentiment": "positive"}

---------------

The Wireless Headphones didn't live up to my expectations. While the sound is decent, the battery life is terrible, and they are not very comfortable. I was hoping for more from this brand. 2/5 stars.

{"product_name": "Wireless Headphones", "rating": 2, "reviewer_name": "A Customer", "sentiment": "negative"}

---------------

These Wireless Headphones are okay. They're not great, but they get the job done. The sound is decent, but there are better options out there for the price. 3/5 stars.

{"product_name": "Wireless Headphones", "rating": 3, "reviewer_name": "N/A", "sentiment": "neutral"}

---------------

I'm absolu