# Developing an LLM application using Langchain

## Task 1: Import Libraries

In [18]:
import google.generativeai as genai
import textwrap
from IPython.display import display, Markdown
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

api_key = 'mykey'
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-pro')

## Task 2: Ask The Questions Using Prompts

In [4]:
response = model.generate_content("Explain Generative AI with 3 bullet points")
print(response.text)
Markdown(response.text)

I0000 00:00:1726973814.327842     380 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


- **Generates new data or content based on existing data:** Generative AI models can create realistic text, images, audio, and even code by learning patterns and relationships from vast datasets.
- **Emulates human creativity:** These models are trained on diverse datasets to capture the nuances of human expression and generate content that resembles human-created outputs.
- **Supports various applications:** Generative AI finds uses in content creation, art generation, natural language processing, drug discovery, and many other fields where novel or creative solutions are needed.


- **Generates new data or content based on existing data:** Generative AI models can create realistic text, images, audio, and even code by learning patterns and relationships from vast datasets.
- **Emulates human creativity:** These models are trained on diverse datasets to capture the nuances of human expression and generate content that resembles human-created outputs.
- **Supports various applications:** Generative AI finds uses in content creation, art generation, natural language processing, drug discovery, and many other fields where novel or creative solutions are needed.

## Task 3: Chat With Gemini And Retrieve The Chat History

In [5]:
hist = model.start_chat()
response = hist.send_message("Hi, give me some advice about where to eat today")
Markdown(response.text)

for i in hist.history:
    print(i)
    print('\n\n')
i.parts[0].text

model.count_tokens("Now tell me the address of the place you adviced")

parts {
  text: "Hi, give me some advice about where to eat today"
}
role: "user"




parts {
  text: "**Consider Your Preferences:**\n\n* **Cuisine:** What type of food do you crave? Italian, Mexican, Asian, American, etc.?\n* **Ambiance:** Are you looking for a casual atmosphere, a romantic setting, or a lively vibe?\n* **Dietary Restrictions:** Are there any dietary requirements you have, such as vegan, gluten-free, or vegetarian options?\n\n**Explore Online Reviews:**\n\n* Check restaurant review websites like Yelp, Google My Business, and Tripadvisor.\n* Read reviews to get an idea of the food quality, service, and overall experience.\n\n**Utilize Social Media:**\n\n* Follow local food bloggers and influencers on Instagram or Facebook.\n* Check their posts for recommendations and reviews.\n\n**Ask Locals:**\n\n* If you\'re in a new area, ask locals for their favorite dining spots.\n* Hotel staff or concierge can also provide valuable suggestions.\n\n**Specific Restaurant Recommend

total_tokens: 11

## Task 4: Experiment With The Temperature Parameter

In [6]:
def get_response(prompt, generation_config={}):
    response = model.generate_content(
        contents=prompt,
        generation_config=generation_config
    )
    return response

for temp in [0.0, 0.5, 1.0]:
    config = genai.types.GenerationConfig(temperature=temp)
    result = get_response("Why the sky is blue", generation_config=config)

    print(f"\n\nFor temperature value {temp}, the result is: \n")
    display(Markdown(result.text))



For temperature value 0.0, the result is: 



The sky appears blue due to a phenomenon called Rayleigh scattering. This scattering occurs when sunlight, which is composed of all colors of the visible spectrum, passes through the Earth's atmosphere.

As sunlight travels through the atmosphere, it encounters molecules of nitrogen and oxygen, which are much smaller than the wavelength of light. These molecules scatter the light in all directions, but they scatter shorter wavelengths (blue light) more effectively than longer wavelengths (red light). This is because the shorter wavelengths have a higher frequency and therefore interact more strongly with the molecules.

As a result of this scattering, more blue light is scattered in all directions, while more red light continues in a straight line. This scattered blue light reaches our eyes from all directions, giving the sky its characteristic blue color.

The amount of scattering depends on the wavelength of light and the density of the molecules in the atmosphere. At sunrise and sunset, when the sunlight has to travel through more of the atmosphere, more of the blue light is scattered away, leaving more of the longer wavelengths (red, orange, and yellow) to reach our eyes. This is why the sky appears red, orange, or yellow at these times.



For temperature value 0.5, the result is: 



The sky appears blue due to a phenomenon called Rayleigh scattering. This scattering occurs when sunlight interacts with molecules in the Earth's atmosphere.

1. **Sunlight:** Sunlight is a mixture of all colors, including red, orange, yellow, green, blue, indigo, and violet.

2. **Molecules in the Atmosphere:** The Earth's atmosphere contains molecules of nitrogen and oxygen. These molecules are much smaller than the wavelength of visible light.

3. **Scattering:** When sunlight enters the atmosphere, it interacts with these molecules. The molecules scatter the sunlight in all directions.

4. **Wavelengths:** The amount of scattering depends on the wavelength of light. Shorter wavelengths (blue and violet) are scattered more than longer wavelengths (red and orange).

5. **Blue Sky:** As sunlight travels through the atmosphere, the shorter wavelengths (blue and violet) are scattered more frequently than the longer wavelengths. This scattered blue light reaches our eyes, making the sky appear blue.

6. **Sunsets and Sunrises:** At sunset and sunrise, the sunlight has to travel through more of the atmosphere to reach our eyes. This means that more of the blue light is scattered away, and we see more of the longer wavelengths (red and orange).

So, the sky appears blue because the molecules in the atmosphere scatter the shorter wavelengths of sunlight (blue and violet) more than the longer wavelengths.



For temperature value 1.0, the result is: 



The sky appears blue due to a phenomenon called Rayleigh scattering. Here's a detailed explanation:

**1. Sunlight Composition:**
- Sunlight is composed of all the colors of the visible spectrum (ROYGBIV: Red, Orange, Yellow, Green, Blue, Indigo, Violet).

**2. Rayleigh Scattering:**
- When sunlight enters the Earth's atmosphere, it interacts with tiny molecules in the air, such as nitrogen and oxygen molecules.
- These molecules are significantly smaller than the wavelengths of visible light.
- Due to their small size, the molecules scatter the shorter wavelengths of light (blue and violet) more efficiently than the longer wavelengths (red and orange).

**3. Blue Dominance:**
- Blue light has a shorter wavelength (around 450 nanometers) than red light (700 nanometers).
- According to Rayleigh scattering, blue light is scattered in all directions to a greater extent than red light.
- As scattered blue light travels toward our eyes from all directions, the sky appears predominantly blue.

**4. Variation in Color:**
- The intensity of the scattered blue light depends on the wavelength and distance.
- As you move higher in the atmosphere, you encounter less air, resulting in less scattering. This causes the sky to appear darker blue.
- Near the horizon at sunrise and sunset, the sunlight has to travel through more of the atmosphere before reaching our eyes. This results in the scattering of even more blue light, making the sky appear darker blue or purple towards the horizon.

**Additional Factors:**
- The amount of moisture and dust particles in the air can also affect the scattering of light and the perceived color of the sky.
- Rayleigh scattering is not only responsible for the blue sky but also for colorful sunrises and sunsets.

## Task 5: Experiment With Maximum Output Tokens

In [9]:
for m_o_tok in [1, 100, 200]:
    config = genai.types.GenerationConfig(max_output_tokens=m_o_tok)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)

    print(f"\n\nFor max output token value {m_o_tok}, the results are: \n\n")
    display(Markdown(result.text))



For max output token value 1, the results are: 




**



For max output token value 100, the results are: 




**XGBoost and Random Forest**

**XGBoost (Extreme Gradient Boosting)**:
- Ensemble method that combines multiple decision trees to improve predictive performance.
- Uses a sequential learning algorithm that builds trees iteratively, adjusting weights based on previous tree performance.
- Advantages: High accuracy, ability to handle complex data, efficient training process.

**Random Forest**:
- Ensemble method that also combines multiple decision trees.
- Grows trees independently by randomly sampling data and features.



For max output token value 200, the results are: 




**XGBoost (Extreme Gradient Boosting)**

XGBoost is a highly efficient machine learning algorithm that combines the power of gradient boosting with tree ensembles. It works by building multiple decision trees incrementally, where each tree learns from the errors of previous trees.

**Key Concepts:**

* **Gradient Boosting:** XGBoost uses gradient boosting to construct a sequence of base learners, typically decision trees. Each tree is trained on the residuals (errors) of the previous trees.
* **Tree Ensembles:** XGBoost combines individual decision trees into an ensemble, where the final prediction is made by aggregating the predictions of all trees.
* **Regularization:** XGBoost employs various regularization techniques (e.g., L1 and L2 regularization) to prevent overfitting and improve generalization performance.

**Real-Life Use Case:**

XGBoost is widely used in various applications, including:

* **Credit Risk Assessment:** Predicting the probability of a loan default based

## Task 6: Experiment With the top_k Parameter

In [10]:
for k in [1, 16, 40]:
    config = genai.types.GenerationConfig(top_k=k)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)

    print(f"\n\nFor top k value {k}, the results are: \n\n")
    display(Markdown(result.text))



For top k value 1, the results are: 




## XGBoost

**Concept:**
XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm that combines multiple weak learners (decision trees) to create a strong learner. It employs gradient boosting, which sequentially trains weak learners by minimizing the loss function over the dataset.

**Real-Life Use Cases:**

* **Customer Churn Prediction:** Predicting the likelihood of customers leaving a service by analyzing their past behavior, demographics, and interactions.
* **Fraud Detection:** Identifying fraudulent transactions by learning patterns in historical data, such as abnormal spending or suspicious IP addresses.
* **Loan Default Prediction:** Assessing the risk of loan applications and determining which borrowers are likely to default based on their financial history and personal data.

## Random Forest

**Concept:**
Random Forest is an ensemble learning algorithm that combines multiple decision trees, each trained on a different subset of the data and a random subset of features. It operates by taking the majority vote of the predictions from each individual tree.

**Real-Life Use Cases:**

* **Image Classification:** Identifying objects or scenes in images by training a Random Forest on a dataset of labeled images.
* **Natural Language Processing:** Analyzing text data, such as sentiment analysis or spam filtering, by leveraging the ability of Random Forest to handle high-dimensional features.
* **Medical Diagnosis:** Assisting in medical decision-making, such as diagnosing diseases or predicting treatment outcomes, by training a Random Forest on patient data and associated diagnoses.

## Comparison

**Similarities:**

* Both XGBoost and Random Forest are ensemble learning algorithms that combine multiple weak learners to improve accuracy.
* They can handle large and complex datasets with many features.
* Both algorithms are powerful tools for classification and regression tasks.

**Differences:**

* **Model Complexity:** XGBoost is generally more complex than Random Forest, as it employs gradient boosting and regularization techniques to optimize its performance.
* **Tuning:** XGBoost typically requires more hyperparameter tuning to achieve optimal results, while Random Forest is often more robust to default parameters.
* **Interpretability:** Random Forest is generally considered more interpretable, as individual decision trees can be examined to understand the model's predictions, while XGBoost involves complex interactions between multiple trees.
* **Speed:** Random Forest is typically faster to train than XGBoost, especially for large datasets.



For top k value 16, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:** XGBoost is an advanced ensemble learning algorithm that uses gradient boosting to combine a large number of weak learners (decision trees). It iteratively builds a series of trees, with each tree focusing on correcting the errors of previous trees.

**Real-Life Use Case:** Fraud Detection - XGBoost can be used to identify fraudulent transactions by leveraging historical data on past fraudulent and non-fraudulent transactions. By building multiple trees and considering their collective predictions, XGBoost improves accuracy and reduces false positives.

**Random Forest**

**Concept:** Random Forest is an ensemble learning algorithm that combines multiple decision trees. It constructs each tree using a different random subset of data and features. The final prediction is determined by aggregating the predictions of all individual trees.

**Real-Life Use Case:** Disease Prediction - Random Forest can be used to predict the likelihood of developing a disease based on a patient's lifestyle, demographics, and health history. By considering multiple perspectives provided by different trees, Random Forest improves diagnostic sensitivity and specificity.

**Key Differences**

* **Boosting vs. Bagging:** XGBoost uses gradient boosting, where trees are built sequentially and focus on correcting errors of previous trees. Random Forest uses bagging, where trees are built independently on different subsets of data.
* **Regularization:** XGBoost incorporates regularization to prevent overfitting and improve model stability. Random Forest relies on tree pruning to control complexity.
* **Performance:** XGBoost typically outperforms Random Forest on large and complex datasets, especially when the goal is to minimize prediction error.

**Shared Characteristics**

* **Ensemble Methods:** Both XGBoost and Random Forest are ensemble methods that combine multiple models to enhance predictive accuracy.
* **Handling Complex Data:** Both algorithms can handle both continuous and categorical variables, making them suitable for a wide range of datasets.
* **Interpretability:** While ensemble methods can be less interpretable than simpler models, XGBoost and Random Forest offer techniques like feature importance calculation for understanding model behavior.



For top k value 40, the results are: 




**XGBoost**

XGBoost (Extreme Gradient Boosting) is a tree-based machine learning algorithm that combines multiple decision trees to create a more accurate model. It is known for its efficiency, scalability, and ability to handle both structured and unstructured data.

**Concepts:**

* **Trees:** XGBoost builds individual decision trees in an ensemble. Each tree predicts a value, and the final prediction is the sum of all the individual predictions.
* **Gradient Boosting:** This technique iteratively adds new trees to the ensemble, where each new tree aims to correct the errors of the previous trees.
* **Regularization:** XGBoost uses regularization techniques to prevent overfitting and improve model stability. This ensures that the model generalizes well to unseen data.

**Real-Life Use Cases:**

* **Customer churn prediction:** Identifying customers at risk of leaving a service.
* **Fraud detection:** Detecting fraudulent transactions in financial systems.
* **Medical diagnosis:** Assisting in disease diagnosis and personalized treatment plans.

**Random Forest**

Random Forest is also a tree-based machine learning algorithm that creates multiple decision trees during training. However, unlike XGBoost, it does not use gradient boosting.

**Concepts:**

* **Trees:** Random Forest creates a large number of decision trees, each trained on a different subset of the training data.
* **Random Selection:** During tree construction, a random subset of features and data points is selected at each node. This helps reduce overfitting.
* **Ensemble Voting:** The final prediction is made by combining the predictions of all the individual trees, typically via majority vote or averaging.

**Real-Life Use Cases:**

* **Image classification:** Recognizing objects and scenes in images.
* **Object detection:** Locating and identifying specific objects within images.
* **Natural language processing:** Understanding text data, sentiment analysis, and machine translation.

**Comparison**

* **Accuracy:** XGBoost generally has higher accuracy than Random Forest, especially for structured data.
* **Efficiency:** XGBoost is more efficient in terms of training time and memory usage.
* **Interpretability:** Random Forest is generally easier to interpret, as it does not rely on gradient boosting.
* **Hyperparameter Tuning:** XGBoost has more hyperparameters to tune, but it also provides more advanced options for customization.

The choice between XGBoost and Random Forest depends on the specific application and the available resources. Both algorithms are powerful techniques that can solve a wide range of machine learning problems.

## Task 7: Experiment With the top_p Parameter

In [None]:
for p in [0, 0.2, 0.4, 0.8, 1]:
    config = genai.types.GenerationConfig(top_p=p)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)

    print(f"\n\nFor top p value {p}, the results are: \n\n")
    display(Markdown(result.text))

## Task 8: Experiment With the candidate_count Parameter

In [None]:
config = genai.types.GenerationConfig(candidate_count=1)
result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)
Markdown(result.text)
# can only set candidate_count to 1 for the config parameter since the Gemini Pro model is designed to focus on generating the single best possible response rather than creating a variety of options. This could be due to the underlying architecture or training data used for the model.

## Task 9: Introduction to Retrieval Augmented Generation

- Load the PDF document (from where you will retrieve the specific data) using the PyPDFLoader.

- Extract the texts using RecursiveCharacterTextSplitter to have meaningful chunks of text for further processing.

- Use GoogleGenerativeAIEmbeddings to generate the embeddings for the extracted texts that will be used for similarity search in RAG.

- Use Chroma (a vector database) to store the created embeddings so that they can be used to retrieve the relevant information later when needed.

- Use the RetrievalQA function to build the question-answering system for retrieval (finding relevant context) with generation (generating an answer based on the retrieved context). This function will use ChatGoogleGenerativeAI with the Gemini Pro model to generate responses from the retrieved context.

## Task 10: Load the PDF and Extract the Texts

In [12]:
CHUNK_SIZE = 700
CHUNK_OVERLAP = 100
pdf_path = "https://www.analytixlabs.co.in/assets/pdfs/Data_Engineering%20&_Other_Job_Roles-AnalytixLabs.pdf"

In [13]:
pdf_loader = PyPDFLoader(pdf_path)
split_pdf_document = pdf_loader.load_and_split()

In [15]:
# Splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)
context = "\n\n".join(str(p.page_content) for p in split_pdf_document)
texts = text_splitter.split_text(context)

## Task 11: Create the Gemini Model and Create the Embeddings

In [19]:
gemini_model = ChatGoogleGenerativeAI(
    model='gemini-pro',
    google_api_key=api_key,
    temperature=0.8
)

In [22]:
embeddings = GoogleGenerativeAIEmbeddings(
    model='models/embedding-001',
    google_api_key=api_key
)

In [23]:
vector_index = Chroma.from_texts(texts, embeddings)
retriever = vector_index.as_retriever(search_kwargs={"k" : 5})

## Task 12: Create the RAG Chain and Ask Query

In [24]:
qa_chain = RetrievalQA.from_chain_type(
    gemini_model,
    retriever=retriever,
    return_source_documents=True
)

In [25]:
# Example usage
question = "Which tools do Data Engineers primarily work with?"
result = qa_chain.invoke({"query": question})
print("Answer:", result["result"])

Answer: * **Data integration tools** are used to extract, transform, and load data from various sources into a central repository. These tools can be used to automate the process of data integration, which can save time and improve data quality.
* **Data modeling tools** are used to create logical and physical models of data. These models can be used to document the structure of data, define relationships between data elements, and identify data inconsistencies.
* **Data quality tools** are used to assess the quality of data. These tools can be used to identify errors, inconsistencies, and missing values in data.
* **Data governance tools** are used to manage the use of data. These tools can be used to define data policies, track data usage, and enforce data security.
* **Cloud computing platforms** are used to host and manage data infrastructure. These platforms can provide data engineers with access to a variety of computing resources, including compute, storage, and networking.
