## Embedding Text using Langchain


### Prompt Template
 Usually used for dynamic prompts

In [None]:
!pip install -r /content/requirements.txt -q

In [None]:
!pip show langchain

Name: langchain
Version: 0.0.350
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, async-timeout, dataclasses-json, jsonpatch, langchain-community, langchain-core, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


## Python-dotenv

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv() , override=True)

True

### LLM Wrappers : DaVinci Model

In [None]:
!pip install openai==0.28 -q

In [None]:
!pip install langchain

In [None]:
#!pip install OpenAI -q

In [None]:
from langchain.llms import OpenAI
llm = OpenAI(model_name = 'text-davinci-003',temperature = 0.3, max_tokens = 512)
print(llm) #llm class

[1mOpenAI[0m
Params: {'model_name': 'text-davinci-003', 'temperature': 0.3, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'logit_bias': {}, 'max_tokens': 512}


In [None]:
output = llm('explain the banking system in simple words')
output2 = llm.generate(['... is the capital of spain' , 'What is the distance to the moon?'])
print(output)
# alternatively call output.generations to see generated answers



The banking system is a system of financial institutions and services that provide a variety of financial services to individuals, businesses, and governments. Banks provide services such as accepting deposits, making loans, and providing payment services such as debit cards, credit cards, and online banking. Banks also provide investment services such as mutual funds, stocks, and bonds. Banks also provide services such as trust services, foreign exchange, and insurance.


In [None]:
#estimating number of tokens
print(llm.get_num_tokens('explain the banking system in simple words'))

8


## Chat models: GPT-3.5 , GPT-4

In [None]:
from langchain.schema import(AIMessage , HumanMessage, SystemMessage)
from langchain.chat_models import ChatOpenAI

In [None]:
chat = ChatOpenAI(model_name = 'gpt-3.5-turbo' , temperature = 0.6 , max_tokens = 512)
messages = [SystemMessage(content = 'You are the best salesman at our car dealership'),
            HumanMessage(content = 'Sell me a kia serato, model 2023 by listing its specifications ')]


output = chat(messages)
print(output.content)

Absolutely! The Kia Serato, model 2023, is an exceptional choice that offers a perfect blend of style, performance, and cutting-edge technology. Here are some of its remarkable specifications:

1. Striking Design: The Kia Serato boasts a sleek and modern exterior design that turns heads wherever you go. With its bold front grille, LED headlights, and aerodynamic curves, it exudes a sense of sophistication.

2. Spacious Interior: Step inside the Serato, and you'll be greeted by a roomy cabin that comfortably accommodates five passengers. The high-quality materials and refined finishes create a premium feel, while ample legroom ensures a comfortable ride for everyone.


4. Impressive Performance: The Serato offers a choice of powertrains to suit your preferences. The standard engine is a responsive 2.0-liter four-cylinder that delivers a balance of power and fuel efficiency. If you're seeking more exhilaration, there's an available turbocharged 1.6-liter engine that provides an extra pun

## Prompt Templates

In [None]:
#Typically used for dynamic prompts

from langchain import PromptTemplate

In [None]:
template = '''You are an experienced virologist,
Write a few sentences about the following {virus} in {language}'''

prompt = PromptTemplate(input_variables = ['virus' , 'language'], template = template)

print(prompt)

input_variables=['language', 'virus'] template='You are an experienced virologist,\nWrite a few sentences about the following {virus} in {language}'


In [None]:
llm=OpenAI(model_name = 'text-davinci-003' , temperature = 0.3)
output = llm(prompt.format(virus = 'Covid-19' , language = 'english'))
print(output)



Covid-19 is a novel coronavirus that has caused a global pandemic. It is highly contagious and is spread through contact with an infected person or contact with contaminated surfaces. Symptoms of Covid-19 can range from mild to severe and can include fever, cough, shortness of breath, and fatigue. It is important to take all necessary precautions to prevent the spread of Covid-19, such as wearing a mask, washing hands frequently, and avoiding large gatherings.


## Simple and Sequential Chains

In [None]:
from langchain.chains import LLMChain , SimpleSequentialChain
llm = ChatOpenAI(model_name = 'gpt-3.5-turbo' , temperature = 0.5)

In [None]:
chain = LLMChain(llm=llm , prompt = prompt)
output = chain.run({'virus':'HIV' , 'language':'French'})

In [None]:
print(output)

Je suis un virologue expérimenté et je vais vous parler du VIH en français. Le VIH, ou virus de l'immunodéficience humaine, est un virus qui attaque le système immunitaire de l'organisme. Il est principalement transmis par le contact direct avec certains fluides corporels infectés, tels que le sang, le sperme, les sécrétions vaginales et le lait maternel. Le VIH peut entraîner le développement du SIDA, le syndrome d'immunodéficience acquise, qui affaiblit progressivement le système immunitaire et rend l'organisme vulnérable à diverses infections et maladies. Malheureusement, il n'existe pas encore de remède définitif contre le VIH, mais grâce aux avancées médicales, les traitements antirétroviraux permettent de contrôler la progression de la maladie et d'améliorer la qualité de vie des personnes vivant avec le VIH.


In [None]:
llm2 = OpenAI(model_name = 'gpt-3.5-turbo' , temperature = 0.5 , max_tokens = 512)
prompt2 = PromptTemplate(input_variables= [ 'concept'] ,
                         template = '''You are the best python professor.
                         Write a function that implements the concept of {concept}''')

chain2 = LLMChain(llm = llm2 , prompt = prompt2)

llm3 = OpenAI(model_name = 'gpt-3.5-turbo' , temperature = 0.4)
prompt3 = PromptTemplate(input_variables=['function'], template = 'Given the Python function {function}, describe it as detailed as possible')

chain3 = LLMChain(llm = llm3 , prompt = prompt3)
overall_chain = SimpleSequentialChain(chains=[chain2 , chain3] , verbose = True)
output = overall_chain.run('linear regression')







[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mThank you for your kind words! I'd be happy to help you with implementing linear regression in Python. Here's a function that implements the concept of linear regression using the ordinary least squares method:

```python
import numpy as np

def linear_regression(X, y):
    # Add a column of ones to the input matrix X
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    
    # Calculate the coefficients using the ordinary least squares formula
    coefficients = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    
    return coefficients
```

In this function, `X` represents the input matrix of independent variables, and `y` represents the target variable. We add a column of ones to the input matrix to account for the intercept term. Then, we calculate the coefficients using the formula `(X^T * X)^-1 * X^T * y`, where `X^T` denotes the transpose of `X`.

To use this function, you can pass your input matrix `X` and target va

## Langchain Agents

In [None]:
!pip install langchain_experimental -q

In [None]:
from langchain_experimental.agents.agent_toolkits import create_python_agent
from langchain_experimental.tools.python.tool import PythonREPLTool
from langchain.llms import OpenAI

In [None]:
llm = OpenAI(temperature = 0.1)
agent_executor = create_python_agent(llm = llm,
                                     tool = PythonREPLTool(),
                                     verbose=True)

agent_executor.run('Calculate the square root of 15 factorial and display with 3 decimal points')



[1m> Entering new AgentExecutor chain...[0m




[32;1m[1;3m I need to calculate the square root of 15 factorial
Action: Python_REPL
Action Input: import math
              print(round(math.sqrt(math.factorial(15)), 3))[0m
Observation: [36;1m[1;3mIndentationError('unexpected indent', ('<string>', 2, 14, '              print(round(math.sqrt(math.factorial(15)), 3))\n', 2, -1))[0m
Thought:[32;1m[1;3m I need to remove the extra indentation
Action: Python_REPL
Action Input: import math
print(round(math.sqrt(math.factorial(15)), 3))[0m
Observation: [36;1m[1;3m1143535.906
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 1143535.906[0m

[1m> Finished chain.[0m


'1143535.906'

# Vector Databases

In [None]:
#pinecone for efficient data processing and retrieval
#pinecone is a vector database using embeddings

import pinecone
pinecone.init(api_key = os.environ.get('PINECONE_API_KEY'),
              environment=os.environ.get('PINECONE_env'))


In [None]:
pinecone.info.version()

VersionResponse(server='2.0.11', client='2.2.4')

In [None]:
pinecone.list_indexes()

[]

In [None]:
index_name = 'langchain-pinecone'
#limited to 1 index and 1 pod for the free plan
if index_name not in pinecone.list_indexes():
  print(f'Creating index {index_name}...')
  pinecone.create_index(index_name, dimension = 1536, metric = 'cosine', pods=1, pod_type='p1.x2')
  print('Done')
else:
  print(f'Index {index_name} already exists!')

Creating index langchain-pinecone...
Done


In [None]:
pinecone.describe_index(index_name)

IndexDescription(name='langchain-pinecone', metric='cosine', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='starter', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

In [None]:
import random

In [None]:
vectors = [[random.random() for _ in range(1536)]for v in range(5)]

In [None]:
ids = list('abcde')
index_name = 'langchain-pinecone'
index = pinecone.Index(index_name)
index.upsert(vectors = zip(ids,vectors))

{'upserted_count': 5}

In [None]:
#to update a vector
index.upsert(vectors=[('c', [0.3]*1536)])

{'upserted_count': 1}

In [None]:
#fetch a vector
index = pinecone.Index('langchain-pinecone')
index.fetch(ids=['c','d'])

{'namespace': '',
 'vectors': {'c': {'id': 'c',
                   'values': [0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
       

In [None]:
index.delete(ids = ['b'])

{}

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 4e-05,
 'namespaces': {'': {'vector_count': 4}},
 'total_vector_count': 4}

In [None]:
#to delete all vectors
#index.delete(delete_all = True)

In [None]:
index.delete(delete_all = True)

{}

In [None]:
emb = [[random.random() for _ in range(1536)]for v in range(2)]
index.query(vectors = emb , top_k=3)

{'results': []}

## Splitting and Embedding Text using Langchain

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200,
                                               chunk_overlap = 40,
                                               length_function = len) #chunking can be a hyperparameter


In [None]:
chunks = text_splitter.create_documents(['the document here'])

### Embedding cost

In [None]:
import tiktoken
def print_embedding_cost(texts):
  enc = tiktoken.encoding_for_model('text-embedding-ada-002')
  total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
  print(f'Total Tokens : {total_tokens}')
  print(f'Embedding Cost in USD: {total_tokens/1000*0.0004:.6f}')


In [None]:
from langchain.embeddings import OpenAIEmbeddings
embedding=OpenAIEmbeddings()

In [None]:
vector = embedding.embed_query('abc')
vector

[0.0026003900945243616,
 -0.011285445990905637,
 -0.00940453863619554,
 -0.03911722245794239,
 -0.03425231768417376,
 0.01206326514391234,
 -0.021227386595400844,
 -0.022853734632851876,
 0.018653512882996225,
 -0.0005285633100776219,
 0.0034188677622028355,
 0.0191484884536398,
 -0.002685243122667173,
 -0.004553776460640192,
 -0.018681797148100282,
 0.003090062481876243,
 0.02708224064781159,
 0.010521768970450963,
 0.010663190295971264,
 0.007410493289746676,
 -0.014382579539057895,
 0.01774841453702125,
 -0.006979157315587234,
 -0.01538667281289708,
 -0.02015258265880151,
 -0.0034754365252415874,
 0.009850016277245749,
 -0.020124298393697452,
 0.02624785389591929,
 -0.007353924759538556,
 0.007778189667421983,
 0.01358354765288441,
 -0.007452919687402766,
 -0.009941940604495207,
 -0.010352063379826604,
 -0.014255300346089626,
 -0.01364011618309253,
 -0.015782653455676445,
 0.010203570522369025,
 -0.0002733415023606986,
 0.024324520143553102,
 0.004642165021921011,
 0.013951243099237

In [None]:
# Inser chunk of text in pinecone index
import os
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(api_key=os.environ.get('PINECONE_API_KEY'), environment = os.environ.get('PINECONE_env'))

In [None]:
# delete all indexes
index = pinecone.list_indexes()
for i in index:
  pinecone.delete_index(i)

In [None]:
index_name = 'Testing'
if index_name not in pinecone.list_indexes():
  print(f'Creating index {index_name} ....')
  pinecone.create_index(index_name,dimension = 1536, metric='cosine')
  print('Done!')

In [None]:
vector_store = Pinecone.from_documents(chunks, embedding, index_name = index_name) #to upload embedded chunks of text to our index