<h1>Ingest/Load data from a txt file

In [1]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader('sample.txt')
text = loader.load()
print(text)

[Document(metadata={'source': 'sample.txt'}, page_content='The drumbeat of pressure on Joe Biden to drop out of the US presidential race intensified Wednesday with a bombshell report in the New York Times that he had conceded the possibility to a key ally, as well as movement within his own party to demand his withdrawal.\nThe White House and Biden\'s campaign quickly denied the Times report suggesting the president had vocalized to a supporter that he could ill-afford another misstep that would irrevocably damage his campaign. Biden himself insisted to campaign staff he intended to remain in the race.\n\n"I\'m in this race to the end and we\'re going to win because when Democrats unite, we will always win," Biden said in a call alongside Vice President Kamala Harris.\nYet time is running out for the beleaguered president to convince anxious Democratic officials, donors and voters that he remains viable in his effort to keep former President Donald Trump from returning to office. In an

<h1>Load the environment variables</h1>

In [4]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['OPENAI_API_KEY']=os.getenv('OPENAI_API_KEY')


<h1>Load the data from a html page</h1>

In [5]:
from langchain_community.document_loaders import WebBaseLoader
from bs4 import SoupStrainer
loader = WebBaseLoader(web_path='https://www.analyticsvidhya.com/blog/2013/10/read-books-web-analytics/',
              bs_kwargs={'parse_only':SoupStrainer('p')}
              )
web_documents = loader.load()
web_documents

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://www.analyticsvidhya.com/blog/2013/10/read-books-web-analytics/'}, page_content="By reading something every day before sleeping, I not only continue my learning, but also end my day on a fulfilling note. This is third and final post in series of must read books (click here for\xa0Business Analysts\xa0and\xa0visualization).The demand for web analytics professionals is set to increase. Here are some of the reasons why I expect so:All these reasons are enough to ensure that application of Web Analytics grows by leaps and bounds. With increased application, the world would need more professionals in this industry.So, if you plan a career in Analytics and are not up to pace with Web Analytics, start learning it today. It is a dynamic field where things change frequently. Keeping that in mind, this list contains both books and blogs, as I think they need to be used simultaneously for effective learning.There is a plethora of reading material on this subj

<h1>Load data from a PDF File</h1>

In [8]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('NLP.pdf')
pdf_document = loader.load()
pdf_document

[Document(metadata={'source': 'NLP.pdf', 'page': 0}, page_content='Steve Nouri                          https://www.linkedin.com/in/stevenouri/  Top 100 NLP Questions \n     Steve Nouri  \n \n \nQ1. Which of the following techniques can be used for keyword normalization in \nNLP, the process of converting a keyword into its base form? \na. Lemmatization \nb. Soundex \nc. Cosine Similarity \nd. N-grams \n \nAnswer : a) Lemmatization helps to get to the base form of a word, e.g. are playing -> play, ea ting \n-> eat, etc.Other options are meant for different purposes. \n \nQ2. Which of the following techniques can be used to compute the distance \nbetween two word vectors in NLP? \na. Lemmatization \nb. Euclidean distance \nc. Cosine Similarity \nd. N-grams \n \nAnswer : b) and c) \nDistance between two word vectors can be computed using Cosine similarity and Euclidean \nDistance.  Cosine Similarity establishes a cosine angle between the vector of two words . A cosi ne \nangle close to e

<H1>convert the document into chunks</h1?>

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
pdf_chunks = text_spliter.split_documents(pdf_document)
pdf_chunks[:4]

[Document(metadata={'source': 'NLP.pdf', 'page': 0}, page_content='Steve Nouri                          https://www.linkedin.com/in/stevenouri/  Top 100 NLP Questions \n     Steve Nouri  \n \n \nQ1. Which of the following techniques can be used for keyword normalization in \nNLP, the process of converting a keyword into its base form? \na. Lemmatization \nb. Soundex \nc. Cosine Similarity \nd. N-grams \n \nAnswer : a) Lemmatization helps to get to the base form of a word, e.g. are playing -> play, ea ting \n-> eat, etc.Other options are meant for different purposes. \n \nQ2. Which of the following techniques can be used to compute the distance \nbetween two word vectors in NLP? \na. Lemmatization \nb. Euclidean distance \nc. Cosine Similarity \nd. N-grams \n \nAnswer : b) and c) \nDistance between two word vectors can be computed using Cosine similarity and Euclidean \nDistance.  Cosine Similarity establishes a cosine angle between the vector of two words . A cosi ne \nangle close to e

<h1> Convert the documents into embeddings/indexing and store into vector db</h1>

In [14]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
db = Chroma.from_documents(pdf_chunks,OpenAIEmbeddings())

<h1>Query on the vector db</h1>

In [20]:
query = '''In NLP, which algorithm decreases the weight for commonly used words and 
increases the weight for words that are not used very much in a collection of 
documents ?'''
result = db.similarity_search(query)
result[0].page_content

'increases the weight for words that are not used very much in a collection of \ndocuments \na. Term Frequency (TF) \nb. Inverse Document Frequency (IDF) \nc. Word2Vec \nd. Latent Dirichlet Allocation (LDA) \n \nAnswer : b) \n \n \n \n \n \nQ11. In NLP, The process of removing words like “and”, “is”, “a”, “an”, “the” from \na sentence is called as \na. Stemming \nb. Lemmatization \nc. Stop word \nd. All of the above \n \nAnswer : c) In Lemmatization, all the stop words such as a, an, the, etc.. are removed . One can \nalso define custom stop words for removal.'

In [22]:
from langchain_community.vectorstores import FAISS
db1 = FAISS.from_documents(pdf_chunks, OpenAIEmbeddings())