# Level 3 RAG App to answer questions from several PDF private documents

## Goals
* Build a RAG app able to answer questions from several PDF documents.
* This app will be similar to the SEC Insights pro app we saw from LlamaIndex, but in this case we will use LangChain.
* We will use the editor Visual Studio Code to write our code there.
* As always, we will load the code in Github where you can download it.
* We will practice using several alternative tools like:
    * Poetry.
    * LangServe as backend server.
    * Tailwinds.

## IMPORTANT: Installation with the exact packages we used
* When you download a full stack app you need to make sure that both backend and frontend use the original packages in order to avoid potential errors caused by installing more modern versions of these packages.
* Since we used poetry to install the original backend packages, you will now use "poetry install" to install them.
* At this time, our project still does not have frontend, so we will not install the frontend yet.
#### Download the code
* Download the code from the github repository.
#### Backend installation
* Since we used both pyenv and poetry to build this project, you will have to use the following approach to install the backend.
* In the terminal, make sure you are in the root directory of the project (v1-162-part1). Pay attention: the root directory of the project and the backend directory have an identic name. Do not mistake them, be sure you are in the root directory of the project now.
* **Create a virtual environment and use pip install to make sure you install the exact same packages we used**:
    * pyenv virtualenv 3.11.4 your-virtual-environment-name
    * pyenv activate your-virtual-environment-name
    * pip install -r requirements.txt
* **Go to the backend directory, create a virtual environment and use poetry install to make sure you install the exact same packages we used**:
    * cd v1-162-part1
    * poetry install
#### Ready to go!
* You can now see the code of the app in Visual Studio Code.
* Relax and review the following steps. Remember, since you have pre-installed the modules you will not have to re-install them again.
* You will be able to run this backend using LangServe Playground as instructed at the end of this notebook.

## For this project, we will use Poetry besides Pip
Poetry and pip are both tools used in Python development, but they serve different purposes and have different features. Here's a breakdown of the differences between them:

1. **Purpose**:
   - **pip**: Pip is the default package manager for Python. It is used primarily for installing and managing Python packages from the Python Package Index (PyPI) or other repositories.
   - **Poetry**: Poetry is a dependency management and packaging tool for Python. It helps manage project dependencies, including their versions, and provides tools for packaging and publishing Python projects.

2. **Dependency Management**:
   - **pip**: Pip installs packages globally or within a virtual environment but doesn't provide a built-in way to manage dependencies for projects directly.
   - **Poetry**: Poetry allows you to define project dependencies in a `pyproject.toml` file, including version constraints. It manages dependencies on a per-project basis and allows for more deterministic dependency resolution.

3. **Locking Dependencies**:
   - **pip**: Pip does not have built-in support for locking dependencies, which can lead to dependency conflicts or inconsistencies between different environments.
   - **Poetry**: Poetry generates a `poetry.lock` file that locks dependencies to specific versions, ensuring that the same versions are installed consistently across different environments.

4. **Packaging**:
   - **pip**: Pip can install packages but does not provide built-in tools for packaging Python projects.
   - **Poetry**: Poetry provides commands for packaging Python projects into distributable formats like wheels or source distributions (`sdist`).

5. **Virtual Environments**:
   - **pip**: Pip relies on virtual environments created using `venv` or `virtualenv` to isolate project dependencies.
   - **Poetry**: Poetry automatically creates and manages virtual environments for each project, simplifying the setup process.

6. **Ease of Use**:
   - **pip**: Pip is a command-line tool with a straightforward interface for installing packages but may require additional tools or scripts for more complex tasks.
   - **Poetry**: Poetry aims to provide a more user-friendly and intuitive interface for managing project dependencies and packaging tasks.

In summary, while both pip and Poetry are essential tools in Python development, Poetry offers more features and a more comprehensive approach to dependency management and project packaging. It's especially valuable for projects that require strict dependency management and reproducible environments. However, pip remains widely used for installing individual packages and is often integrated into development workflows alongside Poetry.

## Preparation
* create new directory
* `pyenv virtualenv 3.11.4 yourvirtualenvname`
* `pyenv activate yourvirtualenvname`

## How to start a new project with Poetry
To start a new Python project using Poetry, you can follow these steps:

1. **Install Poetry**:
   If you haven't already installed Poetry, you can do so by following the installation instructions provided in the Poetry documentation: https://python-poetry.org/docs/#installation.

2. **Create a New Project Directory**:
   Choose or create a directory where you want to initialize your new Python project.

3. **Initialize a New Project**:
   Open a terminal or command prompt, navigate to the directory you created, and run the following command to initialize a new Poetry project:

   `poetry new <project-name>`

   Replace `<project-name>` with the name of your project. This command will create a new directory with the specified project name and initialize a basic Python project structure inside it.

4. **Add Dependencies (Optional)**:
   If your project requires any dependencies, you can add them to the project using Poetry. Navigate into your project directory and use the following command to add dependencies:

   `poetry add <dependency-name>`

   Replace `<dependency-name>` with the name of the package you want to add. You can also specify the version and other constraints if needed.

5. **Write Your Code**:
   Start writing your Python code within the project directory. You can organize your code into modules and packages as needed.

6. **Manage Project Settings (Optional)**:
   Poetry uses a `pyproject.toml` file to manage project settings and dependencies. You can customize project settings such as Python version, dependencies, and packaging options in this file.

7. **Install Dependencies**:
   Once you've added dependencies or made changes to the `pyproject.toml` file, you can install the dependencies by running:

   `poetry install`

   This command will install the project dependencies specified in the `pyproject.toml` file.

8. **Run Your Code**:
   You can run your Python code as usual using the Python interpreter or any scripts you've written. Poetry will manage the project dependencies and ensure that the correct versions are used.

That's it! You've now started a new Python project using Poetry. You can continue developing your project, adding dependencies, and managing project settings using Poetry commands as needed.

## Create a new project with poetry
* We will call our project v1-162-part1, you can call it whatever you want.
* In terminal:
    * `poetry new v1-162-part1`
    * `cd v1-162-part1`

## First, we will install langchain-cli
* In terminal:
    * `pip install langchain-cli`
        * This will give us everything we need to start a LangServe project.

## Then we will create a new LangServe app.
* In terminal:
    * `langchain app new .`
    * What package would you like to add? (leave blank to skip):
        * Leave blank, just press enter
            * This will create a new LangServe app.

## See the project structure in your code editor
* Several folders and files have been created inside of the project folder. Remember that in our case, the project folder is called v1-162-part1
* We will use the free Visual Studio Code editor. You can use this or other editor. Opening the project folder in the editor, now you can see:
    * A new folder called app
    * A new folder called packages
    * A new folder called tests
    * A .gitignore file
    * And several other files

## Create a new folder for the PDF files
* In the root directory of the project, we will create the folder "pdf-documents".
* There we will store manually the PDF documents we want to use in our RAG app.
* Of course, in a production app you will allow the user to enter their own PDF files. For the sake of simplicity, we will do it this way by now.
* For this demo, we will include in this folder 3 pdf files with the Wikipedia's biographies of:
    * The president John F. Kennedy
    * His brother, Robert F. Kennedy
    * His father Joseph P. Kennedy 

## Create a new folder for the RAG load and process functionality
* In the root directory of the project, we will create the folder "rag-data-loader".

## Inside this folder, create a file with the LangChain RAG load and process functionality.
* We will call this file "rag_load_and_process.py".

## Add your .env file in the root directory of the project
* OpenAI API Key
* LangSmith Credentials
* Our LangSmith Project Name: RAGforPDFSv2

OPENAI_API_KEY=yourkey

LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=yourkey
LANGCHAIN_PROJECT=RAGwithPDFSv2

## Update the .gitignore file so your .env file will not be uploaded to github
* Add this line in the .gitignore file:
    * .env

## Now we will add the RAG functionality in the rag_load_and_process.py
* Remember that you can download the whole code from github. We recommend you review these instructions after you download the code in your code editor.
* We will need to add some packages from terminal:
    * `poetry add tqdm`
    *  The tqdm package is a Python library that provides a fast, extensible progress bar for loops and iterables. "tqdm" stands for "taqaddum" in Arabic, which means "progress" or "progression". It's a popular choice among developers for tracking the progress of iterative tasks, such as loops, data processing, or file downloads.
    * `poetry add "unstructured[all-docs]"`
    * We have used the unstructured package before, in the LLM Multimodal Apps section.
    * When you do this, you will get an error message in terminal warning that the unstructured package still does not work with the 3.12.1 version of python. In the error message you can read how to solve this problem: go to the pyproject.toml file and edit the line where it says python = "^3.11" change it to say:
        * python = ">=3.11,<3.12" 
    
* As you can see, for our RAG functionality we are using some familiar and also some new tools:
    * load_dotenv to load the content of our .env file 
    * [DirectoryLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) and [UnstructuredPDFLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) to load the data from the directory where the PDF files are located (remember we called this directory "pdf-documents").
    * Postgres (with the PGVector extension) as our vector database.
    * [SemanticChunker](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker) as our splitter.
    * OpenAIEmbeddings to create the embeddings.
* We will need to download a few packages from terminal:
    * `poetry add langchain-experimental`
    * `poetry add python-dotenv`
    * `poetry add langchain-openai`
    * `poetry add langchain-community`
    * `poetry add tiktoken`
    * `poetry add psycopg`
    * `poetry add pgvector`

## PGVector

To initialize PGVector you can opt to install it directly on your operating system by integrating it with PostgreSQL. PGVector is a vector data type and nearest neighbor search operations for PostgreSQL. Here are the general steps for installing PostgreSQL and then adding PGVector, keeping in mind that the details may vary depending on your operating system.

### 1. Install PostgreSQL

To check if PostgreSQL is installed on your Mac, you can open the Terminal and use the following command:

`psql --version`

This command attempts to run `psql`, the command-line interface for interacting with PostgreSQL, and asks it to display its version. If PostgreSQL is installed, this command will output the version of `psql` and, implicitly, PostgreSQL. For example, it might show something like `psql (PostgreSQL) 14.11`.

If you've installed PostgreSQL via Homebrew and confirmed its version, you do need to start the PostgreSQL service if it's not already running. You can start it with the command:

`brew services start postgresql`

This command will initiate the PostgreSQL server, making it ready to accept connections on the default port (usually 5432). Homebrew services will also ensure that PostgreSQL starts automatically when your computer boots, so you don't need to manually start it every time.

To check if PostgreSQL is running after you've started it, you can use:

`brew services list`

Look for `postgresql` in the output list, and next to it, you should see "started" if the service is running properly.

If PostgreSQL is not installed, you'll likely receive an error message indicating that psql could not be found. This indicates that PostgreSQL needs to be installed on your system.

How to install PostgreSQL varies by operating system, but here's how to do it on Debian-based systems (like Ubuntu) and on macOS.

#### On Ubuntu/Debian:

```bash
sudo apt update
sudo apt install postgresql postgresql-contrib
```

#### On macOS:

You can use Homebrew to install PostgreSQL:

`brew install postgresql`

After installing, start the PostgreSQL service:

- On Linux (Debian/Ubuntu):

`sudo service postgresql start`

- On macOS:

`brew services start postgresql`

### 2. Verify PostgreSQL Installation

Verify that PostgreSQL is correctly installed and running:

`psql -V`

### 3. Install PGVector

Now, you need to install PGVector on your PostgreSQL instance. This can be done by installing the PGVector extension in your PostgreSQL database.

To install PGVector in your Mac it is very easy using Homebrew. In terminal, run:

`brew install pgvector` 

For Widows and other systems check the [installation guide for PGVector](https://github.com/pgvector/pgvector). Pay attention to this [installation notes](https://github.com/pgvector/pgvector#installation-notes---linux-and-mac).

## Activate the extension in your PostgreSQL database
To interact with your PostgreSQL database, you can now do so by using the `psql` command-line interface or any other PostgreSQL-compatible client of your choice. To connect to your default database with `psql`, you can simply open a Terminal window and type:

`psql -U postgres`

"postgres" is the default superuser in postgres.

To create the new database, enter:

`CREATE DATABASE database164;`

Then enter `\q` to exit psql.

Finally, you need to activate the pgvector extension within your database. Access your database with psql and execute the command to create the extension.

`psql -d database164 -c "CREATE EXTENSION vector;"`

If you ran the command `psql -d [database164 -c "CREATE EXTENSION vector;"` and the output was `CREATE EXTENSION`, it means the installation of the `vector` extension (PGVector) was successful in your PostgreSQL database named `database164`. This output confirms that the extension has been added to the database, enabling you to use the functionalities provided by PGVector.

With PGVector now installed, you can begin to leverage its features for storing and performing operations on vectors directly within your PostgreSQL database.

## Load and process data
* Execute the rag_load_and_process.py file to load the data from the pdf documents, convert the data to embeddings and store the embeddings in the vector database.
* In terminal:
    * `cd rag-data-loader`
    * `python3 rag_load_and_process.py`

#### Note: flattening docs (line 27 of rag_load_and_process.py)

`flattened_docs = [doc[0] for doc in docs if doc]`

This line of code uses a list comprehension to create a new list, `flattened_docs`, from the original list `docs`. This particular list comprehension serves to flatten a list of lists by one level under the assumption that each inner list contains at least one `Document` object. Let's break it down:

```python
flattened_docs = [doc[0] for doc in docs if doc]
```

- **`for doc in docs`**: This iterates over each element in the `docs` list. In this case, each `doc` is actually a list of `Document` objects (or possibly a single `Document` object wrapped in a list).

- **`if doc`**: This checks if the current item (`doc`) is truthy. In Python, empty lists are considered "falsy," meaning they evaluate to `False` in a boolean context. Non-empty lists are "truthy," meaning they evaluate to `True`. This part of the comprehension ensures that only non-empty lists are processed, effectively skipping any empty lists that might be in `docs`.

- **`doc[0]`**: For each non-empty list (`doc`), this accesses the first element (`doc[0]`). The assumption here is that the first element is a `Document` object you're interested in. This is the "flattening" part of the operation, where you take the first element out of each list and use it to build the new list, `flattened_docs`.

Put together, this line creates a new list that consists of the first `Document` object from each non-empty list in `docs`. It effectively removes one layer of list nesting, transforming a structure like `[[Document1], [Document2], ...]` into `[Document1, Document2, ...]`.

This approach is specifically tailored to this case, where each list in `docs` is expected to contain one `Document` object. If some lists might contain more than one `Document` object and you want to include all of them, you would need a different approach to flatten the list fully.

* If you change line 28 with:
    * `chunks = text_splitter.split_documents(docs)`
    * and you try to run the file from terminal, looking at the error message you will see why we needed to flatten docs.

## Create the RAG chain file
* In the app folder, create a new file. We will call it rag_chain.py file


#### Note: TypeDict
`TypedDict` allows for the creation of dictionaries with keys that are tied to specific types. This feature was introduced in Python 3.8 as a part of the `typing` module, which provides support for type hints.

**What is `TypedDict`?**

`TypedDict` enables precise type hints for dictionaries where you know the exact structure in advance. It's particularly useful when you want to ensure that dictionaries conform to a specific schema, with specific keys and value types. This can be beneficial for code readability, type checking, and IDE assistance (like autocompletion and type checks).

Before `TypedDict`, type hints for dictionaries were limited to specifying the types of keys and values in a general sense (e.g., `dict[str, int]` indicates a dictionary with string keys and integer values, but it doesn't specify which keys are expected or that different keys might have values of different types).

**How to Use `TypedDict`**

Here is an example of how to define and use a `TypedDict`:

```python
from typing import TypedDict

class Movie(TypedDict):
    name: str
    year: int

# This dictionary conforms to the Movie TypedDict specification
movie: Movie = {'name': 'Blade Runner', 'year': 1982}

# This would raise a type error during type checking because 'years' is not a valid key
# and the type checker expects 'year' to be an int, not a list.
wrong_movie: Movie = {'name': 'Blade Runner', 'years': [1982]}  # Type checking error
```

In this example, `Movie` is a `TypedDict` specifying that dictionaries of this type should have a `name` key with a value of type `str` and a `year` key with a value of type `int`. Using `TypedDict` like this makes the code more self-documenting and allows type checkers like `mypy` to catch mistakes where the dictionary structure deviates from the defined schema.

**Benefits of `TypedDict`**

- **Type Safety**: Provides a way to enforce that dictionaries contain specific keys with values of specific types.
- **Code Readability and Maintenance**: Makes the intended structure of dictionaries clear, which can be especially helpful in large codebases or when working with complex data.
- **Tooling Support**: Improves support for autocompletion and type checking in IDEs, making development faster and helping catch type-related errors early.

**Limitations**

- **Runtime Behavior**: `TypedDict` does not change the runtime behavior of dictionaries. Errors related to missing keys or incorrect types will not be caught at runtime unless you use additional runtime checks.
- **Python Version**: `TypedDict` requires Python 3.8 or later. For older versions of Python, you would need to use comments or other forms of type hints that are supported by external type checkers like `mypy`.

`TypedDict` is part of Python's gradual typing system, which allows developers to opt into type hints as needed, combining the flexibility of Python's dynamic typing with the benefits of static type checks.

#### Note: itemgetter
`itemgetter` is a utility function from Python's `operator` module that constructs a callable that fetches an item from its operand using the operand’s `__getitem__()` method (which corresponds to the square-bracket `[]` access syntax). It is commonly used for retrieving items from collections (like dictionaries, lists, or tuples) and is especially useful when you need to sort or organize data based on the value of specific items.

**Basic Usage**

Here's a basic example of how `itemgetter` works:

```python
from operator import itemgetter

# For a list of tuples
data = [(2, 'Z'), (1, 'A'), (4, 'D'), (3, 'B')]
# Get the second item from each tuple
getter = itemgetter(1)
for record in data:
    print(getter(record))  # Prints the second item of each tuple
```

**Sorting with `itemgetter`**

One common use case for `itemgetter` is in sorting lists of dictionaries or tuples by a specific element. It is used as a key function in sorting methods like `list.sort()` or `sorted()`:

```python
from operator import itemgetter

# Sorting a list of dictionaries by a specific key
records = [{'name': 'John', 'score': 90}, {'name': 'Doe', 'score': 80}, {'name': 'Jane', 'score': 95}]
# Sort by 'score'
sorted_records = sorted(records, key=itemgetter('score'))
print(sorted_records)

# Sorting a list of tuples
data = [(2, 'Z'), (1, 'A'), (4, 'D'), (3, 'B')]
# Sort by the first item
sorted_data = sorted(data, key=itemgetter(0))
print(sorted_data)
```

**`itemgetter` with Multiple Keys**

`itemgetter` can be used with multiple indices or keys. When called with multiple arguments, it creates a callable that returns a tuple with all specified items, which can be useful for sorting by multiple criteria:

```python
from operator import itemgetter

# Sorting by multiple criteria
data = [('John', 'Doe', 90), ('John', 'Smith', 80), ('Jane', 'Doe', 95)]
# Sort by the first item, then by the third
sorted_data = sorted(data, key=itemgetter(0, 2))
print(sorted_data)
```

**Comparison with Lambda Functions**

`itemgetter` offers a more efficient and concise alternative to lambda functions for similar purposes:

```python
# Using itemgetter
sorted_records = sorted(records, key=itemgetter('score'))

# Equivalent using lambda
sorted_records = sorted(records, key=lambda x: x['score'])
```

While both approaches are valid, `itemgetter` can be more readable and performant, especially for simple key extractions.

In summary, `itemgetter` is a versatile tool in Python for accessing items from objects and is particularly useful in sorting and data selection scenarios.

## Edit app/server.py
* add_routes(app, final_chain, path="/rag")

## Install FastAPI
In terminal:
* `cd..` to go back to the root directory of the app
* `pip install fastapi`
* `pip install "uvicorn[standard]"`

## Start FastAPI
In terminal:
* `uvicorn app.server:app --reload`

## Check the app in the LangServe Playground
See this in your browser:
* [http://127.0.0.1:8000/rag/playground/](http://127.0.0.1:8000/rag/playground/)

Use CTRL-C in the terminal to stop it.

## Go to LangSmith to track the operations
* smith.langchain.com

## Note
* Instead of providing just the final code of the project, we are going to provide the code of each development stage, so you can see how the code evolves.
* This is also very good to identify possible bugs or necessary updates in the code. The code works fine when we are recoding this, but as you now Generative AI is evolving very fast and at any moment there can emerge new ways of doing things.