# RAG Tools: Environment Setup

## 1. Introduction

Welcome to the RAG Tools project! This notebook is the first step in setting up a Machine Learning (ML) development environment. RAG stands for Retrieval-Augmented Generation, which is a powerful technique in natural language processing that combines the strengths of large language models with external knowledge retrieval.

In this notebook, we'll cover:
1. Creating a Conda environment
2. Setting up the project structure
3. Installing necessary dependencies
4. Verifying the installation

By the end of this notebook, you'll have a fully functional Python environment ready for ML development, specifically tailored for RAG applications.

## 2. Conda and Docker

We're using two main tools for our environment setup:

1. **Conda**: This is an open-source package management system and environment management system. We use Conda to:
   - Create isolated Python environments
   - Easily install and manage packages
   - Ensure our project's dependencies don't interfere with other projects or system-wide packages

2. **Docker**: This is a platform for developing, shipping, and running applications in containers. We'll use Docker in later notebooks to:
   - Package our entire application, including dependencies and configuration
   - Ensure consistent runtime environments across different machines
   - Easily manage and scale our database services

## 3. Project Structure

Before we begin, let's understand our initial project structure:

```
RAG_tools/
├── config/
│   ├── docker-compose.yml
│   └── example.env
├── notebooks/
│   ├── 00_Environment_Setup.ipynb (this notebook)
│   ├── 01_Database_Setup.ipynb
│   ├── 02_Ollama_Setup.ipynb
│   └── 03_CLI_Implementation.ipynb
├── src/
│   └── utils/
│       ├── config_utils.py
│       └── DockerComposeManager.py
└── tests/
```

This structure separates our code into logical components:
- `config/`: Contains configuration files
- `notebooks/`: Jupyter notebooks for interactive development and documentation
- `src/`: Source code for our project
- `tests/`: Unit tests (we'll add these later)

## 4. Environment Setup

### 4.1 Create and Activate Conda Environment

First, let's create our Conda environment and activate it. Run the following commands in your terminal:

In [1]:
!conda create -n ragtools python=3.12 -y
!conda init
!conda activate ragtools
!python -m ipykernel install --user --name=ragtools


Channels:
 - pytorch
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/todd/anaconda3/envs/ragtools

  added / updated specs:
    - python=3.12


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 
  ca-certificates    conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0 
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-hf3520f5_7 
  libexpat           conda-forge/linux-64::libexpat-2.6.2-h59595ed_0 
  libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5 
  libgcc-ng          conda-forge/linux-64::libgcc-ng-14.1.0-h77fa898_0 
  libgomp            conda-forge/linux-64::libgomp-14.1.0-h77fa898_0 
  libnsl             conda-forge/linux

These commands do the following:
1. Create a new Conda environment named "ragtools" with Python 3.12
2. Initialize Conda for your shell
3. Activate the new environment
4. Install a Jupyter kernel for this environment

NOTE: if you are doing tis in VSCode now is the time to change the python interpreter for your notebooks for the rest of this project to ragtools.

### 4.2 Configure Conda Channels

Now, let's add the necessary Conda channels. Channels are the locations where packages are stored.

In [2]:
!conda config --add channels conda-forge
!conda config --add channels pytorch




These commands add the conda-forge and pytorch channels, which we'll need for some of our dependencies.

### 4.3 Install Dependencies

Let's install our project dependencies:

In [3]:
!conda install transformers psycopg2 numpy matplotlib PyYAML jupyter pandas scikit-learn python-dotenv neo4j-python-driver docker-py ipykernel pgvector -c pytorch -c conda-forge -y
!conda install pytorch torchvision torchaudio cpuonly -c pytorch -y
!conda install -y python-dotenv python-docker -c conda-forge


Channels:
 - pytorch
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/todd/anaconda3/envs/ragtools

  added / updated specs:
    - docker-py
    - ipykernel
    - jupyter
    - matplotlib
    - neo4j-python-driver
    - numpy
    - pandas
    - pgvector
    - psycopg2
    - python-dotenv
    - pyyaml
    - scikit-learn
    - transformers


The following NEW packages will be INSTALLED:

  aiohttp            conda-forge/linux-64::aiohttp-3.9.5-py312h98912ed_0 
  aiosignal          conda-forge/noarch::aiosignal-1.3.1-pyhd8ed1ab_0 
  alsa-lib           conda-forge/linux-64::alsa-lib-1.2.12-h4ab18f5_0 
  anyio              conda-forge/noarch::anyio-4.4.0-pyhd8ed1ab_0 
  argon2-cffi        conda-forge/noarch::argon2-cffi-23.1.0-pyhd8ed1ab_0 
  argon2-cffi-bindi~ conda-forge/linux-64::argon2-cffi-bindings-21.2.0-py312h98912ed_4 
  arrow              conda-forge/noarch::a

This installs a range of packages including:
- `transformers`: For working with transformer models
- `psycopg2`: PostgreSQL adapter
- `numpy` and `pandas`: For data manipulation
- `matplotlib`: For data visualization
- `scikit-learn`: For machine learning utilities
- `neo4j-python-driver`: For connecting to Neo4j graph database
- `pytorch`: Deep learning framework
- `pgvector`: For working with vector embeddings in PostgreSQL

### 4.4 Switch to the New Kernel

After running these cells, restart your Jupyter kernel and select the 'ragtools' kernel:

1. Click on 'Kernel' in the top menu
2. Select 'Change kernel'
3. Choose 'ragtools' from the list

### 4.5 Verify Installation

Let's verify that our environment is set up correctly:

In [4]:
import sys
import subprocess
import json
import importlib

def get_conda_package_info(package_name):
    try:
        result = subprocess.run(['conda', 'list', '--json'], capture_output=True, text=True)
        packages = json.loads(result.stdout)
        for package in packages:
            if package['name'] == package_name:
                return f"{package_name}: {package['version']}"
        return f"{package_name}: Not found in conda environment"
    except Exception as e:
        return f"Error getting conda info for {package_name}: {str(e)}"

def check_importable(package_name):
    import_names = {
        'pytorch': 'torch',
        'scikit-learn': 'sklearn',
        'python-dotenv': 'dotenv',
        'neo4j': 'neo4j',
        'docker': 'docker',
    }
    try:
        importlib.import_module(import_names.get(package_name, package_name))
        return "Importable"
    except ImportError:
        return "Not importable"

print(f"Python version: {sys.version}")
print("\nPackage versions:")

packages = ['pytorch', 'transformers', 'psycopg2', 'numpy', 'matplotlib', 
            'jupyter', 'pandas', 'scikit-learn', 'python-dotenv', 
            'neo4j-python-driver', 'docker-py', 'python-docker', 'pgvector']

for package in packages:
    conda_info = get_conda_package_info(package)
    import_name = package
    if package == 'neo4j-python-driver':
        import_name = 'neo4j'
    elif package in ['docker-py', 'python-docker']:
        import_name = 'docker'
    import_status = check_importable(import_name)
    print(f"{conda_info} ({import_status})")

# Additional check for PyTorch
try:
    import torch
    print(f"\nPyTorch import successful. Version: {torch.__version__}")
except ImportError:
    print("\nFailed to import PyTorch")

print("\nAll package checks completed.")


Python version: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]

Package versions:
pytorch: 2.3.1 (Importable)
transformers: 4.42.4 (Importable)
psycopg2: 2.9.9 (Importable)
numpy: 1.26.4 (Importable)
matplotlib: 3.9.1 (Importable)
jupyter: 1.0.0 (Importable)
pandas: 2.2.2 (Importable)
scikit-learn: 1.5.1 (Importable)
python-dotenv: 1.0.1 (Importable)
neo4j-python-driver: 5.22.0 (Importable)
docker-py: 7.1.0 (Importable)
python-docker: 0.2.0 (Importable)
pgvector: 0.7.2 (Not importable)

PyTorch import successful. Version: 2.3.1

All package checks completed.


This script checks each package's version and whether it can be imported, providing a comprehensive view of our environment setup.

## 5. Conclusion

Congratulations! You've successfully set up your ML development environment for the RAG Tools project. In the next notebook, we'll dive into setting up our databases using Docker.

## 6. Next Steps

In the upcoming notebooks, we'll cover:
1. Setting up PostgreSQL with pgvector and Neo4j databases (01_Database_Setup.ipynb)
2. Configuring Ollama for running large language models locally (02_Ollama_Setup.ipynb)
3. Implementing a Command Line Interface (CLI) for our RAG system (03_CLI_Implementation.ipynb)

These steps will build upon this environment to create a fully functional RAG system.