# Unit 5

## Database Integration and Persistence

## Introduction: Why Store Code and Commits in a Database? 💾

Welcome back\! So far, you have learned how to scan a codebase, extract commit history, and analyze code using a command-line interface. Until now, all the information about code files and commits has been stored in memory—meaning it disappears as soon as your program stops running.

In real-world applications, especially for tools like a code review assistant, it is important to keep this information **organized and persistent**. Storing data in a database allows you to:

  * **Save** code and commit information for later use
  * **Query and analyze** data efficiently
  * **Share** data between different parts of your application

In this lesson, you will learn how to use a database to store code files and commit data using **SQLAlchemy**, a popular Python library for working with databases. This will make your code review assistant more powerful and reliable.

-----

## Setting Up Your Environment

Before we begin working with databases, you need to install the required dependencies. SQLAlchemy is not part of Python's standard library, so it needs to be installed separately.

### Installing SQLAlchemy

If you're working on your own machine, you'll need to install SQLAlchemy. Here are the installation commands:

**Using `pip`:**

```bash
pip install sqlalchemy
```

**Using `pip` with a virtual environment (recommended):**

```bash
# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install SQLAlchemy
pip install sqlalchemy
```

**Using `conda`:**

```bash
conda install sqlalchemy
```

### Additional Dependencies

For this lesson, we'll also use **SQLite** as our database, which comes built-in with Python. If you want to use other databases like PostgreSQL or MySQL, you would need additional drivers:

```bash
# For PostgreSQL
pip install psycopg2-binary
# For MySQL
pip install pymysql
```

> **Note:** On CodeSignal, the required libraries are already installed, so you do not need to worry about installation here. However, it is good practice to know how to set up your environment on your own device.

-----

## Quick Recall: Data Classes and CLI Integration

Before we dive in, let's quickly remind ourselves of what you have already built:

  * You used **Python data classes** to represent code files and commits.
  * You built a **CLI tool** that scanned a project directory and extracted commit history, displaying useful statistics.

In those lessons, all data was kept in memory using Python objects. Now, we will take the next step and store this data in a database so it can be accessed and updated over time.

-----

## Defining SQLAlchemy Models for Code and Commits

To store data in a database, we need to define the structure of our tables. In SQLAlchemy, this is done by creating Python classes called **models**. Each model represents a table in the database.

Let's start by defining a model for a code file.

```python
from sqlalchemy import Column, Integer, String, Text, DateTime
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class CodeFile(Base):
    __tablename__ = 'code_files'
    id = Column(Integer, primary_key=True)
    file_path = Column(String, unique=True)
    content = Column(Text)
    language = Column(String)
    last_updated = Column(DateTime)
```

**Explanation:**

  * `Base = declarative_base()` sets up the base class for all our models.
  * `class CodeFile(Base):` defines a table called `code_files`.
  * Each attribute (like `id`, `file_path`, `content`) becomes a **column** in the table.
  * `id` is the **primary key**, which uniquely identifies each row.
  * `file_path` is marked as **unique**, so no two files can have the same path.

Now, let's define a model for a commit:

```python
class Commit(Base):
    __tablename__ = 'commits'
    id = Column(Integer, primary_key=True)
    hash = Column(String, unique=True)
    message = Column(Text)
    author = Column(String)
    date = Column(DateTime)
```

**Explanation:**

  * This class creates a `commits` table.
  * Each commit has a unique **hash**, a message, an author, and a date.

Finally, we need a way to link files and commits together. For this, we use a third table:

```python
from sqlalchemy import ForeignKey

class FileCommit(Base):
    __tablename__ = 'file_commits'
    id = Column(Integer, primary_key=True)
    file_id = Column(Integer, ForeignKey('code_files.id'))
    commit_id = Column(Integer, ForeignKey('commits.id'))
    diff_text = Column(Text)
```

**Explanation:**

  * `FileCommit` links a code file to a commit, storing the changes (**diff**) made in that commit.
  * `file_id` and `commit_id` are **foreign keys**, meaning they refer to rows in the `code_files` and `commits` tables.

-----

## Setting Up and Initializing the Database

Now that we have our models, we need to set up the database connection and create the tables.

First, let's set up the connection:

```python
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import os

DATABASE_URL = os.getenv('DATABASE_URL', 'sqlite:///code_review.db')
engine = create_engine(DATABASE_URL)

SessionLocal = sessionmaker(bind=engine)
```

**Explanation:**

  * `DATABASE_URL` tells SQLAlchemy where to find the database. If the environment variable is not set, it uses a local **SQLite** file called `code_review.db`.
  * `create_engine()` creates a connection to the database.
  * `SessionLocal` is a factory for creating **sessions**, which are used to interact with the database.

Next, let's create the tables:

```python
def init_database():
    Base.metadata.create_all(bind=engine)
```

**Explanation:**

  * `Base.metadata.create_all()` creates all tables defined by our models if they do not already exist.

Finally, we need a helper function to get database sessions:

```python
def get_session():
    return SessionLocal()
```

**Explanation:**

  * `get_session()` creates and returns a new database session using the `SessionLocal` factory we defined earlier.
  * This session is used to interact with the database — adding, updating, and querying data.

-----

## Populating the Database with Repository Data

With the database ready, let's see how to add code files and commits to it. We will use our repository scanner and git history extractor from previous lessons, but now we will store the results in the database.

First, let's scan the repository and add code files:

```python
from models import CodeFile
from database import get_session
from datetime import datetime
# Dummy scanner for demonstration
class RepositoryScanner:
    def scan_repository(self, repo_path):
        from collections import namedtuple
        File = namedtuple('File', ['file_path', 'content', 'language', 'last_updated'])
        return [
            File('main.py', 'print("Hello, World!")', 'Python', datetime.now())
        ]

session = get_session()
scanner = RepositoryScanner()
files = scanner.scan_repository('.')

for file_data in files:
    db_file = CodeFile(
        file_path=file_data.file_path,
        content=file_data.content,
        language=file_data.language,
        last_updated=file_data.last_updated
    )
    session.merge(db_file)

session.commit()
```

**Explanation:**

  * We use a `RepositoryScanner` to get a list of code files.
  * For each file, we create a `CodeFile` object and add it to the session.
  * `session.merge()` adds the object to the database, updating it if it already exists.
  * `session.commit()` saves the changes.

Now, let's add commits:

```python
from models import Commit
from datetime import datetime
# Dummy git extractor for demonstration
class GitHistoryExtractor:
    def extract_commits(self, repo_path):
        from collections import namedtuple
        CommitData = namedtuple('CommitData', ['hash', 'message', 'author', 'date'])
        return [
            CommitData('abc123', 'Initial commit', 'Alice <alice@example.com>', datetime.now())
        ]

git_extractor = GitHistoryExtractor()
commits = git_extractor.extract_commits('.')

for commit_data in commits:
    db_commit = Commit(
        hash=commit_data.hash,
        message=commit_data.message,
        author=commit_data.author,
        date=commit_data.date
    )
    session.merge(db_commit)

session.commit()
```

**Explanation:**

  * We use a `GitHistoryExtractor` to get a list of commits.
  * For each commit, we create a `Commit` object and add it to the session.
  * Again, `session.merge()` and `session.commit()` are used to save the data.

**Sample Output:**

```
Database populated successfully!
```

This message confirms that your code files and commits have been stored in the database.

-----

## Summary and Practice Preview

In this lesson, you learned how to:

  * Install and set up **SQLAlchemy** as a dependency for database operations.
  * Define **SQLAlchemy models** to represent code files, commits, and their relationships.
  * Set up and initialize a database connection.
  * Scan a repository and extract commit history.
  * **Store** code and commit data in a database for persistence.

You are now ready to practice these skills by working with real code and seeing how data is stored and retrieved from the database. In the next exercises, you will get hands-on experience with database integration and persistence, making your code review assistant more robust and useful.

## Completing SQLAlchemy Models for Code Files

Now that you've learned about SQLAlchemy models and their importance for data persistence, let's put that knowledge into practice! In this exercise, you'll complete a partially implemented database model for storing code files.

The CodeFile model needs a few important additions to work properly in our database system. Your tasks are to:

Add the unique constraint to the file_path column to prevent duplicate files in our database.
Create a language column to store which programming language each file uses.
Add a last_updated column to track when files were last modified.
These improvements will ensure our database can properly organize and track code files for our review assistant. Completing this model is your first step toward building a robust database system that can store all the code and commit information you'll need for analysis.

```python
from sqlalchemy import Column, Integer, String, Text, DateTime, ForeignKey
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class CodeFile(Base):
    __tablename__ = 'code_files'
    id = Column(Integer, primary_key=True)
    # TODO: Add the unique constraint to the file_path column
    file_path = Column(String)
    content = Column(Text)
    # TODO: Add a column for storing the programming language of the file
    # TODO: Add a column for tracking when the file was last updated

class Commit(Base):
    __tablename__ = 'commits'
    id = Column(Integer, primary_key=True)
    hash = Column(String, unique=True)
    message = Column(Text)
    author = Column(String)
    date = Column(DateTime)

class FileCommit(Base):
    __tablename__ = 'file_commits'
    id = Column(Integer, primary_key=True)
    file_id = Column(Integer, ForeignKey('code_files.id'))
    commit_id = Column(Integer, ForeignKey('commits.id'))
    diff_text = Column(Text)
```

You're right, defining the model accurately is the foundation of a robust database\! To complete the `CodeFile` model, we'll use `unique=True` for the path, and add columns for the language (using `String`) and the last update time (using `DateTime`).

Here is the completed code:

```python
from sqlalchemy import Column, Integer, String, Text, DateTime, ForeignKey
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class CodeFile(Base):
    __tablename__ = 'code_files'
    id = Column(Integer, primary_key=True)
    # TODO: Add the unique constraint to the file_path column
    file_path = Column(String, unique=True)
    content = Column(Text)
    # TODO: Add a column for storing the programming language of the file
    language = Column(String)
    # TODO: Add a column for tracking when the file was last updated
    last_updated = Column(DateTime)

class Commit(Base):
    __tablename__ = 'commits'
    id = Column(Integer, primary_key=True)
    hash = Column(String, unique=True)
    message = Column(Text)
    author = Column(String)
    date = Column(DateTime)

class FileCommit(Base):
    __tablename__ = 'file_commits'
    id = Column(Integer, primary_key=True)
    file_id = Column(Integer, ForeignKey('code_files.id'))
    commit_id = Column(Integer, ForeignKey('commits.id'))
    diff_text = Column(Text)
```

### Key Changes:

1.  **`file_path`**: Added the constraint `unique=True` to ensure every file path recorded is unique.
2.  **`language`**: Added `language = Column(String)` to store the file's programming language.
3.  **`last_updated`**: Added `last_updated = Column(DateTime)` to track the modification time, which is useful for incremental scanning.

## Fixing Database Initialization Missing Import

Our database initialization isn't working - when we try to run the application, the database tables aren't being created, preventing data storage.

Your task is to examine the database.py file and find the missing import. Look carefully at how the Base object is referenced in the init_database() function.

```python
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import os

DATABASE_URL = os.getenv('DATABASE_URL', 'sqlite:///code_review.db')
engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(bind=engine)

def init_database():
    Base.metadata.create_all(bind=engine)

def get_session():
    return SessionLocal()

```

## Fixing Database Initialization Missing Import

Our database initialization isn't working - when we try to run the application, the database tables aren't being created, preventing data storage.

Your task is to examine the database.py file and find the missing import. Look carefully at how the Base object is referenced in the init_database() function.

```python
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import os

DATABASE_URL = os.getenv('DATABASE_URL', 'sqlite:///code_review.db')
engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(bind=engine)

def init_database():
    Base.metadata.create_all(bind=engine)

def get_session():
    return SessionLocal()

```

## Storing Code Files in SQLAlchemy Database

## Integrating Git History with Database Storage

## Synchronizing Repository Files with Database