Skip to content

refactor: File model #3050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions apps/common/utils/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

from ..exception.app_exception import AppApiException
from ..models.db_model_manage import DBModelManage
import hashlib


def password_encrypt(row_password):
Expand Down Expand Up @@ -124,6 +125,7 @@ def get_file_content(path):
content = file.read()
return content


def sub_array(array: List, item_num=10):
result = []
temp = []
Expand Down Expand Up @@ -270,3 +272,8 @@ def bulk_create_in_batches(model, data, batch_size=1000):
batch = data[i:i + batch_size]
model.objects.bulk_create(batch)


def get_sha256_hash(_bytes):
sha256 = hashlib.sha256()
sha256.update(_bytes)
return sha256.hexdigest()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code looks mostly clean and well-structured. However, there are a few areas that can be improved:

  1. Docstring: Adding docstrings to functions can help clarify their purpose.

    # In password_encrypt function
    def password_encrypt(row_password):
        """Encrypts the row_password using SHA-256."""
    
    # Other functions similarly
  2. Comments: It might be beneficial to add comments above sections of code that perform specific operations or set up certain conditions.

  3. Code Consistency:

    • The sub_array function should handle cases where item_num > len(array) gracefully.
    • Ensure consistency in naming conventions (e.g., use underscores instead of hyphens).
    def sub_array(array: List, item_num=10):
        if not array:
            return []
        
        start_index = max(0, len(array) - item_num)
        end_index = min(len(array), start_index + item_num)
        return array[start_index:end_index]
  4. Performance Optimization:

    • If bulk_create_in_batches is called frequently, consider optimizing it further or integrating with an asynchronous database client like Django Channels or Celery.
  5. Exception Handling:

    • Consider adding more robust exception handling around file operations and other critical operations.
    • You might want to log exceptions rather than raising them directly.

Here's revised version with some of these improvements:

from typing import List
from ..exception.app_exception import AppApiException
from ..models.db_model_manage import DBModelManage

import hashlib


def password_encrypt(row_password):
    """
    Encrypts the row_password using SHA-256.
    
    :param row_password: Password to encrypt.
    :return: Encrypted password in hexadecimal format.
    """
    sha256 = hashlib.sha256()
    sha256.update(row_password.encode('utf-8'))  # Assuming UTF-8 encoding
    return sha256.hexdigest()


def get_file_content(path):
    """
    Reads content from the specified path.
    
    :param path: Path to the file.
    :return: Content of the file as a string.
    """
    try:
        with open(path, 'r', encoding='utf-8') as file:
            content = file.read()
        return content
    except FileNotFoundError:
        raise AppApiException("File not found.")
    except Exception as e:
        raise AppApiException(f"An error occurred while reading the file: {str(e)}")


def sub_array(array: List, item_num=10):
    """
    Subtracts the first N items from the list.
    
    :param array: Input list.
    :param item_num: Number of items to subtract.
    :return: Modified list.
    """
    if not array:
        return []

    start_index = max(0, len(array) - item_num)
    end_index = min(len(array), start_index + item_num)
    return array[start_index:end_index]


def bulk_create_in_batches(model, data, batch_size=1000):
    """
    Creates multiple instances of the specified model in batches.
    
    :param model: Model class to create instances of.
    :param data: Data representing models to create.
    :param batch_size: Number of instances per batch.
    """
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        model.objects.bulk_create(batch)


def get_sha256_hash(_bytes):
    """
    Generates a SHA-256 hash digest from bytes.
    
    :param _bytes: Bytes to hash.
    :return: Hexadecimal representation of the SHA-256 hash.
    """
    sha256 = hashlib.sha256()
    sha256.update(_bytes)
    return sha256.hexdigest()

These changes enhance readability, maintainability, and error-handling capabilities of the code.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generated by Django 5.2 on 2025-05-07 03:40

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('knowledge', '0004_knowledge_file_size_limit_alter_document_status_and_more'),
]

operations = [
migrations.RemoveField(
model_name='file',
name='workspace_id',
),
migrations.AddField(
model_name='file',
name='file_size',
field=models.IntegerField(default=0, verbose_name='文件大小'),
),
migrations.AddField(
model_name='file',
name='sha256_hash',
field=models.CharField(default='', verbose_name='文件sha256_hash标识'),
),
migrations.AddField(
model_name='file',
name='source_id',
field=models.CharField(default='TEMPORARY_100_MINUTE', verbose_name='资源id'),
),
migrations.AddField(
model_name='file',
name='source_type',
field=models.CharField(choices=[('KNOWLEDGE', 'Knowledge'), ('APPLICATION', 'Application'), ('TEMPORARY_30_MINUTE', 'Temporary 30 Minute'), ('TEMPORARY_100_MINUTE', 'Temporary 120 Minute'), ('TEMPORARY_1_DAY', 'Temporary 1 Day')], default='TEMPORARY_100_MINUTE', verbose_name='资源类型'),
)
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided migration script looks generally correct and should not contain any immediate issues. However, there are a few optimizations you can consider:

  1. Indexing Fields: If these table sizes become significantly larger in the future, consider adding indexes to fields that will be frequently queried.

    class Migration(migrations.Migration):
    
        dependencies = [
            ('knowledge', '0004_knowledge_file_size_limit_alter_document_status_and_more'),
        ]
    
        operations = [
            migrations.RemoveField(
                model_name='file',
                name='workspace_id',
            ),
            migrations.AddField(
                model_name='file',
                name='file_size',
                field=models.IntegerField(default=0, verbose_name='文件大小', db_index=True),
            ),
            migrations.AddField(
                model_name='file',
                name='sha256_hash',
                field=models.CharField(default='', verbose_name='文件SHA256哈希标识', db_index=True),
            ),
            migrations.AddField(
                model_name='file',
                name='source_id',
                field=models.CharField(default='TEMPORARY_100_MINUTE', verbose_name='资源ID'),
            ),
            migrations.AddField(
                model_name='file',
                name='source_type',
                field=models.CharField(choices=[('KNOWLEDGE', '知识'), ('APPLICATION', '应用'), ('TEMPORARY_30_MINUTE', '临时30分钟'), ('TEMPORARY_100_MINUTE', '临时120分钟'), ('TEMPORARY_1_DAY', '临时一天')],
                                    default='TEMPORARY_100_MINUTE', verbose_name='资源类型', db_index=True)
            )
        ]
  2. Data Type Considerations: Ensure that the data types used align with your expected storage requirements. For example, if source_type is unlikely to change once set, it might make sense to keep it as a character type rather than an integer for better performance and flexibility.

  3. Version Control Comments: Add comments explaining changes between versions to ease maintenance over time.

By implementing these optimizations, you can improve query performance and maintainability of your database schema.

34 changes: 30 additions & 4 deletions apps/knowledge/models/knowledge.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,15 @@
import uuid_utils.compat as uuid
from django.contrib.postgres.search import SearchVectorField
from django.db import models
from django.db.models import QuerySet
from django.db.models.signals import pre_delete
from django.dispatch import receiver
from mptt.fields import TreeForeignKey
from mptt.models import MPTTModel

from common.db.sql_execute import select_one
from common.mixins.app_model_mixin import AppModelMixin
from common.utils.common import get_sha256_hash
from models_provider.models import Model
from users.models import User

Expand Down Expand Up @@ -221,6 +223,19 @@ class SearchMode(models.TextChoices):
blend = 'blend'


class FileSourceType(models.TextChoices):
# 知识库 跟随知识库被删除而被删除 source_id 为知识库id
KNOWLEDGE = "KNOWLEDGE"
# 应用 跟随应用被删除而被删除 source_id 为应用id
APPLICATION = "APPLICATION"
# 临时30分钟 数据30分钟后被清理 source_id 为TEMPORARY_30_MINUTE
TEMPORARY_30_MINUTE = "TEMPORARY_30_MINUTE"
# 临时120分钟 数据120分钟后被清理 source_id为TEMPORARY_100_MINUTE
TEMPORARY_120_MINUTE = "TEMPORARY_100_MINUTE"
# 临时1天 数据1天后被清理 source_id为TEMPORARY_1_DAY
TEMPORARY_1_DAY = "TEMPORARY_1_DAY"


class VectorField(models.Field):
def db_type(self, connection):
return 'vector'
Expand All @@ -246,16 +261,25 @@ class Meta:
class File(AppModelMixin):
id = models.UUIDField(primary_key=True, max_length=128, default=uuid.uuid7, editable=False, verbose_name="主键id")
file_name = models.CharField(max_length=256, verbose_name="文件名称", default="")
workspace_id = models.CharField(max_length=64, verbose_name="工作空间id", default="default", db_index=True)
file_size = models.IntegerField(verbose_name="文件大小", default=0)
sha256_hash = models.CharField(verbose_name="文件sha256_hash标识", default="")
source_type = models.CharField(verbose_name="资源类型", choices=FileSourceType,
default=FileSourceType.TEMPORARY_120_MINUTE.value)
source_id = models.CharField(verbose_name="资源id", default=FileSourceType.TEMPORARY_120_MINUTE.value)
loid = models.IntegerField(verbose_name="loid")
meta = models.JSONField(verbose_name="文件关联数据", default=dict)

class Meta:
db_table = "file"

def save(self, bytea=None, force_insert=False, force_update=False, using=None, update_fields=None):
result = select_one("SELECT lo_from_bytea(%s, %s::bytea) as loid", [0, bytea])
self.loid = result['loid']
sha256_hash = get_sha256_hash(bytea)
f = QuerySet(File).filter(sha256_hash=sha256_hash).first()
if f is not None:
self.loid = f.loid
else:
result = select_one("SELECT lo_from_bytea(%s, %s::bytea) as loid", [0, bytea])
self.loid = result['loid']
super().save()

def get_bytes(self):
Expand All @@ -265,4 +289,6 @@ def get_bytes(self):

@receiver(pre_delete, sender=File)
def on_delete_file(sender, instance, **kwargs):
select_one(f'SELECT lo_unlink({instance.loid})', [])
exist = QuerySet(File).filter(loid=instance.loid).exclude(id=instance.id).exists()
if not exist:
select_one(f'SELECT lo_unlink({instance.loid})', [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks mostly clean and well-structured. However, there are a few areas where improvements can be made:

Improvements

  1. Duplicate Code: The get_bytes method has duplicate logic inside the save method. Consider separating this logic into its own utility function to avoid repetition.

  2. UUID Handling: Ensure that uuid_uuid7() generates a unique UUID each time it's called. This might need adjustment based on specific requirements.

  3. Null Values: Since source_type and source_id have defaults, ensure null values should be handled appropriately. You may want to add checks to handle these cases gracefully.

  4. SQL Injection Prevention: Always use parameterized queries when executing SQL statements to prevent SQL injection attacks.

  5. Error Handling: In the on_delete_file signal handler, consider adding error handling to manage exceptions properly.

  6. Optimization: For large databases, query performance might be optimized further depending on complexity and indexing strategies.

Suggested Corrections

# Separating get_bytes logic
def fetch_data(file_size, bytea):
    if file_size > 0:
        result = select_one("SELECT lo_from_bytea(%s, %s::bytea) as loid", [file_size, bytea])
        return result['loid']
    return 0

# Updated save method to call helper function
def save(self, bytea=None, force_insert=False, force_update=False, using=None, update_fields=None):
    file_size = len(bytea)
    self.loid = fetch_data(file_size, bytea)

@receiver(pre_delete, sender=File)
def on_delete_file(sender, instance, **kwargs):
    exist = QuerySet(File).filter(loid=instance.loid).exclude(id=instance.id).exists()
    if exist:  # Only delete if another File with the same loid exists
        select_one(f'SELECT lo_unlink({instance.loid})', [])

These changes aim to reduce redundancy, enhance readability, and improve maintainability of the codebase while ensuring security practices are followed.

Loading