# File Metadata and Content Index Generator

This Python script automates the creation of a metadata and content index for a simulated filesystem. It generates up to 100 JSON files, each representing a document with a variety of attributes—such as filename, path, size, creation and modification dates, content chunks, file type, and tags.  
This tool is designed to help prototype, test, and develop file-based search, retrieval, or recommendation systems that rely on metadata and content for enhanced searchability.

## Features

- **Content-Driven Filenames**  
  - Filenames are generated based on the actual content of the file, extracting a keyword from the first content chunk.
  - The base filename is limited to a maximum of 8 characters (before the file extension).
- **Randomized File Paths**  
  - Each file is assigned a random path starting from root (`/`), up to three directory levels deep.
- **Comprehensive Metadata**  
  - **filename** (content-based, up to 8 chars)
  - **path**
  - **size** (in kilobytes)
  - **creation_date** (randomly between 2001 and 2025)
  - **last_modified_date** (after creation_date, within the same range)
  - **content_chunks** (1–3 realistic text or table chunks per file)
  - **file_type** (`xlsx`, `pdf`, `docx`, `txt`, `pptx`, `md`)
  - **tags** (1–3 tags per file: `sport`, `cinema`, `fashion`, `cars`; some files have multiple tags)
- **Realistic Content**  
  - Content chunks are generated using the Faker library, ensuring relevance to the assigned tags.
  - **Text files:** Paragraphs are tailored to the file’s tags.
  - **Spreadsheets (xlsx):** Tables are generated with relevant headers and sample data.
- **Flexible Output**  
  - The script outputs a JSON index for each file, making it easy to ingest into search engines, vector databases, or other metadata-driven applications.

## Example

Each generated JSON file looks like this:



```
{
  "filename": "basketba.pdf",
  "path": "/alpha/beta/basketba.pdf",
  "sizekB": 1234,
  "last_modified_date": "2023-05-15",
  "creation_date": "2023-01-10",
  "content_chunks": [
    "The basketball player scored a goal. The championship was intense. The athlete trained hard."
  ],
  "file_type": "pdf",
  "tags": ["sport"]
}
```


---

## About Production File Extraction

This script is intended for development and testing. In a production environment, you would typically use specialized tools to extract metadata and content from real files. Here are some examples of how this is done:

- **PDF Documents:**  
  - **Tool:** PyPDF2 or Apache Tika  
  - **What is extracted:**  
    - Basic metadata (author, title, creation/modification dates)
    - Full text content for indexing or search
- **Word Documents:**  
  - **Tool:** python-docx or Apache Tika  
  - **What is extracted:**  
    - Document properties (author, title, keywords)
    - Paragraphs and tables for content analysis
- **Excel Files:**  
  - **Tool:** openpyxl or Apache Tika  
  - **What is extracted:**  
    - Sheet names, column headers, and cell data
    - Custom metadata stored in properties
- **Text Files:**  
  - **Tool:** Built-in Python file handling or Apache Tika  
  - **What is extracted:**  
    - File content for indexing or search
    - Stat metadata (size, timestamps)
- **Markdown Files:**  
  - **Tool:** Built-in Python file handling or specialized parsers  
  - **What is extracted:**  
    - Markdown content for rendering or indexing
    - Front matter (YAML metadata, if present)

These tools and techniques allow you to extract both structured metadata (like creation date and author) and unstructured content (like paragraphs and tables) from a wide range of file formats, supporting robust search and analysis workflows.


In [1]:
#!pip install faker

In [2]:
import os
import json
import random
import re
from datetime import datetime, date, timedelta
from faker import Faker

fake = Faker()

# Ensure output directory exists
os.makedirs('../dataset/generated_files', exist_ok=True)

file_types = ['xlsx', 'pdf', 'docx', 'txt', 'pptx', 'md']
tags_list = ['sport', 'cinema', 'fashion', 'cars']

def random_date(start_year, end_year):
    """Generate a random date between start_year and end_year."""
    year = random.randint(start_year, end_year)
    month = random.randint(1, 12)
    day = random.randint(1, 28)
    return date(year, month, day)

def random_path(max_levels=3):
    """Generate a random path starting with / and up to max_levels deep."""
    levels = random.randint(1, max_levels)
    path = "/"
    for _ in range(levels):
        path += fake.word().lower() + "/"
    return path.rstrip('/')

def clean_filename(keyword):
    """Remove special characters from keyword."""
    cleaned = re.sub(r'[^a-zA-Z0-9_]', '', keyword)
    if not cleaned:
        return "file"
    return cleaned.lower()

def extract_keyword(content_chunk):
    """Extract a unique keyword from the content chunk."""
    words = content_chunk.split()
    for word in words:
        word_lower = word.lower()
        if word_lower not in {'the', 'a', 'an', 'and', 'name', 'title', 'designer', 'model', 'event', 'collection', 'score', 'year', 'brand'}:
            return clean_filename(word)
    # If no unique word found, fall back to a random tag
    return clean_filename(random.choice(['sport', 'cinema', 'fashion', 'cars']))

for i in range(1, 101):
    file_type = random.choice(file_types)
    num_tags = random.randint(1, 3)
    tags = random.sample(tags_list, num_tags) if num_tags <= len(tags_list) else tags_list.copy()
    num_chunks = random.randint(1, 3)
    content_chunks = []
    for _ in range(num_chunks):
        chunk = ""
        if file_type in ['txt', 'md', 'pdf', 'docx', 'pptx']:
            for tag in tags:
                if tag == 'sport':
                    chunk += fake.paragraph(nb_sentences=1, ext_word_list=['football', 'basketball', 'tennis', 'championship', 'athlete']) + " "
                elif tag == 'cinema':
                    chunk += fake.paragraph(nb_sentences=1, ext_word_list=['movie', 'actor', 'director', 'scene', 'award']) + " "
                elif tag == 'fashion':
                    chunk += fake.paragraph(nb_sentences=1, ext_word_list=['dress', 'designer', 'trend', 'runway', 'model']) + " "
                elif tag == 'cars':
                    chunk += fake.paragraph(nb_sentences=1, ext_word_list=['car', 'engine', 'speed', 'model', 'race']) + " "
            content_chunks.append(chunk.strip())
        elif file_type == 'xlsx':
            table = ""
            if 'sport' in tags:
                table += "name,event,score\nJohn,football,3\nLisa,tennis,2\n"
            if 'cinema' in tags:
                table += "title,director,year\nInception,Christopher Nolan,2010\n"
            if 'fashion' in tags:
                table += "designer,collection,season\nChanel,Spring,2023\n"
            if 'cars' in tags:
                table += "model,brand,year\nModel 3,Tesla,2022\n"
            content_chunks.append(table.strip())

    # Extract keyword from first content chunk
    keyword = extract_keyword(content_chunks[0])
    keyword = keyword[:8]  # Truncate to 8 chars
    filename = f"{keyword}.{file_type}"

    path = random_path(max_levels=3) + "/" + filename
    size = random.randint(100, 10000)

    creation_date = random_date(2001, 2025)
    last_modified_date = creation_date + timedelta(days=random.randint(1, (date(2025,12,31) - creation_date).days))
    creation_date = creation_date.isoformat()
    last_modified_date = last_modified_date.isoformat()

    data = {
        "filename": filename,
        "path": path,
        "sizekB": size,
        "last_modified_date": last_modified_date,
        "creation_date": creation_date,
        "content_chunks": content_chunks,
        "file_type": file_type,
        "tags": tags
    }

    with open(f'../dataset/generated_files/file_{i}_index.json', 'w') as f:
        json.dump(data, f, indent=2)

print("Generated 100 JSON index files in 'generated_files' directory!")


Generated 100 JSON index files in 'generated_files' directory!


# JSON Index to JSONL Combiner

This script reads all JSON metadata files from `../dataset/generated_files` and combines them into a single JSONL file (`../dataset/combined_index.jsonl`). Each line in the resulting file is a standalone JSON object representing the metadata and content of one file.

## Usage

1. Place your individual JSON index files in `../dataset/generated_files`.
2. Run this script.
3. The output will be a JSONL file at `../dataset/combined_index.jsonl`, with one JSON object per line.

**Fields included:**  
- `filename`
- `path`
- `sizekB`
- `last_modified_date`
- `creation_date`
- `content_chunks`
- `file_type`
- `tags`


In [3]:
import os
import json
from datetime import datetime, timezone

input_dir = '../dataset/generated_files'
output_file = '../dataset/combined_index.jsonl'

def convert_to_timestamp(datestr):
    try:
        # Convert "YYYY-MM-DD" to timestamp (UTC)
        return int(datetime.fromisoformat(datestr).replace(tzinfo=timezone.utc).timestamp())
    except Exception as e:
        print(f"Error parsing date '{datestr}': {e}")
        return None

def normalize_entry(data, file_id):
    creation_ts = convert_to_timestamp(data.get("creation_date"))
    modified_ts = convert_to_timestamp(data.get("last_modified_date"))

    return {
        "id": file_id,
        "Filename": data.get("filename"),
        "Path": data.get("path"),
        "Size_kB": float(data.get("sizekB", 0)),
        "Content": " ".join(data.get("content_chunks", [])),
        "File_Type": [data.get("file_type")] if data.get("file_type") else [],
        "Tags": data.get("tags", []),
        "Creation_Date": creation_ts,
        "Last_Modified_Date": modified_ts,
    }

with open(output_file, 'w', encoding='utf-8') as outfile:
    file_counter = 1
    for fname in sorted(os.listdir(input_dir)):
        if fname.endswith('.json'):
            with open(os.path.join(input_dir, fname), 'r', encoding='utf-8') as f:
                data = json.load(f)
                normalized = normalize_entry(data, f"id_{file_counter}")
                json.dump(normalized, outfile)
                outfile.write('\n')
                file_counter += 1

print(f"Combined JSONL written to {output_file}")


Combined JSONL written to ../dataset/combined_index.jsonl


In [4]:
import pandas as pd
import json

# Read the normalized JSONL file into a list of dicts
records = []
with open("../dataset/combined_index.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        records.append(json.loads(line))

# Create DataFrame with a descriptive name
filedf = pd.DataFrame(records)

# Show all content in columns without truncation
pd.set_option('display.max_colwidth', None)

# Print available columns
print(list(filedf.columns))

# Display a few selected columns in a nice format
filedf[[
    "id",
    "Filename",
    "File_Type",
    "Tags",
    "Path",
    "Creation_Date",
    "Last_Modified_Date",
    "Content"
]].head(50)


['id', 'Filename', 'Path', 'Size_kB', 'Content', 'File_Type', 'Tags', 'Creation_Date', 'Last_Modified_Date']


Unnamed: 0,id,Filename,File_Type,Tags,Path,Creation_Date,Last_Modified_Date,Content
0,id_1,tennis.md,[md],"[sport, fashion]",/almost/tennis.md,1574121600,1744502400,Tennis championship basketball football tennis. Trend dress trend. Football football athlete football basketball football tennis tennis. Designer model runway trend designer designer designer dress.
1,id_2,tennis.pptx,[pptx],"[sport, fashion, cars]",/area/tennis.pptx,1471651200,1619136000,Tennis basketball championship athlete championship basketball. Model trend trend model designer designer. Car model speed model race model speed.
2,id_3,race.txt,[txt],[cars],/worry/brother/particular/race.txt,1708473600,1709942400,Race model race speed. Race speed car car speed model.
3,id_4,dress.docx,[docx],"[fashion, cars]",/without/role/dress.docx,1378425600,1443484800,Dress model model trend runway trend model runway. Engine model model engine engine speed engine. Designer designer trend model trend designer. Speed speed race speed.
4,id_5,scene.txt,[txt],[cinema],/room/eye/medical/scene.txt,1341964800,1755043200,Scene scene movie director actor award director. Movie actor director. Scene movie award award movie award director movie.
5,id_6,car.md,[md],"[cars, cinema]",/individual/bar/reality/car.md,1295654400,1596153600,Car speed race speed model race. Actor actor award movie award.
6,id_7,race.txt,[txt],"[cars, sport]",/money/health/race.txt,1366934400,1491955200,Model race car car car car model. Championship football championship athlete football. Model model speed speed race. Football championship football football basketball basketball football football.
7,id_8,nameeven.xlsx,[xlsx],"[fashion, sport]",/other/nameeven.xlsx,1186876800,1433721600,"name,event,score\nJohn,football,3\nLisa,tennis,2\ndesigner,collection,season\nChanel,Spring,2023 name,event,score\nJohn,football,3\nLisa,tennis,2\ndesigner,collection,season\nChanel,Spring,2023 name,event,score\nJohn,football,3\nLisa,tennis,2\ndesigner,collection,season\nChanel,Spring,2023"
8,id_9,titledir.xlsx,[xlsx],"[cinema, cars]",/husband/it/titledir.xlsx,1708300800,1714262400,"title,director,year\nInception,Christopher Nolan,2010\nmodel,brand,year\nModel 3,Tesla,2022"
9,id_10,champion.docx,[docx],"[sport, fashion, cinema]",/kid/spring/champion.docx,1113523200,1732579200,Championship football tennis tennis basketball. Trend designer trend dress. Actor award director scene actor. Football football tennis championship football basketball tennis. Trend trend runway designer model runway. Actor actor director director movie director movie. Athlete basketball basketball basketball basketball athlete. Designer model trend model dress designer trend. Movie scene actor scene director award.


## Create file_type and tags categories

In [5]:
import json

file_types = set()
tags = set()

with open(output_file, "r", encoding="utf-8") as f:
    for line in f:
        if not line.strip():
            continue
        obj = json.loads(line)
        # File_Type is a list, so iterate and strip each
        if "File_Type" in obj and obj["File_Type"]:
            file_types.update([ft.strip() for ft in obj["File_Type"]])
        # Tags is a list
        if "Tags" in obj and obj["Tags"]:
            tags.update([t.strip() for t in obj["Tags"]])

result = {
    "FileType": sorted(file_types),
    "Tags": sorted(tags),
}

with open("../dataset/categories.json", "w", encoding="utf-8") as f:
    json.dump(result, f, ensure_ascii=False, indent=2)


### I want collection_name=filesearch so I start superlinked.server with APP_ID=filesearch

```shell
python3.12.3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
APP_ID=filesearch APP_MODULE_PATH=superlinked_app python -m superlinked.server
```

### Split loading combined_index.jsonl and I created more batches

In [6]:
import json
import requests

def send_batches(data, batch_size=50):
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        response = requests.post(
            'http://localhost:8080/data-loader/file_document/run',
            json=batch,
            headers={'Accept': 'application/json'}
        )
        print(f"Batch {i//batch_size + 1} response:", response.status_code, response.text)

# Load your full dataset
with open('../dataset/combined_index.jsonl', 'r', encoding='utf-8') as f:
    data = [json.loads(line) for line in f]

send_batches(data, batch_size=10)  # Adjust batch size as needed


Batch 1 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 2 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 3 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 4 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 5 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 6 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 7 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 8 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 9 response: 202 {"result":"Background task successfully started with name: file_document"}
Batch 10 response: 202 {"result":"Background task successfully started with name: file_document"}


#### Comparing to the original curl command, curl works faster 

```shell
curl -X POST http://<server-host>:<port>/data-loader/file_document/run -H 'accept: application/json' -d ''

```