```{contents}
```

## Data Ingestion

Data ingestion is the **process of collecting raw data from various sources and moving it into a system where it can be processed, stored, and used.**

### Inputs

* Databases
* Files (PDF, CSV, DOCX)
* APIs
* Logs
* Streams

### Operations

1. **Extract**
   Pull data from the source.
2. **Transform (optional)**
   Clean, parse, normalize formats.
3. **Load**
   Store data into a target system.

### Outputs

* Data Warehouse (e.g., BigQuery, Snowflake)
* Data Lake (e.g., S3, GCS)
* Operational DB
* Vector DB (for LLM use)

### Types

* **Batch ingestion**: large chunks at intervals.
* **Streaming ingestion**: continuous real-time flow.

### Common Tools

| Task      | Tools                             |
| --------- | --------------------------------- |
| Extract   | Kafka, APIs, CDC tools (Debezium) |
| Transform | Spark, dbt, Pandas                |
| Load      | Airflow, Dagster, Prefect         |

### For LLM Context

Ingestion includes:

* Convert all source documents to plain text
* Clean and remove noise
* Split into chunks
* Create embeddings
* Store in vector database

End result: data is **structured, searchable, and ready for retrieval or model training.**

```{dropdown} Click here for Sections
```{tableofcontents}

In [1]:
# Import required libraries for data ingestion
import os
import glob
import pandas as pd
from pathlib import Path
import re
from typing import List, Dict
import json
from datetime import datetime

# Set up paths
BASE_DIR = Path(r"c:\Github\Learn-GenAI\genai_book")
DATASETS_DIR = BASE_DIR / "datasets"
TXT_DIR = DATASETS_DIR / "txt"

print(f"Base directory: {BASE_DIR}")
print(f"Datasets directory: {DATASETS_DIR}")
print(f"Text files directory: {TXT_DIR}")
print(f"Available categories: {[d.name for d in TXT_DIR.iterdir() if d.is_dir()]}")

Base directory: c:\Github\Learn-GenAI\genai_book
Datasets directory: c:\Github\Learn-GenAI\genai_book\datasets
Text files directory: c:\Github\Learn-GenAI\genai_book\datasets\txt
Available categories: ['business', 'entertainment', 'food', 'graphics', 'historical', 'medical', 'politics', 'space', 'sport', 'technologie']


In [2]:
class DataIngestionPipeline:
    """
    A comprehensive data ingestion pipeline for LLM training/fine-tuning
    """
    
    def __init__(self, data_dir: Path):
        self.data_dir = data_dir
        self.metadata = []
        self.processed_data = []
    
    def extract_text_files(self, category: str = None) -> List[Dict]:
        """
        Extract text files from specified category or all categories
        """
        extracted_files = []
        
        if category:
            category_path = self.data_dir / category
            if not category_path.exists():
                print(f"Category '{category}' not found!")
                return extracted_files
            categories = [category]
        else:
            categories = [d.name for d in self.data_dir.iterdir() if d.is_dir()]
        
        for cat in categories:
            cat_path = self.data_dir / cat
            txt_files = list(cat_path.glob("*.txt"))
            
            print(f"Processing category: {cat} ({len(txt_files)} files)")
            
            for file_path in txt_files:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                    
                    file_info = {
                        'file_path': str(file_path),
                        'filename': file_path.name,
                        'category': cat,
                        'content': content,
                        'char_count': len(content),
                        'word_count': len(content.split()),
                        'line_count': len(content.split('\n')),
                        'extracted_at': datetime.now().isoformat()
                    }
                    
                    extracted_files.append(file_info)
                    
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
        
        print(f"Successfully extracted {len(extracted_files)} files")
        return extracted_files
    
    def clean_text(self, text: str) -> str:
        """
        Clean and normalize text content
        """
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove special characters but keep punctuation
        text = re.sub(r'[^\w\s\.,!?;:\'\"-]', '', text)
        
        # Remove multiple consecutive punctuation
        text = re.sub(r'([.!?]){2,}', r'\1', text)
        
        # Strip leading/trailing whitespace
        text = text.strip()
        
        return text
    
    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
        """
        Split text into overlapping chunks for better context preservation
        """
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = ' '.join(words[i:i + chunk_size])
            chunks.append(chunk)
            
            # Break if we've reached the end
            if i + chunk_size >= len(words):
                break
        
        return chunks
    
    def transform_data(self, extracted_files: List[Dict], chunk_size: int = 512) -> List[Dict]:
        """
        Transform raw text data into LLM-ready format
        """
        transformed_data = []
        
        for file_info in extracted_files:
            # Clean the text
            cleaned_text = self.clean_text(file_info['content'])
            
            # Skip very short texts
            if len(cleaned_text.split()) < 10:
                continue
            
            # Create chunks
            chunks = self.chunk_text(cleaned_text, chunk_size)
            
            for i, chunk in enumerate(chunks):
                chunk_info = {
                    'source_file': file_info['filename'],
                    'category': file_info['category'],
                    'chunk_id': f"{file_info['filename']}_chunk_{i+1}",
                    'text': chunk,
                    'chunk_size': len(chunk.split()),
                    'chunk_index': i,
                    'total_chunks': len(chunks),
                    'processed_at': datetime.now().isoformat()
                }
                
                transformed_data.append(chunk_info)
        
        print(f"Created {len(transformed_data)} text chunks from {len(extracted_files)} files")
        return transformed_data
    
    def save_processed_data(self, data: List[Dict], output_file: str = "processed_data.jsonl"):
        """
        Save processed data in JSONL format for LLM training
        """
        output_path = BASE_DIR / output_file
        
        with open(output_path, 'w', encoding='utf-8') as f:
            for item in data:
                json.dump(item, f, ensure_ascii=False)
                f.write('\n')
        
        print(f"Saved {len(data)} processed items to {output_path}")
        return output_path

# Initialize the pipeline
pipeline = DataIngestionPipeline(TXT_DIR)
print("Data ingestion pipeline initialized!")

Data ingestion pipeline initialized!


In [3]:
# Extract text files from all categories
print("=== EXTRACTION PHASE ===")
extracted_files = pipeline.extract_text_files()

# Display statistics
if extracted_files:
    df_stats = pd.DataFrame(extracted_files)
    print(f"\nExtraction Statistics:")
    print(f"Total files: {len(extracted_files)}")
    print(f"Categories: {df_stats['category'].nunique()}")
    print(f"Total characters: {df_stats['char_count'].sum():,}")
    print(f"Total words: {df_stats['word_count'].sum():,}")
    
    print(f"\nFiles per category:")
    category_counts = df_stats['category'].value_counts()
    for cat, count in category_counts.items():
        print(f"  {cat}: {count} files")
    
    print(f"\nAverage file sizes:")
    avg_stats = df_stats.groupby('category')[['char_count', 'word_count']].mean()
    print(avg_stats.round(0))

=== EXTRACTION PHASE ===
Processing category: business (100 files)
Processing category: entertainment (100 files)
Processing category: entertainment (100 files)
Processing category: food (100 files)
Processing category: graphics (100 files)
Processing category: food (100 files)
Processing category: graphics (100 files)
Processing category: historical (100 files)
Processing category: historical (100 files)
Processing category: medical (100 files)
Processing category: medical (100 files)
Processing category: politics (100 files)
Processing category: space (100 files)
Processing category: politics (100 files)
Processing category: space (100 files)
Processing category: sport (100 files)
Processing category: sport (100 files)
Processing category: technologie (100 files)
Successfully extracted 1000 files

Extraction Statistics:
Total files: 1000
Categories: 10
Total characters: 2,573,569
Total words: 410,351

Files per category:
  business: 100 files
  entertainment: 100 files
  food: 100 fi

In [4]:
# Transform the extracted data into LLM-ready format
print("\n=== TRANSFORMATION PHASE ===")

# Transform with different chunk sizes for different use cases
chunk_sizes = {
    'small': 256,   # For quick processing/embedding
    'medium': 512,  # Standard for most LLMs
    'large': 1024   # For context-heavy tasks
}

transformed_datasets = {}

for size_name, chunk_size in chunk_sizes.items():
    print(f"\nCreating {size_name} chunks (size: {chunk_size} words)...")
    transformed_data = pipeline.transform_data(extracted_files, chunk_size=chunk_size)
    transformed_datasets[size_name] = transformed_data
    
    # Display transformation statistics
    if transformed_data:
        print(f"  Created {len(transformed_data)} chunks")
        avg_chunk_size = sum(item['chunk_size'] for item in transformed_data) / len(transformed_data)
        print(f"  Average chunk size: {avg_chunk_size:.1f} words")

# Show sample of transformed data
print(f"\n=== SAMPLE TRANSFORMED DATA ===")
if transformed_datasets['medium']:
    sample = transformed_datasets['medium'][0]
    print(f"Sample chunk from '{sample['source_file']}':")
    print(f"Category: {sample['category']}")
    print(f"Chunk ID: {sample['chunk_id']}")
    print(f"Text preview: {sample['text'][:200]}...")
    print(f"Chunk size: {sample['chunk_size']} words")


=== TRANSFORMATION PHASE ===

Creating small chunks (size: 256 words)...
Created 2259 text chunks from 1000 files
  Created 2259 chunks
  Average chunk size: 208.8 words

Creating medium chunks (size: 512 words)...
Created 2259 text chunks from 1000 files
  Created 2259 chunks
  Average chunk size: 208.8 words

Creating medium chunks (size: 512 words)...
Created 1356 text chunks from 1000 files
  Created 1356 chunks
  Average chunk size: 314.6 words

Creating large chunks (size: 1024 words)...
Created 1356 text chunks from 1000 files
  Created 1356 chunks
  Average chunk size: 314.6 words

Creating large chunks (size: 1024 words)...
Created 1103 text chunks from 1000 files
  Created 1103 chunks
  Average chunk size: 375.3 words

=== SAMPLE TRANSFORMED DATA ===
Sample chunk from 'business_1.txt':
Category: business
Chunk ID: business_1.txt_chunk_1
Text preview: Lufthansa flies back to profit German airline Lufthansa has returned to profit in 2004 after posting huge losses in 2003. In a

In [5]:
# Save processed data in different formats for various LLM use cases
print("\n=== LOADING/SAVING PHASE ===")

# 1. Save as JSONL for general LLM training
for size_name, data in transformed_datasets.items():
    output_file = f"llm_data_{size_name}_chunks.jsonl"
    saved_path = pipeline.save_processed_data(data, output_file)

# 2. Create training format for instruction tuning
def create_instruction_format(data: List[Dict]) -> List[Dict]:
    """Convert to instruction-following format"""
    instruction_data = []
    
    for item in data:
        # Create different instruction templates based on category
        category = item['category']
        text = item['text']
        
        if category == 'business':
            instruction = "Analyze this business text and provide insights:"
        elif category == 'entertainment':
            instruction = "Summarize this entertainment content:"
        elif category == 'food':
            instruction = "Describe this food-related text:"
        elif category == 'medical':
            instruction = "Explain this medical information in simple terms:"
        elif category == 'technologie':
            instruction = "Explain this technology concept:"
        else:
            instruction = f"Provide a summary of this {category} text:"
        
        formatted_item = {
            "instruction": instruction,
            "input": text,
            "output": f"This is a {category} text that discusses various aspects related to the topic.",
            "category": category,
            "source": item['source_file'],
            "chunk_id": item['chunk_id']
        }
        
        instruction_data.append(formatted_item)
    
    return instruction_data

# Create instruction dataset
instruction_dataset = create_instruction_format(transformed_datasets['medium'])
instruction_path = pipeline.save_processed_data(instruction_dataset, "instruction_dataset.jsonl")

print(f"\nCreated instruction dataset with {len(instruction_dataset)} examples")


=== LOADING/SAVING PHASE ===
Saved 2259 processed items to c:\Github\Learn-GenAI\genai_book\llm_data_small_chunks.jsonl
Saved 1356 processed items to c:\Github\Learn-GenAI\genai_book\llm_data_medium_chunks.jsonl
Saved 1103 processed items to c:\Github\Learn-GenAI\genai_book\llm_data_large_chunks.jsonl
Saved 1356 processed items to c:\Github\Learn-GenAI\genai_book\instruction_dataset.jsonl

Created instruction dataset with 1356 examples
Saved 2259 processed items to c:\Github\Learn-GenAI\genai_book\llm_data_small_chunks.jsonl
Saved 1356 processed items to c:\Github\Learn-GenAI\genai_book\llm_data_medium_chunks.jsonl
Saved 1103 processed items to c:\Github\Learn-GenAI\genai_book\llm_data_large_chunks.jsonl
Saved 1356 processed items to c:\Github\Learn-GenAI\genai_book\instruction_dataset.jsonl

Created instruction dataset with 1356 examples


In [6]:
# Create embeddings for vector database (RAG use case)
print("\n=== VECTOR EMBEDDINGS FOR RAG ===")

try:
    # Install required packages if not available
    import sentence_transformers
    print("‚úì sentence-transformers already installed")
except ImportError:
    print("Installing sentence-transformers...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "sentence-transformers"])
    import sentence_transformers

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"‚úì Loaded embedding model: {model}")

# Create embeddings for a sample of data (to avoid memory issues)
sample_size = min(100, len(transformed_datasets['medium']))
sample_data = transformed_datasets['medium'][:sample_size]

print(f"Creating embeddings for {len(sample_data)} text chunks...")

# Extract texts for embedding
texts = [item['text'] for item in sample_data]

# Generate embeddings
embeddings = model.encode(texts, show_progress_bar=True)
print(f"Created embeddings shape: {embeddings.shape}")

# Create vector database format
vector_data = []
for i, (item, embedding) in enumerate(zip(sample_data, embeddings)):
    vector_item = {
        'id': item['chunk_id'],
        'text': item['text'],
        'embedding': embedding.tolist(),  # Convert to list for JSON serialization
        'metadata': {
            'category': item['category'],
            'source_file': item['source_file'],
            'chunk_size': item['chunk_size'],
            'chunk_index': item['chunk_index']
        }
    }
    vector_data.append(vector_item)

# Save vector data
vector_path = BASE_DIR / "vector_embeddings.jsonl"
with open(vector_path, 'w', encoding='utf-8') as f:
    for item in vector_data:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

print(f"‚úì Saved {len(vector_data)} embeddings to {vector_path}")
print(f"‚úì Embedding dimension: {len(embeddings[0])}")

# Show similarity example
if len(embeddings) >= 2:
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Calculate similarity between first two chunks
    sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    print(f"\nSample similarity between chunks:")
    print(f"Chunk 1: {texts[0][:100]}...")
    print(f"Chunk 2: {texts[1][:100]}...")
    print(f"Cosine similarity: {sim:.4f}")


=== VECTOR EMBEDDINGS FOR RAG ===


‚úì sentence-transformers already installed
‚úì sentence-transformers already installed
‚úì Loaded embedding model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
Creating embeddings for 100 text chunks...
‚úì Loaded embedding model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tok

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Created embeddings shape: (100, 384)
‚úì Saved 100 embeddings to c:\Github\Learn-GenAI\genai_book\vector_embeddings.jsonl
‚úì Embedding dimension: 384

Sample similarity between chunks:
Chunk 1: Lufthansa flies back to profit German airline Lufthansa has returned to profit in 2004 after posting...
Chunk 2: Winn-Dixie files for bankruptcy US supermarket group Winn-Dixie has filed for bankruptcy protection ...
Cosine similarity: 0.2423


In [7]:
# Data quality assessment and validation
print("\n=== DATA QUALITY ASSESSMENT ===")

def assess_data_quality(data: List[Dict]) -> Dict:
    """
    Assess the quality of processed data
    """
    assessment = {
        'total_chunks': len(data),
        'categories': {},
        'chunk_size_stats': {},
        'quality_issues': []
    }
    
    chunk_sizes = []
    category_counts = {}
    empty_chunks = 0
    duplicate_texts = set()
    duplicates_found = 0
    
    for item in data:
        chunk_size = item['chunk_size']
        chunk_sizes.append(chunk_size)
        
        category = item['category']
        category_counts[category] = category_counts.get(category, 0) + 1
        
        # Check for empty or very short chunks
        if chunk_size < 10:
            empty_chunks += 1
        
        # Check for duplicates
        text_hash = hash(item['text'])
        if text_hash in duplicate_texts:
            duplicates_found += 1
        else:
            duplicate_texts.add(text_hash)
    
    # Calculate statistics
    chunk_sizes = np.array(chunk_sizes)
    assessment['chunk_size_stats'] = {
        'mean': float(chunk_sizes.mean()),
        'median': float(np.median(chunk_sizes)),
        'std': float(chunk_sizes.std()),
        'min': int(chunk_sizes.min()),
        'max': int(chunk_sizes.max())
    }
    
    assessment['categories'] = category_counts
    
    # Quality issues
    if empty_chunks > 0:
        assessment['quality_issues'].append(f"{empty_chunks} chunks with < 10 words")
    if duplicates_found > 0:
        assessment['quality_issues'].append(f"{duplicates_found} potential duplicate chunks")
    
    # Category balance check
    category_values = list(category_counts.values())
    if len(category_values) > 1:
        imbalance_ratio = max(category_values) / min(category_values)
        if imbalance_ratio > 10:
            assessment['quality_issues'].append(f"Category imbalance detected (ratio: {imbalance_ratio:.1f})")
    
    return assessment

# Assess quality of medium-sized chunks
quality_report = assess_data_quality(transformed_datasets['medium'])

print(f"Data Quality Report:")
print(f"Total chunks: {quality_report['total_chunks']}")
print(f"Categories distribution: {quality_report['categories']}")
print(f"Chunk size statistics:")
for stat, value in quality_report['chunk_size_stats'].items():
    print(f"  {stat}: {value:.1f}")

if quality_report['quality_issues']:
    print(f"Quality issues found:")
    for issue in quality_report['quality_issues']:
        print(f"  ‚ö†Ô∏è {issue}")
else:
    print("‚úì No major quality issues detected")

# Create final summary
print(f"\n=== INGESTION PIPELINE SUMMARY ===")
print(f"‚úì Extracted {len(extracted_files)} text files")
print(f"‚úì Created {len(transformed_datasets)} different chunk size datasets")
print(f"‚úì Generated instruction-tuning dataset")
print(f"‚úì Created vector embeddings for RAG")
print(f"‚úì Performed data quality assessment")
print(f"‚úì All datasets saved to: {BASE_DIR}")

# List all output files
output_files = list(BASE_DIR.glob("*.jsonl"))
print(f"\nOutput files created:")
for file in output_files:
    size_mb = file.stat().st_size / (1024 * 1024)
    print(f"  üìÑ {file.name} ({size_mb:.1f} MB)")


=== DATA QUALITY ASSESSMENT ===
Data Quality Report:
Total chunks: 1356
Categories distribution: {'business': 108, 'entertainment': 108, 'food': 101, 'graphics': 124, 'historical': 226, 'medical': 154, 'politics': 129, 'space': 158, 'sport': 114, 'technologie': 134}
Chunk size statistics:
  mean: 314.6
  median: 289.5
  std: 155.5
  min: 19.0
  max: 512.0
Quality issues found:
  ‚ö†Ô∏è 8 potential duplicate chunks

=== INGESTION PIPELINE SUMMARY ===
‚úì Extracted 1000 text files
‚úì Created 3 different chunk size datasets
‚úì Generated instruction-tuning dataset
‚úì Created vector embeddings for RAG
‚úì Performed data quality assessment
‚úì All datasets saved to: c:\Github\Learn-GenAI\genai_book

Output files created:
  üìÑ instruction_dataset.jsonl (2.8 MB)
  üìÑ llm_data_large_chunks.jsonl (2.6 MB)
  üìÑ llm_data_medium_chunks.jsonl (2.8 MB)
  üìÑ llm_data_small_chunks.jsonl (3.2 MB)
  üìÑ vector_embeddings.jsonl (1.0 MB)


## What We've Accomplished

This comprehensive data ingestion pipeline demonstrates the complete workflow for preparing data for LLM training and deployment:

### üîÑ **Complete ETL Pipeline**
1. **Extract** - Read text files from multiple categories
2. **Transform** - Clean, chunk, and format data for different use cases
3. **Load** - Save in multiple formats optimized for various LLM applications

### üìä **Multiple Output Formats**
- **JSONL files** with different chunk sizes (256, 512, 1024 words)
- **Instruction-tuning dataset** for supervised fine-tuning
- **Vector embeddings** for Retrieval-Augmented Generation (RAG)
- **Quality assessment reports** for data validation

### üéØ **Key Features**
- **Scalable processing** - Handles large datasets efficiently
- **Text cleaning** - Removes noise and normalizes content
- **Smart chunking** - Overlapping chunks preserve context
- **Category-aware** - Maintains metadata for domain-specific training
- **Quality validation** - Identifies and reports data issues
- **Multiple use cases** - Supports training, fine-tuning, and RAG

### üöÄ **Ready for LLM Applications**
The processed data is now ready for:
- **Pre-training** language models
- **Fine-tuning** existing models
- **Creating RAG systems**
- **Instruction following** training
- **Domain-specific** model adaptation

### üìà **Next Steps**
1. Load data into your preferred ML framework (PyTorch, TensorFlow)
2. Set up vector database (Pinecone, Weaviate, ChromaDB) for RAG
3. Begin model training or fine-tuning
4. Deploy for inference and evaluation