# Metamorph: Automated Metadata Generation System

This notebook demonstrates the complete process of building an automated metadata generation system from scratch. The system processes various document formats, extracts meaningful content, and generates structured metadata that can be used for better document management and insights.

## Objectives
- Extract text content from various document formats (PDF, DOCX, TXT)
- Implement OCR for images and scanned documents
- Identify semantic content and key sections
- Generate structured metadata (title, keywords, entities, etc.)
- Build a simple web interface for document upload and metadata display
- Package the system for deployment

Let's begin by importing the necessary libraries and setting up our environment.

In [None]:
# Import Required Libraries
import os
import sys
import re
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Document processing libraries
import PyPDF2
import docx
from PIL import Image

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

# For OCR
import pytesseract

# For visualization in the notebook
import matplotlib.pyplot as plt
import seaborn as sns

# For web interface (Streamlit)
# Note: This will run in a separate file, not in this notebook
# import streamlit as st

# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded successfully")
except:
    print("spaCy model not found. Downloading model...")
    os.system("python -m spacy download en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded successfully")

# Create directory for sample documents and output
os.makedirs('sample_docs', exist_ok=True)
os.makedirs('output', exist_ok=True)

## Document Content Extraction

In this section, we'll build functions to extract text content from various document formats:
1. PDF files using PyPDF2
2. DOCX files using python-docx
3. TXT files using standard Python file operations

Let's create a document processor class that can handle different file types:

In [None]:
class DocumentProcessor:
    """Class for processing various document types and extracting text"""
    
    def extract_text(self, file_path):
        """
        Extract text from document based on file extension
        
        Args:
            file_path (str): Path to the document
            
        Returns:
            str: Extracted text from the document
        """
        file_extension = file_path.split('.')[-1].lower()
        
        if file_extension == 'pdf':
            return self._extract_from_pdf(file_path)
        elif file_extension == 'docx':
            return self._extract_from_docx(file_path)
        elif file_extension == 'txt':
            return self._extract_from_txt(file_path)
        elif file_extension in ['png', 'jpg', 'jpeg']:
            return self._extract_from_image(file_path)
        else:
            print(f"Unsupported file format: {file_extension}")
            return ""
    
    def _extract_from_pdf(self, file_path):
        """Extract text from PDF files"""
        text = ""
        try:
            with open(file_path, 'rb') as file:
                reader = PyPDF2.PdfReader(file)
                for page_num in range(len(reader.pages)):
                    page = reader.pages[page_num]
                    text += page.extract_text() + "\n"
                    
            if text.strip() == "":
                print("No text extracted from PDF. It might be a scanned document requiring OCR.")
                return self._perform_ocr_on_pdf(file_path)
                
            return text
        except Exception as e:
            print(f"Error extracting text from PDF: {e}")
            return ""
    
    def _perform_ocr_on_pdf(self, file_path):
        """Perform OCR on PDF if regular extraction fails (likely a scanned document)"""
        try:
            # This is a simplified approach - in production, you'd use a library like pdf2image
            # to convert all pages to images and then process them
            print("Attempting OCR on PDF...")
            images = self._pdf_to_images(file_path)
            text = ""
            for img in images:
                text += pytesseract.image_to_string(img) + "\n"
            return text
        except Exception as e:
            print(f"Error performing OCR on PDF: {e}")
            return ""
    
    def _pdf_to_images(self, file_path):
        """Convert PDF to list of images - simplified implementation"""
        # In production, you would use pdf2image or a similar library
        # For this notebook, we'll return an empty list to avoid extra dependencies
        print("PDF to image conversion requires additional libraries like pdf2image.")
        print("In a production environment, install: pip install pdf2image")
        return []
    
    def _extract_from_docx(self, file_path):
        """Extract text from DOCX files"""
        text = ""
        try:
            doc = docx.Document(file_path)
            for para in doc.paragraphs:
                text += para.text + "\n"
            return text
        except Exception as e:
            print(f"Error extracting text from DOCX: {e}")
            return ""
    
    def _extract_from_txt(self, file_path):
        """Extract text from TXT files"""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return file.read()
        except UnicodeDecodeError:
            try:
                with open(file_path, 'r', encoding='latin-1') as file:
                    return file.read()
            except Exception as e:
                print(f"Error extracting text from TXT: {e}")
                return ""
        except Exception as e:
            print(f"Error extracting text from TXT: {e}")
            return ""
    
    def _extract_from_image(self, file_path):
        """Extract text from image files using OCR"""
        try:
            img = Image.open(file_path)
            text = pytesseract.image_to_string(img)
            return text
        except Exception as e:
            print(f"Error extracting text from image: {e}")
            return ""
    
    def clean_text(self, text):
        """
        Clean and preprocess the extracted text
        
        Args:
            text (str): Extracted text from document
            
        Returns:
            str: Cleaned text
        """
        if not text:
            return ""
            
        # Remove excess whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove special characters and numbers (optional - depends on use case)
        # text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        return text.strip()

# Create an instance of the document processor
doc_processor = DocumentProcessor()

# Let's create a simple sample text file to test our processor
sample_text = """
Metamorph: Automated Metadata Generation System

This is a sample text document to test the document processor.
It contains multiple paragraphs and some formatting.

Key features of our system:
- Multi-format support
- Semantic analysis
- Metadata generation
- Web interface

© 2025 Metamorph
"""

with open('sample_docs/sample.txt', 'w') as f:
    f.write(sample_text)

# Test the document processor with our sample text file
extracted_text = doc_processor.extract_text('sample_docs/sample.txt')
print("Extracted text from sample.txt:")
print("-" * 50)
print(extracted_text)
print("-" * 50)

# Clean the extracted text
cleaned_text = doc_processor.clean_text(extracted_text)
print("\nCleaned text:")
print("-" * 50)
print(cleaned_text)
print("-" * 50)

## Optical Character Recognition (OCR)

For documents that contain images or are scanned PDFs, we need to use OCR (Optical Character Recognition) to extract the text. We've already included OCR capabilities in our DocumentProcessor class, but let's explore OCR in more detail and demonstrate its capabilities.

In [None]:
def ocr_demo():
    """
    Demonstrate OCR capabilities by:
    1. Creating a simple image with text
    2. Saving it to a file
    3. Applying OCR to extract the text
    """
    try:
        # Check if pytesseract is properly installed and configured
        pytesseract.get_tesseract_version()
        print(f"Tesseract version: {pytesseract.get_tesseract_version()}")
        
        # Create a simple image with text using PIL
        img = Image.new('RGB', (800, 200), color=(255, 255, 255))
        from PIL import ImageDraw, ImageFont
        
        # Try to use a default font, or fall back to default
        try:
            font = ImageFont.truetype("Arial", 24)
        except:
            # Use default font if Arial is not available
            font = ImageFont.load_default()
            
        draw = ImageDraw.Draw(img)
        text = "This is a test image for OCR. Metamorph can extract text from images."
        draw.text((50, 80), text, fill=(0, 0, 0), font=font)
        
        # Save the image
        img_path = 'sample_docs/ocr_test.png'
        img.save(img_path)
        print(f"Created test image at {img_path}")
        
        # Apply OCR
        extracted_text = pytesseract.image_to_string(img)
        print("\nExtracted text from image:")
        print("-" * 50)
        print(extracted_text)
        print("-" * 50)
        
        # Display the image
        plt.figure(figsize=(10, 3))
        plt.imshow(img)
        plt.axis('off')
        plt.title('Test Image for OCR')
        plt.show()
        
        return True
    
    except Exception as e:
        print(f"OCR demo failed: {e}")
        print("\nNote: For OCR to work properly, you need to have Tesseract installed on your system.")
        print("Installation instructions:")
        print("- Ubuntu/Debian: sudo apt-get install tesseract-ocr")
        print("- macOS: brew install tesseract")
        print("- Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki")
        
        return False

# Run the OCR demo
ocr_success = ocr_demo()

# If OCR demo fails, create a function to simulate OCR for the remainder of the notebook
if not ocr_success:
    def simulate_ocr(image_path_or_object):
        """Simulate OCR for demonstration purposes"""
        print("Using simulated OCR since actual OCR is not available")
        return "This is simulated OCR text. In a real environment, actual text would be extracted from the image."

## Semantic Content Identification

Now that we can extract text from documents, we need to identify the semantically meaningful parts of the content. This includes:

1. Identifying important entities (people, organizations, locations, dates, etc.)
2. Extracting key sentences and phrases
3. Identifying document structure (title, sections, etc.)

We'll use spaCy for named entity recognition and other NLP techniques to extract meaningful information from the text.

In [None]:
class SemanticAnalyzer:
    """Class for analyzing document content and extracting semantic information"""
    
    def __init__(self):
        """Initialize the analyzer with NLP models and resources"""
        self.nlp = nlp  # Using the already loaded spaCy model
        self.stop_words = set(stopwords.words('english'))
    
    def extract_entities(self, text):
        """
        Extract named entities from text using spaCy
        
        Args:
            text (str): Document text
            
        Returns:
            dict: Dictionary of entity types and their values
        """
        if not text:
            return {}
            
        # Process text with spaCy
        doc = self.nlp(text[:100000])  # Limit text length to avoid memory issues
        
        # Group entities by type
        entities = {
            "PERSON": [],
            "ORG": [],
            "GPE": [],  # Countries, cities, states
            "DATE": [],
            "MISC": []
        }
        
        for ent in doc.ents:
            if ent.label_ in ["PERSON"]:
                entities["PERSON"].append(ent.text)
            elif ent.label_ in ["ORG"]:
                entities["ORG"].append(ent.text)
            elif ent.label_ in ["GPE", "LOC"]:
                entities["GPE"].append(ent.text)
            elif ent.label_ in ["DATE", "TIME"]:
                entities["DATE"].append(ent.text)
            else:
                entities["MISC"].append(ent.text)
        
        # Remove duplicates and limit length
        for key in entities:
            entities[key] = list(set(entities[key]))[:10]
        
        return entities
    
    def extract_keywords(self, text, max_keywords=10):
        """
        Extract keywords using TF-IDF
        
        Args:
            text (str): Document text
            max_keywords (int): Maximum number of keywords to extract
            
        Returns:
            list: List of keywords with scores
        """
        if not text or len(text.split()) < 20:
            return self._extract_frequent_words(text, max_keywords)
        
        try:
            vectorizer = TfidfVectorizer(
                max_df=0.85,
                min_df=2,
                stop_words='english',
                use_idf=True,
                ngram_range=(1, 2)
            )
            
            # Create artificial documents by splitting text into chunks
            sentences = sent_tokenize(text)
            documents = [text]  # Original full text
            
            # Group sentences into chunks
            chunk_size = max(5, len(sentences) // 5)
            for i in range(0, len(sentences), chunk_size):
                chunk = " ".join(sentences[i:i+chunk_size])
                if chunk:
                    documents.append(chunk)
            
            tfidf_matrix = vectorizer.fit_transform(documents)
            feature_names = vectorizer.get_feature_names_out()
            
            # Get scores from the first document (the full text)
            tfidf_scores = zip(feature_names, tfidf_matrix[0].toarray()[0])
            sorted_keywords = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)
            
            # Filter keywords to ensure they are meaningful
            keywords = []
            for keyword, score in sorted_keywords:
                if score > 0.01 and len(keyword) > 2:
                    keywords.append({
                        "text": keyword,
                        "score": float(score)
                    })
                if len(keywords) >= max_keywords:
                    break
            
            return keywords
        
        except Exception as e:
            print(f"Error extracting keywords: {e}")
            return self._extract_frequent_words(text, max_keywords)
    
    def _extract_frequent_words(self, text, max_words=10):
        """
        Extract most frequent words as a fallback method
        
        Args:
            text (str): Document text
            max_words (int): Maximum number of words to extract
            
        Returns:
            list: List of words with scores
        """
        if not text:
            return []
            
        try:
            # Tokenize and clean
            tokens = word_tokenize(text.lower())
            tokens = [word for word in tokens if word.isalpha() and word not in self.stop_words and len(word) > 2]
            
            # Count frequency
            freq_dist = Counter(tokens)
            
            # Get top words
            top_words = freq_dist.most_common(max_words)
            
            # Format as keywords
            keywords = []
            for word, count in top_words:
                # Normalize score between 0 and 1
                score = count / max(freq_dist.values()) if freq_dist.values() else 0
                keywords.append({
                    "text": word,
                    "score": float(score)
                })
            
            return keywords
        
        except Exception as e:
            print(f"Error extracting frequent words: {e}")
            return []
    
    def extract_title(self, text, filename=""):
        """
        Try to extract a title from the document
        
        Args:
            text (str): Document text
            filename (str): Original filename
            
        Returns:
            str: Extracted title
        """
        if not text:
            return filename
            
        # First try: Take the first line if it's reasonably short
        lines = text.split('\n')
        if lines and lines[0].strip() and len(lines[0].strip()) < 100 and len(lines[0].strip().split()) < 15:
            return lines[0].strip()
        
        # Second try: Look for patterns that might indicate titles
        title_patterns = [
            r'^#\s+(.+)$',  # Markdown title
            r'^Title:\s*(.+)$',  # Explicit title
            r'^Subject:\s*(.+)$'  # Document subject
        ]
        
        for pattern in title_patterns:
            match = re.search(pattern, text, re.MULTILINE | re.IGNORECASE)
            if match:
                return match.group(1).strip()
        
        # Fallback: Use filename without extension or first few words
        if filename:
            filename_parts = filename.rsplit('.', 1)
            if len(filename_parts) > 1:
                return filename_parts[0]
            
        # Last resort: Use first few words
        words = text.split()
        if len(words) > 5:
            return " ".join(words[:5]) + "..."
            
        return filename or "Untitled Document"
    
    def generate_summary(self, text, max_length=200):
        """
        Generate a brief summary of the document
        
        Args:
            text (str): Document text
            max_length (int): Maximum summary length
            
        Returns:
            str: Generated summary
        """
        if not text:
            return "Empty document"
            
        # Simple extractive summarization
        try:
            # Split into sentences
            sentences = sent_tokenize(text)
            
            if len(sentences) <= 2:
                # If very short document, return the text itself
                if len(text) <= max_length:
                    return text
                return text[:max_length] + "..."
            
            # Use first and last sentence as summary
            first_sentence = sentences[0]
            last_sentence = sentences[-1]
            
            # If they're too long, truncate
            if len(first_sentence) > max_length // 2:
                first_sentence = first_sentence[:max_length // 2] + "..."
            
            if len(last_sentence) > max_length // 2:
                last_sentence = last_sentence[:max_length // 2] + "..."
            
            return first_sentence + " [...] " + last_sentence
        
        except Exception as e:
            print(f"Error generating summary: {e}")
            # Fallback to simple truncation
            if len(text) <= max_length:
                return text
            return text[:max_length] + "..."
    
    def calculate_readability(self, text):
        """
        Calculate simple readability score based on sentence and word length
        
        Args:
            text (str): Document text
            
        Returns:
            float: Readability score (0-100)
        """
        if not text:
            return 0
            
        try:
            sentences = sent_tokenize(text)
            words = word_tokenize(text)
            
            if not sentences or not words:
                return 0
            
            # Average sentence length
            avg_sentence_length = len(words) / len(sentences)
            
            # Average word length
            avg_word_length = sum(len(word) for word in words) / len(words)
            
            # Simple readability score (higher is more complex)
            # Scale between 0-100 for easier interpretation
            readability = (avg_sentence_length * 0.6 + avg_word_length * 5) * 5
            
            # Cap the score at 100
            return min(100, readability)
        
        except Exception as e:
            print(f"Error calculating readability: {e}")
            return 0

# Create an instance of the semantic analyzer
semantic_analyzer = SemanticAnalyzer()

# Test the semantic analyzer with our sample text
print("Analyzing sample text...\n")

# Extract entities
entities = semantic_analyzer.extract_entities(cleaned_text)
print("Extracted entities:")
print("-" * 50)
for entity_type, entity_list in entities.items():
    if entity_list:
        print(f"{entity_type}: {', '.join(entity_list)}")
print("-" * 50)

# Extract keywords
keywords = semantic_analyzer.extract_keywords(cleaned_text)
print("\nExtracted keywords:")
print("-" * 50)
for keyword in keywords:
    print(f"{keyword['text']} (score: {keyword['score']:.3f})")
print("-" * 50)

# Extract title
title = semantic_analyzer.extract_title(cleaned_text, "sample.txt")
print(f"\nExtracted title: {title}")

# Generate summary
summary = semantic_analyzer.generate_summary(cleaned_text)
print(f"\nGenerated summary: {summary}")

# Calculate readability
readability = semantic_analyzer.calculate_readability(cleaned_text)
print(f"\nReadability score: {readability:.2f}/100")

## Automated Metadata Generation

Now that we can extract text from documents and analyze its semantic content, let's combine these capabilities to automatically generate comprehensive metadata for documents. We'll create a MetadataGenerator class that:

1. Takes a document file as input
2. Extracts the text content
3. Analyzes the content to extract semantic information
4. Generates structured metadata

The metadata we'll generate includes:
- Basic document information (filename, file size, word count)
- Content-based information (title, summary, language)
- Semantic information (keywords, entities)
- Analytical information (readability score)

In [None]:
class MetadataGenerator:
    """Class for generating metadata from documents"""
    
    def __init__(self):
        """Initialize with document processor and semantic analyzer"""
        self.doc_processor = DocumentProcessor()
        self.semantic_analyzer = SemanticAnalyzer()
    
    def generate_metadata(self, file_path):
        """
        Generate comprehensive metadata for a document
        
        Args:
            file_path (str): Path to the document file
            
        Returns:
            dict: Generated metadata
        """
        # Extract filename from path
        filename = os.path.basename(file_path)
        
        # Extract text from document
        extracted_text = self.doc_processor.extract_text(file_path)
        
        # Clean the extracted text
        cleaned_text = self.doc_processor.clean_text(extracted_text)
        
        # If no text was extracted, return basic metadata
        if not cleaned_text:
            return self._generate_empty_metadata(filename)
        
        # Calculate word count
        word_count = len(word_tokenize(cleaned_text)) if cleaned_text else 0
        
        # Extract title
        title = self.semantic_analyzer.extract_title(cleaned_text, filename)
        
        # Generate summary
        summary = self.semantic_analyzer.generate_summary(cleaned_text)
        
        # Extract keywords
        keywords = self.semantic_analyzer.extract_keywords(cleaned_text)
        
        # Extract entities
        entities = self.semantic_analyzer.extract_entities(cleaned_text)
        
        # Calculate readability
        readability_score = self.semantic_analyzer.calculate_readability(cleaned_text)
        
        # Detect language (simplified - always English for this example)
        language = "English"
        
        # Compile metadata
        metadata = {
            "filename": filename,
            "title": title,
            "processing_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "file_size": len(extracted_text),
            "word_count": word_count,
            "language": language,
            "summary": summary,
            "keywords": keywords,
            "entities": entities,
            "readability_score": readability_score
        }
        
        return metadata
    
    def _generate_empty_metadata(self, filename):
        """Generate metadata for empty or unprocessable document"""
        return {
            "filename": filename,
            "title": filename,
            "processing_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "file_size": 0,
            "word_count": 0,
            "language": "unknown",
            "summary": "Empty or unprocessable document",
            "keywords": [],
            "entities": {},
            "readability_score": 0
        }
    
    def save_metadata(self, metadata, output_format="json", output_dir="output"):
        """
        Save metadata to file in specified format
        
        Args:
            metadata (dict): Metadata to save
            output_format (str): Format to save in (json or csv)
            output_dir (str): Directory to save to
            
        Returns:
            str: Path to saved file
        """
        os.makedirs(output_dir, exist_ok=True)
        
        # Create filename based on original filename
        base_filename = metadata.get('filename', 'unknown').rsplit('.', 1)[0]
        
        if output_format.lower() == "json":
            output_path = os.path.join(output_dir, f"{base_filename}_metadata.json")
            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(metadata, f, indent=2)
        
        elif output_format.lower() == "csv":
            output_path = os.path.join(output_dir, f"{base_filename}_metadata.csv")
            
            # Flatten the metadata for CSV
            flat_metadata = {
                "filename": metadata.get("filename", ""),
                "title": metadata.get("title", ""),
                "processing_date": metadata.get("processing_date", ""),
                "file_size": metadata.get("file_size", 0),
                "word_count": metadata.get("word_count", 0),
                "language": metadata.get("language", ""),
                "summary": metadata.get("summary", ""),
                "readability_score": metadata.get("readability_score", 0),
                "keywords": ", ".join([k.get("text", "") for k in metadata.get("keywords", [])]),
                "persons": ", ".join(metadata.get("entities", {}).get("PERSON", [])),
                "organizations": ", ".join(metadata.get("entities", {}).get("ORG", [])),
                "locations": ", ".join(metadata.get("entities", {}).get("GPE", [])),
                "dates": ", ".join(metadata.get("entities", {}).get("DATE", []))
            }
            
            # Save as CSV
            df = pd.DataFrame([flat_metadata])
            df.to_csv(output_path, index=False)
        
        else:
            raise ValueError(f"Unsupported output format: {output_format}")
        
        return output_path

# Create an instance of the metadata generator
metadata_gen = MetadataGenerator()

# Generate metadata for our sample text file
metadata = metadata_gen.generate_metadata('sample_docs/sample.txt')

# Print the generated metadata
print("Generated metadata for sample.txt:")
print("-" * 50)
print(json.dumps(metadata, indent=2))
print("-" * 50)

# Save the metadata to a file
json_path = metadata_gen.save_metadata(metadata, "json")
csv_path = metadata_gen.save_metadata(metadata, "csv")

print(f"\nMetadata saved to {json_path} and {csv_path}")

# Display the CSV content
print("\nCSV content:")
print("-" * 50)
df = pd.read_csv(csv_path)
print(df)
print("-" * 50)

## Structured Metadata Output Formatting

We've already implemented basic metadata output formatting in the MetadataGenerator class, with support for JSON and CSV formats. Now, let's visualize the metadata and explore how it can be used for different purposes.

In [None]:
def visualize_metadata(metadata):
    """
    Visualize metadata using matplotlib
    
    Args:
        metadata (dict): Metadata to visualize
    """
    # Set up the figure
    plt.figure(figsize=(15, 12))
    
    # Document title
    plt.suptitle(metadata.get('title', 'Unknown Document'), fontsize=16, y=0.98)
    
    # Basic information
    plt.subplot(3, 2, 1)
    plt.axis('off')
    info_text = (
        f"Filename: {metadata.get('filename', 'Unknown')}\n"
        f"Processing Date: {metadata.get('processing_date', 'Unknown')}\n"
        f"File Size: {metadata.get('file_size', 0)} bytes\n"
        f"Word Count: {metadata.get('word_count', 0)}\n"
        f"Language: {metadata.get('language', 'Unknown')}\n"
        f"Readability Score: {metadata.get('readability_score', 0):.2f}/100"
    )
    plt.text(0.1, 0.5, info_text, fontsize=10, va='center')
    plt.title('Document Information', fontsize=12)
    
    # Keywords visualization
    plt.subplot(3, 2, 2)
    keywords = metadata.get('keywords', [])
    if keywords:
        words = [kw.get('text', '') for kw in keywords]
        scores = [kw.get('score', 0) for kw in keywords]
        
        # Sort by score
        sorted_indices = np.argsort(scores)[::-1]
        words = [words[i] for i in sorted_indices]
        scores = [scores[i] for i in sorted_indices]
        
        # Create horizontal bar chart
        y_pos = np.arange(len(words))
        plt.barh(y_pos, scores, align='center', color='skyblue')
        plt.yticks(y_pos, words)
        plt.xlabel('Score')
        plt.title('Top Keywords', fontsize=12)
    else:
        plt.axis('off')
        plt.text(0.5, 0.5, 'No keywords found', ha='center', va='center')
        plt.title('Top Keywords', fontsize=12)
    
    # Entities visualization
    plt.subplot(3, 2, 3)
    entities = metadata.get('entities', {})
    
    # Count entities by type
    entity_counts = {k: len(v) for k, v in entities.items() if v}
    
    if entity_counts:
        # Create pie chart
        labels = list(entity_counts.keys())
        sizes = list(entity_counts.values())
        plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
        plt.axis('equal')
        plt.title('Named Entities by Type', fontsize=12)
    else:
        plt.axis('off')
        plt.text(0.5, 0.5, 'No entities found', ha='center', va='center')
        plt.title('Named Entities by Type', fontsize=12)
    
    # Summary visualization
    plt.subplot(3, 2, 4)
    plt.axis('off')
    summary = metadata.get('summary', 'No summary available')
    plt.text(0.1, 0.5, f"Summary:\n\n{summary}", fontsize=10, va='center', wrap=True)
    plt.title('Document Summary', fontsize=12)
    
    # Readability gauge
    plt.subplot(3, 2, 5)
    readability = metadata.get('readability_score', 0)
    
    # Create a gauge-like visualization
    categories = ['Simple', 'Standard', 'Complex']
    colors = ['green', 'orange', 'red']
    
    # Determine which category the score falls into
    if readability < 30:
        category_index = 0
    elif readability < 70:
        category_index = 1
    else:
        category_index = 2
    
    # Create bar chart for readability
    bars = plt.bar([0, 1, 2], [30, 40, 30], color='lightgray')
    bars[category_index].set_color(colors[category_index])
    
    plt.xticks([0, 1, 2], categories)
    plt.ylim(0, 100)
    plt.title(f'Readability: {readability:.1f}/100', fontsize=12)
    
    # Entity details
    plt.subplot(3, 2, 6)
    plt.axis('off')
    
    # Format entity details as text
    entity_text = "Entity Details:\n\n"
    for entity_type, entity_list in entities.items():
        if entity_list:
            entity_text += f"{entity_type}:\n"
            for entity in entity_list[:5]:  # Show only first 5 to avoid clutter
                entity_text += f"- {entity}\n"
            if len(entity_list) > 5:
                entity_text += f"  ... and {len(entity_list) - 5} more\n"
            entity_text += "\n"
    
    plt.text(0.1, 0.5, entity_text, fontsize=10, va='center')
    plt.title('Named Entity Details', fontsize=12)
    
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()

# Define a function to format metadata for different outputs
def format_metadata_for_output(metadata, output_format="html"):
    """
    Format metadata for different output formats
    
    Args:
        metadata (dict): Metadata to format
        output_format (str): Format to output (html, markdown, xml)
        
    Returns:
        str: Formatted metadata
    """
    if output_format.lower() == "html":
        # Format as HTML
        html = f"""
        <div class="metadata-container">
            <h2>{metadata.get('title', 'Unknown Document')}</h2>
            
            <div class="metadata-section">
                <h3>Document Information</h3>
                <table>
                    <tr><th>Filename</th><td>{metadata.get('filename', 'Unknown')}</td></tr>
                    <tr><th>Processing Date</th><td>{metadata.get('processing_date', 'Unknown')}</td></tr>
                    <tr><th>File Size</th><td>{metadata.get('file_size', 0)} bytes</td></tr>
                    <tr><th>Word Count</th><td>{metadata.get('word_count', 0)}</td></tr>
                    <tr><th>Language</th><td>{metadata.get('language', 'Unknown')}</td></tr>
                    <tr><th>Readability Score</th><td>{metadata.get('readability_score', 0):.2f}/100</td></tr>
                </table>
            </div>
            
            <div class="metadata-section">
                <h3>Summary</h3>
                <p>{metadata.get('summary', 'No summary available')}</p>
            </div>
            
            <div class="metadata-section">
                <h3>Keywords</h3>
                <div class="keyword-container">
        """
        
        # Add keywords
        for keyword in metadata.get('keywords', []):
            html += f"""<span class="keyword" style="opacity: {max(0.5, keyword.get('score', 0))}">{keyword.get('text', '')}</span>"""
        
        html += """
                </div>
            </div>
            
            <div class="metadata-section">
                <h3>Named Entities</h3>
        """
        
        # Add entities
        for entity_type, entity_list in metadata.get('entities', {}).items():
            if entity_list:
                html += f"""
                <div class="entity-group">
                    <h4>{entity_type}</h4>
                    <ul>
                """
                
                for entity in entity_list:
                    html += f"""<li>{entity}</li>"""
                
                html += """
                    </ul>
                </div>
                """
        
        html += """
            </div>
        </div>
        """
        
        return html
    
    elif output_format.lower() == "markdown":
        # Format as Markdown
        md = f"""# {metadata.get('title', 'Unknown Document')}

## Document Information
- **Filename:** {metadata.get('filename', 'Unknown')}
- **Processing Date:** {metadata.get('processing_date', 'Unknown')}
- **File Size:** {metadata.get('file_size', 0)} bytes
- **Word Count:** {metadata.get('word_count', 0)}
- **Language:** {metadata.get('language', 'Unknown')}
- **Readability Score:** {metadata.get('readability_score', 0):.2f}/100

## Summary
{metadata.get('summary', 'No summary available')}

## Keywords
"""
        
        # Add keywords
        for keyword in metadata.get('keywords', []):
            md += f"- {keyword.get('text', '')} ({keyword.get('score', 0):.3f})\n"
        
        md += "\n## Named Entities\n"
        
        # Add entities
        for entity_type, entity_list in metadata.get('entities', {}).items():
            if entity_list:
                md += f"\n### {entity_type}\n"
                for entity in entity_list:
                    md += f"- {entity}\n"
        
        return md
    
    elif output_format.lower() == "xml":
        # Format as XML
        xml = f"""<?xml version="1.0" encoding="UTF-8"?>
<metadata>
    <document>
        <title>{metadata.get('title', 'Unknown Document')}</title>
        <filename>{metadata.get('filename', 'Unknown')}</filename>
        <processing_date>{metadata.get('processing_date', 'Unknown')}</processing_date>
        <file_size>{metadata.get('file_size', 0)}</file_size>
        <word_count>{metadata.get('word_count', 0)}</word_count>
        <language>{metadata.get('language', 'Unknown')}</language>
        <readability_score>{metadata.get('readability_score', 0):.2f}</readability_score>
    </document>
    
    <content>
        <summary>{metadata.get('summary', 'No summary available')}</summary>
        <keywords>
"""
        
        # Add keywords
        for keyword in metadata.get('keywords', []):
            xml += f"""        <keyword score="{keyword.get('score', 0):.3f}">{keyword.get('text', '')}</keyword>\n"""
        
        xml += """    </keywords>
        <entities>
"""
        
        # Add entities
        for entity_type, entity_list in metadata.get('entities', {}).items():
            if entity_list:
                xml += f"""        <entity_group type="{entity_type}">\n"""
                for entity in entity_list:
                    xml += f"""            <entity>{entity}</entity>\n"""
                xml += f"""        </entity_group>\n"""
        
        xml += """    </entities>
    </content>
</metadata>"""
        
        return xml
    
    else:
        raise ValueError(f"Unsupported output format: {output_format}")

# Visualize the metadata
visualize_metadata(metadata)

# Format the metadata in different formats
html_output = format_metadata_for_output(metadata, "html")
markdown_output = format_metadata_for_output(metadata, "markdown")
xml_output = format_metadata_for_output(metadata, "xml")

# Save the formatted outputs
with open('output/metadata_output.html', 'w', encoding='utf-8') as f:
    f.write(html_output)

with open('output/metadata_output.md', 'w', encoding='utf-8') as f:
    f.write(markdown_output)

with open('output/metadata_output.xml', 'w', encoding='utf-8') as f:
    f.write(xml_output)

print("Formatted metadata saved to output directory in HTML, Markdown, and XML formats")

# Display the Markdown output as an example
print("\nSample Markdown Output:")
print("-" * 50)
print(markdown_output[:500] + "..." if len(markdown_output) > 500 else markdown_output)
print("-" * 50)

## Web Interface for Document Upload and Metadata Display

Now that we have a functional metadata generation system, let's design a web interface for users to upload documents and view the generated metadata. We'll use Streamlit for this, which is a fast way to build interactive web applications in Python.

Here, we'll provide the code for a Streamlit app. To run this app, you would save it to a separate file and run it with `streamlit run app.py`.

**Note:** Since this is a Jupyter notebook, we won't actually run the Streamlit app here, but we'll provide the complete code for it.

In [None]:
# Create a new file called 'streamlit_app.py' with the following content
streamlit_app_code = """
import streamlit as st
import os
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt
import tempfile
from datetime import datetime

# Add the parent directory to the path to import our modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

# Import our metadata generation classes
from src.document_processor import DocumentProcessor
from src.metadata_generator import MetadataGenerator

# Set page config
st.set_page_config(
    page_title="Metamorph - Metadata Generator",
    page_icon="📄",
    layout="wide"
)

# Initialize the metadata generator
@st.cache_resource
def get_metadata_generator():
    return MetadataGenerator()

metadata_gen = get_metadata_generator()

# App title
st.title("Metamorph")
st.subheader("Automated Metadata Generation System")
st.markdown("---")

# Sidebar
st.sidebar.header("About")
st.sidebar.info(
    "Metamorph automatically extracts and generates metadata from various document types "
    "including PDF, DOCX, TXT, and images."
)

st.sidebar.header("Instructions")
st.sidebar.info(
    "1. Upload a document using the file uploader below.\n"
    "2. The system will process the document and extract metadata.\n"
    "3. View the generated metadata and visualizations.\n"
    "4. Download the metadata in your preferred format."
)

st.sidebar.header("Supported File Types")
st.sidebar.markdown(
    "- PDF (.pdf)\n"
    "- Word (.docx)\n"
    "- Text (.txt)\n"
    "- Images (.png, .jpg, .jpeg)"
)

# File upload
st.header("Upload Document")
uploaded_file = st.file_uploader("Choose a file", type=["pdf", "docx", "txt", "png", "jpg", "jpeg"])

if uploaded_file is not None:
    # Display progress
    progress_bar = st.progress(0)
    status_text = st.empty()
    
    # Update progress
    status_text.text("Uploading file...")
    progress_bar.progress(20)
    
    # Save the uploaded file to a temporary file
    with tempfile.NamedTemporaryFile(delete=False, suffix=f".{uploaded_file.name.split('.')[-1]}") as tmp_file:
        tmp_file.write(uploaded_file.getvalue())
        temp_file_path = tmp_file.name
    
    # Update progress
    status_text.text("Extracting text...")
    progress_bar.progress(40)
    
    # Generate metadata
    status_text.text("Analyzing content...")
    progress_bar.progress(60)
    
    metadata = metadata_gen.generate_metadata(temp_file_path)
    
    # Update progress
    status_text.text("Generating metadata...")
    progress_bar.progress(80)
    
    # Save metadata to JSON temporarily
    json_path = os.path.join(tempfile.gettempdir(), "metadata.json")
    with open(json_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    
    # Save metadata to CSV temporarily
    csv_path = os.path.join(tempfile.gettempdir(), "metadata.csv")
    metadata_gen.save_metadata(metadata, "csv", tempfile.gettempdir())
    
    # Update progress
    status_text.text("Done!")
    progress_bar.progress(100)
    
    # Display metadata
    st.header("Generated Metadata")
    
    # Basic info in columns
    col1, col2, col3 = st.columns(3)
    with col1:
        st.subheader("Document Information")
        st.markdown(f"**Title:** {metadata.get('title', 'Unknown')}")
        st.markdown(f"**Filename:** {metadata.get('filename', 'Unknown')}")
        st.markdown(f"**Word Count:** {metadata.get('word_count', 0)}")
    
    with col2:
        st.subheader("Processing Information")
        st.markdown(f"**Processing Date:** {metadata.get('processing_date', 'Unknown')}")
        st.markdown(f"**Language:** {metadata.get('language', 'Unknown')}")
        st.markdown(f"**File Size:** {metadata.get('file_size', 0)} bytes")
    
    with col3:
        st.subheader("Analysis")
        readability = metadata.get('readability_score', 0)
        st.markdown(f"**Readability Score:** {readability:.2f}/100")
        
        # Readability gauge
        if readability < 30:
            category = "Simple"
            color = "green"
        elif readability < 70:
            category = "Standard"
            color = "orange"
        else:
            category = "Complex"
            color = "red"
            
        st.markdown(f"**Complexity:** <span style='color:{color}'>{category}</span>", unsafe_allow_html=True)
    
    st.markdown("---")
    
    # Summary
    st.subheader("Document Summary")
    st.write(metadata.get('summary', 'No summary available'))
    
    st.markdown("---")
    
    # Keywords and entities
    col1, col2 = st.columns(2)
    
    with col1:
        st.subheader("Keywords")
        keywords = metadata.get('keywords', [])
        if keywords:
            # Display as a table
            keyword_df = pd.DataFrame(keywords)
            st.dataframe(keyword_df)
            
            # Create a bar chart
            fig, ax = plt.subplots(figsize=(10, 4))
            keyword_df = keyword_df.sort_values('score', ascending=False)
            ax.barh(keyword_df['text'], keyword_df['score'], color='skyblue')
            ax.set_xlabel('Score')
            ax.set_title('Top Keywords')
            st.pyplot(fig)
        else:
            st.write("No keywords found")
    
    with col2:
        st.subheader("Named Entities")
        entities = metadata.get('entities', {})
        
        # Display entities by type
        for entity_type, entity_list in entities.items():
            if entity_list:
                with st.expander(f"{entity_type} ({len(entity_list)})"):
                    st.write(", ".join(entity_list))
        
        # Create a pie chart of entity distribution
        entity_counts = {k: len(v) for k, v in entities.items() if v}
        if entity_counts:
            fig, ax = plt.subplots(figsize=(8, 8))
            ax.pie(entity_counts.values(), labels=entity_counts.keys(), autopct='%1.1f%%', startangle=90)
            ax.axis('equal')
            st.pyplot(fig)
    
    st.markdown("---")
    
    # Download options
    st.header("Download Metadata")
    col1, col2, col3 = st.columns(3)
    
    with col1:
        with open(json_path, 'r') as f:
            json_data = f.read()
        st.download_button(
            label="Download JSON",
            data=json_data,
            file_name=f"{metadata.get('filename', 'metadata')}.json",
            mime="application/json"
        )
    
    with col2:
        with open(csv_path, 'r') as f:
            csv_data = f.read()
        st.download_button(
            label="Download CSV",
            data=csv_data,
            file_name=f"{metadata.get('filename', 'metadata')}.csv",
            mime="text/csv"
        )
    
    with col3:
        # Generate markdown
        markdown_data = f"# {metadata.get('title', 'Unknown Document')}\n\n"
        markdown_data += f"**Filename:** {metadata.get('filename', 'Unknown')}  \n"
        markdown_data += f"**Processing Date:** {metadata.get('processing_date', 'Unknown')}  \n"
        markdown_data += f"**Word Count:** {metadata.get('word_count', 0)}  \n"
        markdown_data += f"**Readability Score:** {metadata.get('readability_score', 0):.2f}/100  \n\n"
        
        markdown_data += f"## Summary\n{metadata.get('summary', 'No summary available')}\n\n"
        
        markdown_data += "## Keywords\n"
        for keyword in metadata.get('keywords', []):
            markdown_data += f"- {keyword.get('text', '')} ({keyword.get('score', 0):.3f})\n"
        
        markdown_data += "\n## Named Entities\n"
        for entity_type, entity_list in metadata.get('entities', {}).items():
            if entity_list:
                markdown_data += f"\n### {entity_type}\n"
                for entity in entity_list:
                    markdown_data += f"- {entity}\n"
        
        st.download_button(
            label="Download Markdown",
            data=markdown_data,
            file_name=f"{metadata.get('filename', 'metadata')}.md",
            mime="text/markdown"
        )
    
    # Clean up temporary files
    try:
        os.unlink(temp_file_path)
        os.unlink(json_path)
        os.unlink(csv_path)
    except:
        pass

else:
    # Show sample when no file is uploaded
    st.info("Upload a document to generate metadata")
    
    # Sample image
    st.image("https://via.placeholder.com/800x400.png?text=Metamorph+Automated+Metadata+Generation", use_column_width=True)
    
    # Sample features
    st.header("Key Features")
    
    feature_col1, feature_col2 = st.columns(2)
    
    with feature_col1:
        st.markdown("#### Content Extraction")
        st.markdown("- Extract text from PDF, DOCX, TXT")
        st.markdown("- OCR for images and scanned documents")
        st.markdown("- Support for multiple languages")
        
        st.markdown("#### Semantic Analysis")
        st.markdown("- Named entity recognition")
        st.markdown("- Keyword extraction")
        st.markdown("- Document summarization")
    
    with feature_col2:
        st.markdown("#### Metadata Generation")
        st.markdown("- Comprehensive document metadata")
        st.markdown("- Readability assessment")
        st.markdown("- Content classification")
        
        st.markdown("#### Output Options")
        st.markdown("- JSON, CSV, Markdown formats")
        st.markdown("- Data visualizations")
        st.markdown("- Integration with other systems")
"""

# Save the Streamlit app to a file
with open('streamlit_app.py', 'w') as f:
    f.write(streamlit_app_code)

print("Streamlit app code has been saved to 'streamlit_app.py'")
print("To run the app, use the command: streamlit run streamlit_app.py")

# Display a code snippet for educational purposes
print("\nPreview of the Streamlit app code:")
print("-" * 50)
print("\n".join(streamlit_app_code.split("\n")[:30]) + "\n...")  # Show first 30 lines
print("-" * 50)

## System Deployment

Deploying the Metamorph system involves two components:
1. The Flask web application (from the main project)
2. The Streamlit application (as an alternative interface)

Below, we'll provide instructions for deploying both components.

In [None]:
# Create a deployment instructions file
deployment_instructions = """# Deployment Instructions for Metamorph

## Prerequisites
- Python 3.8+
- pip (Python package installer)
- Git
- Tesseract OCR (for image processing)

## Step 1: Clone the Repository
```bash
git clone https://github.com/yourusername/Metamorph.git
cd Metamorph
```

## Step 2: Set up a Virtual Environment
```bash
# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\\Scripts\\activate
# On macOS/Linux:
source venv/bin/activate
```

## Step 3: Install Dependencies
```bash
pip install -r requirements.txt
```

## Step 4: Install Tesseract OCR
- For Ubuntu/Debian:
  ```
  sudo apt-get install tesseract-ocr
  ```
- For macOS:
  ```
  brew install tesseract
  ```
- For Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)

## Step 5: Download spaCy Model
```bash
python -m spacy download en_core_web_sm
```

## Option 1: Deploy the Flask Web Application

### Local Deployment
```bash
# Start the Flask app
python app.py
```

The application will be available at http://localhost:5000

### Production Deployment with Gunicorn (Linux/macOS)
```bash
pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 app:app
```

### Docker Deployment
1. Create a Dockerfile:
```Dockerfile
FROM python:3.9-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \\
    tesseract-ocr \\
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m spacy download en_core_web_sm

COPY . .

EXPOSE 5000

CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"]
```

2. Build and run the Docker container:
```bash
docker build -t metamorph .
docker run -p 5000:5000 metamorph
```

## Option 2: Deploy the Streamlit Web Application

### Local Deployment
```bash
pip install streamlit
streamlit run streamlit_app.py
```

The application will be available at http://localhost:8501

### Production Deployment
For production deployment of Streamlit apps, consider using:
- [Streamlit Sharing](https://streamlit.io/sharing)
- [Heroku](https://heroku.com)
- [AWS](https://aws.amazon.com)
- [Google Cloud Platform](https://cloud.google.com)

Example for Heroku:
1. Create a `Procfile`:
```
web: streamlit run streamlit_app.py --server.port $PORT
```

2. Deploy to Heroku:
```bash
heroku create metamorph-app
git push heroku main
```

## Option 3: Run as a Jupyter Notebook

The `notebooks/metadata_generation.ipynb` can be run in a Jupyter environment:

```bash
pip install jupyter
jupyter notebook
```

Navigate to the notebooks directory and open `metadata_generation.ipynb`.

## Security Considerations for Production

1. **Secure File Uploads**: Implement strict file validation and virus scanning
2. **Rate Limiting**: Prevent abuse with rate limiting
3. **User Authentication**: Add authentication for production deployments
4. **HTTPS**: Always use HTTPS in production
5. **Environment Variables**: Store sensitive information in environment variables
6. **Regular Updates**: Keep all dependencies updated

## Maintenance

1. Regularly update dependencies
2. Monitor application logs
3. Set up automated backups for any databases
4. Implement monitoring for server health
5. Create a process for user feedback and bug reporting
"""

# Save the deployment instructions
with open('deployment_instructions.md', 'w') as f:
    f.write(deployment_instructions)

print("Deployment instructions have been saved to 'deployment_instructions.md'")

# Create a sample Dockerfile
dockerfile_content = """FROM python:3.9-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \\
    tesseract-ocr \\
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m spacy download en_core_web_sm

COPY . .

EXPOSE 5000

CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"]
"""

# Save the Dockerfile
with open('Dockerfile', 'w') as f:
    f.write(dockerfile_content)

print("Sample Dockerfile has been created")

# Create a Procfile for Heroku
with open('Procfile', 'w') as f:
    f.write("web: gunicorn app:app")

print("Procfile for Heroku deployment has been created")

# Show deployment options
print("\nDeployment Options:")
print("-" * 50)
print("1. Local Flask App: python app.py")
print("2. Local Streamlit App: streamlit run streamlit_app.py")
print("3. Docker: docker build -t metamorph . && docker run -p 5000:5000 metamorph")
print("4. Heroku: git push heroku main")
print("-" * 50)

## Generate README.md File

Finally, let's generate a comprehensive README.md file for the project. This will serve as documentation for users and developers who want to understand and use the Metamorph system.

In [None]:
# Generate a comprehensive README.md file
readme_content = """# Metamorph: Automated Metadata Generation System

![Metamorph](https://via.placeholder.com/800x200.png?text=Metamorph+Metadata+Generation)

## Overview

Metamorph is a comprehensive automated metadata generation system designed to enhance document discoverability, classification, and analysis. The system processes various document formats, extracts meaningful content, and generates structured metadata that can be used for better document management and insights.

## Key Features

- **Multi-format Support**: Process documents in PDF, DOCX, TXT, and image formats
- **Automatic Content Extraction**: Extract text using specialized parsers for each format
- **OCR Capabilities**: Convert text in images to machine-readable content
- **Semantic Analysis**: Identify meaningful sections and key information in documents
- **Named Entity Recognition**: Extract people, organizations, locations, and dates
- **Keyword Extraction**: Generate relevant keywords using TF-IDF analysis
- **Readability Assessment**: Calculate document complexity scores
- **Document Summarization**: Generate concise summaries of document content
- **Intuitive Web Interface**: Upload documents and view generated metadata
- **Analytics Dashboard**: View insights across processed documents

## Technical Architecture

Metamorph is built with a modern tech stack:

- **Backend**: Python/Flask RESTful API
- **Frontend**: HTML5, CSS3, JavaScript with Bootstrap 5
- **Text Processing**: NLTK, spaCy, scikit-learn
- **Document Parsing**: PyPDF2, python-docx, Pillow, pytesseract
- **Data Analysis**: pandas, NumPy
- **Visualization**: Chart.js

## Installation and Setup

### Prerequisites

- Python 3.8+
- Tesseract OCR (for image processing)
- Required Python packages (see requirements.txt)

### Steps

1. Clone the repository:
   ```
   git clone https://github.com/yourusername/Metamorph.git
   cd Metamorph
   ```

2. Create and activate a virtual environment (recommended):
   ```
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\\Scripts\\activate
   ```

3. Install dependencies:
   ```
   pip install -r requirements.txt
   ```

4. Install additional requirements for OCR:
   - For Ubuntu/Debian:
     ```
     sudo apt-get install tesseract-ocr
     ```
   - For macOS:
     ```
     brew install tesseract
     ```
   - For Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)

5. Download spaCy model:
   ```
   python -m spacy download en_core_web_sm
   ```

6. Run the application:
   ```
   python app.py
   ```

7. Access the web interface at http://localhost:5000

## Using the Jupyter Notebook

For data scientists and developers who want to use the core functionality in a notebook environment, we provide a comprehensive Jupyter Notebook that demonstrates:

1. Document loading and text extraction
2. Text preprocessing and cleaning
3. Feature extraction and metadata generation
4. Visualization of results
5. Examples of advanced usage

The notebook can be found at `notebooks/metadata_generation.ipynb`.

## Project Structure

```
Metamorph/
├── app.py                  # Main Flask application
├── requirements.txt        # Python dependencies
├── README.md               # Project documentation
├── Dockerfile              # Docker configuration
├── Procfile                # Heroku configuration
├── streamlit_app.py        # Streamlit interface
├── deployment_instructions.md  # Deployment guide
├── notebooks/              # Jupyter notebooks
│   └── metadata_generation.ipynb
├── src/                    # Source code
│   ├── document_processor.py  # Document text extraction
│   └── metadata_generator.py  # Metadata generation logic
├── static/                 # Static files (CSS, JS, images)
│   ├── css/
│   ├── js/
│   └── img/
├── templates/              # HTML templates
│   ├── index.html
│   └── analyze.html
└── uploads/                # Temporary document storage
```

## API Reference

Metamorph provides a simple RESTful API for programmatic access:

- `POST /upload`: Upload and process a document
  - Returns: JSON with generated metadata

- `GET /analyze`: Get analytics data across all processed documents
  - Returns: JSON with aggregate statistics

## Alternative Interfaces

### Streamlit Interface

We also provide a Streamlit-based interface that can be run separately:

```
streamlit run streamlit_app.py
```

This interface provides the same core functionality with a different user experience.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- NLTK and spaCy for natural language processing
- PyPDF2 and python-docx for document parsing
- Flask for the web framework
- Bootstrap for the UI components

"""

# Save the README.md to the project root
with open('README.md', 'w') as f:
    f.write(readme_content)

print("README.md has been generated and saved to the project root")

# Display conclusion
print("\nThis completes our Metamorph - Automated Metadata Generation System notebook.")
print("You can now explore the system further and customize it for your specific needs.")
print("To run the web application, use: python app.py")
print("To run the Streamlit interface, use: streamlit run streamlit_app.py")
print("\nThank you for using Metamorph!")

## Conclusion

In this notebook, we've built a comprehensive automated metadata generation system from scratch. We've:

1. Implemented text extraction from various document formats (PDF, DOCX, TXT)
2. Added OCR capabilities for images and scanned documents
3. Created semantic analysis for identifying meaningful content
4. Developed a metadata generation pipeline
5. Designed output formats for structured metadata
6. Created a web interface for document upload and metadata viewing
7. Provided deployment instructions for various environments

The Metamorph system can be used in a variety of contexts:
- Document management systems
- Digital libraries and archives
- Content management systems
- Data governance and compliance
- Research repositories
- Enterprise search systems

You can extend this system by:
- Adding support for more document formats
- Implementing more advanced NLP techniques
- Enhancing the web interface with additional features
- Adding a database for storing processed documents and metadata
- Implementing user authentication and access control

Thank you for exploring the Metamorph Automated Metadata Generation System!