🚀 CryptoSights: Advanced Cryptocurrency Intelligence Platform

An intelligent NLP-powered platform for cryptocurrency news analysis, sentiment tracking, and query-based intelligence.

🎯 Features • 🚀 Quick Start • 📖 Documentation • 🐳 Docker • 🔌 API • 🤝 Contributing

📋 Table of Contents

Overview
Key Features
System Architecture
Technology Stack
Quick Start
Docker Deployment
Usage Guide
API Reference
Evaluation System
Project Structure
Development
Contributing
Troubleshooting
License
Contact

🎯 Overview

CryptoSights is a production-grade cryptocurrency intelligence platform that combines web scraping, natural language processing, and machine learning to deliver actionable insights from cryptocurrency news articles. The platform processes thousands of articles across major cryptocurrencies, performs advanced sentiment analysis, and provides instant, ranked answers to user queries through an intuitive web interface and RESTful API.

Supported Cryptocurrencies

Bitcoin (BTC) - Digital gold and store of value
Ethereum (ETH) - Smart contract pioneer
Solana (SOL) - High-performance blockchain
Dogecoin (DOGE) - Community-driven memecoin
Hamster (HMSTR) - Emerging cryptocurrency
General Crypto - Cross-asset analysis

Core Capabilities

Capability	Description
Intelligent Scraping	Automated article collection from cryptocurrency news sources
NLP Processing	Advanced text extraction, cleaning, and preprocessing
Sentiment Analysis	Real-time sentiment tracking with polarity scoring
Smart Retrieval	BM25-powered ranked answer generation
Web Interface	Interactive query platform with visual analytics
RESTful API	Programmatic access for integration
Validation Engine	Machine learning-based answer verification
Docker Support	Containerized deployment with one-command setup

✨ Key Features

🔍 Data Acquisition Pipeline

Automated Web Scraping
- Harvests 10+ articles per cryptocurrency from trusted sources
- Intelligent URL extraction with validation
- Built-in rate limiting and exponential backoff
- User-agent rotation to prevent blocking
- Structured CSV output for audit trails and reproducibility
Robust Error Handling
- Automatic retry mechanisms with configurable backoff
- Comprehensive logging system for debugging
- Graceful degradation on network failures
- Detailed error reports in query logs

📄 Text Processing Engine

Multi-Stage Content Extraction
- HTML to PDF conversion via pdfkit with wkhtmltopdf
- Clean text extraction using trafilatura library
- Boilerplate removal (ads, navigation, footers)
- Character encoding normalization (UTF-8)
Advanced Preprocessing
- Intelligent lowercasing with acronym preservation (BTC, ETH, DeFi)
- Special character handling and normalization
- Tokenization with cryptocurrency term protection
- Stemming and lemmatization via NLTK
- Stop word removal with domain-specific exceptions
Domain-Aware Processing
- Cryptocurrency-specific keyword preservation
- Technical term recognition (DeFi, NFT, PoS, DAO)
- Context-sensitive cleaning for financial terms
- Preserves numerical values for price analysis

💭 Sentiment Analysis System

Sentence-Level Analysis
- TextBlob-powered sentiment scoring
- Polarity measurement (-1.0 to +1.0 scale)
- Subjectivity detection (0.0 to 1.0)
- Multi-class classification (Positive, Negative, Neutral)
- Per-document aggregate sentiment
Visual Analytics
- Real-time sentiment distribution charts
- Matplotlib-powered visualizations
- Sentiment trends over time
- Exportable graphics for reporting
- JSON sentiment metrics export

🎯 Query Retrieval Engine

BM25 Algorithm Implementation
- Okapi BM25 ranking function (k1=1.5, b=0.75)
- Efficient inverted index structure
- Sub-second query response times
- Tunable ranking parameters
- Persistent index serialization with pickle
Rich Query Results
- Ranked answer ordering by relevance score (0.0-1.0)
- Source document attribution with metadata
- Sentiment context for each result
- Configurable result count (top-n)
- Duplicate result filtering
- Query execution time tracking

🌐 Web Interface

User-Friendly Dashboard
- Cryptocurrency selection with autocomplete
- Real-time query processing with loading indicators
- Interactive result display with collapsible cards
- Sentiment visualization integration
- Query history and saved searches (planned)
Responsive Design
- Mobile-optimized layouts (Bootstrap 5)
- Cross-browser compatibility (Chrome, Firefox, Safari, Edge)
- Modern CSS3/HTML5 standards
- Dark mode support (optional)
- Accessibility features (WCAG 2.1 AA)

🔌 RESTful API

JSON-Based Communication
- Standardized request/response format
- CORS-enabled for cross-origin access
- Comprehensive error messages with status codes
- Rate limiting ready (planned)
- API versioning support
Query Logging & Analytics
- Automatic query history tracking
- Timestamp-based audit trail
- Analytics-ready data format
- Performance metrics collection
- User engagement tracking

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      CryptoSights Platform                      │
└─────────────────────────────────────────────────────────────────┘
                                │
                ┌───────────────┴───────────────┐
                │                               │
        ┌───────▼────────┐              ┌──────▼───────┐
        │ Data Pipeline  │              │  Web Layer   │
        └───────┬────────┘              └──────┬───────┘
                │                              │
    ┌───────────┼───────────┐          ┌───────┼───────┐
    │           │           │          │       │       │
┌───▼───┐   ┌──▼───┐   ┌──▼───┐   ┌──▼───┐ ┌───▼──┐ ┌──▼────┐
│Scraper│   │ PDF  │   │ NLP  │   │Flask │ │REST  │ │Docker │
│Module │   │Conver│   │Engine│   │ UI   │ │ API  │ │Deploy │
└───┬───┘   └──┬───┘   └──┬───┘   └──┬───┘ └─┬────┘ └──┬────┘
    │          │          │          │       │         │
    └──────────┴──────────┴──────────┴───────┴─────────┘
                        │
            ┌───────────┴───────────┐
            │                       │
        ┌───▼────┐              ┌──▼──────┐
        │  BM25  │              │Sentiment│
        │Indexing│              │Analyzer │
        └───┬────┘              └──┬──────┘
            │                      │
    ┌───────▼──────────────────────▼────────┐
    │        Query Processing Layer         │
    └───────────────────────────────────────┘
                        │
            ┌───────────┴──────────┐
            │                      │
        ┌───▼────────┐      ┌──────▼──────┐
        │ Validation │      │   Results   │
        │   Engine   │      │ Formatting  │
        └────────────┘      └─────────────┘

Pipeline Flow

Data Acquisition → Web scraping → URL collection → CSV storage
Content Processing → HTML to PDF → Text extraction → Clean output
NLP Processing → Tokenization → Lemmatization → Term preservation
Indexing → BM25 index creation → Persistence → Memory optimization
Query Handling → User input → BM25 ranking → Result formatting
Validation → Semantic matching → Relevance scoring → Quality assurance

🛠️ Technology Stack

Core Technologies

Category	Technologies
Language	Python 3.10+
Web Frameworks	Flask 2.0+ & FastAPI
NLP Libraries	NLTK, TextBlob, Sentence-Transformers
ML Algorithms	BM25Okapi, Random Forest, TF-IDF
Data Processing	Pandas, NumPy
Web Scraping	BeautifulSoup4, Requests, Trafilatura
Visualization	Matplotlib, Seaborn
Containerization	Docker, Docker Compose
Deployment	Hugging Face Spaces, Docker Hub

Detailed Dependencies

# Web Framework & API
flask>=2.0.0
flask-cors>=3.0.10
fastapi>=0.95.0

# Web Scraping & Content Extraction
requests>=2.28.0
beautifulsoup4>=4.11.0
trafilatura>=1.4.0
pdfkit>=1.0.0
lxml>=4.9.0

# Natural Language Processing
nltk>=3.8.0
textblob>=0.17.0
sentence-transformers>=2.2.0

# Machine Learning & Retrieval
rank-bm25>=0.2.2
scikit-learn>=1.2.0

# Data Manipulation
pandas>=1.5.0
numpy>=1.23.0

# Visualization
matplotlib>=3.6.0
seaborn>=0.12.0

# Utilities
python-dotenv>=0.21.0

External Dependencies

wkhtmltopdf - PDF rendering engine (required by pdfkit)
- Download: https://wkhtmltopdf.org/downloads.html
- Ubuntu: sudo apt-get install wkhtmltopdf
- macOS: brew install wkhtmltopdf

🚀 Quick Start

Option 1: Traditional Python Installation

Prerequisites

Python 3.10 or higher
pip (Python package manager)
wkhtmltopdf
Git

Installation Steps

1. Clone the Repository

git clone https://github.com/techiepookie/crypto-insight-ai.git
cd crypto-insight-ai

2. Create Virtual Environment (Recommended)

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

3. Install Python Dependencies

pip install -r requirements.txt

4. Download NLTK Resources

python -c "import nltk; \
    nltk.download('punkt'); \
    nltk.download('punkt_tab'); \
    nltk.download('stopwords'); \
    nltk.download('wordnet'); \
    nltk.download('omw-1.4')"

5. Install wkhtmltopdf

# Ubuntu/Debian
sudo apt-get install wkhtmltopdf

# macOS
brew install wkhtmltopdf

# Windows: Download from https://wkhtmltopdf.org/downloads.html

6. Verify Installation

python -c "import flask, nltk, textblob, rank_bm25; print('✓ All dependencies installed!')"

7. Start the Application

cd crypto_sights
python run.py

Access at: http://localhost:5000

🐳 Docker Deployment

Option 2: Docker (Recommended for Production)

Requires Docker installed: https://docs.docker.com/get-docker/

Quick Start

# Pull pre-built image from Docker Hub
docker pull techiepookie/cryptosights:latest

# Run container
docker run -d \
  --name cryptosights \
  -p 7860:7860 \
  techiepookie/cryptosights:latest

# Access application
# http://localhost:7860

Local Build & Deploy

# Build Docker image
docker build -t cryptosights .

# Run container
docker run -d \
  --name cryptosights \
  -p 7860:7860 \
  techiepookie/cryptosights:latest

# View logs
docker logs -f cryptosights

# Stop container
docker stop cryptosights
docker rm cryptosights

Docker Compose (Recommended)

Create docker-compose.yml:

version: '3.8'

services:
  cryptosights:
    image: techiepookie/cryptosights:latest
    container_name: cryptosights
    ports:
      - "7860:7860"
    volumes:
      - ./app/data:/app/app/data
    environment:
      - FLASK_ENV=production
      - PYTHONUNBUFFERED=1
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Deploy:

docker-compose up -d
docker-compose logs -f
docker-compose down

Docker Hub Publishing

# Tag image
docker tag cryptosights:latest techiepookie/cryptosights:latest
docker tag cryptosights:latest techiepookie/cryptosights:v3.0.0

# Login to Docker Hub
docker login

# Push to registry
docker push techiepookie/cryptosights:latest
docker push techiepookie/cryptosights:v3.0.0

Docker Hub Repository: techiepookie/cryptosights

Hugging Face Spaces Deployment

Live Demo: https://techiepookie-cryptosightsai.hf.space

# Clone Space repository
git clone https://huggingface.co/spaces/techiepookie/cryptosightsai
cd cryptosightsai

# Copy application files
cp -r /path/to/crypto-insight-ai/crypto_sights/* .
cp /path/to/crypto-insight-ai/requirements.txt .

# Deploy via Git
git add .
git commit -m "Deploy CryptoSights v3.0"
git push

Hugging Face automatically builds and deploys the Docker container.

📖 Usage Guide

Running the Complete Data Pipeline

Step 1: Scrape & Process Articles

cd crypto_sights/app/scripts
python crypto_scraping_model.py

Pipeline Steps:

Scrapes articles for all cryptocurrencies
Converts articles to PDF format
Extracts and cleans text content
Performs sentiment analysis
Builds BM25 search index
Saves processed data and indexes

Output Files:

{coin}_news_urls.csv - Article URLs
{coin}_article_{n}.pdf - Downloaded articles
cleaned_data.txt - Preprocessed corpus
extracted_text.pickle - BM25 index
Sentiment charts (PNG)

Starting the Web Application

Option 1: Direct Python

cd crypto_sights
python run.py
# Access: http://localhost:5000

Option 2: Docker

docker run -p 7860:7860 techiepookie/cryptosights:latest
# Access: http://localhost:7860

Web Interface Usage

Access Dashboard → http://localhost:7860
Select Cryptocurrency → Choose from dropdown (Bitcoin, Ethereum, etc.)
Enter Query → Type your question
- Example: "What is driving Bitcoin's recent price increase?"
View Results → See ranked answers with:
- Relevance scores
- Sentiment analysis
- Source documents
- Processing time

Command-Line Query Interface

from app.scripts.model import process_query

# Execute a query
results = process_query(
    coin="bitcoin",
    query="What are analysts predicting for Bitcoin in 2025?",
    top_n=5
)

# Access results
print(f"Sentiment: {results['sentiment']}")
for idx, answer in enumerate(results['top_answers'], 1):
    print(f"{idx}. {answer}")

Advanced Usage Examples

Custom Query with BM25:

import pickle
from rank_bm25 import BM25Okapi

# Load BM25 index
with open('app/data/extracted_text.pickle', 'rb') as f:
    corpus, bm25 = pickle.load(f)

# Perform custom query
query = "ethereum smart contract upgrades"
tokenized_query = query.lower().split()
scores = bm25.get_scores(tokenized_query)

# Get top results
top_indices = scores.argsort()[-10:][::-1]
results = [corpus[i] for i in top_indices]

Batch Query Processing:

queries = [
    "Bitcoin price prediction 2025",
    "Ethereum scaling solutions",
    "Solana network performance"
]

for query in queries:
    results = process_query("general", query, top_n=3)
    print(f"\nQuery: {query}")
    print(f"Sentiment: {results['sentiment']}")

🔌 API Reference

Base URL

http://localhost:7860
http://localhost:5000  # Development

Endpoints

Health Check

Endpoint: GET /health

curl http://localhost:7860/health

Response:

{
  "status": "healthy",
  "timestamp": "2026-01-15T10:30:00Z",
  "version": "3.0.0"
}

Query Processing

Endpoint: POST /query

Request:

curl -X POST http://localhost:7860/query \
  -H "Content-Type: application/json" \
  -d '{
    "coin": "bitcoin",
    "query": "What is the latest trend in Bitcoin price?"
  }'

Parameters:

Parameter	Type	Required	Description
`coin`	string	Yes	bitcoin, ethereum, solana, dogecoin, hamster, general
`query`	string	Yes	User question (max 500 chars)

Response (200 OK):

{
  "query": "What is the latest trend in Bitcoin price?",
  "coin": "bitcoin",
  "sentiment": "Positive (Score: 0.65)",
  "top_answers": [
    "Bitcoin prices surged this week due to increased institutional adoption. (Score: 0.92)",
    "Analysts predict Bitcoin will reach new highs in Q2 2025. (Score: 0.85)"
  ],
  "metadata": {
    "processing_time_ms": 145,
    "total_documents_searched": 1247,
    "timestamp": "2026-01-15T10:30:00Z"
  }
}

Response (400 Bad Request):

{
  "error": "Invalid coin parameter",
  "message": "Coin must be one of: bitcoin, ethereum, solana, dogecoin, hamster, general",
  "status": 400
}

Python SDK

import requests

class CryptoSightsClient:
    def __init__(self, base_url="http://localhost:7860"):
        self.base_url = base_url
    
    def query(self, coin, question):
        """Submit a query to CryptoSights"""
        endpoint = f"{self.base_url}/query"
        payload = {"coin": coin, "query": question}
        response = requests.post(endpoint, json=payload)
        return response.json()

# Usage
client = CryptoSightsClient()
results = client.query("bitcoin", "What drives Bitcoin volatility?")
print(results['sentiment'])

🎓 Evaluation System

The Validation Engine ensures answer quality through ML-based relevance scoring and semantic matching.

Architecture

Input Validation
       ↓
Keyword Matching
       ↓
Semantic Similarity (BERT)
       ↓
TF-IDF Vectorization
       ↓
Random Forest Classifier
       ↓
Valid/Invalid Classification

Components

1. Entry Validation

def is_valid_entry(entry):
    """Multi-stage validation pipeline"""
    # Stage 1: Special cases
    if entry['coin'].lower() == 'general crypto':
        return True
    
    # Stage 2: Basic validation
    if not entry['query'] or not entry['answers']:
        return False
    
    # Stage 3: Keyword matching
    if has_keyword_match(entry):
        return True
    
    # Stage 4: Semantic similarity (threshold: 0.5)
    similarity = compute_semantic_similarity(entry)
    return similarity >= 0.5

2. Semantic Similarity

Model: all-MiniLM-L6-v2 (384-dimensional embeddings)
Similarity Metric: Cosine similarity
Threshold: 0.5

3. Machine Learning

Algorithm: Random Forest Classifier
Features: TF-IDF vectors (500 features)
Cross-Validation: 5-fold CV
Performance: F1-score optimization

Usage Example

from evaluation.evalmodel import is_valid_entry

entry = {
    'coin': 'bitcoin',
    'query': 'What factors influence Bitcoin price?',
    'answers': [
        'Supply and demand affect Bitcoin price.',
        'Regulatory news impacts sentiment.'
    ]
}

is_valid = is_valid_entry(entry)
print(f"Valid: {is_valid}")  # True

📁 Project Structure

crypto-insight-ai/
├── cache/                              # Temporary cache
├── crypto_sights/                      # Main package
│   ├── app/                            # Flask application
│   │   ├── data/                       # Data storage
│   │   │   ├── UI-UX/
│   │   │   ├── extracted_text.pickle   # BM25 index
│   │   │   ├── metrics.json
│   │   │   └── query_logs.json
│   │   ├── routes/                     # API endpoints
│   │   │   ├── __init__.py
│   │   │   └── main_routes.py
│   │   ├── scripts/                    # Processing
│   │   │   ├── crypto_scraping_model.py
│   │   │   ├── crypto.py
│   │   │   ├── keywords.py
│   │   │   ├── logger.py
│   │   │   ├── model.py
│   │   │   └── process_and_push.py
│   │   ├── static/                     # Web assets
│   │   │   ├── assets/
│   │   │   ├── css/
│   │   │   └── js/
│   │   ├── templates/                  # HTML templates
│   │   │   ├── index.html
│   │   │   ├── evaluation.html
│   │   │   ├── faq.html
│   │   │   └── ...
│   │   └── __init__.py
│   ├── evaluation/                     # Validation engine
│   │   ├── evalbackend.py
│   │   └── evalmodel.py
│   ├── Dockerfile                      # Container config
│   ├── deployment.md                   # Deployment guide
│   └── run.py                          # Entry point
├── .dockerignore
├── .gitignore
├── contributors.md
├── README.md
├── requirements.txt
└── LICENSE

💻 Development

Setup Development Environment

# Clone repository
git clone https://github.com/techiepookie/crypto-insight-ai.git
cd crypto-insight-ai

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install development dependencies (optional)
pip install pytest pytest-cov black flake8 pylint

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=crypto_sights

# Run specific test
pytest tests/test_model.py

Code Quality

# Format code
black crypto_sights/

# Lint
flake8 crypto_sights/ --max-line-length=100

# Type checking
mypy crypto_sights/

Commit Message Convention

Follow Conventional Commits:

feat: add new feature description
fix: resolve bug issue
docs: update documentation
refactor: improve code structure
test: add test cases

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

How to Contribute

Fork the repository
Create feature branch - git checkout -b feature/your-feature
Commit changes - git commit -m "feat: description"
Push to fork - git push origin feature/your-feature
Create Pull Request

Code Style

Follow PEP 8
Add docstrings to functions
Include type hints
Write unit tests
Maintain >80% code coverage

🔧 Troubleshooting

Common Issues

Q: "ModuleNotFoundError: No module named 'flask'"

pip install -r requirements.txt

Q: "wkhtmltopdf not found"

# Ubuntu
sudo apt-get install wkhtmltopdf

# macOS
brew install wkhtmltopdf

Q: "NLTK resources not downloaded"

python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

Q: "Port 7860 already in use"

# Find process using port
lsof -i :7860

# Kill process or use different port
docker run -p 8080:7860 techiepookie/cryptosights:latest

Q: "Out of memory errors"

# Increase Docker memory allocation in Desktop settings
# or use volume mounting for persistent storage
docker run -v /path/to/data:/app/data techiepookie/cryptosights:latest

Getting Help

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See deployment.md

📄 License

This project is licensed under the MIT License - see LICENSE for details.

Copyright (c) 2025 CryptoSights Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files...

📞 Contact & Support

Connect

GitHub: @techiepookie
Repository: crypto-insight-ai
Live Demo: techiepookie-cryptosightsai.hf.space

Support the Project

⭐ Star the repository | 🐛 Report bugs | 💡 Suggest features | 🤝 Contribute code

🙏 Acknowledgments

Team

Lead Developer: Harshal Chaudhari & Nikhil Kumar Obhawani
Contributors: See contributors.md

Technologies

Flask • NLTK • TextBlob • Scikit-learn • Pandas • Docker • Hugging Face

Built with ❤️ by the CryptoSights Team

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
cache		cache
crypto_sights		crypto_sights
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contributors.md		contributors.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation