An intelligent NLP-powered platform for cryptocurrency news analysis, sentiment tracking, and query-based intelligence.
π― Features β’ π Quick Start β’ π Documentation β’ π³ Docker β’ π API β’ π€ Contributing
- Overview
- Key Features
- System Architecture
- Technology Stack
- Quick Start
- Docker Deployment
- Usage Guide
- API Reference
- Evaluation System
- Project Structure
- Development
- Contributing
- Troubleshooting
- License
- Contact
CryptoSights is a production-grade cryptocurrency intelligence platform that combines web scraping, natural language processing, and machine learning to deliver actionable insights from cryptocurrency news articles. The platform processes thousands of articles across major cryptocurrencies, performs advanced sentiment analysis, and provides instant, ranked answers to user queries through an intuitive web interface and RESTful API.
- Bitcoin (BTC) - Digital gold and store of value
- Ethereum (ETH) - Smart contract pioneer
- Solana (SOL) - High-performance blockchain
- Dogecoin (DOGE) - Community-driven memecoin
- Hamster (HMSTR) - Emerging cryptocurrency
- General Crypto - Cross-asset analysis
| Capability | Description |
|---|---|
| Intelligent Scraping | Automated article collection from cryptocurrency news sources |
| NLP Processing | Advanced text extraction, cleaning, and preprocessing |
| Sentiment Analysis | Real-time sentiment tracking with polarity scoring |
| Smart Retrieval | BM25-powered ranked answer generation |
| Web Interface | Interactive query platform with visual analytics |
| RESTful API | Programmatic access for integration |
| Validation Engine | Machine learning-based answer verification |
| Docker Support | Containerized deployment with one-command setup |
-
Automated Web Scraping
- Harvests 10+ articles per cryptocurrency from trusted sources
- Intelligent URL extraction with validation
- Built-in rate limiting and exponential backoff
- User-agent rotation to prevent blocking
- Structured CSV output for audit trails and reproducibility
-
Robust Error Handling
- Automatic retry mechanisms with configurable backoff
- Comprehensive logging system for debugging
- Graceful degradation on network failures
- Detailed error reports in query logs
-
Multi-Stage Content Extraction
- HTML to PDF conversion via
pdfkitwith wkhtmltopdf - Clean text extraction using
trafilaturalibrary - Boilerplate removal (ads, navigation, footers)
- Character encoding normalization (UTF-8)
- HTML to PDF conversion via
-
Advanced Preprocessing
- Intelligent lowercasing with acronym preservation (BTC, ETH, DeFi)
- Special character handling and normalization
- Tokenization with cryptocurrency term protection
- Stemming and lemmatization via NLTK
- Stop word removal with domain-specific exceptions
-
Domain-Aware Processing
- Cryptocurrency-specific keyword preservation
- Technical term recognition (DeFi, NFT, PoS, DAO)
- Context-sensitive cleaning for financial terms
- Preserves numerical values for price analysis
-
Sentence-Level Analysis
- TextBlob-powered sentiment scoring
- Polarity measurement (-1.0 to +1.0 scale)
- Subjectivity detection (0.0 to 1.0)
- Multi-class classification (Positive, Negative, Neutral)
- Per-document aggregate sentiment
-
Visual Analytics
- Real-time sentiment distribution charts
- Matplotlib-powered visualizations
- Sentiment trends over time
- Exportable graphics for reporting
- JSON sentiment metrics export
-
BM25 Algorithm Implementation
- Okapi BM25 ranking function (k1=1.5, b=0.75)
- Efficient inverted index structure
- Sub-second query response times
- Tunable ranking parameters
- Persistent index serialization with pickle
-
Rich Query Results
- Ranked answer ordering by relevance score (0.0-1.0)
- Source document attribution with metadata
- Sentiment context for each result
- Configurable result count (top-n)
- Duplicate result filtering
- Query execution time tracking
-
User-Friendly Dashboard
- Cryptocurrency selection with autocomplete
- Real-time query processing with loading indicators
- Interactive result display with collapsible cards
- Sentiment visualization integration
- Query history and saved searches (planned)
-
Responsive Design
- Mobile-optimized layouts (Bootstrap 5)
- Cross-browser compatibility (Chrome, Firefox, Safari, Edge)
- Modern CSS3/HTML5 standards
- Dark mode support (optional)
- Accessibility features (WCAG 2.1 AA)
-
JSON-Based Communication
- Standardized request/response format
- CORS-enabled for cross-origin access
- Comprehensive error messages with status codes
- Rate limiting ready (planned)
- API versioning support
-
Query Logging & Analytics
- Automatic query history tracking
- Timestamp-based audit trail
- Analytics-ready data format
- Performance metrics collection
- User engagement tracking
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CryptoSights Platform β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
βββββββββΌβββββββββ ββββββββΌββββββββ
β Data Pipeline β β Web Layer β
βββββββββ¬βββββββββ ββββββββ¬ββββββββ
β β
βββββββββββββΌββββββββββββ βββββββββΌββββββββ
β β β β β β
βββββΌββββ ββββΌββββ ββββΌββββ ββββΌββββ βββββΌβββ ββββΌβββββ
βScraperβ β PDF β β NLP β βFlask β βREST β βDocker β
βModule β βConverβ βEngineβ β UI β β API β βDeploy β
βββββ¬ββββ ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ βββ¬βββββ ββββ¬βββββ
β β β β β β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββ΄ββββββββββ
β
βββββββββββββ΄ββββββββββββ
β β
βββββΌβββββ ββββΌβββββββ
β BM25 β βSentimentβ
βIndexingβ βAnalyzer β
βββββ¬βββββ ββββ¬βββββββ
β β
βββββββββΌβββββββββββββββββββββββΌβββββββββ
β Query Processing Layer β
βββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββ΄βββββββββββ
β β
βββββΌβββββββββ ββββββββΌβββββββ
β Validation β β Results β
β Engine β β Formatting β
ββββββββββββββ βββββββββββββββ
- Data Acquisition β Web scraping β URL collection β CSV storage
- Content Processing β HTML to PDF β Text extraction β Clean output
- NLP Processing β Tokenization β Lemmatization β Term preservation
- Indexing β BM25 index creation β Persistence β Memory optimization
- Query Handling β User input β BM25 ranking β Result formatting
- Validation β Semantic matching β Relevance scoring β Quality assurance
| Category | Technologies |
|---|---|
| Language | Python 3.10+ |
| Web Frameworks | Flask 2.0+ & FastAPI |
| NLP Libraries | NLTK, TextBlob, Sentence-Transformers |
| ML Algorithms | BM25Okapi, Random Forest, TF-IDF |
| Data Processing | Pandas, NumPy |
| Web Scraping | BeautifulSoup4, Requests, Trafilatura |
| Visualization | Matplotlib, Seaborn |
| Containerization | Docker, Docker Compose |
| Deployment | Hugging Face Spaces, Docker Hub |
# Web Framework & API
flask>=2.0.0
flask-cors>=3.0.10
fastapi>=0.95.0
# Web Scraping & Content Extraction
requests>=2.28.0
beautifulsoup4>=4.11.0
trafilatura>=1.4.0
pdfkit>=1.0.0
lxml>=4.9.0
# Natural Language Processing
nltk>=3.8.0
textblob>=0.17.0
sentence-transformers>=2.2.0
# Machine Learning & Retrieval
rank-bm25>=0.2.2
scikit-learn>=1.2.0
# Data Manipulation
pandas>=1.5.0
numpy>=1.23.0
# Visualization
matplotlib>=3.6.0
seaborn>=0.12.0
# Utilities
python-dotenv>=0.21.0- wkhtmltopdf - PDF rendering engine (required by pdfkit)
- Download: https://wkhtmltopdf.org/downloads.html
- Ubuntu:
sudo apt-get install wkhtmltopdf - macOS:
brew install wkhtmltopdf
- Python 3.10 or higher
- pip (Python package manager)
- wkhtmltopdf
- Git
1. Clone the Repository
git clone https://github.com/techiepookie/crypto-insight-ai.git
cd crypto-insight-ai2. Create Virtual Environment (Recommended)
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate3. Install Python Dependencies
pip install -r requirements.txt4. Download NLTK Resources
python -c "import nltk; \
nltk.download('punkt'); \
nltk.download('punkt_tab'); \
nltk.download('stopwords'); \
nltk.download('wordnet'); \
nltk.download('omw-1.4')"5. Install wkhtmltopdf
# Ubuntu/Debian
sudo apt-get install wkhtmltopdf
# macOS
brew install wkhtmltopdf
# Windows: Download from https://wkhtmltopdf.org/downloads.html6. Verify Installation
python -c "import flask, nltk, textblob, rank_bm25; print('β All dependencies installed!')"7. Start the Application
cd crypto_sights
python run.pyAccess at: http://localhost:5000
Requires Docker installed: https://docs.docker.com/get-docker/
# Pull pre-built image from Docker Hub
docker pull techiepookie/cryptosights:latest
# Run container
docker run -d \
--name cryptosights \
-p 7860:7860 \
techiepookie/cryptosights:latest
# Access application
# http://localhost:7860# Build Docker image
docker build -t cryptosights .
# Run container
docker run -d \
--name cryptosights \
-p 7860:7860 \
techiepookie/cryptosights:latest
# View logs
docker logs -f cryptosights
# Stop container
docker stop cryptosights
docker rm cryptosightsCreate docker-compose.yml:
version: '3.8'
services:
cryptosights:
image: techiepookie/cryptosights:latest
container_name: cryptosights
ports:
- "7860:7860"
volumes:
- ./app/data:/app/app/data
environment:
- FLASK_ENV=production
- PYTHONUNBUFFERED=1
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
interval: 30s
timeout: 10s
retries: 3Deploy:
docker-compose up -d
docker-compose logs -f
docker-compose down# Tag image
docker tag cryptosights:latest techiepookie/cryptosights:latest
docker tag cryptosights:latest techiepookie/cryptosights:v3.0.0
# Login to Docker Hub
docker login
# Push to registry
docker push techiepookie/cryptosights:latest
docker push techiepookie/cryptosights:v3.0.0Docker Hub Repository: techiepookie/cryptosights
Live Demo: https://techiepookie-cryptosightsai.hf.space
# Clone Space repository
git clone https://huggingface.co/spaces/techiepookie/cryptosightsai
cd cryptosightsai
# Copy application files
cp -r /path/to/crypto-insight-ai/crypto_sights/* .
cp /path/to/crypto-insight-ai/requirements.txt .
# Deploy via Git
git add .
git commit -m "Deploy CryptoSights v3.0"
git pushHugging Face automatically builds and deploys the Docker container.
Step 1: Scrape & Process Articles
cd crypto_sights/app/scripts
python crypto_scraping_model.pyPipeline Steps:
- Scrapes articles for all cryptocurrencies
- Converts articles to PDF format
- Extracts and cleans text content
- Performs sentiment analysis
- Builds BM25 search index
- Saves processed data and indexes
Output Files:
{coin}_news_urls.csv- Article URLs{coin}_article_{n}.pdf- Downloaded articlescleaned_data.txt- Preprocessed corpusextracted_text.pickle- BM25 index- Sentiment charts (PNG)
Option 1: Direct Python
cd crypto_sights
python run.py
# Access: http://localhost:5000Option 2: Docker
docker run -p 7860:7860 techiepookie/cryptosights:latest
# Access: http://localhost:7860- Access Dashboard β
http://localhost:7860 - Select Cryptocurrency β Choose from dropdown (Bitcoin, Ethereum, etc.)
- Enter Query β Type your question
- Example: "What is driving Bitcoin's recent price increase?"
- View Results β See ranked answers with:
- Relevance scores
- Sentiment analysis
- Source documents
- Processing time
from app.scripts.model import process_query
# Execute a query
results = process_query(
coin="bitcoin",
query="What are analysts predicting for Bitcoin in 2025?",
top_n=5
)
# Access results
print(f"Sentiment: {results['sentiment']}")
for idx, answer in enumerate(results['top_answers'], 1):
print(f"{idx}. {answer}")Custom Query with BM25:
import pickle
from rank_bm25 import BM25Okapi
# Load BM25 index
with open('app/data/extracted_text.pickle', 'rb') as f:
corpus, bm25 = pickle.load(f)
# Perform custom query
query = "ethereum smart contract upgrades"
tokenized_query = query.lower().split()
scores = bm25.get_scores(tokenized_query)
# Get top results
top_indices = scores.argsort()[-10:][::-1]
results = [corpus[i] for i in top_indices]Batch Query Processing:
queries = [
"Bitcoin price prediction 2025",
"Ethereum scaling solutions",
"Solana network performance"
]
for query in queries:
results = process_query("general", query, top_n=3)
print(f"\nQuery: {query}")
print(f"Sentiment: {results['sentiment']}")http://localhost:7860
http://localhost:5000 # Development
Endpoint: GET /health
curl http://localhost:7860/healthResponse:
{
"status": "healthy",
"timestamp": "2026-01-15T10:30:00Z",
"version": "3.0.0"
}Endpoint: POST /query
Request:
curl -X POST http://localhost:7860/query \
-H "Content-Type: application/json" \
-d '{
"coin": "bitcoin",
"query": "What is the latest trend in Bitcoin price?"
}'Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
coin |
string | Yes | bitcoin, ethereum, solana, dogecoin, hamster, general |
query |
string | Yes | User question (max 500 chars) |
Response (200 OK):
{
"query": "What is the latest trend in Bitcoin price?",
"coin": "bitcoin",
"sentiment": "Positive (Score: 0.65)",
"top_answers": [
"Bitcoin prices surged this week due to increased institutional adoption. (Score: 0.92)",
"Analysts predict Bitcoin will reach new highs in Q2 2025. (Score: 0.85)"
],
"metadata": {
"processing_time_ms": 145,
"total_documents_searched": 1247,
"timestamp": "2026-01-15T10:30:00Z"
}
}Response (400 Bad Request):
{
"error": "Invalid coin parameter",
"message": "Coin must be one of: bitcoin, ethereum, solana, dogecoin, hamster, general",
"status": 400
}import requests
class CryptoSightsClient:
def __init__(self, base_url="http://localhost:7860"):
self.base_url = base_url
def query(self, coin, question):
"""Submit a query to CryptoSights"""
endpoint = f"{self.base_url}/query"
payload = {"coin": coin, "query": question}
response = requests.post(endpoint, json=payload)
return response.json()
# Usage
client = CryptoSightsClient()
results = client.query("bitcoin", "What drives Bitcoin volatility?")
print(results['sentiment'])The Validation Engine ensures answer quality through ML-based relevance scoring and semantic matching.
Input Validation
β
Keyword Matching
β
Semantic Similarity (BERT)
β
TF-IDF Vectorization
β
Random Forest Classifier
β
Valid/Invalid Classification
def is_valid_entry(entry):
"""Multi-stage validation pipeline"""
# Stage 1: Special cases
if entry['coin'].lower() == 'general crypto':
return True
# Stage 2: Basic validation
if not entry['query'] or not entry['answers']:
return False
# Stage 3: Keyword matching
if has_keyword_match(entry):
return True
# Stage 4: Semantic similarity (threshold: 0.5)
similarity = compute_semantic_similarity(entry)
return similarity >= 0.5- Model:
all-MiniLM-L6-v2(384-dimensional embeddings) - Similarity Metric: Cosine similarity
- Threshold: 0.5
- Algorithm: Random Forest Classifier
- Features: TF-IDF vectors (500 features)
- Cross-Validation: 5-fold CV
- Performance: F1-score optimization
from evaluation.evalmodel import is_valid_entry
entry = {
'coin': 'bitcoin',
'query': 'What factors influence Bitcoin price?',
'answers': [
'Supply and demand affect Bitcoin price.',
'Regulatory news impacts sentiment.'
]
}
is_valid = is_valid_entry(entry)
print(f"Valid: {is_valid}") # Truecrypto-insight-ai/
βββ cache/ # Temporary cache
βββ crypto_sights/ # Main package
β βββ app/ # Flask application
β β βββ data/ # Data storage
β β β βββ UI-UX/
β β β βββ extracted_text.pickle # BM25 index
β β β βββ metrics.json
β β β βββ query_logs.json
β β βββ routes/ # API endpoints
β β β βββ __init__.py
β β β βββ main_routes.py
β β βββ scripts/ # Processing
β β β βββ crypto_scraping_model.py
β β β βββ crypto.py
β β β βββ keywords.py
β β β βββ logger.py
β β β βββ model.py
β β β βββ process_and_push.py
β β βββ static/ # Web assets
β β β βββ assets/
β β β βββ css/
β β β βββ js/
β β βββ templates/ # HTML templates
β β β βββ index.html
β β β βββ evaluation.html
β β β βββ faq.html
β β β βββ ...
β β βββ __init__.py
β βββ evaluation/ # Validation engine
β β βββ evalbackend.py
β β βββ evalmodel.py
β βββ Dockerfile # Container config
β βββ deployment.md # Deployment guide
β βββ run.py # Entry point
βββ .dockerignore
βββ .gitignore
βββ contributors.md
βββ README.md
βββ requirements.txt
βββ LICENSE
# Clone repository
git clone https://github.com/techiepookie/crypto-insight-ai.git
cd crypto-insight-ai
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install development dependencies (optional)
pip install pytest pytest-cov black flake8 pylint# Run all tests
pytest
# Run with coverage
pytest --cov=crypto_sights
# Run specific test
pytest tests/test_model.py# Format code
black crypto_sights/
# Lint
flake8 crypto_sights/ --max-line-length=100
# Type checking
mypy crypto_sights/Follow Conventional Commits:
feat: add new feature description
fix: resolve bug issue
docs: update documentation
refactor: improve code structure
test: add test cases
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Fork the repository
- Create feature branch -
git checkout -b feature/your-feature - Commit changes -
git commit -m "feat: description" - Push to fork -
git push origin feature/your-feature - Create Pull Request
- Follow PEP 8
- Add docstrings to functions
- Include type hints
- Write unit tests
- Maintain >80% code coverage
Q: "ModuleNotFoundError: No module named 'flask'"
pip install -r requirements.txtQ: "wkhtmltopdf not found"
# Ubuntu
sudo apt-get install wkhtmltopdf
# macOS
brew install wkhtmltopdfQ: "NLTK resources not downloaded"
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"Q: "Port 7860 already in use"
# Find process using port
lsof -i :7860
# Kill process or use different port
docker run -p 8080:7860 techiepookie/cryptosights:latestQ: "Out of memory errors"
# Increase Docker memory allocation in Desktop settings
# or use volume mounting for persistent storage
docker run -v /path/to/data:/app/data techiepookie/cryptosights:latest- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See deployment.md
This project is licensed under the MIT License - see LICENSE for details.
Copyright (c) 2025 CryptoSights Contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files...
- GitHub: @techiepookie
- Repository: crypto-insight-ai
- Live Demo: techiepookie-cryptosightsai.hf.space
β Star the repository | π Report bugs | π‘ Suggest features | π€ Contribute code
- Lead Developer: Harshal Chaudhari & Nikhil Kumar Obhawani
- Contributors: See contributors.md
Flask β’ NLTK β’ TextBlob β’ Scikit-learn β’ Pandas β’ Docker β’ Hugging Face
Built with β€οΈ by the CryptoSights Team