# Jarjarquant with DataService Integration

This notebook demonstrates how to use the DataService with Jarjarquant for comprehensive financial data analysis. The DataService provides efficient access to financial data stored in Parquet files using DuckDB for fast columnar operations.

## What you'll learn:
- How to initialize and use the DataService
- Filtering stocks by metadata criteria
- Integrating DataService with Jarjarquant for technical analysis
- Computing features and applying triple barrier labeling
- Analyzing multiple assets and finding correlations

## Setup and Imports

First, let's import the necessary libraries and initialize our services.

In [None]:
from jarjarquant import Jarjarquant
from jarjarquant.data_service import DataService
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Set plot style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Imports successful!")

## 1. Initialize DataService

The DataService provides a centralized interface for querying financial data stored in Parquet files.

In [None]:
# Initialize the DataService
ds = DataService("jarjarquant/sample_data/data/")

print("=== DataService Initialization ===\n")
print("📊 DataService initialized successfully!")
print(f"📁 Data path: {ds.data_path}")
print(f"📈 Equities path: {ds.equities_path}")

# Check available tickers
tickers = ds.list_available_tickers()
print(f"\n📋 Total available tickers: {len(tickers)}")
print(f"🔍 Sample tickers: {tickers[:10]}")

## 2. Explore Available Data

Let's explore what sectors and metadata are available in our dataset.

In [None]:
# Get available sectors
sectors = ds.get_sectors()
print("🏢 Available Sectors:")
for i, sector in enumerate(sectors, 1):
    print(f"   {i:2d}. {sector}")

# Get available analyst ratings
ratings = ds.get_analyst_ratings()
print(f"\n⭐ Available Analyst Ratings: {ratings}")

## 3. Smart Stock Selection Using Metadata

Instead of manually picking stocks, let's use metadata to find high-quality technology stocks with strong fundamentals.

In [None]:
print("🔍 Finding large-cap tech stocks with strong buy ratings...\n")

# Try to get stocks with strong buy ratings
tech_stocks = ds.get_sample_by_criteria(
    n_samples=5,
    sector="Technology services",
    min_market_cap=1e11,  # 100 billion minimum market cap
    analyst_rating="Strong buy",
    random_seed=42
)

if not tech_stocks:
    print("⚠️  No stocks found with 'Strong buy' rating. Trying with broader criteria...")
    # Fallback: get any large tech stocks
    tech_stocks = ds.get_sample_by_criteria(
        n_samples=5,
        sector="Technology services",
        min_market_cap=5e10,  # 50 billion minimum
        random_seed=42
    )
    
if not tech_stocks:
    # Final fallback to known tech stocks
    tech_stocks = ["AAPL", "MSFT", "GOOGL", "AMZN", "META"]
    print("⚠️  Using default tech stocks as fallback")

print(f"✅ Selected tickers: {tech_stocks}")

# Get metadata for selected stocks
metadata = ds.get_metadata(tech_stocks)
if not metadata.empty:
    print("\n📊 Stock Metadata:")
    display_cols = ['Description', 'Sector', 'Market capitalization']
    if 'Analyst Rating' in metadata.columns:
        display_cols.append('Analyst Rating')
    print(metadata[display_cols])
else:
    print("⚠️  No metadata available for selected stocks")

## 4. Load Price Data for Analysis

Now let's load historical price data for our selected stock to perform technical analysis.

In [None]:
# Define analysis period
start_date = "2023-01-01"
end_date = "2024-12-31"

# Get data for the first ticker
primary_ticker = tech_stocks[0]
print(f"📈 Loading price data for {primary_ticker}...")

price_data = ds.get_price_data(
    primary_ticker,
    start_date=start_date,
    end_date=end_date
)

if price_data.empty:
    print(f"❌ No data available for {primary_ticker}")
else:
    print(f"✅ Loaded {len(price_data)} days of data for {primary_ticker}")
    print(f"📅 Date range: {price_data.index.min()} to {price_data.index.max()}")
    
    # Display basic statistics
    print("\n📊 Price Data Summary:")
    print(price_data[['Open', 'High', 'Low', 'Close', 'Volume']].describe())
    
    # Show recent data
    print("\n📈 Recent Price Data (last 5 days):")
    print(price_data[['Open', 'High', 'Low', 'Close', 'Volume']].tail())

## 5. Price Visualization

Let's create some visualizations to understand the price movements.

In [None]:
if not price_data.empty:
    # Create subplots
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))
    
    # Price chart
    ax1.plot(price_data.index, price_data['Close'], label='Close Price', linewidth=2)
    ax1.fill_between(price_data.index, price_data['Low'], price_data['High'], 
                     alpha=0.3, label='High-Low Range')
    ax1.set_title(f'{primary_ticker} Price Chart', fontsize=14, fontweight='bold')
    ax1.set_ylabel('Price ($)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Volume chart
    ax2.bar(price_data.index, price_data['Volume'], alpha=0.7, color='orange')
    ax2.set_title(f'{primary_ticker} Volume', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Volume')
    ax2.set_xlabel('Date')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Calculate and display basic metrics
    returns = price_data['Close'].pct_change().dropna()
    
    print(f"\n📈 Performance Metrics for {primary_ticker}:")
    print(f"   📊 Total Return: {((price_data['Close'].iloc[-1] / price_data['Close'].iloc[0]) - 1) * 100:.2f}%")
    print(f"   📊 Average Daily Return: {returns.mean() * 100:.4f}%")
    print(f"   📊 Volatility (Daily): {returns.std() * 100:.4f}%")
    print(f"   📊 Sharpe Ratio (Daily): {returns.mean() / returns.std():.4f}")
    print(f"   📊 Max Drawdown: {((price_data['Close'] / price_data['Close'].cummax()) - 1).min() * 100:.2f}%")

## 6. Initialize Jarjarquant for Technical Analysis

Now let's use Jarjarquant to compute technical indicators and features.

In [None]:
if not price_data.empty:
    print("🔧 Initializing Jarjarquant for technical analysis...\n")
    
    # Convert to the format expected by Jarjarquant
    jq_data = price_data[['Open', 'High', 'Low', 'Close', 'Volume']].copy()
    
    # Initialize Jarjarquant
    jq = Jarjarquant(data=jq_data)
    
    print(f"✅ Jarjarquant initialized with {len(jq_data)} data points")
    print(f"📊 Data columns: {list(jq_data.columns)}")
    print(f"📅 Analysis period: {jq_data.index.min()} to {jq_data.index.max()}")

## 7. Compute Technical Features

Let's compute a comprehensive set of technical indicators and features.

In [None]:
if not price_data.empty:
    print("⚙️  Computing technical indicators and features...\n")
    
    # Compute features
    features = jq.compute_features()
    
    print(f"✅ Generated {len(features.columns)} features")
    print(f"📊 Feature shape: {features.shape}")
    print(f"📅 Feature date range: {features.index.min()} to {features.index.max()}")
    
    # Display feature categories
    feature_names = list(features.columns)
    print(f"\n🏷️  Feature categories (first 10): {feature_names[:10]}")
    
    # Show feature statistics
    print("\n📈 Feature Summary Statistics:")
    print(features.describe().round(4))
    
    # Check for missing values
    missing_counts = features.isnull().sum()
    features_with_missing = missing_counts[missing_counts > 0]
    
    if len(features_with_missing) > 0:
        print(f"\n⚠️  Features with missing values: {len(features_with_missing)}")
        print(features_with_missing.head(10))
    else:
        print("\n✅ No missing values in features")

## 8. Apply Triple Barrier Labeling

The triple barrier method is a sophisticated approach to create labels for machine learning that considers profit-taking, stop-loss, and time-based exits.

In [None]:
if not price_data.empty:
    print("🎯 Applying triple barrier labeling...\n")
    
    # Apply triple barrier labeling
    labels = jq.get_labels(
        pt=0.02,  # 2% profit target
        sl=0.01,  # 1% stop loss
        horizon=20  # 20 day horizon
    )
    
    print(f"✅ Generated {len(labels)} labels")
    print(f"📊 Label date range: {labels.index.min()} to {labels.index.max()}")
    
    # Show label distribution
    print("\n📊 Label Distribution:")
    label_counts = labels.value_counts().sort_index()
    label_percentages = (label_counts / len(labels) * 100).round(2)
    
    for label, count in label_counts.items():
        percentage = label_percentages[label]
        label_name = {-1: "📉 Sell", 0: "⏸️  Hold", 1: "📈 Buy"}.get(label, f"Label {label}")
        print(f"   {label_name}: {count:,} ({percentage}%)")
    
    # Visualize label distribution
    plt.figure(figsize=(10, 6))
    
    # Create a bar plot
    colors = ['red', 'gray', 'green']
    label_names = ['Sell (-1)', 'Hold (0)', 'Buy (1)']
    
    bars = plt.bar(label_names, label_counts.values, color=colors, alpha=0.7)
    plt.title(f'Triple Barrier Label Distribution for {primary_ticker}', 
              fontsize=14, fontweight='bold')
    plt.ylabel('Count')
    plt.xlabel('Label')
    
    # Add percentage labels on bars
    for bar, percentage in zip(bars, label_percentages.values):
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                f'{percentage}%', ha='center', va='bottom', fontweight='bold')
    
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 9. Feature Importance Analysis

Let's evaluate which technical features are most predictive of future price movements.

In [None]:
if not price_data.empty and len(labels) > 0:
    print("🔍 Evaluating feature importance...\n")
    
    # Evaluate features
    importance = jq.evaluate_features(labels)
    
    if importance is not None and not importance.empty:
        print(f"✅ Feature evaluation completed for {len(importance)} features")
        
        # Get top features
        top_features = importance.nlargest(15, 'importance')
        
        print("\n🏆 Top 15 Most Important Features:")
        for i, (feature_name, row) in enumerate(top_features.iterrows(), 1):
            print(f"   {i:2d}. {feature_name:<25}: {row['importance']:.6f}")
        
        # Visualize feature importance
        plt.figure(figsize=(12, 8))
        
        # Horizontal bar plot for better readability
        y_pos = np.arange(len(top_features))
        plt.barh(y_pos, top_features['importance'].values, alpha=0.8)
        plt.yticks(y_pos, top_features.index)
        plt.xlabel('Feature Importance Score')
        plt.title(f'Top 15 Feature Importance for {primary_ticker}', 
                  fontsize=14, fontweight='bold')
        plt.gca().invert_yaxis()  # Highest importance at top
        plt.grid(True, alpha=0.3, axis='x')
        plt.tight_layout()
        plt.show()
        
        # Show summary statistics of importance scores
        print("\n📊 Feature Importance Statistics:")
        print(importance['importance'].describe().round(6))
        
    else:
        print("⚠️  Could not compute feature importance")

## 10. Multi-Asset Analysis

Now let's analyze multiple stocks to compare their characteristics and performance.

In [None]:
print("📊 Analyzing multiple assets...\n")

results = {}
analysis_tickers = tech_stocks[:5]  # Analyze first 5 tickers

for i, ticker in enumerate(analysis_tickers, 1):
    print(f"   {i}/{len(analysis_tickers)} Processing {ticker}...", end=" ")
    
    try:
        # Get price data
        ticker_data = ds.get_price_data(
            ticker,
            start_date=start_date,
            end_date=end_date
        )
        
        if ticker_data.empty:
            print("❌ No data")
            continue
        
        # Get metadata
        metadata = ds.get_metadata([ticker])
        if not metadata.empty:
            sector = metadata.loc[ticker, 'Sector']
            market_cap = metadata.loc[ticker, 'Market capitalization']
        else:
            sector = "Unknown"
            market_cap = 0
        
        # Calculate performance metrics
        returns = ticker_data['Close'].pct_change().dropna()
        
        # Calculate maximum drawdown
        rolling_max = ticker_data['Close'].cummax()
        drawdown = (ticker_data['Close'] / rolling_max - 1)
        max_drawdown = drawdown.min()
        
        # Calculate total return
        total_return = (ticker_data['Close'].iloc[-1] / ticker_data['Close'].iloc[0]) - 1
        
        results[ticker] = {
            'sector': sector,
            'market_cap_b': market_cap / 1e9,  # Convert to billions
            'total_return': total_return,
            'avg_daily_return': returns.mean(),
            'volatility': returns.std(),
            'sharpe_ratio': returns.mean() / returns.std() if returns.std() > 0 else 0,
            'max_drawdown': max_drawdown,
            'data_points': len(ticker_data)
        }
        
        print("✅")
        
    except Exception as e:
        print(f"❌ Error: {str(e)[:50]}...")
        continue

# Display results
if results:
    results_df = pd.DataFrame(results).T
    
    print("\n📊 Multi-Asset Analysis Results:")
    
    # Format the dataframe for better display
    display_df = results_df.copy()
    display_df['total_return'] = (display_df['total_return'] * 100).round(2)
    display_df['avg_daily_return'] = (display_df['avg_daily_return'] * 100).round(4)
    display_df['volatility'] = (display_df['volatility'] * 100).round(4)
    display_df['sharpe_ratio'] = display_df['sharpe_ratio'].round(4)
    display_df['max_drawdown'] = (display_df['max_drawdown'] * 100).round(2)
    display_df['market_cap_b'] = display_df['market_cap_b'].round(1)
    
    # Rename columns for better display
    display_df.columns = ['Sector', 'Market Cap (B)', 'Total Return (%)', 
                         'Avg Daily Return (%)', 'Volatility (%)', 
                         'Sharpe Ratio', 'Max Drawdown (%)', 'Data Points']
    
    print(display_df)
    
    # Create performance comparison chart
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # Total returns
    ax1.bar(results_df.index, results_df['total_return'] * 100, alpha=0.8)
    ax1.set_title('Total Returns (%)', fontweight='bold')
    ax1.set_ylabel('Return (%)')
    ax1.tick_params(axis='x', rotation=45)
    ax1.grid(True, alpha=0.3)
    
    # Volatility
    ax2.bar(results_df.index, results_df['volatility'] * 100, alpha=0.8, color='orange')
    ax2.set_title('Daily Volatility (%)', fontweight='bold')
    ax2.set_ylabel('Volatility (%)')
    ax2.tick_params(axis='x', rotation=45)
    ax2.grid(True, alpha=0.3)
    
    # Sharpe Ratio
    ax3.bar(results_df.index, results_df['sharpe_ratio'], alpha=0.8, color='green')
    ax3.set_title('Sharpe Ratio', fontweight='bold')
    ax3.set_ylabel('Sharpe Ratio')
    ax3.tick_params(axis='x', rotation=45)
    ax3.grid(True, alpha=0.3)
    
    # Max Drawdown
    ax4.bar(results_df.index, results_df['max_drawdown'] * 100, alpha=0.8, color='red')
    ax4.set_title('Maximum Drawdown (%)', fontweight='bold')
    ax4.set_ylabel('Drawdown (%)')
    ax4.tick_params(axis='x', rotation=45)
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
else:
    print("❌ No analysis results available")

## 11. Correlation Analysis

Let's analyze how these assets move together by computing return correlations.

In [None]:
print("🔗 Analyzing asset correlations...\n")

if len(analysis_tickers) >= 2:
    # Get returns for multiple assets
    returns_data = {}
    
    for ticker in analysis_tickers:
        print(f"   Loading returns for {ticker}...", end=" ")
        
        ticker_data = ds.get_price_data(
            ticker,
            start_date=start_date,
            end_date=end_date,
            columns=['Close']
        )
        
        if not ticker_data.empty:
            returns_data[ticker] = ticker_data['Close'].pct_change()
            print("✅")
        else:
            print("❌")
    
    if len(returns_data) >= 2:
        # Create returns dataframe
        returns_df = pd.DataFrame(returns_data).dropna()
        
        print(f"\n📊 Correlation analysis with {len(returns_df)} common observations")
        
        # Calculate correlation matrix
        correlation = returns_df.corr()
        
        print("\n🔗 Return Correlation Matrix:")
        print(correlation.round(3))
        
        # Visualize correlation matrix
        plt.figure(figsize=(10, 8))
        
        # Create heatmap
        mask = np.triu(np.ones_like(correlation, dtype=bool))  # Mask upper triangle
        sns.heatmap(correlation, mask=mask, annot=True, cmap='RdYlBu_r', 
                   center=0, square=True, linewidths=0.5, 
                   cbar_kws={"shrink": .8}, fmt='.3f')
        
        plt.title('Asset Return Correlation Matrix', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        # Find highest and lowest correlations
        # Get upper triangle of correlation matrix (excluding diagonal)
        upper_tri = correlation.where(np.triu(np.ones(correlation.shape), k=1).astype(bool))
        
        # Find highest correlation pair
        max_corr = upper_tri.max().max()
        max_pair = upper_tri.stack().idxmax()
        
        # Find lowest correlation pair
        min_corr = upper_tri.min().min()
        min_pair = upper_tri.stack().idxmin()
        
        print(f"\n🔗 Correlation Insights:")
        print(f"   📈 Highest correlation: {max_pair[0]} - {max_pair[1]} ({max_corr:.3f})")
        print(f"   📉 Lowest correlation: {min_pair[0]} - {min_pair[1]} ({min_corr:.3f})")
        
        # Calculate average correlation
        avg_corr = upper_tri.stack().mean()
        print(f"   📊 Average correlation: {avg_corr:.3f}")
        
    else:
        print("❌ Not enough valid return data for correlation analysis")
else:
    print("❌ Need at least 2 tickers for correlation analysis")

## 12. Summary and Key Insights

Let's summarize what we've learned from this comprehensive analysis.

In [None]:
print("📋 ANALYSIS SUMMARY")
print("=" * 50)

print(f"\n🎯 Analysis Focus:")
print(f"   📊 Primary Asset: {primary_ticker}")
print(f"   📅 Period: {start_date} to {end_date}")
print(f"   🏢 Sector Focus: Technology Services")
print(f"   📈 Assets Analyzed: {len(analysis_tickers)}")

if not price_data.empty:
    print(f"\n💡 Key Findings:")
    
    # Price performance
    total_return = ((price_data['Close'].iloc[-1] / price_data['Close'].iloc[0]) - 1) * 100
    returns = price_data['Close'].pct_change().dropna()
    volatility = returns.std() * np.sqrt(252) * 100  # Annualized
    sharpe = returns.mean() / returns.std() * np.sqrt(252)  # Annualized
    
    print(f"   📈 {primary_ticker} total return: {total_return:.2f}%")
    print(f"   📊 {primary_ticker} annualized volatility: {volatility:.2f}%")
    print(f"   ⭐ {primary_ticker} annualized Sharpe ratio: {sharpe:.3f}")
    
    if 'features' in locals():
        print(f"   🔧 Technical features generated: {len(features.columns)}")
    
    if 'labels' in locals():
        label_counts = labels.value_counts()
        if 1 in label_counts and -1 in label_counts:
            buy_sell_ratio = label_counts[1] / label_counts[-1]
            print(f"   🎯 Buy/Sell signal ratio: {buy_sell_ratio:.2f}")
    
    if 'importance' in locals() and importance is not None:
        top_feature = importance.nlargest(1, 'importance').index[0]
        top_score = importance.nlargest(1, 'importance')['importance'].iloc[0]
        print(f"   🏆 Most predictive feature: {top_feature} ({top_score:.6f})")

print(f"\n🛠️  DataService Capabilities Demonstrated:")
print(f"   ✅ Smart stock selection using metadata criteria")
print(f"   ✅ Efficient price data loading with date filtering")
print(f"   ✅ Multi-asset analysis and comparison")
print(f"   ✅ Integration with Jarjarquant for technical analysis")
print(f"   ✅ Feature engineering and importance evaluation")
print(f"   ✅ Triple barrier labeling for ML applications")
print(f"   ✅ Correlation analysis across assets")

print(f"\n🚀 Next Steps:")
print(f"   📚 Explore different sectors and market conditions")
print(f"   🔬 Experiment with different labeling parameters")
print(f"   🤖 Build machine learning models using the features")
print(f"   📈 Implement backtesting strategies")
print(f"   🔄 Automate the analysis pipeline")

print(f"\n" + "=" * 50)
print(f"✅ Analysis completed successfully!")

## 13. Cleanup

Don't forget to properly close the DataService connection.

In [None]:
# Close the DataService connection
ds.close()
print("🔒 DataService connection closed successfully!")

---

## Conclusion

This notebook demonstrated the powerful integration between Jarjarquant's DataService and the main quantitative analysis framework. We've shown how to:

1. **Intelligently select stocks** using metadata criteria rather than manual selection
2. **Efficiently load and analyze** large datasets using DuckDB's columnar operations
3. **Generate comprehensive technical features** using Jarjarquant's feature engineering
4. **Apply sophisticated labeling techniques** with triple barrier methods
5. **Evaluate feature importance** to identify the most predictive signals
6. **Compare multiple assets** across various performance metrics
7. **Analyze correlations** to understand asset relationships

The DataService provides a robust foundation for quantitative analysis, enabling researchers and practitioners to focus on strategy development rather than data management complexities.

### Key Benefits:
- **Performance**: Fast queries using DuckDB's columnar engine
- **Flexibility**: Easy filtering and selection using metadata
- **Integration**: Seamless connection with Jarjarquant's analysis tools
- **Scalability**: Handles large datasets efficiently
- **Reproducibility**: Consistent data access patterns

Feel free to modify the parameters, explore different sectors, or extend the analysis to suit your specific research needs!