# Setup and Data Exploration - Intellihack Scope 03

This notebook focuses on setting up the environment and exploring the q3_dataset containing technical documentation on AI research topics.

In [None]:
# Install required packages
!pip install -r ../requirements.txt

In [None]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from pathlib import Path

# Check if CUDA is available
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device name: {torch.cuda.get_device_name()}")

## Dataset Exploration

Let's examine the technical documentation provided in the q3_dataset folder.

In [None]:
# Define path to the dataset
data_dir = Path('../data/raw')

# List all files
files = list(data_dir.glob('**/*.md')) + list(data_dir.glob('**/*.txt'))
print(f"Found {len(files)} files in the dataset.")

# Get basic stats about each file
file_stats = []
for file in files:
    with open(file, 'r', encoding='utf-8') as f:
        content = f.read()
        word_count = len(content.split())
        file_stats.append({
            'filename': file.name,
            'path': str(file),
            'size_kb': file.stat().st_size / 1024,
            'word_count': word_count
        })

# Convert to DataFrame for better visualization
stats_df = pd.DataFrame(file_stats)
stats_df

## Key Technical Topics

Based on the dataset, the following key topics need to be learned by our model:
- DualPipe (bidirectional pipeline parallelism)
- DeepSeek-V3 model architecture
- Fire-Flyer File System (3FS)
- Expert Parallelism Load Balancing (EPLB)

In [None]:
# Sample content from one document to understand its structure
sample_file = files[0]  # Change index to explore different files
with open(sample_file, 'r', encoding='utf-8') as f:
    content = f.read()
    
print(f"Sample content from {sample_file.name}:")
print("="*80)
print(content[:1500], "...")
print("="*80)

## Next Steps

Based on this exploration:
1. We need to process these technical documents and extract meaningful chunks
2. Generate QA pairs covering the technical topics
3. Prepare for fine-tuning the Qwen 2.5-3B model