# DataScience-Instruct-500K — Data Exploration

This notebook provides an overview and samples data from every file in the [RUC-DataLab/DataScience-Instruct-500K](https://huggingface.co/datasets/RUC-DataLab/DataScience-Instruct-500K) dataset.

In [8]:
import json
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
from matplotlib.patches import Patch

sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
plt.rcParams["figure.dpi"] = 120
plt.rcParams["figure.figsize"] = (12, 6)
pd.set_option('display.max_colwidth', 120)
pd.set_option('display.max_columns', None)

DATA_ROOT = "../data/DataScience-Instruct-500K"

---
## 1. Dataset Overview — File Counts & Sizes

In [9]:
# Collect file metadata
file_info = []
for root, dirs, files in os.walk(DATA_ROOT):
    dirs[:] = [d for d in dirs if not d.startswith('.')]
    for f in files:
        if f.startswith('.'):
            continue
        fpath = os.path.join(root, f)
        rel = os.path.relpath(fpath, DATA_ROOT)
        folder = rel.split('/')[0] if '/' in rel else 'root'
        size_mb = os.path.getsize(fpath) / (1024**2)
        file_info.append({'file': rel, 'folder': folder, 'size_mb': size_mb, 'ext': os.path.splitext(f)[1]})

df_files = pd.DataFrame(file_info)
print(f"Total files: {len(df_files)}")
print(f"Total size: {df_files['size_mb'].sum():.1f} MB")
print()
print(df_files.groupby('folder').agg(num_files=('file', 'count'), total_mb=('size_mb', 'sum')).round(1))

Total files: 37
Total size: 13452.2 MB

            num_files  total_mb
folder                         
RL                  5    3281.9
assets              2       0.7
interation         12    1240.5
reasoning          15    3913.1
root                3    5016.0


---
## 2. Sample Counts per Subset

In [10]:
# Count samples in each file
sample_counts = []

for folder in ['interation', 'reasoning']:
    json_files = sorted(glob.glob(os.path.join(DATA_ROOT, folder, '*.json')))
    for fpath in json_files:
        fname = os.path.basename(fpath)
        parts = fname.replace('.json', '').rsplit('_', 1)
        count = int(parts[-1]) if parts[-1].isdigit() else None
        category = parts[0] if count is not None else fname.replace('.json', '')
        sample_counts.append({'folder': folder, 'file': fname, 'category': category, 'count': count})

for fpath in sorted(glob.glob(os.path.join(DATA_ROOT, 'RL', '*.parquet'))):
    fname = os.path.basename(fpath)
    df_tmp = pd.read_parquet(fpath)
    sample_counts.append({'folder': 'RL', 'file': fname, 'category': fname.replace('.parquet', ''), 'count': len(df_tmp)})

df_counts = pd.DataFrame(sample_counts)
print(f"Total samples across subsets: {df_counts['count'].sum():,}")
print()
display(df_counts.sort_values('count', ascending=False).reset_index(drop=True))

Total samples across subsets: 516,055



Unnamed: 0,folder,file,category,count
0,reasoning,SKGInstruct_199989.json,SKGInstruct,199989
1,reasoning,TableQA_distillation_39301.json,TableQA_distillation,39301
2,reasoning,TableQA_refinement_39301.json,TableQA_refinement,39301
3,reasoning,TableQA_original_35357.json,TableQA_original,35357
4,reasoning,TableGPT_29448.json,TableGPT,29448
5,reasoning,instruction_following_20000.json,instruction_following,20000
6,reasoning,code_20000.json,code,20000
7,reasoning,science_20000.json,science,20000
8,reasoning,math_20000.json,math,20000
9,reasoning,other_19998.json,other,19998


---
## 3. Sample Data — Root-level JSON Files

In [None]:
for fname in ['single_ability_finetuning.json', 'multi_ability_agentic_training.json']:
    fpath = os.path.join(DATA_ROOT, fname)
    if not os.path.exists(fpath):
        continue
    print(f"{'='*80}")
    print(f"FILE: {fname}")
    print(f"{'='*80}")
    with open(fpath, 'r') as f:
        data = json.load(f)
    print(f"Total samples: {len(data):,}")
    print(f"Keys: {list(data[0].keys())}")
    print()
    for i, entry in enumerate(data[:1]):
        print(f"--- Sample {i} ---")
        for k, v in entry.items():
            if k == 'messages':
                print(f"  messages ({len(v)} msgs):")
                for msg in v[:3]:
                    content_preview = str(msg['content'])[:200]
                    print(f"    [{msg['role']}]: {content_preview}")
                if len(v) > 3:
                    print(f"    ... ({len(v)-3} more messages)")
            elif k == 'evaluation':
                print(f"  evaluation: {json.dumps(v, ensure_ascii=False)[:300]}")
            else:
                print(f"  {k}: {str(v)[:200]}")
        print()

FILE: single_ability_finetuning.json
Total samples: 437,398
Keys: ['id', 'messages', 'input_tokens', 'output_tokens', 'total_tokens', 'evaluation']

--- Sample 0 ---
  id: 0
  messages (2 msgs):
    [user]: You are capable of effectively identifying the hierarchical structure of the table. Based on the provided table and textual description, please provide the answer to the question.
You should think ste
    [assistant]: <Analyze>
First, the question is: "comparing to 2015, how many percentage point has nova scotia decreased in their crime severity index (csi) in 2016?"

I need to find the percentage change for Nova S
  input_tokens: 722
  output_tokens: 3562
  total_tokens: 4284
  evaluation: {"difficulty": 3, "quality": 4, "ability": "Data Analysis"}

--- Sample 1 ---
  id: 1
  messages (2 msgs):
    [user]: Please answer the given question based on the table and text. You should reach a short-form answer after reasoning.
### Instruction
Given a table and a list of texts in the follo

---
## 4. Sample Data — Interaction Files (`interation/`)

In [12]:
interation_files = sorted(glob.glob(os.path.join(DATA_ROOT, 'interation', '*.json')))

for fpath in interation_files:
    fname = os.path.basename(fpath)
    print(f"{'='*80}")
    print(f"FILE: interation/{fname}")
    print(f"{'='*80}")
    with open(fpath, 'r') as f:
        data = json.load(f)
    print(f"Total samples: {len(data):,}  |  Keys: {list(data[0].keys())}")
    
    entry = data[0]
    print(f"\n--- Sample 0 ---")
    for k, v in entry.items():
        if k == 'messages':
            print(f"  messages ({len(v)} msgs):")
            for msg in v[:3]:
                content_preview = str(msg['content'])[:200]
                print(f"    [{msg['role']}]: {content_preview}")
            if len(v) > 3:
                print(f"    ... ({len(v)-3} more messages)")
        elif k == 'evaluation':
            print(f"  evaluation: {json.dumps(v, ensure_ascii=False)[:300]}")
        else:
            print(f"  {k}: {str(v)[:200]}")
    print()

FILE: interation/data_analysis_3936.json
Total samples: 3,936  |  Keys: ['id', 'messages', 'evaluation', 'input_tokens', 'output_tokens', 'total_tokens']

--- Sample 0 ---
  id: 4
  messages (2 msgs):
    [user]: # Instruction
Integrate the 'all_star' and 'fielding' tables via player_id and year to categorize players as all-stars (present in 'all_star') or non-all-stars (present in 'fielding' but not 'all_star
    [assistant]: <Analyze>
I'll begin by carefully planning the analysis approach for this task. The instruction involves multiple complex steps that need to be executed sequentially. Here's my initial plan:

1. Data 
  evaluation: {"difficulty": 5, "quality": 5, "ability": ["Data Preparation", "Data Analysis", "Data Visualization", "Data Modeling", "Data Insight", "Code Generation", "Reasoning", "Numerical Calculation", "Instruction Following", "Structural Understanding"]}
  input_tokens: 299
  output_tokens: 10456
  total_tokens: 10755

FILE: interation/data_cleaning_1616.json


---
## 5. Sample Data — Reasoning Files (`reasoning/`)

In [13]:
reasoning_files = sorted(glob.glob(os.path.join(DATA_ROOT, 'reasoning', '*.json')))

for fpath in reasoning_files:
    fname = os.path.basename(fpath)
    print(f"{'='*80}")
    print(f"FILE: reasoning/{fname}")
    print(f"{'='*80}")
    with open(fpath, 'r') as f:
        data = json.load(f)
    print(f"Total samples: {len(data):,}  |  Keys: {list(data[0].keys())}")
    
    entry = data[0]
    print(f"\n--- Sample 0 ---")
    for k, v in entry.items():
        if k == 'messages':
            print(f"  messages ({len(v)} msgs):")
            for msg in v[:3]:
                content_preview = str(msg['content'])[:200]
                print(f"    [{msg['role']}]: {content_preview}")
            if len(v) > 3:
                print(f"    ... ({len(v)-3} more messages)")
        elif k == 'evaluation':
            print(f"  evaluation: {json.dumps(v, ensure_ascii=False)[:300]}")
        else:
            print(f"  {k}: {str(v)[:200]}")
    print()

FILE: reasoning/SKGInstruct_199989.json
Total samples: 199,989  |  Keys: ['id', 'messages', 'evaluation', 'input_tokens', 'output_tokens', 'total_tokens']

--- Sample 0 ---
  id: 512069
  messages (2 msgs):
    [user]: Write your answer to the question based on your reasoning given the information in the following table table:

col : stem | leaf  row 1 : 3 | 3, 3, 3, 5, 5 row 2 : 4 | 6 row 3 : 5 | 4, 5, 7, 8 row 4 :
    [assistant]: <Analyze>
To solve the problem, we need to determine how many gardens have at least 47 plants based on the given stem-and-leaf plot. Here's the step-by-step plan:

1. **Interpret the Stem-and-Leaf Plo
  evaluation: {"difficulty": 3, "quality": 5, "ability": "Reasoning"}
  input_tokens: 151
  output_tokens: 789
  total_tokens: 940

FILE: reasoning/TableGPT_29448.json
Total samples: 29,448  |  Keys: ['id', 'messages', 'input_tokens', 'output_tokens', 'total_tokens', 'evaluation']

--- Sample 0 ---
  id: 0
  messages (2 msgs):
    [user]: # Instruction
Given t

---
## 6. Sample Data — RL Files (`RL/`)

In [14]:
rl_files = sorted(glob.glob(os.path.join(DATA_ROOT, 'RL', '*.parquet')))

for fpath in rl_files:
    fname = os.path.basename(fpath)
    print(f"{'='*80}")
    print(f"FILE: RL/{fname}")
    print(f"{'='*80}")
    df = pd.read_parquet(fpath)
    print(f"Shape: {df.shape}  |  Columns: {list(df.columns)}")
    print(f"Dtypes:\n{df.dtypes.to_string()}")
    print(f"\n--- First row ---")
    for col in df.columns:
        val = str(df[col].iloc[0])[:300]
        print(f"  {col}: {val}")
    print()

FILE: RL/datatask.parquet
Shape: (3852, 9)  |  Columns: ['input_seq', 'output_seq', 'workspace_id', 'external_knowledge', 'data', 'data_source', 'prompt', 'env_class', 'reward_spec']
Dtypes:
input_seq                str
output_seq               str
workspace_id             str
external_knowledge       str
data                     str
data_source              str
prompt                object
env_class                str
reward_spec           object

--- First row ---
  input_seq: # Instruction
What is the count of customers that Steve Johnson supports?
# Data
File 1:
{
    "name": "employees.json",
    "size": "2.8KB"
}
File 2:
{
    "name": "customers.csv",
    "size": "6.7KB"
}
  output_seq: <Analyze>
To determine the count of customers supported by Steve Johnson, we need to:
1. Understand the relationship between employees and customers
2. Identify how "support" is represented in the data (likely through a foreign key relationship)
3. Locate Steve Johnson in the employees data
4. Fin