Autonomous Data Science Research Agent

This project demonstrates how to orchestrate the OpenAI Agent API to build an end-to-end research assistant tailored for data science teams. The pipeline accepts a user query, automatically investigates relevant websites, stores the findings inside a local vector database, and produces a concise executive summary.

Features

Autonomous web research powered by an OpenAI agent with the web search tool enabled.
Embedding & retrieval pipeline that indexes research snippets in a local vector store for future reuse.
Summarisation layer that distils the aggregated findings into a polished report tailored to data science stakeholders.
CLI workflow for triggering research sprints from the terminal.

Project structure

.
├── datasci_tool/
│   ├── config.py          # Configuration dataclasses
│   ├── embeddings.py      # Embedding generation helpers
│   ├── pipeline.py        # High level research workflow
│   ├── research_agent.py  # OpenAI Agent API wrapper
│   ├── summary.py         # Summary generation helper
│   └── vector_store.py    # Lightweight vector database
├── scripts/
│   └── run_pipeline.py    # CLI entrypoint
├── tests/
│   └── test_pipeline.py   # Unit tests with heavy mocking
└── pyproject.toml

Getting started

Install dependencies (preferably in a virtual environment):
```
pip install -e .[dev]
```
Export your OpenAI credentials:
```
export OPENAI_API_KEY="sk-..."
```
Run a research sprint from the command line:
```
python scripts/run_pipeline.py "Bayesian optimization for hyperparameter tuning"
```
The command prints a JSON payload containing the agent's summary, the sources it discovered, and the top similar items retrieved from the vector database.

Testing

The test-suite uses mocks to avoid real API calls. Execute it with:

pytest

Notes on the OpenAI Agent API

The agent is instantiated with the web_search and code_interpreter tools so it can browse the web and run lightweight calculations as needed.
Research results are streamed back as a JSON array to ensure downstream components receive structured information.
Embeddings are generated with text-embedding-3-large and stored inside a lightweight JSON-backed vector store for simplicity. You can swap in a managed vector database (such as Pinecone or Chroma) by implementing the same interface as LocalVectorStore.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasci_tool		datasci_tool
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Autonomous Data Science Research Agent

Features

Project structure

Getting started

Testing

Notes on the OpenAI Agent API

About

Uh oh!

Releases

Packages

Languages

License

xbwei/test-codex

Folders and files

Latest commit

History

Repository files navigation

Autonomous Data Science Research Agent

Features

Project structure

Getting started

Testing

Notes on the OpenAI Agent API

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages