From 8d0589573a9d71ae58e492f51c3bf1b65870caef Mon Sep 17 00:00:00 2001 From: Omar Khattab Date: Wed, 16 Oct 2024 10:43:39 -0700 Subject: [PATCH] Docs --- docs/docs/quick-start/getting-started-01.md | 6 +++--- docs/docs/quick-start/getting-started-02.md | 10 +++++++--- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/docs/quick-start/getting-started-01.md b/docs/docs/quick-start/getting-started-01.md index 66eaedb2f9..e535bd5a9d 100644 --- a/docs/docs/quick-start/getting-started-01.md +++ b/docs/docs/quick-start/getting-started-01.md @@ -46,7 +46,7 @@ dspy.inspect_history(n=1) ``` **Output:** -See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22) +See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22). DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it. @@ -151,7 +151,7 @@ len(trainset), len(valset), len(devset), len(testset) What kind of metric can suit our question-answering task? There are many choices, but since the answer are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response? -That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](/docs/building-blocks/modules) using whatever LM we're working with. +That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/77c2e1cceba427c7f91edb2ed5653276fb0c6de7/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with. ```python @@ -192,7 +192,7 @@ dspy.inspect_history(n=1) ``` **Output:** -See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8) +See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8). For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`. diff --git a/docs/docs/quick-start/getting-started-02.md b/docs/docs/quick-start/getting-started-02.md index 87da009f71..91784f9ca7 100644 --- a/docs/docs/quick-start/getting-started-02.md +++ b/docs/docs/quick-start/getting-started-02.md @@ -93,7 +93,11 @@ class RAG(dspy.Module): def forward(self, question): context = search(question, k=self.num_docs) return self.respond(context=context, question=question) - +``` + +Let's use the RAG module. + +``` rag = RAG() rag(question="what are high memory and low memory on linux?") ``` @@ -111,7 +115,7 @@ dspy.inspect_history() ``` **Output:** -See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c) +See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c). In the previous guide with a CoT module, we got nearly 40% in terms of semantic F1 on our `devset`. Would this `RAG` module score better? @@ -151,7 +155,7 @@ optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset, ``` **Output:** -See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb) +See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb). The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.