Skip to content
Merged

Docs #1637

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/docs/quick-start/getting-started-01.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ dspy.inspect_history(n=1)
```

**Output:**
See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22)
See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22).


DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.
Expand Down Expand Up @@ -151,7 +151,7 @@ len(trainset), len(valset), len(devset), len(testset)

What kind of metric can suit our question-answering task? There are many choices, but since the answer are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response?

That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](/docs/building-blocks/modules) using whatever LM we're working with.
That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/77c2e1cceba427c7f91edb2ed5653276fb0c6de7/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with.


```python
Expand Down Expand Up @@ -192,7 +192,7 @@ dspy.inspect_history(n=1)
```

**Output:**
See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8)
See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8).

For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`.

Expand Down
10 changes: 7 additions & 3 deletions docs/docs/quick-start/getting-started-02.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,11 @@ class RAG(dspy.Module):
def forward(self, question):
context = search(question, k=self.num_docs)
return self.respond(context=context, question=question)

```

Let's use the RAG module.

```
rag = RAG()
rag(question="what are high memory and low memory on linux?")
```
Expand All @@ -111,7 +115,7 @@ dspy.inspect_history()
```

**Output:**
See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c)
See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c).


In the previous guide with a CoT module, we got nearly 40% in terms of semantic F1 on our `devset`. Would this `RAG` module score better?
Expand Down Expand Up @@ -151,7 +155,7 @@ optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset,
```

**Output:**
See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb)
See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb).


The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.
Expand Down
Loading