stanfordnlp · okhat · Oct 16, 2024 · Oct 16, 2024
diff --git a/docs/docs/quick-start/getting-started-01.md b/docs/docs/quick-start/getting-started-01.md
@@ -46,7 +46,7 @@ dspy.inspect_history(n=1)
 ```
 
 **Output:**   
-See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22)
+See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22).
 
 
 DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.
@@ -151,7 +151,7 @@ len(trainset), len(valset), len(devset), len(testset)
 
 What kind of metric can suit our question-answering task? There are many choices, but since the answer are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response?
 
-That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](/docs/building-blocks/modules) using whatever LM we're working with.
+That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/77c2e1cceba427c7f91edb2ed5653276fb0c6de7/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with.
 
 
 ```python
@@ -192,7 +192,7 @@ dspy.inspect_history(n=1)
 ```
 
 **Output:**     
-See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8)
+See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8).
 
 For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`.
 

diff --git a/docs/docs/quick-start/getting-started-02.md b/docs/docs/quick-start/getting-started-02.md
@@ -93,7 +93,11 @@ class RAG(dspy.Module):
     def forward(self, question):
         context = search(question, k=self.num_docs)
         return self.respond(context=context, question=question)
-
+```
+
+Let's use the RAG module.
+
+```
 rag = RAG()
 rag(question="what are high memory and low memory on linux?")
 ```
@@ -111,7 +115,7 @@ dspy.inspect_history()
 ```
 
 **Output:**     
-See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c)
+See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c).
 
 
 In the previous guide with a CoT module, we got nearly 40% in terms of semantic F1 on our `devset`. Would this `RAG` module score better?
@@ -151,7 +155,7 @@ optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset,
 ```
 
 **Output:**     
-See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb)
+See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb).
 
 
 The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.