Extra generations in the Predict Module here:
https://github.com/stanfordnlp/dspy/blob/main/dsp/primitives/predict.py#L96
halve the max_tokens value of the model in the kwargs and try again this is I think supposed to be a temporary halving since its reading that from a global settings on the dsp.settings.lm.kwargs["max_tokens"] which is on the lm model object and passing the halved value in as kwargs for that generation only.
However this halving in made permanent going forward by code in the AWSmodels code which takes that lm.kwargs dictionary as a reference off "self" and sets the max_tokens back into it thus making all future generations start from the halved value and eventually causing all generations to end up at a limit of max_tokens=75 after a certain number of runs.
The code that sets the max_tokens back on the lm model at the halved value I believe is here:
https://github.com/stanfordnlp/dspy/blob/main/dsp/modules/aws_models.py#L221-L223