In [None]:
!pip install mistralai python-dotenv


In [3]:
import os
from dotenv import load_dotenv

load_dotenv()

from mistralai import Mistral

client=Mistral(api_key=os.getenv("MISTRAL_API_KEY"))

ml_agent=client.beta.agents.create(
    model="mistral-medium-2505",
    name="ml_agent",
    description="Machine learning assistant",
    instructions="You're an ML expert. Give practicle, actionable advice.",
)

In [4]:
print(f"Agent created: {ml_agent.id}")

Agent created: ag_01981f0822d674dfaaac2b056963f042


In [7]:
response=client.beta.conversations.start(
    agent_id=ml_agent.id,
    inputs="Should I use Random forest or XGBoost for a 5000 sample dataset? Answer in 3 sentences only.",
)

In [8]:
type(response)

mistralai.models.conversationresponse.ConversationResponse

In [9]:
response.outputs

[MessageOutputEntry(content="For a dataset of 5000 samples, XGBoost is generally a better choice due to its superior performance and efficiency. It often provides better accuracy and handles various types of data well. However, if interpretability is a key concern, Random Forest might be more suitable as it's easier to understand and visualize.", object='entry', type='message.output', created_at=datetime.datetime(2025, 7, 18, 19, 45, 52, 970041, tzinfo=TzInfo(UTC)), completed_at=datetime.datetime(2025, 7, 18, 19, 45, 54, 339345, tzinfo=TzInfo(UTC)), id='msg_01981f1259497066b5a0bf9c922f141c', agent_id='ag_01981f0822d674dfaaac2b056963f042', model='mistral-medium-2505', role='assistant')]

In [10]:
print(response.outputs[0].content)

For a dataset of 5000 samples, XGBoost is generally a better choice due to its superior performance and efficiency. It often provides better accuracy and handles various types of data well. However, if interpretability is a key concern, Random Forest might be more suitable as it's easier to understand and visualize.


In [12]:
data_agent=client.beta.agents.create(
    model="mistral-medium-latest",
    name="data-expert",
    description="Data preprocessing specialist",
    instructions="Expert in data cleaning and feature engineering",
    completion_args={"temperature":0.1},
)
print(f"Data agent: {data_agent.id}")

Data agent: ag_019824095d7d7488985a4901557f1e18


In [13]:
response=client.beta.conversations.start(
    agent_id=ml_agent.id,
    inputs="What are the best practices for encoding categorical variables in machine learning? List three",

)
print(response.outputs[0].content)

Sure, here are three best practices for encoding categorical variables in machine learning:

1. **One-Hot Encoding**:
   - **Actionable Advice**: Use one-hot encoding for categorical variables with a small number of unique values. This method creates a binary column for each category and returns a vector with a 1 in the position corresponding to the category.
   - **Implementation**: Libraries like pandas in Python have built-in functions (e.g., `pd.get_dummies()`) to easily apply one-hot encoding.
   - **Consideration**: Be mindful of the "curse of dimensionality" if the categorical variable has many unique values, as this can lead to a high-dimensional feature space.

2. **Label Encoding**:
   - **Actionable Advice**: Use label encoding for ordinal categorical variables where there is a meaningful order between the categories. This method assigns a unique integer to each category.
   - **Implementation**: Libraries like scikit-learn provide `LabelEncoder` for this purpose.
   - **Con

In [14]:
print(response.conversation_id)

conv_019828d58d917752bccd5321124880ca


In [15]:
follow_up=client.beta.conversations.append(
    conversation_id=response.conversation_id,
    inputs="What was my first question?",
)
print(follow_up.outputs[0].content)

Your first question was:

"What are the best practices for encoding categorical variables in machine learning? List three"


In [16]:
follow_up=client.beta.conversations.append(
    conversation_id=response.conversation_id,
    inputs="Summarize the entire coo=nverstion in 3 sentences",
)
print(follow_up.outputs[0].content)

You asked for best practices for encoding categorical variables in machine learning, and I provided three actionable methods: one-hot encoding, label encoding, and embeddings for high-cardinality variables. Then, you asked what your first question was, and I repeated it for you. This summarizes our brief conversation so far.


In [37]:
message="Explain gradient descent in simple terms in 5 sentences"

response=client.beta.conversations.start_stream(
    agent_id=ml_agent.id, inputs=message,
)

type(response)

mistralai.utils.eventstreaming.EventStream

In [38]:
with response as event_stream:
    for event in event_stream:
        if event.event=="message.output.delta":
            print(event.data.content, flush=True, end=" ")

Sure , here 's a simple explanation  of gradient descent:

 1 . **Objective **: Gradient  descent is an  optimization algorithm used to  minimize  some  function by  iteratively moving  toward  the lowest  value  of that  function.
2.  **How  it works**: Imagine  you're on  a mountain  and  want  to get  to the lowest  point. You look  around  to  see  which  way  is steep est downward  and  take  a  step in  that direction.
3 . **Iter ative  process **: You repeat  this process,  each  time checking  the steep ness  and  taking  another  step downward ,  until you can 't go  any lower .
 4. **Learning  rate**: The size  of each  step is  determined  by the learning rate . If  the  steps  are too big , you might overs hoot the lowest point .  If they 're  too small, it  might  take  a  long  time to reach  the bottom .
5. ** Usage  in  ML **: In  machine learning, this  process  is used  to update  the  parameters  of a model to  minimize  the error  or  loss function,  making  the  mo

In [39]:
last_conversation_id=client.beta.conversations.list()[0].id

message="Translate your last response to German."
follow_up=client.beta.conversations.append_stream(
    conversation_id=last_conversation_id, inputs=message
)



In [40]:
with follow_up as event_stream:
    for event in event_stream:
        if event.event=="message.output.delta":
            print(event.data.content, flush=True, end=" ")

Hier  ist die  Übersetzung  meiner  letzten Antwort ins  Deutsche :

1 . **Ziel **: Der Gradient en ab stieg  ist ein Optim ierungsalgorithmus , der  verwendet  wird , um eine  Funktion  zu minimieren,  indem iter ativ  auf  den  nied rigsten Wert dieser  Funktion zu gegangen  wird .
2. ** F unkt ionsweise **: Stellen  Sie sich  vor, Sie  befinden  sich auf einem  Berg  und möchten den  tief sten Punkt erreichen . Sie sch auen sich um , um  zu  sehen , welcher  Weg am  ste ilsten ab wärts  führt ,  und machen  einen  Schritt in diese  Richtung.
3.  **Iterat iver Prozess**: Sie  wiederhol en diesen  Prozess, jedes  Mal über prüfen Sie die  Steil heit und machen einen  weiteren  Schritt nach  unten, bis Sie  nicht mehr tiefer  gehen  können.
4.  **L ernrate**: Die  Größe jedes  Sch rit ts wird durch die  Lernrate bestimmt .  Wenn die Schritte  zu groß sind,  könnten Sie den  tiefsten Punkt übersch reiten. Wenn  sie zu klein sind , könnte  es lange  dauern, bis  Sie den tiefsten  Punkt er

In [41]:
research_agent=client.beta.agents.create(
    model="mistral-medium-latest",
    name="ml-research-agent",
    description="ML research specialist",
    tools=[{"type": "web_search"}]
)

search_response=client.beta.conversations.start(
    agent_id=research_agent.id, inputs="Latest transformer improvemnts in 2025 summarized"
)

In [49]:
print(search_response.outputs[1].content)

[TextChunk(text='The latest improvements in transformer architecture in 2025 include several key advancements aimed at enhancing performance, efficiency, and scalability. Here are some of the notable developments:  \n  \n1. **Pre-normalization**: Modern transformer architectures have shifted from post-normalization to pre-normalization. In pre-normalization, normalization layers are applied before the self-attention and feedforward components. This change improves gradient flow, especially in deep networks, leading to better training stability', type='text'), ToolReferenceChunk(tool='web_search', title='The Evolution of Transformer Architecture: From 2017 to 2024 | by arghya mukherjee | Medium', type='tool_reference', url='https://medium.com/@arghya05/the-evolution-of-transformer-architecture-from-2017-to-2024-5a967488e63b', favicon='https://imgs.search.brave.com/4R4hFITz_F_be0roUiWbTZKhsywr3fnLTMTkFL5HFow/rs:fit:32:32:1:0/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvOTZ

In [54]:
markdown_text=" "

for result in search_response.outputs[1].content:
    if hasattr(result, "text"):
        markdown_text+=result.text
        if hasattr(result,"tool"):
            markdown_text+=f"([{result.title}]({result.url}))"

In [55]:
from IPython.display import Markdown

Markdown(markdown_text)

 The latest improvements in transformer architecture in 2025 include several key advancements aimed at enhancing performance, efficiency, and scalability. Here are some of the notable developments:  
  
1. **Pre-normalization**: Modern transformer architectures have shifted from post-normalization to pre-normalization. In pre-normalization, normalization layers are applied before the self-attention and feedforward components. This change improves gradient flow, especially in deep networks, leading to better training stability.  
  
2. **Rotary Position Embeddings (RoPE)**: Rotary position embeddings have replaced the traditional sinusoidal positional encodings. RoPE encodes relative positional information more effectively, integrating it naturally into the self-attention mechanism. This allows the model to generalize better to sequences longer than those seen during training, improving performance on tasks involving long-context inputs.  
  
3. **Efficiency Improvements**: Various "X-former" models, such as Reformer, Linformer, Performer, and Longformer, have been introduced to improve computational and memory efficiency. These models address the quadratic complexity issue of the self-attention mechanism and reduce computation costs through techniques like pooling and sparsity.  
  
4. **Handling Long Sequences**: Innovations like Transformer-XL and Reformer have been developed to handle long sequences more efficiently. Future models are expected to further improve memory mechanisms, sparse attention, and recurrence to process large-scale sequential data more effectively.  
  
5. **Multilingual Capabilities**: There is a growing focus on developing multilingual transformer models. Models like mBERT (Multilingual BERT) and XLM-R (Cross-lingual Language Model) are examples of early multilingual models. Future advancements aim to improve cross-lingual transfer learning, particularly for low-resource languages where labeled data is scarce.  
  
6. **Integration with Reinforcement Learning**: Combining transformers with reinforcement learning (RL) is an exciting area of research. This integration aims to enhance the decision-making capabilities of transformer models, making them more adaptable and efficient in various applications.  
  
These improvements reflect the ongoing efforts to optimize transformer models for better performance, efficiency, and applicability across different domains and languages.