# Tutorial: Data Science

In this tutorial, we will introduce Solara from the perspective of a data scientist or when you are thinking of using Solara for a data science app.
It is therefore focussed on data (Pandas), visualizations (plotly) and how to add interactivity.

## You should know
This tutorial will assume:

  * You have succesfully installed Solara
  * You know how to display a Solara component in a notebook or script

If not, please follow the [Quick start](/docs/quickstart).

## Extra packages you need to install

For this tutorial, you need plotly and pandas, you can install them using pip:

  $ pip install plotly pandas

## You will learn

In this tutorial, you will learn:

   * [To create a scatter plot using plotly.express](#our-first-scatter-plot)
   * [Display your plot in a Solara component](#our-first-scatter-plot).
   * [Build a UI to configure the X and Y axis](#configure-the-x-axis).
   * [Handle a click event and record which point was clicked on](#interactive-plot).
   * [Refactor your code to build a reusable Solara component](#make-a-reusable-component).
   * [Compose your newly built component into a larger application](#make-a-reusable-component).

## The dataset

For this tutorial, we will use the [Iris flow data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) which contains the lengths and widths of the petals and sepals of three species of Iris (setosa, virginica and versicolor).

This dataset comes with many packages, but since we are doing to use plotly.express for this tutorial, we will use:

```python
import plotly.express as px
df = px.data.iris()
```

In [None]:
## solara: skip
import plotly.express as px


df = px.data.iris()
df


## Our first scatter plot

We use plotly express to create our scatter plot with just a single line.

```python
fig = px.scatter(df, "sepal_length", "sepal_width", color="species")
```

To display this figure in a Solara component, we should return an element that can render the plotly figure. [FigurePlotly](/api/plotly) will do the job for us.

Putting this together

In [None]:
import plotly.express as px
import solara

df = px.data.iris()


@solara.component
def Page():
    fig = px.scatter(df, "sepal_length", "sepal_width", color="species")
    solara.FigurePlotly(fig)

In [None]:
## solara: skip
Page()

## Configure the X-axis.

We now add a [`Select`](/api/select) component to list all columns.

```python
columns = list(df.columns)
solara.Select(label="X-axis", values=columns, value=x on_value=set_x)
```

However, we need to do a few things.

   1. Set an initial value (`x`) and pass it to our `Select` component (`value=x`) *and* `px.scatter`.
   2. Respond to the change in the the value of our `Select` (`on_value=set_x`)
   3. Store the changed value, and re-render our component.

If we write the following in our component:
   
```python
x, set_x = solara.use_state("sepal_length")
```

We tell Solara to create a piece of "state", that it initially set to "sepal_length". Which will be return as `x` in this case. Solara will also give us a function that we can call to change the state. If we invoke this function (`set_x`), the function body of our component will be executed again. However, instead of returning the initial value, the return value of `use_state` will return the last set value of `set_x`.


If we now pass `set_x` to the `on_value` event hander of our `Select` component, we solved item 2 and 3, neat!


In [None]:
columns = list(df.columns)


@solara.component
def Page():
    # initially, x is set to "sepal_length"
    x, set_x = solara.use_state("sepal_length")
    # calling set_x("some_value"), will re-execute this 'render function'
    # and will set "x" to "some_value"
    
    # we pass "x" to px.scatter
    fig = px.scatter(df, x, "sepal_width", color="species")
    solara.FigurePlotly(fig)
    # and we also pass it back to Select again
    # on_value triggers when the value changes, because we set it to
    # set_x, it changes the state (x), and triggers a rerender
    solara.Select(label="X-axis", value=x, values=columns, on_value=set_x)

In [None]:
## solara: skip
Page()

### Understanding (optional)

#### State

Understanding `use_state`, and how to link it to callbacks, and how Solara re-renders component is crucial for understanding building larger applications. If you don't fully graps it now, that is ok. You should first get used to the pattern, and consider reading [Understanding Solara Basics](/docs/understanding/reacton-basics) later on to get a deeper understanding.


#### Layout and Context managers

We also introduced two new concepts, hierarchical connections (`VBox` can have children), and using components as context managers (`with solara.VBox():`). See [Understanding Containers](/docs/understanding/containers) to understand these topics better.

## Configure the Y-axis.

Now that we can configure the X-axis, we can repeat the same for the Y-axis. Try to do this yourself, without looking at the code, as a good practice.

In [None]:
@solara.component
def Page():
    x, set_x = solara.use_state("sepal_length")
    y, set_y = solara.use_state("sepal_width")

    fig = px.scatter(df, x, y, color="species")

    with solara.VBox() as main:
        solara.FigurePlotly(fig)
        solara.Select(label="X-axis", value=x, values=columns, on_value=set_x)
        solara.Select(label="Y-axis", value=y, values=columns, on_value=set_y)
    return main
        

In [None]:
## solara: skip
Page()

## Interactive plot

We now built a small UI to control a scatter plot. However, often we also want to interact with the data, for instance select a point in our scatter plot.

We could look up in the plotly documentation how exactly we can extract the right data, but lets take a different approach. We are simply going to store the data we get from `on_click` into a new state variable (`click_data`) and display the raw data into a Markdown component.

In [None]:
@solara.component
def Page():
    x, set_x = solara.use_state("sepal_length")
    y, set_y = solara.use_state("sepal_width")
    # store the click data in local state
    click_data, set_click_data = solara.use_state(None)

    fig = px.scatter(df, x, y, color="species")

    solara.FigurePlotly(fig, on_click=set_click_data)
    solara.Select(label="X-axis", value=x, values=columns, on_value=set_x)
    solara.Select(label="Y-axis", value=y, values=columns, on_value=set_y)
    # display it pre-formatted using the backticks `` using Markdown
    solara.Markdown(f"`{click_data}`")
        

In [None]:
## solara: skip
Page()

### Inspecting the on_click data

Click a point and you should see the data printed out like:

```python
{'event_type': 'plotly_click', 'points': {'trace_indexes': [1], 'point_indexes': [34], 'xs': [5.4], 'ys': [3]}, 'device_state': {'alt': False, 'ctrl': False, 'meta': False, 'shift': False, 'button': 0, 'buttons': 1}, 'selector': None}
```

We can see from the raw data that we can access the trace index we clicked on (we have 3 traces, one for setosa, versicolor and virginica). We can also get access to the point_index (which point in the trace). With these two numbers we can find the row number we clicked

### Finding row number (optional)

It is slightly annoying that plotly express splits up our dataframe into 3 traces, since now we don't have enough information to find back to row number.

There is a trick we can do to get the row index, if we pass `df.index` to the custom data argument, plotly express will also 'distribute' the index along the traces. This information we can use to reconstruct the row index from the trace index and point index.


### Displaying the row number

Ok, we sorted out how to get the row number, we simply display it to test if our code works.

In [None]:
def find_row_index(fig, click_data):
    # goes from trace index and point index to row index in a dataframe
    # requires passing df.index as to custom_data
    trace_index = click_data['points']['trace_indexes'][0]
    point_index = click_data['points']['point_indexes'][0]
    trace = fig.data[trace_index]
    return trace.customdata[point_index][0]
    

@solara.component
def Page():
    x, set_x = solara.use_state("sepal_length")
    y, set_y = solara.use_state("sepal_width")
    clicked_row, set_clicked_row = solara.use_state(None)

    fig = px.scatter(df, x, y, color="species", custom_data=[df.index])

    # Instead of passing FigurePlotly the set_clicked_row directly
    # we need to do some data manipulation first.
    # we do this in a local function, so that we can acess the local
    # variables we need (set_clicked_row function and fig)
    def on_click(click_data):
        # sanity checks
        assert click_data['event_type'] == "plotly_click"        
        row_index = find_row_index(fig, click_data)
        set_clicked_row(row_index)

    solara.FigurePlotly(fig, on_click=on_click)
    solara.Select(label="X-axis", value=x, values=columns, on_value=set_x)
    solara.Select(label="Y-axis", value=y, values=columns, on_value=set_y)
    if clicked_row is not None:
        solara.Markdown(f"Clicked on `index={clicked_row}`")
    else:
        solara.Info("Click to select a point")
        

In [None]:
## solara: skip
Page()

## Displaying the nearest neighbours

We now have the row index of the point we clicked on, we will use that to improve our component, we will.

   1. Add an indicator in the scatter plot to highlight which point we clicked on.
   2. Find the nearest neighbours and display them in a table.
  
For the first item, we simply use plotly express again, and add the single trace it generated to the existing figure (instead of displaying two separate figures).

We add a function to find the `n` nearest neighbours:

```python
def find_nearest_neighbours(df, xcol, ycol, x, y, n=10):
    df = df.copy()
    df["distance"] = ((df[xcol] - x)**2 + (df[ycol] - y)**2)**0.5
    return df.sort_values('distance')[1:n+1]
```

We now only find the nearest neighbours if `clicked_row`, and display the dataframe using the [`DataFrame`](/api/dataframe) component.


In [None]:
def find_nearest_neighbours(df, xcol, ycol, x, y, n=10):
    df = df.copy()
    df["distance"] = ((df[xcol] - x)**2 + (df[ycol] - y)**2)**0.5
    return df.sort_values('distance')[1:n+1]


@solara.component
def Page():
    x, set_x = solara.use_state("sepal_length")
    y, set_y = solara.use_state("sepal_width")
    clicked_row, set_clicked_row = solara.use_state(None)

    fig = px.scatter(df, x, y, color="species", custom_data=[df.index])

    if clicked_row is not None:
        # add an indicator 
        click_x = df[x].values[clicked_row]
        click_y = df[y].values[clicked_row]
        fig.add_trace(px.scatter(x=[click_x], y=[click_y], text=["⭐️"]).data[0])
        df_nearest = find_nearest_neighbours(df, x, y, click_x, click_y, n=3)

    def on_click(click_data):
        # sanity checks
        assert click_data['event_type'] == "plotly_click"
        row_index = find_row_index(fig, click_data)
        set_clicked_row(row_index)

    solara.FigurePlotly(fig, on_click=on_click)
    solara.Select(label="X-axis", value=x, values=columns, on_value=set_x)
    solara.Select(label="Y-axis", value=y, values=columns, on_value=set_y)
    if clicked_row is not None:
        solara.Markdown("## Nearest 3 neighbours")
        solara.DataFrame(df_nearest)
    else:
        solara.Info("Click to select a point")

In [None]:
## solara: skip
Page()

## Make a reusable component

Our main `Page` component is now getting complex, and is not reusable. We will now create a new `FindNearestNeighbours` component that takes over the work of finding the nearest neighbours and displaying them.

Our `FindNearestNeighbours` should take as arguments:

  * `df` - A dataframe.
  * `x` - The initial column name for the x axis.
  * `y` - The initial column name for the y axis.
  * `color` - The column name for the color.
  * `on_clicked_row` - A callback for when we click on a row.

This way our top level `Page` component can create two `FindNearestNeighbours` elements, each working on different data. Using the `on_clicked_row` callback, we can get data from out child component into our parent component.


In [None]:
from typing import Callable


@solara.component
def FindNearestNeighbours(df, x, y, color=None, n=3, on_clicked_row: Callable[[int], None] = None):
    x, set_x = solara.use_state(x)
    y, set_y = solara.use_state(y)
    clicked_row, set_clicked_row = solara.use_state(None)

    # instead of doing this globally, we do it in the component
    # since the dataframe is now passed in as an argument
    columns = list(df.columns)

    fig = px.scatter(df, x, y, color=color, custom_data=[df.index])

    if clicked_row is not None:
        # add an indicator 
        click_x = df[x].values[clicked_row]
        click_y = df[y].values[clicked_row]
        fig.add_trace(px.scatter(x=[click_x], y=[click_y], text=["⭐️"]).data[0])
        df_nearest = find_nearest_neighbours(df, x, y, click_x, click_y, n=3)

    def on_click(click_data):
        # sanity checks
        assert click_data['event_type'] == "plotly_click"
        row_index = find_row_index(fig, click_data)
        set_clicked_row(row_index)
        # bubble up the row index using an event
        if on_clicked_row is not None:
            on_clicked_row(row_index)

    solara.FigurePlotly(fig, on_click=on_click)
    solara.Select(label="X-axis", value=x, values=columns, on_value=set_x)
    solara.Select(label="Y-axis", value=y, values=columns, on_value=set_y)
    if clicked_row is not None:
        solara.Markdown("## Nearest 3 neighbours")
        solara.DataFrame(df_nearest)
    else:
        solara.Info("Click to select a point")

Putting it all together, we now create an application with two `FindNearestNeighbours` components, each working on a different dataset
.

In [None]:


df_iris = px.data.iris()
df_gapminder = px.data.gapminder()


@solara.component
def Page():
    clicked_row_gapminder, set_clicked_row_gapminder = solara.use_state(None)

    with solara.ColumnsResponsive():
        with solara.Card("Iris"):
            FindNearestNeighbours(df_iris, "sepal_length", "sepal_width", color="species")

        title = "Gapminder"
        if clicked_row_gapminder is not None:
            title += f" (clicked on {clicked_row_gapminder})"
        with solara.Card(title):
            FindNearestNeighbours(df_gapminder, "gdpPercap", "lifeExp", color="continent", on_clicked_row=set_clicked_row_gapminder)

In [None]:
## solara: skip
Page()

We only respond to the `on_clicked_row` for the second component, showing that this argument is optional. 

