**Following guidelines set expectations for participant behaviour during workshop activities. They also ensure that the class environment is welcoming, inclusive, and respectful.**
- Where a discussion is taking place, allow everyone a chance to speak
- Listen respectfully, without interrupting and with an open mind to understanding othersâ€™ views
- Be professional and productive, and always share your ideas, your opinion matters, we all can learn something from each other
- Personal information that comes up in the conversation should be kept confidential
- Avoid inflammatory language
- Avoid assumptions about any member of the class or generalisations about social groups

**You have now joined a group of fellow analysts in a workshop:**
#### Outcomes:

- To be able to answer the questions to test your knowledge of Interactive Data Visualisation with Bokeh library


#### Note:
- We understand learners will progress at their own speed
- Tutors will be on hand to answer questions

# Interactive Data Visualisation with Bokeh

### Setting up the environment

- Make sure you run the following code cell before you attempt any of the questions.

In [1]:
from bokeh.plotting import figure, output_notebook, show, save
from bokeh.models import ColumnDataSource, HoverTool

output_notebook() #specify the Bokeh plots should be embedded within the Jupyter notebook

import pandas as pd

import pandas_bokeh
pd.set_option('plotting.backend', 'pandas_bokeh')

import warnings
warnings.filterwarnings('ignore')

### Importing Data

Let's load the dataset `studentsperformance.csv` with `pandas`. The dataset is regarding students Performance in Exams. Dataset contains 10 observations with 8 attributes.

In [2]:
df = pd.read_csv('data/studentsperformance.csv')
df.head(10)

In [3]:
#df.info()

### Plotting with Bokeh

**Q1)** Create a Bokeh plot to analyse the relationship or the correlation between `reading_score` and `math_score`. 

- Create a figure object with `width=600`, `height=400` parameters and assigned it to the variable `p1`. 
- Plot `circle` glyph with the data, and set `x`, `y`, `source`, `size`, `alpha`, and `color` parameters.
- The plot should display the correlation between `reading_score` and `math_score`
- Select suitable `size`, `alpha`, and `color` parameters
- Use the given `data1` object as the data `source`

```python
Syntax:
figure_name = figure(width=..., height=...)
figure_name.circle(x=..., y=...)
```

- Make sure to give a suitable title and axis labels for the plot.

```python
Suggestion:
figure_name.title.text = 'title'
figure_name.xaxis.axis_label = 'label'
figure_name.yaxis.axis_label = 'label'
```

- Finally, use `show(p1)` to display the plot in the notebook.

Fore more info on [figure object and glyphs](https://docs.bokeh.org/en/2.4.3/docs/reference/plotting/figure.html#figure)

In [4]:
data1 = ColumnDataSource(df)

#add your code below

p1 = figure(width=600, height=400)
p1.circle(x='reading_score', y='math_score', source=data1, size=10, alpha=0.5, color="blue")
p1.title.text = 'Correlation Analysis'
p1.xaxis.axis_label = 'Reading Score'
p1.yaxis.axis_label = 'Math Score'
show(p1)



**Q2)** Create a Bokeh plot to analyse the `mean_math_score` for various categories of `parent_level_of_education`. This data is processed and is available for you within the `info` dataframe.

- Create a figure object with `width=600`, `height=400`, and `x_range=info['parent_level_of_education']` parameters and assigned it to the variable `p2`
- Plot `vbar` glyph with the data, and set `x`, `top`, `source`, `width` parameters
- Use `parent_level_of_education` for `x`, `mean_math_score` for `top`, and any value between 0-1 for `width` 
- Use the given `data2` object as the data `source`

```python
Syntax:
figure_name = figure(width=..., height=..., x_range=...)
figure_name.vbar(x=..., top=...)
```

- Make sure to give a suitable title and axis labels for the plot.

```python
Suggestion:
figure_name.title.text = 'title'
figure_name.xaxis.axis_label = 'label'
figure_name.yaxis.axis_label = 'label'
```
- Use `show(p2)` to plot in the notebook
- Make sure you run the following code cell before you attempt the question.

In [5]:
# First we define a list of columns to keep, 
# as computing the average only makes sense for numerical values (here the scores)
cols_to_keep = ['math_score', 'reading_score', 'writing_score']
info = df.groupby('parent_level_of_education')[cols_to_keep].mean().add_prefix('mean_').reset_index()
info.head()

In [6]:
data2 = ColumnDataSource(info)

#add your code below

p2 = figure(width=600, height=400, x_range=info['parent_level_of_education'])
p2.vbar(x='parent_level_of_education', top='mean_math_score', source=data2, width=0.9)

p2.title.text = 'Impact of parental education on math score'
p2.xaxis.axis_label = 'Parent level of education'
p2.yaxis.axis_label = 'Math Score'

show(p2)



**Example1:** You have been given the below code that creates a `Panel` object for each `figure` you created above (`p1` and `p2`). The `child` argument specifies the `figure name` and the `title` specifies a `suitable title` for each tab.

The `Tabs` object, combines the `Panel` objects by setting the `tabs` argument, and then `show()` the result.

To find out more about tabs and other [widgets](https://docs.bokeh.org/en/latest/docs/user_guide/interaction/widgets.html)

In [7]:
from bokeh.models.widgets import Tabs, Panel

panel_corr = Panel(child=p1, title='Plot1')
panel_bar = Panel(child=p2, title='Plot2')

both = Tabs(tabs=[panel_corr, panel_bar])
show(both)

### Plotting with Pandas-Bokeh
[Pandas-Bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh) is a library which simplifies the creation of Bokeh plots when using Pandas DataFrames as the data source.

### Importing Data
Let's load the dataset that gives details of 350,000+ domestic commercial flights in the USA from 1990 - 2009.  

It's important to understand that **much of the work required for visualisation comes in the processing of the data** (even when the data is in a clean and tidy state such as the file we will be using below is), so we will walk through this together. 

Take a moment to run the code cells below and understand what's going on. We are loading a comma-separated values file, adding column headers, and then using Pandas `datetime` methods to extract the `year` and `month` for each row.  

More information about the dataset can be found in the `.yaml` file in the `data` folder.

In [8]:
df = pd.read_csv('data/flights.csv',
                 names=['Origin', 'Destination','Origin_City', 'Destination_City', 
                        'Passengers', 'Seats', 'Flights', 'Distance','Date', 
                        'Origin_City_Popn','Destination_City_Popn'])

df.head()

In [9]:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df.head()

We will now calculate some further columns derived from columns of interest, and check that our extended DataFrame looks as expected:

In [10]:
df['Empty_Seats'] = df['Seats'] - df['Passengers']
df['Spare_Capacity_%'] = (1 - df['Passengers'] / df['Seats']) * 100
df['Passenger_Miles'] = df['Passengers'] * df['Distance']
df.head()

We would like to look at the progression over time of the volume of flights and passengers from a specific airport. There's a number of approaches that could be taken to achieve this but use of `.groupby` is shown. Note how this creates a `MultiIndex` DataFrame.

Here we will use `.sum()`. Note that this operation only makes sense for numerical columns. We also need to exclude columns such as `Month` or `Spare_Capacity_%` for which a sum of all values does not make sense.

Let's first create a list of columns to keep:

In [11]:
columns_to_keep = ['Passengers', 'Passenger_Miles','Flights', 'Seats', 'Empty_Seats']
org = df.groupby(['Origin', 'Year'])[columns_to_keep].sum()
org.head()

We'll now extract only the rows for a specific airport. Note the assignment of the airport code to a variable `airport`; this should make things easier if we subsequently want to re-use our code for analysis of a different airport, perhaps within a function.

In [12]:
airport = 'JFK'
ap = org.loc[airport]
ap.head()

We can see that the values in each column are of quite different magnitudes. To see the progression over time relative to one another, let's use the 1990 values as the base year for comparison:

In [13]:
base = ap.iloc[0]
ap_90 = ap / base
ap_90.head()

**Q3)** We now have a tidy DataFrame with `Year` as the index and the progression of various metrics for a given airport over the time period.   

Now use the`.plot_bokeh()` method on the `ap_90` DataFrame and see what you get:

In [14]:
#add your code below

ap_90.plot_bokeh()


Pandas-Bokeh has done a lot of work, try clicking on the different labels in the legend, and hovering over the lines.

**Example2:** You have been given the below code that assigns the resulting Bokeh figure to a variable `fig`, and then makes further customisations, such as `toolbar_location`, `legend.location`, and `plot_width`. The `show()` method will display the customised plot.


For more info on [palettes](https://bokeh.pydata.org/en/latest/docs/reference/palettes.html) and [optional parameter values](https://github.com/PatrikHlobil/Pandas-Bokeh#lineplot)

In [15]:
fig = ap_90.plot_bokeh(colormap="Colorblind", show_figure=False)

#customise the figure
fig.plot_width=800
fig.toolbar_location="above"
fig.legend.location = "top_left"
fig.yaxis.axis_label = 'Base Year = 1990'
show(fig)

**Q4)** What do you think has happened to the metrics `Passengers`, `Passenger_Miles`,`Flights`, `Seats`, `Empty_Seats` over the years for the `JFK` airport? Note down your findings below.

In [16]:
#add your notes below

""" 
Passengers', 'Passenger_Miles', 'Flights' and 'Seats metrics' show a steady increase over the years. 
Where as 'Empty_Seats' metric is showcasing a steady decline, this might be due to the high demand for air travel.
"""
