*Climate change includes both the global warming driven by human emissions of greenhouse gases, and the resulting large-scale shifts in weather patterns. Though there have been previous periods of climatic change, since the mid-20th century the rate of human impact on Earth's climate system and the global scale of that impact have been unprecedented.* [Wikipedia](https://en.wikipedia.org/wiki/Climate_change)


The goal of this research is to analyze the rise of temperature over time in different parts of the world, using the dataset on the temperature of major cities of the world. The dataset was provided by University of Dayton ([licence](https://academic.udayton.edu/kissock/http/Weather/default.htm)).

- How much is the temperature increase in different parts of the world over time?
- Which countries are seeing a rapid increase in temperature over time?
- What seasonality patterns do we have in different parts of the world? And how did those patterns change over time?

The notebook illustrates answers to these questions, using Bokeh graphs and interactive dashboards.
Actually, this notebook is a concise Bokeh interpretation of its [elder Plotly brother](https://www.kaggle.com/dunklerwald/what-s-going-on-in-ecuador-splash-of-plotly) that I published a while back. The Plotly version is a more elaborate work whereas here I just experiment with some special Bokeh features, such as tabs and shared data sources. So if you really want to know what's going on in Ecuador, you might want to check the Plotly version.

What I disliked about Bokeh is its poor documentation. Limited search, missing crosslinks, unclear and confusing structure.<br>
What I really liked about Bokeh:
- **tabs**(!) and therefore more compact output
- advanced pan tool when you can connect and pan several plots by left-dragging a mouse simultaneously (I guess Plotly also can provide such features but I didn't check it)
- simple and fast ways to create different grids
- easier ways to manage interactions based on user input (e.g. update all graphs in a grid according to the selected value in Select widget).


Performance and internals of both frameworks is still an unexplored territory for me. Especially when it comes to building powerful web apps. If you have some working experience or research articles, analyzing both Plotly and Bokeh, please, don't hesitate to share info here in comments. All in all, based on what I tested so far, Plotly is my tool of choice when it comes to EDA. No brainer.

Part 1: [What's going on in Ecuador? - splash of Plotly](https://www.kaggle.com/dunklerwald/what-s-going-on-in-ecuador-splash-of-plotly)<br>
Part 2: [What's going on in Ecuador? - breeze of Bokeh](https://www.kaggle.com/dunklerwald/what-s-going-on-in-ecuador-breeze-of-bokeh)

# Table of contents
1. [Loading necessary libraries](#1)
1. [Loading city temperatures dataset](#2)
1. [Basic stats](#3)
1. [Data cleaning and feature engineering](#4)
1. [Regional temperature dynamics: dashboard](#5)     
1. [Country temperature dynamics: dashboard](#6)     
1. [Summary](#7)    

## Loading necessary libraries <a id="1"></a>

In [None]:
import numpy as np
import pandas as pd
from math import pi
import gc

from IPython.core.display import HTML

from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, CDSView, GroupFilter, BooleanFilter, CustomJS, Slider, Select, Panel, Tabs, HoverTool, Legend, FactorRange
from bokeh.layouts import row, column, grid, gridplot, layout
from bokeh.io import curdoc
from bokeh.themes import Theme
from bokeh.palettes import Category10, Bokeh
from bokeh.transform import factor_cmap

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 300)
pd.set_option("display.max_rows", 20)

In [None]:
output_notebook()

In [None]:
# Let's define some common style settings for bokeh graphs
tools = ['pan', 'box_select', 'lasso_select', 'box_zoom',  'reset']

curdoc().theme = Theme(json={'attrs': {

    # Figure properties
    'Figure': {
        'min_border_left': 20,
        'min_border_right': 20,
        'background_fill_color': '#E6F1FC'
    },
    # Axis properties
    'Axis': {
        'minor_tick_out': None,
        'minor_tick_in': None,
        'major_tick_out': None,
        'major_tick_in': None,
        'axis_line_color':None
    },
    # Grid properties
    'Grid': {
        'grid_line_color': '#FFFFFF'       
    },
    # Title properties
    'Title': {
        'text_font_size' : '16px',
        'text_font_style' : 'normal',
        'align' : 'center'
    },
    # Legend properties
    'Legend': {
        'background_fill_alpha': 0.8,
        'location': 'top_left',
        'label_text_font_size' : '10px'
    }
}})

## Loading city temperatures datasets <a id="2"></a>

Let's load the dataset on the temperature of major cities of the world, provided by University of Dayton.

In [None]:
df = pd.read_csv('../input/daily-temperature-of-major-cities/city_temperature.csv')
print(df.shape)
df.head()

## Basic stats  <a id="3"></a>

Now we can get basic stats about columns and data demographics, such as uniqueness, missing values and zero values.

In [None]:
#Сommon functions for exploratory data analysis
def get_stats(df):
    """
    Function returns a dataframe with the following stats for each column of df dataframe:
    - Unique_values
    - Percentage of missing values
    - Percentage of zero values
    - Percentage of values in the biggest category
    - data type
    """
    stats = []
    for col in df.columns:
        if df[col].dtype not in ['object', 'str', 'datetime64[ns]']:
            zero_cnt = df[df[col] == 0][col].count() * 100 / df.shape[0]
        else:
            zero_cnt = 0

        stats.append((col, df[col].nunique(),
                      df[col].isnull().sum() * 100 / df.shape[0],
                      zero_cnt,
                      df[col].value_counts(normalize=True, dropna=False).values[0] * 100,
                      df[col].dtype))

    df_stats = pd.DataFrame(stats, columns=['Feature', 'Unique_values',
                                            'Percentage of missing values',
                                            'Percentage of zero values',
                                            'Percentage of values in the biggest category',
                                            'type'])

    del stats
    gc.collect()

    return df_stats

In [None]:
get_stats(df)

## Data cleaning and feature engineering  <a id="4"></a>

In this section we will clean our dataset and generate some new features.

In [None]:
# First of all, we drop State column which is irrelevant for our analysis.
del df['State']

# We also change data types of several columns to optimize memory storage.
df['Month'] = df['Month'].astype('int8')
df['Day'] = df['Day'].astype('int8')
df['Year'] = df['Year'].astype('int16')
df['AvgTemperature'] = df['AvgTemperature'].astype('float16')

# There are several rows with Day=0. We are going to drop such rows.
print(f"There are {df[df['Day']==0].Day.count()} rows with Day=0")
df[df['Day']==0].head()

# Looking at data distribution across years, there are several obvious outliers: years 200,201 and 2020.
# Years 200 and 201 must be typos whereas year 2020 does not keep data for the whole year.
# We will drop all rows, belonging to these years. Also we will drop more than 20 thousand duplicate rows.
df = df[df['Day']!=0]
df = df[~df['Year'].isin([200,201,2020])]
df = df.drop_duplicates()

# 2.7% rows in the dataset have AvgTemperature value of -99. Let's look at distibution of AvgTemperature=-99 across regions.
# Most likely, value of -99 was used used to fill missing temperature values. We are going to drop all such rows.
# Also, for the sake of simplicity, we will drop all "incomplete years" - in case number of observations per country & year is less than 270 days, we eliminate this year as an incomplete yearly snapshot.
df = df[df['AvgTemperature']!=-99]
df['days_in_year']=df.groupby(['Country','Year'])['Day'].transform('size')
df=df[df['days_in_year']>270]

# Here we create column Date and convert AvgTemperature to Celsius scale.
df['Date'] = pd.to_datetime(df[['Year','Month', 'Day']])
df['AvgTemperature'] = (df['AvgTemperature'] -32)*(5/9)

# Also we need to fix some discrepancies in country names
code_dict = {'Czech Republic':'Czechia','Equador':'Ecuador', 'Ivory Coast':"Côte d'Ivoire",'Myanmar (Burma)':'Myanmar','Serbia-Montenegro':'Serbia', 'The Netherlands':'Netherlands'}
df['Country'].replace(code_dict, inplace=True)

Now we are ready to go with our final data set.

In [None]:
print(f"Final data set shape: {df.shape}")

## Regional temperature dynamics: dashboard  <a id="5"></a>

Let's look at how temperature has been changing in different regions through all the years.<br><br>
*Tab General*<br>
The left chart demostrates temperature rise for each region over the entire period. I calculate the rise as the exponentially smoothed temperatures difference of the first and the last year of observations for each region. The chart on the right shows regional dynamics over the years.
You can see it with the naked eye - temperature has been growing across all regions. What's more, in some regions temperature has been growing faster.<br><br>
*Regional tabs* (Africa, Asia etc.)<br>
The chart on the left shows a monthly temperature profile for this region. The tabbed charts on the right illustrate how seasonal temperature has been changing in this region.

In [None]:
# several mappings for seasonality charts
month_dict = {1:"January", 2:"February", 3:"March", 4:"April", 5:"May", 6:"June" ,7:"July", 8:"August", 9:"September", 10:"October", 11:"November", 12:"December"}
season_dict = {1:"Winter", 2:"Spring", 3:"Summer", 4:"Autumn"}
season_month_map = {1:1, 2:1, 3:2, 4:2, 5:2, 6:3, 7:3, 8:3, 9:4, 10:4, 11:4, 12:1}
seasons = ["Winter", "Spring", "Summer", "Autumn"]


# temperature stats, grouped by region and year 
dfr = (
       df.groupby(['Year','Region'])['AvgTemperature'].agg(['mean','min','idxmin','max','idxmax']).reset_index()
      .merge(df[['Country','City','Date']], left_on='idxmin',right_index=True)
      .merge(df[['Country','City','Date']], left_on='idxmax',right_index=True,suffixes=('_min','_max'))
      )

# average temperature, smoothed with exponential weighted average.
dfr['mean_smoothed'] = dfr.groupby(['Region'])['mean'].transform(lambda x: x.ewm(span=3).mean()).fillna(dfr['mean'])

regions = dfr['Region'].sort_values().unique().tolist()
regions_reverse = dfr['Region'].sort_values(ascending=False).unique().tolist()

# Temperature rise per region through the entire period, using exponentially smoothed average temperature
dfrs = dfr.groupby('Region')['mean_smoothed'].agg(['first','last']).reset_index()
dfrs['Temp_delta'] = dfrs['last'] - dfrs['first']
dfrs.columns=['Region','Start year temp','End year temp', 'Delta_temp']


# temperature stats, grouped by year, month, region and country 
dfmc = (
       df.groupby(['Year','Month','Region','Country'])['AvgTemperature'].agg(['mean'])
      .reset_index()
      .rename(columns={'mean': 'AvgTemperature','Month': 'Month_num'})
      .sort_values(by=['Year','Month_num','Region','Country'])
      )

dfmc['Season_num'] = dfmc['Month_num'].map(season_month_map)
dfmc['Season'] = dfmc['Season_num'].map(season_dict)
dfmc['Month'] = dfmc['Month_num'].map(month_dict)

# temperature stats, grouped by year, season, month and region 
dfmr = (
       dfmc.groupby(['Year','Season_num','Season','Month_num','Month','Region'])['AvgTemperature'].agg(['mean'])
      .reset_index()
      .rename(columns={'mean': 'AvgTemperature'})
      .sort_values(by=['Year','Month_num','Region'])
      )

# temperature stats, grouped by month and region 
dfmr_g = (
       dfmr.groupby(['Region','Month_num','Month'])['AvgTemperature'].agg(['mean'])
      .reset_index()
      .rename(columns={'mean': 'AvgTemperature'})
      .sort_values(by=['Region','Month_num'])
      )

months = dfmr_g[['Month','Month_num']].sort_values(by='Month_num')['Month'].unique().tolist()

# temperature stats, grouped by year, season and region 
dfsr = (
       dfmc.groupby(['Year','Season_num','Season','Region'])['AvgTemperature'].agg(['mean'])
      .reset_index()
      .rename(columns={'mean': 'AvgTemperature'})
      .sort_values(by=['Year','Season_num','Region'])
      )

In [None]:
plot_height = 420
plot_width = 400

#General tab
palette = Bokeh[len(dfr['Region'].unique())]

source_dfrs = ColumnDataSource(data=dfrs)

# Temperature rise per region
p_rise = figure(
                 plot_width=plot_width
                ,plot_height=plot_height
                ,tools=tools
                ,y_range=FactorRange(factors=regions_reverse)
                ,title='Temperature rise per region, °C'
                ,tooltips = [('Region','@Region'), ('Temperature rise', '@Delta_temp')])
p_rise.hbar(
             y='Region'
            ,height=0.5
            ,left=0
            ,right='Delta_temp'
            ,line_color=None
            ,fill_color=factor_cmap('Region', palette=Bokeh[len(regions)], factors=regions)
            ,source=source_dfrs)

# Temperature trend per region
p_trend = figure(
                  plot_width=plot_width
                 ,plot_height=plot_height
                 ,tools=tools
                 ,title='Temperature trend per region, °C')
p_trend.add_layout(Legend(), 'right')

for region, color in zip(regions,palette):
    source_dfr_line = ColumnDataSource(data=dfr[dfr['Region']==region])
    p_trend.line(
                  x='Year'
                 ,y='mean_smoothed'
                 ,line_width=2
                 ,line_color=color
                 ,source=source_dfr_line)
    
p_trend.legend.click_policy='hide'
hover = HoverTool(tooltips = [('Region','@Region'), ('Year','@Year'), ('AvgTemperature', '@mean_smoothed')])
p_trend.add_tools(hover)

tab_general = Panel(child=row(p_rise, p_trend), title='General')

In [None]:
# Regional and seasonal tabs
source_dfmr = ColumnDataSource(data=dfmr_g)
source_dfsr = ColumnDataSource(data=dfsr)

tabs = []

# create a tab for each region
for region in regions:
    
    # Average temperature per month
    region_view = CDSView(source=source_dfmr, filters=[GroupFilter(column_name='Region', group=region)])
    p_region_month = figure(
                             plot_width=plot_width
                            ,plot_height=plot_height
                            ,tools=tools
                            ,x_range=FactorRange(factors=months)
                            ,title='Average temperature per month, °C'
                            ,tooltips = [('Region','@Region'), ('Month','@Month'), ('AvgTemperature', '@AvgTemperature')])
    p_region_month.vbar(
                         x='Month'
                        ,bottom=0
                        ,top='AvgTemperature'
                        ,width = 0.8
                        ,line_color=None
                        ,fill_color=factor_cmap('Region', palette=Bokeh[len(regions)], factors=regions)
                        ,source=source_dfmr
                        ,view=region_view)
    p_region_month.xaxis.major_label_orientation = -pi/4
    
    # in each regional tab create 4 seasonal tabs (winter, spring, summer, autumn)
    season_tabs = []
    for season in seasons:
            season_view = CDSView(source=source_dfsr, filters=[GroupFilter(column_name='Region', group=region), GroupFilter(column_name='Season', group=season)])
            p_season = figure(
                             plot_width=plot_width
                            ,plot_height=plot_height-70
                            ,y_range=p_region_month.y_range
                            ,tools=tools
                            ,title='Seasonal temperature dynamics, °C'
                            ,tooltips = [('Region','@Region'), ('Season','@Season'), ('Year','@Year'), ('AvgTemperature', '@AvgTemperature')])
            p_season.vbar(
                                 x='Year'
                                ,bottom=0
                                ,top='AvgTemperature'
                                ,width = 1.0
                                ,line_color='#FFFFFF'
                                ,fill_color=factor_cmap('Season', palette=['#40E0D0','#00FF7F','#FF6347','#FFA500'], factors=seasons)
                                ,fill_alpha=0.6
                                ,source=source_dfsr
                                ,view=season_view)
            season_tabs.append(Panel(child=p_season, title=season))

    tabs.append(Panel(child=row(p_region_month, Tabs(tabs=season_tabs)), title=region))

In [None]:
# Final dashboard that combines General and all regional tabs
g = gridplot([[Tabs(tabs=[tab_general]+tabs)]])
show(g)

## Country temperature dynamics: dashboard  <a id="6"></a>

Just select a country from the list to get 3 different visualisations the dashboard provides:

- first charts shows average country temperature trend through all the years;
- second chart illustrates how seasonal temperature has been changing over years;
- third chart compares 2 distributions for this country: temperature distribution in 1995-2014 vs temperature distribution in 2015-2019.

You can check yourself that in many countries temperature distribution has shifted to the right (e.g. Australia) whereas some countries shifted back to the left (e.g. Canada).

In [None]:
# add new "period" dimension: 1995-2014 (first 15 years) and 2015-2019 (last 5 years) 
dfmc['Period'] = '1995-2014'
dfmc['Period'].loc[dfmc['Year']>2014] = '2015-2019'

dfyc = dfmc.groupby(['Country','Year'])['AvgTemperature'].mean().reset_index()
dfycs = dfmc.groupby(['Country','Year','Season_num','Season'])['AvgTemperature'].mean().reset_index()

In [None]:
def f(x, period):
    array_hist, edges = np.histogram(x,density=True, bins=25)
    return pd.DataFrame({'period':period, 'array_hist':array_hist,'left':edges[:-1], 'right':edges[1:]})

dfyc_bk = dfyc.groupby('Country', sort = False)['Year', 'AvgTemperature'].apply(lambda x: x.to_dict(orient = 'list'))

dfyc_winter_bk = dfycs[dfycs['Season']=='Winter'].groupby('Country', sort = False)['Year', 'AvgTemperature'].apply(lambda x: x.to_dict(orient = 'list'))
dfyc_spring_bk = dfycs[dfycs['Season']=='Spring'].groupby('Country', sort = False)['Year', 'AvgTemperature'].apply(lambda x: x.to_dict(orient = 'list'))
dfyc_summer_bk = dfycs[dfycs['Season']=='Summer'].groupby('Country', sort = False)['Year', 'AvgTemperature'].apply(lambda x: x.to_dict(orient = 'list'))
dfyc_autumn_bk = dfycs[dfycs['Season']=='Autumn'].groupby('Country', sort = False)['Year', 'AvgTemperature'].apply(lambda x: x.to_dict(orient = 'list'))

dfmc_bk = dfmc.groupby('Country', sort = False)['Period', 'AvgTemperature'].apply(lambda x: x.to_dict(orient = 'list'))

dfmc_period1_bk = dfmc[dfmc['Period']=='1995-2014'].groupby('Country', sort = False)['AvgTemperature'].apply(lambda x: f(x,'1995-2014')).groupby(level=0).apply(lambda x: x.to_dict(orient = 'list'))
dfmc_period2_bk = dfmc[dfmc['Period']=='2015-2019'].groupby('Country', sort = False)['AvgTemperature'].apply(lambda x: f(x,'2015-2019')).groupby(level=0).apply(lambda x: x.to_dict(orient = 'list'))

countries = dfmc['Country'].sort_values().unique().tolist()  

In [None]:
source_gen = ColumnDataSource(data=dfyc_bk[countries[0]])
source_winter = ColumnDataSource(data=dfyc_winter_bk[countries[0]])
source_spring = ColumnDataSource(data=dfyc_spring_bk[countries[0]])
source_summer = ColumnDataSource(data=dfyc_summer_bk[countries[0]])
source_autumn = ColumnDataSource(data=dfyc_autumn_bk[countries[0]])
source_period1 = ColumnDataSource(data=dfmc_period1_bk[countries[0]])
source_period2 = ColumnDataSource(data=dfmc_period2_bk[countries[0]])

select = Select(value=countries[0], options=countries, width=200)
callback = CustomJS(
             args=dict(source_gen=source_gen, s_gen=dfyc_bk.to_dict(),
                              source_winter=source_winter, s_winter=dfyc_winter_bk.to_dict(),
                              source_spring=source_spring, s_spring=dfyc_spring_bk.to_dict(),
                              source_summer=source_summer, s_summer=dfyc_summer_bk.to_dict(),
                              source_autumn=source_autumn, s_autumn=dfyc_autumn_bk.to_dict(),
                              source_period1=source_period1, s_period1=dfmc_period1_bk.to_dict(),
                              source_period2=source_period2, s_period2=dfmc_period2_bk.to_dict()),
            code="""
                 source_gen.data = s_gen[cb_obj.value];
                 source_winter.data = s_winter[cb_obj.value];
                 source_spring.data = s_spring[cb_obj.value];
                 source_summer.data = s_summer[cb_obj.value];
                 source_autumn.data = s_autumn[cb_obj.value];
                 source_period1.data = s_period1[cb_obj.value];
                 source_period2.data = s_period2[cb_obj.value];
                 source_gen.change.emit();
                 source_winter.change.emit();
                 source_spring.change.emit();
                 source_summer.change.emit();
                 source_autumn.change.emit();
                 source_period1.change.emit();
                 source_period2.change.emit();
""")

select.js_on_change('value', callback)

plot_width = 800

# Average temperature dynamics on country level
p_gen = figure(plot_width=plot_width, plot_height=150, tools=tools, title='Average temperature dynamics on country level (1995-2019)')
p_gen.line(x='Year', y='AvgTemperature', line_width=4, source=source_gen, line_dash='dashdot', line_color='#00CC96')
hover = HoverTool(tooltips = [('Year','@Year'), ('AvgTemperature', '@AvgTemperature')])
p_gen.add_tools(hover)

# Seasonal tabs
p_winter = figure(tools=tools, plot_width=plot_width, plot_height=150, x_range=p_gen.x_range)
p_winter.line(x='Year', y='AvgTemperature', line_width=2, source=source_winter, line_color='blue')
hover = HoverTool(tooltips = [('Year','@Year'), ('AvgTemperature', '@AvgTemperature')])
p_winter.add_tools(hover)
tab_winter = Panel(child=p_winter, title='Winter')

p_spring = figure(tools=tools, plot_width=plot_width, plot_height=150, x_range=p_gen.x_range)
p_spring.line(x='Year', y='AvgTemperature', line_width=2, source=source_spring, line_color='green')
hover = HoverTool(tooltips = [('Year','@Year'), ('AvgTemperature', '@AvgTemperature')])
p_spring.add_tools(hover)
tab_spring = Panel(child=p_spring, title='Spring')

p_summer = figure(tools=tools, plot_width=plot_width, plot_height=150, x_range=p_gen.x_range)
p_summer.line(x='Year', y='AvgTemperature', line_width=2, source=source_summer, line_color='red')
hover = HoverTool(tooltips = [('Year','@Year'), ('AvgTemperature', '@AvgTemperature')])
p_summer.add_tools(hover)
tab_summer = Panel(child=p_summer, title='Summer')

p_autumn = figure(tools=tools, plot_width=plot_width, plot_height=150, x_range=p_gen.x_range)
p_autumn.line(x='Year', y='AvgTemperature', line_width=2, source=source_autumn, line_color='orange')
hover = HoverTool(tooltips = [('Year','@Year'), ('AvgTemperature', '@AvgTemperature')])
p_autumn.add_tools(hover)
tab_autumn = Panel(child=p_autumn, title='Autumn')

# Temperature distribution dynamics
p_dist = figure(plot_width=plot_width, plot_height=200, tools=tools, title='Temperature distribution dynamics: (1995-2014) vs (2015-2019)')
p_dist.quad(bottom=0, top='array_hist', left='left', right='right', source=source_period1,
            fill_color='blue', fill_alpha = 0.4, line_alpha=0, hover_fill_alpha = 1.0, hover_fill_color = 'blue', legend_label='1995-2014')
p_dist.quad(bottom=0, top='array_hist', left='left', right='right', source=source_period2,
            fill_color='red', fill_alpha = 0.4, line_alpha=0, hover_fill_alpha = 1.0, hover_fill_color = 'red', legend_label='2015-2019')
hover = HoverTool(tooltips = [('Period','@period'), ('AvgTemperature', '@left-@right'), ('count', '@array_hist{0.000}')])
p_dist.add_tools(hover)

g = gridplot([[select],[p_gen],[Tabs(tabs=[tab_winter, tab_spring, tab_summer, tab_autumn])],[p_dist]])
show(g)

# Summary  <a id="7"></a>

Based on all information above, now we know for certain that global world temperature has been growing. We also observe that average temperature has been changing differently in different regions. Having said that, missing values (AvgTemperature=-99), data gaps of various nature and smoothing effect of aggregation (Ecuador case is a good example) could affect analysis in the misleading way. So we should be careful of making wrong judgements and be attentive to data we have under the hood.