
# Bokeh for Time Series Analysis
<hr style="border: 2px solid black;">


<img src="./images/bokeh.png" alt="bokeh Logo" width="1000"/>
<hr style="border: 2px solid black;">

<img src="./images/bokeh_at_ag_glance.png" alt="bokeh Logo" width="1000"/>
<hr style="border: 2px solid black;">
**Introduction to Bokeh**
Bokeh is an interactive visualization library for Python that targets modern web browsers for presentation.
Unlike Matplotlib, which is primarily designed for static plots, Bokeh excels at creating
interactive plots and dashboards. It can handle large datasets and streaming data,
making it suitable for real-time applications.

**Key Features of Bokeh:**

* **Interactivity:** Built-in support for zooming, panning, hovering, and other interactive tools.
* **Web-Focused:** Generates HTML and JavaScript, making it easy to embed plots in web pages.
* **High Performance:** Can handle large datasets efficiently.
* **Versatility:** Supports a wide range of plot types (lines, bars, scatter plots, etc.).

<hr style="border: 2px solid black;">


**Documentation:**

For comprehensive documentation, please refer to the official Bokeh website: [https://docs.bokeh.org/en/latest/](https://docs.bokeh.org/en/latest/)


<hr style="border: 2px solid black;">


**Lab Exercise:**

Your task is to recreate the time series analysis lab we previously conducted using Pandas,
Matplotlib, and Seaborn, but this time, utilize the Bokeh library for visualization.
This will involve:

1.  Loading and preprocessing the "AirPassengersDates.csv" dataset.
2.  Creating interactive Bokeh plots for:
    * Time series line plots
    * Bar plots of aggregated data
    * Visualizing mean and standard deviation
    * Outlier detection
    * Resampling (upsampling and downsampling)
    * Lag analysis
    * Autocorrelation

Pay close attention to Bokeh's features for interactivity (tools, hover effects) and
its handling of data sources. Aim to replicate the insights and visualizations
from the previous lab while leveraging Bokeh's strengths.

Good luck!
<hr style="border: 2px solid black;">

# Imports et chargement dataset

In [1]:
import pandas as pd
from pathlib import Path
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool, Span, BooleanFilter, CDSView
import pandas as pd
from statsmodels.tsa.stattools import acf
import numpy as np

DATA_PATH = Path(".")
passenger_df = pd.read_csv("./datasets/AirPassengersDates.csv")
print(passenger_df.head())
print(passenger_df.dtypes)

         Date  #Passengers
0  1949-01-12          112
1  1949-02-24          118
2  1949-03-22          132
3   1949-04-5          129
4  1949-05-24          121
Date           object
#Passengers     int64
dtype: object


# Nettoyage et parsing des dates

In [2]:
passenger_df.columns = ["Date", "Passengers"]
passenger_df["Date"] = pd.to_datetime(passenger_df["Date"], format="%Y-%m-%d")

passenger_df = passenger_df.sort_values("Date")

print(passenger_df.head())
print(passenger_df.dtypes)

        Date  Passengers
0 1949-01-12         112
1 1949-02-24         118
2 1949-03-22         132
3 1949-04-05         129
4 1949-05-24         121
Date          datetime64[ns]
Passengers             int64
dtype: object


## Préparation des données pour Bokeh

In [3]:
passenger_df.columns = ["Date", "Passengers"]
passenger_df["Date"] = pd.to_datetime(passenger_df["Date"], format="%Y-%m-%d")
passenger_df = passenger_df.sort_values("Date")
source = ColumnDataSource(passenger_df)

# Evolution du nombre de passagers

In [4]:
p = figure(x_axis_type="datetime", width=800, height=400)
p.line(x='Date', y='Passengers', source=source, line_width=2, color="navy", legend_label="Passagers")
p.add_tools(HoverTool(tooltips=[("Date", "@Date{%F}"), ("Passagers", "@Passengers")], formatters={'@Date': 'datetime'}))
p.yaxis.axis_label = 'Nb passagers'

output_notebook()
show(p)

Il semble y avoir des tendances saisonnières sur le nombre de passagers au fil du temps. 
Les pics pourraient indiquer des périodes de haute saison (vacances).

# Distribution du nombre de passager par mois

#### Préparation des données pour histogramme Bokeh

In [5]:
passenger_df['Mois'] = passenger_df['Date'].dt.month_name()

ordered_months = ['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November', 'December']
monthly_avg = (
    passenger_df.groupby('Mois')['Passengers']
    .mean()
    .reindex(ordered_months)
    .dropna()
)

months = monthly_avg.index.tolist()
source_histo = ColumnDataSource(data=dict(mois=months, passagers=monthly_avg.values))

### Création de l'histogramme

In [6]:
p = figure(x_range=months, title="Nombre moyen de passagers par mois", width=800, height=400)
p.vbar(x='mois',
       top='passagers',
       width=0.9,
       source=source_histo)
p.yaxis.axis_label = 'nb passagers'

show(p)

Il semble y avoir plus de passager pendant la periode estivale (juillet-aout)

# Moyenne mobile et déviation standard

#### Calcul de la moyenne et ecart-type

In [7]:
passenger_df['MoyenneMobile'] = passenger_df['Passengers'].rolling(window=12).mean()
passenger_df['EcartType'] = passenger_df['Passengers'].rolling(window=12).std()
passenger_df['MoyenneMobile_Sup'] = passenger_df['MoyenneMobile'] + passenger_df['EcartType']

### Creation du graphique

In [8]:
source = ColumnDataSource(passenger_df)

p = figure(title="Moyenne mobile et écart-type", x_axis_type="datetime", width=800, height=400)
p.line(x='Date', y='Passengers', source=source, color='gray', alpha=0.5, legend_label="Passagers")
p.line(x='Date', y='MoyenneMobile', source=source, color='blue', legend_label="Moyenne mobile")
p.varea(x='Date', y1='MoyenneMobile', y2='MoyenneMobile_Sup',
        source=source, color='lightblue', alpha=0.4, legend_label="Écart-type")

p.yaxis.axis_label = 'nb passager'
p.legend.location = "top_left"

show(p)

La moyenne mobile lisse les fluctuations à court terme, révélant les tendances à long terme. L'aire autour de la moyenne mobile représente la variabilité des données, mettant en évidence les périodes de forte ou faible variation.

# Visualisation des anomalies

#### Detection des anomalies

In [9]:
mean = passenger_df['Passengers'].mean()
std = passenger_df['Passengers'].std()
upper_bound = mean + 2 * std
lower_bound = mean - 2 * std

passenger_df['Outlier'] = (passenger_df['Passengers'] > upper_bound) | (passenger_df['Passengers'] < lower_bound)

source = ColumnDataSource(passenger_df)
outlier_mask = passenger_df['Outlier'].tolist()
outlier_view = CDSView(filter=BooleanFilter(outlier_mask))


### Création du graph

In [10]:
p = figure(title="Détection des valeurs aberrantes", x_axis_type="datetime", width=800, height=400)
p.line(x='Date', y='Passengers', source=source, color='gray', alpha=0.5, legend_label="Passagers")
p.circle(x='Date', y='Passengers', source=source, size=8, color='red', alpha=0.6,
         legend_label="Outlier", view=outlier_view)

p.yaxis.axis_label = 'Nb passager'
p.legend.location = "top_left"

show(p)




Les points rouges indiquent des valeurs qui s'écartent significativement de la moyenne, suggérant des anomalies ou des événements exceptionnels.

# Upsampling to Daily

In [11]:
passenger_df.set_index("Date", inplace=True)
passenger_df.rename(columns={"Passengers": "#Passengers"}, inplace=True)

daily_passengers = passenger_df.resample('D').asfreq()
daily_passengers['#Passengers'] = daily_passengers['#Passengers'].interpolate(method='linear')
daily_passengers.reset_index(inplace=True)

source_daily = ColumnDataSource(daily_passengers)
source_original = ColumnDataSource(passenger_df.reset_index())

p = figure(title="Upsampling to Daily", x_axis_type="datetime", width=900, height=400)
p.line('Date', '#Passengers', source=source_daily, line_dash="dashed", color="green")
p.line('Date', '#Passengers', source=source_original, color="blue", alpha=0.6, legend_label="Original")

p.yaxis.axis_label = "Passengers"
p.legend.location = "top_left"

show(p)


# Downsample to yearly

In [12]:
yearly_passengers = passenger_df.resample("Y")["#Passengers"].mean().reset_index()
source_yearly = ColumnDataSource(yearly_passengers)

p = figure(title="Downsampling to Yearly Frequency", x_axis_type="datetime", width=900, height=400)

p.line('Date', '#Passengers', source=source_original, color="gray", alpha=0.4, legend_label="Original")

p.line('Date', '#Passengers', source=source_yearly, color="red", line_width=2, legend_label="Yearly Average")
p.circle('Date', '#Passengers', source=source_yearly, size=8, color="red", legend_label="Yearly Average")

p.yaxis.axis_label = "Average Passengers"
p.legend.location = "top_left"

show(p)


  yearly_passengers = passenger_df.resample("Y")["#Passengers"].mean().reset_index()


ça lisse la série temporelle ce qui permet d’identifier des tendances à long terme plus facilement.

# Shift vs tShift

In [13]:
df_shift = passenger_df.reset_index()

df_shift["#Passengers_Shift"] = df_shift["#Passengers"].shift(periods=1)
df_shift["#Passengers_tShift"] = df_shift.set_index("Date")["#Passengers"].shift(periods=1, freq="MS").reset_index(drop=True)

source_shift = ColumnDataSource(df_shift)

p = figure(title="Décalage : Shift vs tShift", x_axis_type="datetime", width=900, height=400)
p.line('Date', '#Passengers', source=source_shift, color="blue", legend_label="Original")
p.line('Date', '#Passengers_Shift', source=source_shift, color="orange", legend_label="Shift (valeurs)")
p.line('Date', '#Passengers_tShift', source=source_shift, color="green", legend_label="tShift (index)")

p.yaxis.axis_label = "Passengers"
p.legend.location = "top_left"

show(p)


In [14]:
nlags = 30
acf_values = acf(passenger_df["#Passengers"], nlags=nlags)
lags = np.arange(len(acf_values))

source_acf = ColumnDataSource(data=dict(
    lags=lags,
    acf=acf_values,
    zero=[0] * len(acf_values)
))

p = figure(title="Autocorrelation Function (ACF)", width=900, height=400)
p.segment(x0='lags', y0='zero', x1='lags', y1='acf', source=source_acf, line_color="navy", line_width=2)
p.circle(x='lags', y='acf', source=source_acf, size=6, color="navy")

hline = Span(location=0, dimension='width', line_color='black', line_width=1)
p.add_layout(hline)

p.xaxis.axis_label = "Lag"
p.yaxis.axis_label = "Autocorrelation"

show(p)


