## Introduction

Humanity is facing a new challanges again and again. Wars, epidemics, resourse scarcity just seem to cycle around never going away. And no matter what issue is the first on our radars at any moment we have to appriciate and protect one thing first: air.

In the dataset we are given some metrics describing dynamics of changing gases concentration. Let us begin the exploration of the story it might tell.

In [None]:
!pip install -q plotly==4.9.0
!pip install -q "notebook>=5.3" "ipywidgets>=7.2"

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
from plotly.subplots import make_subplots
import plotly.express as px
import seaborn as sns
init_notebook_mode(connected=True)

In [None]:
root_dir = '../input/silkboard-bangalore-ambient-air-covid19lockdown'
files = list(os.listdir(root_dir))
dfs = {}
for file in files:
    f_name = os.path.join(root_dir, file)
    dfs[file.split('_')[0]] = pd.read_csv(f_name)

In [None]:
for name, df in dfs.items():
    print(name)
    print('Columns:', *list(df.columns))
    print('NaNs:', df.isna().sum().sum())
    print('-'*10)

In [None]:
for name, df in dfs.items():
    print('name')
    print('country:', df.country.unique())
    print('city:', df.city.unique())
    print('locations:', df.location.unique())
    print('coordinates:', df.latitude.unique(), df.longitude.unique())

We have a single location in the dataset, as expected. This helps us refine values we need for exploration.

In [None]:
for name, df in dfs.items():
    print(name, '\n', 
          'maximum:', df.value.max(),
          'minimum:', df.value.min(),
          'mean:', df.value.mean(),
          'median:', df.value.median()
         )

In [None]:
to_drop =['country', 'city', 
          'location', 'latitude', 
          'longitude']

for name, df in dfs.items():
    dfs[name] = df.drop(to_drop, axis=1)

In [None]:
def plot_dfs(x, y, title=None, transform=None, vis_type='scatter'):
    if title is None:
        title=f'{x} {y}'
    fig = make_subplots(rows=2, cols=3, subplot_titles=tuple(dfs.keys()))
    fig.update_layout(showlegend=False, title=title)
    i, j = 1, 1
    for df in dfs.values():
        if vis_type == 'scatter':
            if transform is None:
                fig.add_trace(go.Scatter(x=df[x], y=df[y]), row=i, col=j)
            else:
                fig.add_trace(go.Scatter(x=df[x], y=transform(df[y])), row=i, col=j)
        else:
            fig.add_trace(go.Histogram(x=df[x], histnorm='percent'), row=i, col=j)
            fig.update_layout(barmode='stack')
        i+=1
        if i==3:
            j+=1
            i=1
        if j==4:
            j=1
            
    return fig

fig = plot_dfs('utc', 'value', title='Gases concentration values')
iplot(fig)

Well, it does seem to be the case lockdown helped to ease some pollutant concentration like CO and NO2. But that is not sufficient analysis to my eye. Let's dig deeper.

In [None]:
fig = plot_dfs('utc', 'value', 
               title='Gases concentration percent change',
               transform=pd.Series.pct_change)
iplot(fig)

Let us check if the measurement units are the same. Of course comparing so2 emmisions to concentration of o3 is not apples to apples comparision but still useful to know the scale.

In [None]:
for df in dfs.values():
    print(df.unit.unique())

All the scales seem in tact. It is time to compare all concentrations together, let's merge known data.

In [None]:
for name, df in dfs.items():
    dfs[name] = df[['value', 'utc']].rename(columns={'value': name})
    

gases_df = pd.merge(dfs['co'], dfs['no2'], 
                    on='utc', how='outer')

gases_df = pd.merge(gases_df, dfs['o3'], 
                    on='utc', how='outer')
gases_df = pd.merge(gases_df, dfs['pm10'], 
                    on='utc', how='outer')
gases_df = pd.merge(gases_df, dfs['pm25'], 
                    on='utc', how='outer')
gases_df = pd.merge(gases_df, dfs['so2'], 
                    on='utc', how='outer')
gases_df

In [None]:
gases_df.isna().sum()

As we might notice here: so2 measurements seems to have less frequent collection. We do not expect all measures to be made precisely in the time but given the dataset timestamps put onto these events are many of them have been made nearly simultaneously.

In [None]:
df = gases_df.dropna(how='any')

In [None]:
fig = go.Figure()
for gas in dfs.keys():
    fig.add_trace(go.Scatter(x=df.utc,
                             y=df[gas],
                             mode='lines+markers',
                            name=gas))
iplot(fig)

In [None]:
for gas in dfs.keys():
    gases_df[gas+'_pct'] = gases_df[gas].pct_change()

In [None]:
gases_df

In [None]:
df = gases_df.dropna(how='any')

In [None]:
fig = go.Figure()
for gas in dfs.keys():
    fig.add_trace(go.Scatter(x=df.utc,
                             y=df[gas+'_pct'],
                             mode='lines+markers',
                            name=gas))
iplot(fig)

In [None]:
sns.heatmap(gases_df[list(dfs.keys())].corr());

Notice the correlations here: o3 seems to be negatively correlcated with so2 concentration, co correlates with no2.

So we are seeing spikes of some concentrations. There are periods where concentrations seems to drop but overall we cannot conclude trends truely dropped. If we extrapolate the data before lockdown it might draw us to conclusion quarantine actually caused concentraion to be more volatile with noticable spikes in the march. What is good though co in particular seem to drop with no backing the trend to rise back.