# Build a model to predict the impact of weather on urban air quality using Amazon SageMaker

**Importing libraries**

In [12]:
%matplotlib inline
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

**Loading prepared data into the Amazon SageMaker notebook from S3**

In [13]:
from datetime import datetime
from dateutil.relativedelta import relativedelta

#Air Pollution Data
nox_df = pd.read_csv('https://s3.amazonaws.com/aws-machine-learning-blog/artifacts/air-quality/Dublin_Rathmines_NOx_2011_2016_ugm3_daily.csv')


# Print some records in the nox_df
nox_df

Unnamed: 0,Daily_Avg,AvgOfNOx,AvgOfNO,AvgOfNO2
0,01-Jan-11,26.37,2.42,22.67
1,02-Jan-11,40.04,4.70,32.86
2,03-Jan-11,37.46,3.82,31.62
3,04-Jan-11,15.76,1.77,13.05
4,05-Jan-11,29.40,3.09,24.67
...,...,...,...,...
2187,27-Dec-16,75.87,22.23,41.89
2188,28-Dec-16,145.14,64.27,46.93
2189,29-Dec-16,63.89,22.56,29.44
2190,30-Dec-16,9.95,1.41,7.78


# **Exploratory data analysis (EDA) – Data cleaning and exploration**

Cleaning the data:

In [14]:
nox_df.dropna(axis=0, how="any", thresh=None, subset=None, inplace=False)
nox_df["Daily_Avg"] = pd.to_datetime(nox_df["Daily_Avg"])



conditions = [
    (nox_df['Daily_Avg'] >= '01-01-2011') & (nox_df['Daily_Avg'] <= '31-12-2011'),
    (nox_df['Daily_Avg'] >= '01-01-2012') & (nox_df['Daily_Avg'] <= '31-12-2012'),
    (nox_df['Daily_Avg'] >= '01-01-2013') & (nox_df['Daily_Avg'] <= '31-12-2013'),
    (nox_df['Daily_Avg'] >= '01-01-2014') & (nox_df['Daily_Avg'] <= '31-12-2014'),
    (nox_df['Daily_Avg'] >= '01-01-2015') & (nox_df['Daily_Avg'] <= '31-12-2015'),
    (nox_df['Daily_Avg'] >= '01-01-2016') & (nox_df['Daily_Avg'] <= '31-12-2016'),
    (nox_df['Daily_Avg'] >= '01-01-2017') & (nox_df['Daily_Avg'] <= '31-12-2017')
]

choices = ['2011','2012','2013','2014','2015','2016','2017']
nox_df['Year'] = np.select(conditions, choices, default=np.nan)

nox_df

Unnamed: 0,Daily_Avg,AvgOfNOx,AvgOfNO,AvgOfNO2,Year
0,2011-01-01,26.37,2.42,22.67,2011
1,2011-01-02,40.04,4.70,32.86,2011
2,2011-01-03,37.46,3.82,31.62,2011
3,2011-01-04,15.76,1.77,13.05,2011
4,2011-01-05,29.40,3.09,24.67,2011
...,...,...,...,...,...
2187,2016-12-27,75.87,22.23,41.89,2016
2188,2016-12-28,145.14,64.27,46.93,2016
2189,2016-12-29,63.89,22.56,29.44,2016
2190,2016-12-30,9.95,1.41,7.78,2016


In [15]:
import plotly.express as px

fig = px.line(nox_df, 
              x='Daily_Avg', 
              y='AvgOfNO2',
              title="Average daily NO2 over year")

fig.update_traces(line_color='#138D75', 
                  line_width=1)

fig.update_layout(
    
    xaxis_title="Date",
    yaxis_title="NO2 Conc. ug/m3",
   
)
fig.show()

Visualizing some key insights of the data
All visualizations were produced within Amazon SageMaker using the open source Python matplotlib library.

The first visualization is a figure of boxplots of NO2 concentrations for each year, 2011 to 2016. A boxplot is a great way to understand the spread of the data. In this plot:

The box is Q3 – Q1. Q1 (first quartile) is the middle value between the lowest value and the median. Q3 is the middle value between the highest value and the median
The bar across the box is the median value
The whiskers are 1.5 Q3 – Q1 from the edges of the box
The dots beyond the whiskers are the outliers

In [16]:
import plotly.express as px

fig = px.box(nox_df, x="Year", y="AvgOfNO2")
fig.show()