Air quality is one of the most trending topics lately and with so many people travelling and moving around the world it has also become a huge factor for considering a place to live. Let's check the air quality around the world accoring to [openaq](https://openaq.org/#/?_k=nx3pph).

We will start with importing helper package you can read more about in [this great tutorial](https://www.kaggle.com/rtatman/sql-scavenger-hunt-handbook/?utm_medium=email&utm_source=mailchimp&utm_campaign=sql+scav+hunt)

In [None]:
# import package with helper functions 
import bq_helper

# create a helper object for this dataset
open_aq = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                              dataset_name="openaq")

We are going to first check some basics about the data we will be handling.

In [None]:
# check tables' names
open_aq.list_tables()

In [None]:
# check head of the table
open_aq.head('global_air_quality')

In [None]:
# check the table schema
open_aq.table_schema('global_air_quality')

**How complete and reliable is our data?**

First I want to learn something about my dataset to understand, what's it able to tell me and sometimes to avoid too far-going conclusions. Openaq has data from 8,275 locations in 64 countries - let's see which countries are best covered.

In [None]:
# check how many measure station are there in each country
query_1 = """SELECT country, COUNT(location) AS number_of_locations
                FROM `bigquery-public-data.openaq.global_air_quality`
                GROUP BY country
                ORDER by number_of_locations DESC"""

number_of_locations = open_aq.query_to_pandas_safe(query_1, max_gb_scanned=0.1)
number_of_locations

I'm not able to check here, what is actual area coverage of every country, that's why I'm going to export some data to my local environment and combine it with World Bank data. I put some guidlines how to use this powerful tool in my blog post on Medium - [section Extracting GDP dataset](https://towardsdatascience.com/is-there-a-relationship-between-countries-wealth-or-spending-on-schooling-and-its-students-a9feb669be8c) 

In [None]:
# save the list of countries to get area data for the from World Bank
number_of_locations.to_csv("number_of_stations.csv")

You can check the results in the points_density file in data section. From our previous chart we could say that USA, France, Spain and Germany have the biggest number of measure points, but when we compare the data with country area numbers, the coverage number look quite different. The podium belongs to Gibraltar, Hong Kong, Malta and Andorra (they have at least one measure location per area under 100 square kilometres. 


**What are our pollutants?**

Just out of curiosity I decided to see the most popular pollutants and learn a bit about where they come from.

In [None]:
# check the kind of pollutants
query_2 = """SELECT pollutant, SUM(value) AS total_pollution
                FROM `bigquery-public-data.openaq.global_air_quality`
                GROUP BY pollutant
                ORDER by total_pollution DESC"""

pollutants = open_aq.query_to_pandas_safe(query_2, max_gb_scanned=0.1)
pollutants

It's strange, but it seems that some pollutant have minus values. Let's check what is it all about.

In [None]:
query_2a = """
            SELECT *
            FROM `bigquery-public-data.openaq.global_air_quality`
            WHERE value < -900
            """
minus_values = open_aq.query_to_pandas_safe(query_2a, max_gb_scanned=0.1)
minus_values

It seems like Netherlands have the most of those high minus values, but I couldn't find aby explanation for that. Maybe it's a kind of code for different measuring cases. I decided to count only those measurements, which are above 0 to make our lives easier this time.

Another thing is the 'unit issue'. In the previous version I mistakenly added all the values together, but they are not equals. Now I will follow specific rules with converting units. If you want to leran more there is some guidline [here](https://uk-air.defra.gov.uk/assets/documents/reports/cat06/0502160851_Conversion_Factors_Between_ppb_and.pdf). I chose only 4 of them, because it was easier to find the details of conversion for them.

In [None]:
query_2b = """WITH normalised_pollution AS
                (
                SELECT pollutant,
                CASE
                    WHEN unit = 'ppm' AND pollutant ='o3' AND value > 0 THEN value
                    WHEN unit = 'µg/m³' AND pollutant ='o3' AND value > 0 THEN value/1960
                    WHEN unit = 'ppm' AND pollutant ='no2' AND value > 0 THEN value
                    WHEN unit = 'µg/m³' AND pollutant ='no2' AND value > 0 THEN value/1880
                    WHEN unit = 'ppm' AND pollutant ='co' AND value > 0 THEN value
                    WHEN unit = 'µg/m³' AND pollutant ='co' AND value > 0 THEN value/1150
                    WHEN unit = 'ppm' AND pollutant ='so2' AND value > 0 THEN value
                    WHEN unit = 'µg/m³' AND pollutant ='so2' AND value > 0 THEN value/2620
                END AS ppm_unit
                FROM `bigquery-public-data.openaq.global_air_quality`
                )
                SELECT pollutant, SUM(ppm_unit) as total_ppm_pollution
                FROM normalised_pollution
                GROUP BY pollutant
                ORDER BY total_ppm_pollution DESC"""

pollutants_above_zero = open_aq.query_to_pandas_safe(query_2b, max_gb_scanned=0.1)
pollutants_above_zero

A word of explanation:
* CO  - produced by common home appliances, such as gas or oil furnaces, gas refrigerators, gas clothes dryers, gas ranges, gas water heaters or space heaters, fireplaces, charcoal grills, and wood burning stoves. 
* 03 - formed when pollutants emitted by cars, power plants, industrial boilers, refineries, chemical plants, and other sources chemically react in the presence of sunlight. 
* NO2 - part of a group of gaseous air pollutants produced as a result of road traffic and other fossil fuel combustion processes. 
* SO2 - from coal burning power plants. 
* PM10/PM25 - mixture of materials that can include smoke, soot, dust, salt, acids, and metals. 
* BC (black carbon) - formed through the incomplete combustion of fossil fuels, biofuel, and biomass, and is emitted in both anthropogenic and naturally occurring soot.

**Where is the best air?**

Let's try to find the cleanest countries, using the same average current pollution indicator, this time converting to ug/m3 - as it seems they are more useful, when making a classification.

In [None]:
query_4a = """WITH normalised_pollution AS
                (
                SELECT country, pollutant, location,
                    CASE
                        WHEN unit = 'ppm' AND pollutant ='o3' AND value > 0 THEN value*1960
                        WHEN unit = 'µg/m³' AND pollutant ='o3' AND value > 0 THEN value
                        WHEN unit = 'ppm' AND pollutant ='no2' AND value > 0 THEN value*1880
                        WHEN unit = 'µg/m³' AND pollutant ='no2' AND value > 0 THEN value
                        WHEN unit = 'ppm' AND pollutant ='co' AND value > 0 THEN value*1150
                        WHEN unit = 'µg/m³' AND pollutant ='co' AND value > 0 THEN value
                        WHEN unit = 'ppm' AND pollutant ='so2' AND value > 0 THEN value*2620
                        WHEN unit = 'µg/m³' AND pollutant ='so2' AND value > 0 THEN value
                END AS micro_unit_value
                FROM `bigquery-public-data.openaq.global_air_quality`
                )
                SELECT country, pollutant, (SUM(micro_unit_value)/COUNT(*)) AS current_pollution
                    FROM normalised_pollution
                    WHERE micro_unit_value > 0
                    GROUP BY country, pollutant
                    ORDER by country, pollutant """
pollution = open_aq.query_to_pandas_safe(query_4a, max_gb_scanned=0.1)
pollution.head(15)

I'm going to plot the results for one of the pollutants, es an example. You can easily do it for the rest of them just replacing the first condition with the name of the pollutant, you are interesred in. Check over here for some interpretations of [the level of pollution](https://en.wikipedia.org/wiki/Air_quality_index#Definition_and_usage)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

one_pollutant = pollution.loc[pollution['pollutant'] == 'o3']
ordered_pollution = one_pollutant.sort_values(by='current_pollution')
my_range=range(1,len(ordered_pollution.index)+1)

plt.figure(figsize=(15,10))
plt.hlines(y=my_range, xmin=0, xmax=ordered_pollution['current_pollution'], color='skyblue')
plt.plot(ordered_pollution['current_pollution'], my_range, "o")
plt.yticks(my_range, ordered_pollution['country'])
plt.title("Current pollution of in various countries", loc='left')
plt.xlabel('current pollution (ug/m3)')
plt.ylabel('Country')

If you are more interested in the subject of clean air check those pages I ran into: [air quality real time](http://berkeleyearth.org/air-quality-real-time-map/) and [another](https://breezometer.com/air-quality-map/).

Thanks for reading and I hope that most of you are in the countries that are in the bottom of the chart.