<a href="https://colab.research.google.com/github/sumants-dev/CIS545-Project/blob/main/CIS545-Project/DataCleaning/Polution_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Air Pollution By Country

  Date range: 2010 - 2017
  
  Final dataset is 1352 records



# Importing data and library

In [13]:
!pip install pandasql
!pip install pandas
!pip install requests
!pip install lxml
!pip install nltk



In [14]:
# Imports
import pandas as pd
import pandasql as psql
from lxml import html
import requests
import nltk

In [15]:
air_polution_pd = pd.read_csv("/content/drive/MyDrive/MCIT/CIS545/Data/PM2.5 Global Air Pollution 2010-2017.csv")
air_polution_death_pd = pd.read_csv("/content/drive/MyDrive/MCIT/CIS545/Data/death-rates-from-air-pollution.csv")
cities_air_polution_pd = pd.read_csv("/content/drive/MyDrive/MCIT/CIS545/Data/cities_air_quality_water_pollution.18-10-2021.csv")

# Air polution dataset data cleaning and wrangling

Overall goals of data cleaning includes: 


*   Extract a standardlized date range and group the data by year and month 
*   Ensure consist country data format by updating country to include ISO ; https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes (3-alpha code)
*   Drop columns that are not needed for analysis 
*   Drop any rows that have nulls 
*   Export cleaned dataset




# Extraction of standardlize country codes

We extract the country name to country code mapping through wikipedia. Our approach is to make a request to the wikipedia page, and then use xpath to find the list of country names and country code.

In [16]:
def get_country_codes():
    wiki = requests.get("https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3")
    dom_tree = html.fromstring(wiki.content)
    xpath = "//div[@class=\"plainlist\"]/ul/li"
    country_list  = dom_tree.xpath(xpath)
    country_map = {}

    for country_elem in country_list:
        country_map[country_elem[2].text] = country_elem[1].text

    return country_map

        
def set_country_value(df):
    country_to_code = get_country_codes()
    #df["Country"] = df['Country Name'].apply(lambda elem: country_to_code.get(elem))
    df.insert(0, 'iso_code',df['Country Name'].apply(lambda elem: country_to_code.get(elem)) )
    return df

In [17]:
air_polution_pd = set_country_value(air_polution_pd)

# Initial cleaning

First, we will print the initial dataframe to understand the shape of the data we are working with. 

In [18]:
air_polution_pd.head(5)

Unnamed: 0,iso_code,Country Name,Country Code,2010,2011,2012,2013,2014,2015,2016,2017
0,AFG,Afghanistan,AFG,65.245592,66.835727,66.023174,61.366745,59.01033,61.252656,56.287047,56.910808
1,AGO,Angola,AGO,33.787247,33.104195,33.415495,34.663923,32.974025,32.729873,31.785389,32.388505
2,ALB,Albania,ALB,21.277828,22.772537,20.578259,19.938517,18.883955,19.51254,18.189934,18.200603
3,AND,Andorra,AND,12.807198,13.273506,12.407053,11.813673,10.830418,11.462178,10.255834,10.307621
4,,Arab World,ARB,53.787001,52.652279,53.29727,54.053822,52.583603,60.406813,58.764905,58.689259


The next step we will use Pandas describe function to understand our dataset a bit further. 


*   The data set consist of 240 countries data in air pollution between 2010 to 2017 
*   There are no country missing a data record between 2010 to 2017, since 2010 - 2017 all have 240 count of records 



In [19]:
air_polution_pd.describe()

Unnamed: 0,2010,2011,2012,2013,2014,2015,2016,2017
count,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0
mean,30.872419,31.131758,30.340594,29.790453,28.683781,30.579904,29.161746,29.292363
std,17.978061,17.94265,17.787894,17.732915,17.165262,19.577582,19.034549,19.320528
min,7.152866,7.2837,6.601134,6.278689,6.18083,6.063834,5.893757,5.861331
25%,17.043463,17.362182,16.255018,15.809037,15.171312,15.513597,14.489949,14.572962
50%,27.004787,27.453521,25.948751,25.442579,24.19379,24.441082,23.07915,22.87483
75%,39.433401,40.142818,40.8634,40.343259,39.552618,43.850369,40.775888,40.966942
max,100.784428,100.766061,96.963291,95.313986,98.116017,97.432289,98.054714,99.734374


In [30]:
air_polution_pd_cleaned = air_polution_pd.dropna()
air_polution_pd_cleaned = air_polution_pd.drop(columns=['Country Name', 'Country Code'])
air_polution_pd_cleaned.head(5)




Unnamed: 0,iso_code,2010,2011,2012,2013,2014,2015,2016,2017
0,AFG,65.245592,66.835727,66.023174,61.366745,59.01033,61.252656,56.287047,56.910808
1,AGO,33.787247,33.104195,33.415495,34.663923,32.974025,32.729873,31.785389,32.388505
2,ALB,21.277828,22.772537,20.578259,19.938517,18.883955,19.51254,18.189934,18.200603
3,AND,12.807198,13.273506,12.407053,11.813673,10.830418,11.462178,10.255834,10.307621
4,,53.787001,52.652279,53.29727,54.053822,52.583603,60.406813,58.764905,58.689259


In [31]:
air_polution_pd_cleaned.count()

iso_code    169
2010        240
2011        240
2012        240
2013        240
2014        240
2015        240
2016        240
2017        240
dtype: int64

In [21]:
air_polution_pd_cleaned.describe()

Unnamed: 0,2010,2011,2012,2013,2014,2015,2016,2017
count,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0
mean,30.872419,31.131758,30.340594,29.790453,28.683781,30.579904,29.161746,29.292363
std,17.978061,17.94265,17.787894,17.732915,17.165262,19.577582,19.034549,19.320528
min,7.152866,7.2837,6.601134,6.278689,6.18083,6.063834,5.893757,5.861331
25%,17.043463,17.362182,16.255018,15.809037,15.171312,15.513597,14.489949,14.572962
50%,27.004787,27.453521,25.948751,25.442579,24.19379,24.441082,23.07915,22.87483
75%,39.433401,40.142818,40.8634,40.343259,39.552618,43.850369,40.775888,40.966942
max,100.784428,100.766061,96.963291,95.313986,98.116017,97.432289,98.054714,99.734374


Next, we wil reformat the dataset that is consistent with the other datasets. We will have the iso_code, year and percent as columns of our dataframe

In [34]:
air_polution_pd_cleaned = air_polution_pd_cleaned.set_index('iso_code').stack().reset_index().rename(columns={'level_1': 'year', 0:'percent'})


In [35]:
air_polution_pd_cleaned.head(5)

Unnamed: 0,iso_code,year,percent
0,AFG,2010,65.245592
1,AFG,2011,66.835727
2,AFG,2012,66.023174
3,AFG,2013,61.366745
4,AFG,2014,59.01033


In [36]:
air_polution_pd_cleaned.count()

iso_code    1352
year        1920
percent     1920
dtype: int64

Finally, we will export our cleanned dataframe out to CSV

In [29]:
air_polution_pd_cleaned.to_csv('/content/drive/MyDrive/MCIT/CIS545/Data/pollution_data_cleaned.csv')