<a href="https://colab.research.google.com/github/sumants-dev/CIS545-Project/blob/main/DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [39]:
!pip install pandasql
!pip install pandas
!pip install requests
!pip install lxml
!pip install nltk

Collecting nltk
  Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 1.9 MB/s 
[?25hCollecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 717 kB/s 
[?25hCollecting click
  Using cached click-8.0.3-py3-none-any.whl (97 kB)
Collecting tqdm
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting regex>=2021.8.3
  Downloading regex-2021.11.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
[K     |████████████████████████████████| 764 kB 1.1 MB/s 
[?25hInstalling collected packages: tqdm, regex, joblib, click, nltk
Successfully installed click-8.0.3 joblib-1.1.0 nltk-3.6.5 regex-2021.11.10 tqdm-4.62.3


In [11]:
# Imports
import pandas as pd
import pandasql as psql
from lxml import html
import requests
import nltk

# Cleaning and Wrangling of Global Land Temperature Dataset
This dataset has the average monthly temperature for each country since 1743. For cleaning this dataset, we have two task:
    1. Convert alphabetic country names to 3-alpha iso codes
    2. Group by dates by year and country

## Extraction of country codes 
We extract the country name to country code mapping through wikipedia. Our approach is to make a request to the wikipedia page, and then use xpath to find the list of country names and country code.

In [5]:
def get_country_codes():
    wiki = requests.get("https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3")
    dom_tree = html.fromstring(wiki.content)
    xpath = "//div[@class=\"plainlist\"]/ul/li"
    country_list  = dom_tree.xpath(xpath)
    country_map = {}

    for country_elem in country_list:
        country_map[country_elem[2].text] = country_elem[1].text

    return country_map

        
def set_country_value(df, column =):
    country_to_code = get_country_codes()
    df["Country"] = df["Country"].apply(lambda elem: country_to_code.get(elem))
    return df


In [10]:
global_temp_df = pd.read_csv('../Data/raw/GlobalLandTemperaturesByCountry.csv')
country_maps = get_country_codes()
global_temp_df = set_country_value(global_temp_df)
global_temp_df = global_temp_df.dropna()
global_temp_df

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country
3239,1838-04-01,13.008,2.586,AFG
3241,1838-06-01,23.950,2.510,AFG
3242,1838-07-01,26.877,2.883,AFG
3243,1838-08-01,24.938,2.992,AFG
3244,1838-09-01,18.981,2.538,AFG
...,...,...,...,...
577456,2013-04-01,21.142,0.495,ZWE
577457,2013-05-01,19.059,1.022,ZWE
577458,2013-06-01,17.613,0.473,ZWE
577459,2013-07-01,17.000,0.453,ZWE


Now, we group by dates by year and country for the final result of global yearly temperature by country.

In [26]:
group_by_query = '''
SELECT strftime('%Y', dt) as Year, Country, AVG(AverageTemperature) as AvgYearlyTemp, AVG(AverageTemperatureUncertainty) as AvgTempUncertainty
FROM global_temp_df
GROUP BY strftime('%Y', dt), Country
'''

global_temps_final = psql.sqldf(group_by_query , locals())
global_temps_final.head()

Unnamed: 0,Year,Country,AvgYearlyTemp,AvgTempUncertainty
0,1743,ALB,8.62,2.268
1,1743,AND,7.556,2.188
2,1743,AUT,2.482,2.116
3,1743,BEL,7.106,1.855
4,1743,BGR,5.928,2.547


In [27]:
global_temps_final.to_csv('../Data/global_average_yearly_temp_clean.csv', index= False)