# Introduction

This dataset has been generated by combining two publically available datasets on weather data and city coordinates and demographics. In a nutshell, it uses a city's coordinates to identify the nearest weather station and then extracts the historical weather data of this station. 

A dataset on weather patterns is generated here by using the following open/public datasets:

i)**Global Historical Climatology Network (GHCN) (https://www.ncdc.noaa.gov/ghcn-daily-description):**
"GHCN (Global Historical Climatology Network)-Daily is an integrated database of daily climate summaries from land surface stations across the globe. Like its monthly counterpart (GHCN-Monthly) , GHCN-Daily is comprised of daily climate records from numerous sources that have been integrated and subjected to a common suite of quality assurance reviews.

GHCN-Daily contains records from over 100,000 stations in 180 countries and territories. NCEI provides numerous daily variables, including maximum and minimum temperature, total daily precipitation, snowfall, and snow depth; however, about one half of the stations report precipitation only. Both the record length and period of record vary by station and cover intervals ranging from less than a year to more than 175 years.

ii) **World Cities Database (https://simplemaps.com/data/world-cities):** A database of 26000 cities containing city names, geographical co-ordinates, population, density and country names. The database is available under Creative Commons Attribution 4.0 license. 


## 1. Motivation: 

i) Climate change is known to induce high global temperatures, changes in precipitation patterns, shortened frost/winter season, extreme weather events, heat waves, droughts etc.

ii) **Local data provides perspective:** From NASAs Global Climate Change: "...the extent of climate change effects on individual regions will vary over time and with the ability of different societal and environmental systems to mitigate or adapt to change." Hence, analysing localized weather patterns, population and changes over and around the cities participating in CDP surveys can provide perspective on adaptation and mitigation measures chosen by a city, infrastructure priorities, social equity programs and policies, gender biases, effect on health of the population etc. National level population and weather data may not account for local variations especially in large diverse countries like USA, China, India, Brazil etc. Hence, localized data is prefereable to national level data.

iii) **High Quality Measurable data:** Weather and demographic data are primary, unbiased, recorded, historical data reflecting the on-ground situation. Local weather data like temperature, precipitation, snowfall etc. are regularly measured with quality assurances in place. Similarly, demographic data like population and density have been regularly measured for decades. 

iv) **Unbiased:** Other data types like projections, self reported data, extrapolations suffer from bias/assumptions of the reporter or analyst/estimator. Indices and basket/combination indicators suffer from biases in weightage given to different parameters. We need to be cautious while using projections, indices etc as they can inject existing biases in current analysis. Organizations and administrations can and are known to 'game' such indicators. Hence, it is preferable to use primary, unbiased and measurable data like weather and population demographics to build KPIs

## 2. Advantages: 

The dataset generated and the code in this notebook has applications beyond the current competition. 

i) **Ease of Access:** The dataset makes climate data on thousands of cities instantly accessible through year-wise CSV files. Additionally , a 'key file' which links each city to the nearest weather station makes it easy  to retrieve weather data with only the city and country name. It is presented in a human readable format. It can be easily manipulated using Python, R or even Microsoft Excel/ Google Sheets.

ii) **Scope and Usage** The dataset covers thousands of cities beyond the CDP competition dataset. Hence, its potential uses extend to individuals and organizations engaged in sustainability, climate change, metereology etc. Although the notebook generates data for only the last 5 years, it can be modified further to obtain data for any other years from the NOAA database. 

iii) **Size**  NOAA GHCN annual files sizes range from 1-2 GB. The extracted dataset files sizes are in the range of 150-200 MB i.e. almost 10-20% of the original files


## 3. Method

I) Obtain files from NOAA and Simplemaps

II) Use the co-ordinates of each City and weather station to find the weather station closest to each city

III) Extract the data for these weather stations and shape it in a human readable format.


## 4. Limitations:

Weather data on certain cities is missing or inaccurate because of the following reasons:

i) Some of the weather stations identified in the notebook are not polled by the NOAA in recent years. Hence, weather data on certain cities is missing.

ii) In a few cases, the identified weather stations are very far from the respective cities.

In [None]:
from tqdm import tqdm
import pandas as pd
import plotly.express as px
import csv
import haversine

tqdm.pandas()

# I. Get files from simplemaps and NOAA

## 1. List of all NOAA GHCN stations
This file is in ASCII format so we will convert it into a DataFrame and also save it for future use.

In [None]:
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

In [None]:
colspecs = [(0,11), (12,20), (21,30), (31,37), (38,40), (41,71), (72,75), (76,79), (80,85)]
stations = pd.read_fwf('ghcnd-stations.txt', colspecs=colspecs, header=None, index_col=None)
stations.columns = ["ID", "LATITUDE", "LONGITUDE", "ELEVATION", "STATE", "NAME", "GSNFLAG", "HCNFLAG", "WMOID"]
stations.reset_index(inplace=True)
stations["COORDS"] = list(zip(stations.LATITUDE, stations.LONGITUDE)) #Savinf coordinates as a tuple will be useful later
stations.to_csv('ghcnd-stations.csv', quoting=csv.QUOTE_NONNUMERIC)

## 2. GHCN Daily Weather data files
These files are available as zipped archives for each year. The zipped archives contain a CSV file without any header. 

In [None]:
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2020.csv.gz
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2019.csv.gz
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2018.csv.gz

In [None]:
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2017.csv.gz
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
!wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2015.csv.gz

In [None]:
!gunzip 2020.csv.gz
!gunzip 2019.csv.gz
!gunzip 2018.csv.gz

In [None]:
!gunzip 2017.csv.gz
!gunzip 2016.csv.gz
!gunzip 2015.csv.gz

## 3. World Cities Data
Data on 26000 cities inculding geo co-ordinates, country, and population. This data is available from https://simplemaps.com/data/world-cities

In [None]:
!wget https://simplemaps.com/static/data/world-cities/basic/simplemaps_worldcities_basicv1.71.zip

In [None]:
!unzip ./simplemaps_worldcities_basicv1.71.zip

# II. Finding the closest weather station for each city

1. We will use one helper function to calculate the Haversine distance between two points. We will use this function to calculate distances between each city and all other stations in the NOAA GHCN Station list. The function will return the 'ID' of the closest station and the distance between the city and station.

2. We will limit our analysis to the top 2500 cities by population because of resource constraints of Kaggle Notebooks. For sure, the code below can be used for analysis of a larger number of cities as also custom lists of cities.

In [None]:
from haversine import haversine

def NearestStations(lat,long):
    yo = stations
    coords = (lat,long)
    yo["Dist"]= yo.COORDS.apply(lambda x: haversine(coords,x))
    #yo.reset_index(inplace=True)
    closest = yo.loc[yo["Dist"].idxmin()]
    return(str(closest.ID),float(closest.Dist))

In [None]:
world_cities = pd.read_csv("worldcities.csv")

In [None]:
world_top2500 = world_cities.nlargest(2500,"population")

In [None]:
world_top2500['Closest Station ID'], world_top2500['Closest Station Distance'] = \
                            zip(*world_top2500.progress_apply(lambda x: NearestStations(float(x['lat']),float(x['lng'])), axis=1))

In [None]:
world_top2500.to_csv("World Cities Nearest Stations.csv")

# III. Extracting weather data for stations

1. We will first extract the list of unique weather station from our list of top 2500 cities and their associated weather stations.

2. Next we will iterate over each years weather data. 

3. In each iteration we will extract the data of only those stations appearing in our list. 

4. We will also transform the data to a more readable format using the pivot() function of pandas dataframe

5. The final transformed data will be saved on file and yields our dataset.

In [None]:
stations_list = list(world_top2500["Closest Station ID"].unique())

In [None]:
years = ['2015','2016','2017','2018','2019','2020']
for y in years:
    print(y)
    weather_temp = pd.read_csv(y+".csv")
    weather_temp.columns = ['ID','Date','Element','Value','A','B','C','D']
    weather_temp2 = weather_temp[weather_temp.ID.isin(stations_list)]
    wp = weather_temp2.pivot(index = ["Date","ID"], columns = "Element", values="Value")
    wp.reset_index(inplace=True)
    wp.to_csv("Weather top 2500 cities "+y+".csv")

# IV. Analysis

## 1. Analysis of distance between cities and closest weather station

i) About 50% of the cities have a weather station within a 7 kilometer range
ii) About 60% of the cities have a station within a 15 kilometer range
iii) Some cities don't seem to have any nearby stations on the NOAA list

In [None]:
world_top2500["Closest Station Distance"].describe()

In [None]:
100*world_top2500[world_top2500["Closest Station Distance"]<15].shape[0]/world_top2500.shape[0]

In [None]:
hist = px.histogram(world_top2500, "Closest Station Distance")
hist.show()

## 2. How many weather stations from our list were polled by NOAA for weather data?

1. In each year between 2015-2020, NOAA GHCN collected data from about 53-57% of the stations in our list

In [None]:
unique_stations_lists = []
unique_stations_nos = []

for y in years:
    w = pd.read_csv("Weather top 2500 cities "+y+".csv")
    yo = list(w['ID'].unique())
    yo_len = len(yo)
    unique_stations_lists.append(yo)
    unique_stations_nos.append(yo_len)

unique_stations_all = sorted(set(sum(unique_stations_lists, []))) #ID of every station that was polled at least once 

total_unique_stations_nos = len(list(world_top2500["Closest Station ID"].unique()))
unique_stations_percentages = [yo * 100/total_unique_stations_nos for yo in unique_stations_nos]

Percentage_Stations_Polled = pd.DataFrame(list(zip(years, unique_stations_percentages)), columns=["Year", "Percentage of Stations Polled"])

In [None]:
Percentage_Stations_Polled.head(6)

In [None]:
Cities_Not_Polled = world_top2500[-(world_top2500["Closest Station ID"].isin(unique_stations_all))]
Cities_Not_Polled.shape

In [None]:
Cities_Not_Polled["country"].value_counts()

In [None]:
Cities_Polled = world_top2500[-(world_top2500["Closest Station ID"].isin(unique_stations_all))]
Cities_Polled.shape

In [None]:
Cities_Not_Polled["country"].value_counts()