### Background

NOAA’s National Centers for Environmental Information (NCEI) hosts and provides access to one of the most significant archives on earth, with comprehensive oceanic, atmospheric, and geophysical data. From the depths of the ocean to the surface of the sun and from million-year-old ice core records to near-real-time satellite images, NCEI is the Nation’s leading authority for environmental information.

The five "fundamental activities" of NOAA are:

- Monitoring and observing Earth systems with instruments and data collection networks.
- Understanding and describing Earth systems through research and analysis of that data.
- Assessing and predicting the changes of these systems over time.
- Engaging, advising, and informing the public and partner organizations with important information.
- Managing resources for the betterment of society, economy and environment.

Details of NOAA - https://en.wikipedia.org/wiki/National_Oceanic_and_Atmospheric_Administration

### History

In [None]:
from IPython.display import HTML
from IPython.display import IFrame

HTML('<iframe width="560" height="315" src="//www.youtube.com/embed/nBnCsMYm2yQ" frameborder="0" allowfullscreen></iframe>')

### Data definition and collection

#### GHCN 
- The Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe. 
- GHCN-Daily contains records from over 100,000 stations in 180 countries and territories.
- The data are obtained from more than 20 sources. Some data are more than 175 years old.
- NCEI provides numerous daily variables, including maximum and minimum temperature, total daily precipitation, snowfall, and snow depth; however, about one half of the stations 
  report precipitation only
  
##### Data description
https://www.ncdc.noaa.gov/ghcn-daily-description

##### Collection
The data can be collected from S3 buckets. Here I collected it beforehand and put into aws-data folder for 2019.  
For detail information the link is as below:  
https://docs.opendata.aws/noaa-ghcn-pds/readme.html  
Question for data quality should be addressed at noaa.bdp@noaa.gov.

### Exploration of data retreived from station
#### Summary of Date format
ID = 11 character station identification code. Please see ghcnd-stations section below for an explantation  
YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)  
ELEMENT = 4 character indicator of element type  
DATA VALUE = 5 character data value for ELEMENT  
M-FLAG = 1 character Measurement Flag  
Q-FLAG = 1 character Quality Flag  
S-FLAG = 1 character Source Flag  
OBS-TIME = 4-character time of observation in hour-minute format (i.e. 0700 =7:00 am)  

The fields are comma delimited and each row represents one station-day. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
station_df = pd.read_csv("/kaggle/input/aws-open-source-weather-transaction-and-metadata/2019.csv", header = 1, \
                 names = ['station_code', 'w_date', 'element_type', 'element_value', 'measurement_flag', 'quality_flag', \
                          'source_flag', 'obs_time'], \
                delimiter = ',')
station_df.head()

### Retreive unique element_types

The five core elements are:  
PRCP = Precipitation (tenths of mm)  
SNOW = Snowfall (mm)  
SNWD = Snow depth (mm)  
TMAX = Maximum temperature (tenths of degrees C)  
TMIN = Minimum temperature (tenths of degrees C)  

In [None]:
unique_statuion = station_df['element_type'].unique()
unique_statuion

In [None]:
unique_statuion.size

### Check for number of observations per day

In [None]:
station_df['w_date'].max()

In [None]:
df_grp_dt = station_df.groupby('w_date').count().sort_values(by = 'station_code', ascending = False)
df_grp_dt['station_code'].head()

## Aditional attributes
### M-flag i.e. measurement flag  
MFLAG is the measurement flag. There are ten possible values:  
Blank = no measurement information applicable  
B = precipitation total formed from two 12-hour totals  
D = precipitation total formed from four six-hour totals  
H = represents highest or lowest hourly temperature (TMAX or TMIN) or the average of hourly values (TAVG)  
K = converted from knots  
L = temperature appears to be lagged with respect to reported hour of observation  
O = converted from oktas  
P = identified as “missing presumed zero” in DSI 3200 and 3206  
T = trace of precipitation, snowfall, or snow depth  
W = converted from 16-point WBAN code (for wind direction) 

In [None]:
station_df[station_df.measurement_flag == 'T'].head()

### Q-flag i.e. quality flag  
Q-FLAG is the measurement quality flag. here are fourteen possible values:  
Blank = did not fail any quality assurance check  
D = failed duplicate check  
G = failed gap check  
I = failed internal consistency check  
K = failed streak/frequent-value check  
L = failed check on length of multiday period  
M = failed mega consistency check  
N = failed naught check  
O = failed climatological outlier check  
R = failed lagged range check  
S = failed spatial consistency check  
T = failed temporal consistency check  
W = temperature too warm for snow  
X = failed bounds check  
Z = flagged as a result of an official Datzilla Investigation 

In [None]:
station_df[station_df.quality_flag == 'W'].head(5)

### S-flag i.e. source flag

S-FLAG is the source flag for the observation. There are twenty nine possible values (including blank, upper and lower case letters):

Blank = No source (i.e., data value missing)  
0 = U.S. Cooperative Summary of the Day (NCDC DSI-3200)  
6 = CDMP Cooperative Summary of the Day (NCDC DSI-3206)  
7 = U.S. Cooperative Summary of the Day – Transmitted via WxCoder3 (NCDC SI-3207)  
A = U.S. Automated Surface Observing System (ASOS) real-time data (since January 1, 2006)  
a = Australian data from the Australian Bureau of Meteorology  
B = U.S. ASOS data for October 2000-December 2005 (NCDC DSI-3211)  
b = Belarus update  
C = Environment Canada  
E = European Climate Assessment and Dataset (Klein Tank et al., 2002)  
F = U.S. Fort data  
G = Official Global Climate Observing System (GCOS) or other government-supplied data  
H = High Plains Regional Climate Center real-time data  
I = International collection (non U.S. data received through personal contacts)  
K = U.S. Cooperative Summary of the Day data digitized from paper observer forms (from 2011 to present)  
M = Monthly METAR Extract (additional ASOS data)  
N = Community Collaborative Rain, Hail,and Snow (CoCoRaHS)  
Q = Data from several African countries that had been “quarantined”, that is, withheld from public release until permission was granted from the respective meteorological services  
R = NCEI Reference Network Database (Climate Reference Network and Regional Climate Reference Network)  
r = All-Russian Research Institute of Hydro-meteorological Information-World Data Center  
S = Global Summary of the Day (NCDC DSI-9618)NOTE: “S” values are derived from hourly synoptic reports exchanged on the Global    Telecommunications System (GTS). Daily values derived in this fashion may differ significantly from “true” daily data, particularly for precipitation (i.e., use with caution).  
s = China Meteorological Administration/National Meteorological Information Center/Climatic Data Center (http://cdc.cma.gov.cn)  
T = SNOwpack TELemtry (SNOTEL) data obtained from the U.S. Department of Agriculture’s Natural Resources Conservation Service  
U = Remote Automatic Weather Station (RAWS) data obtained from the Western Regional Climate Center  
u = Ukraine update  
W = WBAN/ASOS Summary of the Day from NCDC’s Integrated Surface Data (ISD).  
X = U.S. First-Order Summary of the Day (NCDC DSI-3210)  
Z = Datzilla official additions or replacements  
z = Uzbekistan update  

#### Recommendation
When data are available for the same time from more than one source, the highest priority source is chosen according to the following priority order (from highest to lowest): - Z,R,0,6,C,X,W,K,7,F,B,M,r,E,z,u,b,s,a,G,Q,I,A,N,T,U,H,S  


In [None]:
station_df[station_df.source_flag == 'E'].head(5)

## Exploration of Station meta data

### Variable and feature details

These variables have the following definitions:  

- ID = the station identification code.  
    The first two characters denote the FIPS country code. Details for FIPS country code https://www.geodatasource.com/resources/tutorials/international-country-code-fips-versus-iso-3166/     
    The third character is a network code that identifies the station numbering system used  
    0 = unspecified (station identified by up to eight alphanumeric characters)  
    1 = Community Collaborative Rain, Hail,and Snow (CoCoRaHS) based identification number. To ensure consistency with with GHCN 
    Daily, all numbers in the original CoCoRaHS IDs have been left-filled to make them all four digits long. In addition, the 
    characters “-” and “_” have been removed to ensure that the IDs do not exceed 11 characters when preceded by “US1”. For 
    example, the CoCoRaHS ID “AZ-MR-156” becomes “US1AZMR0156” in GHCN-Daily  
- LATITUDE = latitude of the station (in decimal degrees).  
- LONGITUDE = longitude of the station (in decimal degrees).  
- STATE = U.S. postal code for the state (for U.S. and Canadian stations only).  
- NAME = name of the station.  
- GSN FLAG = flag that indicates whether the station is part of the GCOS Surface Network (GSN). 
- HCN/CRN FLAG = flag that indicates whether the station is part of the U.S. Historical Climatology Network (HCN). T
- WMO ID is the World Meteorological Organization (WMO) number for the station. If the station has no WMO number (or one has not  
    yet been matched to this station), then the field is blank.


In [None]:
import pandas as pd
md_station_df = pd.read_csv("/kaggle/input/aws-open-source-weather-transaction-and-metadata/ghcnd-stations.txt", header = None, sep = '\s+', \
                         names = ['station_id', 'latitude', 'longitude', 'elevation', 'state', 'name', 'gsn_flag', \
                                  'hcn_flag', 'wmo_id'])
md_station_df.head()


### Projection for all stations in a city/ town e.g. Berlin

In [None]:
import folium
from folium import features


md_city_cordinates = md_station_df[md_station_df.state.str.contains("BERLIN", na=False)][['state','latitude', 'longitude']]
md_city_cordinates

berlin_location = [md_city_cordinates.iloc[0].latitude, md_city_cordinates.iloc[0].longitude]

#tiles="https://1.base.maps.api.here.com/maptile/2.1/maptile/newest/normal.day/{z}/{x}/{y}/256/png8?lg=eng&app_id=%s&app_code=%s"

m = folium.Map(location=berlin_location, zoom_start=11, tiles="openstreetmap", attr="HERE.com")

# mark each station as a point
for index, row in md_city_cordinates.iterrows():
    folium.CircleMarker([row['latitude'], row['longitude']],
                        radius=15,
                        popup=row['state'],
                        fill_color="blue", # divvy color
                        con_color='white',
                       ).add_to(m)

    

m

#### Check station details for a city e.g. München

In [None]:
md_station_df[(md_station_df.station_id.str.startswith('GM')) & (md_station_df.state.str.startswith('MUNCHEN'))]

## Country code master data


In [None]:
import pandas as pd
from io import StringIO

file = "/kaggle/input/aws-open-source-weather-transaction-and-metadata/ghcnd-countries.txt"

def parse_country_file(filename):
    with open(filename) as f:
        for line in f:
            yield line.strip().split(' ', 1)

country_df = pd.DataFrame(parse_country_file(file))
country_df.columns=['country_code', 'country_name']
country_df[country_df['country_name'].isin(['Germany', 'Spain', 'Italy', 'France', 'United Kingdom'])]

## State code master data

The state codes are used in the station identification number, the table below CODE = is the POSTAL code of the U.S. state/territory or Canadian province where the station is located.

In [None]:
import pandas as pd
from io import StringIO

file = "/kaggle/input/aws-open-source-weather-transaction-and-metadata/ghcnd-states.txt"

def parse_country_file(filename):
    with open(filename) as f:
        for line in f:
            yield line.strip().split(' ', 1)

state_df = pd.DataFrame(parse_country_file(file))
state_df.head()

# GHCND inventory master data

This is the periods of record for each station and element  

### Data structure

- ID = the station identification code. Please see “ghcnd-stations.txt” for a complete list of stations and their metadata.  
- LATITUDE = the latitude of the station (in decimal degrees).  
- LONGITUDE = the longitude of the station (in decimal degrees).  
- ELEMENT = the element type. See section III for a definition of elements.  
- FIRSTYEAR = the first year of unflagged data for the given element.  
- LASTYEAR = the last year of unflagged data for the given element. 

In [None]:
import pandas as pd
file = "/kaggle/input/aws-open-source-weather-transaction-and-metadata/ghcnd-inventory.txt"
ghcnd_inventory_df = pd.read_csv(file, sep = '\s+', header=None, names = ['staion_id','latitude', 'longitude', 'element', 'firstyear', 'lastyear'])
ghcnd_inventory_df.head()

## EDA

### Plot Average temperature in Sydney for last N months - on availabiliity

In [None]:
import matplotlib.pyplot as plt

sydney_station_id = md_station_df[md_station_df.state.str.contains('SYDNEY', na=False)].station_id.unique().tolist()
sydney_station_data = station_df[station_df.station_code.isin(sydney_station_id)]

sydney_data_plot = sydney_station_data[sydney_station_data.element_type.str.contains("TAVG")][['station_code', 'w_date', 'element_type']]
sydney_data_plot['element_type_celcius'] = sydney_station_data.element_value/10

sydney_data_pyplot = sydney_data_plot
sydney_data_pyplot.w_date = pd.to_datetime(sydney_data_pyplot['w_date'], format='%Y%m%d')
sydney_data_pyplot.set_index(['w_date'],inplace=True)

plt.figure(figsize = (20,4))
sydney_data_pyplot.plot()


### Compare average temparature of Erlangen, Tokyo and Sydney

In [None]:
hamburg_station_id = md_station_df[md_station_df.state.str.contains('ERLANGEN', na=False)].station_id.unique().tolist()
hamburg_station_data =  station_df[station_df.station_code.isin(hamburg_station_id)]
hamburg_data_plot = hamburg_station_data[hamburg_station_data.element_type.str.contains("TAVG")][['station_code', 'w_date', 'element_type']]
hamburg_data_plot['element_type_celcius'] = hamburg_station_data.element_value/10

tokyo_station_id = md_station_df[md_station_df.state.str.contains('TOKYO', na=False)].station_id.unique().tolist()
tokyo_station_data =  station_df[station_df.station_code.isin(tokyo_station_id)]
tokyo_data_plot = tokyo_station_data[tokyo_station_data.element_type.str.contains("TAVG")][['station_code', 'w_date', 'element_type']]
tokyo_data_plot['element_type_celcius'] = tokyo_station_data.element_value/10

sydney_station_id = md_station_df[md_station_df.state.str.contains('SYDNEY', na=False)].station_id.unique().tolist()
sydney_station_data =  station_df[station_df.station_code.isin(sydney_station_id)]
sydney_data_plot = sydney_station_data[sydney_station_data.element_type.str.contains("TAVG")][['station_code', 'w_date', 'element_type']]
sydney_data_plot['element_type_celcius'] = sydney_station_data.element_value/10

In [None]:
import seaborn as sns

#bts_data_seaplot = bts_data_plot
hamburg_data_plot.w_date = pd.to_datetime(hamburg_data_plot['w_date'], format='%Y%m%d')
tokyo_data_plot.w_date = pd.to_datetime(tokyo_data_plot['w_date'], format='%Y%m%d')
sydney_data_plot.w_date = pd.to_datetime(sydney_data_plot['w_date'], format='%Y%m%d')

plt.figure(figsize = (20,4))
sns.set(rc={'axes.facecolor':'cyan', 'figure.facecolor':'cornflowerblue'})
#sns.set_style("darkgrid")
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
sns.lineplot(hamburg_data_plot["w_date"],hamburg_data_plot["element_type_celcius"], color = "red")
sns.lineplot(tokyo_data_plot["w_date"],tokyo_data_plot["element_type_celcius"], color = "darkslategrey")
sns.lineplot(sydney_data_plot["w_date"],sydney_data_plot["element_type_celcius"], color = "steelblue")
plt.show()

The state codes are used in the station identification number. In the table below CODE is the FIPS country code of the country where the station is located.

### Merging station weather with master data

In [None]:
station_tran_ms_df = pd.merge(station_df, md_station_df, left_on = ['station_code'], right_on = ['station_id'], \
                                how='left')
station_tran_ms_df.head()

#### Find maximum temperature in Germany

In [None]:
station_tran_ms_tmax_df = station_tran_ms_df[(station_tran_ms_df.element_type.str.contains("TMAX" , na=False)) & 
                                            (station_tran_ms_df.station_code.str.startswith('GM', na=False))]
max_tmp = station_tran_ms_tmax_df['element_value'].max()
max_tmp

In [None]:
max_tmp_loc = station_tran_ms_tmax_df[station_tran_ms_tmax_df.element_value == max_tmp][['state','latitude', 'longitude']]
max_tmp_loc

plt_location = [max_tmp_loc.iloc[0].latitude, max_tmp_loc.iloc[0].longitude]

mpl = folium.Map(location=plt_location, zoom_start=11)
folium.TileLayer('stamenterrain').add_to(mpl)


# mark each station as a point
for index, row in max_tmp_loc.iterrows():
    folium.CircleMarker([row['latitude'], row['longitude']],
                        radius=15,
                        popup=row['state'],
                        fill_color="blue", # divvy color
                        con_color='white',
                       ).add_to(mpl)

    

mpl

### Check station with maximum snowfall

In [None]:
station_tran_snowfall_max_df = station_tran_ms_df[station_tran_ms_df.element_type.str.contains("SNOW" , na=False)]
max_snowfall = station_tran_snowfall_max_df['element_value'].max()
max_snowfall


In [None]:
max_snowfall_loc = station_tran_snowfall_max_df[station_tran_snowfall_max_df.element_value == max_snowfall][['state','latitude', 'longitude']]
plt_location = [max_snowfall_loc.iloc[0].latitude, max_snowfall_loc.iloc[0].longitude]

mpl = folium.Map(location=plt_location, zoom_start=11, tiles="openstreetmap", attr="HERE.com")

# mark each station as a point
for index, row in max_snowfall_loc.iterrows():
    folium.CircleMarker([row['latitude'], row['longitude']],
                        radius=15,
                        popup=row['state'],
                        fill_color="blue", # divvy color
                        con_color='white',
                       ).add_to(mpl)

    

mpl

** If you like this notebook, please upvote it **