# Our goal is to predict the AIRGLOW intensities.

In [0]:
from IPython.display import YouTubeVideo
YouTubeVideo('XkfcbHv_NRw', width=800, height=450)

### FEATURES: Parameters that should be relevant but their direct influence is unknown. 
### LABELS: 18 years of airglow measurements


![alt text](https://media.springernature.com/original/springer-static/image/art%3A10.1007%2Fs11214-012-9872-6/MediaObjects/11214_2012_9872_Fig1_HTML.gif)



## 1) Perform the initial steps:

In [0]:
# import the libraries
import numpy as np
import pandas as pd
import datetime
import matplotlib.dates as md
import matplotlib.pyplot as plt
import math

In [0]:
# mount the Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
# set the root path
root_path = 'gdrive/My Drive/SPACE-ML/presentations/Tuesday/'
print(root_path)

## 2) Open the file with airglow intensities (labels) and iprove the data frame

In [0]:
# read the data that are stored in Google Drive - Georgia Data Frame (gdf)
gdf = pd.read_pickle(root_path + 'df_georgia_1975-1993.pkl')
print(list(gdf))
print(gdf.shape)
print(gdf.head())
print(gdf.tail())

In [0]:
# make a short data frame (Georgia Data Frame Short (gdfs)) with only one time as an index
gdfs = gdf.set_index('utc557')
gdfs.drop(['t557', 'utcOH', 'tOH', 'utc630', 't630'], axis=1, inplace=True)
# make index (time) non-timezone aware
gdfs.index = gdfs.index.tz_localize(None)
print(gdfs.tail())


In [0]:
# resemple data to specified period
rr = 'H' # resample keyword
gdfsh = pd.DataFrame() #  -> Georgia Data Frame Short - Hours (gdfsh)
gdfsh['i557'] = gdfs.i557.resample(rr).mean()
gdfsh['i630'] = gdfs.i630.resample(rr).mean()
gdfsh['iOH'] = gdfs.iOH.resample(rr).mean()
gdfsh.dropna(inplace=True)

print(gdfsh.shape)
print(gdfsh.tail())

In [0]:
# round data to 1 digit after decimal
gdfsh = gdfsh.round(1)

print(gdfsh.tail())

In [0]:
# get basic statistics of DF
gdfsh.describe()

In [0]:
# plot basic histograms of airglow intensities
gdfsh.plot.hist(grid=True, bins=40, rwidth=0.9, color=['darkgreen', 'red', 'dimgray'], subplots=True, layout=(3,1), figsize=(12,9), xlim=(-10,2000))

In [0]:
# select time interval
start_year = datetime.datetime.strptime('1975-01-01 00:00', "%Y-%m-%d %H:%M")
end_year = datetime.datetime.strptime('1993-04-30 23:59', "%Y-%m-%d %H:%M")

In [0]:
#@title
# to eliminate possible warning for matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

In [0]:
# quick look of the labels (airglow)
plt.figure(figsize=(18, 6))
plt.title('Airglow photometric measurements, Abastumani (Georgia), ' + str(start_year)[:10] + ' - ' + str(end_year)[:10], fontsize=18)
plt.xlabel('Time [Y]', fontsize=16)
plt.ylabel('Intensity [R]', fontsize=16)
plt.xlim(start_year, end_year)
plt.ylim(0, 1700)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
#xfmt = md.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)

# Choose the column to be plotted
#plt.plot(gdfsh.index, gdfsh.i557, '-o', color='darkgreen', label='Green line (557.7 nm)')
plt.plot(gdfsh.index, gdfsh.i630, '-D', color='red', label='Red line (630.0 nm)')
#plt.plot(gdfsh.index, gdfsh.iOH, '-s', color='dimgray', label='OH lines (900 - 1055 nm)')
plt.legend(loc='upper left')

In [0]:
# save final features data frame to csv file
gdfsh.to_csv(root_path + 'labels_airglow_prediction.csv', sep=',')

## 3) Open the file with space weather indexes (feautures) and iprove the data frame

In [0]:
# read data that were downloaded from omniweb (https://omniweb.gsfc.nasa.gov/form/dx1.html)
col_names = ['year', 'doy', 'hour', 'Kp', 'R', 'Dst', 'F107']
swdf = pd.read_csv(root_path + 'space_weather_indexes_1975-1993.txt', sep='\s+', names=col_names)
print(swdf.shape)
print(swdf.head())


print(swdf.tail())

### Kp index
The Kp index comes from the German words “planetarische kennziffer”, which translate loosely to “planetary index number” and was introduced by Julius Bartels in 1938.  There are 13 magnetometer stations located around the world (from 44 – 60 degrees latitude) and they measure the level of geomagnetic fluctuation at their respective locations recorded as a “K-value” between 0 – 9.  The K-values for each station are based off of the past 3-hours of activity at that station.  The Kp index is set by averaging these 13 K-values. 

### Dst index:
The Dst or disturbance storm time index is a measure of geomagnetic activity used to assess the severity of magnetic storms. It is expressed in nanoteslas and is based on the average value of the horizontal component of the Earth's magnetic field measured hourly at four near-equatorial geomagnetic observatories. Use of the Dst as an index of storm strength is possible because the strength of the surface magnetic field at low latitudes is inversely proportional to the energy content of the ring current, which increases during geomagnetic storms.

### R index
The relative sunspot number is an index of the activity of the entire visible disk of the Sun. It is determined each day without reference to preceding days. Each isolated cluster of sunspots is termed a sunspot group, and it may consist of one or a large number of distinct spots whose size can range from 10 or more square degrees of the solar surface down to the limit of resolution (e.g., 1/25 square degree). The relative sunspot number is defined as R = K (10g + s), where g is the number of sunspot groups and s is the total number of distinct spots. The scale factor K (usually less than unity) depends on the observer and is intended to effect the conversion to the scale originated by Wolf. 

### F10.7 index
The solar radio flux at 10.7 cm (2800 MHz) is an excellent indicator of solar activity. The F10.7 index correlates well with the sunspot number as well as a number of UltraViolet (UV) and visible solar irradiance records. Unlike many solar indices, the F10.7 radio flux can easily be measured reliably on a day-to-day basis from the Earth’s surface, in all types of weather. Reported in “solar flux units”, (s.f.u.), the F10.7 can vary from below 50 s.f.u., to above 300 s.f.u., over the course of a solar cycle. The F10.7 Index has proven very valuable in specifying and forecasting space weather. Because it is a long record, it provides climatology of solar activity over six solar cycles. Because it originates in the chromosphere and corona of the sun, it tracks other important emissions that form in the same regions of the solar atmosphere. The Extreme UltraViolet (EUV) emissions that impact the ionosphere and modify the upper atmosphere track well with the F10.7 index. Many Ultra-Violet emissions that affect the stratosphere and ozone also correlate with the F10.7 index. 


In [0]:
# convert year and doy to date and put it to index
swdf.index = pd.to_datetime(swdf['year'] * 1000 + swdf['doy'], format='%Y%j')
swdf.index +=  pd.to_timedelta(swdf['hour'], unit='h')
swdf.drop(['year', 'doy', 'hour'], axis=1, inplace=True)

print(swdf.shape)
print(swdf.head())
print(swdf.tail())


In [0]:
# get basic statistics of DF
swdf.describe()

## Excercise I. - Plot space weather indexes to get impression of them

In [0]:
# quick look of the feature (Kp)

# write your code here

           
           

## Solution I. 
Open, only if you already tried to find the solution by your own. 

In [0]:
# plot basic histograms of space weather intensities
swdf.plot.hist(grid=True, bins=100, rwidth=0.9, color=['dimgray', 'dimgray', 'dimgray', 'dimgray'], subplots=True, layout=(4,1), figsize=(12,9), xlim=(-200,400))

In [0]:
# quick look of the feature (Kp)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Kp*10 index', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(100, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(swdf.index, swdf.Kp, color='black', linewidth = 1)

In [0]:
# quick look of the feature (Kp)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('R Sunspot Number', fontsize=14)
plt.xlim(start_year, end_year)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(swdf.index, swdf.R, color='black', linewidth = 1)

In [0]:
# quick look of the feature (Dst)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Dst index [nT]', fontsize=14)
plt.xlim(start_year, end_year)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(swdf.index, swdf.Dst, color='black', linewidth = 1)

In [0]:
# quick look of the feature (F10.7)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Solat index F10.7', fontsize=14)
plt.xlim(start_year, end_year)
plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(swdf.index, swdf.F107, color='black', linewidth = 1)

## 4) Calculates the Earth - Sun position

In [0]:
# Create the dataframe with the same dates as they are used above
# epdf means Earth's Position Data Frame

epdf = pd.DataFrame()
epdf['date'] = pd.date_range(start=start_year, end=end_year, freq='H')

print(epdf.shape)
print(epdf.head())
print(epdf.tail())


In [0]:
# Define the method for calculation of distance of the Earth from Sun

def calculate_distance_sun_earth(datestr):
    '''
    Calculates distance between Sun and Earth in astronomical unints (AU) for a given
    date. Date needs to be a string using format YYYY-MM-DD or datetime object.
    '''
    import ephem
    sun = ephem.Sun()
    if isinstance(datestr, str):
        sun.compute(datetime.datetime.strptime(datestr, '%Y-%m-%d').date())
    elif isinstance(datestr, datetime.datetime):
        sun.compute(datestr)
    sun_distance = sun.earth_distance  # will be between 0.9832898912 AU and 1.0167103335 AU
    return sun_distance
  

# print example of method output for one date  
calculate_distance_sun_earth(epdf.date[0])

In [0]:
# Apply defined method to each element in the data frame
epdf['AU'] = epdf['date'].apply(calculate_distance_sun_earth)
print(epdf.head())

In [0]:
# quick look of the feature (Sun - Earth distance)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Sun - Earth distance', fontsize=14)
plt.xlim(start_year, end_year)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(epdf.date, epdf.AU, color='black', linewidth = 1)

The airglow intensities were measured in Abastumani (Georgia)

![alt text](http://www.ilike2learn.com/ilike2learn/Continent%20Maps/Country%20Maps/Europe/Georgia.gif)


In [0]:
# calculate Sun altitude above horizon 
def calculate_sun_altitude(datestr):
    '''
    Calculates Sun altitude for specific position and for a given datetime.
    '''
    import ephem
    obs = ephem.Observer()
    obs.lat = '41:75' # coordinates of Abastumani (Georgia)
    obs.long = '42:83'
    obs.elev = 1300
    obs.pressure = 0 # atmospheric refraction correction is turned off
    obs.date = datestr
    sun = ephem.Sun()
    sun.compute(obs)
    sun_alt = math.degrees(sun.alt)  # in degrees will be between -90 and 90
    return sun_alt
  
# print example of method output for one date 
calculate_sun_altitude(epdf.date[0])

In [0]:
# Apply defined method to each element in data frame and create column with Sun Altitude (SA)
epdf['SA'] = epdf['date'].apply(calculate_sun_altitude)
print(epdf.head())

In [0]:
# select time interval
start_day = datetime.datetime.strptime('1975-01-01 00:00', "%Y-%m-%d %H:%M")
end_day = datetime.datetime.strptime('1976-01-01 12:00', "%Y-%m-%d %H:%M")

In [0]:
# set date column as index pf DF
epdf = epdf.set_index('date')

In [0]:
# quick look of the feature (Sun Altitude)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Sun Altitude [deg]', fontsize=14)
plt.xlim(start_day, end_day)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y-%m')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(epdf.index, epdf.SA, color='black', linewidth = 1)

## 5) Open the file with upper atmosphere parameters (feautures) and iprove the data frame

In [0]:
# read data that were downloaded from https://ccmc.gsfc.nasa.gov/modelweb/models/nrlmsise00.php
# latitude= 41, longitude= 42, height= 100 km
# AtMosphere Data Frame (amdf)

col_names_am = ['year', 'month', 'day', 'hour', 'height', 'O', 'N2', 'O2', 'density', 'temperature', 'H', 'F107d', 'apd']
amdf = pd.read_csv(root_path + 'nrlmsise-00_1975-1993.txt', sep='\s+', names=col_names_am)

# add '0' to months and days to have always 2 characters for the month and 2 characters for the day
amdf['month']=amdf['month'].apply(lambda x: '{0:0>2}'.format(x))
amdf['day']=amdf['day'].apply(lambda x: '{0:0>2}'.format(x))

# convert type of date columns to string
amdf[['year', 'month', 'day']] = amdf[['year', 'month', 'day']].astype(str)
print(amdf.dtypes)

# convert date columns to datetime format
amdf.index = pd.to_datetime(amdf['year'] + amdf['month'] + amdf['day'], format='%Y%m%d')

# drop unneded columns
amdf.drop(['year', 'month', 'day', 'hour', 'height'], axis=1, inplace=True)

print(list(amdf))
print(amdf.shape)
print(amdf.head())
print(amdf.tail())



In [0]:
# quick look of the feature (O = atomic oxygen)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Atomic oxygen', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amdf.index, amdf.O, '.', color='black', linewidth = 0)

## Excercise II. 

*   Resample data to get resolution from days to hours
*   Get impression of the data (make some quick views)



In [0]:
# write your code here

## Solution II.
Open, only if you already tried to find the solution by your own. 

In [0]:
# resemple data from days to hours (use linear interpolation)
# AtMospheric Hourly Data Frame (amhdf)
amhdf = pd.DataFrame()
amhdf= amdf.resample('H').interpolate()

print(amhdf.shape)
print(amhdf.head())
print(amhdf.tail())

In [0]:
# get basic statistics of DF
amhdf.describe()

In [0]:
# quick look of the feature (O = atomic oxygen)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('O Atomic oxygen [cm-3]', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.O, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (O2 = molecular oxygen)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('O2 Molecular oxygen [cm-3]', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.O2, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (N2 = molecular nitrogen)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('N2 Molecular nitrogen [cm-3]', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.N2, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (H = hydrogen)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('H Hydrogen [cm-3]', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.H, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (Mass_density, g/cm-3)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Mass_density [g/cm-3]', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.density, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (Temperature_neutral, K)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('Temperature (neutral) [K]', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.temperature, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (F10.7 dayly means)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('F10.7', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.F107d, '.', color='black', linewidth = 0)

In [0]:
# quick look of the feature (ap index daily mean)
plt.figure(figsize=(18, 6))
plt.xlabel('Time (UTC)', fontsize=14)
plt.ylabel('ap', fontsize=14)
plt.xlim(start_year, end_year)
#plt.ylim(0, 400)
plt.grid()
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
plt.plot(amhdf.index, amhdf.apd, '.', color='black', linewidth = 0)

In [0]:
# F10.7 daily meand can be droped as we will use data from omniweb
amhdf.drop(['F107d'], axis=1, inplace=True)

print(amhdf.shape)
print(amhdf.head())
print(amhdf.tail())

## 6) Concate all dataframes

In [0]:
# Concate dataframes with features
# Final Features Data Frame (ffdf)
ffdf = pd.concat([swdf, epdf, amhdf], axis=1)


# in amhdf 23 values are missing - probably one day is missing. So we can remove rows that contain NaN
ffdf.dropna(inplace=True)

# add index name
ffdf.index.name = 'date'

print(list(ffdf.columns.values)) 
print(ffdf.shape)
print(ffdf.head(10))
print(ffdf.tail(10))

In [0]:
# save final features data frame to csv file
ffdf.to_csv(root_path + 'feautures_airglow_prediction.csv', sep=',')