<h1><center>27 Day F10.7 Forecast Verification (Jan 2020 - Jul 2023)</center></h1>
<br>
<center>John Mayers, Physical Scientist <br> NOAA Space Weather Prediction Center</center>

The Space Weather Prediction Center produces 27 Day geomagnetic and radio forecasts found in the ["Weekly Highlights and 27-day Forecast"](https://www.swpc.noaa.gov/products/weekly-highlights-and-27-day-forecast). These forecast are not archived in a database and can only be found in these PDF documents. In order to assess forecast accuracy and skill, the forecasts have to be extracted from these PDF files into a readable format. This fist part of this program will extract the forecasts from the PDF files (found on pages 4 and 5) using tabula-py then convert them into text files. The text files will then be read into Pandas dataframes for additional analysis. Once cleaned, these forecasts can be compared against observed values in order to obtain accuracy and skill scores. Observed F10.7 data were collected from the Government of Canada.

In [2]:
#pip install tabula-py

In [128]:
import pandas as pd
import tabula
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
import re
    
from tabula.io import read_pdf
from tabulate import tabulate
from tabula.io import convert_into
from tqdm import tqdm
from datetime import datetime, timedelta
from numpy import nan
from matplotlib.pyplot import figure
from re import match
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression

<h3>Navigating to files</h3>

**Step 1**: Change the working directory

In [3]:
#change working directory to location of pdf files and set to variable
    
os.chdir('C:/Users/john.mayers/Documents/27_Day/Data/2020_23_WeeklyPDF/')
directory = 'C:/Users/john.mayers/Documents/27_Day/Data/2020_23_WeeklyPDF/'
print(f'The current working directory is {os. getcwd()}.')

The current working directory is C:\Users\john.mayers\Documents\27_Day\Data\2020_23_WeeklyPDF.


**Step 2**: Confirm the number of PDF files in the directory to be converted

In [4]:
pdf_ls = []
for file in glob.glob("*.pdf"):
    pdf_ls.append(file)
pdf_ls

num_pdfs = len(pdf_ls)

print(f'The are {num_pdfs} PDF files in this folder that will be converted to TXT.')

The are 185 PDF files in this folder that will be converted to TXT.


<h3>Convert PDF "Weeklys" to TXT files that capture forecast tables </h3>

A key was prepared showing which page the table appears on

In [6]:
key = pd.read_excel('C:/Users/john.mayers/Documents/27_Day/Data/27day_key.xlsx')
key.head()

Unnamed: 0,fname,page
0,prf2314,4
1,prf2315,4
2,prf2316,4
3,prf2317,4
4,prf2318,4


Processing tables on page 4

In [17]:
p4_files = key.loc[key['page'] == 4]
p4_files.head()

Unnamed: 0,fname,page
0,prf2314,4
1,prf2315,4
2,prf2316,4
3,prf2317,4
4,prf2318,4


In [28]:
p4_files = p4_files['fname'].tolist()

In [30]:
output_p4=[]

for file in range(len(p4_files)):
    w = os.path.join(directory + p4_files[file]+ ".txt")
    output_p4.append(w)  

In [31]:
input_p4 = list(map(lambda x: x.replace('txt', 'pdf'), output_p4))

In [25]:
#using tabula-py convert_into() function to convert PDF files in dir to TXT

for pdf in tqdm(directory):
    for i in range(len(input_p4)):
        convert_into(input_p4[i], output_p4[i], pages=4) # iterating through input and output filenames from lists above

100%|██████████████████████████████████████████████████████████████████████████████████| 61/61 [01:19<00:00,  1.30s/it]


Processing tables on page 5

In [32]:
p5_files = key.loc[key['page'] == 5]
p5_files.head()

Unnamed: 0,fname,page
38,prf2352,5
43,prf2357,5
44,prf2358,5
46,prf2360,5
61,prf2375,5


In [33]:
p5_files = p5_files['fname'].tolist()

In [35]:
output_p5=[]

for file in range(len(p5_files)):
    w = os.path.join(directory + p5_files[file]+ ".txt")
    output_p5.append(w)

In [37]:
input_p5 = list(map(lambda x: x.replace('txt', 'pdf'), output_p5))

In [38]:
for pdf in tqdm(directory):
    for i in range(len(input_p5)):
        convert_into(input_p5[i], output_p5[i], pages=5)

100%|██████████████████████████████████████████████████████████████████████████████████| 61/61 [00:24<00:00,  2.45it/s]


Processing tables on page 6

In [39]:
p6_files = key.loc[key['page'] == 6]
p6_files.head()

Unnamed: 0,fname,page
68,prf2382,6
95,prf2409,6
96,prf2410,6
107,prf2421,6
110,prf2424,6


In [40]:
p6_files = p6_files['fname'].tolist()

In [42]:
output_p6=[]

for file in range(len(p6_files)):
    w = os.path.join(directory + p6_files[file]+ ".txt")
    output_p6.append(w)

In [44]:
input_p6 = list(map(lambda x: x.replace('txt', 'pdf'), output_p6)) # new input file names

In [45]:
for pdf in tqdm(directory):
    for i in range(len(input_p6)):
        convert_into(input_p6[i], output_p6[i], pages=6)

100%|██████████████████████████████████████████████████████████████████████████████████| 61/61 [00:16<00:00,  3.76it/s]


Processing tables on page 7

In [50]:
p7_files = key.loc[key['page'] == 7]
p7_files.head()

Unnamed: 0,fname,page
117,prf2431,7
137,prf2451,7


In [51]:
p7_files = p7_files['fname'].tolist()

In [53]:
output_p7=[]

for file in range(len(p7_files)):
    w = os.path.join(directory + p7_files[file]+ ".txt")
    output_p7.append(w)

In [55]:
input_p7 = list(map(lambda x: x.replace('txt', 'pdf'), output_p7)) # new input file names

In [56]:
for pdf in tqdm(directory):
    for i in range(len(input_p7)):
        convert_into(input_p7[i], output_p7[i], pages=7)

100%|██████████████████████████████████████████████████████████████████████████████████| 61/61 [00:01<00:00, 47.19it/s]


In [58]:
f=[]

for file in os.listdir(directory):
    if file.endswith(".txt"):
        f.append(file)

Reading all text files into a dataframe

In [59]:
main_df = pd.DataFrame(pd.read_csv(f[0])) # this cell will not run if there is an issue with the text files
  
for i in range(1,len(f)):
    data = pd.read_csv(f[i],header=None)
    df = pd.DataFrame(data)
    main_df = pd.concat([main_df,df], axis=1)
main_df.head()

Unnamed: 0,06 Jan,72,12,4,Unnamed: 4,20 Jan,70,5,2,Unnamed: 9,...,0,1,2.1,3,4.1,5.1,6,7,8,9
0,7.0,72.0,5.0,2.0,,21,70.0,5.0,2.0,,...,17 Jul,180,18,5,,31 Jul,170.0,5.0,2.0,
1,8.0,72.0,8.0,3.0,,22,70.0,5.0,2.0,,...,18,182,25,5,,01 Aug,165.0,5.0,2.0,
2,9.0,72.0,8.0,3.0,,23,70.0,5.0,2.0,,...,19,178,15,4,,02,165.0,5.0,2.0,
3,10.0,72.0,8.0,3.0,,24,70.0,5.0,2.0,,...,20,170,10,3,,03,165.0,10.0,3.0,
4,11.0,72.0,5.0,2.0,,25,71.0,5.0,2.0,,...,21,172,8,3,,04,165.0,8.0,3.0,


In [60]:
# #creating a list of output_filenames from PDF filenames in dir but with .txt extension

# filenames = os.listdir() #list all PDF files in folder and save to variable
# filenames = [i.split('.', 1)[0] for i in filenames] #remove file extension from list of files and resave to variable

# output_p4=[]

# for i in range(len(filenames)):
#     w = os.path.join(directory + filenames[i]+ ".txt") #concat directory, filenames with .txt extension
#     output_p4.append(w)

In [61]:
# #creating a list of input_filenames

# input_p4=[]

# for i in range(len(filenames)):
#     r = os.path.join(directory + filenames[i]+ ".pdf") #concat directory, filenames, with .pdf extension
#     input_p4.append(r)

In [62]:
# #using tabula-py convert_into() function to convert PDF files in dir to TXT

# for pdf in tqdm(directory):
#     for i in range(len(ls)):
#         convert_into(input_p4[i], output_p4[i], pages=4) #iterating through input and output filenames from lists above

In [63]:
# #verifying conversion was successful. this will onl

# num_txt = len(glob.glob1(directory,"*.txt"))

# if num_pdfs == num_txt:
#     print("All PDF files were successfully converted to TXT files")
# else: 
#     print("Some PDF files were not successfully converted")

In [64]:
# f=[]

# for file in os.listdir(directory):
#     if file.endswith(".txt"):
#         f.append(file)
# print(f)

In [65]:
# file2 = open(f[1], 'r')
# #print(file2.read())


# print(f'''Here is what a correct table should look like.

# {file2.read()}
# ''')

# file2.close()

In [25]:
# readfiles =[]
# for txtfile in f:
#     file = open(txtfile, 'r')
#     readfiles.append(file.read())
#     file.close()

In [26]:
# zero =[]

# for s in range(len(readfiles)):
#     if (len(readfiles[s]) == 0):
#         zero.append(s)
# #zero = [int(i) for i in zero] # these are the incorrect files corresponding to the element in the list. 
# print(f'There are {len(zero)} files that are 0 KB and need to be reprocessed.')

In [66]:
# files_0 = [f for f in f if os.path.getsize(f) == 0] # files with no contents
# files_0

In [28]:
# fil1 = [] # determining the filenames of the PDFs that needs to be re-processed
# for file in files_0:
#     fil1.append(ls1[val])
# fil1 # these will be the new output directories to pass through convert_into

In [67]:
# #fil_copy = fil.copy()

# input_new = list(map(lambda x: x.replace('txt', 'pdf'), files_0))
# input_new # new output directories to pass through convert_into

In [15]:
# # delete the bad 0KB textfiles
# for txtfile in fil: 
#     os.remove(txtfile)

In [68]:
# input_p5 = input_new[0:4]
# input_p5

In [69]:
# output_p5 = files_0[0:4]
# output_p5

In [70]:
# input_p6 = input_new[-2:]
# input_p6

In [71]:
# output_p6 = files_0[-2:]
# output_p6

In [72]:
# for pdf in tqdm(directory):
#     for i in range(len(input_p5)):
#         convert_into(input_p5[i], output_p5[i], pages=5) 

In [73]:
# for pdf in tqdm(directory):
#     for i in range(len(input_p6)):
#         convert_into(input_p6[i], output_p6[i], pages=6) 

In [74]:
# x = [f for f in f if os.path.getsize(f) == 0]
# if len(x) == 0:
#     print('All 0 KB txt files have been reprocessed correctly and no 0 KB txt files remain.')
# else:
#     print('0 KB txt files remain. Try again.')

In [75]:
# readfiles =[]
# for txtfile in f:
#     file = open(txtfile, 'r')
#     readfiles.append(file.read())
#     file.close()
# readfiles

In [76]:
# l=[] # num of chars each file
# for table in readfiles:
#     l.append(len(table))
# l

In [48]:
# readfiles =[]
# for txtfile in f:
#     file = open(txtfile, 'r')
#     readfiles.append(file.read())
#     file.close()
# readfiles

In [77]:
# filtered_375 =[] # rerunning filter

# for ele in range(len(readfiles)):
#     if (len(readfiles[ele]) > 352 or len(readfiles[ele]) < 300):
#         filtered_375.append(ele)
# #filtered_350 = [int(i) for i in filtered_350] 

# print(f'''

# There are {len(filtered_375)} files that have more than 352 or less than 300 characters and need to be reprocessed because the table is on a page other than 4.

# The bad files have the following list indices: {filtered_375}

# ''')

In [78]:
# for idx in filtered_375: # previewing all the files that need to be reprocessed
#     print(readfiles[idx])

In [79]:
# output_p5a = [] # locating the filename of the PDF with the problem based on known index
# for idx in filtered_375:
#     output_p5a.append(f[idx])
# print(f'''These {len(output_p5a)} files need to be reprocessed.
# {output_p5a}
# ''')

In [80]:
# input_p5a = list(map(lambda x: x.replace('txt', 'pdf'), output_p5a)) # new input file names

In [81]:
# for pdf in tqdm(directory): # rerunning conversion on page 5 first
#     for i in range(len(input_p5a)):
#         convert_into(input_p5a[i], output_p5a[i], pages=5) 

In [82]:
# readfiles =[]
# for txtfile in f:
#     file = open(txtfile, 'r')
#     readfiles.append(file.read())
#     file.close()
# readfiles

In [83]:
# filtered_375 =[] # rerunning filter

# for ele in range(len(readfiles)):
#     if (len(readfiles[ele]) > 375 or len(readfiles[ele]) < 300):
#         filtered_375.append(ele)
# #filtered_350 = [int(i) for i in filtered_350] 

# print(f'''

# There are {len(filtered_375)} files that have more than 375 or less than 300 characters and need to be reprocessed on page 6.

# The bad files have the following list indices: {filtered_375}

# ''')

In [84]:
# for idx in filtered_375: # previewing all the files that need to be reprocessed
#     print(readfiles[idx])

In [85]:
# output_p6a = [] 
# for idx in filtered_375:
#     output_p6a.append(f[idx])
# print(f'''These {len(output_p6a)} files need to be reprocessed.
# {output_p6a}
# ''')

In [86]:
# input_p6a = list(map(lambda x: x.replace('txt', 'pdf'), output_p6a))

In [87]:
# for pdf in tqdm(directory): # rerunning conversion on page 6
#     for i in range(len(input_p6a)):
#         convert_into(input_p6a[i], output_p6a[i], pages=6) 

In [88]:
# readfiles =[]
# for txtfile in f:
#     file = open(txtfile, 'r')
#     readfiles.append(file.read())
#     file.close()
# readfiles

In [89]:
# filtered_375 =[] # rerunning filter

# for ele in range(len(readfiles)):
#     if (len(readfiles[ele]) > 375 or len(readfiles[ele]) < 300):
#         filtered_375.append(ele)
# #filtered_350 = [int(i) for i in filtered_350] 

# print(f'''

# There are {len(filtered_375)} files that need to be processed on page 7.

# ''')

In [90]:
# for table in readfiles: # scan for bad tables
#     print(table)

In [91]:
# l=[]
# for table in readfiles:
#     l.append(len(table))
# l

In [92]:
# output_p7 = [f[117], f[137]] 

In [93]:
# input_p7 = list(map(lambda x: x.replace('txt', 'pdf'), output_p7))

In [94]:
# for pdf in tqdm(directory): # rerunning conversion on page 7
#     for i in range(len(input_p7)):
#         convert_into(input_p7[i], output_p7[i], pages=7) 

In [95]:
# readfiles =[]
# for txtfile in f:
#     file = open(txtfile, 'r')
#     readfiles.append(file.read())
#     file.close()
# readfiles

In [96]:
# main_df = pd.DataFrame(pd.read_csv(f[0])) # this cell will not run if there is an issue with the text files
  
# for i in range(1,len(f)):
#     data = pd.read_csv(f[i],header=None)
#     df = pd.DataFrame(data)
#     main_df = pd.concat([main_df,df], axis=1)
# main_df.head()

<h3>Cleaning the Data for Analysis</h3>

<h4>Understanding the table structure</h4>

In [97]:
main_copy = main_df.copy() # make a copy of the df

In [98]:
print(f'This table has {main_copy.shape[0]} rows and {main_copy.shape[1]} columns.')

This table has 14 rows and 1850 columns.


<h4>Looking at a subset of the table corresponding to the first forecast</h4>

In [99]:
pdf1 = main_copy.iloc[:,0:10] #looking at the cols corresponding to the first PDF. Each PDF has 10 cols.
pdf1.head()

Unnamed: 0,06 Jan,72,12,4,Unnamed: 4,20 Jan,70,5,2,Unnamed: 9
0,7.0,72.0,5.0,2.0,,21,70.0,5.0,2.0,
1,8.0,72.0,8.0,3.0,,22,70.0,5.0,2.0,
2,9.0,72.0,8.0,3.0,,23,70.0,5.0,2.0,
3,10.0,72.0,8.0,3.0,,24,70.0,5.0,2.0,
4,11.0,72.0,5.0,2.0,,25,71.0,5.0,2.0,


<h4>Fixing col lables</h4>

In [100]:
main_copy.head()

Unnamed: 0,06 Jan,72,12,4,Unnamed: 4,20 Jan,70,5,2,Unnamed: 9,...,0,1,2.1,3,4.1,5.1,6,7,8,9
0,7.0,72.0,5.0,2.0,,21,70.0,5.0,2.0,,...,17 Jul,180,18,5,,31 Jul,170.0,5.0,2.0,
1,8.0,72.0,8.0,3.0,,22,70.0,5.0,2.0,,...,18,182,25,5,,01 Aug,165.0,5.0,2.0,
2,9.0,72.0,8.0,3.0,,23,70.0,5.0,2.0,,...,19,178,15,4,,02,165.0,5.0,2.0,
3,10.0,72.0,8.0,3.0,,24,70.0,5.0,2.0,,...,20,170,10,3,,03,165.0,10.0,3.0,
4,11.0,72.0,5.0,2.0,,25,71.0,5.0,2.0,,...,21,172,8,3,,04,165.0,8.0,3.0,


Every 5th col needs to be dropped

In [101]:
main = main_copy.loc[:, (np.arange(len(main_copy.columns)) + 1) % 5 != 0]
main.head()

Unnamed: 0,06 Jan,72,12,4,20 Jan,70,5,2,0,1,...,7,8,0.1,1.1,2.1,3,5.1,6,7.1,8.1
0,7.0,72.0,5.0,2.0,21,70.0,5.0,2.0,13 Jan,71,...,5.0,2.0,17 Jul,180,18,5,31 Jul,170.0,5.0,2.0
1,8.0,72.0,8.0,3.0,22,70.0,5.0,2.0,14,70,...,5.0,2.0,18,182,25,5,01 Aug,165.0,5.0,2.0
2,9.0,72.0,8.0,3.0,23,70.0,5.0,2.0,15,70,...,5.0,2.0,19,178,15,4,02,165.0,5.0,2.0
3,10.0,72.0,8.0,3.0,24,70.0,5.0,2.0,16,70,...,5.0,2.0,20,170,10,3,03,165.0,10.0,3.0
4,11.0,72.0,5.0,2.0,25,71.0,5.0,2.0,17,70,...,5.0,2.0,21,172,8,3,04,165.0,8.0,3.0


Creating a list of lists that repeats col labels based on the number of tables in the main df, then flattening that list of lists into a large list to pass into the df as the new col labels

In [102]:
new_cols = [['Date', 'Radio', 'Ap', 'Kp', 'Date', 'Radio', 'Ap', 'Kp']] 

k=int(len(main.columns)/8)

res = [ele for ele in new_cols for i in range(k)] #list of lists of col headers

In [103]:
flat_list = [item for sublist in res for item in sublist] # flattening into 1 large list

In [104]:
len(main.columns)

1480

In [105]:
row = main.columns # to be transferred into first row
row

Index(['06 Jan',     '72',     '12',      '4', '20 Jan',     '70',      '5',
            '2',        0,        1,
       ...
              7,        8,        0,        1,        2,        3,        5,
              6,        7,        8],
      dtype='object', length=1480)

In [106]:
main.loc[-1] = row # adding a row and setting it to row

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main.loc[-1] = row # adding a row and setting it to row


In [107]:
main.index = main.index + 1 #shifting index

In [108]:
main = main.sort_index() # sorting by index

In [109]:
main.columns = flat_list # setting new col labels

In [110]:
main.head()

Unnamed: 0,Date,Radio,Ap,Kp,Date.1,Radio.1,Ap.1,Kp.1,Date.2,Radio.2,...,Ap.2,Kp.2,Date.3,Radio.3,Ap.3,Kp.3,Date.4,Radio.4,Ap.4,Kp.4
0,06 Jan,72.0,12.0,4.0,20 Jan,70.0,5.0,2.0,0,1,...,7.0,8.0,0,1,2,3,5,6.0,7.0,8.0
1,7.0,72.0,5.0,2.0,21,70.0,5.0,2.0,13 Jan,71,...,5.0,2.0,17 Jul,180,18,5,31 Jul,170.0,5.0,2.0
2,8.0,72.0,8.0,3.0,22,70.0,5.0,2.0,14,70,...,5.0,2.0,18,182,25,5,01 Aug,165.0,5.0,2.0
3,9.0,72.0,8.0,3.0,23,70.0,5.0,2.0,15,70,...,5.0,2.0,19,178,15,4,02,165.0,5.0,2.0
4,10.0,72.0,8.0,3.0,24,70.0,5.0,2.0,16,70,...,5.0,2.0,20,170,10,3,03,165.0,10.0,3.0


<h3>Inspecting individual forecasts from main df</h3>

In [111]:
main.iloc[:,0:8]

Unnamed: 0,Date,Radio,Ap,Kp,Date.1,Radio.1,Ap.1,Kp.1
0,06 Jan,72.0,12.0,4.0,20 Jan,70.0,5.0,2.0
1,7.0,72.0,5.0,2.0,21,70.0,5.0,2.0
2,8.0,72.0,8.0,3.0,22,70.0,5.0,2.0
3,9.0,72.0,8.0,3.0,23,70.0,5.0,2.0
4,10.0,72.0,8.0,3.0,24,70.0,5.0,2.0
5,11.0,72.0,5.0,2.0,25,71.0,5.0,2.0
6,12.0,71.0,5.0,2.0,26,72.0,5.0,2.0
7,13.0,70.0,5.0,2.0,27,72.0,5.0,2.0
8,14.0,70.0,12.0,4.0,28,72.0,5.0,2.0
9,15.0,70.0,12.0,4.0,29,72.0,5.0,2.0


In [112]:
main.iloc[:,8:16]

Unnamed: 0,Date,Radio,Ap,Kp,Date.1,Radio.1,Ap.1,Kp.1
0,0,1,2,3,5,6.0,7.0,8.0
1,13 Jan,71,8,3,27 Jan,72.0,5.0,2.0
2,14,70,10,4,28,72.0,5.0,2.0
3,15,70,10,4,29,72.0,5.0,2.0
4,16,70,8,3,30,72.0,5.0,2.0
5,17,70,5,2,31,72.0,5.0,2.0
6,18,70,5,2,01 Feb,72.0,10.0,3.0
7,19,70,5,2,02,72.0,10.0,4.0
8,20,70,5,2,03,72.0,10.0,4.0
9,21,70,5,2,04,72.0,10.0,4.0


In [113]:
main.iloc[:,16:24]

Unnamed: 0,Date,Radio,Ap,Kp,Date.1,Radio.1,Ap.1,Kp.1
0,0,1,2,3,5,6.0,7.0,8.0
1,20 Jan,72,12,4,03 Feb,72.0,10.0,3.0
2,21,72,12,4,04,72.0,10.0,3.0
3,22,72,10,3,05,72.0,10.0,3.0
4,23,72,5,2,06,71.0,5.0,2.0
5,24,72,5,2,07,71.0,5.0,2.0
6,25,72,5,2,08,71.0,5.0,2.0
7,26,72,5,2,09,71.0,5.0,2.0
8,27,72,5,2,10,71.0,5.0,2.0
9,28,72,5,2,11,71.0,5.0,2.0


Starting with col9, all values in row0 should be set to Nan. All these values should be replaced with NaN. They are erroneous.

In [114]:
main.iloc[:1,8:] 

Unnamed: 0,Date,Radio,Ap,Kp,Date.1,Radio.1,Ap.1,Kp.1,Date.2,Radio.2,...,Ap.2,Kp.2,Date.3,Radio.3,Ap.3,Kp.3,Date.4,Radio.4,Ap.4,Kp.4
0,0,1,2,3,5,6.0,7.0,8.0,0,1,...,7.0,8.0,0,1,2,3,5,6.0,7.0,8.0


In [115]:
main.iloc[:1,8:] = np.nan

In [116]:
main.head()

Unnamed: 0,Date,Radio,Ap,Kp,Date.1,Radio.1,Ap.1,Kp.1,Date.2,Radio.2,...,Ap.2,Kp.2,Date.3,Radio.3,Ap.3,Kp.3,Date.4,Radio.4,Ap.4,Kp.4
0,06 Jan,72.0,12.0,4.0,20 Jan,70.0,5.0,2.0,,,...,,,,,,,,,,
1,7.0,72.0,5.0,2.0,21,70.0,5.0,2.0,13 Jan,71.0,...,5.0,2.0,17 Jul,180.0,18.0,5.0,31 Jul,170.0,5.0,2.0
2,8.0,72.0,8.0,3.0,22,70.0,5.0,2.0,14,70.0,...,5.0,2.0,18,182.0,25.0,5.0,01 Aug,165.0,5.0,2.0
3,9.0,72.0,8.0,3.0,23,70.0,5.0,2.0,15,70.0,...,5.0,2.0,19,178.0,15.0,4.0,02,165.0,5.0,2.0
4,10.0,72.0,8.0,3.0,24,70.0,5.0,2.0,16,70.0,...,5.0,2.0,20,170.0,10.0,3.0,03,165.0,10.0,3.0


<h3>Cleaning up dates</h3>

The dates present a problem, since only the first date is printed, followed by the day only, until the next col or if the month changes before then.

In [117]:
main['Date']

Unnamed: 0,Date,Date.1,Date.2,Date.3,Date.4,Date.5,Date.6,Date.7,Date.8,Date.9,...,Date.10,Date.11,Date.12,Date.13,Date.14,Date.15,Date.16,Date.17,Date.18,Date.19
0,06 Jan,20 Jan,,,,,,,,,...,,,,,,,,,,
1,7.0,21,13 Jan,27 Jan,20 Jan,03 Feb,27 Jan,10 Feb,03 Feb,17 Feb,...,19 Jun,03 Jul,26 Jun,10 Jul,03 Jul,17 Jul,10 Jul,24 Jul,17 Jul,31 Jul
2,8.0,22,14,28,21,04,28,11,04,18,...,20,04,27,11,04,18,11,25,18,01 Aug
3,9.0,23,15,29,22,05,29,12,05,19,...,21,05,28,12,05,19,12,26,19,02
4,10.0,24,16,30,23,06,30,13,06,20,...,22,06,29,13,06,20,13,27,20,03
5,11.0,25,17,31,24,07,31,14,07,21,...,23,07,30,14,07,21,14,28,21,04
6,12.0,26,18,01 Feb,25,08,01 Feb,15,08,22,...,24,08,01 Jul,15,08,22,15,29,22,05
7,13.0,27,19,02,26,09,02,16,09,23,...,25,09,02,16,09,23,16,30,23,06
8,14.0,28,20,03,27,10,03,17,10,24,...,26,10,03,17,10,24,17,31,24,07
9,15.0,29,21,04,28,11,04,18,11,25,...,27,11,04,18,11,25,18,01 Aug,25,08


In the first forecast, the start date is row0, col1

In [118]:
c1 = main['Date'].iloc[0:1,:1]
c1

Unnamed: 0,Date
0,06 Jan


In the second forecast, and all following, the start date is row1, co12, col4, col6, etc.

In [119]:
d1 = main['Date'].iloc[1:2,2:3]
d1

Unnamed: 0,Date
1,13 Jan


<h3>Building a new df with dates and F10 forecasts</h3>

In order to avoid some tricky programming to fill in the table, the approach will be to populate 27 days lists beginning with the start time of each forecast. 

<h4>Extracting start times</h4>

In [120]:
c2 = pd.DataFrame(data=c1)
c2

Unnamed: 0,Date
0,06 Jan


In [121]:
c2_ls = c2['Date'].tolist()
c2_ls

['06 Jan']

In [122]:
c2_ls.append('2020')
c2_ls

['06 Jan', '2020']

In [123]:
start = c2_ls[0] + ' ' + c2_ls[1]
start

'06 Jan 2020'

In [124]:
format ='%d %b %Y'
dt = datetime.strptime(start, format).date() #start date

In [125]:
k = 27
 
res = []
 
for day in range(k):
    date = (dt + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)
res

['06 Jan 2020',
 '07 Jan 2020',
 '08 Jan 2020',
 '09 Jan 2020',
 '10 Jan 2020',
 '11 Jan 2020',
 '12 Jan 2020',
 '13 Jan 2020',
 '14 Jan 2020',
 '15 Jan 2020',
 '16 Jan 2020',
 '17 Jan 2020',
 '18 Jan 2020',
 '19 Jan 2020',
 '20 Jan 2020',
 '21 Jan 2020',
 '22 Jan 2020',
 '23 Jan 2020',
 '24 Jan 2020',
 '25 Jan 2020',
 '26 Jan 2020',
 '27 Jan 2020',
 '28 Jan 2020',
 '29 Jan 2020',
 '30 Jan 2020',
 '31 Jan 2020',
 '01 Feb 2020']

<h4>Using Regex to find entries with months in dataframe</h4>

In [126]:
#import re

In [127]:
#from re import match

In [129]:
ls = main['Date'].iloc[:,0:1].values.tolist()
ls

[['06 Jan'],
 [7.0],
 [8.0],
 [9.0],
 [10.0],
 [11.0],
 [12.0],
 [13.0],
 [14.0],
 [15.0],
 [16.0],
 [17.0],
 [18.0],
 [19.0],
 [nan]]

In [130]:
regex = r"((\d+) [a-zA-Z]+)" # matches day and 3-digit month df
regex

'((\\d+) [a-zA-Z]+)'

In [131]:
flat_list1 = [item for sublist in ls for item in sublist]

In [132]:
new_list = [item for item in flat_list1 if not(pd.isnull(item)) == True] # removing nan in list

In [133]:
new_list1 = [str(i) for i in new_list] # converting each element to a string
new_list1

['06 Jan',
 '7.0',
 '8.0',
 '9.0',
 '10.0',
 '11.0',
 '12.0',
 '13.0',
 '14.0',
 '15.0',
 '16.0',
 '17.0',
 '18.0',
 '19.0']

In [134]:
e =list(filter(lambda x: match(regex, x), new_list1))
e

['06 Jan']

In [135]:
e.append('2020')
e

['06 Jan', '2020']

In [136]:
start = e[0] + ' ' + e[1]
start

'06 Jan 2020'

In [137]:
format ='%d %b %Y'
dt = datetime.strptime(start, format).date() #start date

<h4>With initial start date, populating 1st forecast dates</h4>

In [138]:
k = 27
 
res = []
 
for day in range(k):
    date = (dt + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)
res[:5]

['06 Jan 2020', '07 Jan 2020', '08 Jan 2020', '09 Jan 2020', '10 Jan 2020']

So need to extract each forecast's start date

In [139]:
k = 27
 
date1 = []
 
for day in range(k):
    date = (dt + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    date1.append(date)
date1[:5]

['06 Jan 2020', '07 Jan 2020', '08 Jan 2020', '09 Jan 2020', '10 Jan 2020']

In [140]:
begin = date1[0]
begin

'06 Jan 2020'

In [141]:
Begindate = datetime.strptime(begin, "%d %b %Y")
Begindate

datetime.datetime(2020, 1, 6, 0, 0)

<h4>Manually calculating forecast start times from pattern</h4>

Pattern: Each successive forecast increases by 7 days from the previous forecast or n times 7 from the first forecast where n is the number of forecasts from the first.

In [142]:
Begindate2 = Begindate + timedelta(days=7)
Begindate2

datetime.datetime(2020, 1, 13, 0, 0)

In [143]:
Begindate3 = Begindate + timedelta(days=14)
Begindate3

datetime.datetime(2020, 1, 20, 0, 0)

In [144]:
Begindate4 = Begindate + timedelta(days=21)
Begindate4

datetime.datetime(2020, 1, 27, 0, 0)

In [145]:
Begindate5 = Begindate + timedelta(days=28)
Begindate5

datetime.datetime(2020, 2, 3, 0, 0)

<h4> Automatically calculating start times for n forecasts</h4>

In [146]:
#iterate the number of forecasts
Begindate = datetime.strptime(begin, "%d %b %Y")
mult = list(range(0,num_pdfs*7,7))
mult = mult[1:]

for j in mult: #multiples of 7 starting with 7
    x = Begindate + timedelta(days=j)
    print(x)

2020-01-13 00:00:00
2020-01-20 00:00:00
2020-01-27 00:00:00
2020-02-03 00:00:00
2020-02-10 00:00:00
2020-02-17 00:00:00
2020-02-24 00:00:00
2020-03-02 00:00:00
2020-03-09 00:00:00
2020-03-16 00:00:00
2020-03-23 00:00:00
2020-03-30 00:00:00
2020-04-06 00:00:00
2020-04-13 00:00:00
2020-04-20 00:00:00
2020-04-27 00:00:00
2020-05-04 00:00:00
2020-05-11 00:00:00
2020-05-18 00:00:00
2020-05-25 00:00:00
2020-06-01 00:00:00
2020-06-08 00:00:00
2020-06-15 00:00:00
2020-06-22 00:00:00
2020-06-29 00:00:00
2020-07-06 00:00:00
2020-07-13 00:00:00
2020-07-20 00:00:00
2020-07-27 00:00:00
2020-08-03 00:00:00
2020-08-10 00:00:00
2020-08-17 00:00:00
2020-08-24 00:00:00
2020-08-31 00:00:00
2020-09-07 00:00:00
2020-09-14 00:00:00
2020-09-21 00:00:00
2020-09-28 00:00:00
2020-10-05 00:00:00
2020-10-12 00:00:00
2020-10-19 00:00:00
2020-10-26 00:00:00
2020-11-02 00:00:00
2020-11-09 00:00:00
2020-11-16 00:00:00
2020-11-23 00:00:00
2020-11-30 00:00:00
2020-12-07 00:00:00
2020-12-14 00:00:00
2020-12-21 00:00:00


In [147]:
#iterate the number of forecasts
Begindate = datetime.strptime(begin, "%d %b %Y")
mult = list(range(0,num_pdfs*7,7)) #multiples of 7 
mult = mult[1:] #multiples of 7 starting with 7

ls=[]

for j in mult:
    x = Begindate + timedelta(days=j)
    ls.append(x)

<h4>List of dates corresponding to 2nd forecast</h4>

In [148]:
k = 27
 
res = []
 
for day in range(k):
    date = (ls[0] + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)

<h4>List of dates corresponding to 3rd forecast</h4>

In [149]:
k = 27
 
res = []
 
for day in range(k):
    date = (ls[1] + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)

<h4> Manually compiling all forecast dates into a list</h4>

In [150]:
k = 27
 
res = []

format ='%d %b %Y'
dt = datetime.strptime(start, format).date() 
 
for day in range(k):
    date = (dt + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)
    
res
 
for day in range(k):
    date = (ls[0] + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)
res


for day in range(k):
    date = (ls[1] + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)
res

for day in range(k):
    date = (ls[2] + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)
res

for day in range(k):
    date = (ls[3] + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res.append(date)

In [151]:
len(res)

135

<h4>Automatically compile n forecast dates into a list</h4>

In [152]:
k = 27 #iterates through all elements of ls to append dates in correct order

res2=[]

for i in range(len(ls)):
    for day in range(k):
        date = (ls[i] + timedelta(days = day))
        date = date.strftime('%d %b %Y')
        res2.append(date)

In [153]:
print(f'There are {len(res2)} dates. Recall the first forecast is missing.')

There are 4968 dates. Recall the first forecast is missing.


<h4>Adding in first forecast to list</h4>

In [154]:
k = 27 #now let's append the first forecast 

res11 = []
 
for day in range(k):
    date = (dt + timedelta(days = day))
    date = date.strftime('%d %b %Y')
    res11.append(date)

for i in range(len(ls)):
    for day in range(k):
        date = (ls[i] + timedelta(days = day))
        date = date.strftime('%d %b %Y')
        res11.append(date)
res11[:5]

['06 Jan 2020', '07 Jan 2020', '08 Jan 2020', '09 Jan 2020', '10 Jan 2020']

In [155]:
print(f'There are {len(res11)} dates corresponding to {num_pdfs*27} rows from {num_pdfs} forecasts.') #matches number of forecasts

There are 4995 dates corresponding to 4995 rows from 185 forecasts.


<h4>Extracting F10 forecast values from the main df and creating a new df</h4>

In [156]:
f10_wide = main['Radio']
f10_wide

Unnamed: 0,Radio,Radio.1,Radio.2,Radio.3,Radio.4,Radio.5,Radio.6,Radio.7,Radio.8,Radio.9,...,Radio.10,Radio.11,Radio.12,Radio.13,Radio.14,Radio.15,Radio.16,Radio.17,Radio.18,Radio.19
0,72.0,70.0,,,,,,,,,...,,,,,,,,,,
1,72.0,70.0,71.0,72.0,72.0,72.0,74.0,71.0,70.0,72.0,...,160.0,180.0,150.0,155.0,170.0,175.0,175.0,155.0,180.0,170.0
2,72.0,70.0,70.0,72.0,72.0,72.0,74.0,71.0,70.0,72.0,...,155.0,175.0,145.0,160.0,170.0,175.0,175.0,155.0,182.0,165.0
3,72.0,70.0,70.0,72.0,72.0,72.0,74.0,71.0,70.0,72.0,...,160.0,175.0,145.0,165.0,165.0,170.0,170.0,160.0,178.0,165.0
4,72.0,70.0,70.0,72.0,72.0,71.0,74.0,71.0,70.0,72.0,...,160.0,170.0,140.0,170.0,155.0,170.0,165.0,160.0,170.0,165.0
5,72.0,71.0,70.0,72.0,72.0,71.0,74.0,72.0,70.0,72.0,...,165.0,170.0,135.0,175.0,155.0,170.0,165.0,165.0,172.0,165.0
6,71.0,72.0,70.0,72.0,72.0,71.0,74.0,72.0,70.0,72.0,...,165.0,170.0,130.0,175.0,155.0,160.0,165.0,165.0,172.0,170.0
7,70.0,72.0,70.0,72.0,72.0,71.0,74.0,72.0,70.0,72.0,...,165.0,170.0,130.0,175.0,155.0,160.0,170.0,170.0,170.0,175.0
8,70.0,72.0,70.0,72.0,72.0,71.0,72.0,72.0,70.0,72.0,...,165.0,170.0,130.0,175.0,155.0,155.0,175.0,170.0,160.0,180.0
9,70.0,72.0,70.0,72.0,72.0,71.0,72.0,72.0,70.0,72.0,...,165.0,165.0,130.0,175.0,160.0,155.0,175.0,165.0,160.0,180.0


In [157]:
f10_long_nan = f10_wide.melt()
f10_long_nan

Unnamed: 0,variable,value
0,Radio,72
1,Radio,72.0
2,Radio,72.0
3,Radio,72.0
4,Radio,72.0
...,...,...
5545,Radio,180.0
5546,Radio,180.0
5547,Radio,175.0
5548,Radio,175.0


In [158]:
f10_long = f10_long_nan.dropna(axis=0) # delete all rows with nan

In [159]:
f10_long = f10_long.drop(['variable'], axis=1)
f10_long

Unnamed: 0,value
0,72
1,72.0
2,72.0
3,72.0
4,72.0
...,...
5544,180.0
5545,180.0
5546,180.0
5547,175.0


In [160]:
f10_long = f10_long.rename(columns={"value": "Kp"})

In [161]:
f10_long = f10_long.reset_index()

In [162]:
f10_long = f10_long.drop(['index'], axis=1)

In [163]:
final_df = pd.DataFrame(res11, columns=['res']) #res11 (dates) needs to be transformed into a pandas df

In [164]:
final_df = final_df.rename(columns={"res": "Date"})

In [165]:
final_df['Radio'] = f10_long

<h4>The final table that observed F10 can be added to for analysis</h4>

In [166]:
final_df.head(32)

Unnamed: 0,Date,Radio
0,06 Jan 2020,72.0
1,07 Jan 2020,72.0
2,08 Jan 2020,72.0
3,09 Jan 2020,72.0
4,10 Jan 2020,72.0
5,11 Jan 2020,72.0
6,12 Jan 2020,71.0
7,13 Jan 2020,70.0
8,14 Jan 2020,70.0
9,15 Jan 2020,70.0


Above you can see that the after Feb 1, the date does not continue with 2 Feb, but with 13 Feb, which is the starting date in the next forecast.

<h3>Importing Observed F10 Data</h3><br/>
Observed F10.7 from Government of Canada.

START HERE: DOWNLOAD F10 DATA FOR 2020-2023. The flux data below are only for 2020.

In [167]:
f10obs = pd.read_excel('C:/Users/john.mayers/Documents/27_Day/Data/noon_flux2020.xlsx')
f10obs.head()

Unnamed: 0,Date,Time,Julian day,Carrington\nrotation,Observed Flux,Adjusted Flux,URSI Flux,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Date.1,Noon Flux
0,2020-01-01,18:00:00,2458850.239,2225.833,71.5,69.2,62.3,,,,,,2020-01-02,71.9
1,2020-01-01,20:00:00,2458850.322,2225.836,71.8,69.4,62.5,,,,,,2020-01-03,71.2
2,2020-01-01,22:00:00,2458850.416,2225.83,71.8,69.5,62.5,,,,,,2020-01-04,72.2
3,2020-01-02,18:00:00,2458851.248,2225.86,71.1,68.8,61.9,,,,,,2020-01-05,71.8
4,2020-01-02,20:00:00,2458851.322,2225.872,71.9,69.5,62.6,,,,,,2020-01-06,70.5


In [168]:
f10obs = f10obs[['Date.1','Noon Flux']] 
f10obs.head()

Unnamed: 0,Date.1,Noon Flux
0,2020-01-02,71.9
1,2020-01-03,71.2
2,2020-01-04,72.2
3,2020-01-05,71.8
4,2020-01-06,70.5


<h3>Matching observed F10 with forecast F10</h3>

<h4>Creating a subset of obs that match up to the days corresponding to the n forecasts</h4>

In [169]:
Begindate + timedelta(days=7)

datetime.datetime(2020, 1, 13, 0, 0)

In [170]:
f10obs.iloc[5:27+5,0:2] # obs corresponding forecast 1

Unnamed: 0,Date.1,Noon Flux
5,2020-01-07,71.6
6,2020-01-08,73.7
7,2020-01-09,74.4
8,2020-01-10,72.8
9,2020-01-11,73.5
10,2020-01-12,71.9
11,2020-01-13,71.5
12,2020-01-14,71.9
13,2020-01-15,71.2
14,2020-01-16,71.8


In [171]:
f10obs.iloc[12:12+27,0:2] # obs corresponding to forecast 2

Unnamed: 0,Date.1,Noon Flux
12,2020-01-14,71.9
13,2020-01-15,71.2
14,2020-01-16,71.8
15,2020-01-17,70.1
16,2020-01-18,71.3
17,2020-01-19,71.8
18,2020-01-20,71.2
19,2020-01-21,70.5
20,2020-01-22,71.9
21,2020-01-23,70.8


In [172]:
f10obs.iloc[19:19+27,0:2] # obs corresponding to forecast 3

Unnamed: 0,Date.1,Noon Flux
19,2020-01-21,70.5
20,2020-01-22,71.9
21,2020-01-23,70.8
22,2020-01-24,71.0
23,2020-01-25,72.7
24,2020-01-26,74.7
25,2020-01-27,72.9
26,2020-01-28,74.2
27,2020-01-29,74.3
28,2020-01-30,74.1


<h4> Writing a loop to populate a list of obs corresponding to forecast dates </h4>

The pattern emerges... 7 gets added each iteration to the start and end position of the row

In [173]:
list(range(0,num_pdfs*7,7)) # recalling multiples of 7

[0,
 7,
 14,
 21,
 28,
 35,
 42,
 49,
 56,
 63,
 70,
 77,
 84,
 91,
 98,
 105,
 112,
 119,
 126,
 133,
 140,
 147,
 154,
 161,
 168,
 175,
 182,
 189,
 196,
 203,
 210,
 217,
 224,
 231,
 238,
 245,
 252,
 259,
 266,
 273,
 280,
 287,
 294,
 301,
 308,
 315,
 322,
 329,
 336,
 343,
 350,
 357,
 364,
 371,
 378,
 385,
 392,
 399,
 406,
 413,
 420,
 427,
 434,
 441,
 448,
 455,
 462,
 469,
 476,
 483,
 490,
 497,
 504,
 511,
 518,
 525,
 532,
 539,
 546,
 553,
 560,
 567,
 574,
 581,
 588,
 595,
 602,
 609,
 616,
 623,
 630,
 637,
 644,
 651,
 658,
 665,
 672,
 679,
 686,
 693,
 700,
 707,
 714,
 721,
 728,
 735,
 742,
 749,
 756,
 763,
 770,
 777,
 784,
 791,
 798,
 805,
 812,
 819,
 826,
 833,
 840,
 847,
 854,
 861,
 868,
 875,
 882,
 889,
 896,
 903,
 910,
 917,
 924,
 931,
 938,
 945,
 952,
 959,
 966,
 973,
 980,
 987,
 994,
 1001,
 1008,
 1015,
 1022,
 1029,
 1036,
 1043,
 1050,
 1057,
 1064,
 1071,
 1078,
 1085,
 1092,
 1099,
 1106,
 1113,
 1120,
 1127,
 1134,
 1141,
 1148,
 1155

In [174]:
dates =[]

for i in range(0,num_pdfs*7,7):
    r = f10obs.iloc[5 + (i):5 +(i) +27,0:2]
    dates.append(r)    

In [175]:
dates =[] 

for i in range(0,num_pdfs*7,7):
    r = f10obs.iloc[5 + (i):5 +(i) +27,1:2] # just the values without dates
    dates.append(r)

In [176]:
print(f'This "pandas list" has {len(dates)} elements but we need {num_pdfs *27 }, so we will flatten the list.')

This "pandas list" has 185 elements but we need 4995, so we will flatten the list.


In [177]:
dates[0].values.tolist()

[[71.6],
 [73.7],
 [74.4],
 [72.8],
 [73.5],
 [71.9],
 [71.5],
 [71.9],
 [71.2],
 [71.8],
 [70.1],
 [71.3],
 [71.8],
 [71.2],
 [70.5],
 [71.9],
 [70.8],
 [71.0],
 [72.7],
 [74.7],
 [72.9],
 [74.2],
 [74.3],
 [74.1],
 [73.9],
 [72.5],
 [72.2]]

Interating through each pandas list to convert to a list.

In [187]:
ll=[]

for l in range(num_pdfs):
    a = dates[l].values.tolist()
    ll.append(a)
ll

[[[71.6],
  [73.7],
  [74.4],
  [72.8],
  [73.5],
  [71.9],
  [71.5],
  [71.9],
  [71.2],
  [71.8],
  [70.1],
  [71.3],
  [71.8],
  [71.2],
  [70.5],
  [71.9],
  [70.8],
  [71.0],
  [72.7],
  [74.7],
  [72.9],
  [74.2],
  [74.3],
  [74.1],
  [73.9],
  [72.5],
  [72.2]],
 [[71.9],
  [71.2],
  [71.8],
  [70.1],
  [71.3],
  [71.8],
  [71.2],
  [70.5],
  [71.9],
  [70.8],
  [71.0],
  [72.7],
  [74.7],
  [72.9],
  [74.2],
  [74.3],
  [74.1],
  [73.9],
  [72.5],
  [72.2],
  [72.1],
  [70.3],
  [70.6],
  [71.3],
  [70.8],
  [72.0],
  [70.6]],
 [[70.5],
  [71.9],
  [70.8],
  [71.0],
  [72.7],
  [74.7],
  [72.9],
  [74.2],
  [74.3],
  [74.1],
  [73.9],
  [72.5],
  [72.2],
  [72.1],
  [70.3],
  [70.6],
  [71.3],
  [70.8],
  [72.0],
  [70.6],
  [70.2],
  [71.1],
  [71.6],
  [71.2],
  [71.3],
  [70.6],
  [70.5]],
 [[74.2],
  [74.3],
  [74.1],
  [73.9],
  [72.5],
  [72.2],
  [72.1],
  [70.3],
  [70.6],
  [71.3],
  [70.8],
  [72.0],
  [70.6],
  [70.2],
  [71.1],
  [71.6],
  [71.2],
  [71.3],
  [70.6

Flattening the list twice

In [188]:
flat=[]

for sublist in ll:
    for element in sublist:
        flat.append(element)
flat

[[71.6],
 [73.7],
 [74.4],
 [72.8],
 [73.5],
 [71.9],
 [71.5],
 [71.9],
 [71.2],
 [71.8],
 [70.1],
 [71.3],
 [71.8],
 [71.2],
 [70.5],
 [71.9],
 [70.8],
 [71.0],
 [72.7],
 [74.7],
 [72.9],
 [74.2],
 [74.3],
 [74.1],
 [73.9],
 [72.5],
 [72.2],
 [71.9],
 [71.2],
 [71.8],
 [70.1],
 [71.3],
 [71.8],
 [71.2],
 [70.5],
 [71.9],
 [70.8],
 [71.0],
 [72.7],
 [74.7],
 [72.9],
 [74.2],
 [74.3],
 [74.1],
 [73.9],
 [72.5],
 [72.2],
 [72.1],
 [70.3],
 [70.6],
 [71.3],
 [70.8],
 [72.0],
 [70.6],
 [70.5],
 [71.9],
 [70.8],
 [71.0],
 [72.7],
 [74.7],
 [72.9],
 [74.2],
 [74.3],
 [74.1],
 [73.9],
 [72.5],
 [72.2],
 [72.1],
 [70.3],
 [70.6],
 [71.3],
 [70.8],
 [72.0],
 [70.6],
 [70.2],
 [71.1],
 [71.6],
 [71.2],
 [71.3],
 [70.6],
 [70.5],
 [74.2],
 [74.3],
 [74.1],
 [73.9],
 [72.5],
 [72.2],
 [72.1],
 [70.3],
 [70.6],
 [71.3],
 [70.8],
 [72.0],
 [70.6],
 [70.2],
 [71.1],
 [71.6],
 [71.2],
 [71.3],
 [70.6],
 [70.5],
 [70.7],
 [71.0],
 [71.0],
 [70.8],
 [71.2],
 [72.3],
 [70.2],
 [70.3],
 [70.6],
 [71.3],
 

In [189]:
flat2=[]

for sublist in flat:
    for element in sublist:
        flat2.append(element)
flat2

[71.6,
 73.7,
 74.4,
 72.8,
 73.5,
 71.9,
 71.5,
 71.9,
 71.2,
 71.8,
 70.1,
 71.3,
 71.8,
 71.2,
 70.5,
 71.9,
 70.8,
 71.0,
 72.7,
 74.7,
 72.9,
 74.2,
 74.3,
 74.1,
 73.9,
 72.5,
 72.2,
 71.9,
 71.2,
 71.8,
 70.1,
 71.3,
 71.8,
 71.2,
 70.5,
 71.9,
 70.8,
 71.0,
 72.7,
 74.7,
 72.9,
 74.2,
 74.3,
 74.1,
 73.9,
 72.5,
 72.2,
 72.1,
 70.3,
 70.6,
 71.3,
 70.8,
 72.0,
 70.6,
 70.5,
 71.9,
 70.8,
 71.0,
 72.7,
 74.7,
 72.9,
 74.2,
 74.3,
 74.1,
 73.9,
 72.5,
 72.2,
 72.1,
 70.3,
 70.6,
 71.3,
 70.8,
 72.0,
 70.6,
 70.2,
 71.1,
 71.6,
 71.2,
 71.3,
 70.6,
 70.5,
 74.2,
 74.3,
 74.1,
 73.9,
 72.5,
 72.2,
 72.1,
 70.3,
 70.6,
 71.3,
 70.8,
 72.0,
 70.6,
 70.2,
 71.1,
 71.6,
 71.2,
 71.3,
 70.6,
 70.5,
 70.7,
 71.0,
 71.0,
 70.8,
 71.2,
 72.3,
 70.2,
 70.3,
 70.6,
 71.3,
 70.8,
 72.0,
 70.6,
 70.2,
 71.1,
 71.6,
 71.2,
 71.3,
 70.6,
 70.5,
 70.7,
 71.0,
 71.0,
 70.8,
 71.2,
 72.3,
 70.2,
 70.1,
 70.6,
 70.1,
 70.9,
 70.6,
 70.1,
 69.3,
 71.1,
 71.6,
 71.2,
 71.3,
 70.6,
 70.5,
 70.7,
 71.0,

In [197]:
final_df.shape

(4995, 2)

In [191]:
len(flat2)

4162

In [198]:
print(f'There are now {len(flat2)} observed values corresponding to {num_pdfs*27} forecasts in the correct order which can now be merged into 1 df.')

There are now 4162 observed values corresponding to 4995 forecasts in the correct order which can now be merged into 1 df.


<h4>Final Merge</h4>

In [182]:
complete_df = final_df.copy()

In [183]:
complete_df = complete_df.rename(columns={"Kp": "Forecast Kp"})
complete_df.head()

Unnamed: 0,Date,Radio
0,06 Jan 2020,72.0
1,07 Jan 2020,72.0
2,08 Jan 2020,72.0
3,09 Jan 2020,72.0
4,10 Jan 2020,72.0


In [184]:
complete_df['Observed F10'] = flat2
complete_df.head()

ValueError: Length of values (4162) does not match length of index (4995)

In [None]:
complete_df = complete_df.rename(columns={"Radio": "Forecast F10"})
complete_df.head()

In [None]:
#Rounding Observed Kp to nearest integer

complete_df['Observed F10'] = complete_df['Observed F10'].round()
complete_df.head()

<h3>Final Table for Analysis</h3>

In [None]:
complete_df.dropna(how="any", inplace=True)

In [None]:
complete_df['Forecast F10']= complete_df['Forecast F10'].astype('int')
complete_df['Observed F10']= complete_df['Observed F10'].astype('int')
complete_df.head()

<h3>Forecast Performance</h3>

In [None]:
complete_df["Forecast Error"] = complete_df["Forecast F10"] - complete_df["Observed F10"] 
complete_df.head()

# negative under forecast
# positive over forecast

In [None]:
complete_df['Abs Error'] = complete_df['Forecast Error'].abs()
complete_df.head()

In [None]:
last = final_df['Date'].iloc[-1]

print(f'For the {num_pdfs} forecasts, beginning on {begin} and ending on {last}, the average forecast error was {sum(complete_df["Abs Error"])/len(complete_df)} sfu.')

In [None]:
x = sum(complete_df["Forecast Error"])/len(complete_df)
y=abs(x)

if sum(complete_df["Forecast Error"])/len(complete_df) < 0:
    print(f'On average, F10 was underforecast by {x}.')
    
else:
    print(f'On average, F10 was overforecast by {y}.')
          

In [None]:
max_err = complete_df[['Abs Error']].max().tolist()
print(f' The max forecast error was {max_err} sfu.')

In [None]:
count5 = complete_df['Abs Error'][complete_df['Abs Error'] > 5].count()
count10 = complete_df['Abs Error'][complete_df['Abs Error'] > 10].count()
count20 = complete_df['Abs Error'][complete_df['Abs Error'] > 20].count()

In [None]:
print(f'The forecast was off by more the 5 sfu {count5} times, 10 sfu {count10} times and more than 20 sfu, {count20} times.')

In [None]:
perfect = complete_df['Forecast Error'].value_counts()[0]
print(f'The forecast was still perfect {perfect} times or {perfect/len(complete_df) * 100} percent of the time.')

In [None]:
complete_df.describe()

In [None]:
freq = complete_df['Abs Error'].value_counts()

In [None]:
plt.scatter(freq.index, freq.values)

plt.xlabel("Forecast Error", size=10)
plt.ylabel("Frequency", size=10)
plt.title("Frequency and Magnitude of Errors for F10", size=15)

plt.show()

<h4> Fitting a Regression </h4>

In [None]:
lr = Ridge()

In [None]:
X=complete_df['Date']
X_axis=np.arange(len(X))
lr.fit(X_axis.reshape(-1,1), complete_df['Abs Error'])

In [None]:
figure(figsize=(70, 40), dpi=100)

plt.bar(X_axis, complete_df['Abs Error'])
plt.plot(X_axis, lr.coef_*X_axis+lr.intercept_, color='red', linewidth=20)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)


plt.xlabel("Days", size=50)
plt.ylabel("F10 Error", size=50)
plt.title("Daily F10.7 Forecast Errors for [test_data]",fontsize=75)
plt.legend(['Trend'], fontsize=50)

plt.show()

In [None]:
fig, ax1 = plt.subplots(figsize=(12, 10))

color = 'tab:blue'
ax1.set_xlabel('Days')
ax1.set_ylabel('Observed F10.7', color=color)
ax1.plot(X_axis, complete_df['Observed F10'], color=color, linewidth=2)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()

color = 'tab:red'
ax2.set_ylabel('F10.7 Forecast Error', color=color)
ax2.bar(X_axis, complete_df['Abs Error'], color=color, alpha=0.25)
ax2.tick_params(axis='y', labelcolor=color)

plt.title("Comparison of Observed F10.7 with Forecast Errors for [test_data]",fontsize=20)

fig.tight_layout()
plt.show()

In [None]:
fig = plt.figure(figsize =(6, 5))
plt.boxplot(complete_df['Forecast Error'])


plt.ylabel(" Error", size=10)
plt.title("F10.7 Forecast Error for [test_data]",fontsize=15)

plt.show

<h4>Mean Absolute Percentage Error</h4>
Definition: The absolute value of the difference between the forecasted value and the actual value taken as a mean.

In [None]:
actual   = complete_df['Observed F10']
forecast = complete_df['Forecast F10']
  
APE = [] # percentage error between the forecast and observed value
  
for day in range(len(actual)):
    per_err = (actual[day] - forecast[day]) / actual[day]
    per_err = abs(per_err)
    APE.append(per_err)
  
MAPE = sum(APE)/len(APE)

print(f'''
MAPE   : { round(MAPE, 2) }
MAPE % : { round(MAPE*100, 2) } %
''')

<h4> Accuracy (POD) </h4>
Definition: How many times the forecast was correct.

In [None]:
ACC = perfect/len(actual)
ACC

<h4> False Alarm Ratio (FAR) </h4>
Definition: How many times the forecast was wrong.

In [None]:
FAR = (len(complete_df['Forecast F10'])-perfect) / len(complete_df['Forecast F10'])
FAR

*Note: POD and FAR are typically used for categorical data rather than discrete data such as integers. The definitions of both have been applied libearlly in order to compute these values.*

<h4> Heidke Skill Score </h4>
Compares the accuracy of the forecasts against the accuracy of some reference model (coefficient of determination), normalized by a perfect model score of 1 against the same coefficient of determination.

In order to create a "skill score" we need to create a reference model based on the observed data points and determine its accuracy (Aref). A linear regression model "trend line" through all observed values will serve as a first guess model for this demonstration. Future efforts should be made to fit a more accurate model in order to extract more meaningful value from a skill score.

In [None]:
#lr.fit(X_axis.reshape(-1,1), complete_df['Observed F10'])

In [None]:
model = LinearRegression().fit(X_axis.reshape(-1,1), complete_df['Observed F10'])

In [None]:
r_sq = model.score(X_axis.reshape(-1,1), complete_df['Abs Error'])
r_sq

In [None]:
Aperf = 1
Aref = r_sq

SS = (ACC - Aref)/(Aperf-Aref) # skill score formula
SS

In [None]:
if SS == 1:
    print("The forecast is perfect")
elif SS > 0:
    print("The forecast is skillful and better than some reference.")
else:
    print("The forecast is less skillful than some reference.")

Now let's define a good forecast as an error no greater than 5.

In [None]:
print(f' Recall there are {count5} forecasts with errors greater than 5.')

And we will define an accuracy below 70% as a bad forecast.

In [None]:
ACC5 = (len(actual)-count5)/len(actual)

if ACC5 == 1:
    print(f'The forecasts are perfect with an overall accuracy of {ACC5*100}%.')
elif ACC5 > .7:
    print(f'The forecasts are good with an overall accuracy of {ACC5*100}%.')
else:
    print(f'The forecasts are bad with an overall accuracy of {ACC5*100}%.')

<h3>Conclusions</h3>

This notebook successfully demonstrates proof of concept in assessing the accuracy of 27 Day F10.7 Forecasts using a test dataset. 