*Forked and edited from the https://www.kaggle.com/mertcaglar/sarimax-baseline-starter-prediction

# Read in Libraries

In [None]:
print("Read in libraries")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
from random import random

# Read and Clean Data

In [None]:
print("read in train file")
df=pd.read_csv("/kaggle/input/covid19-global-forecasting-week-2/train.csv",
               usecols=['Province_State','Country_Region','Date','ConfirmedCases','Fatalities'])


In [None]:
print("fill blanks and add region for counting")
df.fillna(' ',inplace=True)
df['Lat']=df['Province_State']+df['Country_Region']
df.drop('Province_State',axis=1,inplace=True)
df.drop('Country_Region',axis=1,inplace=True)




In [None]:
countries_list=df.Lat.unique()
df1=[]
for i in countries_list:
    df1.append(df[df['Lat']==i])
print("we have "+ str(len(df1))+" regions in our dataset")

#read in test file 
test=pd.read_csv("/kaggle/input/covid19-global-forecasting-week-2/test.csv")

# Papers that Informed Parameters

Notes on how to determine the ARIMA / SARIMA model 
#https://www.sciencedirect.com/science/article/pii/S1201971218344618

A total of 1,341 specimens were positive for influenza A and 490 for influenza B. The majority of infected patients were 1–11 years old (87.7%). The ARIMA model could effectively predict the positive rate of influenza virus in a short time. ARIMA(0,0,11), SARIMA(1,0,0)(0,1,1)12, ARIMA(0,0,1) and SARIMA(0,0,1)(1,0,1)12 were suitable for B(Victoria), B(Yamagata), A(H1N1)pdm09, and A(H3N2), respectively.

#https://journals.lww.com/md-journal/fulltext/2016/06280/time_series_analysis_of_influenza_incidence_in.15.aspx
 It is conceivable that SARIMA (0,1,1)(0,1,1)12 could simultaneously forecast the influenza incidence of the Hebei Province, Guizhou Province, Henan Province, and Shandong Province; SARIMA (1,0,0)(0,1,1)12 could forecast the influenza incidence in Gansu Province; SARIMA (3,1,1)(0,1,1)12 could forecast the influenza incidence in Tianjin City; and SARIMA (0,1,1)(0,0,1)12 could forecast the influenza incidence in Hunan Province. Time series analysis is a good tool for prediction of disease incidence.
 
 #https://www.researchgate.net/publication/337619595_Predicting_Seasonal_Influenza_Based_on_SARIMA_Model_in_Mainland_China_from_2005_to_2018
 The SARIMA (1, 0, 0) × (0, 1, 1) 12 model predicted that the influenza incidence in 2018 was similar to that of previous years, and it fitted the seasonal fluctuation. The relative errors between actual values and predicted values fluctuated from 0.0010 to 0.0137, which indicated that the predicted values matched the actual values well. This study demonstrated that the SARIMA model could effectively make short-term predictions of seasonal influenza.
 
 #https://www.mdpi.com/1660-4601/17/4/1381/htm
  For the SARIMA and ARIMA models, AICc-based model selection using the training data resulted in SARIMA(1,0,0)(1,1,0)[52] and ARIMA(5,1,0) with S=4 harmonics, respectively. The final number of parameters for each of these models is given in Table 1, and it ranges from 3 (SARIMA) to 20 (Beta(4)).

In [None]:
#create the estimates assuming measurement error 
submit_confirmed=[]
submit_fatal=[]
for i in df1:
    # contrived dataset
    data = i.ConfirmedCases.astype('int32').tolist()
    # fit model
    try:
        #model = SARIMAX(data, order=(2,1,0), seasonal_order=(1,1,0,12),measurement_error=True)#seasonal_order=(1, 1, 1, 1))
        model = SARIMAX(data, order=(1,1,0), seasonal_order=(1,1,0,12),measurement_error=True)#seasonal_order=(1, 1, 1, 1))
        #model = SARIMAX(data, order=(1,1,0), seasonal_order=(0,1,0,12),measurement_error=True)#seasonal_order=(1, 1, 1, 1))
        #model = ARIMA(data, order=(3,1,2))
        model_fit = model.fit(disp=False)
        # make prediction
        predicted = model_fit.predict(len(data), len(data)+34)
        new=np.concatenate((np.array(data),np.array([int(num) for num in predicted])),axis=0)
        submit_confirmed.extend(list(new[-43:]))
    except:
        submit_confirmed.extend(list(data[-10:-1]))
        for j in range(34):
            submit_confirmed.append(data[-1]*2)
    
    # contrived dataset
    data = i.Fatalities.astype('int32').tolist()
    # fit model
    try:
        #model = SARIMAX(data, order=(1,0,0), seasonal_order=(0,1,1,12),measurement_error=True)#seasonal_order=(1, 1, 1, 1))
        model = SARIMAX(data, order=(1,1,0), seasonal_order=(1,1,0,12),measurement_error=True)#seasonal_order=(1, 1, 1, 1))
        #model = ARIMA(data, order=(3,1,2))
        model_fit = model.fit(disp=False)
        # make prediction
        predicted = model_fit.predict(len(data), len(data)+34)
        new=np.concatenate((np.array(data),np.array([int(num) for num in predicted])),axis=0)
        submit_fatal.extend(list(new[-43:]))
    except:
        submit_fatal.extend(list(data[-10:-1]))
        for j in range(34):
            submit_fatal.append(data[-1]*2)



In [None]:
#create an alternative fatality metric 
#submit_fatal = [i * .005 for i in submit_confirmed]
#print(submit_fatal)

In [None]:
#make the submission file 
df_submit=pd.concat([pd.Series(np.arange(1,1+len(submit_confirmed))),pd.Series(submit_confirmed),pd.Series(submit_fatal)],axis=1)
df_submit=df_submit.fillna(method='pad').astype(int)

In [None]:
#view submission file 
df_submit.head()
#df_submit.dtypes

In [None]:
#examine the test file 
test.head()

In [None]:
#join the submission file info to the test data set 
#rename the columns 
df_submit.rename(columns={0: 'ForecastId', 1: 'ConfirmedCases',2: 'Fatalities',}, inplace=True)

#join the two data items 
complete_test= pd.merge(test, df_submit, how="left", on="ForecastId")

# Submission

In [None]:
#df_submit.interpolate(method='pad', xis=0, inplace=True)
df_submit.to_csv('submission.csv',header=['ForecastId','ConfirmedCases','Fatalities'],index=False)
complete_test.to_csv('complete_test.csv',index=False)


# Visualisation of Predictions