# The Story of COVID-19 in World and Time Forecasting in Turkey

**Coronavirus is a large family of viruses. This is a disease that was detected in 1960, with several varieties. The virus, which is seen mostly in animals, has also been seen in humans for the first time. The current outbreak first appeared in Wuhan, China, in December 2019.The best way to prevent and slow down transmission is be well informed about the COVID-19 virus, the disease it causes and how it spreads. Protect yourself and others from infection by washing your hands or using an alcohol based rub frequently and not touching your face.**

**The COVID-19 virus spreads primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes, so it’s important that you also practice respiratory etiquette (for example, by coughing into a flexed elbow).At this time, there are no specific vaccines or treatments for COVID-19. However, there are many ongoing clinical trials evaluating potential treatments. WHO will continue to provide updated information as soon as clinical findings become available.**

<img src="https://pbs.twimg.com/media/EWjM_DBWAAESbWc.jpg">

# Table of Contents

* [World COVID-19 Cases](#section-one)
* [Global Deaths Heat Map](#section-two)
* [Active, Recovered, Deaths in Hotspot Countries](#section-three)
* [US Heatmap(Confirmed Cases)](#section-four)
* [Average Age Distribution of Cases in Countries](#section-four-1)
* [Turkey](#section-five)
    * [Top 5 Cities with the Highest Number of Cases](#section-five-one)
    * [Turkey Heatmap (Number of Case)](#section-five-two)
    * [10 Cities with the Lowest Number of Cases](#section-five-three)
* [Turkey COVID-19 Forecasting](#section-six)
     * [Confirmed Case in Time Intervals](#section-six-one)
     * [Fatalities Case in Time Intervals](#section-six-two)
     * [Time Series Model](#section-six-four)
         * [Importing Libraries](#section-six-four-1)
         * [Prepearing Data](#section-six-four-2)
         * [Spliting Data for Training and Validation](#section-six-four-3)
         * [Determine Rolling Stats](#section-six-four-4)
         * [Check for Stationary](#section-six-four-5)
         * [Log scale tranformation](#section-six-four-6)
         * [Exponential Decay Transformation](#section-six-four-7)
         * [ADCF Test](#section-six-four-8)
         * [Time Shift Transformation](#section-six-four-9)
         * [Decomposition](#section-six-four-10)
         * [Building Model](#section-six-four-11)
         * [Prediction & Reverse Transformations](#section-six-four-12)
         * [Validation](#section-six-four-13)
         * [Test Forecasting](#section-six-four-14)
         * [ARIMA PDQ Param Tuning](#section-six-four-15) 
* [REFERENCES](#section-five)

In [None]:
import numpy as np 
import pandas as pd 
import datetime
import requests
import warnings
import random
import squarify
import matplotlib
import seaborn as sns
import matplotlib as mpl
import plotly.offline as py
import plotly_express as px
from sklearn.svm import SVR
import statsmodels.api as sm
from functools import partial
from fbprophet import Prophet
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from lightgbm import LGBMRegressor
from scipy.optimize import minimize
from sklearn.pipeline import Pipeline
from statsmodels.tsa.arima_model import ARIMA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic

from IPython.display import Image
warnings.filterwarnings('ignore')
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="section-one"></a>
# World COVID-19 Cases

#### Multiple Data Source:

* COVID19 Global Forecasting (Week 5)
* COVID-19 in Turkey
* COVID-19 useful features by country
* COVID19 Daily Updates
* Novel Corona Virus 2019 Dataset
* Number of Covid-19 cases in the cities of Turkey
* Python Folium Country Boundaries
* Turkey Geoplot

I received the "World COVID-19 Cases" from the following github links to study its worldwide spread and effects. It contains all cases until 23/9/2020.
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data

In [None]:
## Weights -> In the previous weeks (wk 2/3/4) of the competition phase weight was assigned according to the population of the region specified

confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
latest_data = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/04-04-2020.csv')

world_confirmed = confirmed_df[confirmed_df.columns[-1:]].sum()
world_recovered = recovered_df[recovered_df.columns[-1:]].sum()
world_deaths = deaths_df[deaths_df.columns[-1:]].sum()
world_active = world_confirmed - (world_recovered - world_deaths)

labels = ['Active','Recovered','Deceased']
sizes = [world_active,world_recovered,world_deaths]
color= ['red','green','black']
explode = []

for i in labels:
    explode.append(0.05)

plt.figure(figsize= (15,10))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=8, explode =explode,colors = color)
centre_circle = plt.Circle((0,0),0.70,fc='white')

fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('World COVID-19 Cases',fontsize = 20)
plt.axis('equal')  
plt.tight_layout()

<a id="section-two"></a>
## Global Deaths Heat Map

In [None]:
## DATA READING

df_deaths = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
df_covid19 = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv")
df_confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

## PRE-PROCESSING

df_confirmed = df_confirmed.rename(columns={"Province/State":"state","Country/Region": "country"})
df_covid19 = df_covid19.drop(["People_Tested","People_Hospitalized","UID","ISO3","Mortality_Rate"],axis =1)
df_deaths = df_deaths.rename(columns={"Province/State":"state","Country/Region": "country"})
df_covid19 = df_covid19.rename(columns={"Country_Region": "country"})
df_covid19["Active"] = df_covid19["Confirmed"]-df_covid19["Recovered"]-df_covid19["Deaths"]

# Changing the conuntry names as required by pycountry_convert Lib
df_deaths.loc[df_deaths['country'] == "US", "country"] = "USA"
df_deaths.loc[df_deaths['country'] == 'Korea, South', "country"] = 'South Korea'
df_deaths.loc[df_deaths['country'] == 'Taiwan*', "country"] = 'Taiwan'
df_deaths.loc[df_deaths['country'] == 'Congo (Kinshasa)', "country"] = 'Democratic Republic of the Congo'
df_deaths.loc[df_deaths['country'] == "Cote d'Ivoire", "country"] = "Côte d'Ivoire"
df_deaths.loc[df_deaths['country'] == "Reunion", "country"] = "Réunion"
df_deaths.loc[df_deaths['country'] == 'Congo (Brazzaville)', "country"] = 'Republic of the Congo'
df_deaths.loc[df_deaths['country'] == 'Bahamas, The', "country"] = 'Bahamas'
df_deaths.loc[df_deaths['country'] == 'Gambia, The', "country"] = 'Gambia'

countries = np.asarray(df_confirmed["country"])
countries1 = np.asarray(df_covid19["country"])
# Continent_code to Continent_names
continents = {
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Australia',
    'AF': 'Africa',
    'EU' : 'Europe',
    'na' : 'Others'}


# Defininng Function for getting continent code for country.
def country_to_continent_code(country):
    try:
        return pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country))
    except :
        return 'na'

#Collecting Continent Information
df_deaths.insert(2,"continent",  [continents[country_to_continent_code(country)] for country in countries[:]])
df_covid19.insert(1,"continent",  [continents[country_to_continent_code(country)] for country in countries1[:]])

df_deaths[df_deaths["continent" ]== 'Others']
df_deaths = df_deaths.replace(np.nan, '', regex=True)

df_countries_cases = df_covid19.copy().drop(['Lat','Long_','continent','Last_Update'],axis =1)
df_countries_cases.index = df_countries_cases["country"]
df_countries_cases = df_countries_cases.drop(['country'],axis=1)

df_countries_cases.fillna(0,inplace=True)

## VISUALIZATION

temp_df = pd.DataFrame(df_countries_cases['Deaths'])
temp_df = temp_df.reset_index()
fig = px.choropleth(temp_df, locations="country",
                    color=np.log10(temp_df["Deaths"]+1), 
                    hover_name="country", 
                    hover_data=["Deaths"],
                    color_continuous_scale=px.colors.sequential.Plasma,locationmode="country names")
fig.update_geos(fitbounds="locations", visible=False)
fig.update_coloraxes(colorbar_title="Deaths (Log Scale)",colorscale="Reds")

fig.show()

## Top 10 countries (Deaths)

In [None]:
f = plt.figure(figsize=(10,5))
f.add_subplot(111)

plt.axes(axisbelow=True)
plt.barh(df_countries_cases.sort_values('Deaths')["Deaths"].index[-10:],df_countries_cases.sort_values('Deaths')["Deaths"].values[-10:],color="crimson")
plt.tick_params(size=5,labelsize = 13)
plt.xlabel("Deaths Cases",fontsize=18)
plt.title("Top 10 Countries (Deaths Cases)",fontsize=20)
plt.grid(alpha=0.3,which='both')

<a id="section-three"></a>
## Active, Recovered, Deaths in Hotspot Countries

In [None]:
hotspots = ['China','Germany','Iran','Italy','Spain','US','Korea, South','France','Turkey','United Kingdom','India']
dates = list(confirmed_df.columns[4:])
dates = list(pd.to_datetime(dates))
dates_india = dates[8:]

df1 = confirmed_df.groupby('Country/Region').sum().reset_index()
df2 = deaths_df.groupby('Country/Region').sum().reset_index()
df3 = recovered_df.groupby('Country/Region').sum().reset_index()

global_confirmed = {}
global_deaths = {}
global_recovered = {}
global_active= {}

for country in hotspots:
    k =df1[df1['Country/Region'] == country].loc[:,'1/30/20':]
    global_confirmed[country] = k.values.tolist()[0]

    k =df2[df2['Country/Region'] == country].loc[:,'1/30/20':]
    global_deaths[country] = k.values.tolist()[0]

    k =df3[df3['Country/Region'] == country].loc[:,'1/30/20':]
    global_recovered[country] = k.values.tolist()[0]
    
for country in hotspots:
    k = list(map(int.__sub__, global_confirmed[country], global_deaths[country]))
    global_active[country] = list(map(int.__sub__, k, global_recovered[country]))
    
fig = plt.figure(figsize= (15,25))
plt.suptitle('Active, Recovered, Deaths in Hotspot Countries and India as of May 15',fontsize = 13,y=1.0)
#plt.legend()
k=0
for i in range(1,12):
    ax = fig.add_subplot(6,2,i)
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%b'))
    ax.bar(dates_india,global_active[hotspots[k]],color = 'red',alpha = 0.6,label = 'Active');
    ax.bar(dates_india,global_recovered[hotspots[k]],color='green',label = 'Recovered');
    ax.bar(dates_india,global_deaths[hotspots[k]],color='black',label = 'Death');   
    plt.title(hotspots[k])
    handles, labels = ax.get_legend_handles_labels()
    fig.legend(handles, labels, loc='upper left')
    k=k+1

plt.tight_layout(pad=3.0)

<a id="section-four"></a>
## US Heatmap (Confirmed Cases)

In [None]:
us = latest_data.loc[latest_data['Country_Region'] == 'US']
us.drop('Admin2', axis=1, inplace=True)

from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

us_min = us["Confirmed"].min()
us_mean = us["Confirmed"].mean()
us_max = us["Confirmed"].max()
us_med = us["Confirmed"].median()

fig = px.choropleth_mapbox(us, geojson=counties, locations="FIPS", color='Confirmed',
                           hover_name="Province_State",
                           color_continuous_scale="OrRd",
                           range_color=(us_med,us_mean),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 37.0902, "lon": -95.7129},
                           opacity=0.4,
                           labels={'Confirmed':'Confirmed Case Number'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

<a id="section-four-1"></a>
## Average Age Distribution of Cases in Countries

In [None]:
cluster_data = pd.read_csv("../input/covid19-useful-features-by-country/Countries_usefulFeatures.csv")
age_df = cluster_data[["Country_Region","Mean_Age"]]
sns.distplot(a=age_df['Mean_Age'], kde=False)

<a id="section-five"></a>
# Turkey

In [None]:
train = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-5/train.csv")
test = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-5/test.csv")

train_copy = train.copy()
test_copy = test.copy()

train['day']=pd.to_datetime(train.Date,format='%Y-%m-%d').dt.day
train['month']=pd.to_datetime(train.Date,format='%Y-%m-%d').dt.month

test['day']=pd.to_datetime(test.Date,format='%Y-%m-%d').dt.day
test['month']=pd.to_datetime(test.Date,format='%Y-%m-%d').dt.month

train.columns = map(str.lower, train.columns)
train = train.rename(columns = {'county': 'country', 'province_state': 'state', 'country_region': 'region', 'target': 'case', 'targetvalue':'case_value'}, inplace = False)

tc_data = pd.read_csv("../input/number-of-cases-in-the-city-covid19-turkey/number_of_cases_in_the_city.csv")

<a id="section-five-one"></a>
### Top 5 Cities with the Highest Number of Cases

In [None]:
tc_list = list(range(1, 82))
tc_data.insert(0, "id", tc_list, True) 

import plotly.express as px

more_case = tc_data.sort_values(by='Number of Case', ascending=False)

fig = px.pie(
    more_case.head(5),
    values = "Number of Case",
    names = "Province",
    color_discrete_sequence = px.colors.sequential.RdBu)

fig.update_traces(textposition="inside", textinfo="percent+label")
fig.show()

<a id="section-five-two"></a>
### Turkey Heatmap (Number of Case)

In [None]:
import plotly.express as px

# loading Turkey's geoplot json file
from urllib.request import urlopen
import json
with open("../input/geoplot/tr-cities-utf8.json") as f:
    cities = json.load(f)

mini = tc_data["Number of Case"].min()
average = tc_data["Number of Case"].mean()
#tc_data.drop('id', axis=1, inplace=True)
    
fig = px.choropleth_mapbox(tc_data, geojson=cities, locations=tc_data.id, color=(tc_data["Number of Case"]),
                           hover_name="Province",
                           range_color= (mini,average),
                           color_continuous_scale='amp',
                           mapbox_style="carto-positron",
                           zoom=4, opacity=0.7,center = {"lat": 38.963745, "lon": 35.243322},
                           labels={'color':'Number of Case'})

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

<a id="section-five-three"></a>
### 10 Cities with the Lowest Number of Cases

In [None]:
less_case = tc_data.sort_values(by='Number of Case', ascending=True)

fig = px.bar(
    less_case.head(10),
    x = "Province",
    y = "Number of Case")
fig.update_layout(barmode="group")
fig.update_traces(marker_color='rosybrown')
fig.show()

### Fatalities vs Confirmed Cases

In [None]:
import matplotlib as mpl

tc = train.loc[train.region == 'Turkey']

tc.drop('country', axis=1, inplace=True)
tc.drop('state', axis=1, inplace=True)
tc.drop('region', axis=1, inplace=True)
tc.drop('population', axis=1, inplace=True)

tc_1=tc['case_value'].groupby(tc['case']).sum()

fatal_tc=tc[tc['case']=='Fatalities']
conf_tc=tc[tc['case']=='ConfirmedCases']

labels =[tc_1.index[0],tc_1.index[1]]
sizes = [tc_1[0],tc_1[1]]
explode = (0, 0.08)  
plt.figure(figsize = (8,8))

plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',textprops={'fontsize': 14},startangle=90)
plt.show()

<a id="section-six"></a>
# Turkey COVID-19 Forecasting

### Plotting the Features to see trends
* Covid cases have strong daily and monthly properties.

#### Concepts:
* Trend: As the name suggests trend depicts the variation in the output as time increases.It is often non-linear. Sometimes we will refer to trend as “changing direction” when it might go from an increasing trend to a decreasing trend.

* Level: It basically depicts baseline value for the time series.

* Seasonal: As its name depicts it shows the repeated pattern over time. In layman terms, it shows the seasonal variation of data over time.

* Noise: It is basically external noises that vary the data randomly.

<a id="section-six-one"></a>
### Confirmed Case in Time Intervals

In [None]:
px.line(data_frame=conf_tc, x="date", y='case_value',hover_name="case")

### Implications and Causation:
* The first confirmed cases in Turkey were found Covidien-19 on March 11.
* Before that, on February 3, Turkey has announced that it stop all flights from China.
* Turkey on February 29, announced that flights with Italy, South Korea and Iraq were mutually suspended.
* Soon, the Iraqi border was also closed. The ministry also established field hospitals close to the Iraqi and Iranian borders.
* On March 11, Health Minister Fahrettin Koca announced that a Turkish man caught the virus while traveling to Europe was the country's first coronavirus case. The patient was isolated to a hospital and family members of the patient were observed.
* During the month of March, many activities such as sports leagues, horse races, barbecues in gardens, parks and recreation areas were stopped.<br>

#### Why It Increased?
* It peaked on April 11.
* As a result of incomplete reporting of case numbers, people did not take the disease seriously.
* The fact that the virus load is very high indicates that these people stay together in large numbers and for a long time in closed spaces and therefore are exposed to very intense / large amount of virus attacks and that there is a lot of inter-human contact in these places. The amount of virus you come into contact with is important in terms of how you will overcome the disease.
* People with chronic diseases (obesity, inactivity is also considered a chronic disease) could not protect themselves and got infected.
* Effective drugs were not in use.
* The increased hospital load could not be balanced.
* People did not listen to the Stay At Home call and continued on domestic travel. Necessary measures could not be taken early for jobs and schools, the interaction continued for a long time.
------------------------------------------

### Çıkarımlar ve Nedensellik:
* Türkiye'de ilk teyit edilen covid-19 vakası 11 Mart'ta bulundu.
* Ondan önce 3 Şubat'ta Türkiye, Çin'den gelen tüm uçuşları durdurduğunu açıkladı. 
* 29 Şubatta Türkiye; İtalya, Güney Kore ve Irak ile uçuşların karşılıklı olarak durdurulduğunu açıkladı.
* Kısa süre sonra Irak sınırı da kapatıldı. Bakanlık ayrıca Irak ve İran sınırlarına yakın saha hastaneleri kurdu.
* 11 Mart'ta Sağlık Bakanı Fahrettin Koca, Avrupa'ya seyahat ederken virüse yakalanan bir Türk erkeğin ülkenin ilk koronavirüs vakası olduğunu açıkladı. Hasta bir hastaneye tecrit edildi ve hastanın aile üyeleri gözlem altına alındı.
* Mart ayı süresince spor ligleri, at yarışları, bahçe, park ve mesire alanlarında mangal yakılması gibi birçok aktivite durduruldu.<br>

#### Neden Arttı? 
* **11 Nisan'da** pik yaptı.
* **Vaka sayılarının eksik bildirimi** sonucunda,insanlar hastalığı ciddiye almadı.
* **Virüs yükünün çok yüksek olması,** bu insanlarımızın kapalı mekânlarda çok sayıda bir arada ve uzun süre kaldıklarını ve bu nedenle çok yoğun / çok miktarda virüs saldırısına maruz kaldıklarını ve bu mekânlarda insanlar arası temasın çok olduğunu gösteriyor. Temasa geldiğiniz virüs miktarı hastalığı nasıl atlatacağınız açısından önemli.
* **Kronik hastalıkları** olan (obezlik, hareketsizlik de bir kronik hastalık sayılır) insanların kendilerini koruyamayıp virüs kaptılar.
* **Etkili ilaçlar** devrede değildi.
* **Artan hastane yükü** dengelenemedi.
* İnsanlar **Evde Kalın** çağrısını dinlemedi ve yurtiçi seyehatlare devam ettiler. İş ve okullar için erkenden gerekli önlem alınamadı, uzun bir süre etkileşim devam etti.


<a id="section-six-two"></a>
### Fatalities Case in Time Intervals

In [None]:
fig = px.line(data_frame=fatal_tc, x="date", y='case_value',hover_name="case", color_discrete_map={'case_value': 'red'})
fig.show()

* The first fatalities case of covid-19 in Turkey was found on 17 March

### Changes in the Number of Cases by Months

In [None]:
with plt.style.context('fivethirtyeight'):
    dategroup=tc.groupby('month').mean()
    fig, ax = plt.subplots(figsize=(20,6))
    ax.xaxis.set(ticks=range(0,13)) # Manually set x-ticks
    dategroup['case_value'].plot(x=tc.month)

* In this graph, we ca see that the cases peaked in April.

<a id="section-six-four"></a>
# TIME SERIES MODEL

1. How can techniques we will use to create the model help us deepen the understandings from EDA?
    - While our human eyes used to focus more on what happened already,machine may look beyond to the future objectively and systematically.
2. How much worthy or useful is it to predict the number of infections or deceased victims?
    - It could help decision makers select more efficient and timely measures to tackle the new cases
    - It might be helpful for individuals to know how much we need to be serious on preventing efforts
3. What could be fundamental or neccessary too for anyone in the storm of pandemic?
    - A general forecasting model not only for Turkey but for all countries (especially for the ones showing the early pattern of spreading.)
    - Ways of allocating necessary medical resources nationwide or even worldwide (e.g. ventilator) (from the places with the decreasing trend to those with the increasing one.)
-----
1. Modeli oluştururken kullanacağımız teknikler, EDA'yı derinleştirmemize nasıl yardımcı olabilir?
     - İnsan gözü önceden olanlara daha çok odaklanırken, oluşturacağımız modelle geleceğin ötesine nesnel ve sistematik olarak bakabiliriz.
2. Enfektelerin veya ölen kurbanların sayısını tahmin etmek ne kadar değerli veya yararlıdır?
     - Karar vericilerin yeni vakaların üstesinden gelmek için daha verimli ve zamanında önlemler seçmelerine yardımcı olabilir.
     - Bireylerin hastalığı önleme konusundaki çabalarının ne kadar ciddi olması gerektiğini bilmelerinde faydalı olabilir.
3. Pandemi sürecinde olan herhangi biri için temel veya gerekli olanlar nedir, ne olabilir?
     - Sadece Türkiye için değil, tüm ülkeler için (özellikle erken yayılma modelini gösteren ülkeler için) genel bir tahmin modeli sunar.
     - Ulusal ve hatta dünya çapında gerekli tıbbi kaynakları tahsis etme yolları konusunda yardımcı olabilir. (örn. test kiti)
    

<a id="section-six-four-1"></a>
### Importing Libraries

In [None]:
from pandas import Series
from math import sqrt

# metrics
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm

# forecasting model
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt
from statsmodels.tsa.arima_model import ARIMA

# for analysis
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from shapely.geometry import LineString

import matplotlib.pyplot as plt
from matplotlib.pyplot import plot
import seaborn as sns
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 7

from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

<a id="section-six-four-2"></a>
### Prepearing Data

In [None]:
train_original=pd.read_csv('../input/covid19-global-forecasting-week-5/train.csv')
test_original=pd.read_csv('../input/covid19-global-forecasting-week-5/test.csv')
train_original.sample(3)

#### Data Cleaning and Generating Date:

In [None]:
# Train Data Cleaning
train_original=train_original.drop(["County"], axis=1)
train_original=train_original.drop(["Province_State"], axis=1)
train_original=train_original.drop(["Population"], axis=1)
train_original=train_original.drop(["Weight"], axis=1)
train_original=train_original.drop(["Id"], axis=1)

train_original = pd.DataFrame(train_original[(train_original['Country_Region'] == 'Turkey') & (train_original['Target'] == 'ConfirmedCases')])
train_original=train_original.drop(["Country_Region"], axis=1)
train_original=train_original.drop(["Target"], axis=1)

# Test Data Cleaning
test_original=test_original.drop(["County"], axis=1)
test_original=test_original.drop(["Province_State"], axis=1)
test_original=test_original.drop(["Population"], axis=1)
test_original=test_original.drop(["Weight"], axis=1)

test_original = pd.DataFrame(test_original[(test_original['Country_Region'] == 'Turkey') & (test_original['Target'] == 'ConfirmedCases')])
test_original=test_original.drop(["Country_Region"], axis=1)
test_original=test_original.drop(["Target"], axis=1)

test_original.dropna(inplace=True)
test_original.dropna(inplace=True)
test_original.drop(test_original.tail(1).index, inplace=True)

train_df=train_original.copy()
test_df=test_original.copy()

train_original['Date']=pd.to_datetime(train_original.Date, format='%Y/%m/%d')
test_original['Date']=pd.to_datetime(test_original.Date, format='%Y/%m/%d')
train_df['Date']=pd.to_datetime(train_df.Date, format='%Y/%m/%d')
test_df['Date']=pd.to_datetime(test_df.Date, format='%Y/%m/%d')

# generate day, month, year feature
for i in (train_original, test_original, train_df, test_df):
    i['year']=i.Date.dt.year
    i['month']=i.Date.dt.month
    i['day']=i.Date.dt.day
    i['hour']=i.Date.dt.hour
    
# sampling for daily basis
train_df.index=train_df.Date
test_df.index=test_df.Date

train_df=train_df.resample('D').mean()
test_df=test_df.resample('D').mean()

In [None]:
train_df.head(3) #Last Version

<a id="section-six-four-3"></a>
### Spliting Data for Training and Validation

In [None]:
train=train_df.loc['2020-04-11':'2020-05-16']
valid=train_df.loc['2020-05-17':'2020-07-10']
plt.figure(figsize=(20,7))

train.TargetValue.plot(label='Train data')
valid.TargetValue.plot(label='Valid data')
plt.legend(loc='best')

<a id="section-six-four-4"></a>
### Determine Rolling Stats

* A rolling analysis of a time series model is often used to assess the model's stability over time. 

-----

* Bir zaman serisi modelinin rolling analizi, genellikle modelin zaman içindeki kararlılığını değerlendirmek için kullanılır.

In [None]:
rolmean=train.TargetValue.rolling(window=7).mean() #for 7 days -> roll.mean: pencere gezdirip ortalama alma
rolstd=train.TargetValue.rolling(window=7).std()
rolmean.dropna(inplace=True)
rolstd.dropna(inplace=True)

plt.figure(figsize=(17,7))
rolmean.plot(label='Rolmean', color='green')
rolstd.plot(label='rolstd')
train.TargetValue.plot(label='Train')
plt.legend(loc='best')

<a id="section-six-four-5"></a>
### Check for Stationary

#### What does it mean for data to be stationary?

- The mean of the series should not be a function of time. We will see if this series is stationary by examining the Decomposition result in Trend.

#### Why is this important? 
- When running a linear regression the assumption is that all of the observations are all independent of each other. In a time series, however, we know that observations are time dependent. It turns out that a lot of nice results that hold for independent random variables (law of large numbers and central limit theorem to name a couple) hold for stationary random variables. So by making the data stationary, we can actually apply regression techniques to this time dependent variable.

- There are two ways you can check the stationarity of a time series. The first is by looking at the data. By visualizing the data it should be easy to identify a changing mean or variation in the data. For a more accurate assessment there is the Dickey-Fuller test. I won’t go into the specifics of this test, but if the ‘Test Statistic’ is greater than the ‘Critical Value’ than the time series is stationary. Below is code that will help you visualize the time series and test for stationarity.
-----
- Veriyi durağan(stationary) hale getirerek, regresyon tekniklerini verilen zamana bağlı değişkene gerçekten de uygulayabiliriz.
<img src="https://miro.medium.com/max/390/0*3XXCQed3bPHrD1lt.png">

In [None]:
dftest=adfuller(train.TargetValue, autolag='AIC')
dfout=pd.Series(dftest[0:4], index=['Test statistics', 'p-value', '#Lags used', 'Number of observation used'])
for key, val in dftest[4].items():
    dfout['Critical value (%s)'%key]=val

print(dfout)

* The smaller p-value, the more likely it's stationary. Here our p-value is 0.603053. It's actually not that bad, if we use a 5% Critical Value(CV), this series would be considered stationary. But as we just visually found an upward trend, we want to be more strict, we use 1% CV.
* To get a stationary data, there's many techiniques. We can use log, differencing etc...

<a id="section-six-four-6"></a>
### Log scale tranformation

* Plot your graph of the data against time. If it looks like the variation increases with the level of the series, take logs. Otherwise model the original data.
* If it seems to be linear, than no need for logs.

In [None]:
# estimating trend
train_count_log=np.log(train.TargetValue)

# make TS to be stationary
moving_avg=train_count_log.rolling(window=7).mean()
moving_std=train_count_log.rolling(window=7).std()
plt.figure(figsize=(17,7))

train_count_log.plot(label='Log Scale')
moving_avg.plot(label='moving_avg')
moving_std.plot(label='moving_std')

plt.legend(loc='best')

In [None]:
# Varyasyonun yüksek olduğu yer.
dif_log=train_count_log-moving_avg
dif_log.dropna(inplace=True)
dif_log.plot()

In [None]:
def test_stationary(timeseries):
    # determine roling stats
    mov_avg=timeseries.rolling(window=7).mean()
    mov_std=timeseries.rolling(window=7).std()
    #plot rolling stats
    plt.figure(figsize=(12,7))
    timeseries.plot(label='Original')
    mov_avg.plot(label='Mov avg')
    mov_std.plot(label='Mov std')
    plt.legend(loc='best')
    plt.title('Rolling mean & standard deviation')
    
    # dickey-fuller test
    print('Result of Dickey-fuller test')
    dftest=adfuller(timeseries, autolag='AIC')
    dfout=pd.Series(dftest[:4], index=['Test stats', 'p-value', '#Lag used', 'Number of observation used'])
    for key, val in dftest[4].items():
        dfout['Critical value (%s)'%key]=val
    print(dfout)

In [None]:
test_stationary(dif_log)

After log transformation, our p-value is extremely small. Thus this series is very likely to be stationary.

<a id="section-six-four-7"></a>
### Exponential Decay Transformation

* Exponential Decay Transformation is a time series forecasting method for univariate data that can be extended to support data with a systematic trend or seasonal component. Specifically, past observations are weighted with a geometrically decreasing ratio.

------

* Exponential Decay Transformation, verileri sistematik bir eğilim veya mevsimsel bileşenle desteklemek için genişletilebilen tek değişkenli veriler için bir zaman serisi tahmin yöntemidir. Spesifik olarak, geçmiş gözlemler geometrik olarak azalan bir oranla ağırlıklandırılır.

In [None]:
plt.figure(figsize=(12,7))
edw_avg=train_count_log.ewm(halflife=7, min_periods=0, adjust=True).mean()
train_count_log.plot(label='Log scale')
edw_avg.plot(label='Exponential Decay Weight MA')

<a id="section-six-four-8"></a>
### ADCF Test

* Augmented Dickey–Fuller test is the most accepted determination of stationarity in the literature and it is accepted as the most valid test in determining stationarity in time series.

------

* Augmented Dickey–Fuller testi, literatürde en çok kabul gören durağanlık tespitidir ve zaman serisi konusunda da durağanlığın tespitinde en geçerli test olarak kabul edilmiştir (Enders, 1995).

The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to the terms at shorter lags.

#### Autoregression Intuition
- Consider a time series that was generated by an autoregression (AR) process with a lag of k.

- We know that the ACF describes the autocorrelation between an observation and another observation at a prior time step that includes direct and indirect dependence information.

- This means we would expect the ACF for the AR(k) time series to be strong to a lag of k and the inertia of that relationship would carry on to subsequent lag values, trailing off at some point as the effect was weakened.

- We know that the PACF only describes the direct relationship between an observation and its lag. This would suggest that there would be no correlation for lag values beyond k.

- This is exactly the expectation of the ACF and PACF plots for an AR(k) process.

In [None]:
dif_edw=train_count_log-edw_avg
dif_edw = dif_edw.replace([np.inf, -np.inf], np.nan)
dif_edw.dropna(inplace=True)
test_stationary(dif_edw)

<a id="section-six-four-9"></a>
### Time Shift Transformation

In [None]:
dif_shift=train_count_log-train_count_log.shift()
dif_shift = dif_shift.replace([np.inf, -np.inf], np.nan)
dif_shift.dropna(inplace=True)
test_stationary(dif_shift)

<a id="section-six-four-10"></a>
### Decomposition

- To start with, we want to decompose the data to seperate the trend. Since we have 3 months of confirmed case data, we would expect there's a  monthly or weekly pattern. Let's use a function in statsmodels to help us find it.

In [None]:
decom=seasonal_decompose(dif_edw, freq=3)

trend=decom.trend
seasonal=decom.seasonal
residual=decom.resid

fig=plt.figure(figsize=(15,8))
plt.subplot(211)
train_count_log.plot(label='Original')
plt.title("Original")
plt.subplot(212)
trend.plot(label='Trend')
plt.title("Trend")

'''
plt.subplot(413)
seasonal.plot(label='Seasonal')
plt.title("Seasonal")
plt.subplot(414)
residual.plot(label='Residual')
plt.title("Residual")
fig.tight_layout()
'''

decom_log_data=residual
decom_log_data = decom_log_data.replace([np.inf, -np.inf], np.nan)
decom_log_data.dropna(inplace=True)
#test_stationary(decom_log_data)

<a id="section-six-four-11"></a>
### Building Model

#### How to determine p, d, q
- In our case, we see the first order differencing make the ts stationary. 

- AR model might be investigated first with lag length selected from the PACF or via empirical investigation. In our case, it's clearly that within 4 lags the AR is significant. Which means, we can use AR = 4

- To avoid the potential for incorrectly specifying the MA order (in the case where the MA is first tried then the MA order is being set to 0), it may often make sense to extend the lag observed from the last significant term in the PACF.

- What is interesting is that when the AR model is appropriately specified, the the residuals from this model can be used to directly observe the uncorrelated error. This residual can be used to further investigate alternative MA and ARMA model specifications directly by regression.

- Assuming an AR(s) model were computed, then I would suggest that the next step in identification is to estimate an MA model with s-1 lags in the uncorrelated errors derived from the regression. The parsimonious MA specification might be considered and this might be compared with a more parsimonious AR specification. Then ARIMA models might also be analysed.
-----
#### Özbağlanımlı Model (Autoregressive model):
- İstatistik, ekonometri ve sinyal işlemede, otoregresif bir model bir tür rasgele sürecin temsilidir; bu haliyle, doğa, ekonomi, vb. zamana göre belirli zamanla değişen süreçleri tanımlamak için kullanılır

#### Moving-average model
* Zaman serisi analizinde, hareketli ortalama süreci olarak da bilinen hareketli ortalama modeli, tek değişkenli zaman serilerini modellemek için yaygın bir yaklaşımdır. Hareketli ortalama modeli, çıktı değişkeninin doğrusal olarak stokastik bir terimin mevcut ve çeşitli geçmiş değerlerine bağlı olduğunu belirtir. ARIMA ise ikisinin birleşimidir.

In [None]:
## AR MODEL:

train_count_log = train_count_log.replace([np.inf, -np.inf], np.nan)
train_count_log.dropna(inplace=True)
model=ARIMA(train_count_log, order=(4,1,0))
results_AR=model.fit(disp=0)

'''
The Residual sum of Squares (RSS) is defined as below and is used in the Least Square Method 
in order to estimate the regression coefficient.
The smallest residual sum of squares is equivalent to the largest r squared.
The deviance calculation is a generalization of residual sum of squares.
Squared loss = (y−y^)2
'''

plt.figure(figsize=(18,6))
dif_edw.plot(label='Exponentian Decay Differentiation')
results_AR.fittedvalues.dropna(inplace=True)
results_AR.fittedvalues.plot(label='Results AR')
df=pd.concat([results_AR.fittedvalues, dif_edw], axis=1).dropna()
plt.title('AR MODEL /RSS: %.4f'%sum((df[0]-df['TargetValue'])**2))

In [None]:
## MA MODEL
model=ARIMA(train_count_log, order=(2,1,1))
results_MA=model.fit(disp=0)

plt.figure(figsize=(18,6))
dif_edw.plot(label='Exponentian Decay Differentiation')
results_MA.fittedvalues.dropna(inplace=True)
results_MA.fittedvalues.plot(label='Results AR')
df=pd.concat([results_MA.fittedvalues, dif_edw], axis=1).dropna()
plt.title('MA MODEL /RSS: %.4f'%sum((df[0]-df['TargetValue'])**2))

In [None]:
## ARIMA MODEL
model=ARIMA(train_count_log, order=(4,1,2))
results_ARIMA=model.fit(disp=0)

plt.figure(figsize=(18,6))
dif_edw.plot(label='Exponentian Decay Differentiation')
results_ARIMA.fittedvalues.dropna(inplace=True)
results_ARIMA.fittedvalues.plot(label='Results AR')
df=pd.concat([results_ARIMA.fittedvalues, dif_edw], axis=1).dropna()
plt.title('ARIMA MODEL /RSS: %.4f'%sum((df[0]-df['TargetValue'])**2))

#### ARIMA Pros:

* Intepretability: Each coefficient means a specific thing ts key elements understanding: the concept of lags, and error lag terms are very unique, ARIMA gave a comprehensive cover on them. So even in the future I want to try some other regression model. I would add the lag terms and consider the error term.

#### ARIMA Cons:

* Inefficiency: ARIMA needs to be run on each time series, since we have 500 store/item combinations, it needs to run 500 times. Every time we want to forecast the future, say on Jan 2, 2018, we want to forecast next 90 days. We need to re-run ARIMA.

-----

#### ARIMA Artıları:

* Anlaşılabilirlik: Her bir katsayı, belirli bir şey anlamına gelir ve temel unsurların anlaşılması, gecikme kavramı ve hata gecikme terimleri çok benzersizdir, ARIMA bunları kapsamlı bir şekilde ele alır. Bu yüzden gelecekte bile başka bir regresyon modeli denemek istiyorsak gecikme katsayılarını ekler ve hata metriklerini dikkate alabiliriz.

#### ARIMA Eksileri:

* Verimsizlik: ARIMA'nın her zaman serisinde çalıştırılması gerekir, çünkü örneğin 500 mağaza / ürün kombinasyonumuz olduğundan, 500 kez çalıştırılması gerekir. Geleceği tahmin etmek istersek, örneğin önümüzdeki 90 günü tahmin etmek istediğimizde ARIMA'yı yeniden çalıştırmamız gerekiyor.

<a id="section-six-four-12"></a>
### Prediction & Reverse Transformations

In [None]:
# using AR model
pred_ar_dif=pd.Series(results_AR.fittedvalues, copy=True)
pred_ar_dif_cumsum=pred_ar_dif.cumsum()

pred_ar_log=pd.Series(train_count_log.iloc[0], index=train_count_log.index)
pred_ar_log=pred_ar_log.add(pred_ar_dif_cumsum, fill_value=0)
pred_ar_log.head()

# inverse of log is exp
pred_ar=np.exp(pred_ar_log)
plt.figure(figsize=(12,7))
train.TargetValue.plot(label='Train')
pred_ar.plot(label='Pred')

<a id="section-six-four-13"></a>
### Validation 

In [None]:
def validation(order):
    # forecasting for validation
    valid_count_log=list(np.log(valid.TargetValue).values)
    history = list(train_count_log.values)
    model = ARIMA(history, order=order)
    model_fit = model.fit(disp=0)
    output = model_fit.forecast(steps=len(valid))
    mse = mean_squared_error(valid_count_log, output[0])
    rmse = np.sqrt(mse)
    print('Test MSE: %.3f' % mse)
    print('Test RMSE: %.3f' % rmse)
    
    fig=plt.figure(figsize=(12,7))
    # reverse transform
    pred=np.exp(output[0])
    pred=pd.Series(pred, index=valid.index)
    valid.TargetValue.plot(label='Valid')
    pred.plot(label='Pred')
    plt.legend(loc='best')
    
    fig=plt.figure(figsize=(18,7))
    train.TargetValue.plot(label='Train')
    valid.TargetValue.plot(label='Valid')
    pred.plot(label='Pred', color='black')

In [None]:
validation((2,1,2))

<a id="section-six-four-14"></a>
### Test Forecasting

In [None]:
'''def arima_predict_hourly(data, arima_order):
    # forecasting for testing (Daily based forecasting)
    
    history = data
    model = ARIMA(history, order=arima_order)
    model_fit = model.fit(disp=0)
    output = model_fit.forecast(steps=len(test_original))

    submit=test_original.copy()
    submit.index=submit.ID
    submit['Count']=np.exp(output[0])
    submit.drop(['Unnamed: 0','ID','Datetime','year','month','day','hour'], axis=1, inplace=True)
    
    # plot result
    plt.figure(figsize=(12,7))
    train_original.index=train_original.Datetime
    submit.index=test_original.Datetime

    train_original.TargetValue.plot(label='Train')
    submit.TargetValue.plot(label='Pred')
    return submit'''

from pandas import DataFrame
# forecasting for testing (daily based forecasting)

h = list(np.log(train_original.TargetValue).values)
history = DataFrame(h,columns=['values'])
history = history.replace([np.inf, -np.inf], np.nan)
history.fillna(0, inplace=True)

model = ARIMA(history, order=(2,0,1))
model_fit = model.fit(disp=0)
output = model_fit.forecast(steps=len(test_original))

submit=test_original.copy()
submit.index=submit.ForecastId
submit['TargetValue']=np.exp(output[0])
submit.drop(['ForecastId','Date','year','month','day','hour'], axis=1, inplace=True)

In [None]:
# plot result
plt.figure(figsize=(18,7))
train_original.index=train_original.Date
submit.index=test_original.Date

train_original.TargetValue.plot(label='Train')
submit.TargetValue.plot(label='Pred')

<a id="section-six-four-15"></a>
### ARIMA PDQ Param Tuning

In [None]:
# evaluate an ARIMA model for a given order (p,d,q)
def evaluate_arima_model(arima_order):
    # forecasting for validation
    valid_count_log=list(np.log(valid.TargetValue).values)
    history = list(train_count_log.values)
    model = ARIMA(history, order=arima_order)
    model_fit = model.fit(disp=0)
    output = model_fit.forecast(steps=len(valid))
    mse = mean_squared_error(valid_count_log, output[0])
    rmse = np.sqrt(mse)
#     print('Test MSE: %.3f' % mse)
#     print('Test RMSE: %.3f' % rmse)
    return mse


# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(p_values, d_values, q_values):
    best_score, best_cfg = float("inf"), None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                order = (p,d,q)
                try:
                    mse = evaluate_arima_model(order)
                    if mse < best_score:
                        best_score, best_cfg = mse, order
                    print('ARIMA%s MSE=%.3f' % (order,mse))
                except:
                    continue
    print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score))

In [None]:
# evaluate parameters
p_values = [0, 1, 2, 4, 6, 8]
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
evaluate_models(p_values, d_values, q_values)

In [None]:
# ARIMA PDQ Param Tuning said that BEST ARIMA -> (4,1,0)
validation((4,1,0))

#### Further steps:
* Improving the current model by using different techniques and based on different metrics.
* To make longer term predictions with a better understanding of certain concepts.
* With the arrival of new data, to make more detailed estimates based on city / region.
-----
#### İleriki adımlar:
* Daha farklı teknikler kullanılıp ve farklı metrikler baz alınarak şuanki modelin iyileştirilmesi.
* Belli kavramların daha iyi anlaşılmasıyla birlikte daha uzun vadeli tahminler yapmak.
* Yeni dataların gelmesiyle de birlikte şehir/bölge bazlı daha detaylı tahminler yapmak.

### TO SEE GLOBAL FORECASTING, PLEASE GO PART.2 -> https://www.kaggle.com/thepinokyo/esin-part-2-covid19

# STAY AT HOME AND DO KAGGLE
<img src="https://i.hizliresim.com/dOUeA3.jpg">

### REFERENCES
* https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
* https://www.who.int/health-topics/coronavirus#tab=tab_1
* https://www.stat.tamu.edu/~jnewton/stat626/topics/topics/topic5.pdf
* https://tr.wikipedia.org/wiki/Türkiye%27de_COVID-19_pandemisi_zaman_çizelgesi
* https://www.herkesebilimteknoloji.com/yazarlar/orhan-bursali/turkiyede-covid-19-olum-oranlari-neden-artti
* https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
* https://www.researchgate.net/post/How_does_one_determine_the_values_for_ARp_and_MAq
* https://stats.stackexchange.com/questions/281666/how-does-acf-pacf-identify-the-order-of-ma-and-ar-terms/281726#281726
* https://stats.stackexchange.com/questions/134487/analyse-acf-and-pacf-plots?rq=1