# 1.Introduction

In this report, we analyze the spreading and evolution of COVID-19 over 312 regions around the world and forecast the confirmed cases and fatalities between April 15 and May 14 using the previous observations. Our submission is publicly available in [Kaggle](https://www.kaggle.com/hongshitan/nustarcat)

In the first part of the report, we explore the data set in terms of structure and features, our data cleaning method is also illustrated in this part. The overviews of the features are given in the following section. In the third part, we provide the details of the selected model for modelling the growth of COVID-19 cases and approaches to estimate the corresponding parameters. We also conduct a comparison of our modelling approach with the naive linear regression method, which has been widely used and achieved desired performance. Due to the diversity of the training data set which is sampled globally, there are some corner cases and outliners during the parameter estimation that introduce negative influence on the prediction results, we adopt several outlier handling strategies, which are available in the fourth section. In the final part, we discuss the disadvantages of our model and the further work we aim to.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import plotly.graph_objects as go
import plotly.express as px 

from sklearn import linear_model
from sklearn import preprocessing
from plotly.subplots import make_subplots

#import plotly.io as pio
#pio.renderers.default = "notebook+pdf" 

In [None]:
path = "../input/covid19-global-forecasting-week-4/"
train_df = pd.read_csv(path + "train.csv")
test_df = pd.read_csv(path + "test.csv")
submission = pd.read_csv(path + "submission.csv")


# 2. Dataset Description




Our goal is to forecast confirmed cases and fatalities between April 15 and May 14 by region.


- **a. What is the source of your data? What is it about?**  
The train dataset contains the confirmed cases and fatalites of COVID-19 by region between January 22 and May 15 2020, which is provided by JHU CSSE.
The test dataset includes the same region with train set with date between April 2 and May 14 2020.
- **b. How many features and datapoints does it contain?**  
The `train_df` dataframe has a shape of (35995, 6), which contains daily number of cases and deaths of COVID-19 from 184 countries and 312 states. As is mentioned above, for each region the data is collected in 115 consecutive days.
- **c. List a few (at most 10) features and describe them.**  
Sadly, there are only 6 features (technically only 5 are meaningful), which are `State,	Country,	Date,	ConfirmedCases,	Fatalities`. In the following part of this section, the visualization of these features is given. 

In [None]:
#train_df.shape

In [None]:
#test_df.shape

In [None]:
train_df.groupby(['Province_State','Country_Region']).ConfirmedCases.describe()

## 2.1 Data Set Structure


This task includes two data sets: training data set and test data set.
The training data set conclude six column, the description of each column is as follows.

* Id: Unique index of the observation
* Province_state: Provinces and states of a specific country
* Country_region: Name of Country
* Date: Timestamp for the corresponding data
* ConfirmedCases: The number of confirmed infected cases of COVID-19
* Fatalities: The number of deaths caused by COVID-19

The train dataset contains the confirmed cases and fatalites of COVID-19 by region between January 22 and May 15 2020. The test dataset includes the same region with train set with date between April 2 and May 14 2020. 
There is an overlapping bewteen the training dataset and the test dateset.Therefore, we remove the overlapping data in training dataset to avoid the data leakage.

In [None]:
def count_missing_values(df):
    missing_values = 0
    for column in df:
        missing_values += df[column].isna().sum()
    return missing_values

def get_col_have_missing_values(df):
    col = []
    for column in df:
        
        missing_values = df[column].isna().sum()
        if not( missing_values == 0):
            col += [column]
    return col

def fill_missing_state(state, country):
    if pd.isna(state):
        return country
    else :
        return state

## 2.2 Fill Missing Values

Because different countries have different policies on the statistic and release of the COVID-19 related data, the details of each province/state for some country, such as Germany, is not publicly avalible, and it casuses nan in the data set. Therefore, before the data analytics, we fill the missing province/state by its country name.

- **a. Are there missing values in your dataset?**  


In [None]:
if  count_missing_values(train_df) == 0:
    print('No')
else:
    print('Yes, the following table shows the missing values intuitively.')
    

In [None]:
train_df.head()

- **b. If yes, please provide the statistics (such has how many, which features, etc.) of the missing values.**  

In [None]:
print(f"The total number of missing value is {count_missing_values(train_df)}.")

In [None]:
missing_cols = get_col_have_missing_values(train_df)
print('The incomplete feature is:')
for col in missing_cols:
    print('\t'+col)
  

In [None]:
print('All the missing values are State data from some countries (without state-specific data).')

- **c. Do you want to remove such data-points from the dataset? Why?**  
No. Since for some large countries, like US or China, the population density distribution, human traffic and transmission efficiency vary with different provinces. And that's why we have more detailed data in a specific state to analyze and predict the future data of COVID-19. For those countries without `states` feature, we take the  country's data as a whole to predict future cases and fatalities, instead of removing these data_points.  

- **d. Do you perform imputation to fill in missing values? What technique would you use?**  
We can fill the data just with the country name to replace NAN because the `State` represent geographic information similar to `country` but more specific. When there is no `State` value, we can just use the `Country` value to represent the Place.

In [None]:

# modify col name and fill in missing value in 'Province_State' with 'Country_Region'
test_df.rename(columns={'Province_State':'State','Country_Region':'Country'}, inplace=True)
test_df['State'] = test_df.apply(lambda df: fill_missing_state(df['State'],df['Country']),axis=1)

train_df.rename(columns={'Province_State':'State','Country_Region':'Country'}, inplace=True)
train_df['State'] = train_df.apply(lambda df: fill_missing_state(df['State'], df['Country']) ,axis=1)

## 2.3 Data Analytics

We conduct several explorative data analysis which includes:
-  A TOP 20 ranking of confirmed cases/fatalities grouped by countries. 
- A visualization of global cases/fatalities grouped by countries. 

In [None]:
# constant
plot_days = 14
cases = 20
power = 1
alpha = 0.3
rmsle_cal_days = 28

In [None]:
def global_overview(data):
    plot = train_df.loc[(train_df['Country'].isin(countries))].groupby(['Date', 'Country', 'State']).max().groupby(['Date', 'Country']).sum().sort_values(by=data, ascending=False).reset_index()
    plot2 = train_df.groupby(['Date'])[data].sum().reset_index()
    fig = px.bar(plot, x="Date", y=data, color="Country", barmode="stack")
    fig.add_scatter(x=plot2['Date'], y=plot2[data],name='Global Trend') 
    fig.update_layout(title=data,width=1000,
    height=500,)
    
    fig.show()

In [None]:
# A sorted dataframe according to countries. 
display_df = train_df.drop(['Fatalities'], axis= 1)
df_confirmedcases = display_df.groupby(['Country','State']).max().groupby('Country').sum().sort_values(by='ConfirmedCases', ascending=False).reset_index().drop(columns='Id')
countries = df_confirmedcases[:10]['Country'].unique().tolist()
df_confirmedcases[:10].set_index('Country').transpose()


The above table shows that, before May 15, 2020, the United States has the most COVID-19 confirmed infections. The Russia holds the second place in trems of confirmed cases.

In [None]:
global_overview('ConfirmedCases')

The above figure depicts the global trend of COVID-19 confirmed cases, of which the growth after March 29 is almost linear. The number of infected had exceeded 3 million by the end of April.


In [None]:
display_df = train_df.drop(['ConfirmedCases'], axis= 1)
df_confirmedcases = display_df.groupby(['Country','State']).max().groupby('Country').sum().sort_values(by='Fatalities', ascending=False).reset_index().drop(columns='Id')
countries = df_confirmedcases[:10]['Country'].unique().tolist()
df_confirmedcases[:10].set_index('Country').transpose()


The top 10 fatalities countries are listed in this table. Due to a large amount of confrimed cases, the United States also has the most fatalites, and the United Kingdom is the second one. 

In [None]:
global_overview('Fatalities')

The above figure depicts the global trend of COVID-19 death cases. The growth ratio of fatalities declined slightly. The top 10 countries make up the majority of fatalities

# 3. Feature Engineering

In this section, we clean up the data set and prepare the data for the constuction of our model.

We remove the leakage existing in training data set. The training data set contains the observation from 2020-04-02 to 2020-05-15, which is the period needs to predict, and we eliminate them by setting zeros.

There are only five features in the orignal data set, which are State, Country, Date, ConfirmedCases, Fatalities. As the `Province_State` and `Country_Region` columns are string type which is not friendly during the data process, we transform the values in these columns into unique numerical indexes and create corrsponding remapping dictionarys. Then, we separate `Date` column into `Day`, `Week`, `Month`, `DayOfWeek` for futher exploration.





In [None]:
data_leak = pd.merge(train_df,test_df, how='inner', on='Date')['Date'].unique().tolist()
data_leak.append('2020-05-15')
data_leak.sort()


In [None]:
train_df_fix = train_df.loc[~train_df['Date'].isin(data_leak)]
df_all = pd.concat([train_df_fix, test_df], axis = 0, sort=False)


df_all['ConfirmedCases'].fillna(0, inplace=True)
df_all['Fatalities'].fillna(0, inplace=True)
df_all['Id'].fillna(-999, inplace=True)
df_all['ForecastId'].fillna(-999, inplace=True)

In [None]:
def create_features(df_in):
    df = df_in.copy()
    df['Day_num'] = le.fit_transform(df['Date'])
    df['Date'] = pd.to_datetime(df['Date'])
    df['Day'] = df['Date'].dt.day
    df['Week'] = df['Date'].dt.isocalendar().week
    df['Month'] = df['Date'].dt.month
    #df['Year'] = df['Date'].dt.year
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    
    df['Country'] = le.fit_transform(df['Country'])
    country_dict = dict(zip(le.inverse_transform(df['Country']), df['Country'])) 
    
    df['State'] = le.fit_transform(df['State'])
    state_dict = dict(zip(le.inverse_transform(df['State']), df['State']))
    
    return df, country_dict, state_dict

In [None]:
le = preprocessing.LabelEncoder()

df_all, country_dict, state_dict = create_features(df_all)

golden_df,golden_country_dict, golden_state_dict = create_features(train_df)

inv_country_dict = {v: k for k, v in country_dict.items()}
inv_state_dict = {v: k for k, v in state_dict.items()}

# 4. Modeling

## 4.1 Logistic Curve Regression
A logistic function is a common S-shaped curve (also called sigmoid curve) with equation:

$$f(x) = \frac{L}{1+e^{-k(x-x_0)}}$$
where:
* $x_0$ is the $x$ value of the sigmoid's midpoint;  
* $L$, the curve's maximum value;  
* $k$, the logistic growth rate or steepness of the curve.  

A logistic function, or related functions (e.g. the Gompertz function) are usually used in a descriptive or phenomenological manner because they fit well not only to the early exponential rise, but to the eventual levelling off of the pandemic as the population develops a herd immunity. the curve can be used to model a pandemic [[1]](https://en.wikipedia.org/wiki/Logistic_function#In_medicine:_modeling_of_a_pandemic).
Therefore, we choose a logistic function as the model for the increase in the number of confirmed COVID-19 cases and deaths over time. Generally, three coefficients: the growth ratio $k$, the maximum value $L$ and the midpoint $x_0$ need to be estimated. 

The growth ratio $k$ is influenced by many factors, such as social distance government policy and population density, which can be different among countries, even the provinces in the same country. Therefore, we fit the coefficients of each place independently, and the propagation cross the place is not considered.

We use the following ways to fit the growth ratio $k$ and the maximum value $L$. Firstly, from the derivative of $f(x)$:

$$
\begin{aligned}
\frac{d}{dt}f(x) &= L \cdot k \cdot e^{-k (x - x_0)} \cdot \left(1 + e^{-k  (x - x_0)} \right)^{-2} \\
 &=  k \cdot \frac{L }{1 + e^{-k(x-x_0)}} \cdot  \left( \frac{ e^{-k (x-x_0)}}{1 + e^{-k (x-x_0)}} \right) \\
 &=  k \cdot f(x) \cdot \left( 1 - \frac{f(x)}{L} \right) \\
\frac{\frac{d}{dt}f(x)}{f(x)} &= k \cdot \left( 1 - \frac{f(x)}{L} \right) 
\end{aligned}
$$

We notice that the proportional growth ratio $\frac{\frac{d}{dt}f(x)}{f(x)}$ has linear relation with the growth $f(x)$. Hence, we adopt the linear regression on the training dataset to fit the above equation. Finally, $k$  and $L$ can be calculated by the coefficients from linear regression.

The midpoint $x_0$ can be treated as a translation transformation of the logistic curve on the time axis ($x$ axis). We construct an error function $h(x_0)$: 

$$h(x_0) = \sum_{i \in T}{\left( f(i,x_0) - g(i) \right) }$$
where:
* $T$ is the set of time points in training dataset;
* $g(i)$ is the ground truth of confirm cases or fatalites at time $i$ (from training dataset);
* $f(i,x_0)$ is the output of logistic function with midpoint $x_0$ at time $i$;


The optimal $x_0$ holds the equation $h(x_0) = 0$. Hence, we adopt the Newton's method for solving this equation to get the optimal $x_0$.



           
In this task, we choose Root Mean Squared Logarithmic Error (RMSLE) to evaluate the estimation.
Since the result of RMSE tends to be dominated by some large values when the range of predicted values is large. In this way, even if the model predicts a lot of small values accurately, the RMSE may be large because one very large value is not accurate. On the contrary, if another poor algorithm is more accurate for large values, but not good at many small values, the RMSE may be smaller than the previous one.             
This problem can be improved by taking logarithm first and then computing RMSE. RMSLE penalizes under-prediction more than over-prediction.

In [None]:
def RMSLE(pred,actual):
    return np.sqrt(np.mean(np.power((np.log(pred+1)-np.log(actual+1)),2)))

In [None]:
def filter_df_state(df ,country_dictionary,state_dictionary,country_name,state_name):
    df_country = df.copy()
    df_country = df_country.loc[df_country['Day_num'] >= 0]
    if type(country_name) == type('str'):
        df_country = df_country.loc[df_country['Country'] == country_dictionary[country_name]]
        df_country = df_country.loc[df_country['State'] == state_dictionary[state_name]]
    else:
        df_country = df_country.loc[df_country['Country'] == country_name]
        df_country = df_country.loc[df_country['State'] == state_name]
    features = ['Id', 'State', 'Country','ConfirmedCases', 'Fatalities', 'Day_num']
    df_country = df_country[features]
    return df_country

In [None]:
def error_function(L, k, groundtruth, tn):
    logis = L / (1. + np.exp(-k * (np.arange(len(groundtruth)) - tn)))
    max_value =  max([max(logis), max(groundtruth.values)])
    normed_pred = [i/max_value for i in logis]
    normed_real = [i/max_value for i in groundtruth.values]
    error = sum(np.array(normed_pred) - np.array(normed_real))/len(logis)
    return error

In [None]:
start = df_all[df_all['Id']==-999].Day_num.min()
end = df_all[df_all['Id']==-999].Day_num.max()
total_day = end + 1
print('prediction start day: {} end day: {}'.format(start, end))


Function description:
- Dataset preparation: function `prepare_training`
- error function: `error = sum(y_truth - y_hat)`
- predict with the sigmoid function regression with `country_calculation_logistic` function

Parameter description:
- method 0: Ridge Regression
- method 1: Linear Regression
- method 2: Ransac                                           

Use one of these three methods to compute the appropriated coefficients for logistic curve.                            
D(t) is the number of confirmed cases/fatalities of day t, and Ratios = (D(t+1)-D(t-1))/(2*D(t)).                      
When y_hat = 0, x is the maximum of ConfirmedCases/fatalities of `country`, that is, `L`. And the intercept of linear regression is the approximated value growth rate `k`. So we can compute the coefficients for logistic curve from outputs `a` and `b`.                             
In this way, we transform the logistic curve problem into a linear regression problem, which is easier to detect outlier by ransacRegressor.                                 
For the last coefficient `t0`, we use Newton's method to iteratively get `tn` and reset `res_t0` = `tn`.

In [None]:
def ratios_calculation(country, all_case, method, t0=0, plot=0, min_case = 1, start_day = 0,cut_ratio = 0.3):
    '''
    params:
    country: cases data for a single region
    method: choose one method to get the 
    t0: the day of the inflexion
    plot: decide to draw the plot or not
    min_case: the minimum number of cases
    start_day: prediction start date 
    cut_ratio:
    return: L,k,res_t0, error, pd.Series(data=prediction)
    '''
    ground_truth = country.copy()
    country  =  country[country >= min_case]
    if (len(country) >= 10):
        cut_ratio = 0.1
    slopes = 0.5 * ((country.diff(1) - country.diff(-1)))
    ratios = slopes / country
    x1 = country.values[1:-1]
    y1 = ratios.values[1:-1]
    x, y = [],[]
    
    for i in range(len(x1)):
        if not ((x1[i] <= x1.max()* cut_ratio )
                and ((y1[i] <= y1.max()* cut_ratio) )):
           # if (y1[i] < y1.max()  * 0.9):
            x += [x1[i]]
            y += [y1[i]]

    x, y = np.array(x), np.array(y)
    X = x.reshape(-1, 1) 
   
    try:
        if method == 0:
            reg = linear_model.Ridge(fit_intercept=True, normalize=True)
            reg.fit(X, y)
            a = reg.coef_[0]
            b = reg.intercept_
        elif method == 1:
            reg = linear_model.LinearRegression(fit_intercept=True, normalize=True)
            reg.fit(X, y)
            a = reg.coef_[0]
            b = reg.intercept_
        else: # RANSAC
            try :
                ransac = linear_model.RANSACRegressor()
                reg = ransac.fit(X, y)
                reg.fit(X, y)
                a = reg.estimator_.coef_[0]
                b = reg.estimator_.intercept_
            except:
                print("RANSAC failed")
                #reg = linear_model.LinearRegression(fit_intercept=True, normalize=True)
                reg = linear_model.Ridge(fit_intercept=True, normalize=True)
                reg.fit(X, y)
                a = reg.coef_[0]
                b = reg.intercept_

        #print(method)
    except:
        if plot >=4:
            print("[DEBUG] unexpected error, using the default value {}".format(len(country)))
        L,k, res_t0,error= 0,0,0,0
        prediction = pd.Series(
            data=[ ground_truth.values[i] if i < len(ground_truth) else ground_truth.max() for i in range(total_day +1)])
        return L,k,res_t0, error, prediction
   
    L = -b / a 
    if (a > 0) or (np.isnan(L)):
        if plot >=4:
            print("[DEBUG] unexpected regression results: a {} L {}".format(a, L))
        L,k, res_t0,error= 0,0,0,0
        prediction = pd.Series(
            data=[ ground_truth.values[i] if i < len(ground_truth) else ground_truth.max() for i in range(total_day +1)])
        return L,k,res_t0, error, prediction

    k = b
    L = -b / a 
    y_hat = a * x + b 
    
   
        
    tn = t0
    epsilon  = 0.01
    max_iteration = 100

    # Newton's method to calculate t0 
    for n in range(0, max_iteration):
        error  = error_function(L, k, ground_truth,tn)
        derror = (error_function(L, k, ground_truth,tn + 1) - error_function(L, k, ground_truth,tn -1))/2
        if plot >= 4:
            print('[DEBUG] error: {}, derror: {}'.format(error,derror))
        if (abs(error) <  epsilon):
            break
        if (abs(derror) < epsilon): # accepct
            break
        tn = tn - error/derror

    if np.isnan(tn):
        res_t0 = t0
        #print('nan detected')
    else:
        res_t0 = tn

    #res_t0 += start_day
    prediction= [L / (1. + np.exp(-k * (t - res_t0))) for t in range(0 ,total_day + 1)]
    
    if plot >= 1:
        fig = go.Figure()
        fig = make_subplots()
        fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name='Proportional growth ratio'))
        fig.add_scatter(x=x, y=y_hat, 
                        name='Linear regression')

        fig.update_layout(title='Linear Regression on the Proportional Growth Ratio and Number of Growth',
                       xaxis_title='Number of Growth',
                       yaxis_title='Proportional Growth Ratio')
        fig.show()
        
    if plot >= 2:
        logis = L / (1. + np.exp(-k * (np.arange(len(ground_truth)) - res_t0)))
        basic_xaixs = list(range(len(logis)))
        fig = go.Figure()
        fig = make_subplots()
        fig.add_scatter(x=basic_xaixs, y=ground_truth.values, 
                        name='Ground Truth', 
                        line=dict(  color="MediumPurple",
                                    width=4,
                                    dash="dot",)
                       )
        fig.add_scatter(x=basic_xaixs, y=logis, 
                        name='Proposed Model')

        fig.update_layout(title='Our Model versus the Ground Truth ',
                       xaxis_title='Days',
                       yaxis_title='Number of Cases')
        fig.show()

    return L,k,res_t0, error, pd.Series(data=prediction)

Use the coefficients computed by `ratios_calculation` to finish logistic curve regression.

In [None]:
def fit_logistic_curve(country,state, t0, method = None, plot = 1, target = 'ConfirmedCases', res_df = None,cut_ratio = 0.3):
    '''
    params:
    country: input the country feature
    state: input the state feature
    t0: the day of the inflexion
    method: choose a method mentioned above to compute the coefficients of logistic cuurve, default value = None
    plot:
    target: predict confirmed cases or fatalities, default value = 'ConfirmedCases'
    res_df: 
    cut_ratio:
    return: res[selected_index]
    '''
    df_country = filter_df_state(df_all,country_dict, state_dict, country,state)
    df_golden = filter_df_state(golden_df,golden_country_dict,golden_state_dict,country,state)
    start_day = df_country[df_country[target] >= 1]['Day_num'].min()
    #print(start_day)

    min_case = 1
    test_country = df_country[df_country['Day_num'] < start][target]
    all_case_country = df_golden[target]
    
    num_of_tests = 5
    rmsle = [None] * num_of_tests
    res = [None] * num_of_tests
    methods = [None] * num_of_tests
    
      
 
    if test_country.max() < 1:
        prediction_res = pd.Series(data=[0 for i in range(total_day +1)])
        #fake 
        selected_index = 0
        methods[selected_index] = method 
        res[selected_index] = prediction_res
        rmsle[selected_index] = RMSLE(prediction_res[start:start + rmsle_cal_days],
                                      all_case_country.values[start: start+ rmsle_cal_days])
        L,k = 0,0 
        
    else:
        if method == None: #debug only
            for tests in range(len(res)):
                L_array,k_array,t_array, error_array = [None]*3,[None]*3,[None]*3,[None]*3
                for i  in range(3):
                    L, k,t, error, prediction_res \
                        = ratios_calculation(test_country,
                                             all_case_country, 
                                             i, 
                                             t0=t0,
                                             plot=0,
                                             start_day=start_day,
                                             cut_ratio=cut_ratio)
                    error_array[i] = error
                #print(error)
                error_array = np.abs(error_array)
                index_min = np.argmin(error_array)
                res[tests] = prediction_res 
                rmsle[tests] = RMSLE(prediction_res[start:start + rmsle_cal_days],
                                     all_case_country.values[start: start+ rmsle_cal_days])
                methods[tests] = index_min
            selected_index = np.argmin(rmsle)
        else:
            L, k,t, error, prediction_res  \
                    = ratios_calculation(test_country,
                                         all_case_country, 
                                         method, 
                                         t0=t0,
                                         plot=plot,
                                         start_day=start_day,
                                         cut_ratio=cut_ratio)
            selected_index = 0
            methods[selected_index] = method
            res[selected_index] = prediction_res
            rmsle[selected_index] = RMSLE(prediction_res[start:start + rmsle_cal_days],
                                          all_case_country.values[start: start+ rmsle_cal_days])

   

    if plot >= 3:
        basic_xaixs = list(range(len(all_case_country)))
        train_xaixs = list(range(len(test_country)))
        fig = go.Figure()
        fig = make_subplots()
        fig.add_trace(go.Scatter(x=train_xaixs, y=test_country.values, mode="markers", name='Training data'))
        fig.add_trace(go.Scatter(x=basic_xaixs[len(train_xaixs):], 
                                 y=all_case_country.values[len(train_xaixs):],
                                 mode="markers", name='Ground Truth'))
        fig.add_scatter(x=basic_xaixs, y=res[selected_index], 
                        name='Prediction')
        max_y = max([max(res[selected_index]),max(test_country.values),max(all_case_country.values)])
        fig.add_annotation( x=start, y=-0.1 * max_y, xref="x", yref="y", text="Start of Prediction", showarrow=True, 
            font=dict(size=12),
            align="center",  arrowhead=0, arrowsize=1, arrowwidth=2, ax=-0, ay=-345, borderwidth=2, borderpad=4, opacity=1
            )
        fig.update_layout(title='Prediction',
                       xaxis_title='Days',
                       yaxis_title='Number of Cases')
        fig.show()

    if plot >=1:
        print("RMSLE: {} by method {}".format(rmsle[selected_index],methods[selected_index]))
    if plot >= 4:
        print("[DEBUG] L:{} k:{}".format(L, k))

    return res[selected_index]


In [None]:
res = fit_logistic_curve('Australia','New South Wales',15,plot=3 )

In [None]:
res = fit_logistic_curve('China','Hubei',15, plot=3)

## 4.2 Comparision with Linear Regression

In [None]:
def linear_regression_prepare_training(df,d,day,filtered_cols,target):
    '''
    params:
    df: data frame
    d: the higher bound of day_num for training df
    day: the lower bound of day_num
    filtered_cols: choose cases or fatalities
    return: tuple of training x, y and test_x
    '''
    df=df.loc[df['Day_num'] >= day]
    df_train = df.loc[df['Day_num'] < d]
    x = df_train
    y = df_train[target]
    x = x.drop(columns=filtered_cols).drop(columns=target)
    res = df.loc[df['Day_num'] == d]
    test_x = res
    test_x = test_x.drop(columns=filtered_cols).drop(columns=target) 
    x.drop('Id', inplace=True, errors='ignore', axis=1)
    x.drop('ForecastId', inplace=True, errors='ignore', axis=1)
    test_x.drop('Id', inplace=True, errors='ignore', axis=1)
    test_x.drop('ForecastId', inplace=True, errors='ignore', axis=1)
    return x,y,test_x

In [None]:
def linear_regression_calculation(df_all,country,date,day):
    def lin_reg(X_train, Y_train, x_test):
        regr = linear_model.Ridge()
        regr.fit(X_train, Y_train)
        pred = regr.predict(x_test)
        return regr, pred
    def lag_feature(df,target,lags):
        for lag in lags:
            lag_col = target + "_{}".format(lag)
            df[lag_col] = df.groupby(['Country','State'])[target].shift(lag, fill_value=0)
        return df
    targets = ['ConfirmedCases', 'Fatalities']
    lags = [40 , 20]
    df_country = df_all.copy()
    df_country = df_country.loc[df_country['Date'] >= date]
    df_country = df_country.loc[df_country['Country'] == country_dict[country]]
    features = ['Id', 'State', 'Country','ConfirmedCases', 'Fatalities', 'Day_num']
    df_country = df_country[features]
    
    for i in range(len(targets)):
        df_country = lag_feature(df_country, targets[i], range(1, lags[i]))

    filter_col_confirmed = [col for col in df_country if col.startswith('Confirmed')]
    filter_col_fatalities= [col for col in df_country if col.startswith('Fataliti')]
    filter_col = np.append(filter_col_confirmed, filter_col_fatalities)    
    filtered_cols = [filter_col_fatalities, filter_col_confirmed]

    df_country[filter_col] = df_country[filter_col].apply(lambda x: np.float_power(np.log1p(x), power))

    for d in range(start,start + 28+ 1):
        df_country.replace([np.inf, -np.inf], 0, inplace=True)
        df_country.fillna(0, inplace=True)
        for i in range(len(targets)):
            
            X_train_1,Y_train_1,x_test_1 = linear_regression_prepare_training(df_country,d,day,
                                                                              filtered_cols[i],targets[i])
            regr_1, pred_1 = lin_reg(X_train_1, Y_train_1, x_test_1)
            df_country.loc[(df_country['Day_num'] == d) & (df_country['Country'] == country_dict[country]), targets[i]] = pred_1[0]
            df_country = lag_feature(df_country, targets[i], range(1, lags[i]))   
    #print("Calculation done.")
    return df_country

In [None]:
def linear_regression_plot(df,test_con):
    groundtruth = train_df[(train_df['Country'] == test_con) & (train_df['Date'] >= '2020-03-10')].reset_index()
    df = df[df['Day_num'] >= 48]
    df = df[df['Day_num'] <= start + 14]
    df = df[:len(groundtruth)].reset_index()
    df['GroundTruthConfirmedcases'] = groundtruth['ConfirmedCases']
    df['GroundTruthFatalities'] = groundtruth['Fatalities']
    df['ConfirmedCases'] = df['ConfirmedCases'].apply(lambda x: np.float_power(np.expm1(x), 1/power))
    df['Fatalities'] = df['Fatalities'].apply(lambda x: np.float_power(np.expm1(x), 1/power))

    fig = go.Figure()
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    fig.add_scatter(x=df['Day_num'], y=df['ConfirmedCases'], name='Confirmedcases - Prediction')
    fig.add_scatter(x=df['Day_num'], y=df['GroundTruthConfirmedcases'], name='Confirmedcases')
    fig.add_scatter(x=df['Day_num'], y=df['Fatalities'], name='Fatalities - Prediction', secondary_y=True)
    fig.add_scatter(x=df['Day_num'], y=df['GroundTruthFatalities'], name='Fatalities ', secondary_y=True)
    max_y = max([max(df['ConfirmedCases']),max(df['GroundTruthConfirmedcases']),max(df['Fatalities']), max(df['GroundTruthFatalities'])])
    fig.add_annotation( x=start, y=-0.1 * max_y, xref="x", yref="y", text="Start of Prediction", showarrow=True, 
            font=dict(size=12),
            align="center",  arrowhead=0, arrowsize=1, arrowwidth=2, ax=-0, ay=-345, borderwidth=2, borderpad=4, opacity=1
            )

    fig.update_layout(title='Prediction of '+ test_con,
                       xaxis_title='#Days ',
                       yaxis_title='Confirmed Cases')

    fig.update_yaxes(title_text="Confirmed Cases", secondary_y=False)
    fig.update_yaxes(title_text="Fatalities", secondary_y=True)
    
    return fig.show()

In [None]:
def linear_regression_plot_country(country):
    df_check = linear_regression_calculation(df_all, country, '2020-03-10', 48)
    linear_regression_plot(df_check,country)

In [None]:
linear_regression_plot_country('Germany')

We also make a comparsion with the naive linear regression approaches, of which the asssumption is the exponential growth. The above figure shows the ground truth and the prediction of confirmed cases and fatalites in Germany. The perdiction fit well in the training data set, however it is easy to fail in long term perdiction. The reason why some of the naive linear regression based submission can also achieve a good result is that they do not handle the leakage of test data.

# 5. Outliers.
- **a. Have you found any outliers in your data? How have you found them?**   
Since the data is from the statistics all over the world, and the situation of different places are different, it is hard to determine ***whether it is an outlier or an incident-related/policy-related change***. So we plot some figures to see the trend of the data and try to find some unusual things in the visualization. There are some discoveries but we can't assure any outliers without further information.  
- **b. Do you plan to remove them or keep them in your data? Why?**
Details are as follows  


## 5.1 Outliers in Data Set

In [None]:
df_exp = train_df.groupby(['State']).sum()
df_exp = df_exp.reset_index()
state_exp = df_exp.sort_values(by='ConfirmedCases', ascending=False)[:20]['State'].to_list()
df_exp = train_df.groupby(['State', 'Date']).sum().reset_index()

In [None]:
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
fig, axes = plt.subplots(5,4, figsize=(25,20), sharex=True, sharey=True)
for i, state in enumerate(state_exp):
    a = i//4
    b = i % 4
    df_x = df_exp[df_exp.State==state]
    sns.scatterplot(df_x.Date, df_x.ConfirmedCases, ax=axes[a][b])
    # axes[i].title.set_text('First Plot')


We can see from the figure that:
- figure(0, 3) , (1,0), (1,1)

## 5.2 Outliers in Model Fiting

In [None]:
res = fit_logistic_curve('West Bank and Gaza', 'West Bank and Gaza',15,0,plot = 4,target = 'Fatalities')

For the region only has a few number of confirm cases or fatalities before the start of prediction, it lacks enough vaild data to fit the logistic curve. For example, West Bank and Gaza, which only have one vailed point in terms of proportional growth ratio (the 63rd day in above figure), fails the regression. This is a typical pathological case during the modelling. Currently, we directly use the latest available observation and give a constant prediction in the future. 

In [None]:
res = fit_logistic_curve('China','Hubei',15,1, plot= 2)

In idea case, the change of proportional growth ratio over the time is constant. However, there are many factors in real world influencing the growth ratio. Discontinuity factors, such as the change of the standard for judging the infection or government regulation, introduce ourliers and cause unexpected prediction result. For example, in Hube, China, we can see there is a jump on the 21st day, which is because of a change in diagnositc criteria. This data is an outlier in the regression, and we adopt random sample consensus (RANSAC) algorithm to deal with outliers. RANSAC is a learning technique to estimate parameters of a model by random sampling of observed data.  Given a dataset whose data elements contain both inliers and outliers, RANSAC uses the voting scheme to find the optimal fitting result and remove the outliers [[7]](https://en.wikipedia.org/wiki/Random_sample_consensus). 

In [None]:
res = fit_logistic_curve('China','Hubei',15,2, plot= 2)

The above figures are the result of adopting RANSAC. In terms of RMSLE, we can see an order of magnitude improvement is brought by the RANSAC, that is from 0.02 to 0.003. After the 40th day, the RANSAC-enabled estimations perfectly fit with the ground truth, while the results from the original method have clear biases.

# 6. Conclusion

We choose the logistic curve fitting as our baseline modeling approach, and combining with the RANSAC, our approach achieve an acceptable result both in short-range prediction and long-range predicition. However, our model lacks the consideration of the cross region propagation and other potential factors may have influence on the spread of virus, such as the temperature, population and average age etc. We will further explore these aspects and pursue better results

In [None]:
res_df = df_all.copy()



for country in res_df['Country'].unique():

    state_list = res_df[res_df['Country']==country]['State'].unique()

    for state in state_list:
        #print('country')
        #print(str(country) + ' , ' + str(state))
        targets = ['ConfirmedCases', 'Fatalities']
        
        for target in targets:
            res = fit_logistic_curve(country,state,15,0,plot = 0,target = target)
            
            for i in range(start, end + 1):
                res_df.loc[(res_df['Day_num'] == i) & (res_df['Country'] == country) & (res_df['State'] == state), \
                       target] = res[i]
        

print("done")

In [None]:
results_df_submit = res_df.copy()
submission_data = pd.DataFrame()
submission_data['ForecastId']= results_df_submit[results_df_submit['ForecastId'] >=0 ]['ForecastId'].astype(int)
submission_data['ConfirmedCases'] = results_df_submit[results_df_submit['ForecastId'] >=0 ]['ConfirmedCases'].replace([np.inf, -np.inf], 0)
submission_data['Fatalities'] = results_df_submit[results_df_submit['ForecastId'] >=0 ]['Fatalities'].replace([np.inf, -np.inf], 0)
submission_data['ConfirmedCases']  = submission_data['ConfirmedCases'].apply(lambda x: int(0) if (x < 0 or np.isnan(x)) else int(x))
submission_data['Fatalities']  = submission_data['Fatalities'].apply(lambda x: int(0) if (x < 0 or np.isnan(x)) else int(x))

submission_data.columns = ['ForecastId','ConfirmedCases','Fatalities']
submission_data.to_csv('submission.csv', index=False)

In [None]:
ground_truth = train_df.loc[train_df['Date'].isin(data_leak)]

truth = pd.concat([ground_truth, test_df], axis = 0, sort=False)
truth = pd.merge(train_df,test_df, how='inner', on=['Date','State','Country'] ) 

prediction = submission_data.copy()
prediction.rename(columns={'ConfirmedCases':'ConfirmedCasesHat','Fatalities':'FatalitiesHat'}, inplace=True)

all_res = pd.merge(truth,prediction, how='inner', on=['ForecastId']) 

eva_date = '2020-04-15'
a = all_res[all_res['Date'] < eva_date]['ConfirmedCasesHat']
b = all_res[all_res['Date'] < eva_date]['ConfirmedCases']

print(RMSLE(a,b))

a = all_res[all_res['Date'] < eva_date]['FatalitiesHat']
b = all_res[all_res['Date'] < eva_date]['Fatalities']

print(RMSLE(a,b))


# Reference

[1] [*Logistic Function*](https://en.wikipedia.org/wiki/Logistic_function#In_medicine:_modeling_of_a_pandemic). (2021, Feburary 24). Wikipedia. 

[2] [*Linear Regression Model*](https://www.kaggle.com/abhijithchandradas/linearregressionmodel). (2021, Feburary 24). Kaggle Inc. 

[3] [*COVID-19 Logistic Curve Prediction*](https://www.kaggle.com/orianao/covid-19-logistic-curve-prediction). (2021, Feburary 24). Kaggle Inc. 

[4] [*EDA and Forcast Polynomial & Linear Regression*](https://www.kaggle.com/abhijithchandradas/eda-and-forcast-polynomial-linear-regression). (2021, Feburary 24). Kaggle Inc. 

[5] [*Linear Regression Is All you Need*](https://www.kaggle.com/c/covid19-global-forecasting-week-5/discussion/151461
). (2021, Feburary 24). Kaggle Inc.

[6] [*Bayesian Model for COVID-19 Spread Prediction*](https://www.kaggle.com/bpavlyshenko/bayesian-model-for-covid-19-spread-prediction). Kaggle Inc. 

[7] [*Random Sample Consensus*](https://en.wikipedia.org/wiki/Random_sample_consensus). (2021, Feburary 26). Wikipedia. 

[8] [*COVID-19 - Logistic Curve Fitting and Correlation*](https://www.kaggle.com/diamondsnake/covid-19-logistic-curve-fitting-and-correlation). (2021, Feburary 26). Kaggle Inc. 
