# An Estimation for Asymptomatic Covid-19 Cases of Turkey
### A multivariate polynomial regression model

# Table of Contents

1. [Introduction](#Introduction)

    1.1 [Acknowledgments](#Acknowledgments)

    1.2 [The Problem](#The-Problem)
    
    1.3 [The Goal of the Project](#The-Goal-of-the-Project)
    
    1.4 [Timeline](#Timeline)
    
    1.5 [Methodology](#Methodology)
2. [Importing the Necessary Libraries](#Importing-the-Necessary-Libraries)
3. [Importing Data](#Importing-Data)

    3.1 [Preprocessed Data](#Preprocessed-Data)
    
    3.2 [Raw Data](#Raw-Data-from-the-Ministry-of-Health-in-Turkey)
4. [Data Cleaning and Wrangling](#Data-Cleaning-and-Wrangling)
5. [Regression](#Regression)

    5.1 [Creating Filters](#Creating-Filters)
    
    5.2 [Feature Selection](#Feature-Selection)
    
    5.3 [Prediction](#Prediction)
    
    5.4 [Adjustment](#Adjustment)
6. [Saving the Dataset](#Saving-the-Dataset)
    

# Introduction

### Acknowledgments

##### This project has 2 different datasets from 2 different sources. I have prepared one of them in another project which used data from "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University", link to their repo is https://github.com/CSSEGISandData/COVID-19

##### The second dataset is processed with the data from the Ministry of Health in Turkey 

##### Several studies inspired me to create a new feature. These studies are;

Jombart T, van Zandvoort K, Russell TW et al. Inferring the number of COVID-19 cases from recently reported deaths [version 1; peer review: 2 approved]. Wellcome Open Res 2020, 5:78 (https://doi.org/10.12688/wellcomeopenres.15786.1)

Pueyo T. Coronavirus: Why You Must Act Now (https://tomaspueyo.medium.com/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca)

Linton NM, Kobayashi T, Yang Y, Hayashi K et al. Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data. Journal of Clinical Medicine. 2020; 9(2):538. (https://doi.org/10.3390/jcm9020538)

### The Problem

The Republic of Turkey was one of the countries which announce only symptomatic Covid-19 cases. Yet, this policy only took about 3 months.

The Ministry referred to these numbers as "patients" rather than "cases". They started to announce asymptomatic cases once again and revealed the total case numbers in the country. 

Secondary repositories kept these numbers as if they were case numbers and this caused significant changes in the recording standards for the country. Actual case numbers are now known but we still do not have the daily increases for this gap.

### The Goal of the Project

This project aims to estimate daily increases of case numbers for these days. There are several facts and data that make it possible to build a highly accurate machine learning model to solve this problem. 

These facts are:

    1. Total case numbers are known.
    
    2. This policy only took part of the total duration of the pandemic and 
       it was in the middle. 
       We know the case numbers of the first months and the last months. 
        

### Timeline
###### On the 29th of July 2020, the phrase "Case" changed to "Patients" on the graphs that the Ministry of Health shares with the public
            Turkey started to announce only the symptomatic cases.
            Secondary repositories kept recording these numbers as cases.
            
###### On the 25th of November 2020, the Ministry started to announce the cases once again
            There were not any adjustments for the previous cases.
            
###### On the 10th of December 2020, the Ministry revealed missing previous cases cumulatively
            Secondary repositories kept this number as if the cases were 
            discovered that day.

### Methodology

  Linear correlation, p-value, cross-validation and comparing with the cumulative cases for the gap are used to measure the accuracy of the model and the relations.

  Daily tests and recovered cases are used as features in the model.
  
  The most related feature was the increase in deaths from 15 days later. This feature is created to acquire a stronger relation. It takes approximately 15 days to die from Covid-19 (Linton et al., 2020). Inspirations of this feature were mentioned in the acknowledgments section.
  
  Estimations are also used as coefficients to distribute cumulative cases that were announced on the 10th of December 2020 as an adjustment.

# Importing the Necessary Libraries

In [None]:
import pandas as pd
import requests, json, statistics
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_regression
from scipy import stats
from scipy.stats.stats import pearsonr
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Importing Data

### Preprocessed Data

The file called "Covid-19 Dataset.csv" was created in another project. The making of this dataset is [here](https://github.com/ocaktans/Gathering-Covid-19-Data). The updated dataset is [here](https://www.kaggle.com/ocaktan/covid19pandemic-dataset).

In [None]:
df = pd.read_csv("../input/covid19pandemic-dataset/Covid-19 Dataset.csv", 
                 parse_dates = ["Date"])

df.head()

In [None]:
df.dtypes

### Raw Data from the Ministry of Health in Turkey

#### Web Scraping

In [None]:
url = "https://covid19.saglik.gov.tr/TR-66935/genel-koronavirus-tablosu.html#"
r = requests.get(url)
print(r.status_code)

In [None]:
soup = BeautifulSoup(r.content, "xml")

print(soup.title)

In [None]:
script = soup.find_all("script")[16].text[26:-4]
data = json.loads(script)
turkeyfirst = pd.DataFrame.from_records(data)

# Data Cleaning and Wrangling

#### Filtering Turkey and Necessary Columns from the First Set

In [None]:
df = df[df["Country"] == "Turkey"][["Date","Confirmed", "Deaths", "Death Rate(%)","Increase","day_xth"]]

df.set_index("day_xth",inplace = True)

df = df[df["Date"] > "2020-03-10"]

In [None]:
df.head()

#### Cleaning Columns from the Second Dataset and Translation

In [None]:
turkey = turkeyfirst[['tarih', 'gunluk_test', 'gunluk_vaka', 
                      'gunluk_hasta', 'gunluk_vefat', 'gunluk_iyilesen', 
                      'toplam_test', 'toplam_vefat', 'toplam_iyilesen', 
                      'hastalarda_zaturre_oran', 'agir_hasta_sayisi',
                      'toplam_hasta']]

turkey.rename(columns = {'tarih': 'Date', 'gunluk_test': 'Inc_Test', 'gunluk_vaka':'Inc_Case',
                         'gunluk_hasta':'Inc_Patient', 'gunluk_vefat': 'Inc_Deaths', 
                         'gunluk_iyilesen' : 'Inc_Recovered', 'toplam_test' : "Total_Tests", 
                         'toplam_vefat':'Total_Deaths', 'toplam_iyilesen':'Total_Recovered', 
                         'agir_hasta_sayisi':'Heavily_Ill','toplam_hasta':'Total_Patient', 
                         'hastalarda_zaturre_oran':'Pneumonia_Ratio'}, inplace = True)

turkey = turkey.reindex(columns = ['Date', 'Inc_Test', 'Inc_Case', 
                                   'Inc_Patient', 'Inc_Deaths',
                                    'Inc_Recovered', 'Total_Tests', 
                                   'Total_Patient', 'Total_Deaths',
                                   'Total_Recovered', 'Heavily_Ill', 
                                   'Pneumonia_Ratio'])

#### Changing Types

In [None]:
turkey.dtypes

Inc_ indicates the daily increase in the related field.

In [None]:
turkey["Date"] = pd.to_datetime(turkey.Date, dayfirst = True)
turkey.sort_values(by=["Date"], inplace = True)
turkey = turkey.reset_index(drop=True)

In [None]:
turkey.head()

##### Masking Integer and Float columns

In [None]:
intcolumns = turkey.columns[1:-1]
floatcolumns = turkey.columns[-1:]

##### Cleaning of the Problematic Characters

In [None]:
turkey[intcolumns] = turkey[intcolumns].applymap(lambda x: 
                                                 x.replace('.', ''))
turkey[intcolumns] = turkey[intcolumns].replace("", 0, regex = True)
turkey[floatcolumns] = turkey[floatcolumns].replace("", 0, regex = True)
turkey[floatcolumns] = turkey[floatcolumns].replace(",", ".", regex = True)
turkey[floatcolumns] = turkey[floatcolumns].replace(" saat", "", 
                                                    regex = True)

In [None]:
turkey[intcolumns] = turkey[intcolumns].astype(int)
turkey[floatcolumns] = turkey[floatcolumns].astype(float)
turkey.dtypes

In [None]:
turkey.shape

### Creating New Columns

The dataset called "turkey" already has the increase in deaths, yet it has some missing values. This column is created for comparison.

In [None]:
df["IncreaseDeath"] = df["Deaths"].diff().fillna(0).astype(int)
df["Increase"] = df["Increase"].astype(int)

Merging before adding more columns.

In [None]:
frame = df.merge(turkey, left_on="Date", right_on="Date")
frame.head()

Creating a column to check the differences of increase in deaths columns.

In [None]:
frame["Check"] = frame["IncreaseDeath"] - frame["Inc_Deaths"]
frame[frame["Check"] !=0]

It can be seen that there are some empty cells in the data from the ministry which caused the most of the difference. One row(with the index 494) appear to be faulty because of the first source. Similar differences appear in total death columns as well. We will chose to continue with "IncreaseDeath" and "Deaths" columns. The row with the index 494 will be dropped in the next sections.

#### Filling Empty Cells with One of the Preprocessed Columns

Empty cells in this column may affect our results. It is partially empty because the increase of the cases was being kept in "Inc_Patient" column until the 29th of July 2020. Yet "Increase" column is not fully correct either. Therefore, they must be combined to acquire an accurate column.

In [None]:
for cell in range(len(frame) - 1):
    
    if frame["Inc_Case"].loc[cell] == 0:
        
        frame["Inc_Case"].loc[cell] = frame["Increase"].loc[cell]
        
    else:
        
        frame["Inc_Case"].loc[cell] = frame["Inc_Case"].loc[cell]
        
pd.options.mode.chained_assignment = None

Columns "Increase" and "Inc_Patient" must differ after the 25th of November.

In [None]:
frame[frame["Date"] > "2020-11-24"].head()

Difference between "Increase" and "Inc_Case" can be seen above. The correct one is "Inc_Case"

#### Increase of the deaths which is taken from 15 days later

This feature is explained in the methodology section.

In [None]:
frame["Death15"] = 0

for cell in range(len(frame)):
   
    if cell + 15 < (len(frame)):
        
        frame["Death15"].loc[cell] = frame["IncreaseDeath"].loc[cell + 15]
        
    else:
        
        break

Checking the code.

In [None]:
frame.tail(17)

Removing unnecessary columns.

In [None]:
frame = frame.drop(["Inc_Deaths","Total_Deaths","Total_Recovered","Check"], axis=1)

# Regression

In [None]:
frame.head()

### Creating Filters

We have to choose between the quantity of the features and our sample size since various cells are empty on certain dates. We do not have access to all attributes from day one.

"thegap" masks the days that we want to predict.

"masked0" simply includes outside of the gap. It will be used for merging purposes in the next sections.

"masked1" has a bigger size than "masked2", yet we have to exclude some columns(deleted columns are partially empty for these dates).

"masked2" has more features with less size.

In [None]:
thegap = frame[(frame["Date"] > "2020-07-28") & 
               (frame["Date"] < "2020-11-25")].copy()

masked0 = frame[(frame["Date"] <= "2020-07-28") | 
                (frame["Date"] >= "2020-11-25")].copy()

masked1 = frame[((frame["Date"] <= "2020-07-28") | 
                 (frame["Date"] >= "2020-11-25") & (frame["Death15"] != 0))].copy().drop(["Pneumonia_Ratio", "Heavily_Ill", "Inc_Patient","Total_Patient"], axis=1)

masked1 = masked1[masked1["Inc_Test"] != 0]

masked2 = frame[(frame["Date"] >= "2020-11-25") & (frame["Death15"] != 0)].copy()

In [None]:
masked1.head(5)

### Feature Selection

I have realized that correlation is a better way to explore relations for this dataset instead of mutual information. Therefore, correlation scores are used as indicators of the relations.

In [None]:
plt.figure(figsize=(5, 8))
sns.heatmap(masked1.corr()[["Inc_Case"]].sort_values(by = "Inc_Case", 
                                                     ascending = False), annot = True)

#### Correlations with p-values

P.S. This could be done with SelectKBest as well.

In [None]:
corlist = []

X = masked1.loc[:,masked1.columns != "Inc_Case"]
X = X.loc[:,X.columns != "Date"]
y = masked1.loc[:,"Inc_Case"]

for col in X.columns:
    
    cor = pearsonr(X[col], y)
    corlist.append([col, cor[0], cor[1]])

cordf = pd.DataFrame(corlist, columns = ["Features", "Correlation", 
                                         "p-value"])

cordf.sort_values(by = "p-value", inplace = True)
cordf

In [None]:
corlist2 = []

X2 = masked2.loc[:,masked2.columns != "Inc_Case"]
X2 = X2.loc[:,X2.columns != "Date"]
y2 = masked2.loc[:,"Inc_Case"]

for col in X2.columns:
    
    cor2 = pearsonr(X2[col], y2)
    corlist2.append([col, cor2[0], cor2[1]])

cordf2 = pd.DataFrame(corlist2, columns = ["Features", "Correlation", 
                                           "p-value"])

cordf2.sort_values(by = "p-value", inplace = True)
cordf2

In [None]:
plt.figure(figsize=(8, 8))
ax = sns.regplot(x="Death15", y="Inc_Case", data=masked1)

##### Same feature as a quadratic equation

In [None]:
sns.lmplot(x = "Death15", y = "Inc_Case", data = masked1, order=2, height = 7)

In [None]:
sns.lmplot(x = "Inc_Test", y = "Inc_Case", data = masked1, order=2, height = 7)

In [None]:
plt.figure(figsize=(8, 8))
ax = sns.regplot(x="Inc_Recovered", y="Inc_Case", data=masked1)

#### Accuracy of the Model

cross_val_score method does not work well with this dataset(especially in the first folds) since it is ordered by time. As an alternative, multiple train test split methods are used and the average scores are recorded. 

The code below has the best features and the best degree of these features(even though "Inc_Recovered" seems like it might be a linear equation, it works better as a polynomial) that I found. Feel free to change the variables to test it yourself. 

Note that cumulative values of the predictions were considered as a method to measure accuracy as well since we know the total case values in the gap from official sources. For example the 3rd order of the features resulted in a slightly higher accuracy while the cumulative value of the cases was closer to the actual value in the second order.

In [None]:
X = masked1[["Death15", "Inc_Test", "Inc_Recovered"]]

y = masked1[["Inc_Case"]]

PolyReg = LinearRegression()

polynom = PolynomialFeatures(degree = 2) 

scores = []

for i in range(10):
    
    X_train, X_test, y_train, y_test = train_test_split(polynom.fit_transform(X), y, test_size=0.2, random_state=i)
    
    PolyReg.fit(X_train, y_train)

    scores.append(PolyReg.score(X_test, y_test))
    
statistics.mean(scores)

### Prediction 

In [None]:
PolyReg.fit(polynom.fit_transform(X), y)

thegap["Estimated"] = PolyReg.predict(polynom.fit_transform(thegap[["Death15", "Inc_Test", "Inc_Recovered"]]))

Sum of the estimations is supposed to be equal to 1,159,626

In [None]:
thegap.Estimated.sum()

In percentage

In [None]:
(thegap.Estimated.sum()*100)/1159626

#### Visualization

In [None]:
plt.figure(figsize = [10,6])
plt.plot(thegap["Date"],thegap["Inc_Case"], label = "Patients")
plt.plot(thegap["Date"],thegap["Estimated"], label = "Estimated")
plt.xlabel("Date")
plt.legend()
plt.show()

#### Merging the Datasets 

In [None]:
masked0["Estimated"] = None

In [None]:
final = masked0.append(thegap, ignore_index = True)
final.sort_values(by=["Date"], inplace = True)
final = final.reset_index(drop=True)

In [None]:
plt.figure(figsize = [10,6])
plt.plot(final["Date"],final["Inc_Case"], label = "Cases/Patients")
plt.plot(final["Date"],final["Estimated"], label = "Estimated")
plt.xlabel("Date")
plt.legend()
plt.show()

### Adjustment 

Since we have the total number of cases, we can distribute them to each day as if our estimations were coefficients.

In [None]:
for i in range(len(final) - 1):
    
    if final["Estimated"].loc[i] != None:
    
        final["Adjusted"] = (1159626*final["Estimated"])/(final["Estimated"].sum())
    
    else:
    
        final["Adjusted"] = None

Sum of the adjusted column should be equal to 1,159,626 if the code above is correct.

In [None]:
final.Adjusted.sum()

These values should be integers.

In [None]:
final["Estimated"] = final["Estimated"].round()
final["Adjusted"] = final["Adjusted"].round()

#### Filtering for a Closer Look with the Visualization 

These dates mean nothing important. They are just there for scaling purposes.

In [None]:
thegap = final[(final["Date"] > "2020-07-14") & 
               (final["Date"] < "2020-12-09")].copy()

In [None]:
plt.figure(figsize = [10,6])
plt.plot(thegap["Date"],thegap["Inc_Case"], label = "Patients")
plt.plot(thegap["Date"],thegap["Estimated"], label = "Estimated")
plt.plot(thegap["Date"],thegap["Adjusted"], label = "Adjusted")
plt.xlabel("Date")
plt.legend()
plt.show()

# Saving the Dataset

In [None]:
final = final.drop(["Increase", "Inc_Patient", "Total_Patient", 
                    "Heavily_Ill", "Pneumonia_Ratio"], axis=1)

In [None]:
final.to_csv("TurkeyAdjusted.csv", index = False)

In [None]:
final.tail()