https://www.kaggle.com/bradysites/info-1998-project

# INFO 1998 Project: COVID-19
Names: Ashley Jiang (yj387), Brady Sites (bas339)

For this project, we chose to analyze data surrounding COVID-19 statistics. The question we asked ourselves was, "What was the average recovery period for different age groups?" The motivation for this inquiry is the general trend of deteriorating health with age, so we wanted to see if age generally correlated to recovery time.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
dataset1 = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv')
dataset1[dataset1.columns[0:32]].isnull().sum()/len(dataset1)*100


This shows us that the one dataset provided will not be able to answer our question, since almost 100% of cases did not have an outcome listed. Fortunately, data is spread across multiple csv files, so we can dive into a few others and search for this info.

In [None]:
dataset2 = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/COVID19_line_list_data.csv')
dataset2.isnull().sum()/len(dataset2)*100


This file has complete information on recovery status, with varying amounts of data on symptoms, exposure, and report dates.
Thus, we see it fit to use for analysis.

In [None]:
covid = dataset2.copy()
df= pd.DataFrame(covid['symptom_onset'])
df.fillna(0)
df['month']=0
df['day'] = 0
for x in range(0,len(df)):

    if type(df['symptom_onset'][x]) == float :
        df['month'][x]=0
        df['day'][x]=0
        
    else:
        slash= (df['symptom_onset'][x]).find('/')
        second_slash= df['symptom_onset'][x][slash+1:].find('/')
        df['month'][x]= int(df['symptom_onset'][x][0:slash].strip()) 
        df['day'][x]=int(df['symptom_onset'][x][slash+1:slash+second_slash+1].strip())
        
df
    

Since the dates in the sympotom_onset column are string inputs, we first cut out the individual month and date of each date then convert them to int.

NOTE: at this point, we realized the dataset we originally sourced from wasn't fit to answer the question we came in with. So we found another, more recent one.

In [None]:
dataset2= pd.read_csv('/kaggle/input/coronavirusdataset/PatientInfo.csv')

In [None]:
mask = ((dataset2['symptom_onset_date'].notnull() | dataset2['confirmed_date'].notnull()) & (dataset2['released_date'].notnull() | dataset2['deceased_date'].notnull()) & (dataset2['birth_year'].notnull() | dataset2['age'].notnull()))
dataset2.loc[mask, 'age'].notnull().sum()

In [None]:
column_names=['birth_year','age','symptom_onset_date','confirmed_date','released_date','deceased_date','sex','state']
newdf = dataset2.loc[mask,column_names].copy()
newdf.head()

We've narrowed down our samples to those with values we can use to potentially answer our question.

In [None]:
date_columns = ['symptom_onset_date','confirmed_date','released_date','deceased_date']
for column_name in date_columns:
    newdf[column_name] = pd.to_datetime(newdf[column_name], format="%Y-%m-%d")
print()
print(newdf.dtypes)

We converted all relevant dates to datetime64 type so we can perform calculations on them easier later.

In [None]:
newdf['is_male'] = pd.get_dummies(newdf.sex)['male']
newdf['is_female'] = pd.get_dummies(newdf.sex)['female']
print(newdf[['is_male','is_female']].head(10))
print(newdf.dtypes)
print(newdf.isnull().sum()/len(newdf)*100)


In our refined dataset newdf, we have a released or deceased date for every case, a confirmation date for every case, and a sex for every case. Essentially we want to consolidate the dates into one column that gives the length in time from confirmation to release/death, in days. We could create secondary visualization(s) using birth_year and symptom_onset_date, or using sex.

In [None]:
newdf['recovery_time']=0
for i in range(0,1035):
    if newdf['state'].iloc[i]=='deceased':
        newdf['recovery_time'].iloc[i]=newdf.iloc[i]['deceased_date'] - newdf.iloc[i]['confirmed_date']
    elif newdf['state'].iloc[i]=='released':
        newdf['recovery_time'].iloc[i]= newdf.iloc[i]['released_date']- newdf.iloc[i]['confirmed_date']
        
newdf

We just created a column with the value of main interest: recovery time! In order to do so, we divide our dataset into two groups: the released ones and the deceased ones. And we use the respective dates to calculate the time span.

In [None]:
print(newdf.isnull().sum()/len(newdf)*100)

As seen here, our new column does not contain any null values. Now, we can turn this into a numeric type and work with that result directly.

In [None]:
dayss= newdf['recovery_time'].astype('timedelta64[D]')
newdf['timelength']= dayss / np.timedelta64(1,'D')

We create another column that give us the integer value only for the recovery time. Next step is to separate the deceased from the recovered and then calculate average timelength.

In [None]:
alive = newdf[newdf['state']!='deceased']
alive.state.unique()


In [None]:
x=alive.age.unique()
x.sort()
y=[]
avgtime=0
for age in x:
    t_avg= alive[alive['age']==age]['timelength'].mean()
    y.append(t_avg)

import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(x,y)
plt.show()


We calculate the average recovery time based on their groups. And we created a histogram of the average time length versus the age groups.

In [None]:
import seaborn as sns
density= sns.kdeplot(alive['timelength'],bw=5)

plt.title ('recovery time density plot')
plt.xlabel('Recovery time')
plt.ylabel('density')

plt.show()

We also create a graph for the density of the recovery time in general regardless of the ages in order to get a sense of how long in general does ut take for symptoms to go away.

In [None]:
alive['acc_age']=2020- alive['birth_year']
alive['agegroup']= 0
for i in range(len(alive)):
    s= alive['age'].iloc[i].find('s')
    alive['agegroup'].iloc[i]= int((alive['age'].iloc[i])[:s])
    

We essentially created two columns that represent the age group and ages in integer respectively to visualize the data even better.

In [None]:
Y= alive['agegroup']
plt.scatter(alive['acc_age'], alive['timelength'], c=Y.values.ravel())

plt.title('scatter plot for age against recovery time')
plt.xlabel('age')
plt.ylabel('recovery time')

plt.show()



This is a scatter plot of the recovery time versus ages. We differentiate different age groups with different colors so the distinctions between different ages are clearer.

In [None]:
plt.scatter(alive['agegroup'], alive['timelength'])

plt.title('scatter plot for agegroup against recovery time')
plt.xlabel('age group')
plt.ylabel('recovery time')

plt.show()

I'm thinking that this plot is kind of redundant and it's not as nice as the one before so we might as well jsut delete it. What do you think?

In [None]:
plt.violinplot([alive['timelength']], showextrema=True, showmedians=True)

plt.title('Recovery time')
plt.ylabel('time(days)')

plt.show()

This violinplot shows both the median and the probability density. We have a clear view of the entire distribution of the data and we can see that there're two peaks in this dataset.

# Closing inferences and remarks:
<p>The scatter plot makes it look like there is no correlation between age and recovery time. We obviously can't conclude that, because of inconsistencies in data that might lead to the large amount of variation. The bar graph indicates that the oldest age group did take longer to recover, but our data does not give a strong enough indication for us to accept the hypothesis.</p>