**Analysis on Published Medical Care on COVID-19**

**What has been published about medical care?
COVID-19 Open Research Dataset Challenge (CORD-19)**

Task Details
What has been published about medical care? What has been published concerning surge capacity and nursing homes? What has been published concerning efforts to inform allocation of scarce resources? What do we know about personal protective equipment? What has been published concerning alternative methods to advise on disease management? What has been published concerning processes of care? What do we know about the clinical characterization and management of the virus?

Specifically, we want to know what the literature reports about:

Resources to support skilled nursing facilities and long term care facilities.
Mobilization of surge medical staff to address shortages in overwhelmed communities
Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies
Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients
Outcomes data for COVID-19 after mechanical ventilation adjusted for age.
Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.
Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.
Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.
Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.
Guidance on the simple things people can do at home to take care of sick people and manage disease.
Oral medications that might potentially work.
Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.
Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.
Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials
Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials
Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen).

The documentation and the code can be found in my github [here](https://github.com/ednasawe/COVID-19-Visualization-of-Virus-Origin-Genetic-Evolution)

******TASK DETAILS**

1. What has been published about medical care? 
2. What has been published concerning surge capacity and nursing homes? 
3. What has been published concerning efforts to inform allocation of scarce resources? 
4. What do we know about personal protective equipment? 
5. What has been published concerning alternative methods to advise on disease management? 
6. What has been published concerning processes of care? 
7. What do we know about the clinical characterization and management of the virus?
8. Resources to support skilled nursing facilities and long term care facilities. 
9. Mobilization of surge medical staff to address shortages in overwhelmed communities 
10. Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies 
11. Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients 
12. Outcomes data for COVID-19 after mechanical ventilation adjusted for age. 
13. Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest. 
14. Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.
15. Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks. 
16. Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries. 
17. Guidance on the simple things people can do at home to take care of sick people and manage disease. 
18. Oral medications that might potentially work. 
19. Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually. 
20. Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes. 
21. Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials 
22. Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials 
23. Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen).

**Import the Libraries**

In [None]:
#Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import xgboost as xgb


plt.rcParams.update({'font.size': 14})


**Loading the Data**

In [None]:
# Load data

meta_dt = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")


In [None]:
print (meta_dt.shape)

There are 45774 datasets with 17 columns in the metadata.csv file.

**Basic Exploration of the dataset**

In [None]:
meta_dt.head()

**Preprocess data**

~ Transform categories into discrete numerical values
~ Transform all words to lowercase
~ To remove all punctuations

**Splitting of the data set into Train and Test data sets**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    meta_dt['has_pdf_parse'], 
    meta_dt['title'], 
    random_state = 1
)

print("Training dataset: ", X_train.shape[0])
print("Test dataset: ", X_test.shape[0])

**Eliminating any duplicates**

In [None]:
meta_dt.duplicated().sum()


In [None]:
meta_dt = meta_dt.drop_duplicates().reset_index(drop=True)


**Visualizing of the Data**

In [None]:
sns.countplot(y=meta_dt.has_pdf_parse);

In [None]:
meta_dt.has_pdf_parse.value_counts()


In [None]:
meta_dt.isnull().sum()

Based on the above data visualization and evaluation, we find that the full text is all non-null. And that most of the published articles have full texts available.

Checking the full text file on the metadata.csv file for uniques keywords

In [None]:
# Checking all the full text files on the metadata.csv files for uniques keywords

print (meta_dt.has_pmc_xml_parse.nunique())
print (meta_dt.has_pdf_parse.nunique())


In [None]:
# Most common keywords found

plt.figure(figsize=(9,6))
sns.countplot(y=meta_dt.has_pmc_xml_parse, order = meta_dt.has_pmc_xml_parse.value_counts().iloc[:15].index)
plt.title('Files containing the keywords')
plt.show()

# meta_dt.keyword.value_counts().head(10)

The above graph shows that the four files have the texts we need to analyze the published articles. Custom_license has more counts than the biorxiv_medrxiv file.

In [None]:
# Check number of unique keywords and journals

print (meta_dt.journal.nunique())

In [None]:
# Most common journals we have

plt.figure(figsize=(9,6))
sns.countplot(y=meta_dt.journal, order = meta_dt.journal.value_counts().iloc[:15].index)
plt.title('Top 15 journals')
plt.show()

The journals are mostly on virology,vaccines, and virus research; meaning they will be helpful in analysing the medical care for the viral disease COVID-19.

In [None]:
# Most common titles of the articles we have

plt.figure(figsize=(9,6))
sns.countplot(y=meta_dt.title, order = meta_dt.title.value_counts().iloc[:15].index)
plt.title('Top 15 Titles')
plt.show()

The above graph show that we have more articles with the title "Infectious disease surveillance update" and less articles on "Patent reports" and "Abstracts cont."
Using the top 15 titles provided above, we will be able to narrow down on the specific thing we are supposed to look for and analyze:

In [None]:
raw_loc = meta_dt.title.value_counts()
top_loc = list(raw_loc[raw_loc>=10].index)
top_only = meta_dt[meta_dt.title.isin(top_loc)]

top_l = top_only.groupby('title').mean()['has_pdf_parse'].sort_values(ascending=False)
plt.figure(figsize=(6,6))
sns.barplot(x=top_l.index, y=top_l)
plt.axhline(np.mean(meta_dt.has_pdf_parse))
plt.xticks(rotation=80)
plt.show()

The graph shows that only artilces by the Department of Error has full texts.

**Data Processing**

In [None]:
for col in ['title','has_pdf_parse']:
    meta_dt[col] = meta_dt[col].fillna('None')
   
def clean_loc(x):
    if x == 'None':
        return 'None'
    elif x == 'Corona' or x =='Corona Virus' or x == 'Covid-19':
        return 'Covid-19'
    elif 'Virus' in x or 'Viral' in x:
        return 'Virus'    
    elif 'Viruses' in x:
        return 'Viruses'
    elif 'Virology' in x:
        return 'Virology'
    elif 'Vaccine' in x and 'Vaccines' in x and 'Vaccinantion' in x:
        return 'Vaccine'
    elif x in top_loc:
        return x
    else: return 'Others'
    
meta_dt['title_clean'] = meta_dt['title'].apply(lambda x: clean_loc(str(x)))

In [None]:
top_l2 = meta_dt.groupby('title_clean').mean()['has_pdf_parse'].sort_values(ascending=False)
plt.figure(figsize=(14,10))
sns.barplot(x=top_l2.index, y=top_l2)
plt.axhline(np.mean(meta_dt.has_pdf_parse))
plt.xticks(rotation=80)
plt.show()

In [None]:
meta_dt.to_csv('submission.csv', index = False)