## **Project Purpose**
> The purpose of this project is to identify which patients are more likely to be classified as those who need to urgently get into the ICU.

### **Data Dictionary:**
   1. **id:** The identification number of the patient.
   2. **sex:** Identify gender of the patient, 1 as female and 2 as male.
   3. **patient_type:** Type of patient, 1 for not hospitalized and 2 for hosptalized.
   4. entry_date: The date that the patient went to the hospital.
   5. date_symptoms: The date that the patient started to show symptoms.
   6. date_died: The date that the patient died, “9999-99-99” stands for recovered.
   7. intubed: Intubation is a procedure that’s used when you can’t breathe on your own.“1” denotes that the patient used ventilator and “2” denotes that the patient did not, “97” “98” “99” means not specified.
   8. pneumonia: Indicates whether the patient already have air sacs inflammation or not “1” for yes, “2” for no, “97” “98” “99” means not specified.
   9. age: Specifies the age of the patient.
   10. pregnancy: Indicates whether the patient is pregnant or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   11. diabetes: Indicates whether the patient has diabetes or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   12. copd: Indicates whether the patient has Chronic obstructive pulmonary disease (COPD) or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   13. asthma: Indiactes whether the patient has asthma or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   14. inmsupr: Indicates whether the patient is immunosuppressed or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   15. hypertension: Indicates whether the patient has hypertension or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   16. other_disease: Indicates whether the patient has other disease or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   17. cardiovascular: Indicates whether if the patient has heart or blood vessels realted disease, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   18. obesity: Indicates whether the patient is obese or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   19. renal_chronic: Indicates whether the patient has chronic renal disease or not, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   20. tobacco: Indicates whether if the patient is a tobacco user, “1” for yes, “2” for no, “97” “98” “99” means not specified.
   21. contact_other_covid: Indicates whether if the patient has contacted another covid19 patient.
   22. icu: Indicates whether the if the patient had been admitted to an Intensive Care Unit (ICU), “1” for yes, “2” for no, “97” “98” “99” means not specified.
   23. covid_res: 1 indicates person is covid +ve, 2 indicates person is covide -ve, 3 indicates result is in awaiting process.

### **Importing Models and Libraries**

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

### **Importing Dataset**

In [5]:
df = pd.read_csv('../input/covid19-patient-precondition-dataset/covid.csv', parse_dates=[3, 4])
df.head()

In [6]:
df.shape

Check if there are any null values present in the dataset

In [7]:
df.isnull().sum().any()

In [8]:
df.info()

In [9]:
df.describe()

In [10]:
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask=mask)

Creating a new feature from the difference of the date the patient started showing symptoms and the date that same patient when to a healthcare center.

In [11]:
df["symp_duration"] = (df.entry_date - df.date_symptoms).dt.days

In [12]:
df.symp_duration.unique()

* The negative values are for those patients who started showing symptoms after being admitted to the health center.
* The positive values are for those patients who started showing symptoms before being admitted.
* Zero values are for those patients that started showing symptoms the same day they were admitted.

<!-- Now let's normalize it! -->

In [13]:
# df["symp_duration"] = df["symp_duration"].apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))

Now that we have extracted this useful information we no longer have a use for the `entry_date` and `date_symptoms` features.

In [14]:
df.drop(["date_symptoms", "entry_date"], axis=1, inplace=True)

Now we add a feature indicating whether patients died or not by using the `date_died` feature. The `date_died` feature will then be dropped.

In [15]:
df["dead"] = df["date_died"].apply(lambda x: 0 if x == '9999-99-99' else 1)
df.drop(["date_died"], axis=1, inplace=True)

Unfortunately for the efforts, the `dead` feature is not useful for our purposes.

In [16]:
df.drop(["dead"], axis=1, inplace=True)

In [17]:
df.icu.value_counts()

The `icu` feature needs editing. It should be binarized and values indicating that the status is undecided will be removed.

In [18]:
df.loc[df["icu"]==2, "icu"] = 0
df = df[df["icu"]<2]
df.icu.unique()

In [19]:
df.shape

The same procedure should apply for those features with undecided values.

In [20]:
df.loc[df['patient_type']==2,'patient_type']=1
df.loc[df['patient_type']==1,'patient_type']=0
df.loc[df['sex']==2,'sex']=0
df.loc[df['inmsupr']==2,'inmsupr']=0
df.loc[df['pneumonia']==2,'pneumonia']=0
df.loc[df['diabetes']==2,'diabetes']=0
df.loc[df['asthma']==2,'asthma']=0
df.loc[df['copd']==2,'copd']=0
df.loc[df['hypertension']==2,'hypertension']=0
df.loc[df['cardiovascular']==2,'cardiovascular']=0
df.loc[df['renal_chronic']==2,'renal_chronic']=0
df.loc[df['obesity']==2,'obesity']=0
df.loc[df['tobacco']==2,'tobacco']=0
df.loc[df['intubed']==2,'intubed']=0
df.loc[df['icu']==2,'icu']=0
df.loc[df['covid_res']==2,'covid_res']=0

Now we have to learn which features are useful for our prediction. Let's analyze the count values of the following:

In [21]:
feature_count = ["patient_type","intubed","hypertension","other_disease","cardiovascular","obesity","renal_chronic","tobacco","contact_other_covid","covid_res","inmsupr","asthma","copd","diabetes","pregnancy","pneumonia"]

In [22]:
for feature in feature_count:
    print (f"{feature}: {df[feature].value_counts()}\n")

In [23]:
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask=mask)

In [24]:
df.drop(["patient_type","other_disease","contact_other_covid","pregnancy","hypertension"], axis=1, inplace=True)

In [25]:
df.shape

In [26]:
df = df[df["intubed"]<2]
df = df[df["cardiovascular"]<2]
df = df[df["obesity"]<2]
df = df[df["renal_chronic"]<2]
df = df[df["tobacco"]<2]
df = df[df["covid_res"]<2]
df = df[df["inmsupr"]<2]
df = df[df["asthma"]<2]
df = df[df["copd"]<2]
df = df[df["diabetes"]<2]
df = df[df["pneumonia"]<2]

In [27]:
df.reset_index(drop=True, inplace=True)
df.shape

In [28]:
df.head()

### **Splitting Data to Train and Test**

In [29]:
X = df.drop(["id", "icu"],axis=1)
Y = df["icu"]

In [30]:
X.head()

In [31]:
Y.value_counts()

Clearly the data is imbalanced! We will takes steps in balancing it using **Synthetic Minority Oversampling Technique(SMOTE)**

In [32]:
smote = SMOTE(random_state=42)
x_bal, y_bal = smote.fit_resample(X, Y)

In [33]:
y_bal.value_counts()

Now that the data is balanced, let's move on to train, test split and model fitting and evaluation.

In [34]:
x_train, x_test, y_train, y_test = train_test_split(x_bal, y_bal, test_size=0.2, random_state=0)

**Standardizing the data**

In [35]:
# scaler = StandardScaler()
# x_train_scaled = scaler.fit_transform(x_train)
# x_test_scaled = scaler.transform(x_test)

### **Model Fitting and Evaluation**

In [36]:
def Model(model):
    
    model.fit(x_train, y_train)
    results = model.predict(x_test)
    
    print("train Accuracy = {}".format(accuracy_score(y_train, model.predict(x_train))))
    print("test Accuracy = {}".format(accuracy_score(y_test, results)))
    print("Confusion Matrix")
    print(confusion_matrix(y_test, results))
    print("Classification Report")
    print(classification_report(y_test, results))

In [None]:
Model(SVC(C=0.5, kernel="linear"))

In [None]:
Model(LogisticRegression())

In [None]:
Model(KNeighborsClassifier(n_neighbors=5,weights='distance',p=1,metric='minkowski'))

In [None]:
Model(XGBClassifier())

In [None]:
Model(RandomForestClassifier())

In [None]:
Model(GradientBoostingClassifier(max_features='auto', loss='deviance',learning_rate=0.3, 
                                   max_depth=8,min_samples_leaf=3,min_samples_split=0.1, n_estimators=400, subsample=0.4))

## ** model perfomed best!**