# Hackathon

The L7 AI Hackathon will give you an opportunity to work in a team of four/five to test your new data science and ML skills. You will be required to deliver a 15mins presentation (10mins presentation and 5mins for Q&A) about your learning journey in the programme & the outcome of your work.

You are going to be provided with a sample of possible questions to investigate and tasks to undertake. However, you are free to design your own tasks as a team and augment the initial datasets for your own analysis.

In this hackathon, we want you to build machine learning models to predict COVID-19 infections from symptoms. It has several applications – for example, triaging patients to be attended by a doctor or nurse, recommending self-isolation through contact tracing apps, etc.

Zoabi et al. ([link here](https://www.nature.com/articles/s41746-020-00372-6)) [1] builds a decision tree classifier using the publicly available data reported by the Israeli Ministry of Health. The paper itself discusses the various challenges encountered in deploying such a model. We encourage you to read the paper and learn the challenges and ways to overcome them. We will be using the dataset in this paper (Github link in the references).

## References

[1] Zoabi, Y., Deri-Rozov, S. & Shomron, N. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. npj Digit. Med. 4, 3 (2021).

[2] [Github link](https://github.com/nshomron/covidpred/tree/master) referenced in the paper. The main page of the Github repository mentions that the data file, `corona_tested_individuals_ver_006.english.csv.zip` was the version they used for the analysis. It is also the version we will use for this hackathon.

## Presentation

On Day 2, between 3-5PM, your team will deliver a presentation of your findings and experience for 10 minutes followed by 5 minutes of Q&A from your peers.

The presentation should cover (non-exhaustively) the following components:

- Briefly define the problem
- Briefly describe the dataset
- What did you learn about various models/techniques/etc.?
- What’s the AUC score of your final model?

Below are a list of potential questions and directions that you can explore but you are free and encouraged to be creative and explore other directions to apply your ML skills

## Exploratory Data Analysis

- Think about possible biases and limitations of this dataset. What are the sources of uncertainty?
- What is the format of feature values?
- What is the statistics of these feature values? How many symptoms are reported or not?
- Which symptoms have a reporting bias, i.e., likely to be reported when the patient is COVID positive?
- How will the symptoms with reporting bias affect the model’s performance?
- Visualization: Draw the bar graph of features grouped by the target class?
- What does the bar graph of the symptoms with reporting bias look like?
- Determine if we have a class imbalance in the dataset? If so, what do you reckon will be the downstream challenges in evaluating the model? How will you overcome those challenges?

## Feature Engineering

- How will you represent the features in numerical format that can be accessible by model?
- Are there any redundancies in your feature representation?

## Models

- Check out various classifiers in sklearn or any other preferred library to build your models

## Evaluate

- Is accuracy the right metric to evaluate the model? Are inaccuracies correctly penalized in the accuracy metric?
- Which dataset should you choose to evaluate the model? Validation or Test?
- What other metric is relevant in our context?

For benchmarking everyone’s results we will stick to ROC-AUC score as a metric.

## Report your result

- With the metric chosen, report your result on the test dataset.
- How will you select the threshold for your model above which model score will be interpreted as a prediction of positive diagnosis.

## Extras

- Dimensionality Reduction for fun: Can you reduce the dimension to just 2 dimensions and check if the inputs corresponding to different classes belong to different clusters? Try using t-SNE or UMAP for that purpose.
- Collaborate with Ensemble: Can you combine other models?

# Start

In [55]:
import pandas as pd

random_state=137

In [56]:
covid = pd.read_csv('data/corona_tested_individuals_ver_006.english.csv')

  covid = pd.read_csv('data/corona_tested_individuals_ver_006.english.csv')


In [72]:
covid = covid.astype({
    'test_date': 'datetime64[s]',
    'cough': 'float16',
    'fever': 'float16',
    'sore_throat': 'float16',
    'shortness_of_breath': 'float16',
    'head_ache': 'float16',
    'corona_result': 'category',
    'age_60_and_above': 'category',
    'gender': 'category',
    'test_indication': 'category'
})
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278848 entries, 0 to 278847
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype        
---  ------               --------------   -----        
 0   test_date            278848 non-null  datetime64[s]
 1   cough                278596 non-null  float16      
 2   fever                278596 non-null  float16      
 3   sore_throat          278847 non-null  float16      
 4   shortness_of_breath  278847 non-null  float16      
 5   head_ache            278847 non-null  float16      
 6   corona_result        278848 non-null  category     
 7   age_60_and_above     151528 non-null  category     
 8   gender               259285 non-null  category     
 9   test_indication      278848 non-null  category     
dtypes: category(4), datetime64[s](1), float16(5)
memory usage: 5.9 MB


In [81]:
pd.DataFrame(covid.isna().sum(), columns=['NaN']).sort_values(by='NaN', ascending=False)

Unnamed: 0,NaN
age_60_and_above,127320
gender,19563
cough,252
fever,252
sore_throat,1
shortness_of_breath,1
head_ache,1
test_date,0
corona_result,0
test_indication,0


In [95]:
covid.test_indication.value_counts(dropna=False)

test_indication
Other                     242741
Abroad                     25468
Contact with confirmed     10639
Name: count, dtype: int64

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [22]:
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278848 entries, 0 to 278847
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   test_date            278848 non-null  object 
 1   cough                278596 non-null  float64
 2   fever                278596 non-null  float64
 3   sore_throat          278847 non-null  float64
 4   shortness_of_breath  278847 non-null  float64
 5   head_ache            278847 non-null  float64
 6   corona_result        278848 non-null  object 
 7   age_60_and_above     151528 non-null  object 
 8   gender               259285 non-null  object 
 9   test_indication      278848 non-null  object 
dtypes: float64(5), object(5)
memory usage: 21.3+ MB


In [33]:
covid = covid.replace({
    'corona_result': {
        'negative': 0,
        'positive': 1,
        'other': 2
    }
})

  covid = covid.replace({


In [34]:
X = covid.drop(columns=['corona_result'])
y = covid.corona_result

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=137)

In [36]:
y.value_counts(normalize=True)

corona_result
0    0.933222
1    0.052821
2    0.013957
Name: proportion, dtype: float64

In [37]:
y_train.value_counts(normalize=True)

corona_result
0    0.933221
1    0.052820
2    0.013959
Name: proportion, dtype: float64

In [38]:
y_test.value_counts(normalize=True)

corona_result
0    0.933226
1    0.052824
2    0.013950
Name: proportion, dtype: float64

In [41]:
from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler

from sklearn.compose import ColumnTransformer

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Models
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

# Evaluation Metric
from sklearn.metrics import roc_auc_score

num_features = []
cat_features = []

num_transformer = Pipeline([
    # ('imputer', SimpleImputer(strategy='median'))
])

cat_transformer = Pipeline([
    # ('encoder', OneHotEncoder(handle_unknown='ignore')),
    # ("selector", SelectPercentile(chi2, percentile=50))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ]
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', 'passthrough'),
    ('classifier', XGBClassifier(objective='multi:softmax'))
])

params = {
    'scaler': [StandardScaler(with_mean=False), MaxAbsScaler()],
    'classifier__learning_rate': [0.3, 0.5, 0.7],
    'classifier__n_estimators': [10, 50, 100]
    'classifier__max_depth': [2, 4, 6]
}

model = GridSearchCV(
    estimator=pipeline,
    param_grid=params,
    scoring='roc_auc', # https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    cv=3
)

# X_dev, X_val, y_dev, y_val = train_test_split(X_train, y_train, test_size=0.1, stratify=y_train, random_state=31)

# model.fit(X_dev, y_dev)

# print(model.best_params_)

# print(f'Model Accuracy Score: {accuracy_score(y_val, model.predict(X_val)):0.4f}')

# n_estimators=100, max_depth=6, objective='multi:softmax'