# Explainer notebook

This notebook contains all analysis and code used to make the website. The notebook will be structured as follows:

1. Motivation
1. Part 1 (Basic stats + data analysis + references)
1. Part 2 (Basic stats + data analysis + references)
1. Part 3 (Basic stats + data analysis + references)
1. Genres
1. Visualizations
1. Discussion
1. Contributions
1. References

Part 1, 2 and 3 corresponds to the three parts of the website and have been divided to aid readability of the notebooks.

# <span style="color:orange">1:</span> Motivation

In this part the following will be described:
1. What is the dataset
1. Why was this particular dataset chosen
1. What is the goal for the end user's experience

## <span style="color:orange">1.1:</span> What is the dataset? 

The dataset used is the 'Data Science for COVID-19' (DS4C) which is provided by the 'Korea Centers for Disease Control & Prevention' (KCDC) but is updated and maintained by students and academic faculty throughout South Korea [1] [2]. 

The dataset is continually expanded as more data becomes as available and is updated approximately bi-weekly. 

The dataset is comprised of files each of which has been explained briefly below based on [3]. 
1. **Case data**
    * **Case:** Data of COVID-19 infection cases in South Korea
1. **Patient data**
    * **PatientInfo:** Epidemiological data of COVID-19 patients in South Korea
    * **PatientRoute:** Route data of COVID-19 patients in South Korea
1. **Time series data**
    * **Time:** Time series data of COVID-19 status in South Korea
    * **TimeAge:** Time series data of COVID-19 status in terms of the age in South Korea
    * **TimeGender:** Time series data of COVID-19 status in terms of gender in South Korea
    * **TimeProvince:** Time series data of COVID-19 status in terms of the Province in South Korea
1. **Additional data**
    * **Region:** Location and statistical data of the regions in South Korea
    * **Weather:** Data of the weather in the regions of South Korea
    * **SearchTrend:** Trend data of the keywords searched in NAVER which is one of the largest portals in South Korea
    * **SeoulFloating:** Data of floating population in Seoul, South Korea (from SK Telecom Big Data Hub)
    * **Policy:** Data of the government policy for COVID-19 in South Korea
    
The structure of the dataset can be seen below where same color means similar properties, lines mean values are partially shared and dotted lines mean weak relevance [3]. 

<img src="attachment:datastructure.png" width="700">

Additional information about the dataset can be gained from [3]. 

## <span style="color:orange">1.2:</span> Why was it chosen? 

When investigating COVID-19 datasets it quickly became apparent that not a lot of information was shared besides 'Number of cases', 'Survivers' and 'Deceased'. This is extremely high-level information and did not seem to allow for exciting deep-dives to unravel new information about the disease. 

That is however not the case for the DS4C dataset. The dataset provides extremely detailed information on all levels, from individual movements and patient information to regional information, policies and time series data. Additionally, South Korea has been a front-runner in successfully battling the COVID-19 pandemic, and having the opportunity to do a detailed analytical deepdive into their data seemed like a no-brainer as other countries where the pandemic started later can still benefit from learnings of the early countries.

## <span style="color:orange">1.3:</span> What is the goal?

The goal in terms of end user experience is "*to learn how one might reduce the risk of covid-19 country-wide*", meaning both how to reduce spreading but also how to reduce fatal cases. 

To learn how to reduce covid-19 risks, the story will start from the very top and gradually zoom in and become more detailed. The story has been outlined as such: 

1.  Show the general state of South Korea and explain why it went as it did from a top-level perspective
1. Zoom in once and explain how the virus spread geographically and country-wise and why to infer conclusions of risk-reduction
1. Zoom in again and explain how the virus spread city-wise and why to infer conclusions of risk-reduction
1. Zoom in again and explain how the virus impacts individuals and how to infer conclusions of risk-reduction
1.  Zoom out and connect the dots to a collection of findings

This will take the user on a journey that will allow great understanding of the virus in little time, from high level geographics to localized movements and individual factors; Making it a tool suitable both for governments wishing to strengthen their grip on the virus as well as for individuals wanting to protect themselves.

# <span style="color:orange">2:</span> Part 1 - Global numbers and nation-wise spatio-temporal evolution

Here 'Basic stats' and 'Data analysis' related to *Part 1 - Global numbers and nation-wise spatio-temporal evolution* will be described. The description have been left in another notebook to aid readability.

[Access the notebook here](nbviewer-link)

**NOTE** that some interactive features might not be viewable in nbviewer, these can be viewed either on the website or by downloading the github and running the notebook locally. 

# <span style="color:orange">3:</span> Part 2 - TITLE

Here 'Basic stats' and 'Data analysis' related to *Part 2 - TITLE* will be described. The description have been left in another notebook to aid readability.

[Access the notebook here](nbviewer-link)

**NOTE** that some interactive features might not be viewable in nbviewer, these can be viewed either on the website or by downloading the github and running the notebook locally. 

# <span style="color:orange">4:</span> Part 3 - What affects risk of fatal attacks for individuals

Here 'Basic stats' and 'Data analysis' related to *Part 3 - What affects risk of fatal attacks for individuals* will be described. The description have been left in another notebook to aid readability.

[Access the notebook here](https://nbviewer.jupyter.org/github/s153048/KC19/blob/master/notebooks/Part3_Explained.ipynb)

**NOTE** that some interactive features might not be viewable in nbviewer, these can be viewed either on the website or by downloading the github and running the notebook locally. 

# <span style="color:orange">5:</span> Genres 

## <span style="color:orange">5.1:</span> Graphics genres used 

Several genres have been used throughout the website [4]. 
1. **Magazine style**
    * Is used generally to explain what can be seen from the graphs, and to explain insights when no graphs are present, for instance when investigative analysis have been done or when the machine learning model was explained instead of showed.
1. **Annotated chart**
    * Is used for instance to convey what government actions have been put in place by the South Korean Government by annotating a chart over cases with the actions taken, allowing the reader to investigate the connections herself.
1. **Slide show**
    * An adation of slide shows have been used in multiple places to allow the user to use a dropdown menu to select which angle she wishes to view the data from. For instance by choosing if she wants to see cumulative or non cumulative data in part 1, or to shuffle between different features in part 3.
1. **Animation**
    * Is used in multiple places as well, for instance in part 1 where the geographical spreading of the virus is shown using two movies, and in part 2 where the local movements of Seoul can be seen day by day. 
    
The genres have all been chosen with a purpose. The magazine style has been chosen to guide the viewer, and to help her understand the data even though she is not a scientist herself. The annotated charts and the slideshow aims at engaging the reader in looking closer at the data, by giving her the ability to make her own conclusions and investigate the data on her own. Finally, the animation has been used to convey otherwise difficult messages in a nice 'just lean back and enjoy it' kind of way. 

## <span style="color:orange">5.2:</span> Narrative structure used 

The narrative structure of the website has been designed to let the reader engage in complex analysis in an easy-to-read manner while being as engaging as possible to let the reader feel like she discovered the insights herself instead of just being told something. This was achieved by using the following structural tools, each of which will be explained below [4].
1. **Order**
    1. Linear
1. **Interactivity**
    1. Hover highlighting / Details
    1. Filtering / Selection / Search
    1. Navigation buttons
    1. Explicit instruction
    1. Tacit tutorial 
1. **Messaging**
    1. Captions/Headlines
    1. Annotations
    1. Accompanying text
    1. Comment repitition
    1. Introductory text
    1. Summary synthesis
    
    
The linear order was chosen as 'Random access' and 'User directed paths' might lead to confusion for the reader as the data investigation is somewhat complicated and needs a certain amount of guidance to be fully understood. 

For interactivity, 'Hovering', 'Filtering/Selection' and 'Navigation buttons' was provided to let the user engage with the graphs at her own pace to increase engagement, while still providing instructions to ease understanding through the use of 'Explicit instructions', or the use of 'Tacit tutorial' in the risk-profiling part where the user was first shown how the model works and was then given free-pass to play with it as she liked. 

For messaging, the main goal was to drive the linear story-line and guide the reader through the different charts and interactions to ease understanding and increase engagement. 'Captions and headlines' were generally provided as a quick overview, with 'Annotations' being used in for instance the government actions chart. 'Accompanying text', 
'Comment repitition', 'Introductory text' and 'Summary synthesis' were all used. The goal of these was to introduce the data to be seen to ease understanding, then accompany it with text and comment repition to make sure the reader got the write take-aways, and finally; Summary synthesis were used to make sure the reader did not forget some important takeaways after the intensive read! 

# <span style="color:orange">6:</span> Visualizations 

Here, each visualization will be explained. The explanations will include small images of the figure for reference and will be dividded into each part to ease readability. 

## <span style="color:orange">6.1:</span> Part 1 - Global numbers and nation-wise spatio-temporal evolution

**Visualization 1: Line**

To show the evolution with time of tested, positive and negative individuals a line plot was chosen, since it showcases the temporal evolution in an objective and minimalist manner. A drop-down menu is present to include information from both accumulated and daily statistics.
<img src="attachment:pedro_f1.png" width="300">

**Visualization 3: Line**

To show the evolution with time in detail of positive, released and deceased individuals a line plot was chosen for the same reasons as the previous item. A drop-down menu is present to include information from both accumulated and daily statistics.
<img src="attachment:pedro_f3.png" width="300">

**Visualization 4: Line**

To show the evolution with time in detail of deceased individuals a line plot was chosen for the same reasons as the previous items. A drop-down menu is present to include information from both total population tested and positive individuals only statistics.
<img src="attachment:pedro_f4.png" width="300">

**Visualization 5: Line**

To show possible correlations between goverment actions and number of confirmed citizens a line plot with annotations was chosen, showing evolution of the indicators with clear distinction of before/after the actions.
<img src="attachment:pedro_f5.png" width="300">

**Visualization 6: Geo-plot**

To show how the disease spreaded geographically with time, a heatmap with time was chose in order to communicate several dimensions at once, but without overwhelming the reader: time, coordinates and intensity are present.
<img src="attachment:pedro_f6.png" width="300">

**Visualization 7: Bar**

To show in more details how the disease and all its indicators have spreaded throughout the provinces, a time-animated bar chart with hovering annotations was chosen, delivering a lot of information but with some interactivity and possibility of only seeing for the chosen provinces, thus not overwhelming the reader.
<img src="attachment:pedro_f7.png" width="300">

## <span style="color:orange">6.2:</span> Part 2 - TITLE

## <span style="color:orange">6.3:</span> Part 3 - What affects risk of fatal attacks for individuals

**Visualization 1: Model**

To explain the machine-learning model it was chosen to not show any visualizations and simply explain it with as few as possible written words. This is to no discourage less tech-savy readers while still explaining where the results to be presented comes from and if the reader should trust them or not. 
<img src="images/p3_model.png" width="200">

**Visualization 2: Bar**

To explain the feature importances of the model, a bar chart with a simple hover feature was chosen. The bar chart was chosen to convey the information in the most easy to comprehend way, and a simple interactivity was chosen to let the user engage in a small talk about machine-learning 'Bias'. 
<img src="images/p3_bar.png" width="200">

**Visualization 3: Line**

To explain the partial dependence of each feature a simple line-chart was chosen. The line chart had a drop-down feature letting the user investigate the partial dependencies herself. The line chart was chosen to show how results change depending on the value of the feature, and the dropdown was chosen as to give the user the sense of discovery rather than just showing all four charts at once. 
<img src="images/p3_line.png" width="200">

**Visualization 4: Input output**

For the final 'Visualization' a fully interactive menu where the reader could enter her own info was constructed. This was chosen for two reason; Firstly, to let the user experience how models can be used not just to investigate data, but also as a government tool to provide personalized recommendations to citizens. Secondly, to let the user experience what features impacted their personalized predictions, to make a connection between previously discovered information and let the user play around with the data herself to understand it better. 
<img src="images/p3_choose.png" width="200">
<img src="images/p3_risk.png" width="200">
<img src="images/p3_pred.png" width="200">

# <span style="color:orange">7:</span> Discussion

Critically analyzing the achievements of the present document:
* What went well?
    * The storyline mimics a natural inquisitive line of thought common to many people when exploring a problem. Looking globally, visualizing geographically, then looking at population dynamics and finalizing with individual consequences. This makes it easier for the reader to follow;
    * The visualizations are clean and light, but at the same time pack plenty of information;
    * Interactivity makes following the notebook/website much more engaging.
* What could be improved and why?
    * Further data and analysis in trying to relate goverment actions with impacts on how the disease spread. This is of utter interest since it could be carried to other nations in the form of recommendations;
    * fff
    * fff

# <span style="color:orange">8:</span> Contributions

DODODODODODODODODO

# <span style="color:orange">9:</span> References

Note: This section contains only references used in this notebook, the three separate notebooks contains references related to the parts individually. 

1. https://github.com/ThisIsIsaac/Data-Science-for-COVID-19
1. http://www.cdc.go.kr/
1. https://www.kaggle.com/kimjihoo/ds4c-what-is-this-dataset-detailed-description
1. http://vis.stanford.edu/files/2010-Narrative-InfoVis.pdf