# Data Science Workshop: Data Analysis with Python



## Introduction about this notebook

A Jupyter Notebook (or Colab Notebook in this instance) is a document format that bridges the gap between a traditional word processor and a code editor. **This notebook document allows executable code and nicely formatted text to live side-by-side.** Python code can be written and executed within a clear and detailed context describing the thought process and implementation of a data science project.

The cell below is a code cell. To run the code within it, first select it then click the play icon on the left, or **simply hit `[shift] + [enter]`**.

In [0]:
print("That is some fancy piece of code!")
print("1 + 1 makes", 1 + 1)

You have just executed some Python code! Below the code cell, you should have the following result:
> ```
That is some fancy piece of code!
1 + 1 makes 2
```

Please note that the code in a cell is only executed when you run the cell, which means you can choose to run cells in the notebook in any order that you wish. It means the physical order of where the code appears in the notebook does not necessarily determine the order of when it is run. Because the variables in your coding environment change everytime a cell is run, the same cell may return different results depending on the order of execution.

> **Make sure to run each cell while you go through this notebook.**

If you think you are not getting the expected results and would like to reset your notebook, just click in the toolbar on `Runtime > Restart runtime...`. Then execute each cell in order of appearance in the document, from top to bottom.

And now...

## Let's get started!

The cell below imports Python libraries that we will be using in our project. They allow us to utilize useful functions to process and analyse our data.

Basically every data science notebook will start with a cell like that one. Make sure you run it first!

> **Reminder:**  
> Select the cell and press `[shift] + [enter]` to run it.

In [0]:
import numpy as np              # Python Math library
import pandas as pd             # Python Data Analysis library  
%matplotlib inline
import matplotlib.pyplot as plt   # Library to plot our charts

### How to find a dataset?

There are a lot of great sources to obtain data on the web. For example, many datasets are available on [Kaggle](https://www.kaggle.com/datasets) and [GitHub](https://github.com/awesomedata/awesome-public-datasets)

Today, we will be working with this dataset by Johns Hopkins CSSE ([2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository](https://github.com/CSSEGISandData/COVID-19)) which contains daily updated numbers on the COVID-19 outbreak. This dataset is publicly available on GitHub.

### Importing data from the web

This is a good way to showcase how useful notebooks can be in comparison to a traditional spreadsheet... We can very simply write a few lines of Python code to **automatically download our dataset from the web or APIs**. In this case, we are downloading csv files from a GitHub repository. Even though are updated daily, we can simply re-run the cell to fetch the latest version of the data.

In [0]:
import os, requests

# The csv files we will download from GitHub
filenames = ["time_series_19-covid-Confirmed.csv", "time_series_19-covid-Deaths.csv", "time_series_19-covid-Recovered.csv"]
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"

# Creating a /data directory for our notebook
if not os.path.exists('data'):
    os.mkdir('data')

# Downloading the csv files in /data
for filename in filenames:
    print(f"Downloading {filename}...")
    r = requests.get(url + filename, allow_redirects=True)
    open('data/' + filename, 'wb').write(r.content)
    
print("Download complete.")

## Our first dataframe

A dataframe is a pandas object that we can manipulate using pandas functions. Think of it like a code representation of an excel spreadsheet.

Let's create a dataframe by loading our first CSV file into our notebook! You can do this with the `read_csv` function.

```python
filepath = "data/time_series_19-covid-Recovered.csv"
recovered_df = pd.read_csv(filepath)
```

In [0]:
### TODO ###
# Open "data/time_series_19-covid-Recovered.csv" using pd.read_csv(filepath)

filepath = "data/time_series_19-covid-Recovered.csv"
recovered_df = pd.read_csv(filepath)

Let's see what's in our newly created dataframe by calling the variable `recovered_df`:

In [0]:
recovered_df

### Get a quick sense of the data

In [0]:
recovered_df.shape

In [0]:
recovered_df.columns

In [0]:
recovered_df.dtypes

### Take a quick look inside the data

Show the first rows using `head()`:
```
recovered_df.head(10)
```

In [0]:
recovered_df.head(10)

In [0]:
recovered_df.tail(10)

## Exploring the data

### Getting one column

In [0]:
recovered_df["Country/Region"]

### Getting a list of rows

In [0]:
recovered_df[10:20]

### Getting a cell

```
recovered_df.log[row_indexer, column_indexer]
```

In [0]:
recovered_df.log[3, "Country/Region"]

### Filtering using boolean indexing

Selecting data based on the value in the column `Country/Region`.
```python
only_japan = recovered_df["Country/Region"] == "Japan"
recovered_df[only_japan]
```

In [0]:
only_japan = recovered_df["Country/Region"] == "Japan"
recovered_df[only_japan]

In [0]:
recovered_df["Country/Region"] == "Korea"
# Doesn't return any row because the value doesn't match exactly.

Select rows that include a string in column `Country/Region` using `str.contains()`

In [0]:
only_korea = recovered_df["Country/Region"].str.contains("Korea")
recovered_df[only_korea]

## Clean up and process the data

Remove unused columns using `drop(columns)`

In [0]:
recovered_df.drop(["Province/State", "Lat", "Long"], axis=1)

In [0]:
recovered_df.groupby('Country/Region').sum()

In [0]:
recovered_df.transpose()

In [0]:
recovered_df

In [0]:
def preprocess(data):
    # START YOUR CODE
    # Step 1. Drop the 'Province/State', 'Lat' and 'Long' columns.
    # Hint: don't forget the axis=1 parameter to target columns.
    data = data.drop(["Province/State", "Lat", "Long"], axis=1)
    # Step 2. Set the 'Country/Region' column as index
    data = data.set_index('Country/Region')
    # Step 3. Sum over all the rows
    data = data.sum()
    # END
    return data

In [0]:
preprocess(recovered_df)

## Plot charts

In [0]:
deaths_df = pd.read_csv("data/time_series_19-covid-Deaths.csv")
plt.plot(preprocess(recovered_df), color='g')
plt.plot(preprocess(deaths_df), color='r')