## Visualization and Modern Data Science

> Final: Visualization and Modern Data Science, NTU, Spring, 2021.

Kuo, Yao-Jen <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com)

## Instructions

- We've imported necessary modules/libraries at the beginning of each exercise.
- We've put necessary files(if any) in the working directory of each exercise.
- We've defined the names of functions/inputs/arguments for you.
- Write down your solution between the comments `### BEGIN SOLUTION` and `### END SOLUTION`.
- Running tests to see if your solutions are right: Kernel -> Restart & Run All -> Restart and Run All Cells.
- You can run tests after each question or after finishing all questions.
- REMEMBER to upload your `.ipynb` file to [CEIBA](https://ceiba.ntu.edu.tw/) before **2021-06-18 20:59:59** when you are done running tests.

In [1]:
import json
import unittest
import numpy as np
import pandas as pd

## 00. Define a function named `calculate_players_bmi` that is able to calculate the BMI of NBA players and sort with descending order given `players.json` as a pandas DataFrame.

\begin{equation}
BMI = \frac{weight_{kg}}{height^2_{m}}
\end{equation}

PS You have to exclude the players who is not active(`isActive == False`).

- Expected inputs: a file `players.json`.
- Expected outputs: a (504, 5) DataFrame.

```
    firstName    lastName  weightKilograms  heightMeters        bmi
0        Zion  Williamson            128.8          2.01  31.880399
1       Jusuf      Nurkic            131.5          2.11  29.536623
2     Jarrell    Brantley            113.4          1.96  29.518950
3        Eric    Paschall            115.7          1.98  29.512295
4       Udoka    Azubuike            127.0          2.08  29.354660
..        ...         ...              ...           ...        ...
499      Josh        Hall             86.2          2.06  20.312942
500    Isaiah         Joe             74.8          1.93  20.081076
501     Isaac       Bonga             81.6          2.03  19.801500
502     Jaden   McDaniels             83.9          2.06  19.770949
503   Aleksej  Pokusevski             86.2          2.13  18.999758

[504 rows x 5 columns]
```

In [2]:
def calculate_players_bmi(players_json):
    """
    >>> players_bmi = calculate_players_bmi('players.json')
    >>> type(players_bmi)
    pandas.core.frame.DataFrame
    >>> players_bmi.shape
    (504, 5)
    >>> players_bmi['firstName'].values[0]
    'Zion'
    >>> players_bmi['lastName'].values[0]
    'Williamson'
    >>> players_bmi['firstName'].values[1]
    'Jusuf'
    >>> players_bmi['lastName'].values[1]
    'Nurkic'
    """
    ### BEGIN SOLUTION
    file = open('players.json')
    player_dic = json.load(file)
    file.close()

    d = player_dic['league']['standard']
    fn = []
    ln = []
    hm = []
    wk = []
    
    for i in d:
        if i['isActive'] == True:
            fn.append(i['firstName'])
            ln.append(i['lastName'])
            hm.append(float(i['heightMeters']))
            wk.append(float(i['weightKilograms']))

    new = {}
    new['firstName'] = fn
    new['lastName'] = ln
    new['weightKilograms'] = wk
    new['heightMeters'] = hm

    df = pd.DataFrame(new)

    df['bmi'] = (df['weightKilograms']) / (df['heightMeters'] * (df['heightMeters']))
    return df.sort_values(by = ['bmi'], ascending = False)
    
    ### END SOLUTION

## 01. Define a function named `count_players_by_country` that is able to find the number of players for each country and sort with descending order given `players.json` as a pandas DataFrame.

PS You have to exclude the players who is not active(`isActive == False`).

- Expected inputs: a file `players.json`.
- Expected outputs: a Series of length 44.

```
country
USA                       387
Canada                     19
France                     12
Australia                   8
Serbia                      6
Germany                     6
Turkey                      4
Spain                       4
Croatia                     4
Slovenia                    3
Greece                      3
Nigeria                     3
Italy                       3
Brazil                      3
Lithuania                   3
Latvia                      2
Cameroon                    2
Japan                       2
Ukraine                     2
Bahamas                     2
Argentina                   2
Senegal                     2
Saint Lucia                 1
Sudan                       1
South Sudan                 1
Switzerland                 1
United Kingdom              1
Republic of the Congo       1
                            1
New Zealand                 1
Montenegro                  1
Jamaica                     1
Angola                      1
Israel                      1
Guinea                      1
Georgia                     1
Finland                     1
Egypt                       1
Dominican Republic          1
DRC                         1
Czech Republic              1
Bosnia and Herzegovina      1
Austria                     1
Uzbekistan                  1
dtype: int64
```

In [3]:
def count_players_by_country(players_json):
    """
    >>> players_by_country = count_players_by_country('players.json')
    >>> type(players_by_country)
    pandas.core.series.Series
    >>> players_by_country.size
    44
    >>> players_by_country.index[0]
    'USA'
    """
    ### BEGIN SOLUTION
    file = open('players.json')
    player_dic = json.load(file)
    file.close()

    d = player_dic['league']['standard']
    c = []

    for i in d:
        if i['isActive'] == True:
            c.append(i['country'])

    new = {}
    new['country'] = c

    df = pd.DataFrame(new)
    return df['country'].value_counts()
    ### END SOLUTION

## 02. Define a function named `extract_nba_coaches` that is able to extract all coaches including head coaches and assistant coaches given `coaches.json` as a pandas DataFrame.

- Expected inputs: a file `coaches.json`.
- Expected outputs: a (236, 4) DataFrame.

```
    teamTricode firstName   lastName  isAssistant
0           PHI       Doc     Rivers        False
1           PHI     David    Joerger         True
2           PHI       Sam    Cassell         True
3           PHI       Dan      Burke         True
4           PHI    Popeye      Jones         True
..          ...       ...        ...          ...
231         WAS     David     Adkins         True
232         WAS    Jarell  Christian         True
233         WAS      Dean     Oliver         True
234         WAS     Corey     Gaines         True
235         WAS      Mike   Terpstra         True

[236 rows x 4 columns]
```

In [4]:
def extract_nba_coaches(coaches_json):
    """
    >>> nba_coaches = extract_nba_coaches('coaches.json')
    >>> type(nba_coaches)
    pandas.core.frame.DataFrame
    >>> nba_coaches.shape
    (236, 4)
    >>> nba_coaches['teamTricode'].nunique()
    30
    >>> nba_coaches['isAssistant'].sum()
    210
    """
    ### BEGIN SOLUTION
    file = open('coaches.json')
    c_dic = json.load(file)
    file.close()
    
    d = c_dic['league']['standard']
    tc = []
    fn = []
    ln = []
    iA = []

    for i in d:
        tc.append(i['teamSitesOnly']['teamTricode'])
        fn.append(i['firstName'])
        ln.append(i['lastName'])
        iA.append(i['isAssistant'])

    new = {}
    new['teamTricode'] = tc
    new['firstName'] = fn
    new['lastName'] = ln
    new['isAssistant'] = iA

    return pd.DataFrame(new)
    ### END SOLUTION

## 03. Define a function named `summarize_time_series` that is able to summarize the summation of confirmed and deaths cases by `Country/Region` given `time_series_covid19_confirmed_global.csv` and `time_series_covid19_deaths_global.csv`.

- Expected inputs: None.
- Expected outputs: a (98044, 4) DataFrame.

```
      Country/Region       Date  Confirmed  Deaths
0        Afghanistan 2020-01-22          0       0
1        Afghanistan 2020-01-23          0       0
2        Afghanistan 2020-01-24          0       0
3        Afghanistan 2020-01-25          0       0
4        Afghanistan 2020-01-26          0       0
...              ...        ...        ...     ...
98039       Zimbabwe 2021-06-08      39321    1617
98040       Zimbabwe 2021-06-09      39432    1622
98041       Zimbabwe 2021-06-10      39496    1626
98042       Zimbabwe 2021-06-11      39688    1629
98043       Zimbabwe 2021-06-12      39852    1632

[98044 rows x 4 columns]
```

In [5]:
def convert_date(d):
    d = d.split('/')
    new = ['20' + d[2] , d[0], d[1]]
    if len(new[1]) < 2:
        new[1] = '0' + new[1]
    if len(new[2]) < 2:
        new[2] = '0' + new[2]
    return '-'.join(new)

def summarize_time_series():
    """
    >>> summarized_time_series = summarize_time_series()
    >>> type(summarized_time_series)
    pandas.core.frame.DataFrame
    >>> summarized_time_series.shape
    (98044, 4)
    """
    ### BEGIN SOLUTION
    df1 = pd.read_csv('time_series_covid19_confirmed_global.csv')
    df2 = pd.read_csv('time_series_covid19_deaths_global.csv')
    
    idVars = ['Province/State', 'Country/Region', 'Lat', 'Long']
    first = pd.melt(df1 , id_vars = idVars , var_name = 'Date' , value_name = 'Confirmed')
    sec = pd.melt(df2 , id_vars = idVars , var_name = 'Date' , value_name = 'Deaths')
    
    combined = pd.merge(first, sec)
    combined['Date'] = combined['Date'].apply(convert_date)
    return combined.groupby(['Country/Region' , 'Date'])[['Confirmed','Deaths']].sum().reset_index()
    ### END SOLUTION

## 04. Define a function named `calculate_daily_cases_of_taiwan` that is able to calculate the daily cases of Taiwan a DataFrame as expected given `time_series_covid19_confirmed_global.csv` and `time_series_covid19_deaths_global.csv`.

- Expected inputs: None.
- Expected outputs: a (508, 5) DataFrame.

```
           Country/Region  Confirmed  Deaths  Daily_Confirmed  Daily_Deaths
Date                                                                       
2020-01-22        Taiwan*          1       0              NaN           NaN
2020-01-23        Taiwan*          1       0              0.0           0.0
2020-01-24        Taiwan*          3       0              2.0           0.0
2020-01-25        Taiwan*          3       0              0.0           0.0
2020-01-26        Taiwan*          4       0              1.0           0.0
...                   ...        ...     ...              ...           ...
2021-06-08        Taiwan*      11694     308            203.0          22.0
2021-06-09        Taiwan*      11968     333            274.0          25.0
2021-06-10        Taiwan*      12222     361            254.0          28.0
2021-06-11        Taiwan*      12500     385            278.0          24.0
2021-06-12        Taiwan*      12746     411            246.0          26.0

[508 rows x 5 columns]
```

In [6]:
def convert_date(d):
    d = d.split('/')
    new = ['20'+d[2], d[0], d[1]]
    if len(new[1]) < 2:
        new[1] = '0'+new[1]
    if len(new[2]) < 2:
        new[2] = '0'+new[2]
    return '-'.join(new)

def calculate_daily_cases_of_taiwan():
    """
    >>> daily_cases_of_taiwan = calculate_daily_cases_of_taiwan()
    >>> type(daily_cases_of_taiwan)
    pandas.core.frame.DataFrame
    >>> daily_cases_of_taiwan.shape
    (508, 5)
    """
    ### BEGIN SOLUTION
    df1 = pd.read_csv('time_series_covid19_confirmed_global.csv')
    df2 = pd.read_csv('time_series_covid19_deaths_global.csv')
    
    idVars = ['Province/State', 'Country/Region', 'Lat', 'Long']
    first = pd.melt(df1 , id_vars = idVars , var_name = 'Date' , value_name = 'Confirmed')
    sec = pd.melt(df2 , id_vars = idVars , var_name = 'Date' , value_name = 'Deaths')
    
    combined = pd.merge(first, sec)
    combined['Date'] = combined['Date'].apply(convert_date)
    
    an = combined.groupby(['Country/Region' , 'Date'])[['Confirmed','Deaths']].sum().reset_index()
    
    t = an[an['Country/Region'] == 'Taiwan*']
    t['Daily_Confirmed'] = t['Confirmed'].diff()
    t['Daily_Deaths'] = t['Deaths'].diff()
    
    return t.set_index('Date')
    ### END SOLUTION

## Run tests!

Kernel -> Restart & Run All. -> Restart And Run All Cells.

In [7]:
class TestFinal(unittest.TestCase):
    def test_00_calculate_players_bmi(self):
        players_bmi = calculate_players_bmi('players.json')
        self.assertIsInstance(players_bmi, pd.core.frame.DataFrame)
        self.assertEqual(players_bmi.shape, (504, 5))
        self.assertEqual(players_bmi['firstName'].values[0], 'Zion')
        self.assertEqual(players_bmi['lastName'].values[0], 'Williamson')
        self.assertEqual(players_bmi['firstName'].values[1], 'Jusuf')
        self.assertEqual(players_bmi['lastName'].values[1], 'Nurkic')
    def test_01_count_players_by_country(self):
        players_by_country = count_players_by_country('players.json')
        self.assertIsInstance(players_by_country, pd.core.series.Series)
        self.assertEqual(players_by_country.size, 44)
        self.assertEqual(players_by_country.index[0], 'USA')
    def test_02_extract_nba_coaches(self):
        nba_coaches = extract_nba_coaches('coaches.json')
        self.assertIsInstance(nba_coaches, pd.core.frame.DataFrame)
        self.assertEqual(nba_coaches.shape, (236, 4))
        self.assertEqual(nba_coaches['teamTricode'].nunique(), 30)
        self.assertEqual(nba_coaches['isAssistant'].sum(), 210)
    def test_03_summarize_time_series(self):
        summarized_time_series = summarize_time_series()
        self.assertIsInstance(summarized_time_series, pd.core.frame.DataFrame)
        self.assertEqual(summarized_time_series.shape, (98044, 4))
    def test_04_calculate_daily_cases_of_taiwan(self):
        daily_cases_of_taiwan = calculate_daily_cases_of_taiwan()
        self.assertIsInstance(daily_cases_of_taiwan, pd.core.frame.DataFrame)
        self.assertEqual(daily_cases_of_taiwan.shape, (508, 5))
        self.assertEqual(daily_cases_of_taiwan['Country/Region'].unique()[0], 'Taiwan*')
        self.assertTrue(np.isnan(daily_cases_of_taiwan['Daily_Confirmed'].values[0]), True)
        self.assertTrue(np.isnan(daily_cases_of_taiwan['Daily_Deaths'].values[0]), True)
        self.assertAlmostEqual(daily_cases_of_taiwan['Daily_Confirmed'].values[-1], 246.0)
        self.assertAlmostEqual(daily_cases_of_taiwan['Daily_Deaths'].values[-1], 26.0)
        
suite = unittest.TestLoader().loadTestsFromTestCase(TestFinal)
runner = unittest.TextTestRunner(verbosity=2)
test_results = runner.run(suite)
number_of_failures = len(test_results.failures)
number_of_errors = len(test_results.errors)
number_of_test_runs = test_results.testsRun
number_of_successes = number_of_test_runs - (number_of_failures + number_of_errors)

test_00_calculate_players_bmi (__main__.TestFinal) ... ok
test_01_count_players_by_country (__main__.TestFinal) ... ok
test_02_extract_nba_coaches (__main__.TestFinal) ... ok
test_03_summarize_time_series (__main__.TestFinal) ... ok
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t['Daily_Confirmed'] = t['Confirmed'].diff()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t['Daily_Deaths'] = t['Deaths'].diff()
ok

----------------------------------------------------------------------
Ran 5 tests in 1.154s

OK


In [8]:
print("You've got {} points among {} questions.".format(number_of_successes * 5, number_of_test_runs))

You've got 25 points among 5 questions.
