# Applied Machine Learning (2022), exercises


## General instructions for all exercises

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE

Remove also line 

> raise NotImplementedError()

**Do not change other areas of the document**, since it may disturb the autograding of your results!
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manually graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks are text in markdown format. It may contain text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, download the whole notebook, using menu `File -> Download as -> Notebook (.ipynb)`. Save the file in your hard disk, and submit it in [Moodle](https://moodle.uwasa.fi) or EUNICE Moodle under the corresponding excercise.

Your solution should be an executable Python code. Use the code already existing as an example of Python programing and read more from the numerous Python programming material from the Internet if necessary. 


## Reading, and visualizing data with pandas

This exercise contains the following tasks:

1. Read the CSV data to Pandas dataframe
1. Study the data statistics 
1. Slice and plot Finnish Covid cases
1. Parse timestamps from strings
1. Differentiate the data to get daily cases
1. Store the data in four different formats

In [None]:
#%load_ext autoreload
#%autoreload 2
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import seaborn as sns
# This makes the plots to have white background and grids by default
#plt.style.use('seaborn-whitegrid')
sns.set()


### Task 1: Read and examine data
#### a) Read the CSV data to Pandas dataframe

Read the data file `time_series_covid19_confirmed.csv` to a pandas dataframe called `D`, and display the head and tail of the dataframe.

The data is taken from [GitHub](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data) and it is collected by John Hopkins hospital, USA.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Some testing
assert(D.shape == (289,993)), "The shape of the data does not seem to be correct"


#### Task 1 b) Missing values
What would be the best strategy to mitigate missing values if we are interested only in Finnish situation? Drop rows with missing values, Drop columns with missing values or impute missing values with empty values?

Go ahead and apply the necessary cleaning operation to D and store the result as D1.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()



In [None]:
# Some testing
if 'D1' not in globals():
    print("store your modified dataframe using name D1, please")
    assert(False)


### Task 2: Slice and plot Finnish Covid cases

Select the data to represent confirmed Covid cases in Finland, by selecting the right row, and only those columns, which shows the numbers of cases (all columns except first four). The dataframe which has only one column, is naturally a Pandas Data Series. It is always a column vector, so transpose is not needed.

You can also select it show that the result is a DataFrame, with one row. In this case transpose the selected slice of the Dataframe, using transpose operator `.T`. This makes rows to become columns and columns to become rows, just like the transpose of a Matrix in mathematics. You can chain the `.loc`, `.iloc` and `.T` operators in one line to accomplish your task.

Check after selecting the slice of the original DataFrame did you got a DataFrame or series, using `type(DF)`.

Save this resulting one-column dataframe by name `DF` in the workspace, and plot it using `.plot()` -function.


In [None]:

# YOUR CODE HERE
raise NotImplementedError()

type(DF)

In [None]:
# The result can be either DataFrame with only one column or a dataSeries, which have always only one column

if type(DF)==pd.core.frame.DataFrame:
    if(DF.shape[1]>1):
        print("Remove unnecessary countries, please")
    assert(DF.shape==(989,1))

if type(DF)==pd.core.series.Series:
    if(DF.shape[0]>989):
        print("Please remove also non-numerical columns from the beginning. Otherwise you cannot cleanly plot the data")

### Task 3: Parse timestamps to datetime objects
The data looks familiar and would be usefull already for many purposes, but it has still a problem. This is obviously a time series, but the computer does not yet understand what the values in time-axis are, and they are handled just strings without meaning. 

To let the computer understand them, they needs to be parsed to datetime objects. Any string can be parsed to datetime using string parser function, called `strptime()`. It uses a template for mathing a string to years (`%y`), months (`%m`) and days (`%d`). See more exact description from the [documentation](https://www.programiz.com/python-programming/datetime/strptime).

The function for parsing the timestamp strings is provided below.


In [None]:
#task4
indexes=DF.index

from datetime import datetime

# Parse a timestamp
def parseTime(s):
    return datetime.strptime(s, "%m/%d/%y")


Replace the index of `DF` by parsing the list of values in it into list of datetime objects, and assign it to the new index. 

You can read the values of the current index, using a read/write property `.index` of the dataframe, and you can update the index by assigning a list of datetime objects into it. 

You can apply the previous `parseTime()`-function to all values in a list of date-strings by using a `map()` function in python as follows

 `map(parseTime, listOfStringValues)`
 
 See more from [documentation](https://www.programiz.com/python-programming/methods/built-in/map).
 
 Note that if you try to run it again, it will give you an error, since the index values are no longer strings, which could be parsed again.
 
 Now plot the data again, and you will notice that the computer now understands the time axis and can show it differently.
 
 Try to slice the data and only plot the values for August 2022 (`DF.loc['yyyy-mm']`). See how the time axis is scaled again.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
if type(DF)==pd.core.frame.DataFrame:
    assert(DF.loc['2022-09-20'].values[0]==1277473)
else:
    assert(DF.loc['2022-09-20']==1277473)


### Task 4: Differentiate to get the daily cases

Now calculate the daily cases, by calculating the difference of the cumulative number of confirmed cases, using the `.diff()` function of the dataframe as name `daily`, and plot it. 

If you have time, you can also smooth the daily graph by using a rolling average (`.rolling()`), and plot the smoothed curve in different plot or the same plot. See examples from:
- Pandas [diff function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html)
- Pandas [rolling function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html)

Tip: If you want to plot both graphs in one figure, let the first plot return an axis object `ax` and define the new plot to use that same axis, as follows:

`
ax = daily. ... .plot(...)
daily. ... .plot(ax=ax)
`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
type(daily)
#daily.loc['2020-09-23']

In [None]:
if type(daily)==pd.core.frame.DataFrame:
    assert(daily.shape==(989,1)), "The shape of the differentiated data does not seem right"
    print(daily.loc['2020-09-23'])
    assert(daily.loc['2020-09-23'].values[0]==110)
else:
    assert(daily.shape[0]==989), "The shape of the differentiated data does not seem right"
    print(daily.loc['2020-09-23'])
    assert(daily.loc['2020-09-23']==110)


### Task 5: Save the parsed data to different formats

Save the parsed dataframe, DF, in different formats:
 - CSV, use function `.to_csv()`, read documentation inline or from [net](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html), use filename `cases.csv`
 - feather, use function `.to_feather()`, read documentation inline or from [net](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html), use filename `cases.feather`
 - JSON, use function `.to_json()`, read documentation inline or from [net](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html), use filename `cases.csv`


 Feather and JSON do not support datetime objects as index. To overcome this issue, a new index consisting of only integers is created, and the dat is moved to the separate column. This can be accomplished by simply calling the `.reset_index()` from the dataframe.
 
 In addition, Feather requires valid column names, so they needs to be defined too, which is a good idea anyway. The column names can be set using read/write property `.columns`:
 
 Both requirements can be set as follows:
 
 `
 DFR = DF.reset_index()
 DFR.columns=('Date', 'Cases')
 `
 
**See the code in the validation section for reading the files back.**

**Note** If you do this at home, install `pyarrow` to support feather-format. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
data_csv=pd.read_csv('cases.csv')
data_fea=pd.read_feather('cases.feather')
data_json=pd.read_json('cases.json')

assert((data_csv.shape==(989,2)) or (data_csv.shape==(989,3)))
assert((data_fea.shape==(989,1)) or (data_fea.shape==(989,2)))
assert((data_json.shape==(989,1)) or (data_json.shape==(989,2)))

# Check the types of the datetime column values
assert((type(data_csv.iloc[1,0])==str) or (type(data_csv.iloc[1,1])==str))
assert((type(data_json.index[0])==pd.Timestamp) or (type(data_json.iloc[1,0])==pd.Timestamp))
assert(type(data_fea.iloc[1,0])==pd.Timestamp)

Note that the binary data formats retain the timestamps objects as timestamps, whereas after reading CSV, the timestamps needs to be parsed again to timestamps.

To help deciding which format to use in your own projects, take a look at the 
[comparison of different file formats](https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d). The optimal file format depends also on the application.