# Assignment 3

<br> 
    <center>
        <img src="src/A3_Cluster_illustration.png" width="350"/>
    </center>
</br>

In the third assignment, you can revise dimensionality reduction with **principal component analysis (PCA)** and **k-means clustering**. Here, we focus on two datasets of stock prices. In particular, you will see examples of 

* when principal components might be useful as new variables for a predictive model,
* how to recognise whether principal components carry sufficient information,
* and identify clusters of similar assets in the finance example.

For this exercise, you can switch back to the environment ```APML``` used during class.

Please provide solutions to all exercises below and send me all notebooks by **31st of May 2024**.

***

## Part I: Portfolio of four industries

For the first part, we consider a portfolio of companies belonging to four industries: 

* automotive (Mercedes-Benz: 'MBG.DE', BMW: 'BMW.DE')
* aviation (Lufthansa: 'LHA.DE', Air France-KLM 'AF.PA') 
* pharmaceuticals (Bayer: 'BAYN.DE', Novartis: 'NVS', Roche: 'RO.SW')
* technology (Apple: 'AAPL', Amazon: 'AMZN', Google: 'GOOG')

In [None]:
portfolio = {
    'MBG.DE': 'Mercedes-Benz Group AG',
    'BMW.DE': 'Bayerische Motoren Werke AG',
    'LHA.DE': 'Deutsche Lufthansa AG',
    'AF.PA': 'Air France-KLM',
    'BAYN.DE': 'Bayer AG',
    'NVS': 'Novartis AG',
    'RO.SW': 'Roche Holding AG',
    'AAPL': 'Apple Inc.',
    'AMZN': 'Amazon.com Inc.',
    'GOOG': 'Alphabet Inc.'
}

Let us first inspect the dataset. Load the data with the following cell:

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.precision', 2)

industries_df = pd.read_csv('data/finance_toy_data.csv', index_col='Date')
industries_df

### Exercise I.1

As you can see above, the stock prices vary quite a bit from one company to another.

To normalise the values, we use **normalised returns**. To this end, calculate the 
fractional change from the immediately previous row. For example, the stock price of
```MBG.DE``` decreased on ```2016-01-06``` compared to previous day ```2016-01-05``` by

$\frac{49.32 - 50.60}{50.60} = \frac{-1.28}{50.60} \approx -0.025296$

i.e. by about 2.5296%.

* Compute the normalised returns for ```industries_df``` (i.e do a similar computation as above for all rows) and store the result in DataFrame ```norm_returns```.
* Drop all rows in ```norm_returns``` where there is at least one ```NaN``` entry
* Finally, transpose ```norm_returns``` such that rows represent different companies and columns different dates.

Fill in the cell below:

### Exercise I.2

Perform a principal component analysis on ```norm_returns```. Consider 
only the first two eigenvectors, i.e. use ```num_eigvectors = 2``` in your PCA function call.

### Exercise I.3

Can you think of a metric which might give us an idea of how expressive the first principal components are?

Fill in the cell below with your choice:

### Exercise I.4

Visualise the result of your PCA by plotting the first two principal components. 
Label the individual data points with the abbreviations given in 
the ```portfolio``` dictionary. Your result should look somewhat similar to the
title picture of this notebook above.

### Exercise I.5

Perform k-means clustering on your PCA result. Use

In [None]:
k=4 
random_state=123
n_init=100

in your k-means clustering function.

### Exercise I.6

Visualise your result by plotting a similar figure as in Exercise I.4, 
but this time colour the data points by their cluster. Your result should 
resemble the title picture of this notebook above.

In addition, print the cluster number (0 to 3) and the stocks which belong to each cluster number.

Fill in the cell below:

### Exercise I.7

**You do not need to write or execute anything in this exercise.**

Reflect on the following questions:

* Were you able to identify clusters which correspond to the four industries?
* If you were or were not, why is that the case? Take your result on the expressivity (Exercise I.3) into consideration.
* Finally, do you consider the simplification to two principal components useful for this dataset?

***

## Part II: Portfolio of DAX companies

In the first part, we focused on handpicked companies in four distinct industries. 
In the second part, we consider a broader scope but essentially perform the same steps. 

Here, we work with the German stock market index DAX, which includes the 40 biggest German companies. 
Below, we focus on the following 27 DAX companies:

In [None]:
portfolio = {
    'ADS.DE': 'Adidas AG',
    'ALV.DE': 'Allianz SE',
    'BAS.DE': 'BASF SE',
    'BAYN.DE': 'Bayer AG',
    'BEI.DE': 'Beiersdorf AG',
    'BMW.DE': 'Bayerische Motoren Werke AG',
    'CON.DE': 'Continental AG',
    'DB1.DE': 'Deutsche Boerse AG',
    'DBK.DE': 'Deutsche Bank AG',
    'DHER.DE': 'Delivery Hero SE',
    'DPW.DE': 'Deutsche Post AG',
    'DTE.DE': 'Deutsche Telekom AG',
    'DWNI.DE': 'Deutsche Wohnen SE',
    'EOAN.DE': 'E.ON SE',
    'FME.DE': 'Fresenius Medical Care AG & Co. KGaA',
    'FRE.DE': 'Fresenius SE & Co. KGaA',
    'HEI.DE': 'HeidelbergCement AG',
    'HEN3.DE': 'Henkel AG & Co. KGaA',
    'IFX.DE': 'Infineon Technologies AG',
    'LHA.DE': 'Deutsche Lufthansa AG',
    'MRK.DE': 'Merck KGaA',
    'MUV2.DE': 'Muenchener Rückversicherungs-Gesellschaft AG',
    'RWE.DE': 'RWE AG',
    'SAP.DE': 'SAP SE',
    'SIE.DE': 'Siemens AG',
    'TKA.DE': 'thyssenkrupp AG',
    'VOW3.DE': 'Volkswagen AG'
}

Load the dataset in the following cell:

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.precision', 2)

dax_df = pd.read_csv('data/finance_dax_data.csv', index_col='Date')
dax_df

### Exercise II.1

As before, perform the following preparation of the dataset:

* Compute the normalised returns for ```dax_df``` and store the result in DataFrame ```norm_returns```.
* Drop all rows in ```norm_returns``` where there is at least one ```NaN``` entry
* Finally, transpose ```norm_returns``` such that rows represent different companies and columns different dates.

Fill in the cell below:

### Exercise II.2

Perform a principal component analysis on ```norm_returns```. Consider 
only the first two eigenvectors, i.e. use ```num_eigvectors = 2``` in your PCA function call.

### Exercise II.3

In Exercise I.3 you have chosen a metric for the expressivity of the first principal components.

Fill in the cell below with your choice for the DAX example:

### Exercise II.4

Visualise the result of your PCA by plotting the first two principal components. 
Label the individual data points with the abbreviations given in 
the ```portfolio``` dictionary. 

### Exercise II.5

Let us make a simplifying assumption and consider 
that the DAX consist of the following seven industry sectors:

* commerce
* aviation and automotive
* bank and insurance
* chemicals and pharmaceuticals
* software and communication
* energy and construction
* technology and medicine
 
Perform k-means clustering on your PCA result. Under this assumption, use

In [None]:
k=7
random_state=123
n_init=100

in your k-means clustering function.

### Exercise II.6

Visualise your result by plotting a similar figure as in Exercise I.6. 
Colour the data points by their cluster.

As before, print the cluster number (0 to 6) and the stocks which belong to each cluster number.

Fill in the cell below:

### Exercise II.7

**You do not need to write or execute anything in this exercise.**

As before, reflect on the following questions:

* Were you able to identify useful clusters?
* If you were or were not, why is that the case? Take your result on the expressivity (Exercise II.3) into consideration.
* Do you consider the simplification to two principal components useful for this second dataset, too?