# Unit 4 -  Joins & Aggregation  

## Agregations
Count and Avg Aggregations using DMV dataset

1. Method 1 - Naive Method

    ##### increment the counts corresponding to fuel_type and passengers. So memory complexity = O(num_unique_fuel_types)

In [None]:
 2. Method2 - Naive Method version 2
   It takes only 0.85sec to loop through and find the unique values

2. Method 2 - Multi-Core Approach

## Summary

In [2]:
!python3 gather_results.py aggregation `ls -d group_*`

Evaluating: group_duckdb/aggregation.py
Result: 0.35705113410949707
Evaluating: group_suman_mike_naive/aggregation.py
Result: 13.597444772720337
  Rank  Group                     Time (seconds)
------  ----------------------  ----------------
     1  group_duckdb                    0.357051
     2  group_suman_mike_naive         13.5974


- Understand and visualize it

- Present your results to your peers in the form of an _interactive_ presentation.

## Jupyter Slideshows
- Jupyter notebooks can be converted into slideshows
- Cells define the appearance structure
- Markdown defines the content (text, images, tables, code), HTML can be used for more advanced layouts
- Code can be used to create plots or _live code execution_

- Two types of slide shows:
    - Static HTML slides with [reveal.js](https://revealjs.com/)
    - Dynamic with live code execution using the `rise` extension

## Slide types
- Activate the slide type selection drop down by navigating to `View ► Cell Toolbar ► Slideshow`
![slide types enable](./figures/slide_types_enable.png)

$\rightarrow$ Each cell has now a "Slide Type" that can be changed
![slide types select](./figures/slide_types_select.png)

The following slide types exist
- "Skip": Is not displayed e.g. framework code
- "Slide": First cell of a new topic (animation goes to the right)
- "-": Same slide but new cell, usful for code cells
- "Sub Slide": First cell of a sub topic (animation goes to the bottom)
- "Fragment": Same slide, but new content (adds to the bottom)
- "Notes": Shown in the speaker view (if configured) and not displayed

## Example Slide
- Some content

- same slide but new cell

In [None]:
# some code
def do_nothing():
    pass

### Subslide
- more content

- Fragment

## Convert Notebook to Static Slideshow with reveal.js
Directly in your browser via `File ► Download as ► Reveal.js slides`
![slide types enable](./figures/convert_in_browser.png)

### Speaker View
If you want you can also generate a version with a speaker view with the following commands:

In the same folder as `presentation.ipynb`
1. `git clone https://github.com/hakimel/reveal.js.git`
2. `jupyter nbconvert presentation.ipynb --to slides --reveal-prefix reveal.js`
3. Press `S` to start the speaker view

### Live Code Execution with RISE
Follow these steps to install the RISE extension to jupyter
1. Create fresh virtual environment and activate it
2. Install nbclassic and rise with `pip install nbclassic RISE`
3. Start jupyter notebook using `python -m nbclassic`

When you open the presentation notebook you should have a button to enter a slideshow.

![rise button](./figures/rise.png)

## Example
- Hidden cell below to import matplotlib and numpy:

In [None]:
# This cell is hidden in the presentation
import matplotlib.pyplot as plt
import numpy as np

- Example plot:

In [None]:
x = np.linspace(0, 10, 1000)
plt.plot(x, np.sin(x**2))
plt.show()

Things to remember before starting the presentation
- Delete your cell output bevor you start the slide show
- Run hidden code import or function definition cells

## General Instructions
- Create your presentations with this Jupyter notebook template
- Create your plots with python code in the notebook
  - You can import whatever library you like
  - Don't forget to add a `requirements.txt` with your dependencies!
- Commit and push your presentation notebook and all supplementary content to the linked git repository (we will check contribution)
- Every group member needs to present a significant part of the 20 minute presentation
- We will grade both the presentation and code used for plots etc. in your presentation
- Respect the guidelines on canvas

## Presentation Topics
<dl>
<dt>Expectation maximization</dt>
<dd>4 People</dd>
<dt>Self-organizing maps</dt>
<dd>3 People</dd>
<dt>t-distributed stochastic neighbor embedding</dt>
<dd>3 People</dd>
</dl>

### [EM for Mixtures of Gaussians](https://gitos.rrze.fau.de/utn-machine-intelligence/teaching/ml-ws2324-lu6-student-presentations/em-for-mixtures-of-gaussians)

**You should explain**
- What are mixtures of Gaussians and what are they used for?
- What is the relationship between EM and k-Means?
- Why does EM converge?

### EM for Mixtures of Gaussians
**Hints**
- Provide a visual example for mixtures of Gaussians.
- Explain why standard MLE is not applicable and why EM is required. First present the general version of the EM algorithm with latent variables (Bis07, 9.3) and then show the specific case for mixtures of Gaussians. Explain why latent variables are helpful.
- Provide an intuition of the proof for EM. You do not need to cover all details (Bis07, 9.4).

### EM for Mixtures of Gaussians
**Bibliography**
<div class="csl-bib-body" style="line-height: 1.35; ">
<div class="csl-entry" style="clear: left; ">
<div class="csl-left-margin" style="float: left; padding-right: 0.5em;text-align: right; width: 1em;">[1]</div>
<div class="csl-right-inline" style="margin: 0 .4em 0 1.5em;">C. M. Bishop and N. M. Nasrabadi, <i>Pattern recognition and machine learning</i>, vol. 4. Springer, 2006. Accessed: Jan. 08, 2024. [Online]. Available: <a href="https://link.springer.com/book/9780387310732">https://link.springer.com/book/9780387310732</a></div>
<span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Pattern%20recognition%20and%20machine%20learning&amp;rft.publisher=Springer&amp;rft.aufirst=Christopher%20M.&amp;rft.aulast=Bishop&amp;rft.au=Christopher%20M.%20Bishop&amp;rft.au=Nasser%20M.%20Nasrabadi&amp;rft.date=2006"></span> (sections 9.2 - 9.4)

### [Self-Organizing Maps](https://gitos.rrze.fau.de/utn-machine-intelligence/teaching/ml-ws2324-lu6-student-presentations/self-organizing-maps)

**You should explain**
- What does the SoM algorithm do?
- What is it used for?
- What are its advantages and disadvantages?
- Why is the weight initialization critical? What are typical ways to initialize the weights of the map?

### Self-Organizing Maps
**Hints**
- Explain the high level idea of this discrete dimension reduction method.
- Show how SoM structures look like
- Explain both the training and inference algorithm in an intuitive way.

### Self-Organizing Maps
**Basic Literature**
<div class="csl-bib-body" style="line-height: 1.35; ">
  <div class="csl-entry" style="clear: left; ">
    <div class="csl-left-margin" style="float: left; padding-right: 0.5em;text-align: right; width: 1em;">[2]</div><div class="csl-right-inline" style="margin: 0 .4em 0 1.5em;">T. Kohonen, “Automatic formation of topological maps of patterns in a self-organizing system,” in <i>Proceedings of the 2nd scandinavian Conference on Image Analysis</i>, 1981.</div>
  </div>
  <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=proceeding&amp;rft.atitle=Automatic%20formation%20of%20topological%20maps%20of%20patterns%20in%20a%20self-organizing%20system&amp;rft.btitle=Proceedings%20of%20the%202nd%20scandinavian%20Conference%20on%20Image%20Analysis&amp;rft.aufirst=Teuvo&amp;rft.aulast=Kohonen&amp;rft.au=Teuvo%20Kohonen&amp;rft.date=1981"></span>
  <div class="csl-entry" style="clear: left; ">
    <div class="csl-left-margin" style="float: left; padding-right: 0.5em;text-align: right; width: 1em;">[3]</div><div class="csl-right-inline" style="margin: 0 .4em 0 1.5em;">T. Kohonen and T. Honkela, “Kohonen network,” <i>Scholarpedia</i>, vol. 2, no. 1, p. 1568, Jan. 2007, doi: <a href="https://doi.org/10.4249/scholarpedia.1568">10.4249/scholarpedia.1568</a>.</div>
  </div>
  <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_id=info%3Adoi%2F10.4249%2Fscholarpedia.1568&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.atitle=Kohonen%20network&amp;rft.jtitle=Scholarpedia&amp;rft.volume=2&amp;rft.issue=1&amp;rft.aufirst=Teuvo&amp;rft.aulast=Kohonen&amp;rft.au=Teuvo%20Kohonen&amp;rft.au=Timo%20Honkela&amp;rft.date=2007-01-18&amp;rft.pages=1568&amp;rft.issn=1941-6016&amp;rft.language=en"></span>
  <div class="csl-entry" style="clear: left; ">
    <div class="csl-left-margin" style="float: left; padding-right: 0.5em;text-align: right; width: 1em;">[4]</div><div class="csl-right-inline" style="margin: 0 .4em 0 1.5em;">T. Kohonen, “The self-organizing map,” <i>Proceedings of the IEEE</i>, vol. 78, no. 9, pp. 1464–1480, 1990.</div>
  </div>
  <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.atitle=The%20self-organizing%20map&amp;rft.jtitle=Proceedings%20of%20the%20IEEE&amp;rft.volume=78&amp;rft.issue=9&amp;rft.aufirst=Teuvo&amp;rft.aulast=Kohonen&amp;rft.au=Teuvo%20Kohonen&amp;rft.date=1990&amp;rft.pages=1464%E2%80%931480&amp;rft.spage=1464&amp;rft.epage=1480"></span>
</div>

### [t-SNE](https://gitos.rrze.fau.de/utn-machine-intelligence/teaching/ml-ws2324-lu6-student-presentations/t-sne)
**You should explain**
- What is the KL-divergence?
- What are the two distributions whose KL-divergence we want to minimize?
- What are the differences between t-SNE and SNE? 

### t-SNE
**Hints**
- Make a visualization of the KL-divergence.
- Explain how t-SNE improves SNE.
- Provide an intuition for why the t-distribution is more appropriate than the gaussian it replaces.

### t-SNE
**Bibliography**
<div class="csl-bib-body" style="line-height: 1.35; ">
  <div class="csl-entry" style="clear: left; ">
    <div class="csl-left-margin" style="float: left; padding-right: 0.5em;text-align: right; width: 1em;">[5]</div><div class="csl-right-inline" style="margin: 0 .4em 0 1.5em;">G. E. Hinton and S. Roweis, “Stochastic neighbor embedding,” <i>Advances in neural information processing systems</i>, vol. 15, 2002, Accessed: Jan. 08, 2024. [Online]. Available: <a href="https://proceedings.neurips.cc/paper_files/paper/2002/hash/6150ccc6069bea6b5716254057a194ef-Abstract.html">https://proceedings.neurips.cc/paper_files/paper/2002/hash/6150ccc6069bea6b5716254057a194ef-Abstract.html</a></div>
  </div>
  <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.atitle=Stochastic%20neighbor%20embedding&amp;rft.jtitle=Advances%20in%20neural%20information%20processing%20systems&amp;rft.volume=15&amp;rft.aufirst=Geoffrey%20E.&amp;rft.aulast=Hinton&amp;rft.au=Geoffrey%20E.%20Hinton&amp;rft.au=Sam%20Roweis&amp;rft.date=2002"></span>
  <div class="csl-entry" style="clear: left; ">
    <div class="csl-left-margin" style="float: left; padding-right: 0.5em;text-align: right; width: 1em;">[6]</div><div class="csl-right-inline" style="margin: 0 .4em 0 1.5em;">L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.,” <i>Journal of machine learning research</i>, vol. 9, no. 11, 2008, Accessed: Jan. 08, 2024. [Online]. Available: <a href="https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf?fbcl">https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf?fbcl</a></div>
  </div>
  <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.atitle=Visualizing%20data%20using%20t-SNE.&amp;rft.jtitle=Journal%20of%20machine%20learning%20research&amp;rft.volume=9&amp;rft.issue=11&amp;rft.aufirst=Laurens&amp;rft.aulast=Van%20der%20Maaten&amp;rft.au=Laurens%20Van%20der%20Maaten&amp;rft.au=Geoffrey%20Hinton&amp;rft.date=2008"></span>
</div>