# Working with Large Datasets

As the size of data gets larger more and more problems will occur in our workflow. Either the it will take too much **time**, will not fit into the **memory** of our computer or it will be harder to **comprehend**. 

The same is true for **visualization**: plotting will take more time and the figure will not be able to tell us the necessary information or might be even misleading the same techniques.

In the following notebooks we will address some of these issues following some articles and tutorials.

## Reduce data size if possible

See the [02-Pandas-reduce-data-size](02-Pandas-reduce-data-size.ipynb) notebook!

* Store data in the right format
* lose some of the data in order to increase the efficiency of your code
* ...

## Computation and python packages

### Visualization of package relations

Funny site: https://anvaka.github.io/vs/?query=dask

<img src="https://cdn-images-1.medium.com/max/1024/0*P7Xv__4reO9WuF2I.png">

### Pandas together with Dask

* sample of the data

https://www.machinelearningplus.com/python/dask-tutorial/:

You may use Spark or Hadoop to solve this. But, these are not python environments. This stops you from using numpy, sklearn, pandas, tensorflow, and all the commonly used Python libraries for ML.

* Dask tutorial
https://medium.com/@gongster/dask-an-introduction-and-tutorial-b42f901bcff5

follow https://towardsdatascience.com/how-to-handle-large-datasets-in-python-with-pandas-and-dask-34f43a897d55follow 

* https://www.youtube.com/watch?v=Alwgx_1qsj4

### Pandas alternatives

https://www.datacamp.com/tutorial/high-performance-data-manipulation-in-python-pandas2-vs-polars


*    Koalas. A pandas API built on top of PySpark. If you use Spark, you should consider this tool.
*    Vaex. A pandas API for out-of-memory computation, great for analyzing big tabular data at a billion rows per second.
*    Modin. A pandas API for parallel programming, based on Dask or Ray frameworks for big data projects. If you use Dask or Ray, Modin is a great resource.
*    cuDF. Part of the RAPIDS project, cuDF is a pandas-like API for GPU computation that relies on NVIDIA GPUs or other parts of RAPIDS to perform high-speed data manipulation.


## File formats

* HDF5
* NetCDF
* Parquet

# Visualization

The underlying problem is that we cannot distinguish between the datapoints because
* **the figure is too crowded**
* **when using alpha values -> over- or undersaturation**
* **saturation cannot be overcome by binning (heatmap)**

or the plotting interface is **too slow** to respond **to any interaction**

A good introduction:
* https://www.slideshare.net/continuumio/visualizing-billions-of-data-points-doing-it-right

## Datashader
<img src="https://github.com/holoviz/datashader/blob/main/examples/assets/images/nyc_races.jpg?raw=true" width=700>

With the [holoviews](holoviews.org/) suite interactive exploration of large datasets are possibble

### Datashading pipeline:
1. Select data and project it (e.g. scatter plot)
2. Aggregate points into fixed set of bins --> result in one or more scalars
3. Transform data using transfer functions --> yields a visible image

Example notebooks from the [datashader repository](https://github.com/holoviz/datashader):
* **How to Visualize a large dataset**: Datashader-01-Pipeline.ipynb
* **Aspects and concepts**: Datashader-02-Interactivity.ipynb
* **Networks**: Datashader-03-Networks.ipynb       
* **Timeseries example**: Datashader-04-Timeseries.ipynb



* Datashader
https://www.youtube.com/watch?v=n4cFwPan59I

## Further reading

[Handling Large Dataset](https://medium.com/@keshavaggarwal1311/handling-large-data-sets-the-best-ways-to-read-store-and-analyze-655117d0d939#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImFjM2UzZTU1ODExMWM3YzdhNzVjNWI2NTEzNGQyMmY2M2VlMDA2ZDAiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMDA1NzE5MDYzOTY5OTM2MTA4ODciLCJlbWFpbCI6ImplZ2VzbUBnbWFpbC5jb20iLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwibmJmIjoxNzE1MDE5MjI4LCJuYW1lIjoiamVnZXNtIiwicGljdHVyZSI6Imh0dHBzOi8vbGgzLmdvb2dsZXVzZXJjb250ZW50LmNvbS9hL0FDZzhvY0xRNW1ST1hrSGp1SlRaYmpmVzZ1SWVtQ0lHZGN4NERoUmNLd3hDUlVVRVZuaU5uUT1zOTYtYyIsImdpdmVuX25hbWUiOiJqZWdlc20iLCJpYXQiOjE3MTUwMTk1MjgsImV4cCI6MTcxNTAyMzEyOCwianRpIjoiMzJjMGZhN2YwMmRlY2EzMmEzZTFlYjY4OGU3OTg3ZTVjYWM4MWY0OCJ9.kLw4PiVXpypugx3FzgLpb5_xAXM4Gdq9_nTacWzfnD4jBqtg530Xj_oU8mki9-uuoXJogpfWRlUMvaBwmi7SsYO_PPwpiKV1hCN1jhSEVjMZ73y3yEEu4d2lQ6E5YMgQ2tu65KDXdqm-RxeVbXtOJf0mSZ32ZAfD1krRFtwf2wUXuhzHyTFO6tHVKTtwd4OQHkcjxlvxG_R8OipC1HmsbM6AXVb9VB8bxtZR6U3BZiyrCqj6qJFlHwiXonpNWunhqpezrV_1Y7p6okyS1OFlvz2baMb2H-kwum1d5v4oXDkTnSAMB7a_4eOYebBGNcBNU6kYX5Ogcy7iSSW6OZIeHw)