In [1]:
%matplotlib qt

# Dealing with big data
This will show some demos for the discussion on how to work with large data files.

## Dask package
Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms.

Simply put: distributed Numpy.

- Parallel: Uses all of the cores on your computer
- Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
- Blocked Algorithms: Perform large computations by performing many smaller computations

This demo was adapted from [dask tutorials](https://github.com/dask/dask-tutorial/blob/master/03_array.ipynb).

In [2]:
import dask.array as da
example = da.random.normal(10, 0.1, size=(10000, 10000), chunks=(1000, 1000))
example

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 800.00 MB 8.00 MB Shape (10000, 10000) (1000, 1000) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


### Example:

- Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks
- Take the mean along one axis
- Take every 100th element

**NOTE: Show task manager memory profile**

**NOTE: Cathodoluminescence?**

In [3]:
import numpy as np

In [4]:
%%time
x = np.random.normal(10, 0.1, size=(20000, 20000))
y = x.mean(axis=0)[::100]

Wall time: 22.9 s


**Note: Restart jupyter server** to clear up memory OR **reset_selective the variable**

iPython has a set [built-in magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html), one of which clears up a variable.
However, memory handling in Python is tricky and not that simple.

In [5]:
# The -f param will force resetting
%reset_selective -f x
%reset_selective -f y

In [6]:
import dask.array as da

In [7]:
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100]
y = y.compute()

Wall time: 5.13 s


Memory difference:

<img src="memory_diff.png" width=400 />

## Sampling with pandas

Pandas `sample()` is used to generate a sample random row or column from the function caller data frame.

In [20]:
# importing pandas package
import pandas as pd

# making data frame from csv file
df = pd.read_csv("dummy_data.csv")
df
#df.head(20)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


Or load partially the file, using `skiprows`:

In [19]:
n = 10
dfn = pd.read_csv("dummy_data.csv", skiprows=(lambda i: i % n !=0))
dfn

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Frances,Female,8/8/2002,6:51 AM,139852,7.524,True,Business Development
1,Donna,Female,7/22/2010,3:48 AM,81014,1.894,False,Product
2,Benjamin,Male,1/26/2005,10:06 PM,79529,7.008,True,Legal
3,,Male,1/29/2016,2:33 AM,122173,7.797,,Client Services
4,Chris,,1/24/1980,12:13 PM,113590,3.055,False,Sales
...,...,...,...,...,...,...,...,...
95,Albert,Male,9/19/1992,2:35 AM,45094,5.850,True,Business Development
96,Linda,Female,2/4/2010,8:49 PM,44486,17.308,True,Engineering
97,Ernest,Male,7/20/2013,6:41 AM,142935,13.198,True,Product
98,Justin,,2/10/1991,4:58 PM,38344,3.794,False,Legal


Selecting randomly n number of cases:

In [21]:
# Sample only 5 data points
df.sample(n = 5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
548,Janice,Female,1/2/1984,9:06 PM,41190,3.311,True,Sales
711,Karen,Female,1/9/2014,10:09 PM,46478,16.552,False,Engineering
470,Ryan,Male,7/20/1993,10:18 PM,139917,11.466,False,Distribution
579,Harold,Male,10/18/2010,8:45 PM,65673,1.187,True,Legal
286,Todd,Male,2/2/1984,10:13 AM,69989,10.985,True,Finance


In [22]:
# Sample a 10% of the whole dataset
df.sample(frac=0.2)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
724,Andrea,Female,12/10/2001,6:40 AM,37888,13.470,False,Engineering
547,Evelyn,Female,9/22/1998,7:55 PM,51525,10.366,False,Finance
154,Rebecca,Female,11/15/1980,4:13 AM,85730,5.359,True,Product
968,Louise,Female,3/27/1995,10:27 PM,43050,11.671,False,Distribution
449,Beverly,Female,11/30/2005,2:57 AM,107163,3.665,True,Human Resources
...,...,...,...,...,...,...,...,...
950,Paula,Female,5/21/1983,11:42 AM,58423,10.833,False,Business Development
297,Daniel,Male,9/15/2007,10:16 PM,123811,7.664,True,Human Resources
989,Justin,,2/10/1991,4:58 PM,38344,3.794,False,Legal
870,Cynthia,,11/19/1996,10:40 PM,107816,18.751,False,Marketing


Estimating a parameter and getting the real parameter: comparison.

In [23]:
estimated_m = df.sample(frac=0.2).mean(axis = 0, skipna = True)
estimated_std = df.sample(frac=0.2).std(axis = 0, skipna = True)
print('Mean')
print(estimated_m)
print('--------')
print('Standard deviation')
print(estimated_std)

Mean
Salary     90426.55000
Bonus %       10.21772
dtype: float64
--------
Standard deviation
Salary     33195.646392
Bonus %        5.734029
dtype: float64


In [24]:
estimated_m = df.mean(axis = 0, skipna = True)
estimated_std = df.std(axis = 0, skipna = True)
print('Mean')
print(estimated_m)
print('--------')
print('Standard deviation')
print(estimated_std)

Mean
Salary     90662.181000
Bonus %       10.207555
dtype: float64
--------
Standard deviation
Salary     32923.693342
Bonus %        5.528481
dtype: float64


## Rebinning data

The `rebin()` method supports rebinning the data to arbitrary new shapes as long as the number of dimensions stays the same.
It can use two different algorithms:

- If the new shape dimensions are divisors of the old shape’s, the operation supports easy computation and is usually faster.
- Otherwise, the operation requires linear interpolation and is generally slower.

In [25]:
import hyperspy.api as hs
s = hs.datasets.example_signals.EDS_SEM_Spectrum().as_lazy()
s.plot()

In [26]:
s_bin = s.rebin(scale=[4])
s_bin.plot()

In [27]:
# Check the number of bytes for each object
s.data.nbytes, s_bin.data.nbytes

(4096, 1024)