In [26]:
%matplotlib qt

# Dealing with big data
This will show some demos for the discussion on how to work with large data files.

## Dask package
Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms.

Simply put: distributed Numpy.

- Parallel: Uses all of the cores on your computer
- Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
- Blocked Algorithms: Perform large computations by performing many smaller computations

This demo was adapted from [dask tutorials](https://github.com/dask/dask-tutorial/blob/master/03_array.ipynb).

In [5]:
import dask.array as da
example = da.random.normal(10, 0.1, size=(10000, 10000), chunks=(1000, 1000))
example

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 800.00 MB 8.00 MB Shape (10000, 10000) (1000, 1000) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


### Example:

- Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks
- Take the mean along one axis
- Take every 100th element

**NOTE: Show task manager memory profile**

In [33]:
import numpy as np


In [34]:
%%time
x = np.random.normal(10, 0.1, size=(20000, 20000))
y = x.mean(axis=0)[::100]

Wall time: 26.2 s


**Note: Restart jupyter server** to clear up memory OR **reset_selective the variable**

iPython has a set [built-in magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html), one of which clears up a variable.
However, memory handling in Python is tricky and not that simple.

In [35]:
# The -f param will force resetting
%reset_selective -f x
%reset_selective -f y

In [36]:
import dask.array as da

In [37]:
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100]
y = y.compute()

Wall time: 5.81 s


## Sampling with pandas

Pandas `sample()` is used to generate a sample random row or column from the function caller data frame.

In [12]:
# importing pandas package
import pandas as pd

# making data frame from csv file
df = pd.read_csv("dummy_data.csv")
df

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


In [9]:
# Sample only 5 data points
df.sample(n = 5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
880,Robert,,5/25/2007,3:17 AM,90998,8.382,False,Finance


In [10]:
# Sample a 10% of the whole dataset
df.sample(frac=0.2)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
184,Jerry,Male,12/18/2003,6:46 AM,140810,9.177,True,Client Services
278,Betty,Female,6/28/2005,6:03 PM,51613,12.984,False,Distribution
402,Richard,,11/28/1992,5:05 PM,124655,14.272,True,Engineering
942,Lori,Female,11/20/2015,1:15 PM,75498,6.537,True,Marketing
711,Karen,Female,1/9/2014,10:09 PM,46478,16.552,False,Engineering
...,...,...,...,...,...,...,...,...
358,Scott,Male,12/17/2011,3:45 AM,90429,4.450,False,Product
813,Evelyn,Female,2/10/2002,4:44 AM,123621,19.767,True,Marketing
188,Charles,Male,10/14/2000,9:40 PM,71749,15.931,False,Legal
533,Earl,Male,2/11/2014,9:03 PM,52620,13.773,False,Product


Estimating a parameter and getting the real parameter: comparison.

In [21]:
estimated_m = df.sample(frac=0.2).mean(axis = 0, skipna = True)
estimated_std = df.sample(frac=0.2).std(axis = 0, skipna = True)
print('Mean')
print(estimated_m)
print('--------')
print('Standard deviation')
print(estimated_std)

Salary     90140.69500
Bonus %       10.63084
dtype: float64
--------
Salary     34158.886893
Bonus %        5.457243
dtype: float64


In [23]:
estimated_m = df.mean(axis = 0, skipna = True)
estimated_std = df.std(axis = 0, skipna = True)
print('Mean')
print(estimated_m)
print('--------')
print('Standard deviation')
print(estimated_std)

Mean
Salary     90662.181000
Bonus %       10.207555
dtype: float64
--------
Standard deviation
Salary     32923.693342
Bonus %        5.528481
dtype: float64


## Rebinning data

The `rebin()` method supports rebinning the data to arbitrary new shapes as long as the number of dimensions stays the same.
It can use two different algorithms:

- If the new shape dimensions are divisors of the old shape’s, the operation supports easy computation and is usually faster.
- Otherwise, the operation requires linear interpolation and is generally slower.

In [27]:
import hyperspy.api as hs
s = hs.datasets.example_signals.EDS_SEM_Spectrum().as_lazy()
s.plot()

In [28]:
s_bin = s.rebin(scale=[4])
s_bin.plot()

In [32]:
# Check the number of bytes for each object
s.data.nbytes, s_bin.data.nbytes

(4096, 1024)