In [2]:
%matplotlib qt

# Dealing with big data
This will show some demos for the discussion on how to work with large data files.

## Dask package
Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms.

Simply put: distributed Numpy.

- Parallel: Uses all of the cores on your computer
- Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
- Blocked Algorithms: Perform large computations by performing many smaller computations

This demo was adapted from [dask tutorials](https://github.com/dask/dask-tutorial/blob/master/03_array.ipynb).

In [4]:
import numpy as np
import dask.array as da
example = da.random.normal(10, 0.1, size=(10000, 10000), chunks=(500, 500))
example

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,2.00 MB
Shape,"(10000, 10000)","(500, 500)"
Count,400 Tasks,400 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 800.00 MB 2.00 MB Shape (10000, 10000) (500, 500) Count 400 Tasks 400 Chunks Type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,2.00 MB
Shape,"(10000, 10000)","(500, 500)"
Count,400 Tasks,400 Chunks
Type,float64,numpy.ndarray


### Example:

- Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks
- Take the mean along one axis
- Take every 100th element

**NOTE: Show task manager memory profile**

In [5]:
import numpy as np

In [6]:
%%time
x = np.random.normal(10, 0.1, size=(20000, 20000))
y = x.mean(axis=0)[::100]

Wall time: 31 s


**Note: Restart jupyter server** to clear up memory OR **reset_selective the variable**

iPython has a set [built-in magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html), one of which clears up a variable.
However, memory handling in Python is tricky and not that simple.

In [7]:
# The -f param will force resetting
%reset_selective -f x
%reset_selective -f y

In [None]:
# Alternatively, use the garbage collector package
import gc

del x,y
gc.collect()

Now repeat with dask instead:

In [8]:
import dask.array as da

In [9]:
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100]
y = y.compute()

Wall time: 5.41 s


Memory difference:

<img src="memory_diff.png" width=400 />

## Sampling with pandas

Pandas `sample()` is used to generate a sample random row or column from the function caller data frame.

In [13]:
# importing pandas package
import pandas as pd

# making data frame from csv file
df = pd.read_csv("dummy_data.csv")
df
#df.head(20)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


Or load partially the file, using `skiprows`:

In [12]:
n = 100
dfn = pd.read_csv("dummy_data.csv", skiprows=(lambda i: i % n !=0))
dfn

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Harold,Male,1/2/1985,9:40 PM,77544,12.447,False,Business Development
1,Jonathan,Male,7/17/2009,8:15 AM,130581,16.736,True,
2,Emily,Female,1/13/1988,6:42 AM,36711,19.028,True,Human Resources
3,Kathryn,Female,6/9/1988,9:29 AM,86439,7.799,False,Finance
4,Barbara,,12/10/1980,11:04 PM,90187,14.764,True,Distribution
5,,Female,10/11/1990,10:57 PM,98385,10.925,,Human Resources
6,Amy,,5/19/1984,11:47 AM,102839,10.385,True,Distribution
7,Raymond,Male,12/12/1986,12:18 PM,47529,2.712,True,Product
8,Walter,Male,5/21/1992,12:39 AM,144701,16.323,True,Marketing
9,Albert,Male,5/15/2012,6:24 PM,129949,10.169,True,Sales


Selecting randomly n number of cases:

In [16]:
# Sample only 5 data points
df.sample(n = 5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
818,Ann,Female,10/3/1980,1:08 AM,96941,10.048,True,Distribution
534,Gerald,,1/8/1992,3:50 PM,133366,12.292,False,Legal
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal
424,Matthew,,6/9/2003,7:35 AM,79443,14.637,False,Human Resources
246,Fred,,12/2/1984,2:03 PM,59937,12.045,True,Human Resources


In [17]:
# Sample a 10% of the whole dataset
df.sample(frac=0.1)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
761,Jennifer,Female,3/31/2015,7:43 PM,132084,10.006,True,Engineering
42,Beverly,Female,9/9/1998,8:26 PM,121918,15.835,False,Legal
143,Teresa,,1/28/2016,10:55 AM,140013,8.689,True,Engineering
417,Sarah,,8/31/1981,2:51 PM,37748,9.047,False,Human Resources
445,Chris,Male,12/12/2006,1:57 AM,71642,1.496,False,
...,...,...,...,...,...,...,...,...
484,Joe,Male,4/14/2000,3:14 PM,50645,11.119,False,Marketing
717,Jason,,4/15/1988,8:16 PM,97480,11.518,False,Human Resources
593,Marie,Female,11/23/2000,5:06 AM,125574,4.644,False,Sales
818,Ann,Female,10/3/1980,1:08 AM,96941,10.048,True,Distribution


Estimating a parameter and getting the real parameter: comparison.

In [18]:
estimated_m = df.sample(frac=0.2).mean(axis = 0, skipna = True)
estimated_std = df.sample(frac=0.2).std(axis = 0, skipna = True)
print('Mean')
print(estimated_m)
print('--------')
print('Standard deviation')
print(estimated_std)

Mean
Salary     91224.2600
Bonus %       10.5919
dtype: float64
--------
Standard deviation
Salary     34425.705676
Bonus %        5.297754
dtype: float64


In [19]:
estimated_m = df.mean(axis = 0, skipna = True)
estimated_std = df.std(axis = 0, skipna = True)
print('Mean')
print(estimated_m)
print('--------')
print('Standard deviation')
print(estimated_std)

Mean
Salary     90662.181000
Bonus %       10.207555
dtype: float64
--------
Standard deviation
Salary     32923.693342
Bonus %        5.528481
dtype: float64


## Rebinning data

The `rebin()` method supports rebinning the data to arbitrary new shapes as long as the number of dimensions stays the same.
It can use two different algorithms:

- If the new shape dimensions are divisors of the old shape’s, the operation supports easy computation and is usually faster.
- Otherwise, the operation requires linear interpolation and is generally slower.

In [10]:
import hyperspy.api as hs
s = hs.datasets.example_signals.EDS_SEM_Spectrum().as_lazy()
s.plot()



In [11]:
s_bin = s.rebin(scale=[4])
s_bin.plot()

In [12]:
# Check the number of bytes for each object
s.data.nbytes, s_bin.data.nbytes

(4096, 1024)

# Managing files in batch

This is mainly done with the `os` and `glob` packages.

In [11]:
import os, glob

data = []
base_root = r"C:\Users\jf631\Documents\GitHub\jordiferrero\nanoDTC\python_demo_notebooks\big_data_demos"
path = "folder_to_batch_process"

directory = os.path.join(base_root, path)

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.txt'):
            data.append(os.path.join(root, file))

data.sort()
data

['folder_to_batch_process\\scan_01.txt',
 'folder_to_batch_process\\scan_02.txt',
 'folder_to_batch_process\\scan_03.txt',
 'folder_to_batch_process\\scan_04.txt',
 'folder_to_batch_process\\scan_05.txt']

In [4]:
for fname in data[:1]:
    print(fname)
    print(os.path.basename(fname))
    print(os.path.dirname(fname))

C:\Users\jf631\Documents\GitHub\jordiferrero\nanoDTC\python_demo_notebooks\big_data_demos\folder_to_batch_process\scan_01.txt
scan_01.txt
C:\Users\jf631\Documents\GitHub\jordiferrero\nanoDTC\python_demo_notebooks\big_data_demos\folder_to_batch_process


You can also use the powers of the `glob` package in conjuction with `os`.

In [13]:
base_root = r"C:\Users\jf631\Documents\GitHub\jordiferrero\nanoDTC\python_demo_notebooks\big_data_demos"
folder = "folder_to_batch_process"
# Note we use an asterisk
endswith_str = '*.txt' # You can also use asterisk for folders e.g. `*/*.txt`

data = [f for f in glob.glob(os.path.join(base_root, folder, endswith_str))]
data.sort()
data


['C:\\Users\\jf631\\Documents\\GitHub\\jordiferrero\\nanoDTC\\python_demo_notebooks\\big_data_demos\\folder_to_batch_process\\scan_01.txt',
 'C:\\Users\\jf631\\Documents\\GitHub\\jordiferrero\\nanoDTC\\python_demo_notebooks\\big_data_demos\\folder_to_batch_process\\scan_02.txt',
 'C:\\Users\\jf631\\Documents\\GitHub\\jordiferrero\\nanoDTC\\python_demo_notebooks\\big_data_demos\\folder_to_batch_process\\scan_03.txt',
 'C:\\Users\\jf631\\Documents\\GitHub\\jordiferrero\\nanoDTC\\python_demo_notebooks\\big_data_demos\\folder_to_batch_process\\scan_04.txt',
 'C:\\Users\\jf631\\Documents\\GitHub\\jordiferrero\\nanoDTC\\python_demo_notebooks\\big_data_demos\\folder_to_batch_process\\scan_05.txt']