# ASCAT data loading <br><br>
#### data reading support (research of Hans Lievens)

In [14]:
import numpy as np

### Received information about dataset (email-conversation)

Provided files in a folder (ftp)

In [15]:
!ls ./ASCAT/

1322.dat  1322.idx  1323.dat  1323.idx	1324.dat  1324.idx  DGGv02.1_CPv02.nc


### First draft reading of the binary data

In the following section, I'll test the reading for the following dataset/file:

In [16]:
filename = "1323"

#### INDEX FILE

The data-type structure was provided (and apparently numpy-styling):

In [17]:
struct1 = np.dtype([('gpi', np.int32)])

We have to skip the  first 208 bytes to avoid the reading of the header, so I checked on Stackoverflow how to do this: http://stackoverflow.com/questions/14245094/how-to-read-part-of-binary-file-with-numpy

In [20]:
f = open("".join(["./ASCAT/", filename, ".idx"]), "rb")  # reopen the file
f.seek(208, os.SEEK_SET)  # seek

idx_data = np.fromfile(f, dtype=struct1).astype('int32') # astype erbij om de omzetting te verzekeren

Checking the data:

In [28]:
idx_data.shape # evaluate size before I print it...

(4488295,)

In [42]:
idx_data[:50]

array([2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795, 2301795, 2301795, 2301795, 2301795, 2301795, 2301795,
       2301795], dtype=int32)

I read a list of numbers, as described in the email. So the grid-cells are clustered together for a set of lines. Let's see how the unique values look like:

In [7]:
np.unique(idx_data)

array([2301795, 2301797, 2301801, ..., 2493005, 2493009, 2493013], dtype=int32)

In [8]:
np.unique(idx_data).size

1379

This data sets contains 1379 different grid points that can be selected

#### DAT-FILE itself

Reading in, considering the provided data-type and the 208 bit header to skip

In [30]:
struct2 = np.dtype([('jd', np.double),  ('sig', np.float32), ('sig_noise', np.float32), 
                   ('dir', np.dtype('S1')),    ('pdb',np.ubyte), ('azcorr_flag', 
                                                                  np.dtype([('f', np.ubyte),  
                                                                            ('m',np.ubyte),  
                                                                            ('a', np.ubyte)]))])

In [31]:
f = open(''.join(["./ASCAT/", filename, ".dat"]), "rb")  # reopen the file
f.seek(208, os.SEEK_SET)  # seek

data = np.fromfile(f, dtype=struct2)

In [32]:
data

array([ (2454102.8887586812, -9.16275691986084, 0.08637592941522598, 'D', 54, (2, 2, 2)),
       (2454103.9439236117, -9.11705493927002, 0.08362860232591629, 'D', 54, (2, 2, 2)),
       (2454104.3495876626, -9.267045021057129, 0.08337954431772232, 'A', 54, (2, 2, 2)),
       ...,
       (2457022.369726563, -8.836539268493652, 0.07853401452302933, 'A', 54, (2, 2, 2)),
       (2457022.8782986, -8.78021240234375, 0.08376690745353699, 'D', 54, (2, 2, 2)),
       (2457022.947786447, -8.758130073547363, 0.08410192281007767, 'D', 54, (2, 2, 2))], 
      dtype=[('jd', '<f8'), ('sig', '<f4'), ('sig_noise', '<f4'), ('dir', 'S1'), ('pdb', 'u1'), ('azcorr_flag', [('f', 'u1'), ('m', 'u1'), ('a', 'u1')])])

A single element:

In [33]:
data[1000]

(2455067.402756065, -9.999258995056152, 0.08587104082107544, 'A', 54, (2, 2, 2))

As suggested by Hans, I should check the second number to be in between 0 and -30, since this is backscatter value. First part looks like a date sequence... OK!


### Extract data from a gridpoint

So considering the email, following steps are undertaken:
* Decide about a number/grid point to select
* Check for which rows this number is in the file and save these indices
* Select the rows from the .dat file based on the saved indices

Let's try it out, based on the *idx_data* and the *data* arrays:

In [35]:
# Decide about a grid point
grid_point = 2301829

In [36]:
# select rows/position for the given grid point
indxs_point = np.where(idx_data == grid_point)

In [37]:
# Get these rows out of the data-array
my_selection = data[indxs_point]

In [38]:
my_selection

array([ (2454102.8887369796, -9.05182933807373, 0.0817079022526741, 'D', 54, (2, 2, 2)),
       (2454103.943836806, -9.10305404663086, 0.08119107037782669, 'D', 54, (2, 2, 2)),
       (2454104.34952257, -9.057757377624512, 0.07795611023902893, 'A', 54, (2, 2, 2)),
       ...,
       (2457021.893793403, -9.648776054382324, 0.07921148091554642, 'D', 54, (2, 2, 2)),
       (2457022.9488281254, -9.93057918548584, 0.08341851830482483, 'D', 54, (2, 2, 2)),
       (2457023.3545138896, -10.14028549194336, 0.07756844907999039, 'A', 54, (2, 2, 2))], 
      dtype=[('jd', '<f8'), ('sig', '<f4'), ('sig_noise', '<f4'), ('dir', 'S1'), ('pdb', 'u1'), ('azcorr_flag', [('f', 'u1'), ('m', 'u1'), ('a', 'u1')])])

Hans works in Matlab, so I'd like to provide the necessary data as a .mat-file. Exporting to .mat files is available as a scipy package extension:

In [17]:
import scipy.io as sio

Some testing on how to export it: different grid points in a single file is possible, but maybe not feasible for >1000 grid_points...

In [19]:
sio.savemat("data_hans", {''.join(['grid_point1_',str(grid_point)]):my_selection, ''.join(['grid_point2_',str(grid_point)]):my_selection})

This works and creates a matlab file, that kan be *load*ed by Matlab as structure. OK!

## Bring this together in a single functionality

This information is what I needed and provided and are the building blocks I need. However, I'd like to make this available as a functionality to make this reusable:
* within python, in case I would need it as a function within another workflow
* from command line; as such, I can quickly run it fromn cmd when new data enters

Learnt all the things above, I made a function to do the analysis above in a file called *convert_ascat.py*.

#### FUNCTION

In [43]:
from convert_ascat import convert_ascat_to_matlab

So, let's test the functionality as a function...

**Extracting a single grid point, by naming the ID**

In [40]:
my_data = convert_ascat_to_matlab("./ASCAT/1323", grid_point_id=2301795, byte2skip=208)

data loaded
working on grid point 2301795


**Extracting all grid points, all in separate files**

In [45]:
all_data = convert_ascat_to_matlab("./ASCAT/1323", 
                                   grid_point_id='all', 
                                   byte2skip=208)

#### CMD-tool

A second file, called *convert_ascat.py* is available to do this from the command line, using the filename as an extra argument, just make sure you have the python file executable. 

The dependencies are numpy and scipy. Easiest way of using this yourself is to install Anaconda, https://www.continuum.io/downloads, which will provide you these libraries (together with some other useful libraries) to make it possible.

Another option is to install python/numpy/scipy itself manually; or use Miniconda and create your own set of library combinations; install http://conda.pydata.org/miniconda.html and create environment by putting the following command in the command line:

        conda create -n myenvname python=2.7 numpy scipy

To use it from the command line:

            python convert_ascat.py ./ASCAT/1323 2301795

for a specific grid point, or

            python convert_ascat.py ./ASCAT/1322 all

to convert all of the grid points to .mat files (the latter can take a while ;-).

### Conclusion

By adding only a small extra effort to the analysis of *how to read the binary data*, I added a small **functionality** to my library of tools that I have available. Most code is just putting it all together, and...

* creating a **function** with defined input/outputs instead of just some code-lines in a script
* using the **main()** functionality that Python offers to make it cmd-avilable
* adding **documentation**

Veel succes ermee en laat maar weten als je ergens vast loopt,

mvg,

Stijn