# Import of external data into `numpy`-arrays
Beginners often have difficulties to import scientific data from external sources into `numpy`-arrays.

## Small data sets in tabulated `ascii`-files
We already imported small data-volumes from text files into `numpy`-arrays. This can be done with the `loadtxt`-function. We repeat it here for completeness.

In [None]:
!cat data/Cobe.txt

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# text files can be imported with loadtxt.
# Pay attention to data-types if necessary
data = np.loadtxt('data/Cobe.txt', dtype=np.float32)

# give meaningful names to individual data columns
x = data[:,0]
y = data[:,1]
# note the units between data points and errors!
error = data[:,2] / 100.

# word on the data
plt.errorbar(x, y, error, fmt='o')

**Note:** It is similarily easy to also read `ascii`-data in different formats than simple rows and columns, e.g. `comma-separated-value (csv)` files - see options of `np.loadtxt`.
 
This approach is no longer optimal for *large* data sets with many columns. We will have a look at better options to represent data from text-files in the [astropy-tables notebook](02_astropy_intro_and_tables.ipynb).

## `ASCII` files and binary files
- Advantages of `ascii`-files:
  - Human readable
  - Can be inspected / edited with a texteditor
  - Can be inspected / manipulated with *Unix*-tools (`awk` etc.)
  - Can be imported and exported easily into nearly each application working with data tables
  - The meaning of different columns is typically *clear* - no need to store sophisticated *meta-data*
  - You do not need to know about the internal data structure (data-types etc.) before the data were written
- Disadvantages of `ascii`-files:
  - One file can only represent *one* data-table if you want to preserve all the advantages
  - Difficulties with missing data (problems with Unix-tools, need special characters etc.)
  - *Much* larger (disk-space) than binary-data (each character is a byte!)
  - Very slow to read and write
  
The `ASCII`-format is optimal for small amounts of *homegeneous* data but not practical for very large data sets.  

## The `numpy`-binary format
`Numpy` support an easy to use binary format which is ideal to store `numpy`-arrays on disk and to read them
later. You saw this format already in project 5.

In [None]:
import numpy as np

data = np.loadtxt('data/Cobe.txt', dtype=np.float32)

# numpy-arrays can be stored in an own binary-fomat
# (file-ending -.npy)
np.save('data/Cobe.npy', data)

In [None]:
# You can no longer easily look at the data outside of Python!
!ls -ltr data
%%time!cat data/Cobe.npy

In [None]:
import numpy as np

# The .npy-format can be read into an array without any effort.
# Note however that 'meta data' (comments) cannot be stored with the data.
Cobe_data = np.load('data/Cobe.npy')

print(Cobe_data)

## Raw binary-data from `C` or `Fortran` programs
Sometimes, you would like to import raw binary data created from a `C` or `Fortran` program to postprocess them with Python (visualise simulation data for instance).

The following `C`-program writes a float and an int-array to two files. We want to read them into `numpy`-arrays later:

```c
#include <stdio.h>

int main(void) {

  FILE *file;
  const int nx = 10;
  /* In C, float is float32 */
  float float_array[nx];
  /* in C, short int is int16 */
  short int int_array[nx];
  int i;

  for (i = 0; i < nx; i++) {
    float_array[i] = i * 1.5;
    int_array[i] = i * 2;
  }

  file = fopen("data/c-data_float.bin", "wb");
  fwrite(float_array, sizeof(float_array), 1, file);
  fclose(file);

  file = fopen("data/c-data_int.bin", "wb");
  fwrite(int_array, sizeof(int_array), 1, file);
  fclose(file);

  return 0;
}

```

In [None]:
# compile the code and run it. It will create the files
# data/c-data_float.bin and data/c-data_int.bin
!gcc -o code/float_int.exe code/float_int.c -lm
!./code/float_int.exe
!ls -ltr data

It is easy to read such files into `numpy`-arrays **IF** you know how they were written. You need to know the following *meta-data*:
- The type of the data (`float32`, `int16` etc.)
- The structure of the data (multidimensionals arrays, shapes)
- (big endian or little endian representation of the data)
- (Fortran or C-ordering of multidimensional arrays)

If you ask somebody for help, it is the easiest to show the code with that the data was written (along with the infromation on which machine it was created)!

In [None]:
# script to read the binary C-data into numpy-arrays
# The reading can be done with the function np.fromfile
import numpy as np

# read the float-data
f = open('data/c-data_float.bin')

# we know that the data is float 32.
data_float = np.fromfile(f, dtype=np.float32)
f.close()

print(data_float)

# read the int-data
f = open('data/c-data_int.bin')

# we know that the data is int16
data_int = np.fromfile(f, dtype=np.int16)
f.close()

print(data_int)


## Data from standardized binary formats

There are **many** standards to store scientific data (together with their meta data) in binary form. In astronomy, the `FITS` (Flexibible Image Transfort System) is very common. It can store images, tables and spectra. It is covered within the `astropy.io.fits`-module.

Another common format is [`HDF` (Hierarchical Data format)](https://de.wikipedia.org/wiki/Hierarchical_Data_Format) - beware of different versions. The `HDF` version 5 format was used for the [Aqua](http://en.wikipedia.org/wiki/Aqua_%28satellite%29) satellite data that you worked with in project 5 ([here are the original data](http://www.iup.uni-bremen.de/seaice/amsredata/asi_daygrid_swath/l1a/n6250/)).

There are Python-modules to read all relevant, standardized binary data-formats into `numpy`-arrays - at least I was never unsuccessful to find a suitable module for what I need! Dr. Google will help you!

I used [this webpage](https://www.science-emergence.com/Articles/How-to-read-a-MODIS-HDF-file-using-python-/) for examples on how to read the `HDF`-files for project 5 with the `pyhdf`-module.

**Note:** The `pyhdf`-module is not contained in the standard anaconda installation. You probably need to install it with (Linux commandline):
```
user$ conda install -c conda-forge pyhdf

```

In [None]:
# script to transform Aqua Icemap data for Project 5.
# We transform the HDF format to simple numpy-format so that
# the students do not need to deal with the HDF data-format.
# We also bin the data by a factor of 2 to make the data-volume
# smaller.
%matplotlib inline
import matplotlib.pyplot as plt
import pyhdf.SD as pS

def rebin(arr, new_shape):
    """
    The function rebins a numpy-array to smaller
    resolution. The new shape needs to be a multiple
    (in each dimension) of the old shape.
    
    """
    shape = (new_shape[0], arr.shape[0] // new_shape[0],
             new_shape[1], arr.shape[1] // new_shape[1])
    return arr.reshape(shape).mean(-1).mean(1)

# read the HDF data into a numpy-array.
# The following commands were extracted from the HDF-tutorial
# (see above)
file_name = 'data/asi-n6250-20110101-v5.hdf'
hdf_file = pS.SD(file_name, pS.SDC.READ)
sds_obj = hdf_file.select('ASI Ice Concentration')
# data contains a numpy-array of the Ice-concentration data
data = sds_obj.get()

# cut essential data, rebin and save to numpy-format
essential_data = rebin(data[300:1400,100:1100], (550, 500))
np.save('data/{0}.npy'.format(file_name[15:23]), essential_data)

# just make a plot to verify the data
plt.figure(figsize=(8,8))
plt.imshow(essential_data, origin='lower', cmap=plt.cm.jet)
plt.show()
##