# Reading ASCII files

### How files are read and interpreted
A common problem is to read ASCII files as the format of data is often not
very intuitive designed. 
Often there is additional metadata before or after a matrix like block or in the filename.
Or information is given as column headers in the first line (e.g csv data) 

Jscatter uses a simple concept to classify lines :

* 2 numbers at the beginning of a line are data (matrix like data block).
* a name followed by a number (and more) is an attribute with name and content.
* everything else is a comment (but can later be converted to an attribute).
* The filename is **always** stored in attribute ``.name``

A new dataArray is created if, while reading a file, a data block with a parameter block
(preceded or appended) is found or a keyword indicates the new dataset (see [dataArray](https://jscatter.readthedocs.io/en/latest/dataArray.html)).

Often it is just necessary to replace some characters to fit into this concept.
This can be done during reading using some simple options in *dataArray/dataList* creation:

* replace={‘old’:’new’,’,’:’.’}     ==>  replace char and strings
* skip complete bad lines by
   ``skiplines=lambda words: any(w in words for w in [‘’,’ ‘,’NAN’,’‘*])``
* takeline='ATOM'   ==> select specific lines
* ignore ='#'       ==> skip lines starting with this character
* usecols =[1,2,5]  ==> select specific columns
* lines2parameter=[2,3,4]  ==> use these data lines as comment and not as data.

See [dataArray](https://jscatter.readthedocs.io/en/latest/dataArray.html) for all options and how to use them.

If there is more information in comments or filename this can be extracted by using the comment lines.
 * data.getfromcomment('nameatfirstcolumn') ==> extract a list of words in this line
 * data.name  ==> filename, see below examples.
 
#### Prerequisite

In [None]:
import sys
!{sys.executable} -m pip install jscatter
%matplotlib notebook

import jscatter as js
import numpy as np
js.usempl(True)   # force matplotlib, not needed in Linux

### dataArray (js.dA) reads **one** dataset from a file    
Read one dataset. If multiple then the first is choosen or as given by index

In [None]:
dat=js.dA(js.examples.datapath+'/1hho.Dq')                  # take dataset in file 
dat=js.dA(js.examples.datapath+'/iqt_1hho.dat',index=3)     # take 4th dataset in file with multiple datasets

### dataList  (js.dL) reads **all** datasets from one/multiple files (may differ in shape) 
.append(...) shares the same options as dL(...)

In [None]:
data=js.dL(js.examples.datapath+'/iqt_1hho.dat')                          # read all dataset in file
data=js.dL(js.examples.datapath+'/iqt_1hho.dat',index=2)                  # select only the second dataset in the file
data=js.dL(js.examples.datapath+'/iqt_1hho.dat',index=[2,6,-1])           # select given in list
data=js.dL(js.examples.datapath+'/iqt_1hho.dat',index=slice(None,None,2)) # select from slice, here each second
data.append(dat)                                                          # append single dataArray read from above
data.append(js.examples.datapath+'/iqt_1hho.dat',index=[2,6,-1])          # append with same possibilities as js.dL(...)

#### Read all, filter later
Create dataList from multiple files

Uses [glob](https://docs.python.org/2/library/glob.html#module-glob) patterns  

In [None]:
data=js.dL(js.examples.datapath+'/*.dat')              # reads all '.dat' files starting with 'latest' in data folder.

Filter according to attributes or name

In [None]:
data2=data.filter(lambda a:a.name.find('/polymer')>-1)   # by name
data3=data2.filter(lambda a:a.detDist==1205)             # e.g. detector distance
data4=data[1:-1:3]                                       # drop first and last and use only each 3rd

Same with glob (using name patterns as ?,*) treating

In [None]:
import glob
data=js.dL()  # empty dataList
for filename in glob.glob(js.examples.datapath+'/*.dat'):
   data.append(filename)                       # you may add options

Add attributes e.g. from filename or comments
Attribut pressure was detected with following unit as list.

### How to extract information from comment lines or names
Lines as 'pressure 1013 14' are used automatically to set an as .pressure

If data.name is a string like 'mydata_T_273_bla.dat' extract Temp as below.

In [None]:
data=js.dL(js.examples.datapath+'/mydata_T*.dat') 
for dat in data:
   dat.Temp=float(dat.name.split('_')[2])     # extract attribute from name (which is always the filename read from)
   dat.mw=dat.comment[0].split()[2]     # if second comment line contains 'pressure 1000 mbar'

data.showattr()

In [None]:
data=js.dA(js.examples.datapath+'/mydata_T_273_bla.dat')
# lines as 'pressure 1013 14' are used automatically to set an attribute as .pressure
# if data.name is a string as 'adh_Temp273_conc02.dat' extract Temp and conc like
temp=data.name.split('_')
data.Temp=float(temp[2])
#
# if same line is in comment use
data.Temp=data.comment[0].split('_')
# or use data.getfromComment(...)

data.showattr()

## Some explicit examples and how to read them

### File with matrix like data and name-value combinations 
filename: data1_273K_10mM.dat (e.g. Instrument JNSE@MLZ, Garching) ::

In [None]:
datacontent='''
this is just a comment or description of the data
temp     293
pressure 1013 14
detectorsetting up
name     temp1bsa
0.854979E-01  0.178301E+03  0.383044E+02
0.882382E-01  0.156139E+03  0.135279E+02
0.909785E-01  0.150313E+03  0.110681E+02
0.937188E-01  0.147430E+03  0.954762E+01
0.964591E-01  0.141615E+03  0.846613E+01
0.991995E-01  0.141024E+03  0.750891E+01
0.101940E+00  0.135792E+03  0.685011E+01
0.104680E+00  0.140996E+03  0.607993E+01
'''

Read by

In [None]:
data=js.dA(js.examples.datapath+'/data1_273K_10mM.dat')
data.getfromcomment('detectorsetting')           # creates attribute detectorsetting with string value 'up' found in comments
data.Temp=float(data.name.split('_')[1][:-1])    # extracts the temperature from filename
data.conc=float(data.name.split('_')[2][:2])    # same for concentration

data.showattr()

### NSE measurement from IN15 at ILL Grenoble 


In [None]:
datacontent="""
ftime         E_SUM       EERR_SUM    EQ_0.0596   EERRQ_0.0596  EQ_0.0662   EERRQ_0.0662   EQ_0.0728   EERRQ_0.0728   EQ_0.0793   EERRQ_0.0793   EQ_0.0859   EERRQ_0.0859
Amplitude    -1.0000e+00  0.0000e+00  3.3149e+00  1.9984e-03  3.4203e+00  2.0375e-03  3.2560e+00  1.9803e-03  2.7188e+00  1.8161e-03  1.8634e+00  1.5032e-03
Polarisation -1.0000e+00  0.0000e+00  2.3719e+00  4.4403e-03  2.3723e+00  4.6673e-03  2.1675e+00  4.6726e-03  1.7156e+00  4.4392e-03  1.1127e+00  3.7890e-03
0.0000e+00 1.0000e+00  1.0318e-03  1.0000e+00  1.9261e-03  1.0000e+00  2.0252e-03  1.0000e+00  2.2186e-03  1.0000e+00  2.6615e-03  1.0000e+00  3.4992e-03
2.2428e-01  9.7447e-01  3.4201e-03  9.7363e-01  6.3708e-03  9.7026e-01  6.6990e-03  9.8392e-01  7.3605e-03  9.8819e-01  8.8623e-03  9.5632e-01  1.1831e-02
2.9474e-01  9.8425e-01  3.3694e-03  9.9020e-01  6.1962e-03  9.7785e-01  6.5809e-03  9.9125e-01  7.2723e-03  9.8005e-01  8.8698e-03  9.9022e-01  1.1909e-02
3.6520e-01  9.7910e-01  3.3071e-03  9.8269e-01  6.0875e-03  9.8190e-01  6.4363e-03  9.7275e-01  7.1155e-03  9.8566e-01  8.7117e-03  9.7766e-01  1.1829e-02
5.0612e-01  9.7927e-01  3.2226e-03  9.7898e-01  5.9112e-03  9.7517e-01  6.2379e-03  9.8108e-01  6.9563e-03  9.8669e-01  8.5569e-03  9.8611e-01  1.1557e-02
"""

Read by 

In [None]:
 # column 1,2 are averages over following columns. First line contains q values
 data=js.dL()   # empty dataList
 temp=js.dA(js.examples.datapath+'/017112345.txt')         # read all then sort later
 for i in [3,5,7,9]:
    data.append(temp[[0,i,i+1]])
    data[-1].Amplitude=temp.Amplitude[i-1:i+1]
    data[-1].Polarisation=temp.Polarisation[i-1:i+1]
    data[-1].q=float(temp.comment[0].split()[i].split('_')[1])
    
data.showattr()

### A complex file with atomic coordinates file
aspirin.pdb: Atomic coordinates for aspirin [AIN](http://ligand-expo.rcsb.org/reports/A/AIN/AIN_ideal.pdb)
from [Protein Data Bank, PDB](http://www.rcsb.org/ligand/AIN)

In [None]:
content='''
Header
Remarks blabla
Remarks in pdb files are sometimes more than 100 lines
ATOM      1  O1  AIN A   1       1.731   0.062  -2.912  1.00 10.00           O
ATOM      2  C7  AIN A   1       1.411   0.021  -1.604  1.00 10.00           C
ATOM      3  O2  AIN A   1       2.289   0.006  -0.764  1.00 10.00           O
ATOM      4  C3  AIN A   1      -0.003  -0.006  -1.191  1.00 10.00           C
ATOM      5  C4  AIN A   1      -1.016   0.010  -2.153  1.00 10.00           C
ATOM      6  C5  AIN A   1      -2.337  -0.015  -1.761  1.00 10.00           C
ATOM      7  C6  AIN A   1      -2.666  -0.063  -0.416  1.00 10.00           C
ATOM      8  C1  AIN A   1      -1.675  -0.085   0.544  1.00 10.00           C
ATOM      9  C2  AIN A   1      -0.340  -0.060   0.168  1.00 10.00           C
ATOM     10  O3  AIN A   1       0.634  -0.083   1.111  1.00 10.00           O
ATOM     11  C8  AIN A   1       0.314   0.035   2.410  1.00 10.00           C
ATOM     12  O4  AIN A   1      -0.824   0.277   2.732  1.00 10.00           O
ATOM     13  C9  AIN A   1       1.376  -0.134   3.466  1.00 10.00           C
ATOM     14  HO1 AIN A   1       2.659   0.080  -3.183  1.00 10.00           H
ATOM     15  H4  AIN A   1      -0.765   0.047  -3.203  1.00 10.00           H
ATOM     16  H5  AIN A   1      -3.119   0.001  -2.505  1.00 10.00           H
ATOM     17  H6  AIN A   1      -3.704  -0.082  -0.117  1.00 10.00           H
ATOM     18  H1  AIN A   1      -1.939  -0.123   1.591  1.00 10.00           H
ATOM     19  H91 AIN A   1       0.931  -0.004   4.453  1.00 10.00           H
ATOM     20  H92 AIN A   1       1.807  -1.133   3.391  1.00 10.00           H
ATOM     21  H93 AIN A   1       2.158   0.610   3.318  1.00 10.00           H
CONECT    1    2   14 may appear at the end
HETATOM lines may appear at the end
END'''

#### Read by (several methods)

Take 'ATOM' lines, but only column 6-8 as x,y,z coordinates.

In [None]:
js.dA(js.examples.datapath+'/AIN_ideal.pdb',takeline='ATOM',replace={'ATOM':'0'},usecols=[6,7,8])

Replace 'ATOM' string by number and set XYZ for convenience

In [None]:
js.dA(js.examples.datapath+'/AIN_ideal.pdb',replace={'ATOM':'0'},usecols=[6,7,8],XYeYeX=[0,1,None,None,2])

Only the Oxygen atoms

In [None]:
js.dA(js.examples.datapath+'/AIN_ideal.pdb',takeline=lambda w:(w[0]=='ATOM') & (w[2][0]=='O'),replace={'ATOM':'0'},usecols=[6,7,8])

Using regular expressions we can decode the atom specifier into a scattering length

In [None]:
import re    # regular expression 
rHO=re.compile('HO\d') # 14 is HO1
rH=re.compile('H\d+')  # represents something like 'H11' or 'H1' see regular expressions
rC=re.compile('C\d+')
rO=re.compile('O\d+')
# replace atom specifier by number and use it as last column
js.dA(js.examples.datapath+'/AIN_ideal.pdb',replace={'ATOM':'0',rC:1,rH:5,rO:2,rHO:5},usecols=[6,7,8,2],XYeYeX=[0,1,None,None,2])

Read only atoms and use it to retrieve atom data from js.formel.Elements

In [None]:
atoms=js.dA(js.examples.datapath+'/AIN_ideal.pdb',replace={'ATOM':'0'},usecols=[2],XYeYeX=[0,1,None,None,2])[0].array
al=[js.formel.Elements[a[0].lower()] for a in atoms]
al

#### Data with lots of non non number content in matrix
data2.txt

In [None]:
content='''
# this is just a comment or description of the data
# temp     ;    293
# pressure ; 1013 14  bar
# name     ; temp1bsa
&doit
0,854979E-01  0,178301E+03  0,383044E+02
0,882382E-01  0,156139E+03  0,135279E+02
0,909785E-01  *             0,110681E+02
0,937188E-01  0,147430E+03  0,954762E+01
0,964591E-01  0,141615E+03  0,846613E+01
nan           nan           0
'''

Read by
- ignore is by default '#', so switch it of
- skip lines with non numbers in data
- replace some char by others or remove by replacing with empty string ''.

In [None]:
js.dA(js.examples.datapath+'/data2.txt',replace={'#':'',';':'',',':'.'},skiplines=['*','nan'],ignore='' )

### pdh format used in some SAXS instruments (first real data point is line 4)

In [None]:
content='''
SAXS BOX
      2057         0         0         0         0         0         0         0
  0.000000E+00   3.053389E+02   0.000000E+00   1.000000E+00   1.541800E-01
  0.000000E+00   1.332462E+00   0.000000E+00   0.000000E+00   0.000000E+00
-1.069281E-01   2.277691E+03   1.168599E+00
-1.037351E-01   2.239132E+03   1.275602E+00
-1.005422E-01   2.239534E+03   1.068182E+00
-9.734922E-02   2.219594E+03   1.102175E+00
'''

Read by
- Use only first two column
- Use lines 2,3,4 as comment
- As demo: skip first 5  and last 50 and use each second

In [None]:
# this saves the prepended lines in attribute line_2,...
empty=js.dA(js.examples.datapath+'/buffer_averaged_corrected_despiked.pdh',usecols=[0,1],lines2parameter=[2,3,4])
# next just ignores the first lines (and last 50) and uses every second line,
empty2=js.dA(js.examples.datapath+'/buffer_averaged_corrected_despiked.pdh',usecols=[0,1],block=[5,-50,2])

empty.showattr()

### Read csv data by (comma separated list) 
- replace ';' by ' ' (space)
- if needed from Windows computers the encoding may be specified. encoding='cp1252' (US),'cp1251' (with German öäüß)

```js.dA('data2.txt',replace={',':' '})
If tabs separate the columns
js.dA('data2.txt',replace={',':' ','\t':' '})
```