# Data Reading and Writing {.unnumbered}

# How do you read the data? {.unnumbered}

Depending on the data format, you can use different libraries to read the data.

## Reading Plain Text Files {.unnumbered}

You can use the `pandas`  or `numpy` library to read CSV files. 







### Pandas {.unnumbered}
Pandas is read function is quite fast and can read large files.
The advantage is that different data types can be read in the same file. 
The reading functions return a DataFrame object. 

Pandas has different functions to read different file formats.
- `pandas.read_csv()` function is can read CSV files.
- `pandas.read_table()` function is can read general delimiter files.
- `pandas.read_fwf()` function is can read fixed-width files.

Mostly used function is
`pandas.read_csv()` 
because you can specify the delimiter, header, and other options.




```python
import pandas as pd
df = pd.read_csv('file.csv')
```

In [47]:
import pandas as pd
data = pd.read_csv('../../../data/temperatures.csv')
data

Unnamed: 0,time;temperature
0,1;303.073024218
1,2;302.951624807
2,3;302.831229733
3,4;302.73615227
4,5;302.708880354
...,...
44635,44636;296.102947663
44636,44637;296.173110138
44637,44638;296.140000813
44638,44639;296.169289777


The data has a different delimiter than the default `comma`. You can specify the delimiter using the `sep` parameter.

```python
df = pd.read_csv('file.csv', sep=';')
```

In [48]:
data = pd.read_csv('../../../data/temperatures.csv', sep=';')
data

Unnamed: 0,time,temperature
0,1,303.073024
1,2,302.951625
2,3,302.831230
3,4,302.736152
4,5,302.708880
...,...,...
44635,44636,296.102948
44636,44637,296.173110
44637,44638,296.140001
44638,44639,296.169290


Now the data is read correctly. The header is already taken from the first row. If you want to specify the header, you can use the `header` parameter.

```python
df = pd.read_csv('file.csv', header=None) # No header
df = pd.read_csv('file.csv', header=0) # Header is in the first row
df = pd.read_csv('file.csv', header=1) # Header is in the second row
```


In [49]:
data = pd.read_csv('../../../data/temperatures.csv', sep=';', header=0)
data

Unnamed: 0,time,temperature
0,1,303.073024
1,2,302.951625
2,3,302.831230
3,4,302.736152
4,5,302.708880
...,...,...
44635,44636,296.102948
44636,44637,296.173110
44637,44638,296.140001
44638,44639,296.169290


If your data contains whitespace, you can use the `skipinitialspace` parameter to remove initial whitespaces.

```python
df = pd.read_csv('file.csv', skipinitialspace=True,sep=' ')
```

In [57]:
data = pd.read_csv('../../../data/temperatures.dat', skipinitialspace=True, sep=" ")
data

Unnamed: 0,1,303.073024218
0,2,302.951625
1,3,302.831230
2,4,302.736152
3,5,302.708880
4,6,302.647462
...,...,...
44634,44636,296.102948
44635,44637,296.173110
44636,44638,296.140001
44637,44639,296.169290


Now the data has no header. You can specify the header using the `names` parameter.

```python
df = pd.read_csv('file.csv', names=['A', 'B', 'C', 'D'])
```

In [160]:
data = pd.read_csv('../../../data/temperatures.dat', sep=' ', skipinitialspace=True,header=1,names=['t', 'T'])
# important to set header=None, otherwise the first line is used as header
data

Unnamed: 0,t,T
0,3,302.831230
1,4,302.736152
2,5,302.708880
3,6,302.647462
4,7,302.513749
...,...,...
44633,44636,296.102948
44634,44637,296.173110
44635,44638,296.140001
44636,44639,296.169290


You see that also not `.csv` files can be read with the `read_csv()` function.

The `read_csv()` function has a lot of parameters. 
Look in the documentation. You can see which parameters you can set [https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

For example,
- `delimiter` parameter can be used to specify the delimiter instead of `sep`. Both are the same. Default is `,`.
- `header` parameter can be used to specify the header row. Default is inferred from the file.
- `skipinitialspace` parameter can be used to remove initial whitespaces. Default is `False`.
- `names` parameter can be used to specify the column names.
- `skiprows` parameter can be used to skip rows at the beginning of the file.
- `skipfooter` parameter can be used to skip rows at the end of the file.
- `nrows` parameter can be used to read only a specific number of rows.
- `usecols` parameter can be used to read only specific columns.
- `dtype` parameter can be used to specify the data type of the columns.
- `na_values` parameter can be used to specify the missing values.
- `keep_default_na` parameter can be used to specify if the default missing values should be kept. Default is `True`.
- `na_filter` parameter can be used to recognize missing values without `NA` or `NaN` values. Default is `True`.
- `true_values` parameter can be used to specify the values that should be recognized as `True`.
- `false_values` parameter can be used to specify the values that should be recognized as `False`.
- `parse_dates` parameter can be used to parse dates. Default is `False`.

 

Some examples are:

```python
df = pd.read_csv('file.csv', skiprows=2)
```


In [161]:
data = pd.read_csv('../../../data/temperatures.csv', sep=';', header=0, names=['t', 'T'], skiprows=1)
data # now the first row is skipped, only 44638 rows are read instead of 44639

Unnamed: 0,t,T
0,2,302.951625
1,3,302.831230
2,4,302.736152
3,5,302.708880
4,6,302.647462
...,...,...
44634,44636,296.102948
44635,44637,296.173110
44636,44638,296.140001
44637,44639,296.169290


Missing value examples:


In [162]:
data = pd.read_csv('../../../data/temperatures_nan.dat', sep=' ', skipinitialspace=True,header=None,names=['t', 'T'],  keep_default_na=True,na_filter=False)
print(data.loc[20:26]) # print some rows to see the NaN values


     t              T
20  21  302.020507467
21  22               
22  23  301.845408096
23  24  301.833550446
24  25  301.785933229
25  26  301.846169501
26  27  301.779994697


If the `na_filter`is set to `False`, the missing values are not recognized. But if it set on `True`, the missing values are recognized.

In [163]:
data = pd.read_csv('../../../data/temperatures_nan.dat', sep=' ', skipinitialspace=True,header=None,names=['t', 'T'],  keep_default_na=True,na_filter=True)
print(data.loc[20:26]) # print some rows to see the NaN values


     t           T
20  21  302.020507
21  22         NaN
22  23  301.845408
23  24  301.833550
24  25  301.785933
25  26  301.846170
26  27  301.779995


### Numpy
Numpy has two main functions to read text files.
- `numpy.loadtxt()` function is used to read text files.
- `numpy.genfromtxt()` function is used to read text files with missing values.

In comparison to the `pandas` library, the `numpy` library is slower and can not read different data types in the same file.
So you can not read a file with strings and numbers in the same file.

```python
import numpy as np
data = np.loadtxt('file.csv', delimiter=',')
```

If you try to read a file with a header row, you will get an error. 


In [164]:
import numpy as np
data = np.loadtxt('../../../data/temperatures.csv', delimiter=';')

ValueError: could not convert string 'time' to float64 at row 0, column 1.

You can specify the header row using the `skiprows` parameter.
If you want to skip one row, the `skiprows=1` parameter is set at `1`.

```python
data = np.loadtxt('file.csv', delimiter=';', skiprows=1)
```

In [165]:
data = np.loadtxt('../../../data/temperatures.csv', delimiter=';', skiprows=1)
data

array([[1.00000000e+00, 3.03073024e+02],
       [2.00000000e+00, 3.02951625e+02],
       [3.00000000e+00, 3.02831230e+02],
       ...,
       [4.46380000e+04, 2.96140001e+02],
       [4.46390000e+04, 2.96169290e+02],
       [4.46400000e+04, 2.96287904e+02]])

Now the data is read correctly. You can see that the data is read as a numpy array and not as a DataFrame. This can be a disadvantage if you want to use the data as a DataFrame but an advantage if you want to use numpy functions to process the data.

If you have data with whitespace, you do not need to specify the `delimiter`paramter because the default is whitespace.

```python
data = np.loadtxt('file.csv')
```

In [166]:
data = np.loadtxt('../../../data/temperatures.dat')
data

array([[1.00000000e+00, 3.03073024e+02],
       [2.00000000e+00, 3.02951625e+02],
       [3.00000000e+00, 3.02831230e+02],
       ...,
       [4.46380000e+04, 2.96140001e+02],
       [4.46390000e+04, 2.96169290e+02],
       [4.46400000e+04, 2.96287904e+02]])

`genfromtxt()` gives you more flexibility to read files with missing values. 

First using the `loadtxt()` function, you get an error because of the missing values. 


In [167]:
data = np.loadtxt('../../../data/temperatures_nan.dat')
data

ValueError: the number of columns changed from 2 to 1 at row 22; use `usecols` to select a subset and avoid this error

If you try to read the file with an empty entrance at row 22, you wil get still an error with the `genfromtxt()` function. 

In [168]:
data = np.genfromtxt('../../../data/temperatures_nan.dat')
data

ValueError: Some errors were detected !
    Line #22 (got 1 columns instead of 2)

But why?
What do you think is the reason for the error?

::: {.callout-caution collapse="true"}
The reason is that the `genfromtxt()` function expects the same number of columns in each row. The delimiter is set default to `whitespace`. But if you have a missing value, the function expects a value.
An error is raised because at row 22 the function is detecting only one column due to the missing value.
:::

How can you solve this problem?

::: {.callout-caution collapse="true"}
You can **NOT** solve this problem with the `genfromtxt()` function if you have missing values and delimiter is whitespace.
Either you have to fill the missing value with a value or you have to use the `pandas` library.
:::


If you have not `whitespace` as delimiter, you can use the `genfromtxt()` function with missing values.

```python
data = np.genfromtxt('file.csv', delimiter=',')
```

In [186]:
data = np.genfromtxt('../../../data/temperatures_nan.csv',delimiter=';')
data[20:26] # print some rows to see the NaN values

array([[ 21.        , 302.02050747],
       [ 22.        ,          nan],
       [ 23.        , 301.8454081 ],
       [ 24.        , 301.83355045],
       [ 25.        , 301.78593323],
       [ 26.        , 301.8461695 ]])

The different parameters that can be set are for `loadtxt()` function:
(see documentation [https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) 
- `delimiter` parameter can be used to specify the delimiter. Default is whitespace.
- `skiprows` parameter can be used to skip rows at the beginning of the file.
- `usecols` parameter can be used to read only specific columns.
- `dtype` parameter can be used to specify the data type of the columns.
- `comments` parameter can be used to specify the comment character. Default is `#`.
- `max_rows` parameter can be used to read only a specific number of rows after skipping rows.
- `unpack` parameter can be used to unpack the columns, so each column is returned as a separate array.
and for `genfromtxt()` function
(see documentation [https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt))
- `delimiter` parameter can be used to specify the delimiter. Default is whitespace.
- `skip_header` parameter can be used to skip rows at the beginning of the file.
- `skip_footer` parameter can be used to skip rows at the end of the file.
- `usecols` parameter can be used to read only specific columns.
- `dtype` parameter can be used to specify the data type of the columns.
- `comments` parameter can be used to specify the comment character. Default is `#`.
- `max_rows` parameter can be used to read only a specific number of rows after skipping rows.
- `unpack` parameter can be used to unpack the columns, so each column is returned as a separate array.
- `missing_values` parameter can be used to specify which values should be recognized as missing values.
- `filling_values` parameter can be used to specify the filling values for the missing values.
- `usemask` parameter can be used to return a masked array with missing values.
- `names` parameter can be used to specify the column names. If `names=True`, the column names are read from the first row.
- `replace_space` parameter can be used to replace spaces in the column names. Default is `_`.
etc.



::: {.callout-important}
The `pandas` library is faster and more flexible than the `numpy` library.
Choose wisely which library you want to use.
It depends on the data format, the data type and what kind of processing you want to do.
:::

## Exercises {.unnumbered}