# Inspection of Data {.unnumbered}

After you load your data you have to inspect it to:
- check if no data is consistent, no missing values
- check if the data is in the correct format
- check if the data is in the correct range
- check if the data is in the correct distribution
- get first insights into the data

## Pandas {.unnumbered}

### Overview of the Data {.unnumbered}

You can use the `head()` function to get a quick overview of the first rows of the data.

```python
import pandas as pd

data = pd.read_csv('data.csv')
data.head()
```

In [54]:
import pandas as pd

data = pd.read_csv('../../../data/temperatures.csv',sep=';')
data.head()

Unnamed: 0,time,temperature
0,1,303.073024
1,2,302.951625
2,3,302.83123
3,4,302.736152
4,5,302.70888


The `describe()` function gives you a quick overview of the data distribution.

```python
data.describe()
```

In [55]:
data.describe()

Unnamed: 0,time,temperature
count,44640.0,44640.0
mean,22320.5,297.937566
std,12886.602345,6.080222
min,1.0,278.81467
25%,11160.75,293.842755
50%,22320.5,297.87593
75%,33480.25,301.87956
max,44640.0,316.006103


The `describe`function shows you the count, mean, standard deviation, minimum, 25%, 50%, 75% and maximum values of the data.

In this case, data has 44640 data points, The mean of the temperature is 298(6) K. The minimum temperature is 279 K and the maximum temperature is 316 K. Further the 25% quantile is 294 K, the 50% quantile is 298 K and the 75% quantile is 302 K. The measurement was taken from 1 to 44640 seconds which is 12 hours and 24 minutes.
We suppose that is the correct time range which was to be expected.

This gives you a quick overview of the data distribution.

#### Missing Data and Corrupted Data {.unnumbered}

To check if there is missing data in the data set you can use the `isna()` function.

```python
data.isna().sum()
```

In [56]:
data.isna().sum()

time           0
temperature    0
dtype: int64

No missing data is found in this case.

You can check the data type using `dtypes` function to check if the data is in the correct format.

```python
data.dtypes
```

In [57]:
data.dtypes

time             int64
temperature    float64
dtype: object

You see that `time` is an `int64` and `temperature` is a `float64`.
For the analysis, you might want to convert the `time` to a `float64` as well.

```python
data['time'] = data['time'].astype('float64')
```

In [58]:
data['time'] = data['time'].astype('float64')
data.dtypes

time           float64
temperature    float64
dtype: object

If we have missing data we can use the `fillna()` function to fill the missing data with a specific value.

```python
data.fillna(0)
```

In [59]:
data_missing = pd.read_csv('../../../data/temperatures_nan.dat', header=None, skipinitialspace=True, sep=' ', names=['time', 'temperature'])
data_missing.head()

Unnamed: 0,time,temperature
0,1,303.073024
1,2,302.951625
2,3,302.83123
3,4,302.736152
4,5,302.70888


In [60]:
data_missing.isna().sum()

time           0
temperature    1
dtype: int64

One value is missing in the `temperature` column. We fill it with 0.

First let check where the data is missing.

```python
data[data['temperature'].isna()]
```

In [61]:
data_missing[data_missing['temperature'].isna()]

Unnamed: 0,time,temperature
21,22,


At index 21 at time 22 s the temperature is missing.

:::{.callout .callout-info}
The handling of missing data is a complex topic.
First of all you have to check why the data is missing. Is it a measurement error, a data processing error etc.

You have to decide if you want to fill the missing data with a specific value, drop the row or column or interpolate the missing data. The decision depends on the data and the analysis you want to perform. Droping Data is always a delicate decision because you loose information. Sometimes it is not good scientific practice to drop data. 
For more information there a lot of research in this topic [https://doi.org/10.1076/edre.7.4.353.8937](https://doi.org/10.1076/edre.7.4.353.8937)
:::

The time step can be estimated by the difference between the time steps of the previous and the next data point.

```python
data['time'].diff()
```
And we can summarize it via:

```python
data['time'].diff().value_counts()
```

In [62]:
data_missing['time'].diff().value_counts()

time
1.0    44639
Name: count, dtype: int64

In [63]:
# get difference between temperature values
data_missing[10:30].diff()


Unnamed: 0,time,temperature
10,,
11,1.0,0.000728
12,1.0,-0.062319
13,1.0,-0.130198
14,1.0,-0.060203
15,1.0,0.003234
16,1.0,0.048911
17,1.0,-0.042025
18,1.0,0.021264
19,1.0,0.033401


The time step is constantly 1 second. 
The difference between the temperature of the previous and the next data point is at $~10^{-2}$ order.
We can assume that in this case the data is consistent enough and we can fill the missing data with the mean of the previous and the next data point.

```python
data['temperature'].fillna((data['temperature'].shift() + data['temperature'].shift(-1))/2, inplace=True)
```

In [64]:
data_missing['temperature'].fillna((data_missing['temperature'].shift() + data_missing['temperature'].shift(-1))/2, inplace=True)
data_missing[20:25]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_missing['temperature'].fillna((data_missing['temperature'].shift() + data_missing['temperature'].shift(-1))/2, inplace=True)


Unnamed: 0,time,temperature
20,21,302.020507
21,22,301.932958
22,23,301.845408
23,24,301.83355
24,25,301.785933


Now can analysis or plot the data.
