# Quality Check on QPESUMS Data

During the *autoencoder* process, we confirmed a data problem: the QPESUMS on 20140614 is up-side-down while records of other time are not. This prompt us to do a more detailed check on the QPESUMS dataset, since we are not sure how may other records are corrupted, nor any systematic error pattern is found so far.

We want to perform a sampling and checking process on this totally **34,369** records. Since we concern more about the heavy rainfall cases, hence we want to include as many these cases as possible.

The size of the dataset and its subsets are summarized in the following table:

|dataset|number of records|
|-------|-----------------|
|Full|34369|
|1mm/hr|11260|
|5mm/hr|4456|
|10mm/hr|2150|
|20mm/hr|858|
|40mm/hr|236|
|Typhoon warning|1517|


Hence, our sampling scheme is:

- Sample 500 files and check with the stored Radar data
  - all cases with (prec >= 40mm/hr): 236
  - 200 with typhoon warning
  - 64 from others
    

## Read Timestamps

First let's read in the timestamps and filter out duplicates.

In [1]:
import numpy as np
import pandas as pd

# Read date list
tsp01 = pd.read_csv('data/dates_p01.csv')
tsp40 = pd.read_csv('data/dates_p40.csv')
tstyw = pd.read_csv('data/dates_typhoon.csv')

# Check dimension
print(tsp01.shape)
print(tsp40.shape)
print(tstyw.shape)

print(tstyw.head())

(11481, 1)
(239, 1)
(1584, 2)
    timestamp typhoon
0  2013071101  SOULIK
1  2013071102  SOULIK
2  2013071103  SOULIK
3  2013071104  SOULIK
4  2013071105  SOULIK


In [2]:
# Filter out overlapped time-stamps
tsp01 = tsp01.loc[~tsp01.date.isin(tsp40.date),:]
print(tsp01.shape)
tsp01 = tsp01.loc[~tsp01.date.isin(tstyw.timestamp),:]
print(tsp01.shape)

tstyw = tstyw.loc[~tstyw.timestamp.isin(tsp40.date),:]
print(tstyw.shape)

(11242, 1)
(10340, 1)
(1543, 2)


## Sampling

The most common tool for random sampling is [`numpy.random.choice`](https://docs.scipy.org/doc/numpy-1.16.0/reference/generated/numpy.random.choice.html). We need to specify `replace=False` to avoid picking the same timestamp twice.

In [3]:
# All HR cases
ts_selected = list(tsp40.date)
print(len(ts_selected))
# Add 200 samples from Typhoon cases
tmp = np.random.choice(tstyw.timestamp, 200, replace=False)
ts_selected += list(tmp)
print(len(ts_selected))
# Add 61 samples from prec>=1mm/hr
ts_selected += list(np.random.choice(tsp01.date, 61, replace=False))
print(len(ts_selected))

239
439
500


In [4]:
# Check duplicates
print(len(set(ts_selected)))

499


In [8]:
ts_selected.sort()
output = pd.DataFrame({'timestamp':ts_selected})
print(output.head())
output.to_csv('data/sampled_timestamp.csv', index=False)

    timestamp
0  2013010818
1  2013011009
2  2013011017
3  2013011108
4  2013021606
