# Data Preparation

這篇筆記是記錄基本的資料讀取。

## 前期準備

### 基本設定

首先，我們先導入需要用到的函式庫，設定基本的參數與資料路徑。資料檔的欄位名稱儲存在 `header_file` 裡面。

In [3]:
# 載入函式庫
import numpy as np
import pandas as pd
import os, sys, csv

# 定義路徑
header_file = '../../data/readme.csv'
ST_PATH = '../../data/ST/'
VS_PATH = '../../data/VS/'

# 測試讀取欄位名稱
header = pd.read_csv(header_file).columns
header

Index(['Station', 'instrument', 'P', 'H', 'T', 'RH', 'U', 'V', 'Rad.'], dtype='object')

### 清點並檢查檔案

我們先利用 `os.walk()` 搜尋資料夾裡的所有檔案，

In [5]:
# Holders for data files
ST_files = []
VS_files = []

# Scan for ST files
for root, dirs, files in os.walk(ST_PATH): 
    for fn in files: 
        if fn.endswith('.csv'): 
             ST_files.append({'time':fn.replace('Banqiao_','').replace('_ST.csv',''), 'uri':os.path.join(root, fn)})
ST_files = pd.DataFrame(ST_files)

# Scan for VS files
for root, dirs, files in os.walk(VS_PATH): 
    for fn in files: 
        if fn.endswith('.csv'): 
             VS_files.append({'time':fn.replace('Banqiao_','').replace('_VS.csv',''), 'uri':os.path.join(root, fn)})
VS_files = pd.DataFrame(VS_files)

# Merge by time stamp
datafiles = pd.merge(VS_files, ST_files, on='time', suffixes=('_vs', '_st'))

print("VS files: "+str(VS_files.shape))
print("ST files: "+str(ST_files.shape))
print("After merging: "+str(datafiles.shape))
print(datafiles.head())

VS files: (412, 2)
ST files: (412, 2)
After merging: (412, 3)
          time                                    uri_vs  \
0  20180625_06  ../../data/VS/Banqiao_20180625_06_VS.csv   
1  20180626_00  ../../data/VS/Banqiao_20180626_00_VS.csv   
2  20180626_03  ../../data/VS/Banqiao_20180626_03_VS.csv   
3  20180627_03  ../../data/VS/Banqiao_20180627_03_VS.csv   
4  20180627_06  ../../data/VS/Banqiao_20180627_06_VS.csv   

                                     uri_st  
0  ../../data/ST/Banqiao_20180625_06_ST.csv  
1  ../../data/ST/Banqiao_20180626_00_ST.csv  
2  ../../data/ST/Banqiao_20180626_03_ST.csv  
3  ../../data/ST/Banqiao_20180627_03_ST.csv  
4  ../../data/ST/Banqiao_20180627_06_ST.csv  


In [7]:
# Function to read single sounding file in CSV format
def read_sounding_csv(furi, colnames=['Station', 'instrument', 'P', 'H', 'T', 'RH', 'U', 'V', 'Rad.'], verbose=0):
    import pandas as pd
    data = pd.read_csv(furi, names=colnames)
    if verbose>0:
        print('Read from file: '+furi+', data dimension: '+str(data.shape))
    return(data)

data_vs = read_sounding_csv(datafiles['uri_vs'].iloc[0], verbose=1)
data_st = read_sounding_csv(datafiles['uri_st'].iloc[0], verbose=1)


Read from file: ../../data/VS/Banqiao_20180625_06_VS.csv, data dimension: (701, 9)
Read from file: ../../data/ST/Banqiao_20180625_06_ST.csv, data dimension: (701, 9)
