## Exercise with Data Wrangling

This is an excerpt of what I've been trying to do with a large amount of weather data that is not stored in an intuitive format. <br> 
The main goals of sorting through the dataset are: <br>

1. Read in all the files, which include several weather stations sites and years.
2. Pick out hourly temperature and precipitation from raw data.
3. Store temperature and precipitation information in separate dataframe.
4. Identify gap size in temperature & precipitation data.
5. Select site-years that have acceptable gaps that have both temperature and precipitaiton data.

Some more info on the weather dataset:
- The data is too big to put on github! Please find a subset of the data [here](https://drive.google.com/open?id=1JwjLBBS8IqNSriNKN6-BiCb5HpTe3kss).
- NOAA maintains this dataset in the Integrated Surface Hourly (ISH) database
- Raw data can be downloaded [here](ftp://ftp.ncdc.noaa.gov/pub/data/noaa/) if anyone might be interested.
- The data is recorded hourly at weather stations across the US, and the information is stored in a special format called the ASCII character format.
- Each data file records information for a particular weather station for the whole year, and each line of text within the data includes lots of information recorded on a hourly timestep, so the file will include 24*365 = 8760 lines of data if not a leap year. <br>
There are three parts within the data that have relevant information to extract (refer to "ISH_Manual.pdf" included in folder:

> 1. The control data section (p.4): <br>
In here you can find timestamps, and weather station info
> 2. The mandatory data section (p.7 & p.9): <br>
You will find temperature data here
> 3. The additional data section (p.12): <br>
You will find precipitation data here

A couple things to think about:

- This is code I put together when I just started transitioning into python (and still am!). I imagnie it has room for lots of improvement. Feel free to work with it and see what other ideas you come up with!
- What are some good practices when wrangling data?
- The code is slow : / <br> In reality, I have data from ~200-273 sites to process from year 1961-1990. Are there ways to make this process faster?
- Data management?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import glob
import time as ti
# from profilestats import profile

In [2]:
file_list = glob.glob("./data/1961/*") # make sure to update this to wherever you put your data.
                                                     # refer to the glob tutorial EL put togehter for more info on glob: 
                                                     # "exercise_geo_plotting/TUTORIAL_glob.ipynb
        
# for j in file_list[0:1]:  
df = pd.read_csv(file_list[0], sep='\t', header=None, squeeze=True, encoding="latin1")       

### 1. Read in individual weather files

In [3]:
# Speed code up 8 times by not storing list of individual timestamps, but to store it as a Datetime vector

t0 = ti.time()
timeStampList = list()
for k in range(5000):
    timeStamp = pd.to_datetime(df[k][15:27], format="%Y%m%d%H%M")
    timeStampList.append(timeStamp)
print('The timestamp way, code takes:      ', ti.time()-t0, ' seconds to run')
               
               
t0 = ti.time()
timeList = list()
for k in range(5000):
    timeList.append( df[k][15:27])
tVector = pd.to_datetime(timeList, format="%Y%m%d%H%M" )  
print('The list operation way, code takes: ', ti.time()-t0, ' seconds to run')
               
print('The original format: ')
print( timeStampList [0:3])
print('The new format: ')
print(tVector[0:3])


The timestamp way, code takes:       0.6104340553283691  seconds to run
The list operation way, code takes:  0.06109285354614258  seconds to run
The original format: 
[Timestamp('1961-01-01 00:00:00'), Timestamp('1961-01-01 01:00:00'), Timestamp('1961-01-01 02:00:00')]
The new format: 
DatetimeIndex(['1961-01-01 00:00:00', '1961-01-01 01:00:00',
               '1961-01-01 02:00:00'],
              dtype='datetime64[ns]', freq=None)


In [4]:
# same speed, but shorter code by initializing list

for j in file_list[0:1]:  
    df = pd.read_csv(j, sep='\t', header=None, squeeze=True, encoding="latin1")        
    lines = np.arange(df.shape[0])        
    timeList = list()
    temp = list()
    precip = list() 
    temp2 = np.full( df.shape[0], np.nan)
    precip2 = np.full( df.shape[0], np.nan)
    
    t0 = ti.time()
    for k in lines:   
        precip_pos = df[k].find("AA1")        
        if precip_pos == -1:
            precip.append("NaN")
        elif int(df[k][precip_pos+5:precip_pos+9]) == 9999:
            precip.append("NaN")
        else:
            precip.append(int(df[k][precip_pos+5:precip_pos+9]))
    print( 'original way: ',ti.time() - t0)       
    
    t0 = ti.time()
    for k in lines:  
        precip_pos = df[k].find("AA1")           
        if  not(precip_pos == -1)  and  not(int(df[k][precip_pos+5:precip_pos+9]) == 9999):
            precip2[k] =  int(df[k][precip_pos+5:precip_pos+9]) 
    print( 'New way: ',ti.time() - t0) 

original way:  0.0860598087310791
New way:  0.08506226539611816


In [5]:
# Note: this code will take a couple minutes to finish running

t0 = ti.time()
years = np.arange(1961,1966) # years of data 
# years = np.arange(1961,1962) # years of data 
df_temp_all = pd.DataFrame() # setting up an empty dataframe to store final info
df_precip_all = pd.DataFrame() # same for precip 
 
for i in years:
    file_list = glob.glob("./data/" + str(i) + "/*") # make sure to update this to wherever you put your data.
                                                     # refer to the glob tutorial EL put togehter for more info on glob: 
                                                     # "exercise_geo_plotting/TUTORIAL_glob.ipynb
    timepoints = pd.date_range(start = str(i) + "-01-01", 
                               end = str(i) + "-12-31 23:00:00", freq = "H")
    df_temp = pd.DataFrame(index=timepoints)
    df_precip = pd.DataFrame(index=timepoints)  
    
    for j in file_list:  
        df = pd.read_csv(j, sep='\t', header=None, squeeze=True, encoding="latin1")         
        timeList = list() 
        temp = np.full( df.shape[0], np.nan)
        precip  = np.full( df.shape[0], np.nan )
                          
        for k in np.arange(df.shape[0]):
            # time append 
            timeList.append(df[k][15:27])
            # temp 
            if ~(int(df[k][87:92]) == 9999): 
                temp[k] = int(df[k][87:92])/10 
            # precip store
            p_pos = df[k].find("AA1")        
            if  not(p_pos == -1)  and  not(int(df[k][p_pos+5:p_pos+9]) == 9999):
                precip[k] =  int(df[k][p_pos+5:p_pos+9])   
                
        timeDate =  pd.to_datetime(timeList, format="%Y%m%d%H%M" ) 
        
        WBAN_id = j.split("/")[-1][7:12] # weather station site ID
        df_t = pd.DataFrame({WBAN_id: temp }, index=timeDate)  
        df_t = df_t[~df_t.index.duplicated()] # some sites have data records on sub-hourly time scale
                                              # but I only want hourly data
        df_temp = pd.concat([df_temp, df_t], axis= 1, 
                            join_axes=[df_temp.index], sort=False) 
        
        df_p = pd.DataFrame({WBAN_id:precip}, index=timeDate)
        df_p = df_p[~df_p.index.duplicated()]
        df_precip = pd.concat([df_precip, df_p], axis= 1, 
                              join_axes=[df_precip.index], sort=False)
        
    timepoints = pd.date_range(start = str(i) + "-01-01", 
                               end = str(i) + "-12-31 23:00:00", freq = "H")
    df_temp = pd.DataFrame(index=timepoints)
    df_precip = pd.DataFrame(index=timepoints)  
                          
    frames_temp = [df_temp_all, df_temp]
    df_temp_all = pd.concat(frames_temp, sort=False)
    
    frames_precip = [df_precip_all, df_precip]
    df_precip_all = pd.concat(frames_precip, sort=False)
    
print('runtime new code: ', ti.time()-t0)
print('old runtime is about 200? Not sure because I had an error')

runtime new code:  58.46745038032532
old runtime is about 200? Not sure because I had an error
