## Exercise with Data Wrangling
The entire code now runs in **11.8 seconds ± 370 ms** (original code runs in **2 minutes 49 seconds ± 9.62 s** on my PC).

### Things I did to make the code faster
- Use `read_fwf` instead of `read_csv`.
- I got rid of all the `for` loops except for one (`for` loops are slow in Python).  
- I check the data for gaps *before* analyzing it and adding it to the master dataframe.
- If there are commands that drop a lot of data (for example removing non-growseason times), I run those first so less data is stored in memory.

In [1]:
import numpy as np
import pandas as pd
import glob

In [2]:
%%timeit

fnames = glob.glob("./data/*/*")

# list of breakpoints and column names from ISH_Manual.PDF
colnames = ["time", "temp", "precip"]
colspecs = [(15,27), (87,91), (105, 8193)]

crit_rows = 3 # Maximum allowed missing hours
growseason = pd.date_range(start='2000-05-01', end='2000-10-31').strftime('%m-%d')

df_temp_all = pd.DataFrame(columns=["time"], dtype=int)
df_precip_all = pd.DataFrame(columns=["time"], dtype=int)

for name in fnames:
    # Read in data file
    df = pd.read_fwf(name, names=colnames, colspecs=colspecs, header=None, 
                     encoding="latin_1", dtype={'time':int, 'temp':int, 'precip':str})
    
    # Keep only rows where month and day are in growing season
    df = df[pd.DatetimeIndex(df["time"]).strftime('%m-%d').isin(growseason)]
    
    # Remove duplicate hours, keep only the first measurement per hour
    df = df[df["time"].astype(str).str.slice(4, 10).duplicated(keep="first") == False]
    
    # Get precipitation data (or NaN if AA1 is not in extra data section)
    df["precip"] = df[df['precip'].str.find("AA1")!=-1]['precip'].str.split("AA1").str.get(1).str.slice(5, 8)
    
    # Replace placeholder 9999 with NaN values
    df["temp"].replace({9999: np.nan}, inplace=True)
    
    # If there are no gaps bigger than crit_rows, then process data
    if df.replace(method="ffill", limit=crit_rows).iloc[crit_rows:].isnull().sum().sum() == 0:
        
        # Get the year and site name from the filename
        year_site = name.split("-")[-1]+"_"+name.split("-")[-2]    
        
        # Rename the precipitation and temperature data by year and ID
        temp = df.rename(columns={"temp":year_site}).drop("precip", axis=1)
        precip = df.rename(columns={"precip":year_site}).drop("temp", axis=1)
        
        # Merge the data onto the master dataframes
        df_temp_all = temp.merge(df_temp_all, how="outer", on="time", sort=False)
        df_precip_all = precip.merge(df_precip_all, how="outer", on="time", sort=False)

11.8 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
