<a id='pd-ops'></a>  
# Pandas Operations  

## Key points  
- read_csv  
  - delim_whitespace=True   
- apply function to columns
  - df["TEMP_C"] = df["TEMP_F"].apply(tc.fahr_to_celsius)  
- apply lambda to columns
  - df["TEMP_C"] = df["TEMP_F"].apply(lambda x: (x - 32) / 1.8 )  
- integer dates  
  - astype(str).str.slice(start=, stop=)  
  - pd.to_datetime(df["TIME"].astype(str), format="%Y%m", exact=False)  
  - exact speeds up process (drops all non-specifed (drops days, hours, mins, secs)  
- index  
  - .set_index("YEAR_MONTH_DT",drop=True)  (drops previous index)  
  - filt = df.index.to_series().dt.year == 1969  
- groupby  
  - group_agg = df.groupby("YEAR_MONTH")[cols].agg('mean')  
- filter  
  - filt = df["MONTH"] == "04"  
  - cols = ["STATION", "TEMP_F", "TEMP_C", "YEAR_MONTH"]  
  - sel = df[filt][cols]  
- glob  
  - file_list = glob.glob(os.path.join(data_dir,"0*txt"))  
- isna  
  - df.isna().sum()  
- join  
  - join1 = monthly_data.merge(reference_temps, on='month')  

[Import](#pd-ops-import)  
[Attributes](#pd-ops-attributes)   
[Aggregate](#pd-ops-aggregate)    
[Transform](#pd-ops-transform)  
[Group](#pd-ops-group)  
[Filter](#pd-ops-filter)  
[Exercise](#pd-ops-exercise) 

## Libraries

In [1]:
import os
import glob
import pandas as pd
import temp_converter as tc

## Parameters

In [2]:
# shows result of cell without needing print
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr_or_assign"

In [3]:
pd.set_option('display.max_rows',10)
#defaults: pd.set_option('display.max_columns',20,'display.max_rows',60,'display.max_colwidth',50)

## Directories

In [4]:
home_dir = home_dir = os.path.expanduser("~")
work_dir = os.path.join(home_dir, 'eda', 'gpy')
data_dir = os.path.join(work_dir,'data') 
os.chdir(work_dir)
os.getcwd()

'/Users/forest/eda/gpy'

<a id='pd-ops-import'></a>
## Import

[Return to Start of Notebook](#pd-ops)  

### file path

In [5]:
file_path = os.path.join(data_dir,"029440.txt")

'/Users/forest/eda/gpy/data/029440.txt'

### delim_whitespace=True

In [6]:
data = pd.read_csv(file_path,
                   delim_whitespace=True,
                   usecols=["USAF", "YR--MODAHRMN", "DIR", "SPD", "GUS", "TEMP", "MAX", "MIN"],
                   na_values=["*", "**", "***", "****", "*****", "******"])
data.head(3)

Unnamed: 0,USAF,YR--MODAHRMN,DIR,SPD,GUS,TEMP,MAX,MIN
0,29440,190601010600,90.0,7.0,,27.0,,
1,29440,190601011300,,0.0,,27.0,,
2,29440,190601012000,,0.0,,25.0,,


<a id='pd-ops-attributes'></a>
## Attributes

[Return to Start of Notebook](#pd-ops)  

Attributes  (minimally sufficient pandas)
- shape  
- columns  
- dtypes  
- index  

### shape

In [7]:
data.shape

(757983, 8)

In [8]:
row_count = data.shape[0]

757983

In [9]:
col_count = data.shape[1]

8

### columns

In [10]:
list(data.columns)

['USAF', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'TEMP', 'MAX', 'MIN']

### dtypes

In [11]:
data.dtypes

USAF              int64
YR--MODAHRMN      int64
DIR             float64
SPD             float64
GUS             float64
TEMP            float64
MAX             float64
MIN             float64
dtype: object

### index

In [12]:
data.index

RangeIndex(start=0, stop=757983, step=1)

<a id='pd-ops-aggregate'></a>
## Aggregate

[Return to Start of Notebook](#pd-ops)  

Aggregation Methods  (minimally sufficient pandas)
- describe
- count, sum, max, min  
- idxmax, idxmin 
- all, any  
- mean, median, mode, std, var      
- nunique   

### describe

In [13]:
data.describe()

Unnamed: 0,USAF,YR--MODAHRMN,DIR,SPD,GUS,TEMP,MAX,MIN
count,757983.0,757983.0,699256.0,750143.0,19906.0,754862.0,23869.0,23268.0
mean,29440.0,199997400000.0,233.499846,6.742641,20.147996,40.409778,45.373539,35.783737
std,0.0,1629544000.0,209.707258,4.296191,7.415138,17.898715,18.242679,17.195427
min,29440.0,190601000000.0,10.0,0.0,11.0,-33.0,-26.0,-32.0
25%,29440.0,198908300000.0,130.0,3.0,14.0,29.0,32.0,26.0
50%,29440.0,200404200000.0,200.0,7.0,18.0,39.0,44.0,36.0
75%,29440.0,201205000000.0,270.0,9.0,26.0,54.0,60.0,49.0
max,29440.0,201910000000.0,990.0,61.0,108.0,91.0,91.0,81.0


<a id='pd-ops-transform'></a>
## Transform

[Return to Start of Notebook](#pd-ops)  

### rename

In [14]:
new_names = {"USAF": "STATION",
             "TEMP": "TEMP_F",
             "MAX": "MAX_F",
             "MIN": "MIN_F",             
             "YR--MODAHRMN": "TIME",
             "SPD": "SPEED",
             "GUS": "GUST"}
data = data.rename(columns=new_names)
data.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F
0,29440,190601010600,90.0,7.0,,27.0,,
1,29440,190601011300,,0.0,,27.0,,
2,29440,190601012000,,0.0,,25.0,,


### copy

In [15]:
dfc = data.copy();

### apply

#### function (to single column)

In [16]:
dfc["TEMP_C"] = dfc["TEMP_F"].apply(
    tc.fahr_to_celsius)
dfc.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778
1,29440,190601011300,,0.0,,27.0,,,-2.777778
2,29440,190601012000,,0.0,,25.0,,,-3.888889


#### function (to multiple columns)

In [17]:
dfc[["TEMP_C", "MIN_C", "MAX_C"]] = dfc[["TEMP_F", "MIN_F", "MAX_F"]].apply(
    tc.fahr_to_celsius)
dfc.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,


#### lambda to single column

In [18]:
dfc = data.copy();
dfc.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F
0,29440,190601010600,90.0,7.0,,27.0,,
1,29440,190601011300,,0.0,,27.0,,
2,29440,190601012000,,0.0,,25.0,,


In [19]:
dfc["TEMP_C"] = dfc["TEMP_F"].apply(
    lambda x: (x - 32) / 1.8 )
dfc.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778
1,29440,190601011300,,0.0,,27.0,,,-2.777778
2,29440,190601012000,,0.0,,25.0,,,-3.888889


#### lambda (to multiple columns)

In [20]:
dfc[["TEMP_C", "MIN_C", "MAX_C"]] = dfc[["TEMP_F", "MIN_F", "MAX_F"]].apply(
    lambda x: (x - 32) / 1.8 )
dfc.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,


### datetime

#### option 1: astype(str).str.slice(start=, stop=)

In [21]:
dfdt1 = dfc.copy()
dfdt1.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,


In [22]:
dfdt1["YEAR_MONTH"] = dfdt1["TIME"].astype(str).str.slice(start=0, stop=6)
dfdt1["YEAR"] = dfdt1["TIME"].astype(str).str.slice(start=0, stop=4)
dfdt1["MONTH"] = dfdt1["TIME"].astype(str).str.slice(start=4, stop=6)
dfdt1.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C,YEAR_MONTH,YEAR,MONTH
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,,190601,1906,1
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,,190601,1906,1
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,,190601,1906,1


In [23]:
print(dfdt1["YEAR"].nunique())
print(dfdt1["MONTH"].nunique())
print(dfdt1["YEAR_MONTH"].nunique())

51
12
601


#### option 2 pd.to_datetime()

In [24]:
dfdt2 = dfc.copy()
dfdt2.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,


#### .astype(str), format=, exact=False  
- exact will drop all non-specifed (drops days, hours, mins, secs)

In [25]:
dfdt2["YEAR_MONTH_DT"] = pd.to_datetime(dfdt2["TIME"].astype(str), format="%Y%m", exact=False)
dfdt2.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C,YEAR_MONTH_DT
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,,1906-01-01
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,,1906-01-01
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,,1906-01-01


#### set_index

In [26]:
dfdt2.set_index("YEAR_MONTH_DT",drop=True)
dfdt2.head(3)

Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C,YEAR_MONTH_DT
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,,1906-01-01
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,,1906-01-01
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,,1906-01-01


<a id='pd-ops-group'></a>
## Group

[Return to Start of Notebook](#pd-ops)  

The pd.concat function accepts a list of tables which it combines:  
- by row when using axis=0 (this is the default)  
- by column when using axis=1

### .groupby

In [27]:
print(len(dfdt1))
print(dfdt1["YEAR_MONTH"].nunique())

757983
601


In [28]:
grouped = dfdt1.groupby("YEAR_MONTH")

print(type(grouped))
print(type(grouped.groups.keys()))
print(len(grouped))

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<class 'dict_keys'>
601


### .get_group()

In [29]:
key = "190601"
group1 = grouped.get_group(key)
print(type(group1))
group1.head(3)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,STATION,TIME,DIR,SPEED,GUST,TEMP_F,MAX_F,MIN_F,TEMP_C,MIN_C,MAX_C,YEAR_MONTH,YEAR,MONTH
0,29440,190601010600,90.0,7.0,,27.0,,,-2.777778,,,190601,1906,1
1,29440,190601011300,,0.0,,27.0,,,-2.777778,,,190601,1906,1
2,29440,190601012000,,0.0,,25.0,,,-3.888889,,,190601,1906,1


### aggregate

In [30]:
mean_cols = ["DIR", "SPEED", "GUST", "TEMP_F", "TEMP_C"]

['DIR', 'SPEED', 'GUST', 'TEMP_F', 'TEMP_C']

#### option1

In [31]:
# Create an empty DataFrame for the aggregated values
monthly_data = pd.DataFrame()

# Iterate over the groups
for key, group in grouped:

    # Calculate mean
    mean_values = group[mean_cols].mean()

    # Add the ´key´ (i.e. the date+time information) into the aggregated values
    mean_values["YEAR_MONTH"] = key

    # Append the aggregated values into the DataFrame
    monthly_data = monthly_data.append(mean_values, ignore_index=True)
monthly_data

Unnamed: 0,DIR,SPEED,GUST,TEMP_F,TEMP_C,YEAR_MONTH
0,218.181818,13.204301,,25.526882,-3.596177,190601
1,178.095238,13.142857,,25.797619,-3.445767,190602
2,232.043011,15.021505,,22.806452,-5.107527,190603
3,232.045455,13.811111,,38.822222,3.790123,190604
4,192.820513,10.333333,,55.526882,13.070490,190605
...,...,...,...,...,...,...
596,370.992008,8.138490,17.251852,61.743400,16.524111,201906
597,294.433641,5.785714,15.034722,61.569955,16.427753,201907
598,320.335766,6.769447,15.751678,60.598649,15.888138,201908
599,306.491058,6.363594,15.173285,49.958137,9.976743,201909


#### option2

In [32]:
monthly_data = grouped[mean_cols].mean()

Unnamed: 0_level_0,DIR,SPEED,GUST,TEMP_F,TEMP_C
YEAR_MONTH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
190601,218.181818,13.204301,,25.526882,-3.596177
190602,178.095238,13.142857,,25.797619,-3.445767
190603,232.043011,15.021505,,22.806452,-5.107527
190604,232.045455,13.811111,,38.822222,3.790123
190605,192.820513,10.333333,,55.526882,13.070490
...,...,...,...,...,...
201906,370.992008,8.138490,17.251852,61.743400,16.524111
201907,294.433641,5.785714,15.034722,61.569955,16.427753
201908,320.335766,6.769447,15.751678,60.598649,15.888138
201909,306.491058,6.363594,15.173285,49.958137,9.976743


### groupby().agg()

In [33]:
group_agg = dfdt1.groupby("YEAR_MONTH")[mean_cols].agg('mean')

Unnamed: 0_level_0,DIR,SPEED,GUST,TEMP_F,TEMP_C
YEAR_MONTH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
190601,218.181818,13.204301,,25.526882,-3.596177
190602,178.095238,13.142857,,25.797619,-3.445767
190603,232.043011,15.021505,,22.806452,-5.107527
190604,232.045455,13.811111,,38.822222,3.790123
190605,192.820513,10.333333,,55.526882,13.070490
...,...,...,...,...,...
201906,370.992008,8.138490,17.251852,61.743400,16.524111
201907,294.433641,5.785714,15.034722,61.569955,16.427753
201908,320.335766,6.769447,15.751678,60.598649,15.888138
201909,306.491058,6.363594,15.173285,49.958137,9.976743


<a id='pd-ops-filter'></a>
## Filter

[Return to Start of Notebook](#pd-ops)  

In [34]:
filt = dfdt1["MONTH"] == "04"
cols = ["STATION", "TEMP_F", "TEMP_C", "YEAR_MONTH"]
aprils = dfdt1[filt][cols]
aprils.head(3)

Unnamed: 0,STATION,TEMP_F,TEMP_C,YEAR_MONTH
270,29440,19.0,-7.222222,190604
271,29440,29.0,-1.666667,190604
272,29440,18.0,-7.777778,190604


In [35]:
monthly_mean = aprils.groupby("YEAR_MONTH").agg('mean')
monthly_mean.head(3)

Unnamed: 0_level_0,STATION,TEMP_F,TEMP_C
YEAR_MONTH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
190604,29440.0,38.822222,3.790123
190704,29440.0,36.111111,2.283951
190804,29440.0,36.811111,2.67284


#### sort

In [36]:
monthly_mean.sort_values(by="TEMP_C", ascending=False).head(10)

Unnamed: 0_level_0,STATION,TEMP_F,TEMP_C
YEAR_MONTH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
201904,29440.0,42.47203,5.817794
199004,29440.0,41.918084,5.510047
198904,29440.0,41.369647,5.20536
201104,29440.0,41.29073,5.161517
200404,29440.0,41.249676,5.138709
200204,29440.0,41.132353,5.073529
198304,29440.0,41.016183,5.008991
200804,29440.0,40.962343,4.979079
200004,29440.0,40.777778,4.876543
199904,29440.0,40.695291,4.830717


## Workflow

In [37]:
file_list = glob.glob(os.path.join(data_dir,"0*txt"))

['/Users/forest/eda/gpy/data/029170.txt',
 '/Users/forest/eda/gpy/data/028690.txt',
 '/Users/forest/eda/gpy/data/029820.txt',
 '/Users/forest/eda/gpy/data/029700.txt',
 '/Users/forest/eda/gpy/data/028970.txt',
 '/Users/forest/eda/gpy/data/029070.txt',
 '/Users/forest/eda/gpy/data/029500.txt',
 '/Users/forest/eda/gpy/data/029110.txt',
 '/Users/forest/eda/gpy/data/028750.txt',
 '/Users/forest/eda/gpy/data/029720.txt',
 '/Users/forest/eda/gpy/data/029440.txt',
 '/Users/forest/eda/gpy/data/028360.txt',
 '/Users/forest/eda/gpy/data/029810.txt',
 '/Users/forest/eda/gpy/data/029740.txt',
 '/Users/forest/eda/gpy/data/029350.txt']

<a id='pd-ops-exercise'></a>
## Exercise

[Return to Start of Notebook](#pd-ops)  

### Prob 1

In [38]:
file_path = os.path.join(data_dir,"1091402.txt")

'/Users/forest/eda/gpy/data/1091402.txt'

In [39]:
data = pd.read_csv(file_path,
                   delim_whitespace=True,
                   na_values=[-9999])
data.head(3)

Unnamed: 0,STATION,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,TAVG,TMAX,TMIN
0,-----------------,----------,----------,----------,--------,--------,--------,--------,--------
1,GHCND:FIE00142080,51,60.3269,24.9603,19520101,0.31,37,39,34
2,GHCND:FIE00142080,51,60.3269,24.9603,19520102,,35,37,34


#### skip single row skiprows=[]

In [40]:
data = pd.read_csv(file_path,
                   delim_whitespace=True,
                   skiprows=[1],
                   parse_dates=['DATE'],
                   index_col='DATE',
                   na_values=[-9999])
data.head(3)

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1952-01-01,GHCND:FIE00142080,51,60.3269,24.9603,0.31,37.0,39.0,34.0
1952-01-02,GHCND:FIE00142080,51,60.3269,24.9603,,35.0,37.0,34.0
1952-01-03,GHCND:FIE00142080,51,60.3269,24.9603,0.14,33.0,36.0,


In [41]:
data.dtypes

STATION       object
ELEVATION      int64
LATITUDE     float64
LONGITUDE    float64
PRCP         float64
TAVG         float64
TMAX         float64
TMIN         float64
dtype: object

In [42]:
data.tail(3)

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2017-10-02,GHCND:FIE00142080,51,60.3269,24.9603,,47.0,49.0,46.0
2017-10-03,GHCND:FIE00142080,51,60.3269,24.9603,0.94,47.0,,44.0
2017-10-04,GHCND:FIE00142080,51,60.3269,24.9603,0.51,52.0,56.0,


In [43]:
data.shape

(23716, 8)

#### no data

In [44]:
data.isna().sum()

STATION         0
ELEVATION       0
LATITUDE        0
LONGITUDE       0
PRCP         1553
TAVG         3308
TMAX          260
TMIN          365
dtype: int64

In [45]:
tavg_nodata_count = data['TAVG'].isna().sum()

3308

In [46]:
tmin_nodata_count = data['TMIN'].isna().sum()

365

In [47]:
day_count = len(data)

23716

#### first and last observation

##### option 1 .min()

In [48]:
filt = data.index == data.index.min()
first_obs = data[filt]

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1952-01-01,GHCND:FIE00142080,51,60.3269,24.9603,0.31,37.0,39.0,34.0


In [49]:
filt = data.index == data.index.max()
last_obs = data[filt]

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2017-10-04,GHCND:FIE00142080,51,60.3269,24.9603,0.51,52.0,56.0,


##### option 2 .idxmin()

In [50]:
first_obs = data.loc[[data.index.min()]]

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1952-01-01,GHCND:FIE00142080,51,60.3269,24.9603,0.31,37.0,39.0,34.0


In [51]:
last_obs = data.loc[[data.index.max()]]

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2017-10-04,GHCND:FIE00142080,51,60.3269,24.9603,0.51,52.0,56.0,


#### avg_temp

In [52]:
avg_temp = data['TAVG'].mean()

41.32408859270874

In [53]:
filt1 = data.index.to_series().dt.year == 1969
filt2 = data.index.to_series().dt.month.isin([5,6,7,8])
filt = filt1 & filt2
temp_1969_may_aug = data[filt]
temp_1969_may_aug.head(3)

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1969-05-01,GHCND:FIE00142080,51,60.3269,24.9603,0.0,,41.0,33.0
1969-05-02,GHCND:FIE00142080,51,60.3269,24.9603,0.0,,48.0,31.0
1969-05-03,GHCND:FIE00142080,51,60.3269,24.9603,0.0,,44.0,27.0


In [54]:
avg_temp_1969 = round(temp_1969_may_aug['TMAX'].mean(),2)

67.82

### Prob 2

#### convert date

In [55]:
data['YEAR_MONTH'] =  data.index.astype(str).str.slice(start=0, stop=4) +\
                      data.index.astype(str).str.slice(start=5, stop=7)
data.head(3)

Unnamed: 0_level_0,STATION,ELEVATION,LATITUDE,LONGITUDE,PRCP,TAVG,TMAX,TMIN,YEAR_MONTH
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1952-01-01,GHCND:FIE00142080,51,60.3269,24.9603,0.31,37.0,39.0,34.0,195201
1952-01-02,GHCND:FIE00142080,51,60.3269,24.9603,,35.0,37.0,34.0,195201
1952-01-03,GHCND:FIE00142080,51,60.3269,24.9603,0.14,33.0,36.0,,195201


#### apply function

In [56]:
data['temp_celsius'] = data['TAVG'].apply(tc.fahr_to_celsius)

#### .groupby

In [57]:
cols = ['temp_celsius']
monthly_data = data.groupby("YEAR_MONTH")[cols].agg('mean')

Unnamed: 0_level_0,temp_celsius
YEAR_MONTH,Unnamed: 1_level_1
195201,-1.400966
195202,-4.000000
195203,-10.106838
195204,4.226190
195205,7.037037
...,...
201706,13.500000
201707,15.716846
201708,15.716846
201709,11.296296


### Prob 3

In [58]:
monthly_data['month'] =  monthly_data.index.to_series().str.slice(start=4, stop=6)
monthly_data.reset_index(inplace=True)
monthly_data

Unnamed: 0,YEAR_MONTH,temp_celsius,month
0,195201,-1.400966,01
1,195202,-4.000000,02
2,195203,-10.106838,03
3,195204,4.226190,04
4,195205,7.037037,05
...,...,...,...
785,201706,13.500000,06
786,201707,15.716846,07
787,201708,15.716846,08
788,201709,11.296296,09


#### part 1: reference_temps

In [59]:
cols = ['temp_celsius']
reference_temps = monthly_data.groupby("month")[cols].agg('mean')

Unnamed: 0_level_0,temp_celsius
month,Unnamed: 1_level_1
01,-5.350916
02,-5.941307
03,-2.440364
04,3.423785
05,10.179938
...,...
08,15.603642
09,10.596153
10,5.487785
11,0.645782


In [60]:
reference_temps.reset_index(inplace=True)

In [61]:
new_names = {"temp_celsius": "ref_temp"}
reference_temps = reference_temps.rename(columns=new_names);

In [62]:
reference_temps.head(3)

Unnamed: 0,month,ref_temp
0,1,-5.350916
1,2,-5.941307
2,3,-2.440364


#### part 2: temperature anomalies

In [63]:
monthly_data.head(3)

Unnamed: 0,YEAR_MONTH,temp_celsius,month
0,195201,-1.400966,1
1,195202,-4.0,2
2,195203,-10.106838,3


In [64]:
join1 = monthly_data.merge(reference_temps, on='month')

Unnamed: 0,YEAR_MONTH,temp_celsius,month,ref_temp
0,195201,-1.400966,01,-5.350916
1,195301,-5.396825,01,-5.350916
2,195401,-7.072650,01,-5.350916
3,195501,-5.473251,01,-5.350916
4,195601,-8.133333,01,-5.350916
...,...,...,...,...
785,201212,-6.630824,12,-3.211359
786,201312,1.362007,12,-3.211359
787,201412,-1.146953,12,-3.211359
788,201512,2.204301,12,-3.211359


In [65]:
monthly_diff = join1.copy()

Unnamed: 0,YEAR_MONTH,temp_celsius,month,ref_temp
0,195201,-1.400966,01,-5.350916
1,195301,-5.396825,01,-5.350916
2,195401,-7.072650,01,-5.350916
3,195501,-5.473251,01,-5.350916
4,195601,-8.133333,01,-5.350916
...,...,...,...,...
785,201212,-6.630824,12,-3.211359
786,201312,1.362007,12,-3.211359
787,201412,-1.146953,12,-3.211359
788,201512,2.204301,12,-3.211359


In [66]:
monthly_diff['diff'] = monthly_diff['temp_celsius'] - monthly_diff['ref_temp']

In [67]:
monthly_diff.head()

Unnamed: 0,YEAR_MONTH,temp_celsius,month,ref_temp,diff
0,195201,-1.400966,1,-5.350916,3.94995
1,195301,-5.396825,1,-5.350916,-0.045909
2,195401,-7.07265,1,-5.350916,-1.721733
3,195501,-5.473251,1,-5.350916,-0.122335
4,195601,-8.133333,1,-5.350916,-2.782417


In [68]:
filt = monthly_diff['diff'] == monthly_diff['diff'].min()
min_diff = monthly_diff[filt]

Unnamed: 0,YEAR_MONTH,temp_celsius,month,ref_temp,diff
35,198701,-17.97491,1,-5.350916,-12.623994


In [69]:
filt = monthly_diff['diff'] == monthly_diff['diff'].max()
max_diff = monthly_diff[filt]

Unnamed: 0,YEAR_MONTH,temp_celsius,month,ref_temp,diff
104,199002,1.170635,2,-5.941307,7.111942


In [70]:
monthly_diff[["temp_celsius", "ref_temp", "diff"]].describe()

Unnamed: 0,temp_celsius,ref_temp,diff
count,682.0,790.0,682.0
mean,5.097114,5.094564,-1.588824e-16
std,8.483949,8.102228,2.511899
min,-17.97491,-5.941307,-12.62399
25%,-1.685185,-2.440364,-1.520885
50%,4.726105,5.487785,0.1173686
75%,12.87037,13.649932,1.667551
max,22.329749,17.280519,7.111942


### Prob 4