# Temperature Analysis I

In [1]:
#import dependencies
import pandas as pd
from datetime import datetime as dt
from scipy import stats

In [2]:
#import csv
measurements_df = pd.read_csv('Resources/hawaii_measurements.csv')
measurements_df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [3]:
#convert date column to datetime
for index, row in measurements_df.iterrows():
    row['date'] = dt.strptime(row['date'], '%Y-%m-%d')

In [4]:
#set index to date column
measurements_df.set_index('date', inplace=True)
measurements_df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.0,63
2010-01-03,USC00519397,0.0,74
2010-01-04,USC00519397,0.0,76
2010-01-06,USC00519397,,73


# June and December Data Comparison

In [5]:
#dictionaries to hold rows
june_dict = {}
dec_dict = {}

#pulling needed rows
for index, row in measurements_df.iterrows():
    dt_index = dt.strptime(index, '%Y-%m-%d')
    if dt_index.strftime('%m') == '06':
        june_dict[index] = row
    elif dt_index.strftime('%m') == '12':
        dec_dict[index] = row

In [6]:
#june data frame
june_df = pd.DataFrame.from_dict(june_dict, orient='index')
len_june = len(june_df)
print(len_june)
june_df.head()

240


Unnamed: 0,station,prcp,tobs
2010-06-01,USC00516128,0.08,70
2010-06-02,USC00516128,0.0,78
2010-06-03,USC00516128,0.1,73
2010-06-04,USC00516128,0.01,68
2010-06-05,USC00516128,0.09,79


In [7]:
#december data frame
dec_df = pd.DataFrame.from_dict(dec_dict, orient='index')
len_dec = len(dec_df)
print(len_dec)
dec_df.head()

217


Unnamed: 0,station,prcp,tobs
2010-12-01,USC00516128,1.48,71
2010-12-03,USC00516128,1.59,67
2010-12-04,USC00516128,0.19,77
2010-12-06,USC00516128,0.0,61
2010-12-07,USC00516128,0.0,62


In [8]:
#finding june average temp and stdv
tavg_june = round(june_df['tobs'].mean(), 2)
stdev_june = stats.tstd(june_df['tobs'])
print(tavg_june)
print(stdev_june)

72.13
2.706721706951832


In [9]:
#finding december average temp and stdv
tavg_dec = round(dec_df['tobs'].mean(), 2)
stdev_dec = stats.tstd(dec_df['tobs'])
print(tavg_dec)
print(stdev_dec)

69.24
3.320488564639603


In [10]:
#run Welch's t-test
stats.ttest_ind_from_stats(tavg_june, stdev_june, len_june, tavg_dec, stdev_dec, len_dec, equal_var=False)

Ttest_indResult(statistic=10.133438702814022, pvalue=1.0094299999426102e-21)

In [11]:
#extra - modified june data frame to match December's length and ran ttest_rel
june_df_short = june_df.loc[:'2017-06-11']
print(len(june_df_short))
short_june_avg = june_df_short['tobs'].mean()
print(short_june_avg)

217
71.90783410138249


In [12]:
#extra - running paired t-test on modified data
stats.ttest_rel(june_df_short['tobs'], dec_df['tobs'])

Ttest_relResult(statistic=9.711579509054578, pvalue=9.878980958190058e-19)

# Analysis

Since the June and December data sets were different lengths due to June having an extra month of data and June and December inherently having different day counts, I chose to run Welch's t-test, or the unequal variances t-test, to account for this difference in sample size. Ideally, a paired t-test would have been run, since the temperatures were pulled from the same locations at two different points in time, but since the sample sizes didn't match, scipy wouldn't run the tests, and I had to choose the best option. 
I also cut some data out of the June dataset to run the tests as a paired t-test, just to see if the results were similar or different using the paired t-test.

For Welch's test: With a p-value of 1.0094299999426102e-21 (0.000000000000000000001), the results show statistical significance.

For the modified paired test: The p value was higher for this test at 9.878980958190058e-19 (0.0000000000000000009), but still much lower than 0.05 and still statistically significant.