In [1]:
import pickle
import numpy as np

There are three things that I want to check out here. The previous eda files have more or less been playing around 
with less clear purposes. 

1.) How does this data look - after contacting the forest services, I've been told that each row should be a fire
dectected, and there should be four pictures for every long/lat. pair, correspond to 4 pictures in that given day. 
I need to verify that, and if it is true see how far back that works. 

2.) How does the confidence level range across fires in general, and then how about across the 4 pictures per day? Is there a good range of fire confidence level from 10-90%?

3.) How am I going to group these fires from day to day - do I go out 0.01 degrees in each lat/long direction, 0.10 
degrees, etc.? To look into this what I'm going to do is just cycle through a bunch of different values and see how many fires pop up within that many degrees. 

In [2]:
def row_examination(df):
    counts = df.groupby(['LAT', 'LONG', 'year', 'month', 'day']).count()
    print 'Max. number of rows per lat/long coordinate: ', counts.max()[0]
    print 'Min. number of rows per lat/long coordinate: ', counts.min()[0]
    print 'Mean number of rows per lat/long coordinate: ', counts.mean()[0]

In [3]:
def conf_levels_examination(df): 
    print 'Confidence level info: ', df['CONF'].describe()

In [4]:
for year in xrange(2015, 2002, -1): 
    with open('../../../data/pickled_data/MODIS/df_' + str(year) + '.pkl') as f: 
        df = pickle.load(f)
        print 'Year: ', str(year)
        print '-' * 50, '\n'
        row_examination(df)
        conf_levels_examination(df)

Year:  2015
-------------------------------------------------- 

Max. number of rows per lat/long coordinate:  10
Min. number of rows per lat/long coordinate:  1
Mean number of rows per lat/long coordinate:  2.05243429241
Confidence level info:  count    140171.000000
mean         66.610961
std          22.428298
min           0.000000
25%          52.000000
50%          67.000000
75%          84.000000
max         100.000000
Name: CONF, dtype: float64
Year:  2014
-------------------------------------------------- 

Max. number of rows per lat/long coordinate:  7
Min. number of rows per lat/long coordinate:  1
Mean number of rows per lat/long coordinate:  2.07766551185
Confidence level info:  count    229688.000000
mean         68.196545
std          21.343116
min           0.000000
25%          55.000000
50%          70.000000
75%          84.000000
max         100.000000
Name: CONF, dtype: float64
Year:  2013
-------------------------------------------------- 

Max. number of rows pe

From the above, I can see that starting in 2009 it looks like the 4 pictures per lat/long coordinates might be true, but before that it doesn't look to be true. So now I need to just pull out some rows and see exactly what is going on. 

In [5]:
def examine_index(df, index): 
    print df.query('LAT == @index[0] & LONG == @index[1] & year == @index[2] & month == @index[3] & day == @index[4]')

In [6]:
def examine_lown_rows(df, count_num, output = False): 
    fires_counts = df.groupby(['LAT', 'LONG', 'year', 'month', 'day']).count()['AREA'] == count_num
    if output: 
        for index in fires_counts[fires_counts == True].index[0:10]:
            print '-' * 50
            examine_index(df, index)
    else:  
        return fires_counts.index[0:10]

In [7]:
for year in xrange(2015, 2014, -1): 
    with open('../../../data/pickled_data/MODIS/df_' + str(year) + '.pkl') as f: 
        df = pickle.load(f)
        examine_lown_rows(df, 2, True)     

--------------------------------------------------
                        AREA             PERIMETER   FIRE_  FIRE_ID  \
140169  0.0000000000000E+000  0.0000000000000E+000  140170   133875   
140170  0.0000000000000E+000  0.0000000000000E+000  140171   564951   

                  LAT           LONG  JULIAN   GMT   TEMP  SPIX  TPIX SAT_SRC  \
140169  2.517000E+001  -8.09690E+001     108  1835  312.9   1.1   1.1       A   
140170  2.517000E+001  -8.09690E+001     108  1835  312.9   1.1   1.1       A   

        CONF  FRP  year  month  day  
140169    47  6.3  2015      4   18  
140170    47  6.3  2015      4   18  
--------------------------------------------------
                        AREA             PERIMETER   FIRE_  FIRE_ID  \
140167  0.0000000000000E+000  0.0000000000000E+000  140168   133876   
140168  0.0000000000000E+000  0.0000000000000E+000  140169   564952   

                  LAT           LONG  JULIAN   GMT   TEMP  SPIX  TPIX SAT_SRC  \
140167  2.517200E+001  -8.09580

My hunch is that for those obs. where there are only 1 or 2 obs. for a given lat/long/date combination, it's because 
the fire moved quickly and so the lat/long coordinates changed quickly. The way to check this would be to see if there are larger number of fires (rows) within a given lat/long distance from that current long/lat distance. 

In [8]:
for year in xrange(2015, 2014, -1): 
    with open('../../../data/pickled_data/MODIS/df_' + str(year) + '.pkl') as f: 
        df = pickle.load(f)
        for row_number in [2, 4, 10]: 
            print 'Row Number', str(row_number)
            print '-' * 50
            rows_less = examine_lown_rows(df, row_number, False)
            df['LAT'] = df['LAT'].astype(float)
            df['LONG'] = df['LONG'].astype(float)
            for dist_out in [0.001, 0.01, 0.05, 0.1]:
                print 'dist', str(dist_out)
                print '-' * 50
                for index in rows_less: 
                    lat_1, lat_2 = float(index[0]) - dist_out, float(index[0]) + dist_out
                    long_1, long_2 = float(index[1]) - dist_out, float(index[1]) + dist_out
                    result = df.query('LAT > @lat_1 & LAT < @lat_2 & LONG > @long_1 & LONG < @long_2')
                    print result.shape[0], index[0], index[1]

Row Number 2
--------------------------------------------------
dist 0.001
--------------------------------------------------
2 2.517000E+001 -8.09690E+001
2 2.517200E+001 -8.09580E+001
2 2.517700E+001 -8.06590E+001
2 2.518500E+001 -8.09140E+001
2 2.520000E+001 -8.07360E+001
2 2.520600E+001 -8.07720E+001
2 2.548600E+001 -8.03730E+001
2 2.551200E+001 -9.75740E+001
2 2.552100E+001 -8.05430E+001
2 2.556000E+001 -9.77710E+001
dist 0.01
--------------------------------------------------
2 2.517000E+001 -8.09690E+001
2 2.517200E+001 -8.09580E+001
2 2.517700E+001 -8.06590E+001
2 2.518500E+001 -8.09140E+001
2 2.520000E+001 -8.07360E+001
2 2.520600E+001 -8.07720E+001
2 2.548600E+001 -8.03730E+001
2 2.551200E+001 -9.75740E+001
2 2.552100E+001 -8.05430E+001
4 2.556000E+001 -9.77710E+001
dist 0.05
--------------------------------------------------
4 2.517000E+001 -8.09690E+001
6 2.517200E+001 -8.09580E+001
2 2.517700E+001 -8.06590E+001
4 2.518500E+001 -8.09140E+001
4 2.520000E+001 -8.07360E+001
4 

The above doesn't seem to support my hypothesis (but I did only look at 10 obs). Maybe it's just that those fires aren't actually present at those locations later in the day, and so there aren't pictures for them. Ultimately, the way to represent a square kilometer is by +/- 0.01 degrees, so I need to stick with that. What I need to do now is to find some large fire in a given year and try to track that, and see how likely it is that I can track a particular fire (i.e. are there 4 pics. a day for that fire, are there 4 pics. a day for multiple days for that fire, etc.). What does this data loe)ok like for a location where I know there was a large fire that burned for a very long period of time.

If I go to the following website (https://www.nifc.gov/fireInfo/fireInfo_stats_lgFires.html), I can pick some large fires that are from 2009+ (since earlier I decided to only look past this data). I'll start off by looking at some 2012 fires, and figure out if I got out a certain distance from that fire's origin, how many rows there are in the data (i.e. how many detected fires that far out from the large fires origin). I'll look at the Long Draw fire (42.392, -117.894), the Holloway fire (41.973, -118.366), the Mustang Complex (45.425, -114.59), Rush (40.621   , -120.152), and Ash Creek MT (45.669, -106.469). We'll check these out first.  

In [12]:
fires = [('Long Draw', 42.392, -117.894), ('Holloway', 41.973, -118.366), ('Mustang Complex', 45.425, -114.59), 
        ('Rush', 40.621, -120.152), ('Ash Creek MT', 45.669, -106.469)]

In [13]:
with open('../../../data/pickled_data/MODIS/df_' + str(2012) + '.pkl') as f: 
    df = pickle.load(f)
    df['LAT'] = df['LAT'].astype(float)
    df['LONG'] = df['LONG'].astype(float)
    for fire in fires: 
        print fire[0]
        print '-' * 50
        lat_orig, long_orig = fire[1], fire[2]
        for dist_out in [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5]: 
            lat_1, lat_2 = lat_orig - dist_out, lat_orig + dist_out 
            long_1, long_2 = long_orig - dist_out, long_orig + dist_out
            result = df.query('LAT > @lat_1 & LAT < @lat_2 & LONG > @long_1 & LONG < @long_2')
            print 'Dist_out: %s' %str(dist_out), result.shape[0], '\n'

Long Draw
--------------------------------------------------
Dist_out: 0.001 0 

Dist_out: 0.01 10 

Dist_out: 0.1 610 

Dist_out: 0.2 1827 

Dist_out: 0.3 3106 

Dist_out: 0.4 4431 

Dist_out: 0.5 6049 

Holloway
--------------------------------------------------
Dist_out: 0.001 0 

Dist_out: 0.01 15 

Dist_out: 0.1 1152 

Dist_out: 0.2 3288 

Dist_out: 0.3 4541 

Dist_out: 0.4 5657 

Dist_out: 0.5 6159 

Mustang Complex
--------------------------------------------------
Dist_out: 0.001 0 

Dist_out: 0.01 30 

Dist_out: 0.1 2292 

Dist_out: 0.2 5718 

Dist_out: 0.3 8174 

Dist_out: 0.4 11735 

Dist_out: 0.5 14212 

Rush
--------------------------------------------------
Dist_out: 0.001 0 

Dist_out: 0.01 3 

Dist_out: 0.1 424 

Dist_out: 0.2 1375 

Dist_out: 0.3 2425 

Dist_out: 0.4 2465 

Dist_out: 0.5 2525 

Ash Creek MT
--------------------------------------------------
Dist_out: 0.001 0 

Dist_out: 0.01 38 

Dist_out: 0.1 602 

Dist_out: 0.2 1377 

Dist_out: 0.3 2337 

Dist_out: 0

This looks fairly legit, but what I want to do is take 100 unique random LAT/LONG pairs from the 2012 database, and 
look at the average number of obs. we get at dist_out of the above values (0.001, 0.01, 0.1, 0.2, 0.3, 0.4, and 0.5). If the number of observations that far out is on average the same as above, that would suggest that we weren't actually able to locate the Long Draw fire. 

In [19]:
with open('../../../data/pickled_data/MODIS/df_' + str(2012) + '.pkl') as f: 
    df = pickle.load(f)
    df['LAT'] = df['LAT'].astype(float)
    df['LONG'] = df['LONG'].astype(float)
    df = df.set_index(['LAT', 'LONG'])
    indices = df.index
    unique_indices = np.unique(indices)
    num_indices = len(unique_indices)
    obs_array = []
    rand_indices = np.random.randint(low=0, high=num_indices, size=100)
    for index in rand_indices: 
        lat_orig, long_orig = unique_indices[index]
        num_obs = []
        for dist_out in [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5]: 
            lat_1, lat_2 = lat_orig - dist_out, lat_orig + dist_out 
            long_1, long_2 = long_orig - dist_out, long_orig + dist_out
            result = df.query('LAT > @lat_1 & LAT < @lat_2 & LONG > @long_1 & LONG < @long_2')
            num_obs.append(result.shape[0])
        obs_array.append(num_obs)

Done with Rand


In [11]:
np.array(obs_array).mean(axis = 0)

array([  2.20000000e+00,   1.72300000e+01,   6.67580000e+02,
         1.41051000e+03,   1.89278000e+03,   2.28991000e+03,
         2.68952000e+03])

The above seems to suggest that our results from attempting to look at the Long Draw fire are fairly successful, at least if we look out far enough. Up to 0.1 degrees, the numbers we were seeing for Long Draw seem to be roughly the same on average, whereas if we go 0.2 degrees +, we end up with a much larger number of obs. for the Long Draw fire. Now we'll try to look at the Railbelt Complex fire. 