It turns out that in some of the perimeter boundary file for fires, there are duplicate entries if I group by fire name and date. I need to figure out what is going with those - if I merge on the detected fires to the perimeter boundary files by date and gemoetry, some detected fires end up merging to two different boundaries, which creates what appear to be duplicate entries. Before dropping what appear to be duplicate perimeter boundaries for the same fire, I need to figure out why there are two different rows in the data for these fires. Is it because two different sources reported the same fire boundary, or what?

In [14]:
import psycopg2
import numpy as np

In [15]:
conn = psycopg2.connect(dbname='forest_fires')
c = conn.cursor()

In [16]:
# Grab the fire names for 2013 that have the highest number of obs per fire_name and date.
c.execute('''SELECT COUNT(fire_name) as total, fire_name, date_
            FROM daily_fire_shapefiles_2013 
            GROUP BY fire_name, date_
            ORDER BY total DESC
            LIMIT 20; ''')
c.fetchall()

[(16L, 'Douglas Complex', datetime.date(2013, 8, 4)),
 (16L, 'Douglas Complex', datetime.date(2013, 8, 13)),
 (16L, 'Douglas Complex', datetime.date(2013, 8, 5)),
 (15L, 'Douglas Complex', datetime.date(2013, 8, 2)),
 (12L, 'Douglas Complex', datetime.date(2013, 8, 9)),
 (12L, 'Douglas Complex', datetime.date(2013, 8, 11)),
 (11L, 'Douglas Complex', datetime.date(2013, 7, 30)),
 (10L, 'Whiskey Complex', datetime.date(2013, 8, 7)),
 (10L, 'Government Flats Complex', datetime.date(2013, 8, 19)),
 (10L, 'Corral Complex', datetime.date(2013, 8, 12)),
 (9L, 'West Fork Complex', datetime.date(2013, 7, 7)),
 (9L, 'Douglas Complex', datetime.date(2013, 8, 19)),
 (9L, 'Big Windy Complex', datetime.date(2013, 8, 3)),
 (9L, 'Douglas Complex', datetime.date(2013, 8, 18)),
 (9L, 'Government Flats Complex', datetime.date(2013, 8, 21)),
 (9L, 'Douglas Complex', datetime.date(2013, 7, 31)),
 (9L, 'Big Windy Complex', datetime.date(2013, 8, 5)),
 (8L, 'Corral Complex', datetime.date(2013, 8, 15)),
 (8L

In [17]:
# Now let's look at a a couple of these and see whats different. SELECT * 
# won't work below because there is a field that returns all blanks and 
# causes an error. 
columns = ['acres', 'agency', 'time_', 'comments', 'year_', 'active', 
          'unit_id', 'fire_num', 'fire', 'load_date', 'inciweb_id', 
          'st_area_sh', 'st_length_', 'st_area__1', 'st_length1', 
          'st_area__2', 'st_lengt_1']
for column in columns: 
    c.execute('''SELECT ''' + column + '''
                FROM daily_fire_shapefiles_2013 
                WHERE fire_name = 'Douglas Complex' and date_ = '2013-8-4'; ''')
    print column + ':', np.unique(c.fetchall())

acres: [Decimal('5.19000000') Decimal('7.82000000') Decimal('9.36000000')
 Decimal('14.43000000') Decimal('244.75000000') Decimal('2200.27000000')
 Decimal('15768.88000000') Decimal('16356.72000000')
 Decimal('19608.00000000') Decimal('20059.75000000')]
agency: ['State Agency']
time_: ['0040' '1130' '1330']
comments: [None]
year_: ['2013']
active: ['N']
unit_id: ['OR-73S']
fire_num: ['HSG9']
fire: ['Brimstone' 'Dads Creek' 'Farmers' 'Malone' 'Malone Creek' 'McNab' 'Milo'
 'Rabbit Mountain' 'Tom East']
load_date: [datetime.date(2013, 8, 5) datetime.date(2013, 8, 21)]
inciweb_id: ['3559']
st_area_sh: [Decimal('0.00000231') Decimal('0.00000347') Decimal('0.00000416')
 Decimal('0.00000644') Decimal('0.00010880') Decimal('0.00097647')
 Decimal('0.00702357') Decimal('0.00728532') Decimal('0.00871746')
 Decimal('0.00891827')]
st_length_: [Decimal('0.00673966') Decimal('0.00725281') Decimal('0.00790847')
 Decimal('0.01173040') Decimal('0.04025363') Decimal('0.18993042')
 Decimal('0.77855830') 

Upon first glance what it looks like is that there are multiple entries per fire name because there are different parts of the fire. If we look at 'fire' variable above we see a number of different names - ['Brimstone' 'Dads Creek' 'Farmers' 'Malone' 'Malone Creek' 'McNab' 'Milo'
 'Rabbit Mountain' 'Tom East']. With some googling we can tell that these are different parts/areas of the same fire. Let's look at 2014 and check one ob. there to (a). Just check it out, and (b.) Kind of confirm that what I think is true is true in another year. 

In [18]:
# Grab the fire names for 2014 that have the highest number of obs per fire_name and date.
c.execute('''SELECT COUNT(fire_name) as total, fire_name, date_
            FROM daily_fire_shapefiles_2014 
            GROUP BY fire_name, date_
            ORDER BY total DESC
            LIMIT 20; ''')
c.fetchall()

[(29L, 'July Complex', datetime.date(2014, 8, 27)),
 (23L, 'July Complex', datetime.date(2014, 9, 2)),
 (23L, 'Happy Camp Complex', datetime.date(2014, 9, 5)),
 (20L, 'Happy Camp Complex', datetime.date(2014, 8, 17)),
 (19L, 'July Complex', datetime.date(2014, 8, 17)),
 (18L, 'Deception Creek Complex', datetime.date(2014, 9, 3)),
 (18L, 'July Complex', datetime.date(2014, 8, 18)),
 (17L, 'Happy Camp Complex', datetime.date(2014, 9, 9)),
 (17L, 'July Complex', datetime.date(2014, 9, 8)),
 (17L, 'July Complex', datetime.date(2014, 8, 28)),
 (16L, 'Happy Camp Complex', datetime.date(2014, 9, 2)),
 (16L, 'July Complex', datetime.date(2014, 8, 19)),
 (15L, 'Deception Creek Complex', datetime.date(2014, 8, 26)),
 (14L, 'July Complex', datetime.date(2014, 8, 29)),
 (14L, 'Chiwaukum Complex', datetime.date(2014, 8, 15)),
 (14L, 'July Complex', datetime.date(2014, 8, 22)),
 (13L, 'Happy Camp Complex', datetime.date(2014, 9, 8)),
 (13L, 'Happy Camp Complex', datetime.date(2014, 8, 23)),
 (13L, '

In [19]:
# Now let's look at a a couple of these and see whats different. SELECT * 
# won't work below because there is a field that returns all blanks and 
# causes an error. The columns also aren't the same in 2014 as they are in 2013. 
columns = ['acres', 'agency', 'time_', 'comments', 'year_', 'active', 
          'unit_id', 'fire_num', 'fire', 'load_date', 'inciweb_id', 
          'st_area_sh', 'st_length_']
for column in columns: 
    c.execute('''SELECT ''' + column + '''
                FROM daily_fire_shapefiles_2014 
                WHERE fire_name = 'July Complex' and date_ = '2014-8-27'; ''')
    print column + ':', np.unique(c.fetchall())

acres: [Decimal('0.03000000') Decimal('0.07000000') Decimal('1.24000000')
 Decimal('1.81000000') Decimal('1.90000000') Decimal('2.43000000')
 Decimal('3.03000000') Decimal('3.13000000') Decimal('3.48000000')
 Decimal('7.18000000') Decimal('16.63000000') Decimal('20.82000000')
 Decimal('2209.04000000') Decimal('3633.95000000')
 Decimal('33752.02000000') Decimal('39300.11000000')
 Decimal('39417.94000000')]
agency: ['USFS']
time_: ['0037' '1553' '2249']
comments: [None 'IR heat perimeter']
year_: ['2014']
active: ['N']
unit_id: ['CA-KNF']
fire_num: ['H91E']
fire: [None 'Crapo' 'Devil' 'F2_1' 'F2_2' 'F3' 'Gem' 'Jewel' 'Leef' 'Log' 'Man'
 'Rays Peak' 'Shelly' 'Summit' 'Trail' 'Whites']
load_date: [datetime.date(2014, 8, 27) datetime.date(2014, 8, 28)]
inciweb_id: ['4035']
st_area_sh: [Decimal('1.287E-8') Decimal('2.922E-8') Decimal('5.4275E-7')
 Decimal('7.9211E-7') Decimal('8.2960E-7') Decimal('0.00000106254')
 Decimal('0.00000132234') Decimal('0.00000136810') Decimal('0.00000151818')
 De

Cool - the 2014 data seems to tell the same story. 