# Introduction

We will go through Brazil Forest Fire Historical Data and try to analyze the evolution of forest fire in **Brazil** throughout the year. The goal is to find when is the hotspot through the year. The data can also be combined with other data such as weather condition to create a hypothesis about how the fire started and how to prevent it in the future
<br>
The "inspiration" of this data processing can be found [here](https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil/data#). However, I use the data provided [here](https://storage.googleapis.com/kaggle-forum-message-attachments/675863/14453/rf_incendiosflorestais_focoscalor_estados_1998-2017.csv). **The data I used is a bit different, so be carefull comparing this kernel to others.**

## About This File

This dataset report of the number of forest fires in Brazil divided by states. The series comprises the period of approximately 10 years (1998 to 2017).

### Column

- year: the year when the Forest Fires happen
- state: Brazilian State
- month: the month when the Forest Fires happen
- number: Number of Forest Fires reported
- date: Date when Forest Fires where reported

## Limitation

After went through the data set, I suspect that there are outliers on the data. This indicated by how high the number of reported forest fire is several months and states. However, we do not know how sever is 1 forest fire reported. It maybe represent some area or some duration, but who knows?

# Data Preview and Data Cleaning

We will first look at our data, try to understand the structure of the data and the basic statistic. Also, because the given data are in Portuguese, we will change the month format into number to prevent confusion and to make it easier to play with the data.

In [None]:
# import module needed to analyze the dataset

from csv import reader
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight')
%matplotlib inline

In [None]:
# creating a dataframe type of data

firepd = pd.read_csv('../input/amazon2/amazon2.csv', encoding = 'latin1', thousands = ".")

print("First 5 rows of the dataset: ")
print(firepd.head())
print("\nData Description")
print(firepd.describe())
print("\nData details")
print(firepd.info())

I suspect that the data on **date** column may not give any new information, since we already have the year column. It also seems that the date column just store the first date of each year. But before completely throw the column, we need to check it first.

In [None]:
print('The proof of our suspiciousness: \n','\n',firepd.date.unique())
firepd = firepd.drop(['date'], axis = 1)
firepd.head()

In [None]:
# change the month format into number

mo_in_span = list(firepd.month.unique())

mo_num_dict = {}
for i in range (1,13):
    mo_num_dict[mo_in_span[i-1]] = i

firepd = firepd.replace({'month': mo_num_dict})

print("First 5 rows of the dataset: ")
firepd.head()

# Forest Fire Occurence, Time Wise

## Analyzing The General Pattern

From here, we are going to dig into the data, start from the general information and then deep dive to more specific area and period.
<br>
Here, we are about to point out which year has the highest number of forest fire. Then, try to figure out which month has the most frequent forest fire

In [None]:
# this dataframe stores the MAXIMUM number of forest fire in the respective year
temp1 = firepd[['year','number']].groupby('year').max().reset_index() 

max_year_pd = firepd[(firepd['year'].isin(temp1['year'])) & 
                     (firepd['number'].isin(temp1['number']))].sort_values(by = 'year')

# this dataframe stores the TOTAL number of forest fire in the respective year
data = firepd[['year','number']].groupby(['year']).sum().reset_index() 
    
# Visualization

fig, ax1 = plt.subplots(figsize=(14,6))

color = 'tab:blue'
ax1.set_title('Forest Fire Record', fontsize = 20)
ax1.set_xlabel('year')
ax1.set_ylabel('Maximum number reported', color=color)
ax1.bar(list(max_year_pd['year']), list(max_year_pd['number']), color = color)
ax1.grid(axis = 'x')
ax1.set_xticks(list(data['year']))

ax2 = ax1.twinx() # Create a twin Axes sharing the x axis

color = 'tab:red'
ax2.set_ylabel('Total reported', color=color)
ax2.plot(list(data['year']), list(data['number']), color = color)
ax2.grid(None)
ax2.tick_params(axis='y', labelcolor=color)

## Max Forest Fire on Each Year

As seen on the bar graph above, the maximal number of forest fire reported throughout the year vary inconsistently each year, they have several ups and downs. The lowest number reported was in 2013 with 5576 fire reported, and the highest was in 2007 with 25936 forest fire. However this number is inconsistent, there are years with significant change whether they are increased or decreased.
<br><br>
Let see whether or not there are pattern on the month when this maximum report occured.

In [None]:
# Create a frequency table of month vs max reported fire each year

month_freq = max_year_pd['month'].value_counts().to_frame().reset_index()

month_freq = month_freq.rename(columns={'index':'month', 'month':'frequency'})

none_month = []

for number in range(1,13):
    if number not in list(month_freq['month']):
        none_month.append([number, 0])
        
none_pd = pd.DataFrame(none_month, columns = ['month', 'frequency'])

month_freq = month_freq.append(none_pd, ignore_index = True)

# Visualize the table

month_freq_graph = month_freq.sort_values('month').plot(x = 'month', y = 'frequency',
               kind = 'bar',
               rot = 0, figsize = (10,6))

box_month = firepd.boxplot(column = ['number'], by = ['month'], fontsize = 12, figsize = (15,8), showfliers = False)

## Yearly Maximum Forest Fire Pattern

The Graph above depict the pattern of maximum forest fire from 1998 to 2017. You can see that 9<sup>th</sup> month (September) have the most frequency. This indicates that for the last 20 years, September is the month where forest fire has high potential to be occured.
<br><br>
We will go through the data once again to see the monthly distribution of forest fire reported throughout the year.

In [None]:
for i in range (1,13):
    if i == 1:
        print("There are ",len(firepd[(firepd.month == i) & (firepd.number > 0)])," forest fire reported on ",str(i)+"st month")
    elif i == 2:
        print("There are ",len(firepd[(firepd.month == i) & (firepd.number > 0)])," forest fire reported on ",str(i)+"nd month")
    elif i == 3:
        print("There are ",len(firepd[(firepd.month == i) & (firepd.number > 0)])," forest fire reported on ",str(i)+"rd month")
    else:
        print("There are ",len(firepd[(firepd.month == i) & (firepd.number > 0)])," forest fire reported on ",str(i)+"th month")

Turns out that the forest fire count distributed somehow evenly throughout the months. Let add some additional information such as the average number of forest fire reported for the last 20 years to our condition.

In [None]:
mean = firepd['number'].mean()

above_avg_month_dict = {}

for i in range(1,13):
    above_avg_month_dict[i] = len(firepd[(firepd.month == i) & (firepd.number >= mean)])

above_avg_month_pd = pd.DataFrame(list(above_avg_month_dict.items()), columns = ['month', 'count above avg'])
above_avg_month_pd['rank'] = above_avg_month_pd['count above avg'].rank(method='first', ascending = False).astype('int64')
above_avg_month_pd = above_avg_month_pd[['rank', 'count above avg', 'month']] # rearrange the column order
above_avg_month_pd = above_avg_month_pd.sort_values('rank')
above_avg_month_pd

## Some Insight

I think this information is more insightful than before. There are significant differences that consistent with our frequency graph before. Despite the fact that forest fire in Brazil is consistently occured each month since 1998, the first 5 rows from the data frame above is significant different from the rest. The top 5 has more than 100 forest fire count above average reported in the last 20 years.
<br><br>
We can also compare this information to the frequency graph, you mau notice that **September (9<sup>th</sup> month) is the most frequent month which has a maximum number of forest fire were reported, and ranked 2nd highest for the month which has numerous reported forest fire above average**. This information should be a lead for us to take action to investigate deeper on what is the trigger of the forest fire. By having the information about which month has the most severe and most frequent forest fire, we narrowing the time window for further investigate on nature and people behaviors on that time. This will be usefull either for detection, mitigation, and prevention in the future.