**DEPENDENCIES**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Data Loading & Data Exploring**

In [None]:
forestdata=pd.read_csv('../input/forest-fires-data-set/forestfires.csv')
forestdata

**Summary Statistics**

In [None]:
forestdata.describe(include='all') #its shows basic statistical characteristics of each numerical feature.
# include all ,consider categorical columns also.

In [None]:
forestdata.head(6)                                # gives top  6 rows of dataset.

In [None]:
forestdata.tail(6)                              #gives last 6 rows of dataset.

In [None]:
forestdata.info()                            # gives general information about dataset.

**Data Analysis :- FWI Code Realation with Temporal Conditions**

In [None]:
df1=pd.pivot_table(data=forestdata,values=['rain','temp','wind','RH','area','FFMC','DMC','DC','ISI'],index='month',aggfunc=['mean'])
df1

In [None]:
df1[('mean','rain')].sort_values(ascending=False).head(4)

In [None]:
df1[('mean','temp')].sort_values(ascending=False).head(4)

In [None]:
df1[('mean','wind')].sort_values(ascending=True).head(4)

In [None]:
df1[('mean','RH')].sort_values(ascending=True).head(4)

In [None]:
df1[('mean','DC')].sort_values(ascending=False).head(4)

In [None]:
df1[('mean','DMC')].sort_values(ascending=False).head(4)

In [None]:
df1[('mean','FFMC')].sort_values(ascending=False).head(4)

In [None]:
df1[('mean','ISI')].sort_values(ascending=False).head(4)

Weather Observations:
1. Rain
Rainfall only in these months:aug,july and mar which is even very less in amount.

Months receving no rainfall is more of a dangerous conditions with respect to rainfall prospective.

2. Temp
june,july,aug,sep,oct has high temp

3. Wind
wind is low in jan,feb,july,sep,oct

2. Relative Humidity
we see humidity is also low in sep oct nov dec

1. DC value high in july,aug,sep,oct more dry in that month
2. DMC value high in july ,aug,sep but not in oct
3. FFMC value high above 90 in aug,july,sep,oct 
4. ISI value high in june july aug,sep 
Forest Fire Prediction
1. Month having no rainfall is more prone to catch forest fire. 
2. When temp is increasing the moisture content of all 3 different types of fuel is reduced in the same month, so more dangerous conditions have occured in the months of july,aug,sep,oct from temp prospective 
3. Also we can see that DMC and DC value have no significant relations with wind and Relative humidity columns 
4. Months having low humidity is more prone to catch forest fire.

**Data Visualization**

In [None]:
# analysis on burned area
plt.figure(figsize=(16,5))
print("Skew: {}".format(forestdata['area'].skew()))
print("Kurtosis: {}".format(forestdata['area'].kurtosis()))
ax = sns.kdeplot(forestdata['area'],shade=True,color='g')
plt.xlabel('Area in hectare',color='red',fontsize=15)
plt.ylabel('probability density of forest fire',color='red',fontsize=15)
plt.title('Forest Fire Probability Density  Vs Amount of Area Burnt',color='blue',fontsize=18)
plt.xticks([i for i in range(0,1200,50)])
plt.show()

Observations:
1. The burned area is highly skewed with a value of +12.84 ha and huge kurtosis value of 194 ha.
2. It even tells you that majority of the forest fires do not cover a large area, most of the damaged area is under 50 hectares of land.

In [None]:
dfa = forestdata.drop(columns='area')
cat_columns = dfa.select_dtypes(include='object').columns.tolist()  #seperating categorical columns from data set
num_columns = dfa.select_dtypes(exclude='object').columns.tolist()  #seperating numerical columns from data set

**Analyzing Categorical Columns**

In [None]:
# Analysis of forest fire based on different months and days.
plt.figure(figsize=(16,10))
for i,col in enumerate(cat_columns,1):
    plt.subplot(2,2,i)             #indexing subplot using loop
    sns.countplot(data=dfa,y=col)  #countplot:count of each month/day in month/day columns
    plt.subplot(2,2,i+2)
    forestdata[col].value_counts().plot.bar() #freq of each month/day in month/day columns
    plt.ylabel(col)
    plt.xlabel('% distribution per category')
plt.show()

Observations:

1.It is interesting to see that abnormally high number of the forest fires occur in the month of August and September amd least in Nov.

2.In the case of day, the days Friday - Monday have higher proportion of cases. (However, no strong indicators)

In [None]:
# Analysis of forest fire damage based on different months and days.
# Adding categorical variable  based on forest fire area as No damage, low, moderate, high, very high
def area_cat(area):            # grouping damage category based on amount of area burned.
    if area == 0.0:
        return "No damage"
    elif area <= 1:
        return "low"
    elif area <= 25:
        return "moderate"
    elif area <= 100:
        return "high"
    else:
        return "very high"

forestdata['damage_category'] = forestdata['area'].apply(area_cat)




for col in cat_columns:      
    cross = pd.crosstab(index=forestdata['damage_category'],columns=forestdata[col],normalize='index')
    cross.plot.barh(stacked=True,rot=40,cmap='plasma')
    plt.xlabel('% distribution per category')
    plt.xticks(np.arange(0,1.1,0.1))
    plt.title("Forestfire damage each {}".format(col))
plt.show()

Observations:

1.Previously we had observed that August and September had the most number of forest fires. And from the above plot of month, we can understand few things

#Most of the fires in August were low (< 1 hectare).

#The very high damages(>100 hectares) happened in only 3 months - august,july and september.

2.Regarding fire damage per day, nothing much can be observed. Except that, there were no very high damaging fires on Friday.
#on Saturdays it has been reported most

**Analyzing Numerical Columns**

In [None]:
# Analysis of Burnt area based on spatial cordinates(X,Y)
forestdata.plot(kind='scatter', x='X', y='Y', alpha=0.2, s=20*forestdata['area'],figsize=(10,6))
plt.xlabel('X cordinates of park',color='red',fontsize=15)
plt.ylabel('Y cordinates of park',color='red',fontsize=15)
plt.title('Burnt area in different regions of the park',color='blue',fontsize=18)

Observations:

1. from the above scatter plot of 9X9 representation of park we can see that there are multiple hotspots for burnt area.

2. The cordinates (6,5) show intense burnt area.

3. By applying maximum and minimum function on area columns.

#. we can deduce the min burned area is at (1,2) cordinates whereas

#. max burned area is at (9,9) cordinates

In [None]:
# monthly analysis of burnt area, where the condition is: area>0
areaburnt=forestdata[forestdata['area']>0]
areaburnt

In [None]:
areaburnt.groupby('month')['area'].agg('count').plot(kind='pie',title='Monthly analysis of burnt area',figsize=(9,9),explode=[0,0.1,0,0,0,0,0,0,0,0.1],autopct='%0.1f%%')
plt.show()

Observations:

1. As we can see from the above pie chart,the month of aug and sep have recorded highest % of forest fire i.e 36.8% & 36.1% respectively.

2. Month that recorded least forest fire is may with 0.4%. 

Conclusion:

1. From above analysis we can conclude that due to less rainfall the temp increses which affect all the FWI codes.

Also there is no significant changes in DMC and DC codes due to wind as they are depth layer of fuel codes.

2.Also majority of the forest fires do not cover a large area most of the damaged area is under 50 hectares of land 

3. high number of the forest fires occur in the month of August and September.

4. The cordinates (6,5) show intense burned area.