Version 1.0.2

# Pandas basics 

Hi! In this programming assignment you need to refresh your `pandas` knowledge. You will need to do several [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)s and [`join`]()`s to solve the task. 

In [1]:
import pandas as pd
import numpy as np
#import googletrans
import os
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt; plt.rcdefaults() 
%matplotlib inline 

In [2]:
DATA_FOLDER = '../readonly/final_project_data/'

I got a hold of a dataset (from Kaggle) of forest fires in Brazil, which houses the largest rainforest on Earth — Amazon. I didn’t want to be picky and so this dataset was a complete random choice.
> About the data:
* year is the year when the forest fire happened;
* state is the Brazilian state;
* month is the month when the forest fire happened;
* number is the number of forest fires reported;

date is the date when the forest fire was reported

<ol start="1">
<li>Going through the csv file (amazon.csv), you notice that some numbers are in decimal. 2.588 numbers of forest fires doesn't make sense. That's because the decimal is how thousands are formatted. So, 2.588 means 2588 forest fires. This can easily be accounted for when reading the csv file.
</li>
</ol>

<ol start="2">
<li>You’ll also notice that the month column is in Portuguese. There's an upcoming fix for that too.
</li>
</ol>

<ol start="3">
<li>When I imported the file for the first time after downloading it, I got an error: UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte. To fix it, I opened the csv in Sublime Text and: Save with Encoding -> UTF-8. However, this caused errors in the date column. So, I simply opened up the original csv and exported it as csv. Weird but it worked.
</li>
</ol>

Imports:
For this project, I set up a virtual environment using virtualenv. Check out this post for all the steps. We’re using three major libraries: pandas, matplotlib, googletrans.
!pip3 install the these packages (if you haven’t already) before importing them.

> Read the data:
Make sure amazon.csv is in your working directory. The thousands = "." parameter makes up for the decimal formatting.

In [3]:
forests = pd.read_csv(os.path.join(DATA_FOLDER, 'amazon_update.csv'),thousands='.')

# View the data:

Let's start with a simple task. 

Gives a nice summary of the data. Such as the count of all the columns, the highest occurring value in each column (if applicable) and its frequency.

In [4]:
# YOUR CODE GOES HERE
print(forests.shape)
print(forests.head())
print(forests.describe(include="all"))
#print(forests.isna().sum()) # Check for any missing values

(6454, 5)
   year state    month  number        date
0  1998  Acre  Janeiro       0  1998-01-01
1  1999  Acre  Janeiro       0  1999-01-01
2  2000  Acre  Janeiro       0  2000-01-01
3  2001  Acre  Janeiro       0  2001-01-01
4  2002  Acre  Janeiro       0  2002-01-01
               year state    month        number        date
count   6454.000000  6454     6454   6454.000000        6454
unique          NaN    23       12           NaN          20
top             NaN   Rio  Janeiro           NaN  2010-01-01
freq            NaN   717      541           NaN         324
mean    2007.461729   NaN      NaN    522.696312         NaN
std        5.746654   NaN      NaN   1554.846486         NaN
min     1998.000000   NaN      NaN      0.000000         NaN
25%     2002.000000   NaN      NaN      9.000000         NaN
50%     2007.000000   NaN      NaN     54.000000         NaN
75%     2012.000000   NaN      NaN    269.000000         NaN
max     2017.000000   NaN      NaN  25963.000000         NaN


# Break the dataset into smaller subsets:

The first thought I had was to visualise the number of forest fires over the years, over the months. You need to be able to identify smaller pieces of the bigger picture.

Let’s drop rows from the dataset that aren’t contributing to the number of forest fires. So, any row with number column value as 0, must be dropped. We first convert the 0s to NaN and then drop rows with NaN in the specific column number.

In [5]:
# YOUR CODE GOES HERE
forests = forests.replace(0, np.nan)
forests_update = forests.dropna(subset=['number'])
print(forests_update.describe(include= "all"))

               year state    month        number        date
count   5837.000000  5837     5837   5837.000000        5837
unique          NaN    23       12           NaN          20
top             NaN   Rio  Outubro           NaN  2016-01-01
freq            NaN   661      534           NaN         317
mean    2007.834847   NaN      NaN    577.947918         NaN
std        5.649076   NaN      NaN   1625.176973         NaN
min     1998.000000   NaN      NaN      1.000000         NaN
25%     2003.000000   NaN      NaN     16.000000         NaN
50%     2008.000000   NaN      NaN     72.000000         NaN
75%     2013.000000   NaN      NaN    334.000000         NaN
max     2017.000000   NaN      NaN  25963.000000         NaN


# Creating subset of data:

In [6]:
# YOUR CODE GOES HERE
# grouping the data by month and summing the numbers. The output is a pandas series.
forest_fire_per_month = forests_update.groupby('month')['number'].sum()
print(forest_fire_per_month)

# we notice the result is in alphabetical order. To get it back to the monthly order, we use the reindex property of dataframes.
months_unique = list(forests_update.month.unique())
print(months_unique)
forest_fire_per_month = forest_fire_per_month.reindex(months_unique) 
print(forest_fire_per_month)

# Next we convert the series into a dataframe
forest_fire_per_month = forest_fire_per_month.to_frame()
print(forest_fire_per_month.head())

# This doesn’t look right. That’s because month is being considered as the index of the dataframe.
forest_fire_per_month.reset_index(level=0, inplace=True)
print(forest_fire_per_month)

month
Abril          28364.0
Agosto        740841.0
Dezembro      152596.0
Fevereiro      30952.0
Janeiro        52587.0
Julho         217620.0
Junho         111405.0
Maio           46083.0
Março          35118.0
Novembro      312326.0
Outubro       629665.0
Setembro     1015925.0
Name: number, dtype: float64
['Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho', 'Julho', 'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro']
month
Janeiro        52587.0
Fevereiro      30952.0
Março          35118.0
Abril          28364.0
Maio           46083.0
Junho         111405.0
Julho         217620.0
Agosto        740841.0
Setembro     1015925.0
Outubro       629665.0
Novembro      312326.0
Dezembro      152596.0
Name: number, dtype: float64
            number
month             
Janeiro    52587.0
Fevereiro  30952.0
Março      35118.0
Abril      28364.0
Maio       46083.0
        month     number
0     Janeiro    52587.0
1   Fevereiro    30952.0
2       Março    35118.0
3       Abril    28

Well done! :)