# First Steps - Understanding the Data
In this project I will look at deforestation and how it interacts with economic and social factors in the Gran Chaco region of South America.

The Gran Chaco region covers a large swath of South America, encompasing the majority of northern Argentina, three quarters of Paraguay, and eastern Bolivia. The Chaco region varies from semi-arid forests in its western side to humid forests in its northern and eastern edges. Because of good soils and warm climate, it has become the epicenter of high rates of deforestation, as forests are being cleared mostly for soy cultivation and to some extent cattle ranching. As the agricultural frontier expands, so do the environmental and social costs. Loss of forests means less habitat for the thousands of bird, mammal, reptile and plant species that make it a home. Big agro's other dark side often includes the displacement of the native people who've inhabited the land for generations, but  lack the legal titles to the land. 

As time allows, I will be adding new installments of this analysis to my repository, so don't forget to check back for new material. 

In this first project I will be using data from [Guyra Paraguay](www.guyra.org.py), an environmental non-profit that has been monitoring deforestation of the Chaco region through satellite images since 2012. 

In [2]:
# Importing modules I'll be using in this analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

#### IPython Magic Commands
For those new to IPython, `%matplotlib` is a [magic function](http://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained) of IPython.
The output of plotting commands is displayed inline directly below the code cell that produced it, like in the rest of Jupyter Notebook. The benefit is that the plots get stored in the notebook document. You can read more [here](http://ipython.readthedocs.io/en/stable/interactive/plotting.html).

In [3]:
%matplotlib inline

## Importing and Formatting the Data
One of the first things that we need to do is make sure that the data is formatted correctly, and if not, we need to tidy it up. For example, removing and renaming columns, figuring out if we have missing values and what to do with them, etc.

### Reading Data Into a Dataframe
Let's read data into a Pandas dataframe:

In [8]:
d = pd.read_csv("C:/Users/user/Dropbox/Data Analysis/Portfolio/Data Sets/Deforestation/Monitoring_Data_Unprocessed.csv", 
                encoding = "UTF-8")

In [9]:
# .head() gives you the first five rows of the DF. 
d.head()

Unnamed: 0,Year,Month,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation (ha),Unnamed: 6,Unnamed: 7
0,2012,March,Argentina,Catamarca,La Paz,105.0,,
1,2012,March,Argentina,Catamarca,Santa Rosa,290.3,,
2,2012,March,Argentina,Chaco,12 de Octubre,9.6,,
3,2012,March,Argentina,Chaco,Almirante Brown,2004.7,,
4,2012,March,Argentina,Chaco,General Güemes,478.8,,


### Removing Columns
There are two empty columns we need to delete. Pandas has an easy way with `drop`.

In [16]:
# inplace=True replaces the data in the variable d.
d.drop(d.columns[[6, 7]], axis = 1, inplace = True)

### Renaming Columns
Let's rename "Deforestation (ha)" so we get rid of the parenthesis and spaces.

In [17]:
d.rename(index=str, columns={"Deforestation (ha)": "Deforestation_ha"}, inplace=True)

# Let's see what we have so far:
d.head()

Unnamed: 0,Year,Month,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation_ha
0,2012,March,Argentina,Catamarca,La Paz,105.0
1,2012,March,Argentina,Catamarca,Santa Rosa,290.3
2,2012,March,Argentina,Chaco,12 de Octubre,9.6
3,2012,March,Argentina,Chaco,Almirante Brown,2004.7
4,2012,March,Argentina,Chaco,General Güemes,478.8


### Changing Values in the DataFrame 
Let's make sure all months are named correctly:

In [14]:
# One method we can use is with unique()
d.Month.unique()

array(['March', 'Abril', 'May', 'June', 'July', 'August', 'September',
       'October', 'November', 'December', 'January', 'February', 'April'],
      dtype=object)

In [18]:
# Another option is using groupby()
d.groupby('Month').count()

Unnamed: 0_level_0,Year,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation_ha
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Abril,38,38,38,38,38
April,254,254,254,254,254
August,391,391,391,391,391
December,305,305,305,305,305
February,289,289,289,289,289
January,339,339,339,339,339
July,398,398,398,398,398
June,291,291,291,291,291
March,325,325,325,325,325
May,278,278,278,278,278


Apparently, we have "April" named also as "Abril". Let's change this to all "April":

In [19]:
d.replace({'Month': {'Abril': 'April'}}, inplace=True)

We can no confirm that our data looks right:

In [20]:
d.groupby('Month').count()

Unnamed: 0_level_0,Year,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation_ha
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
April,292,292,292,292,292
August,391,391,391,391,391
December,305,305,305,305,305
February,289,289,289,289,289
January,339,339,339,339,339
July,398,398,398,398,398
June,291,291,291,291,291
March,325,325,325,325,325
May,278,278,278,278,278
November,255,255,255,255,255


## Creating a Date Column
To be able to work with dates and time series we need it in a format that Python (and in this case Matplotlib) can understand. We'll have to create a "Date" column so that we have Python `datetime` objects to work with. This will help us plot the data in `matplotlib`. What I'm going to do here is create a function that translates the month's name into a month-day number format, like '10-31'. Using the `.apply` method, we can create a new "month_day" column.

In [22]:
def month_to_number (month):
    name = {
    "January": '01-31',
    "February": '02-28',
    "March": '03-31',
    "April": '04-30',
    "May": '05-31',
    "June": '06-30',
    "July": '07-31',
    "August": '08-31',
    "September": '09-30',
    "October": '10-31',
    "November": '11-30',
    "December": '12-31'
    }
    return(name[month])

# For example, let's run this:
month_to_number ("January")

'01-31'

In [23]:
# Creating a new 'month_day' column ()
d['month_day'] = d['Month'].apply(lambda x: month_to_number(x))

In [32]:
d.month_day.unique()

array(['03-31', '04-30', '05-31', '06-30', '07-31', '08-31', '09-30',
       '10-31', '11-30', '12-31', '01-31', '02-28'], dtype=object)

Let's now create the "Date" column by adding the "Year" column to the "month_day" column:

In [34]:
d['Date'] = d['Year'].map(str) + "-" + d['month_day']

In [35]:
d.head()

Unnamed: 0,Year,Month,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation_ha,month_day,Date
0,2012,March,Argentina,Catamarca,La Paz,105.0,03-31,2012-03-31
1,2012,March,Argentina,Catamarca,Santa Rosa,290.3,03-31,2012-03-31
2,2012,March,Argentina,Chaco,12 de Octubre,9.6,03-31,2012-03-31
3,2012,March,Argentina,Chaco,Almirante Brown,2004.7,03-31,2012-03-31
4,2012,March,Argentina,Chaco,General Güemes,478.8,03-31,2012-03-31


### Rearanging Columns
Just for fun, let's rearange the column order and put the "Date" column at the beginning.

In [40]:
cols = d.columns.tolist()

Let's put `["Date"]` as the first column.

In [41]:
cols

['Year',
 'Month',
 'Country',
 'Prov_Depto',
 'Detpo_Distr_Mun ',
 'Deforestation_ha',
 'month_day',
 'Date']

In [42]:
cols = [cols[7]] + [cols[0]] + [cols[1]] + [cols[6]] + cols[2:6]
cols

['Date',
 'Year',
 'Month',
 'month_day',
 'Country',
 'Prov_Depto',
 'Detpo_Distr_Mun ',
 'Deforestation_ha']

In [43]:
# we create the new dataframe with the column names rearanged in the preferred order
d = d[cols]

In [44]:
d.head()

Unnamed: 0,Date,Year,Month,month_day,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation_ha
0,2012-03-31,2012,March,03-31,Argentina,Catamarca,La Paz,105.0
1,2012-03-31,2012,March,03-31,Argentina,Catamarca,Santa Rosa,290.3
2,2012-03-31,2012,March,03-31,Argentina,Chaco,12 de Octubre,9.6
3,2012-03-31,2012,March,03-31,Argentina,Chaco,Almirante Brown,2004.7
4,2012-03-31,2012,March,03-31,Argentina,Chaco,General Güemes,478.8


Finally, let's create a column with the the "Date" column converted into `Matplotlib` dates

In [45]:
from datetime import datetime

d['Date'] = d['Date'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d"))
d['date_num'] = plt.matplotlib.dates.date2num(d['Date'])

In [46]:
d.head()

Unnamed: 0,Date,Year,Month,month_day,Country,Prov_Depto,Detpo_Distr_Mun,Deforestation_ha,date_num
0,2012-03-31,2012,March,03-31,Argentina,Catamarca,La Paz,105.0,734593.0
1,2012-03-31,2012,March,03-31,Argentina,Catamarca,Santa Rosa,290.3,734593.0
2,2012-03-31,2012,March,03-31,Argentina,Chaco,12 de Octubre,9.6,734593.0
3,2012-03-31,2012,March,03-31,Argentina,Chaco,Almirante Brown,2004.7,734593.0
4,2012-03-31,2012,March,03-31,Argentina,Chaco,General Güemes,478.8,734593.0


### Saving a Dataframe to Disc

Let's save this dataframe to disc using the Pandas method.

In [48]:
d.to_csv("C:/Users/user/Dropbox/Data Analysis/Portfolio/Data Sets/Deforestation/Monitoring_Data_First_Step.csv")

Our next instalment will be how to merge two dataframes together. 