# QCTO - Workplace Module

### Project Title:  Time series forecasting model to predict future forest areas
#### Done By: Sarah Mahlangu

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

This project focuses on building a time series forecasting model to predict future trends in the world’s forest area, based on historical data. The primary goal is to forecast changes in both the total forest area (in square kilometers) and the percentage of land area covered by forests. This information can help policymakers, environmental organizations, and researchers understand future patterns of deforestation or afforestation and take proactive measures to protect the world’s forests. The significance of this project lies in its potential to provide insights into global forest trends, which are crucial for environmental Conservation, policy and regulation, sustainability and research.

Forests play a critical role in maintaining ecological balance, supporting biodiversity, and combating climate change through carbon sequestration. Over the years, the global forest area has been influenced by a variety of factors, including urbanization, agriculture, industrialization, and government policies on deforestation and afforestation. Understanding the trends and predicting future changes in forest area is key to addressing environmental challenges like climate change, habitat loss, and sustainability. The forest area dataset, which includes metrics on forest areasin square kilometers for different countries, offers an opportunity to analyze these trends over time and forecast future developments.

This project will seek to answer the following questions:
1. What is the future trajectory of the global forest area (in both km² and %) over the next years or decades?
2. How have different regions or countries contributed to the changes in forest area, and can we predict future patterns?

By accurately forecasting the future of the world's forest areas, this project will contribute to more informed decision-making around sustainability and conservation.

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [1]:
# Import packages

import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", None)

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

The World Forest Area dataset is typically sourced from reputable organizations involved in global environmental monitoring such as the World Bank. The exact dataset used in this project can be found on Kaggle through the following link.
https://www.kaggle.com/datasets/webdevbadger/world-forest-area

Dataset Overview:

The dataset spans several decades from 1990 to 2021, with yearly data points available for analysis. It also includes all countries in the world.
The data contains both numerical and categorical data.

Categorical data: Country name and country code.

Numerical Data: Total Forest Area in km² (The actual area of forested land within a given country) for each year from 1990 to 2021.

Challenges in the Data:

Missing Data: Some countries have missing data for certain years, which can require imputation or other techniques to handle in time series forecasting models.

Geographical Discrepancies: Some regions may have highly variable data quality due to different levels of monitoring or reporting capabilities.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [8]:
# loading dataset

df = pd.read_csv('forest_area_km.csv')

In [9]:
# View sample of the dataset

df.head()

Unnamed: 0,Country Name,Country Code,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Afghanistan,AFG,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4
1,Albania,ALB,7888.0,7868.5,7849.0,7829.5,7810.0,7790.5,7771.0,7751.5,7732.0,7712.5,7693.0,7705.77,7718.54,7731.31,7744.08,7756.85,7769.62,7782.39,7795.16,7807.93,7820.7,7834.935,7849.17,7863.405,7877.64,7891.875,7891.8,7889.025,7889.0,7889.0,7889.0,7889.0
2,Algeria,DZA,16670.0,16582.0,16494.0,16406.0,16318.0,16230.0,16142.0,16054.0,15966.0,15878.0,15790.0,16129.0,16468.0,16807.0,17146.0,17485.0,17824.0,18163.0,18502.0,18841.0,19180.0,19256.0,19332.0,19408.0,19484.0,19560.0,19560.0,19430.0,19300.0,19390.0,19490.0,19583.333
3,American Samoa,ASM,180.7,180.36,180.02,179.68,179.34,179.0,178.66,178.32,177.98,177.64,177.3,177.0,176.7,176.4,176.1,175.8,175.5,175.2,174.9,174.6,174.3,174.0,173.7,173.4,173.1,172.8,172.5,172.2,171.9,171.6,171.3,171.0
4,Andorra,AND,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [10]:
# Display number of rows and columns of the dataframe

df.shape

(259, 34)

There are 256 rows of data and 34 columns 

In [11]:
# Check data types and null counts

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Data columns (total 34 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  259 non-null    object 
 1   Country Code  259 non-null    object 
 2   1990          215 non-null    float64
 3   1991          219 non-null    float64
 4   1992          248 non-null    float64
 5   1993          251 non-null    float64
 6   1994          251 non-null    float64
 7   1995          251 non-null    float64
 8   1996          251 non-null    float64
 9   1997          251 non-null    float64
 10  1998          251 non-null    float64
 11  1999          251 non-null    float64
 12  2000          253 non-null    float64
 13  2001          253 non-null    float64
 14  2002          253 non-null    float64
 15  2003          253 non-null    float64
 16  2004          253 non-null    float64
 17  2005          253 non-null    float64
 18  2006          255 non-null    

The dataset contains both numerical and categorical data types. The are some missing data points.

In [18]:
# Count null values per column

df.isna().sum()

Country Name     0
Country Code     0
1990            44
1991            40
1992            11
1993             8
1994             8
1995             8
1996             8
1997             8
1998             8
1999             8
2000             6
2001             6
2002             6
2003             6
2004             6
2005             6
2006             4
2007             4
2008             4
2009             4
2010             4
2011             1
2012             0
2013             0
2014             0
2015             0
2016             0
2017             0
2018             0
2019             0
2020             0
2021             0
dtype: int64

There are a number of missing forest data in some of the years. This data needs to be imputed.

In [21]:
# Fill missing values using linear interpolation

df_interpolated = df.copy()
df_interpolated = df_interpolated.interpolate(method='linear', axis=0, limit_direction='both')

In [23]:
# Display the dataset after interpolation

df_interpolated.head()

Unnamed: 0,Country Name,Country Code,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Afghanistan,AFG,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4
1,Albania,ALB,7888.0,7868.5,7849.0,7829.5,7810.0,7790.5,7771.0,7751.5,7732.0,7712.5,7693.0,7705.77,7718.54,7731.31,7744.08,7756.85,7769.62,7782.39,7795.16,7807.93,7820.7,7834.935,7849.17,7863.405,7877.64,7891.875,7891.8,7889.025,7889.0,7889.0,7889.0,7889.0
2,Algeria,DZA,16670.0,16582.0,16494.0,16406.0,16318.0,16230.0,16142.0,16054.0,15966.0,15878.0,15790.0,16129.0,16468.0,16807.0,17146.0,17485.0,17824.0,18163.0,18502.0,18841.0,19180.0,19256.0,19332.0,19408.0,19484.0,19560.0,19560.0,19430.0,19300.0,19390.0,19490.0,19583.333
3,American Samoa,ASM,180.7,180.36,180.02,179.68,179.34,179.0,178.66,178.32,177.98,177.64,177.3,177.0,176.7,176.4,176.1,175.8,175.5,175.2,174.9,174.6,174.3,174.0,173.7,173.4,173.1,172.8,172.5,172.2,171.9,171.6,171.3,171.0
4,Andorra,AND,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0


In [24]:
# Check for missing values after interpolation

df_interpolated.isnull().sum()

Country Name    0
Country Code    0
1990            0
1991            0
1992            0
1993            0
1994            0
1995            0
1996            0
1997            0
1998            0
1999            0
2000            0
2001            0
2002            0
2003            0
2004            0
2005            0
2006            0
2007            0
2008            0
2009            0
2010            0
2011            0
2012            0
2013            0
2014            0
2015            0
2016            0
2017            0
2018            0
2019            0
2020            0
2021            0
dtype: int64

In [25]:
# Count number of duplicate rows

num_duplicates = df.duplicated().sum()
num_duplicates

0

There are no duplicated rows of data

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
