# Initial Data Exploration
This notebook is available to get familiar with the dataset and with using Jupyter notebooks.

**Other useful links:**
- <a href="https://github.com/schmitzmelissa/DSCT-Capstone01">Full Github Repository</a>
- <a href="https://www.kaggle.com/footprintnetwork/ecological-footprint">Kaggle Dataset</a>
- <a href="https://www.footprintnetwork.org/">Data Source Website</a>
- <a href="https://www.footprintnetwork.org/resources/glossary/">Glossary of Relevant Terms</a>

## Dataset Discoveries
(acquired through the below code)
- 188 countries, which is a great representation of most countries in the world
- 15 countries with incomplete observations (i.e., only 8% of countries contain NaN values)

## Proposed Next Steps
- Convert region names to shorter abbreviations
- Consider grouping data by factors such as:
 - Weighted area
 - Population size
 - Separate analysis of outliers

In [14]:
import pandas as pd
import numpy as np

# Allows better display of DataFrames
from IPython.display import display

# Create DataFrame from CSV file
countries = pd.read_csv('countries.csv')

# Confirm it is a DataFrame
type(countries)

pandas.core.frame.DataFrame

In [2]:
# Determine how many countries are in the list
len(countries)

188

**Discoveries:**
- This dataset is from 2016 and contains 188 countries total.
- As of 2018, there are 196-241 countries and territories (depends on political questions).
- 5 countries with name changes since 2016 are either recorded with the old name or not recorded (~2%)

**Conclusion:** Most countries of the world have been recognized with this data.

In [3]:
# Get a look at the composition of the DataFrame
countries.head(5)

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
0,Afghanistan,Middle East/Central Asia,29.82,0.46,$614.66,0.3,0.2,0.08,0.18,0.0,...,0.24,0.2,0.02,0.0,0.04,0.5,-0.3,0.46,1.6,6
1,Albania,Northern/Eastern Europe,3.16,0.73,"$4,534.37",0.78,0.22,0.25,0.87,0.02,...,0.55,0.21,0.29,0.07,0.06,1.18,-1.03,1.27,1.87,6
2,Algeria,Africa,38.48,0.73,"$5,430.57",0.6,0.16,0.17,1.14,0.01,...,0.24,0.27,0.03,0.01,0.03,0.59,-1.53,1.22,3.61,5
3,Angola,Africa,20.82,0.52,"$4,665.91",0.33,0.15,0.12,0.2,0.09,...,0.2,1.42,0.64,0.26,0.04,2.55,1.61,0.54,0.37,6
4,Antigua and Barbuda,Latin America,0.09,0.78,"$13,205.10",,,,,,...,,,,,,0.94,-4.44,3.11,5.7,2


In [4]:
# Too many columns to print neatly in one shot, so splitting into chunks for better visibility
country_IDs = countries.iloc[:,:5]
country_footprints = countries.iloc[:,5:10]
country_resources = countries.iloc[:,10:15]
country_totals = countries.iloc[:,15:]

In [5]:
# Print out heads to ensure correct slicing
country_IDs.head(10)

# print(country_footprints.head(10))
# print(country_resources.head(10))
# print(country_totals.head(10))

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita
0,Afghanistan,Middle East/Central Asia,29.82,0.46,$614.66
1,Albania,Northern/Eastern Europe,3.16,0.73,"$4,534.37"
2,Algeria,Africa,38.48,0.73,"$5,430.57"
3,Angola,Africa,20.82,0.52,"$4,665.91"
4,Antigua and Barbuda,Latin America,0.09,0.78,"$13,205.10"
5,Argentina,Latin America,41.09,0.83,"$13,540.00"
6,Armenia,Middle East/Central Asia,2.97,0.73,"$3,426.39"
7,Aruba,Latin America,0.1,,
8,Australia,Asia-Pacific,23.05,0.93,"$66,604.20"
9,Austria,European Union,8.46,0.88,"$51,274.10"


In [6]:
# Rename columns for better readability
country_IDs = country_IDs.rename(columns = {'Population (millions)':'Pop (M)','GDP per Capita':'GDP per cap'})
country_IDs.head(10)

Unnamed: 0,Country,Region,Pop (M),HDI,GDP per cap
0,Afghanistan,Middle East/Central Asia,29.82,0.46,$614.66
1,Albania,Northern/Eastern Europe,3.16,0.73,"$4,534.37"
2,Algeria,Africa,38.48,0.73,"$5,430.57"
3,Angola,Africa,20.82,0.52,"$4,665.91"
4,Antigua and Barbuda,Latin America,0.09,0.78,"$13,205.10"
5,Argentina,Latin America,41.09,0.83,"$13,540.00"
6,Armenia,Middle East/Central Asia,2.97,0.73,"$3,426.39"
7,Aruba,Latin America,0.1,,
8,Australia,Asia-Pacific,23.05,0.93,"$66,604.20"
9,Austria,European Union,8.46,0.88,"$51,274.10"


In [7]:
# Look for columns with NaN values
print(len(countries.loc[:, countries.isnull().any()]))

188


In [8]:
print('Shape:',countries.shape,'\n\n')
print('Columns:',countries.columns,'\n\n')
print('Info:\n')
countries.info()

Shape: (188, 21) 


Columns: Index(['Country', 'Region', 'Population (millions)', 'HDI', 'GDP per Capita',
       'Cropland Footprint', 'Grazing Footprint', 'Forest Footprint',
       'Carbon Footprint', 'Fish Footprint', 'Total Ecological Footprint',
       'Cropland', 'Grazing Land', 'Forest Land', 'Fishing Water',
       'Urban Land', 'Total Biocapacity', 'Biocapacity Deficit or Reserve',
       'Earths Required', 'Countries Required', 'Data Quality'],
      dtype='object') 


Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 21 columns):
Country                           188 non-null object
Region                            188 non-null object
Population (millions)             188 non-null float64
HDI                               172 non-null float64
GDP per Capita                    173 non-null object
Cropland Footprint                173 non-null float64
Grazing Footprint                 173 non-null float64
Forest Footprint     

- The data has 188 rows and 21 columns.
- The column names as-is are fairly consistent, but not all of them explicitly include units (units were inferred in the proposal from other data on the source website).
- Apparently 188 columns have NaN values...
 - So every single row contains at least one NaN?
 - (Seems weird to me... Also, I'm pretty sure there's AT LEAST 1 country that has all the columns filled out...)
 - I think I'm just using the wrong command.
 
 
 **Don't forget:** I split up the columns into separate DataFrames... So analysis in above two cells was on original DataFrame.

In [9]:
countries[countries.HDI.isnull()]

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
7,Aruba,Latin America,0.1,,,,,,,,...,,,,,,0.57,-11.31,6.86,20.69,2
18,Bermuda,North America,0.06,,"$70,626.30",,,,,,...,,,,,,0.13,-5.64,3.33,44.05,3T
24,British Virgin Islands,Latin America,0.03,,,,,,,,...,,,,,,2.05,-0.81,1.65,1.4,3T
33,Cayman Islands,Latin America,0.06,,,0.36,0.45,1.23,3.56,0.05,...,0.0,0.02,0.19,0.11,0.0,0.32,-5.33,3.26,17.91,3L
43,Côte d'Ivoire,Africa,19.84,,"$1,016.83",0.51,0.06,0.22,0.26,0.14,...,0.88,0.3,0.48,0.04,0.08,1.78,0.51,0.74,0.72,6
62,French Guiana,Latin America,0.24,,,0.07,0.06,0.46,1.58,0.17,...,0.07,0.06,95.16,16.07,0.0,111.35,109.01,1.35,0.02,3L
63,French Polynesia,Asia-Pacific,0.27,,,0.75,0.65,0.12,2.39,0.82,...,0.19,0.03,0.73,0.42,0.0,1.37,-3.36,2.73,3.45,5
71,Guadeloupe,Latin America,0.46,,,0.11,0.03,0.16,2.61,0.31,...,0.11,0.03,0.13,0.18,0.0,0.45,-2.77,1.86,7.14,5
91,"Korea, Democratic People's Republic of",Asia-Pacific,24.76,,,0.28,0.0,0.14,0.68,0.02,...,0.24,0.0,0.21,0.09,0.06,0.6,-0.57,0.67,1.94,6
108,Martinique,Latin America,0.4,,,0.13,0.02,0.12,1.73,0.04,...,0.13,0.02,0.1,0.09,0.04,0.39,-1.7,1.2,5.41,3B


In [15]:
countries['HDI'][7] == np.nan

False

In [16]:
countries['HDI'][7] == 'NaN'

False

In [17]:
countries['HDI'][7]

nan

In [18]:
countries['HDI'][7].isnull()

AttributeError: 'numpy.float64' object has no attribute 'isnull'

In [22]:
np.isnan(countries['HDI'][7])

True

In [31]:
# Remove the NaN rows for the HDI column
countries[pd.isnull(countries['HDI']) == False]

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
0,Afghanistan,Middle East/Central Asia,29.820,0.460000,$614.66,0.30,0.20,0.08,0.18,0.00,...,0.24,0.20,0.02,0.00,0.04,0.50,-0.30,0.46,1.600000,6
1,Albania,Northern/Eastern Europe,3.160,0.730000,"$4,534.37",0.78,0.22,0.25,0.87,0.02,...,0.55,0.21,0.29,0.07,0.06,1.18,-1.03,1.27,1.870000,6
2,Algeria,Africa,38.480,0.730000,"$5,430.57",0.60,0.16,0.17,1.14,0.01,...,0.24,0.27,0.03,0.01,0.03,0.59,-1.53,1.22,3.610000,5
3,Angola,Africa,20.820,0.520000,"$4,665.91",0.33,0.15,0.12,0.20,0.09,...,0.20,1.42,0.64,0.26,0.04,2.55,1.61,0.54,0.370000,6
4,Antigua and Barbuda,Latin America,0.090,0.780000,"$13,205.10",,,,,,...,,,,,,0.94,-4.44,3.11,5.700000,2
5,Argentina,Latin America,41.090,0.830000,"$13,540.00",0.78,0.79,0.29,1.08,0.10,...,2.64,1.86,0.66,1.67,0.10,6.92,3.78,1.82,0.450000,6
6,Armenia,Middle East/Central Asia,2.970,0.730000,"$3,426.39",0.74,0.18,0.34,0.89,0.01,...,0.44,0.26,0.10,0.02,0.07,0.89,-1.35,1.29,2.520000,3B
8,Australia,Asia-Pacific,23.050,0.930000,"$66,604.20",2.68,0.63,0.89,4.85,0.11,...,5.42,5.81,2.01,3.19,0.14,16.57,7.26,5.37,0.560000,5
9,Austria,European Union,8.460,0.880000,"$51,274.10",0.82,0.27,0.63,4.14,0.06,...,0.71,0.16,2.04,0.00,0.15,3.07,-3.00,3.50,1.980000,5
10,Azerbaijan,Middle East/Central Asia,9.310,0.750000,"$7,106.04",0.66,0.22,0.11,1.25,0.01,...,0.46,0.20,0.11,0.02,0.06,0.85,-1.46,1.33,2.720000,6


In [19]:
type(countries['HDI'][7])

numpy.float64

In [27]:
type(countries['GDP per Capita'][7])

float

In [28]:
pd.isnull(countries['GDP per Capita'][7])

True

In [None]:
type(countries)

In [10]:
# Noticed that there seem to be units on the Data Quality
countries['Data Quality'].value_counts()

5     66
6     60
3B    29
3L    18
3T     7
2      6
4      2
Name: Data Quality, dtype: int64

Data Quality score information can be found <a href="https://www.footprintnetwork.org/data-quality-scores/">here</a>.

The purpose of these scores is to rate the quality of data that has been produced and maintained from various data sources since 1961. This may include missing, unavailable, incomplete, or error-containing data. The quality score, therefore, is a measure of confidence in DQ.

- 1-3 is the time series score
- A-D is the latest year score

(not entirely sure what that means... needs to be researched furter)

**Summary of DQ values in dataset:**
- **3B:** No component of BC or EF is unreliable or unlikely for the latest data year. Some individual components of the EF or BC are unlikely in the latest data year. The total EF and BC time series results are not significantly affected by unlikely data.
- **3L:** ???
- **3T:** ???

Created a discussion on Kaggle about the scores <a href="https://www.kaggle.com/footprintnetwork/ecological-footprint/discussion/74703">here</a>.

In [11]:
# Another starting place to look for values that don't fit
countries.describe()

Unnamed: 0,Population (millions),HDI,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required
count,188.0,172.0,173.0,173.0,173.0,173.0,173.0,188.0,173.0,173.0,173.0,173.0,173.0,188.0,188.0,188.0,188.0
mean,37.342372,0.68636,0.578208,0.263179,0.373815,1.804913,0.122486,3.317606,0.53185,0.45659,2.459191,0.595145,0.06711,4.019681,0.702074,1.915745,4.037397
std,140.756836,0.15604,0.355691,0.352067,0.359349,1.898283,0.158427,2.370931,0.672567,1.014738,10.593956,1.661872,0.054844,11.689075,11.771339,1.369624,12.444616
min,0.0,0.34,0.07,0.0,0.01,0.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.05,-14.14,0.24,0.02
25%,2.0375,0.5575,0.35,0.08,0.17,0.42,0.02,1.4825,0.18,0.03,0.06,0.03,0.03,0.675,-1.935,0.855,0.9425
50%,7.97,0.72,0.52,0.18,0.26,1.14,0.07,2.74,0.35,0.12,0.34,0.11,0.05,1.31,-0.73,1.58,1.705
75%,24.87,0.8025,0.7,0.32,0.46,2.6,0.15,4.64,0.59,0.34,1.17,0.37,0.09,2.815,0.2125,2.6775,2.8475
max,1408.04,0.94,2.68,3.47,3.03,12.65,0.82,15.82,5.42,8.23,95.16,16.07,0.27,111.35,109.01,9.14,159.47


- Max much higher than the mean (indicates a possible outlier)
 - Population
 - Cropland
 - Grazing Footprint
 - Forest land
 
Max and mean look evenly distributed:
- HDI
- Cropland Footprint
- ...
    
**Thought process:** *Should I include it? Is it a true observation? Should I transform it?*