# Dataset and Preprocessing
In this section we describe all the datasets used in the creation of the visualizations for our data story and display the first 5 entries in these datasets.

## Dataset 1 : Greenhouse gas emission in the Netherlands

This dataset includes emissions of the gases carbon dioxide, methane, and nitrogen in the Netherlands from 1990 up until 2017. The data is ordered by emission source, such as pharmaceutical industry, households, agriculture and fishery. The emissions themselves are measured in million kilograms, and are ratio values. This dataset can be used to calculate increase in greenhouse gas emissions and compare against datasets of the emissions of other countries. The dataset was taken from <a href='https://www.kaggle.com/datasets/janheindejong/greenhouse-gas-emissions-in-the-netherlands?select=IPCC_emissions.csv'> Kaggle </a> and needed to be seperated by semicolon to be deemed usable. 



In [10]:
import pandas as pd

df_greenhouse = pd.read_csv("../datasets/IPCC_emissions.csv", sep=';')
df_greenhouse.head()

Unnamed: 0,ID,Bronnen,Perioden,CO2_1,CH4_2,N2O_3
0,0,T001176,1990JJ00,163120,1278.17,59.49
1,1,T001176,1995JJ00,173520,1192.41,59.84
2,2,T001176,2000JJ00,172290,975.64,53.01
3,3,T001176,2001JJ00,177390,949.16,49.71
4,4,T001176,2002JJ00,176670,904.27,47.01


## Dataset 2 : ---



Unnamed: 0,country_or_area,year,value,category
0,Australia,2014,393126.946994,carbon_dioxide_co2_emissions_without_land_use_...
1,Australia,2013,396913.93653,carbon_dioxide_co2_emissions_without_land_use_...
2,Australia,2012,406462.847704,carbon_dioxide_co2_emissions_without_land_use_...
3,Australia,2011,403705.528314,carbon_dioxide_co2_emissions_without_land_use_...
4,Australia,2010,406200.993184,carbon_dioxide_co2_emissions_without_land_use_...


## Dataset 3 : Statistics per Country

This dataset shows several different statistics from nearly every county in de United Nations in 2017. Some of these 50 statistics are population, GDP, agriculture economy, internet use, and emission estimates. The values of these variables are denoted in their name, like number, percent or square kilometer, and are nominal, ordinal or ratio values. The dataset was preprocessed to include only a few relevant European countries before it was used and can be found on <a href='https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles?select=kiva_country_profile_variables.csv'>Kaggle</a>. 

In [12]:
eu_country_profile = pd.read_csv("../datasets/EUonly_country_profile_variables.csv")
eu_country_profile.head()

Unnamed: 0,location,Region,Surface area (km2),Population in thousands (2017),"Population density (per km2, 2017)","Sex ratio (m per 100 f, 2017)",GDP: Gross domestic product (million current US$),"GDP growth rate (annual %, const. 2005 prices)",GDP per capita (current US$),Economy: Agriculture (% of GVA),...,Mobile-cellular subscriptions (per 100 inhabitants).1,Individuals using the Internet (per 100 inhabitants),Threatened species (number),Forested area (% of land area),CO2 emission estimates (million tons/tons per capita),"Energy production, primary (Petajoules)",Energy supply per capita (Gigajoules),"Pop. using improved drinking water (urban/rural, %)","Pop. using improved sanitation facilities (urban/rural, %)",Net Official Development Assist. received (% of GNI)
0,Albania,SouthernEurope,28748,2930,106.9,101.9,11541,2.6,3984.2,22.4,...,63.3,130,28.2,5.7/2.0,84,36,94.9/95.2,95.5/90.2,2.96,-99
1,Belgium,WesternEurope,30528,11429,377.5,97.3,455107,1.5,40277.8,0.7,...,85.1,37,22.6,93.4/8.3,520,196,100.0/100.0,99.5/99.4,-99,-99
2,Croatia,SouthernEurope,56594,4189,74.9,93.1,48676,1.6,11479.4,4.1,...,69.8,176,34.3,16.8/4.0,182,79,99.6/99.7,97.8/95.8,...,-99
3,Denmark,NorthernEurope,42921,5734,135.1,99.0,301308,1.6,53149.3,1.2,...,96.3,47,14.4,33.5/5.9,666,119,100.0/100.0,99.6/99.6,-99,-99
4,Estonia,NorthernEurope,45227,1310,30.9,88.2,22460,1.4,17112.0,3.4,...,88.4,23,52.7,19.5/14.8,242,193,100.0/99.0,97.5/96.6,-99,-99


## Dataset 4 : Global Air Pollution

This dataset contains geolocated data on greenhouse gas emissions and AQI of air pollution of large cities in nearly all countries. Some of the attributes are the overall AQI value and category, as well as the AQI value and category for the gases N2O, CO and the Ozone. The dataset can be used to compare air pollution between different countries. The dataset shown here was not processed, but was cleaned before using by grouping the data by country and aggregating the temperature values. The dataset can be found on <a href='https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset'>Kaggle</a>.

In [13]:
df_pollution = pd.read_csv("../datasets/global_air_pollution_dataset.csv")
df_pollution.head()

Unnamed: 0,Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category
0,Russian Federation,Praskoveya,51,Moderate,1,Good,36,Good,0,Good,51,Moderate
1,Brazil,Presidente Dutra,41,Good,1,Good,5,Good,1,Good,41,Good
2,Italy,Priolo Gargallo,66,Moderate,1,Good,39,Good,2,Good,66,Moderate
3,Poland,Przasnysz,34,Good,1,Good,34,Good,0,Good,20,Good
4,France,Punaauia,22,Good,0,Good,22,Good,0,Good,6,Good


## Dataset 5 : Agricultural companies in the Netherlands
This dataset includes data on the agricultural companies in the Netherlands over the period of 1851-2022. Some variables include the number of companies, the number of companies in livestock and the meat or wheat production. The dataset was preprocessed by appending 'JJ00' to the years in the 'Perioden' column, as that simplified the comparison between this dataset and dataset 1. The dataset can be found on the website of the <a href='https://www.cbs.nl/nl-nl/cijfers/detail/71904ned#'>Centraal Bureau voor Statistiek</a>.

In [14]:
df_agr_comp = pd.read_csv('../datasets/table__71904ned.csv')
df_agr_comp.head()

Unnamed: 0,Perioden,Bedrijven Alle bedrijven (x 1 000),Bedrijven met akkerbouw (x 1 000),Bedrijven met tuinbouw open grond (x 1 000),Bedrijven met tuinbouw onder glas (x 1 000),Bedrijven met grasland (x 1 000),Bedrijven met rundvee,Bedrijven met varkens,"Arbeidskrachten Totaal arbeidskrachten\nArbeidskrachten, totaal (x 1 000)","Gezinsarbeidskrachten Totaal gezinsarbeidskrachten\nGezinsarbeidskrachten, totaal (x 1 000)",...,"Granen Tarwe, winter (mln kg)","Granen Tarwe, zomer (mln kg)",Handelsgewassen Koolzaad (mln kg),Peulvruchten Bruine bonen (mln kg),Opbrengst gewassen Suikerbieten (mln kg),Opbrengst gewassen Zaaiuien (mln kg),Dierlijke productie Melk\nMelk afgeleverd aan fabrieken (mln kg),Vlees Rundvlees (mln kg),Vlees Varkensvlees (mln kg),Zuivelproducten Gecondenseerde melk (mln kg)
0,1851JJ00,.,.,.,.,.,.,.,.,.,...,.,.,60.0,5,.,.,.,.,.,.
1,1900JJ00,.,.,.,.,.,.,.,.,.,...,.,.,2.0,10,1.509,.,851,.,.,.
2,1950JJ00,410,.,.,.,242,216,271,.,.,...,273,21,45.0,15,2.718,208,4.766,140,236,172
3,1960JJ00,301,193,.,.,215,200,146,.,.,...,434,156,8.0,9,4.676,155,6.068,236,413,386
4,1970JJ00,185,96,53,20,141,131,76,.,.,...,509,131,22.0,11,4.711,331,7.748,250,700,495


## Dataset 6 : Import and Export of the Netherlands
This dataset describes the import income and export income of goods from the Netherlands to countries inside or outside Europe. The dataset is separated by categories such as  meat/meat products, fruit/vegetables, ect, and it includes data from 2021 and 2022. The dataset was not preprocessed and can be found on the website of the <a href='https://www.cbs.nl/nl-nl/cijfers/detail/83926NED?dl=6C56B#'>Centraal Bureau voor Statistiek</a>.

In [15]:
df_trade = pd.read_csv('../datasets/table__83926NED.csv')
df_trade.head()

Unnamed: 0,SITC,Landen,Perioden,Invoerwaarde (mln euro),Uitvoerwaarde (mln euro),Handelsbalans (mln euro),Jaarmutatie invoerwaarde (%),Jaarmutatie uitvoerwaarde (%)
0,Totaal goederen,Totaal landen,2021 mei,40.738,46.097,5.359,336,358
1,Totaal goederen,Totaal landen,2021 juni,44.033,50.799,6.766,249,295
2,Totaal goederen,Totaal landen,2021 juli,43.261,48.89,5.629,243,216
3,Totaal goederen,Totaal landen,2021 augustus,42.955,47.161,4.206,346,319
4,Totaal goederen,Totaal landen,2021 september,46.421,52.181,5.76,263,240
