# Introduction

The industrial revolution was exactly that, a revolution. A time in which our way of living was changed drastically through the automation of the production of many goods in bulk, the advancement of science and medicine, and the following growth in the quality of life itself for so many all over the world. The rise of industry brought a great deal of incredible things with skyrocketing the world into a global society connecting nearly everyone on Earth.

Though as we know today, the effects of industry are not all fantastic. In recent years more and more concerns have been raised on the topic of global warming which has been on the rise ever since the conception of modern day industry. Apart from the explosive increase of greenhouse gasses such as CO2 and methane in the atmosphere caused by human activities such as industry there are also plenty of other pollutants that the environment has to deal with. By infecting the ground and water, pollutants threaten to destroy entire ecosystems and cause many species of plants and animals to go extinct.

Regardless of public opinion, there are also those that cherish more conservative ideas on the topic of industry. On a basic level there are two different views people can take on, on the one hand there are people that are convinced that industry is something to be weary of, and on the other hand there are those that believe there is no problem with industry at all. These two types of people clash often in the real world, but what are the facts? What does the data say about all this? What truly are the effects of industry on our world? By looking at data referring to  it might be possible to obtain an answer to these questions.


# Dataset and Preprocessing

In order to conduct our research on this matter it was necessary to acquire a few datasets to look at and find connections between. Five separate datasets were selected, a dataset on water quality, a dataset on industry, a dataset on infrastructure, a dataset on innovation, and lastly a dataset on nature. Every dataset contains a column with three-letter country tags which were used to join the datasets together during the research process. Therefor this column was renamed to 'ISO3' in every dataset. The dataset were also each pivoted in order to match the same column pattern and order. With this country tag column as the join key all of the data has been ordered by year. The following is a short rundown of each dataset with links to each webpage.

#### Dataset 1: water_set_done.csv

Original link: https://washdata.org/data/country/WLD/household/download

Short description:  This dataset contains 46 different attributes and 4914 different entries. These attributes include things such as total population, rural surface water percentage and to what extend water quality and access has improved in different regions for almost every major country on earth from the year 2000 up until the year 2020. The data is split in three main categories:
1. National (Proportion)
2. Urban (Proportion)
3. Rural (Proportion)

Preprocessing: With preprocessing the attributes' names changed, in order to make it more readable in pandas. The categories National/Urban/Rural got merged into the attribute names, so from category Rural with subcategory Surface water, the new attribute name would be changed to 'RURAL-Surface water'. In order to make up for missing data the data has been filled in, grouped by country, with the python functions ffill() and bfill(). The countries have been seperated for this step so their data wouldn't intervene. Strings like '>99' have also been changed to floats 99.0 and these are now the maximum values for these attributes.

In [1]:
import pandas as pd

pd.read_csv('new_water_done5.csv').head(n=3)

Unnamed: 0.1,Unnamed: 0,Country,ISO3,Year,Population (thousands),% urban,NATIONAL-Basic,NATIONAL-Limited,NATIONAL-Unimproved,NATIONAL-Surface water,...,URBAN-Proportion-Available when needed,URBAN-Proportion-Free from contamination,URBAN-Proportion-Annual rate of change in safely managed,URBAN-Proportion-Piped,URBAN-Proportion-Non-piped,Sl,SDG region,WHO region,UNICEF Programming region,UNICEF Reporting region
0,0,Afghanistan,AFG,2000.0,20779.957031,22.077999,28.171415,3.660638,43.178306,24.989641,...,,20.570155,0.790563,17.408464,39.131826,1.0,Central and Southern Asia,Eastern Mediterranean,South Asia,South Asia
1,1,Afghanistan,AFG,2001.0,21606.992188,22.169001,28.199366,3.661542,43.167542,24.97155,...,,20.570155,0.790563,17.408464,39.131826,2.0,Central and Southern Asia,Eastern Mediterranean,South Asia,South Asia
2,2,Afghanistan,AFG,2002.0,22600.773438,22.261,30.236385,3.949472,41.68962,24.124524,...,,21.416514,0.790563,18.695357,40.171284,3.0,Central and Southern Asia,Eastern Mediterranean,South Asia,South Asia


#### Dataset 2: iso_infra.csv

Original link: https://stats.oecd.org/Index.aspx?QueryId=73638

Short description:  This dataset contains 23 attributes and 1340 entries which provide information about a multitude of different major countries from the year 2000 up until the year 2020 on the topic of nation wide infrastructure. Such points of information include such things as the number of airports per 1 million of inhabitants, the total density of road and inland waterway infrastructure spending.

In [None]:
pd.read_csv('iso_infra.csv').head(n=3)

#### Dataset 3: iso_green

Original link: https://stats.oecd.org/Index.aspx?DataSetCode=GREEN_GROWTH

Short description:  This dataset contains 165 different attributes and 5366 total entries detailing different points of information from a grand multitude of different countries on the topic of green growth. Green growth is a concept which is measured in a number of ways such as the CO2 productivity, energy productivity, and freshwater and forest resources. These points of data provide insight into the progress countries have been making towards a greener and more environmentally friendly way of living.

In [None]:
pd.read_csv('iso_green.csv').head(n=3)

#### Dataset 4: iso_ino.csv

Original link: https://stats.oecd.org/Index.aspx?DataSetCode=REGION_INNOVATION

Short description: This dataset has 82 different attributes and 1005 total entries which provide information on the topic of regional information for most countries on Earth. These attributes include things such as different levels of student enrolment, the spread of the workforce throughout the different sectors, and progress and innovation on numerous things such as nanotech and medicine. This dataset could provide insight into how well developed countries are and how they have progressed throughout the years.

In [None]:
pd.read_csv('iso_inno.csv').head(n=3)

#### Dataset 4: iso_mei.csv

Original link: https://stats.oecd.org/Index.aspx?DataSetCode=MEI_REAL

Short description:  This dataset contains 14 attributes and 987 different entries of a multitude of different countries from the year 2000 up until the year 2020 on the topic of the production and sales of different products a country produces. These attributes include things such as production of different kinds of goods such as intermediate goods and service goods, and the production of energy and electricity. With this dataset it is possible to gain insight into the kind of services a country provides which allows one to determine its place on the world's grand stage.

In [None]:
pd.read_csv('iso_mei.csv').head(n=3)