# D7041E Mini Project

## Air Pollution Awareness and 2030 Predictions

### Importance and Purpose:
##### Air pollution is a critical environmental issue with severe implications for public health and the well-being of our planet. 
##### In this mini project by Umuthan Ercan for the D7041E Applied Artificial Intelligence course, the focus is on raising awareness about air pollution 
##### and making predictions for the year 2030 using the World Health Organization (WHO) dataset. 
##### By employing data preprocessing techniques and a Random Forest Regressor model, the script aims to analyze historical 
##### air pollution data, identify risky countries projected to surpass a defined threshold in 2030, and contribute to the understanding 
##### of potential future challenges. This initiative underscores the significance of proactive measures and awareness in mitigating 
##### the adverse effects of air pollution. The predictions generated by the model serve as a tool to inform and guide efforts 
##### towards cleaner air and sustainable environmental practices.

##### The dataset can be accessed from https://www.who.int/data/gho/data/themes/air-pollution/who-air-quality-database.

In [77]:
# Import necessary libraries and Air Pollution dataset by WHO.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np

# Load the Air Pollution dataset by WHO
df = pd.read_csv('data.csv', sep=';') #separator parameter was used as ';' in this dataset

### Data pre-processing.

In [78]:
# In the value column, I preferred to stick with the values only (numbers and decimals) so I chose to extract the remaining parantheses.

df['Value'] = df['Value'].str.extract('(\d+\.\d+)').astype(float)

#There were many redundant columns, so I chose to keep only relevant columns, which are: 'Location', 'Period', and 'Value'
#Location stands for the Country name
#Period stands for years between 2010 and 2019, which we will use to predict the year 2030
#Value is clear enough
df = df[['Location', 'Period', 'Value']]

# There were many missing values so to handle missing values I chose to simply imputate with mean
df['Value'].fillna(df['Value'].mean(), inplace=True)

# A print function to observe and check that we have everything we need (Location, Period and Value) in order
print (df) 


     Location  Period  Value
0      Angola    2019  27.16
1      Angola    2018  25.85
2      Angola    2017  25.44
3      Angola    2016  26.08
4      Angola    2015  25.03
...       ...     ...    ...
1945    Samoa    2014   7.55
1946    Samoa    2013   7.64
1947    Samoa    2012   7.56
1948    Samoa    2011   7.48
1949    Samoa    2010   7.40

[1950 rows x 3 columns]


In [79]:
# Initialize an empty DataFrame for predictions
# An empty DataFrame called 'predictions' is created to store the model's predictions for air pollution levels in 2030. 
predictions = pd.DataFrame()

# Set a threshold for predicted values for 2030, all countried above predicted '50' value will be considered as critical
threshold = 50


### Iterate over unique countries
##### The script enters a loop to iterate over unique countries present in the dataset. 
##### For each country, a subset of the data is created to focus on its specific air pollution trends over time. 
##### The model, a Random Forest Regressor, is then trained on the historical data for that country, allowing it to capture complex relationships 
##### between time and air pollution levels, leading to more accurate predictions.

In [80]:
print ("Risky country predictions (with above 50 (PM2.5) concentrations of fine particulate matter) by 2030 are:")  
for country in df['Location'].unique():
    # Filter data for the current country
    # Data is filtered for the current country to focus on its specific air pollution trends over time
    country_df = df[df['Location'] == country]

    # Prepare features (X) and target variable (y)
    # The features (X) consist of the 'Period' column, representing time, and the target variable (y) is the 'Value' column, representing air pollution levels
    X = country_df[['Period']]
    y = country_df['Value']

    # Initialize the model (Random Forest Regressor)
    # A Random Forest Regressor model is chosen for its ability to capture complex relationships between time and air pollution levels
    model = RandomForestRegressor()

    # Train the model
    # The model is trained using historical data to learn the patterns and relationships between time and air pollution levels for the specific country
    model.fit(X, y)

    # Create a DataFrame with future periods (2030)
    # A DataFrame containing future periods (specifically, the year 2030) is created to make predictions for air pollution levels in that year
    future_periods = pd.DataFrame({'Period': [2030]})

    # Make predictions for 2030
    # The trained model is used to predict air pollution levels for the year 2030 based on the selected country's historical data
    future_predictions = model.predict(future_periods[['Period']])

    # Check if the predicted value is greater than the threshold
    # The script checks whether the predicted air pollution level for 2030 exceeds the predetermined threshold of 50
     
    if future_predictions[0] > threshold:
        print(f"{country}:  = {future_predictions[0]:.3f}")

    # Create a DataFrame with predictions for the current country
    # A DataFrame containing the predictions for the current country is created, including the location, the future period (2030), 
    # and the predicted air pollution level
    country_predictions = pd.DataFrame({
        'Location': [country],
        'Period': future_periods['Period'],
        'Predicted Value for 2030': future_predictions
    })

    # Append predictions for the current country to the overall predictions DataFrame
    # The predictions for the current country are appended to the overall 'predictions' DataFrame, which will contain predictions for multiple countries.
    predictions = pd.concat([predictions, country_predictions], ignore_index=True)


Risky country predictions (with above 50 (PM2.5) concentrations of fine particulate matter) by 2030 are:
Cameroon:  = 57.048
Niger:  = 50.481
Nigeria:  = 56.067
Afghanistan:  = 64.100
Bahrain:  = 52.667
Egypt:  = 62.806
Kuwait:  = 64.716
Pakistan:  = 51.771
Qatar:  = 60.495
Saudi Arabia:  = 57.519
Tajikistan:  = 55.433
India:  = 51.376


### Display the predictions DataFrame
##### The final 'predictions' DataFrame is printed, providing a comprehensive overview of the model's predictions for air pollution levels in 2030. 
##### This DataFrame contains country-specific predictions, allowing for an assessment of potential risks and aiding in decision-making for environmental strategies. 
##### The displayed information includes the location (country), the future period (2030), and the predicted air pollution values. 
##### This output serves as a valuable insight into the projected environmental conditions, facilitating informed actions towards air pollution mitigation and awareness.

In [81]:
print(predictions)

         Location  Period  Predicted Value for 2030
0          Angola    2030                   26.6374
1         Burundi    2030                   16.7825
2           Benin    2030                   31.7849
3    Burkina Faso    2030                   41.1734
4        Botswana    2030                   12.8709
..            ...     ...                       ...
190         Tonga    2030                    7.4952
191        Tuvalu    2030                    6.8406
192      Viet Nam    2030                   20.6507
193       Vanuatu    2030                    8.5282
194         Samoa    2030                    7.7672

[195 rows x 3 columns]
