## Project 2

For project 2, I will be exploring the the relationship between  GDP per capita and  CO2 emissions per capita. I will be looking at the top 10 countries with the highest GDP. Data fetched from Our World in Data via API.

Let's start by important our pandas package and examining our CO2 emissions.

In [8]:
import pandas as pd
co2_emissions = pd.read_csv(
    "https://ourworldindata.org/grapher/annual-co2-emissions-per-country.csv?v=1&csvType=full&useColumnShortNames=true"
)
co2_emissions

Unnamed: 0,Entity,Code,Year,emissions_total
0,Afghanistan,AFG,1949,14656.0
1,Afghanistan,AFG,1950,84272.0
2,Afghanistan,AFG,1951,91600.0
3,Afghanistan,AFG,1952,91600.0
4,Afghanistan,AFG,1953,106256.0
...,...,...,...,...
29379,Zimbabwe,ZWE,2020,8490839.0
29380,Zimbabwe,ZWE,2021,10222778.0
29381,Zimbabwe,ZWE,2022,12231845.0
29382,Zimbabwe,ZWE,2023,13443295.0


That's too much data. Let's drop unnecessary columns (such as Code) and filter for the last five years, between 2020 to 2025.

In [9]:
#First, let's double check if the year column is an integer
co2_emissions.dtypes

Entity              object
Code                object
Year                 int64
emissions_total    float64
dtype: object

In [10]:
co2_emissions_clean = co2_emissions[["Entity", "Year", "emissions_total"]]
co2_emissions_clean

Unnamed: 0,Entity,Year,emissions_total
0,Afghanistan,1949,14656.0
1,Afghanistan,1950,84272.0
2,Afghanistan,1951,91600.0
3,Afghanistan,1952,91600.0
4,Afghanistan,1953,106256.0
...,...,...,...
29379,Zimbabwe,2020,8490839.0
29380,Zimbabwe,2021,10222778.0
29381,Zimbabwe,2022,12231845.0
29382,Zimbabwe,2023,13443295.0


In [30]:
#Filtering for years 2020 - 2025
co2_emissions_clean = co2_emissions_clean[
    co2_emissions_clean["Year"].between(2020, 2025)
]
co2_emissions_clean

Unnamed: 0,Entity,Year,emissions_total
71,Afghanistan,2020,11118626.0
72,Afghanistan,2021,9868841.0
73,Afghanistan,2022,10169889.0
74,Afghanistan,2023,10516319.0
75,Afghanistan,2024,10825998.0
...,...,...,...
29379,Zimbabwe,2020,8490839.0
29380,Zimbabwe,2021,10222778.0
29381,Zimbabwe,2022,12231845.0
29382,Zimbabwe,2023,13443295.0


In [31]:
#Since we are using gdp per capita after this, let's import population and create emissions per capita too.
pop = pd.read_csv(
    "https://ourworldindata.org/grapher/population-with-un-projections.csv?v=1&csvType=full&useColumnShortNames=true"
)

#Inspect
pop

Unnamed: 0,Entity,Code,Year,population__sex_all__age_all__variant_estimates,population__sex_all__age_all__variant_medium
0,Afghanistan,AFG,1950,7776180.0,
1,Afghanistan,AFG,1951,7879343.0,
2,Afghanistan,AFG,1952,7987784.0,
3,Afghanistan,AFG,1953,8096703.0,
4,Afghanistan,AFG,1954,8207954.0,
...,...,...,...,...,...
38651,Zimbabwe,ZWE,2096,,36840484.0
38652,Zimbabwe,ZWE,2097,,36932280.0
38653,Zimbabwe,ZWE,2098,,37019885.0
38654,Zimbabwe,ZWE,2099,,37096555.0


In [32]:
#Keeping only what's necessary
pop_clean = pop[
    ["Entity", "Year", "population__sex_all__age_all__variant_estimates"]
]

pop_clean

Unnamed: 0,Entity,Year,population__sex_all__age_all__variant_estimates
0,Afghanistan,1950,7776180.0
1,Afghanistan,1951,7879343.0
2,Afghanistan,1952,7987784.0
3,Afghanistan,1953,8096703.0
4,Afghanistan,1954,8207954.0
...,...,...,...
38651,Zimbabwe,2096,
38652,Zimbabwe,2097,
38653,Zimbabwe,2098,
38654,Zimbabwe,2099,


In [33]:
# Filtering for years 2020 - 2025
pop_clean = pop_clean[
    pop_clean["Year"].between(2020, 2025)
]

#Changing column name to population
pop_clean.rename(
    columns={"population__sex_all__age_all__variant_estimates": "population"}, inplace=True
)

#Inspect
pop_clean



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Entity,Year,population
70,Afghanistan,2020,39068977.0
71,Afghanistan,2021,40000410.0
72,Afghanistan,2022,40578847.0
73,Afghanistan,2023,41454762.0
74,Afghanistan,2024,
...,...,...,...
38576,Zimbabwe,2021,15797220.0
38577,Zimbabwe,2022,16069061.0
38578,Zimbabwe,2023,16340829.0
38579,Zimbabwe,2024,


Next, let's combine co2_emissions and population, so that we can get co2_emissions per capita.

In [35]:
co2_pop_df = pd.merge(
    co2_emissions_clean, pop_clean, on=["Entity", "Year"]
)

#Inspect
co2_pop_df

Unnamed: 0,Entity,Year,emissions_total,population
0,Afghanistan,2020,11118626.0,39068977.0
1,Afghanistan,2021,9868841.0,40000410.0
2,Afghanistan,2022,10169889.0,40578847.0
3,Afghanistan,2023,10516319.0,41454762.0
4,Afghanistan,2024,10825998.0,
...,...,...,...,...
1090,Zimbabwe,2020,8490839.0,15526887.0
1091,Zimbabwe,2021,10222778.0,15797220.0
1092,Zimbabwe,2022,12231845.0,16069061.0
1093,Zimbabwe,2023,13443295.0,16340829.0


In [36]:
#Next, let's calculate CO2 emissions per capita.

co2_pop_df["emissions_per_capita"] = (
    co2_pop_df["emissions_total"] / co2_pop_df["population"]
)
co2_pop_df


Unnamed: 0,Entity,Year,emissions_total,population,emissions_per_capita
0,Afghanistan,2020,11118626.0,39068977.0,0.284590
1,Afghanistan,2021,9868841.0,40000410.0,0.246718
2,Afghanistan,2022,10169889.0,40578847.0,0.250620
3,Afghanistan,2023,10516319.0,41454762.0,0.253682
4,Afghanistan,2024,10825998.0,,
...,...,...,...,...,...
1090,Zimbabwe,2020,8490839.0,15526887.0,0.546847
1091,Zimbabwe,2021,10222778.0,15797220.0,0.647125
1092,Zimbabwe,2022,12231845.0,16069061.0,0.761205
1093,Zimbabwe,2023,13443295.0,16340829.0,0.822681


In [37]:
#Let's keep only emissions_per_capita
co2_per_capita_df = co2_pop_df[["Entity", "Year", "emissions_per_capita"]]

co2_per_capita_df

Unnamed: 0,Entity,Year,emissions_per_capita
0,Afghanistan,2020,0.284590
1,Afghanistan,2021,0.246718
2,Afghanistan,2022,0.250620
3,Afghanistan,2023,0.253682
4,Afghanistan,2024,
...,...,...,...
1090,Zimbabwe,2020,0.546847
1091,Zimbabwe,2021,0.647125
1092,Zimbabwe,2022,0.761205
1093,Zimbabwe,2023,0.822681


Next, let's import our dataset on GDP per capita.

In [38]:
gdp_per_cap = pd.read_csv(
    "https://ourworldindata.org/grapher/gdp-per-capita-worldbank.csv?v=1&csvType=full&useColumnShortNames=true"
)
gdp_per_cap


Unnamed: 0,Entity,Code,Year,ny_gdp_pcap_pp_kd,owid_region
0,Afghanistan,AFG,2000,1617.8264,
1,Afghanistan,AFG,2001,1454.1108,
2,Afghanistan,AFG,2002,1774.3087,
3,Afghanistan,AFG,2003,1815.9282,
4,Afghanistan,AFG,2004,1776.9182,
...,...,...,...,...,...
7306,Zimbabwe,ZWE,2020,2987.2683,
7307,Zimbabwe,ZWE,2021,3184.7847,
7308,Zimbabwe,ZWE,2022,3323.1184,
7309,Zimbabwe,ZWE,2023,3442.2488,Africa


In [40]:
#Similarly, let's drop unnecessary columns (e.g. Code) and filter for years 2020 - 2025.
gdp_clean = gdp_per_cap[["Entity", "Year", "ny_gdp_pcap_pp_kd"]]
gdp_clean = gdp_clean[gdp_clean["Year"].between(2020, 2025)]
gdp_clean

Unnamed: 0,Entity,Year,ny_gdp_pcap_pp_kd
20,Afghanistan,2020,2769.6858
21,Afghanistan,2021,2144.1665
22,Afghanistan,2022,1981.7102
23,Afghanistan,2023,1983.8126
24,Aland Islands,2023,
...,...,...,...
7306,Zimbabwe,2020,2987.2683
7307,Zimbabwe,2021,3184.7847
7308,Zimbabwe,2022,3323.1184
7309,Zimbabwe,2023,3442.2488


In [41]:
# The column for gdp per capita is too long. Let's change it to "GDP per capita".

gdp_clean.rename(columns={"ny_gdp_pcap_pp_kd": "GDP per capita"}, inplace=True)
gdp_clean

Unnamed: 0,Entity,Year,GDP per capita
20,Afghanistan,2020,2769.6858
21,Afghanistan,2021,2144.1665
22,Afghanistan,2022,1981.7102
23,Afghanistan,2023,1983.8126
24,Aland Islands,2023,
...,...,...,...
7306,Zimbabwe,2020,2987.2683
7307,Zimbabwe,2021,3184.7847
7308,Zimbabwe,2022,3323.1184
7309,Zimbabwe,2023,3442.2488


Great. Now we're ready to merge these two datasets: co2_per_capita_df and gdp_clean.

In [42]:
combined_df = pd.merge(gdp_clean, co2_per_capita_df, on=["Entity", "Year"])
combined_df

Unnamed: 0,Entity,Year,GDP per capita,emissions_per_capita
0,Afghanistan,2020,2769.6858,0.284590
1,Afghanistan,2021,2144.1665,0.246718
2,Afghanistan,2022,1981.7102,0.250620
3,Afghanistan,2023,1983.8126,0.253682
4,Albania,2020,14662.7960,1.693982
...,...,...,...,...
1002,Zimbabwe,2020,2987.2683,0.546847
1003,Zimbabwe,2021,3184.7847,0.647125
1004,Zimbabwe,2022,3323.1184,0.761205
1005,Zimbabwe,2023,3442.2488,0.822681


Super! Since we are comparing two numeric variables, let's create a scatter plot.

In [43]:
#Getting the average of the five years by using mean()
avg_df = (
    combined_df.groupby("Entity")[["GDP per capita", "emissions_per_capita"]]
    .mean()
    .reset_index()
)

#Let's rename the columns for clarity
avg_df.rename(
    columns={
        "GDP per capita": "avg_GDP_per_cap",
        "emissions_per_capita": "avg_emissions_per_cap",
    },
    inplace=True,
)

avg_df.head()

Unnamed: 0,Entity,avg_GDP_per_cap,avg_emissions_per_cap
0,Afghanistan,2219.843775,0.258903
1,Albania,16962.4698,1.664647
2,Algeria,14814.9336,4.146411
3,Andorra,61858.736,5.159903
4,Angola,7391.42586,0.558217


From a quick inspection, we have non countries such as "World". Let's go back to check what's in our Entity and remove non-countries.

In [44]:
#Inspect our Entity
avg_df["Entity"].unique()


array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire Sint Eustatius and Saba',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo', 'Cook Islands', 'Costa Rica', "Cote d'Ivoire",
       'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czechia',
       'Democratic Republic of Congo', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Faroe Islands'

In [45]:
#This was a bit confusing, so now I'm using ChatGPT to help me write a code to remove these aggregates.

non_countries = [
    "World",
    "High-income countries",
    "Upper-middle-income countries",
    "Lower-middle-income countries",
    "Low-income countries",
    "European Union (27)",
]

avg_df_clean = avg_df[~avg_df["Entity"].isin(non_countries)]
avg_df_clean

Unnamed: 0,Entity,avg_GDP_per_cap,avg_emissions_per_cap
0,Afghanistan,2219.843775,0.258903
1,Albania,16962.469800,1.664647
2,Algeria,14814.933600,4.146411
3,Andorra,61858.736000,5.159903
4,Angola,7391.425860,0.558217
...,...,...,...
213,Vietnam,12968.242600,3.392863
214,Wallis and Futuna,,2.560470
216,Yemen,,0.257450
217,Zambia,3573.855160,0.528698


A quick inspection shows that we have NA. Let's see how many and remove them.

In [46]:
avg_df_clean.isna().sum()

Entity                    0
avg_GDP_per_cap          19
avg_emissions_per_cap     0
dtype: int64

In [47]:
#There we go. Let's remove the NAs.

avg_df_clean = avg_df.dropna(subset=["avg_GDP_per_cap"])

#Let's verify.
avg_df_clean.isna().sum()

Entity                   0
avg_GDP_per_cap          0
avg_emissions_per_cap    0
dtype: int64

In [48]:
#Great, now let's create our scatter plot.
import plotly.express as px

#Create scatter plot
fig = px.scatter(
    avg_df_clean,
    x="avg_GDP_per_cap",
    y="avg_emissions_per_cap",
    color="Entity",
    title="Average GDP per Capita vs Average CO₂ Emissions per Capita (2020–2025)",
    labels={
        "avg_GDP_per_cap": "Average GDP per Capita (USD)",
        "avg_emissions_per_cap": "Average CO₂ Emissions per Capita (tons)",
        "Entity": "Country",
    },
)

fig.show()

In [49]:
#It's frustrating that non-countries are still included. I will be using ChatGPT to help me remove these non-countries.
#Strip whitespace from Entity names
avg_df["Entity"] = avg_df["Entity"].str.strip()

# Define non-countries
non_countries = [
    "World",
    "High-income countries",
    "Upper-middle-income countries",
    "Lower-middle-income countries",
    "Low-income countries",
    "European Union (27)",
]

# Filter them out
avg_df_clean = avg_df[~avg_df["Entity"].isin(non_countries)]

# Check the result
avg_df_clean["Entity"].unique()


array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire Sint Eustatius and Saba',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo', 'Cook Islands', 'Costa Rica', "Cote d'Ivoire",
       'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czechia',
       'Democratic Republic of Congo', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Faroe Islands'

In [50]:
# Let's try again.
fig = px.scatter(
    avg_df_clean,
    x="avg_GDP_per_cap",
    y="avg_emissions_per_cap",
    color="Entity",
    title="Average GDP per Capita vs Average CO₂ Emissions per Capita (2020–2025)",
    labels={
        "avg_GDP_per_cap": "Average GDP per Capita (USD)",
        "avg_emissions_per_cap": "Average CO₂ Emissions per Capita (tons)",
        "Entity": "Country",
    },
)

fig.show()


Great! Now we've successfully created a scatter plot showing the relationship between the average gdp per capita and the average CO2 emissions per capita over a five year period. If we hover around the bottom left, we see a lot of developing countries clustering. This can be the basis for further research, especially in examining equity issues in climate change.

#Note: This data vizualization may not be accurate since we had to drop some NAs from our data.