# What this notebook teaches

1. Common pandas operations: **.unique**, **.value_counts**, **.where**
2. Common numerical operations with pandas: **.mean**, **.sum**, **.max**, **.idxmax**, **.min**, **.idxmin**, **.diff**
3. masking in pandas : the basics
4. plotting with pandas: **bar**, **line** and **histogram** plots


# Options

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Imports

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt

# Read the data

In [None]:
local_path = os.path.join('data', 'global-data-on-sustainable-energy.csv')
url = 'https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/Data%20Wrangling/Data%20Wrangling%20-%20Pandas-Advanced/Data%20Analysis%20Basics/data/global-data-on-sustainable-energy.csv'

df = pd.read_csv(url)
df.head(3)
df.shape

# Documentation

About the Dataset: 

The data is collected from multiple sources including the World Bank, the International Energy Agency, and ourworldindata.org. [source](https://www.kaggle.com/datasets/anshtanwar/global-data-on-sustainable-energy)

- **Entity**: The name of the country or region for which the data is reported.
- **Year**: The year for which the data is reported, ranging from 2000 to 2020.
- **Access to electricity (% of population)**: The percentage of population with access to electricity.
- **Access to clean fuels for cooking (% of population)**: The percentage of the population with primary reliance on clean fuels.
- **Renewable-electricity-generating-capacity-per-capita**: Installed Renewable energy capacity per person
- **Financial flows to developing countries (US \$):** Aid and assistance from developed countries for clean energy projects.
- **Renewable energy share in total final energy consumption (\%):** Percentage of renewable energy in final energy consumption.
- **Electricity from fossil fuels (TWh):** Electricity generated from fossil fuels (coal, oil, gas) in terawatt-hours.
- **Electricity from nuclear (TWh):** Electricity generated from nuclear power in terawatt-hours.
- **Electricity from renewables (TWh):** Electricity generated from renewable sources (hydro, solar, wind, etc.) in terawatt-hours.
- **Low-carbon electricity (\% electricity):** Percentage of electricity from low-carbon sources (nuclear and renewables).
- **Primary energy consumption per capita (kWh/person):** Energy consumption per person in kilowatt-hours.
- **Energy intensity level of primary energy (MJ/\$2011 PPP GDP):** Energy use per unit of GDP at purchasing power parity.
- **Value_co2_emissions (metric tons per capita):** Carbon dioxide emissions per person in metric tons.
- **Renewables (\% equivalent primary energy):** Equivalent primary energy that is derived from renewable sources.
- **GDP growth (annual \%):** Annual GDP growth rate based on constant local currency.
- **GDP per capita:** Gross domestic product per person.
- **Density (P/Km2):** Population density in persons per square kilometer.
- **Land Area (Km2):** Total land area in square kilometers.
- **Latitude:** Latitude of the country's centroid in decimal degrees.
- **Longitude:** Longitude of the country's centroid in decimal degrees.

# Let's begin

Seems to be data about countries, which countries are we talking about?

In [None]:
df['Entity'].unique()

A lot of them apparently, how many exactly?

In [None]:
df['Entity'].nunique()

Also what is the timeframe we have here?

In [None]:
df['Year'].unique()

How many data for each country?

In [None]:
df['Entity'].value_counts()

It seems most countries have 21 datapoints which are probably one for each year. Some countries don't have all 21 datapoints though

In [None]:
def check_if_less_than_21(value):
    return value < 21

In [None]:
df['Entity'].value_counts().where(lambda value: value < 21).dropna()
#df['Entity'].value_counts().where(check_if_less_than_21).dropna() # same thing

Countries that don't have the full data: Serbia, Montenegro, South Sudan, French Guiana

There is another way to count how many countries have the full 21 datapoints and how many have other values

In [None]:
df['Entity'].value_counts().value_counts()

# Math operations

## Mean

What is the % population with access to electricity, on average?

In [None]:
df.loc[:,'Access to electricity (% of population)'].mean()

## Sum

In total how much $ did these countries send to assist developing countries for clean energy projects?

In [None]:
df.loc[:,'Financial flows to developing countries (US $)'].sum()

Cool! Let's check this value in billions

In [None]:
df.loc[:,'Financial flows to developing countries (US $)'].sum() / 1_000_000_000

$147 billion throughout 2000-2020!

----

## Max, idxmax, Min, idxmin

What is the Country-year that had the most electricity generated from nuclear power?

In [None]:
df['Electricity from nuclear (TWh)'].max()

Ok this just returned the value of the max, but i actually want to know the Country and respective year where this happened

In [None]:
index = df['Electricity from nuclear (TWh)'].idxmax()
index

In [None]:
df.loc[index,['Entity','Year','Electricity from nuclear (TWh)']]

Wanna do all that in a single step?

In [None]:
df.loc[df['Electricity from nuclear (TWh)'].idxmax(),
       ['Entity','Year','Electricity from nuclear (TWh)']]

----

Country with the least amount of data in this dataset?

In [None]:
df['Entity'].value_counts().min()

Oops ran into the same problem again. The minimum is 1, but which country does it refer to?

In [None]:
df['Entity'].value_counts().idxmin()

Great!

## A lot more operations


You can check other math operations you can do on Pandas [over here](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics)

# Masking

This is cool and all but we can only compute these stats for the whole dataset

![](media/will.jpeg)

In [None]:
df['Entity'].unique()[:5]

In [None]:
angola_mask = df['Entity'] == 'Angola'

Let's see what we have here

In [None]:
angola_mask

In [None]:
angola_mask.value_counts()

- A series with a bunch of Falses and some Trues
- The **True** values are for the rows where the condition above is ... well ... True!

![media/what.jpeg](https://i.imgflip.com/98w3rl.jpg)

In [None]:
df.loc[angola_mask,:]

Now we can ask questions for specific subgroups of our data!

Let's see what we can learn about Angola here:

How did the Access to electricity evolved throughout the years for Angola?

In [None]:
df.loc[angola_mask,'Access to electricity (% of population)'].diff()

`.diff` will:
- compute differences for each row
- Will have Nan on the first row because there's no previous row to compare it to

would be cool to assign the year as index here to understand better this analysis

In [None]:
df.loc[angola_mask,:].set_index('Year')['Access to electricity (% of population)'].diff()

Most of the years, we have a positive difference which means more and more people in Angola have been getting access to electricity

In [None]:
(df
 .loc[angola_mask,'Access to electricity (% of population)']
 .diff()
 .sum()
)

From 2000 till 2020, Angola enabled access to an additional 22% of its population!

In [None]:
# another way of getting the same result

df.loc[angola_mask, 'Access to electricity (% of population)'].iloc[-1] - df.loc[angola_mask, 'Access to electricity (% of population)'].iloc[0]

# Plotting

Now, this type of analysis is begging for a visualization

In [None]:
angola_evolution_electricity_access = df.loc[angola_mask].set_index('Year').loc[:,'Access to electricity (% of population)'].diff()
angola_evolution_electricity_access.head(5)

## bar plots

In [None]:
angola_evolution_electricity_access.plot.bar(title='Analysis of Angola\'s population access to electricity',xlabel='Year', ylabel='difference in % points from previous year');

## Line plots

Line plots are particularly useful to visualize something across time

Let's visualize the growth of this `electricity_access` in another way

In [None]:
(df
 .loc[angola_mask]
 .set_index('Year')
 .loc[:,'Access to electricity (% of population)']
 .plot(title='(Angola) Growth in % population with access to electricity', ylabel='% population')
)

Now the x axis isn't perfect, there's a lot of ways to solve this situation which you will be using google, stackOverflow and chatgpt most of the time to solve, but here is one simple solution

In [None]:
(df
 .loc[angola_mask]
 .astype({'Year':'str'})
 .set_index('Year')
 .loc[:,'Access to electricity (% of population)']
 .plot(title='(Angola) Growth in % population with access to electricity', ylabel='% population')
)

## histogram plots

Density plots are very useful because they give you a lot more information about a distribution than just its average

What is the distribution of co2 emissions in the year 2010?

In [None]:
mask_2010 = df['Year'] == 2010

before the plot, `.describe` is a quick way to also have a notion of a distribution

In [None]:
df.loc[mask_2010,'Value_co2_emissions_kt_by_country'].describe().astype(int)

But we can also plot to get more information about the distribution

In [None]:
df.loc[mask_2010,'Electricity from renewables (TWh)'].plot.hist()

You can control the number of bins

In [None]:
df.loc[mask_2010,'Electricity from renewables (TWh)'].plot.hist(bins = range(0,800+1,20))

or if you don't want to focus on the outliers

In [None]:
df.loc[mask_2010,'Electricity from renewables (TWh)'].plot.hist(bins = range(0,200+1,10))

<font style="position:absolute; top:2em;opacity:0;"></font>

![](media/dog.jpeg)

Now proceed to the exercises notebook