# Converting numeric data to sequence data: Example based on the Gapminder data

*Author: Yuqi Liang*
*Date: 26 Feb 2025*

In this tutorial, we will explore the Gapminder data and demonstrate how to convert numeric data into sequence data. Here is some important information:

**Data Source:** The data used in this tutorial is sourced from [Gapminder](https://www.gapminder.org/data/). Gapminder provides comprehensive datasets on various global indicators, including CO₂ emissions and GDP per capita, which are used to track and visualize development trends over time.  

**Data Analysis and Cleaning:** We will perform data analysis and cleaning to transform the numeric dataset into a sequence dataset. This involves:

* Loading the dataset.
* Converting the dataset from wide to long format.
* Handling missing values and converting data types.
* Computing quintile thresholds to categorize the data.
* Converting the dataset back to wide format with sequence states.

Let's get started!

In [1]:
# Import necessary packages
import pandas as pd

## CO2 emissions data

In [2]:
# Load the dataset
df_co2_emissions = pd.read_csv(f'data_sources/gapminder/co2_pcap_cons.csv')

df_co2_emissions

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Afghanistan,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,...,0.28,0.253,0.262,0.245,0.247,0.254,0.261,0.261,0.279,0.284
1,Angola,0.009,0.009,0.009,0.009,0.009,0.009,0.010,0.010,0.010,...,1.28,1.640,1.220,1.180,1.150,1.120,1.150,1.120,1.200,1.230
2,Albania,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,...,2.27,2.250,2.040,2.010,2.130,2.080,2.050,2.000,2.120,2.100
3,Andorra,0.333,0.335,0.337,0.340,0.342,0.345,0.347,0.350,0.352,...,5.9,5.830,5.970,6.070,6.270,6.120,6.060,5.630,5.970,5.910
4,UAE,0.063,0.063,0.064,0.064,0.064,0.064,0.065,0.065,0.065,...,27,26.800,27.000,26.700,23.900,23.500,21.200,19.700,20.700,21.100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Samoa,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,...,1.04,1.090,1.210,1.260,1.290,1.320,1.370,1.310,1.400,1.430
190,Yemen,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,...,0.994,0.937,0.480,0.377,0.363,0.356,0.365,0.362,0.387,0.395
191,South Africa,0.003,0.003,0.004,0.004,0.004,0.004,0.004,0.004,0.004,...,6.2,6.100,5.760,5.680,5.550,5.420,5.680,5.110,5.080,5.180
192,Zambia,0.255,0.256,0.257,0.258,0.259,0.260,0.261,0.262,0.263,...,0.511,0.560,0.519,0.471,0.474,0.467,0.487,0.388,0.416,0.424


In [3]:
# Convert the dataset from wide to long format
df_long = df_co2_emissions.melt(id_vars=["country"], var_name="year", value_name="co2_per_capita")

# Convert year to integer
df_long["year"] = df_long["year"].astype(int)

# Convert CO₂ per capita values to numeric, handling errors as NaN
df_long["co2_per_capita"] = pd.to_numeric(df_long["co2_per_capita"], errors="coerce")

# Drop missing values
df_long = df_long.dropna(subset=["co2_per_capita"])

df_long

Unnamed: 0,country,year,co2_per_capita
0,Afghanistan,1800,0.001
1,Angola,1800,0.009
2,Albania,1800,0.001
3,Andorra,1800,0.333
4,UAE,1800,0.063
...,...,...,...
43257,Samoa,2022,1.430
43258,Yemen,2022,0.395
43259,South Africa,2022,5.180
43260,Zambia,2022,0.424


## Long vs. Wide Format: Important distinctions in statistics and data analytics

**In wide format:**

Each row represents a country, and each column is a different year (e.g., `1800`, `1801`, `1802`, etc.). It is called as wide format because it will contain a long list of columns and it looks wide. This format is also the default format for Sequenzo to recognize your original dataframe.

**In long format:**

Each row represents one observation — that is, one country in one year — with three columns such as  
`country`, `year`, and `co2_per_capita`. As the name indicates, it is long as each row is not necessarily a country unless it has only one observation (one year in this case). 

For instance, if each country has carbon emissions records for 192 years, then in long format, each country will have 192 rows. By contrast, a country will only have one row in wide format.

**Why convert to long format?**  
- It's easier to manipulate time-based or grouped data as a data preprocessing step.
- It works better with most Python tools (`groupby`, `qcut`, plotting, etc.).
- It's the preferred structure for sequence analysis, which tracks patterns over time.

---

## Why Convert Numeric Values into Categories?

In **social sequence analysis**, we focus on **changes in status or category over time**, not the exact numerical values.

Instead of analyzing raw CO₂ per capita values, we group them into **categories** such as:
- Very Low
- Low
- Middle
- High
- Very High

This:
- Highlights meaningful shifts between categories.
- Makes country comparisons easier.
- Reduces sensitivity to small numerical fluctuations.

We create these categories using **quintiles** — dividing all numeric values into five equal-sized groups.

> **Note:** This can be a standard practice in social sequence analysis for turning continuous variables into **categorical states** suitable for pattern analysis over time.
> Of course, you are more than welcome to come up with any other plausible and interesting ways.

## Time to distinguish global and local quintiles

### 1. **Global Quintiles**

**What it means**: 
- All CO₂ per capita values across all years and countries are considered together to compute **one set** of quintile thresholds.
- This allows **global comparisons across time** using the same benchmark.

**How to do it**:
```python
df_long["quintile_global"] = pd.qcut(df_long["co2_per_capita"], q=5, labels=["Very Low", "Low", "Middle", "High", "Very High"])
df_global = df_long.pivot(index="country", columns="year", values="quintile_global")
```

---

### 2. **Local (Yearly) Quintiles**

**What it means**: 
- Quintile thresholds are calculated **independently for each year**.
- A country's CO₂ level is evaluated relative to other countries in **that specific year only**.
- Useful for identifying changes in **relative rank**, even if the global scale has shifted.

**How to compute it**:
```python
# Create a function to assign quintiles per year
def assign_local_quintiles(sub_df):
    sub_df["quintile_local"] = pd.qcut(sub_df["co2_per_capita"], q=5, labels=["Very Low", "Low", "Middle", "High", "Very High"])
    return sub_df

# Apply the function by year
df_local = df_long.groupby("year").apply(assign_local_quintiles)

# Pivot for wide format
df_local = df_local.pivot(index="country", columns="year", values="quintile_local")
```

---

### 3. Why Do We Have Both?

As you gain experience with data analysis, it's easy to fall into a rhythm — focusing on *how* to do things rather than *why* you’re doing them. But in both research and industry, understanding the **conceptual meaning** behind each data operation is crucial. Always take a step back and ask yourself: **What does this step represent, is it really helpful and why? And what story does it help me tell?**

For example:
- If you're interested in **absolute progress** — whether a country’s CO₂ emissions are rising or falling over decades — then **global quintiles** help you benchmark everything on the same scale.
- If you're more focused on **relative standing** — how a country ranks among its peers in a given year — then **local (yearly) quintiles** provide that perspective.

There’s no one-size-fits-all answer. It's about understanding your **research or business question** and choosing the approach that best helps you explore or communicate it.

Generally speaking, it’s often helpful to explore **both** approaches. They provide different lenses through which to understand change over time.

| Aspect              | Global Quintiles                                  | Local (Yearly) Quintiles                                |
|---------------------|---------------------------------------------------|---------------------------------------------------------|
| **Benchmark**        | Fixed across time                                | Dynamic per year                                        |
| **Tracks**           | Absolute change over time                        | Relative position change year-to-year                   |
| **Useful For**       | Long-term global comparisons                     | Annual rankings and relative shifts                     |
| **Limitations**      | Can hide relative progress/regress per year      | Can't compare across years directly                     |

---

### Outputs

Through the following code snippets, you will have:
- `df_global`: CO₂ per capita quintiles by country and year (based on **global** distribution)
- `df_local`: CO₂ per capita quintiles by country and year (based on **yearly** or **local** distribution)


In [4]:
# Global quintiles: Compute quintile thresholds based on the entire dataset (all years and countries)
df_long["quintile_global"] = pd.qcut(df_long["co2_per_capita"], q=5, labels=["Very Low", "Low", "Middle", "High", "Very High"])

# Convert back to wide format (country as index, years as columns with quintile states)
df_global = df_long.pivot(index="country", columns="year", values="quintile_global")

# Reset index to make 'country' a regular column
df_global = df_global.reset_index()

# Remove the columns.name metadata for cleaner display
df_global.columns.name = None

df_global

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Afghanistan,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,High,High,High,High,High,High,High,High,High,High
1,Albania,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
2,Algeria,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
3,Andorra,High,High,High,High,High,High,High,High,High,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
4,Angola,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,High,High,High,High,High,High,High,High,High,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Venezuela,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very High,Very High,Very High,Very High,Very High,High,Middle,High,High,High
190,Vietnam,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,High,High,High,High,High,Very High,Very High,Very High,Very High,Very High
191,Yemen,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,High,High,High,High,High,High,High,High,High,High
192,Zambia,High,High,High,High,High,High,High,High,High,...,High,High,High,High,High,High,High,High,High,High


In [5]:
# Local (yearly) quintiles
# Create a function to assign quintiles per year
def assign_local_quintiles(sub_df):
    sub_df["quintile_local"] = pd.qcut(sub_df["co2_per_capita"], q=5, labels=["Very Low", "Low", "Middle", "High", "Very High"])
    return sub_df

# Apply the function by year
df_local = df_long.groupby("year").apply(assign_local_quintiles)

# Pivot for wide format
df_local = df_local.pivot(index="country", columns="year", values="quintile_local")

# Reset index to make 'country' a regular column
df_local = df_local.reset_index()

# Remove the columns.name metadata for cleaner display
df_local.columns.name = None

df_local

  df_local = df_long.groupby("year").apply(assign_local_quintiles)


Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Afghanistan,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low
1,Albania,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle
2,Algeria,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle
3,Andorra,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,...,High,High,High,High,High,High,High,High,High,High
4,Angola,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,...,Low,Low,Low,Low,Low,Low,Low,Low,Low,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Venezuela,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Middle,High,Middle,Middle,Middle,Low,Very Low,Low,Very Low,Very Low
190,Vietnam,High,High,High,High,High,High,High,High,High,...,Low,Low,Low,Low,Low,Low,Middle,Middle,Middle,Middle
191,Yemen,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,Low,Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low
192,Zambia,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,...,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low


## Deciles

In [6]:
# Convert your long-form data as needed:
# df_long should have 'country', 'year', 'co2_per_capita' columns

# Compute global deciles
df_long["decile_global"] = pd.qcut(
    df_long["co2_per_capita"], 
    q=10, 
    labels=[
        "D1 (Very Low)", "D2", "D3", "D4", "D5",
        "D6", "D7", "D8", "D9", "D10 (Very High)"
    ]
)

# Convert to wide format: rows = country, columns = year, values = decile
df_global_deciles = df_long.pivot(index="country", columns="year", values="decile_global")

# Optional: reset index and clean column names
df_global_deciles = df_global_deciles.reset_index()
df_global_deciles.columns.name = None

df_global_deciles

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Afghanistan,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D7,D7,D7,D7,D7,D7,D7,D7,D7,D7
1,Albania,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9
2,Algeria,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9
3,Andorra,D7,D7,D7,D7,D7,D7,D7,D7,D7,...,D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High)
4,Angola,D3,D3,D3,D3,D3,D3,D3,D3,D3,...,D8,D8,D8,D8,D8,D8,D8,D8,D8,D8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Venezuela,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D9,D9,D9,D9,D9,D8,D6,D8,D8,D8
190,Vietnam,D3,D3,D3,D3,D3,D3,D3,D3,D3,...,D8,D8,D8,D8,D8,D9,D9,D9,D9,D9
191,Yemen,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D8,D8,D7,D7,D7,D7,D7,D7,D7,D7
192,Zambia,D7,D7,D7,D7,D7,D7,D7,D7,D7,...,D7,D7,D7,D7,D7,D7,D7,D7,D7,D7


In [7]:
# Local (Yearly) Deciles

# Function to assign deciles within each year
def assign_local_deciles(sub_df):
    sub_df = sub_df.copy()
    
    # Use qcut with duplicates='drop' to handle repeated bin edges
    try:
        # Try creating bins without setting labels
        bins = pd.qcut(sub_df["co2_per_capita"], q=10, duplicates='drop', retbins=True)[1]
        num_bins = len(bins) - 1
        
        # Generate matching number of labels dynamically
        all_labels = [
            "D1 (Very Low)", "D2", "D3", "D4", "D5",
            "D6", "D7", "D8", "D9", "D10 (Very High)"
        ]
        labels = all_labels[:num_bins]

        # Now apply qcut with correct number of labels
        sub_df["decile_local"] = pd.qcut(
            sub_df["co2_per_capita"],
            q=10,
            labels=labels,
            duplicates='drop'
        )
    except ValueError:
        # If qcut completely fails (e.g. all values are the same), assign NaN
        sub_df["decile_local"] = np.nan

    return sub_df

# Group by year and apply decile assignment
df_local_deciles = df_long.groupby("year", group_keys=False).apply(assign_local_deciles)

# Pivot to wide format: rows = country, columns = year, values = decile
df_local_deciles = df_local_deciles.pivot(index="country", columns="year", values="decile_local")

df_local_deciles


  df_local_deciles = df_long.groupby("year", group_keys=False).apply(assign_local_deciles)


year,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D2,D2,D2,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low)
Albania,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D5,D5,D5,D5,D5,D5,D5,D5,D5,D5
Algeria,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D6,D6,D6,D6,D6,D6,D6,D6,D6,D6
Andorra,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9,...,D7,D7,D8,D8,D8,D8,D8,D8,D8,D8
Angola,D5,D5,D5,D5,D5,D5,D5,D5,D5,D5,...,D4,D4,D4,D4,D4,D3,D3,D3,D3,D3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D6,D7,D6,D5,D5,D3,D1 (Very Low),D3,D2,D2
Vietnam,D6,D6,D6,D6,D6,D6,D6,D6,D6,D6,...,D4,D4,D4,D4,D4,D4,D5,D5,D5,D5
Yemen,D2,D2,D2,D2,D2,D2,D2,D2,D2,D2,...,D3,D3,D2,D2,D2,D2,D2,D2,D2,D2
Zambia,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9,...,D2,D2,D2,D2,D2,D2,D2,D2,D2,D2


In [9]:
# Uncomment the following line if you would like to try to download these two dataframes locally in your computer.

df_global.to_csv('country_co2_emissions_global_quintiles.csv', index=False)

df_local.to_csv('country_co2_emissions_local_quintiles.csv', index=False)

df_global_deciles.to_csv('country_co2_emissions_global_deciles.csv', index=False)

df_local_deciles.to_csv('country_co2_emissions_local_deciles.csv', index=False)

## GDP per capita

The data source for each country's GDP per capita can be found [here](https://docs.google.com/spreadsheets/d/1mbfE9vSQmshpSOsBbicmiaL0Oio7hIJAg0vQKzzl8v0/edit?gid=730262387#gid=730262387). Although this dataset primarily focuses on carbon emissions for each country, it also includes valuable information on GDP per capita. This additional data opens up opportunities for multidomain sequence analysis, where we analyze multiple sequences, such as economic and environmental trajectories, simultaneously for the same country. This approach provides richer insights and will be explored in more detail in later tutorials.


In [5]:
import pandas as pd

# Load the dataset
file_path_gdp = "data_sources/gapminder/Output_CO2 Long Series 1800 - 2022 - Output.csv"
df_gdp = pd.read_csv(file_path_gdp)

df_gdp

Unnamed: 0,country,name,time,MtCo2 (Million Tons of CO2),tCO2 per cap (Tonnes of CO2 per Cap),GDP per Cap
0,afg,Afghanistan,1800,0.002452,0.000748,476.991347
1,afg,Afghanistan,1801,0.002460,0.000750,476.991347
2,afg,Afghanistan,1802,0.002468,0.000752,476.991347
3,afg,Afghanistan,1803,0.002476,0.000755,476.991347
4,afg,Afghanistan,1804,0.002484,0.000757,476.991347
...,...,...,...,...,...,...
43257,zwe,Zimbabwe,2018,12.497979,0.830310,2399.621551
43258,zwe,Zimbabwe,2019,12.026190,0.783230,2203.396810
43259,zwe,Zimbabwe,2020,11.550268,0.737110,1990.319419
43260,zwe,Zimbabwe,2021,12.614216,0.788708,2115.144555


In [6]:
# Select relevant columns for GDP per capita analysis
df_gdp_selected = df_gdp[["name", "time", "GDP per Cap"]].rename(columns={"name": "country", "time": "year", "GDP per Cap": "gdp_per_capita"})

df_gdp_selected

Unnamed: 0,country,year,gdp_per_capita
0,Afghanistan,1800,476.991347
1,Afghanistan,1801,476.991347
2,Afghanistan,1802,476.991347
3,Afghanistan,1803,476.991347
4,Afghanistan,1804,476.991347
...,...,...,...
43257,Zimbabwe,2018,2399.621551
43258,Zimbabwe,2019,2203.396810
43259,Zimbabwe,2020,1990.319419
43260,Zimbabwe,2021,2115.144555


In [7]:
# Convert year to integer and GDP per capita to numeric
df_gdp_selected["year"] = df_gdp_selected["year"].astype(int)
df_gdp_selected["gdp_per_capita"] = pd.to_numeric(df_gdp_selected["gdp_per_capita"], errors="coerce")

# Drop missing values for accurate quintile computation
df_gdp_selected = df_gdp_selected.dropna(subset=["gdp_per_capita"])

# Compute quintile thresholds based on the entire dataset (all years and countries)
# df_gdp_selected["gdp_quintile"] = pd.qcut(df_gdp_selected["gdp_per_capita"], q=5, labels=["Very Low", "Low", "Middle", "High", "Very High"])

# Compute quintiles separately per year
df_gdp_selected["gdp_quintile"] = df_gdp_selected.groupby("year")["gdp_per_capita"]\
    .transform(lambda x: pd.qcut(x, q=5, labels=["Very Low", "Low", "Middle", "High", "Very High"]))

# Convert back to wide format (country as index, years as columns with GDP quintile states)
df_sequence_gdp = df_gdp_selected.pivot(index="country", columns="year", values="gdp_quintile")

df_sequence_gdp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_gdp_selected["gdp_quintile"] = df_gdp_selected.groupby("year")["gdp_per_capita"]\


year,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low
Albania,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle
Algeria,Low,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle
Andorra,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
Angola,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Low,Low,Low,Low,Low,Low,Low,Low,Low,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,High,High,High,High,High,High,High,High,High,High,...,High,High,High,Middle,Middle,Middle,Low,Low,Low,Low
Vietnam,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,...,Low,Low,Low,Low,Low,Middle,Middle,Middle,Middle,Middle
Yemen,High,High,High,High,High,High,High,High,High,High,...,Low,Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low
Zambia,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low


In [10]:
# Uncomment the following line if you would like to try to download this df_wide locally in your computer.

# Reset the multi-index (moves 'country' back to a column)
df_sequence_gdp = df_sequence_gdp.reset_index()

df_sequence_gdp.to_csv('country_co2_emissions.csv', index=False)

In [9]:
print("Thank you for learning sequence analysis with Sequenzo! ")
print("We hope you found this tutorial insightful.")
print("\n💡 Stay Curious, keep coding, and discover new insights.")
print("✉️ If you have any questions, please feel free to reach out :)")

Thank you for learning sequence analysis with Sequenzo! 
We hope you found this tutorial insightful.

💡 Stay Curious, keep coding, and discover new insights.
✉️ If you have any questions, please feel free to reach out :)
