# Measuring Income Equality with the Gini Coefficient


As we discussed in our numpy exercises, one frequently used measure of inequality is the Gini Coefficient. The Gini Coefficient takes on a value of 1 when the distribution of some property is maximally unequal across a said of entities, and a value of 0 when it is evenly distributed. 

In this exercise, we will calculate the Gini Coefficient for income inequality across the countries of the world to get a sense of income inequality *across* countries. 


## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

**Starting with this assignment, submissions that have not been formatted with `black` will be automatically rejected.**

For this assignment, please name your file `exercise_series.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:

```python
assert set(results.keys()) == {
    "ex2_mean",
    "ex2_median",
    "ex3_highest_gdp_percap",
    "ex3_lowest_gdp_percap",
    "ex4_lessthan20_000",
    "ex5_switzerland",
    "ex6_gini_loop",
    "ex7_gini_vectorized",
    "ex8_gini_2025",
}
```

### Submission Limits

Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.


In [1]:
results = {}

### Exercise 1

To get accustomed to Series, let's explore some data on the wealth of 10 randomly selected countries. Data below presents the GDP per capita for these countries in 2008. 

Use the code below to get started: 

```python
gdppercap = pd.Series(
    [34605, 34493, 12393, 44200, 10041, 58138, 4709, 49284, 10109, 42536],
    index=[
        "Bahrain",
        "Belgium",
        "Bulgaria",
        "Ireland",
        "Macedonia",
        "Norway",
        "Paraguay",
        "Singapore",
        "South Africa",
        "Switzerland",
    ],
)
```



In [2]:
import pandas as pd

gdppercap = pd.Series(
    [34605, 34493, 12393, 44200, 10041, 58138, 4709, 49284, 10109, 42536],
    index=[
        "Bahrain",
        "Belgium",
        "Bulgaria",
        "Ireland",
        "Macedonia",
        "Norway",
        "Paraguay",
        "Singapore",
        "South Africa",
        "Switzerland",
    ],
)

gdppercap

Bahrain         34605
Belgium         34493
Bulgaria        12393
Ireland         44200
Macedonia       10041
Norway          58138
Paraguay         4709
Singapore       49284
South Africa    10109
Switzerland     42536
dtype: int64

### Exercise 2

Find the mean, median, minimum and maximum values of GDP per capita in this data. 

In [3]:
ex2_mean = gdppercap.mean()
print(f"the mean of GDP per capita in this dataset is {ex2_mean}")
ex2_median = gdppercap.median()
print(f"the median of GDP per capita in this dataset is {ex2_median}")
minimum = gdppercap.min()
print(f"the minimum of GDP per capita in this dataset is {minimum}")
maximum = gdppercap.max()
print(f"the maximum of GDP per capita in this dataset is {maximum}")

results["ex2_mean"] = ex2_mean
results["ex2_median"] = ex2_median
print(results)

the mean of GDP per capita in this dataset is 30050.8
the median of GDP per capita in this dataset is 34549.0
the minimum of GDP per capita in this dataset is 4709
the maximum of GDP per capita in this dataset is 58138
{'ex2_mean': 30050.8, 'ex2_median': 34549.0}


## Exercise 3

Programmatically, determine which country in our data has the highest income per capita, and which has the lowest income per capita.

(Obviously, this is easier to do by just looking at the data, but that's only because this dataset is very small. With a real dataset, you would need to do it with code, so please write code to accomplish this task.)

Hint: Country names form the index for this Series, so to get country names you'll need to access the index. 

Store the country names *as strings* with the keys `"ex3_highest_gdp_percap"` and `"ex3_lowest_gdp_percap"`

In [4]:
# Find the row with the highest income per capita
# gdppercap.idxmax()
max_row = gdppercap.loc[gdppercap == maximum]
# print(max_row)
ex3_highest_gdp_percap = max_row.index[0]
print(
    f"the country with the highest income per capita in our dataset is {ex3_highest_gdp_percap}"
)
min_row = gdppercap.loc[gdppercap == minimum]
# print(min_row)
ex3_lowest_gdp_percap = min_row.index[0]
print(
    f"the country with the minimum income per capita in our dataset is {ex3_lowest_gdp_percap}"
)

the country with the highest income per capita in our dataset is Norway
the country with the minimum income per capita in our dataset is Paraguay


In [5]:
results["ex3_highest_gdp_percap"] = ex3_highest_gdp_percap
results["ex3_lowest_gdp_percap"] = ex3_lowest_gdp_percap
print(results)

{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay'}


### Exercise 4

Get Python to print out the names of all the countries that have GDP per capita of less than \$20,000.

Store these countries in a list, sorted alphabetically, and store it in `results` under the key `"ex4_lessthan20_000"`

In [6]:
GDP_less_20000 = gdppercap[gdppercap < 20000].sort_index()
# print(GDP_less_20000)
GDP_less_20000_index = GDP_less_20000.index[:]
# print(GDP_less_20000_index)
ex4_lessthan20_000 = GDP_less_20000_index.tolist()
print(
    f"all the countries that have GDP per capita of less than $20,000, sorted alphabetically, is {ex4_lessthan20_000}"
)

results["ex4_lessthan20_000"] = ex4_lessthan20_000
print(results)

all the countries that have GDP per capita of less than $20,000, sorted alphabetically, is ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa']
{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay', 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa']}


## Exercise 5 

Get Python to print out the GDP per capita of Switzerland. Store the result as `ex5_switzerland`:

In [7]:
ex5_switzerland = gdppercap["Switzerland"]
results["ex5_switzerland"] = ex5_switzerland
print(results)

{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay', 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'], 'ex5_switzerland': 42536}


## Exercise 6

One frequntly used measure of inequality is the Gini Coefficient. The Gini Coefficient takes on a value of 1 when the distribution of some variable is maximally unequal across a population, and a value of 0 when it is evenly distributed. We will calculate the Gini Coefficient for income inequality in our data. 

To visualize the Gini Coefficient, we plot the cumulative share of the population (ordered from poorest to richest) on the x-axis, and cumulative share of income earned by that group on the y-axis. The Gini Coefficient is then defined as $$\frac{A}{A + B}$$, where the areas A and B are labeled below: 

![gini_coefficient](https://upload.wikimedia.org/wikipedia/commons/thumb/5/59/Economics_Gini_coefficient2.svg/800px-Economics_Gini_coefficient2.svg.png)

If income is evenly distributed, then the poorest 20% of a population will also have 20% of the wealth; the poorest 40% will have 40% of the wealth, and so forth, resulting in a perfect 45 degree line. In this situation, there is no area between the 45% line and the actual income distribution, so $A=0$, and the Gini Coefficient is 0. 

If, by contrast, the top 10% of people hold all the wealth in a country, then there will be no wealth for the poorest 90% of people, then wealth will jump up at the far right side of the graph. This will generate a very large gap between the 45% line and actual income for most of the graph, generating a large value for the area $A$, creating a very high Gini Coefficient. 

To illustrate, here are a few different Gini plots. These come from someone studying inequality of participation, so to adapt this to our study of income, just imagine the y-axis plots share of income):

![gini_distributions](https://miro.medium.com/max/595/0*3DTcZnzDwS6A6AtP)

For discrete data, the Gini Coefficient can be calculated with the following formula: 

$$\frac{2 \sum_{i=1}^n i y_i}{n \sum_{i=1}^n y_i} -\frac{n+1}{n}$$

Where $i$ is each country's rank ordering from poorest to richest, and $y_i$ is the income of country $i$.



### Exercise 6

Using this formula, calculate the Gini coefficient for our income data. 

Begin by writing a function to calculate the Gini Coefficient for our data *by looping over the entries in our Series*. In other words, try and embrace the spirit of how you might normally think about interpreting the summation notation written above.

Store the gini coefficient you calculate in `results` under the key `"ex6_gini_loop"`.

**HINT**: Be careful with 0-indexing! Python counts from 0, but mathematical formulas (like $\sum$) start from 1!

**HINT 2**: I'll probalby ask you to use this more than once, so please put it in a function.

In [8]:
# Sort 'gdppercap' in ascending order based on GDP values
gdppercap_sorted = gdppercap.sort_values(ascending=True)


def calculate_gini_by_loop_sorted(gdppercap_sorted):
    n = len(gdppercap_sorted)
    sum_y = gdppercap_sorted.sum()
    gini_sum = 0

    for i in range(1, n + 1):
        y_i = gdppercap_sorted.iloc[i - 1]  # 0-based index, so we use (i - 1)
        gini_sum += i * y_i

    gini_coefficient = (2 * gini_sum) / (n * sum_y) - (n + 1) / n
    return gini_coefficient


ex6_gini_loop = calculate_gini_by_loop_sorted(gdppercap_sorted)

print(f"the Gini Coefficient for our data gdppercap is {ex6_gini_loop}")

results["ex6_gini_loop"] = ex6_gini_loop
print(results)

the Gini Coefficient for our data gdppercap is 0.3382798461272245
{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay', 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'], 'ex5_switzerland': 42536, 'ex6_gini_loop': 0.3382798461272245}


### Exercise 7

Excellent! But as we've seen in [our readings](https://nickeubank.github.io/practicaldatascience_book/notebooks/class_2/week_4/11_vectorization.html), in data science we generally strive to *not* loop over the entries in our arrays; instead, we aspire to write *vectorized code* that naturally applies a simple operation to each observation.

So now write a new function to calculate the Gini Coefficient that *doesn't* use loops, and instead relies on vectorized code.

Store the result in `results` under the key `"ex7_gini_vectorized"`.

**HINT:** you will probably have to create some new series/vectors/arrays.

In [9]:
import numpy as np


def calculate_gini_vectorized(series):
    n = len(series)
    sum_y = series.sum()
    y = series
    i = np.arange(1, n + 1)

    gini_coefficient = (2 * (i * y).sum()) / (n * sum_y) - (n + 1) / n
    return gini_coefficient


ex7_gini_vectorized = calculate_gini_vectorized(gdppercap_sorted)
print(f"the Gini Coefficient for our data gdppercap is {ex7_gini_vectorized}")

results["ex7_gini_vectorized"] = ex7_gini_vectorized
print(results)

the Gini Coefficient for our data gdppercap is 0.3382798461272245
{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay', 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'], 'ex5_switzerland': 42536, 'ex6_gini_loop': 0.3382798461272245, 'ex7_gini_vectorized': 0.3382798461272245}


### Exercise 8

The result we just generated offers a snap-shot of inequality for this subset of countries. But what are the dynamics of inequality for these countries?

There is an idea in economics called the "convergence hypothesis", which argues that poorer countries are likely to grow faster, and as a result global inequality is likely to decline. Economists advocating for this hypothesis pointed out that while rich countries had to invent new technologies in order to grow, many poor countries simply had to take advantage of innovations already developed by rich countries. 

To test this hypothesis, let's do a small analysis of the dynamics of income inequality in our sample. Create the following Series in your Python session, which provides the average growth rate of GDP per capita for all the countries in our sample from 2000 to 2018. 

```python
avg_growth = pd.Series(
    [
        -0.29768835,
        0.980299584,
        4.52991925,
        3.686556736,
        2.621416804,
        0.775132075,
        2.015489468,
        3.345793635,
        1.349993318,
        0.982775018,
    ],
    index=[
        "Bahrain",
        "Belgium",
        "Bulgaria",
        "Ireland",
        "Macedonia",
        "Norway",
        "Paraguay",
        "Singapore",
        "South Africa",
        "Switzerland",
    ],
)
```

In [10]:
avg_growth = pd.Series(
    [
        -0.29768835,
        0.980299584,
        4.52991925,
        3.686556736,
        2.621416804,
        0.775132075,
        2.015489468,
        3.345793635,
        1.349993318,
        0.982775018,
    ],
    index=[
        "Bahrain",
        "Belgium",
        "Bulgaria",
        "Ireland",
        "Macedonia",
        "Norway",
        "Paraguay",
        "Singapore",
        "South Africa",
        "Switzerland",
    ],
)

Using this data on average growth rates in GDP per capita, and assuming growth rates from 2000 to 2018 continue into the future, estimate what our Gini Coefficient may look like in 2025 (remembering that income in our data is from 2008, so we're extrapolating ahead 17 years)?

**Hint:** the formula for compound growth (i.e. value of something growing at a rate of `x` percent for $t$ periods) is:

$$future\_value = current\_value * (1 + \frac{percentage\_growth\_rate}{100}))^t$$

Store the answer in `results` under the key `"ex8_gini_2025"`

In [11]:
t = 2025 - 2008

future_gdppercap = gdppercap * (1 + avg_growth / 100) ** t
# print(future_gdppercap)

# Sort 'future_gdppercap' in ascending order based on GDP values
future_gdppercap_sorted = future_gdppercap.sort_values(ascending=True)
# print(future_gdppercap_sorted)

ex8_gini_2025 = calculate_gini_vectorized(future_gdppercap_sorted)
print(f"the Gini Coefficient for our data future_gdppercap is {ex8_gini_2025}")

results["ex8_gini_2025"] = ex8_gini_2025
print(results)

the Gini Coefficient for our data future_gdppercap is 0.3656264991306193
{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay', 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'], 'ex5_switzerland': 42536, 'ex6_gini_loop': 0.3382798461272245, 'ex7_gini_vectorized': 0.3382798461272245, 'ex8_gini_2025': 0.3656264991306193}


### Exercise 9

Interpret your result -- does it seem to imply that we are seeing covergence or not?

[After you're done, you can see a more systematic version of this analysis here!](https://www.cgdev.org/blog/everything-you-know-about-cross-country-convergence-now-wrong)

> + From the result above, we find that the 2008 Gini coefficient for our data gdppercap, which includes the gdp per-capita for 10 counties, is around 0.34. The future Gini coefficient for our 2025 data future_gdppercap, which includes the future gdp per-capita for the same 10 countires, is around 0.37. 
> + The Gini coefficient is a measure of inequality, with 0 representing perfect equality and 1 representing perfect inequality. Since 0.37 is greater than 0.34, this represents an increase in inequality, which means global incomes are becoming more unevenly distributed. Therefore, in this case, we are not seeing convergence to more global equality; rather, we are seeing divergence to more global inequality in income distribution. 
> + Contrary to the convergence hypothesis, the global inequality doesn't decline but increases. 

In [12]:
assert set(results.keys()) == {
    "ex2_mean",
    "ex2_median",
    "ex3_highest_gdp_percap",
    "ex3_lowest_gdp_percap",
    "ex4_lessthan20_000",
    "ex5_switzerland",
    "ex6_gini_loop",
    "ex7_gini_vectorized",
    "ex8_gini_2025",
}

print(results)

{'ex2_mean': 30050.8, 'ex2_median': 34549.0, 'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay', 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'], 'ex5_switzerland': 42536, 'ex6_gini_loop': 0.3382798461272245, 'ex7_gini_vectorized': 0.3382798461272245, 'ex8_gini_2025': 0.3656264991306193}
