# World's Richest People 2022 as Listed by Forbes.
The objective of this project is to analyze and visualize the data on the world's richest people (in US Dollars) in 2022 as listed by Forbes (The Billionaires List). The Billionaires List is an annual ranking of the world's wealthiest individuals, based on their net worth, as determined by Forbes.
The project will also explore the source of wealth for the world's richest people in 2022, identifying the industries and sectors that are driving wealth creation at the highest levels. This analysis will provide insight into the global economy and the trends that are shaping it.
The dataset for this project was extracted from [forbes](https://www.forbes.com/billionaires/page-data/index/page-data.json)

In [8]:
# import libraries
import pandas as pd
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import numpy as np


pd.options.display.max_rows = 50
pd.options.display.max_columns = 22

## 1. Read Data

In [31]:
# Load the the Data
df = pd.read_csv("../Datasets/forbes_2022_billionaires.csv")
df.head(3)

Unnamed: 0,Rank,Name,Age,Country,Month,Year,Networth,Source,Industries,CountryOfCitizenship,Selfmade,Title,City,Gender
0,1,Elon Musk,50,United States,4,2022,$219 B,"Tesla, SpaceX",Automotive,United States,True,CEO,Austin,M
1,2,Jeff Bezos,58,United States,4,2022,$171 B,Amazon,Technology,United States,True,Chairman and Founder,Seattle,M
2,3,Bernard Arnault & family,73,France,4,2022,$158 B,LVMH,Fashion & Retail,France,False,Chairman and CEO,Paris,M


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2668 entries, 0 to 2667
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Rank                  2668 non-null   int64 
 1   Name                  2668 non-null   object
 2   Age                   2668 non-null   int64 
 3   Country               2655 non-null   object
 4   Month                 2668 non-null   int64 
 5   Year                  2668 non-null   int64 
 6   Networth              2668 non-null   object
 7   Source                2668 non-null   object
 8   Industries            2668 non-null   object
 9   CountryOfCitizenship  2668 non-null   object
 10  Selfmade              2668 non-null   bool  
 11  Title                 390 non-null    object
 12  City                  2624 non-null   object
 13  Gender                2652 non-null   object
dtypes: bool(1), int64(4), object(9)
memory usage: 273.7+ KB


In [33]:
#Shape
rows, columns = df.shape
print(f"There are {rows} rows and {columns} columns in the data")

There are 2668 rows and 14 columns in the data


In [34]:
# Missing Values
df.isnull().sum()

Rank                       0
Name                       0
Age                        0
Country                   13
Month                      0
Year                       0
Networth                   0
Source                     0
Industries                 0
CountryOfCitizenship       0
Selfmade                   0
Title                   2278
City                      44
Gender                    16
dtype: int64

In [35]:
print("We have some missing data for COUNTRY, TITLE, CITY and GENDER columns.\n\
We'll deal with them on a case by case basis when the need arises")

We have some missing data for COUNTRY, TITLE, CITY and GENDER columns.
We'll deal with them on a case by case basis when the need arises


## 2. Data Cleaning 
Let's clean up columns whose data doesn't make sense.

In [36]:
# Explore the numerical variables.
for col in df.columns:
    if df[col].dtype not in ["object", "bool"]:
        print(f"{col} || max --> {df[col].max()} | min --> {df[col].min()}")
        print("\n ---------------- \n")

print("The AGE column has a minimum value of 0 doen't make sense.\n\
We'll assume that the zero value are billionaires whose age is not known.")

Rank || max --> 2578 | min --> 1

 ---------------- 

Age || max --> 100 | min --> 0

 ---------------- 

Month || max --> 4 | min --> 4

 ---------------- 

Year || max --> 2022 | min --> 2022

 ---------------- 

The AGE column has a minimum value of 0 doen't make sense.
We'll assume that the zero value are billionaires whose age is not known.


In [37]:
## Replace rows with zero in the Age column with NaN
df["Age"] = df["Age"].apply(lambda x: np.nan if x == 0 else x)

In [38]:
# Convert the NETWORTH column to numeric
df["Networth"] = df["Networth"].apply(lambda x: float(x.split(" ")[0].replace("$", "").strip()))

In [39]:
# Check for duplicate
df.duplicated().sum()

0

## 3. Age Distribution

In [40]:
trace = go.Histogram(
    x = df["Age"],
    marker = dict(color="#007F8E"),
    xbins = dict(start=10, end=100, size=10 )
)
layout = dict(
    title = dict(
        text = "Billionaire Age Distribution", 
        font = dict(color = "#9C9C9C")
    ),
    xaxis = dict(
        title="Age",
        color = "#9C9C9C"
    ),
    yaxis = dict(
        title="Count",
        color = "#9C9C9C"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)
fig = go.Figure(data = [trace], layout = layout)
fig.show()

* **MOST billionaires are aged between 50-69 years**
* **The number of billionaires increase with each age group**

## 4. Top 10 Richest people in 2022

In [41]:
top_10 = df.loc[:9, ["Rank", "Name", "Age", "Networth", "Country"]]
top_10

Unnamed: 0,Rank,Name,Age,Networth,Country
0,1,Elon Musk,50.0,219.0,United States
1,2,Jeff Bezos,58.0,171.0,United States
2,3,Bernard Arnault & family,73.0,158.0,France
3,4,Bill Gates,66.0,129.0,United States
4,5,Warren Buffett,91.0,118.0,United States
5,6,Larry Page,49.0,111.0,United States
6,7,Sergey Brin,48.0,107.0,United States
7,8,Larry Ellison,77.0,106.0,United States
8,9,Steve Ballmer,66.0,91.4,United States
9,10,Mukesh Ambani,64.0,90.7,India


In [42]:
top_10_table = ff.create_table(top_10)
top_10_table

**8 of the top 10 richest people in the world came from the United States.**

## 5. Industry with the highest number of Billionaires

In [43]:
# GROUP the data by category then get the size for each Industry.
categories = df.groupby("Industries").size().reset_index(name="count").sort_values(by="count", ascending=False)
categories.reset_index(drop=True, inplace=True)

In [44]:
top_10_categories = categories.head(10)
top_10_categories_table = ff.create_table(top_10_categories)
top_10_categories_table

In [45]:
trace = go.Bar(
    x = top_10_categories["Industries"],
    y = top_10_categories["count"],
    marker = dict(
        color = top_10_categories["count"],
        colorscale = "sunset",
    )
)
layout = dict(
    title = dict(
        text = "Top 10 Sectors with the highest number of Billionaires", 
        font = dict(color="#9C9C9C")
    ),
    xaxis = dict(
        color = "#9C9C9C",
        title = "Industry",
        tickfont = dict(size=10)
    ),
    yaxis = dict(
        color = "#9C9C9C",
        title = "Number of Billionaires"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [trace], layout = layout)
fig.show()

**FINANCE Industry has the highest number of Billionaires.**

## 6. Male vs Female Billionaires

In [46]:
# GROUP by gende then get the number of billionaires under each gender.
by_gender = df.groupby("Gender").size().reset_index(name="count")
by_gender["Gender"] = by_gender["Gender"].map({"F": "Female", "M": "Male"})
by_gender

Unnamed: 0,Gender,count
0,Female,311
1,Male,2341


In [47]:
by_gender_table = ff.create_table(by_gender)
by_gender_table

In [48]:
gender = go.Pie(
    labels = by_gender["Gender"],
    values = by_gender["count"],
    marker = dict(colors = ["#E05F19", "#007F8E"]),
    hole=0.5
)
layout = dict(
    title = dict(
        text="Male vs Female Billionaires", 
        font = dict(color="#9C9C9C")
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [gender], layout = layout)
fig.show()

**There are significantly more MALE (88.3%) than FEMALE (11.7%) billionaires**

## 7. Which Industries do most FEMALE billionaires belong to?

In [49]:
# Select rows with gender female
female = df[df["Gender"] == "F"]
female.head(3)

Unnamed: 0,Rank,Name,Age,Country,Month,Year,Networth,Source,Industries,CountryOfCitizenship,Selfmade,Title,City,Gender
13,14,Francoise Bettencourt Meyers & family,68.0,France,4,2022,74.8,L'Oréal,Fashion & Retail,France,False,,Paris,F
17,18,Alice Walton,72.0,United States,4,2022,65.3,Walmart,Fashion & Retail,United States,False,Philanthropist,Fort Worth,F
21,21,Julia Koch & family,59.0,United States,4,2022,60.0,Koch Industries,Diversified,United States,False,,New York,F


In [50]:
# Group the Data for female billionaires by category
category_female = female.groupby("Industries").size().reset_index(name="count").sort_values(by="count", ascending=False)
category_female.reset_index(drop=True, inplace=True)

In [51]:
category_female_table = ff.create_table(category_female)
category_female_table

**Most Female Billionaires are in Manufacturing, Food & Beverages**

In [52]:
trace = go.Bar(
    x = category_female["Industries"],
    y = category_female["count"],
    marker = dict(
        color = category_female["count"],
        colorscale = "teal",
    )
)
layout = dict(
    title = dict(
        text="Distribution of Female Billionaires by Sector", 
        font = dict(color="#9C9C9C")
    ),
    xaxis = dict(
        color = "#9C9C9C",
        title = "Sector",
        tickfont = dict(size=10)
    ),
    yaxis = dict(
        color = "#9C9C9C",
        title = "Number of Female Billionaires"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [trace], layout = layout)
fig.show()

## 8. Age Distribution of Female Billionaires

In [53]:
trace = go.Histogram(
    x = female["Age"],
    marker = dict(color="#007F8E"),
    xbins = dict(start=10, end=100, size=10 )
)
layout = dict(
    title = dict(
        text = "Female Billionaire Age Distribution", 
        font = dict(color = "#9C9C9C")
    ),
    xaxis = dict(
        title="Age",
        color = "#9C9C9C"
    ),
    yaxis = dict(
        title="Count",
        color = "#9C9C9C"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)
fig = go.Figure(data = [trace], layout = layout)
fig.show()

**Majority of the female billionaires are over 50years old** 

## 8. Distribution of Billionaires by Country

In [54]:
# GROUP the data by country then count the number of billionaires for each.
countries = df.groupby("Country").size().reset_index(name="Number of Billionaires").sort_values(by="Number of Billionaires", ascending=False)
countries.reset_index(drop=True, inplace=True)

In [55]:
countries.head(10)

Unnamed: 0,Country,Number of Billionaires
0,United States,748
1,China,571
2,India,158
3,Germany,112
4,United Kingdom,87
5,Switzerland,74
6,Hong Kong,68
7,Russia,64
8,Brazil,54
9,Italy,49


**USA and CHINA alone had almost half of all billionaires in 2022**

In [56]:
def aggregate_countries(row):
    if row["Number of Billionaires"] < 112:
        return "Other"
    else:
        return row["Country"]

countries_agg = countries.copy()  
countries_agg["Country"] = countries_agg.apply(aggregate_countries, axis=1)
countries_agg = countries_agg.groupby("Country")["Number of Billionaires"].sum().reset_index()
countries_agg = countries_agg.iloc[[4, 0, 2, 1, 3], :].reset_index(drop=True)
countries_agg["Percentage"] = round((countries_agg["Number of Billionaires"] /  countries_agg["Number of Billionaires"].sum()) * 100, 1)
countries_agg["Percentage"] = countries_agg["Percentage"].apply(lambda x: f"{x}%")
countries_agg

Unnamed: 0,Country,Number of Billionaires,Percentage
0,United States,748,28.2%
1,China,571,21.5%
2,India,158,6.0%
3,Germany,112,4.2%
4,Other,1066,40.2%


In [57]:
country_table = ff.create_table(countries_agg)
country_table

In [58]:
country = go.Pie(
    labels = countries_agg["Country"],
    values = countries_agg["Number of Billionaires"],
    marker = dict(colors = ["#007F8E", "#FF834C", "#458F54", "#E6AD20", "#6514FF"]),
    hole = 0.4,
    direction = "clockwise",
    rotation = 0
)
layout = dict(
    title = dict(
        text="Top countries by number of billionaires".title(), 
        font = dict(color="#9C9C9C")
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [country], layout = layout)
fig.show()

**Four countries: USA, CHINA, GERMANY and INDIA had 60% of all billionaires in the world in 2022. WOW!**

## 9. Distribution of FEMALE billionaires by Country

In [59]:
# GROUP the female data we filtered earlier by country, then count the number female billionaires for each.
female_by_country = female.groupby("Country").size().reset_index(name="Female Billionaires").sort_values(by="Female Billionaires", ascending=False)
female_by_country.reset_index(drop=True, inplace=True)
female_by_country["Percentage"] = round((female_by_country["Female Billionaires"]/female_by_country["Female Billionaires"].sum()) *100, 1)
female_by_country["Percentage"] = female_by_country["Percentage"].apply(lambda x: f"{x}%")

In [60]:
top_10_female_by_country = ff.create_table(female_by_country.head(10))
top_10_female_by_country

**USA and CHINA still had the highest number of FEMALE billionaires in 2022**

In [61]:
# Merge the countries and female_by_country dataframes.
# We are using a left join because there some countries with no FEMALE billionaires.
combined = countries.merge(female_by_country, on="Country", how="left")
combined.isnull().sum()

Country                    0
Number of Billionaires     0
Female Billionaires       37
Percentage                37
dtype: int64

In [62]:
## The missing values in the combined dataframe represents the countries with zero female billionaires.
# So lets replace the with zero.
combined.fillna(value=0, inplace=True)
combined.isnull().sum()

Country                   0
Number of Billionaires    0
Female Billionaires       0
Percentage                0
dtype: int64

In [63]:
# Use a choropleth map to show billionaire distribution across the world
fig = px.choropleth(
    combined,
    locations='Country',
    locationmode="country names",
    color='Number of Billionaires',
    hover_name = 'Country',
    projection='natural earth1',
    hover_data=["Number of Billionaires", "Female Billionaires"],
    color_continuous_scale="viridis",
    title='Distribution of Billionaires by Country',
)
fig.update_layout(
    title = dict(
        font=dict(color="#9C9C9C")
    ),
    paper_bgcolor = "#F5F5F5",
)

fig.show()

## 10. Youngest Billionaires in 2022

In [66]:
ages = df.sort_values(by="Age")
ages = ages[["Name", "Age", "Networth", "Country", "Selfmade"]].reset_index(drop=True)
ages["Networth"] = ages["Networth"]
youngest_10 = ages.iloc[:10]
youngest_10.fillna(value="Germany", inplace=True)

In [None]:
youngest_10_table = ff.create_table(youngest_10)
youngest_10_table

**Some of the youngest billionaires have inherited wealth**

### Conclusion
In conclusion, this EDA project on the World's Richest People 2022 as listed by Forbes provides an insightful analysis of the current state of wealth distribution in the world. We have seen the countries with the highest number of billionaires, the economic sectores they come from. We've also seen the Female representation in the Billionaires club is low.
Overall, the EDA project on the World's Richest People 2022 as listed by Forbes offers valuable insights into the state of global wealth among the top 1%.