# World's Richest People 2022 as Listed by Forbes.
The objective of this project is to analyze and visualize the data on the world's richest people (in US Dollars) in 2022 as listed by Forbes (The Billionaires List). The Billionaires List is an annual ranking of the world's wealthiest individuals, based on their net worth, as determined by Forbes.
The project will also explore the source of wealth for the world's richest people in 2022, identifying the industries and sectors that are driving wealth creation at the highest levels. This analysis will provide insight into the global economy and the trends that are shaping it.
The dataset for this project was extracted from [forbes](https://www.forbes.com/billionaires/page-data/index/page-data.json)

In [199]:
# import libraries
import pandas as pd
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import json
import requests
import numpy as np


pd.options.display.max_rows = 50
pd.options.display.max_columns = 22

## 1. Extract Data from Forbes

In [3]:
url = "https://www.forbes.com/billionaires/page-data/index/page-data.json"
content = requests.get(url)
data = content.json()

In [13]:
raw_data = data["result"]["pageContext"]["tableData"]
raw_data[0]

{'name': 'Billionaires',
 'year': 2022,
 'month': 4,
 'uri': 'elon-musk',
 'rank': 1,
 'listUri': 'billionaires',
 'finalWorth': 219000,
 'category': 'Automotive',
 'otherCompensation': 0,
 'person': {'name': 'Elon Musk',
  'uri': 'elon-musk',
  'imageExists': True,
  'squareImage': 'https://specials-images.forbesimg.com/imageserve/62d700cd6094d2c180f269b9/416x416.jpg?background=000000&cropX1=0&cropX2=959&cropY1=0&cropY2=959'},
 'visible': True,
 'personName': 'Elon Musk',
 'age': 50,
 'country': 'United States',
 'state': 'Texas',
 'city': 'Austin',
 'source': 'Tesla, SpaceX',
 'industries': 'Automotive',
 'countryOfCitizenship': 'United States',
 'organization': 'Tesla',
 'timestamp': 1664284092360,
 'version': 1,
 'naturalId': 'faris/5/2022/14117',
 'position': 1,
 'imageExists': True,
 'selfMade': True,
 'status': 'U',
 'gender': 'M',
 'birthDate': 46915200000,
 'lastName': 'Musk',
 'firstName': 'Elon',
 'listDescription': "The World's Billionaires",
 'title': 'CEO',
 'employment':

In [250]:
def get_data(raw_data):
    """
        Fetch the data and create a pandas dataframe.
        Returns dataframe
    """
    names = [person["personName"] for person in raw_data]
    month = [person["month"] for person in raw_data]
    year = [person["year"] for person in raw_data]
    rank = [person["rank"] for person in raw_data]
    age = [person["age"] for person in raw_data]
    networth = [person["netWorth"] for person in raw_data]
    source = [person["source"] for person in raw_data]
    industries = [person["industries"] for person in raw_data]
    country_of_citizenship= [person["countryOfCitizenship"] for person in raw_data]
    self_made = [person["selfMade"] for person in raw_data]
    country = []
    title = []
    city = []
    gender = []
    for person in raw_data:
        try:
            raw_country = person["country"]
        except KeyError:
            country.append(None)
        else:
            country.append(raw_country)
    for person in raw_data:
        try:
            raw_title = person["title"]
        except KeyError:
            title.append(None)
        else:
            title.append(raw_title)
    for person in raw_data:
        try:
            raw_city = person["city"]
        except KeyError:
            city.append(None)
        else:
            city.append(raw_city)
    for person in raw_data:
        try:
            raw_gender = person["gender"]
        except KeyError:
            gender.append(None)
        else:
            gender.append(raw_gender)
    df = pd.DataFrame(
        {
            "Rank": rank,
            "Name": names,
            "Age": age,
            "Gender": gender,
            "Title": title,
            "Month": month,
            "Year": year,
            "Networth": networth,
            "Source": source,
            "Industries": industries,
            "Country": country,
            "Country_of_citizenship": country_of_citizenship,
            "City": city,
            "Self_made": self_made,
        }
    )
    return df
dataframe = get_data(raw_data)

In [251]:
# Save the Dataframe to csv file
dataframe.to_csv("forbes_billionaires_2022.csv", index=None)

In [258]:
# Load the the Data
df = pd.read_csv("forbes_billionaires_2022.csv")
df.head(3)

Unnamed: 0,Rank,Name,Age,Gender,Title,Month,Year,Networth,Source,Industries,Country,Country_of_citizenship,City,Self_made
0,1,Elon Musk,50,M,CEO,4,2022,$219 B,"Tesla, SpaceX",Automotive,United States,United States,Austin,True
1,2,Jeff Bezos,58,M,Chairman and Founder,4,2022,$171 B,Amazon,Technology,United States,United States,Seattle,True
2,3,Bernard Arnault & family,73,M,Chairman and CEO,4,2022,$158 B,LVMH,Fashion & Retail,France,France,Paris,False


In [259]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2668 entries, 0 to 2667
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Rank                    2668 non-null   int64 
 1   Name                    2668 non-null   object
 2   Age                     2668 non-null   int64 
 3   Gender                  2652 non-null   object
 4   Title                   390 non-null    object
 5   Month                   2668 non-null   int64 
 6   Year                    2668 non-null   int64 
 7   Networth                2668 non-null   object
 8   Source                  2668 non-null   object
 9   Industries              2668 non-null   object
 10  Country                 2655 non-null   object
 11  Country_of_citizenship  2668 non-null   object
 12  City                    2624 non-null   object
 13  Self_made               2668 non-null   bool  
dtypes: bool(1), int64(4), object(9)
memory usage: 273.7+ KB


In [260]:
df.shape

(2668, 14)

## Data Cleaning 
Let's clean up columns whose data doesn't make sense.

In [261]:
# The are some records with zero in the age column.
df["Age"].value_counts()

0      85
59     82
57     82
58     81
54     78
       ..
27      2
26      2
29      2
19      1
100     1
Name: Age, Length: 77, dtype: int64

In [262]:
## Lets just replace rows with zero in the Age column with NaN
df["Age"] = df["Age"].apply(lambda x: np.nan if x == 0 else x)

In [263]:
# Clean up the Networth column and convert it to a numeric data type
df["Networth"] = df["Networth"].apply(lambda x: float(x.split(" ")[0].replace("$", "").strip()))

In [226]:
df.isnull().sum()

Rank                         0
Name                         0
Age                         85
Gender                      16
Title                     2278
Month                        0
Year                         0
Networth                     0
Source                       0
Industries                   0
Country                     13
Country_of_citizenship       0
City                        44
Self_made                    0
dtype: int64

In [221]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2668 entries, 0 to 2667
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Rank                    2668 non-null   int64  
 1   Name                    2668 non-null   object 
 2   Age                     2583 non-null   float64
 3   Gender                  2652 non-null   object 
 4   Title                   390 non-null    object 
 5   Month                   2668 non-null   int64  
 6   Year                    2668 non-null   int64  
 7   Networth                2668 non-null   float64
 8   Source                  2668 non-null   object 
 9   Industries              2668 non-null   object 
 10  Country                 2655 non-null   object 
 11  Country_of_citizenship  2668 non-null   object 
 12  City                    2624 non-null   object 
 13  Self_made               2668 non-null   bool   
dtypes: bool(1), float64(2), int64(3), object

In [248]:
# Check for duplicate
df.duplicated().sum()

0

## 3. Age Distribution

In [249]:
trace = go.Histogram(
    x = df["Age"],
    marker = dict(color="#007F8E"),
    xbins = dict(start=10, end=100, size=10 )
)
layout = dict(
    title = dict(
        text = "Billionaire Age Distribution", 
        font = dict(color = "#9C9C9C")
    ),
    xaxis = dict(
        title="Age",
        color = "#9C9C9C"
    ),
    yaxis = dict(
        title="Count",
        color = "#9C9C9C"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)
fig = go.Figure(data = [trace], layout = layout)
fig.show()

* **MOST billionaires are aged between 50-69 years**
* **The number of billionaires increase with each age group**

## 4. Top 10 Richest people in 2022

In [77]:
top_10 = df.loc[:9, ["Rank", "Name", "Networth", "Country"]]
top_10

Unnamed: 0,Rank,Name,Networth,Country
0,1,Elon Musk,219.0,United States
1,2,Jeff Bezos,171.0,United States
2,3,Bernard Arnault & family,158.0,France
3,4,Bill Gates,129.0,United States
4,5,Warren Buffett,118.0,United States
5,6,Larry Page,111.0,United States
6,7,Sergey Brin,107.0,United States
7,8,Larry Ellison,106.0,United States
8,9,Steve Ballmer,91.4,United States
9,10,Mukesh Ambani,90.7,India


In [78]:
top_10_table = ff.create_table(top_10)
top_10_table

**8 of the top 10 richest people in the world came from the United States.**

## 5. Industry with the highest number of Billionaires

In [79]:
# GROUP the data by category then get the size for each Industry.
categories = df.groupby("Industries").size().reset_index(name="count").sort_values(by="count", ascending=False)
categories.reset_index(drop=True, inplace=True)

In [80]:
top_10_categories = categories.head(10)
top_10_categories_table = ff.create_table(top_10_categories)
top_10_categories_table

In [81]:
trace = go.Bar(
    x = top_10_categories["Industries"],
    y = top_10_categories["count"],
    marker = dict(
        color = top_10_categories["count"],
        colorscale = "sunset",
    )
)
layout = dict(
    title = dict(
        text = "Top 10 Sectors with the highest number of Billionaires", 
        font = dict(color="#9C9C9C")
    ),
    xaxis = dict(
        color = "#9C9C9C",
        title = "Industry",
        tickfont = dict(size=10)
    ),
    yaxis = dict(
        color = "#9C9C9C",
        title = "Number of Billionaires"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [trace], layout = layout)
fig.show()

**FINANCE Industry has the highest number of Billionaires.**

## 6. Male vs Female Billionaires

In [82]:
# GROUP by gende then get the number of billionaires under each gender.
by_gender = df.groupby("Gender").size().reset_index(name="count")
by_gender["Gender"] = by_gender["Gender"].map({"F": "Female", "M": "Male"})
by_gender

Unnamed: 0,Gender,count
0,Female,311
1,Male,2341


In [83]:
by_gender_table = ff.create_table(by_gender)
by_gender_table

In [84]:
gender = go.Pie(
    labels = by_gender["Gender"],
    values = by_gender["count"],
    marker = dict(colors = ["#E05F19", "#007F8E"]),
    hole=0.5
)
layout = dict(
    title = dict(
        text="Male vs Female Billionaires", 
        font = dict(color="#9C9C9C")
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [gender], layout = layout)
fig.show()

**There are significantly more MALE (88.3%) than FEMALE (11.7%) billionaires**

## 7. Which Industries do most FEMALE billionaires belong to?

In [86]:
# Select rows with gender female
female = df[df["Gender"] == "F"]
female.head(3)

Unnamed: 0,Rank,Name,Age,Gender,Title,Month,Year,Networth,Source,Industries,Country,Country_of_citizenship,City,Self_made
13,14,Francoise Bettencourt Meyers & family,68,F,,4,2022,74.8,L'Oréal,Fashion & Retail,France,France,Paris,False
17,18,Alice Walton,72,F,Philanthropist,4,2022,65.3,Walmart,Fashion & Retail,United States,United States,Fort Worth,False
21,21,Julia Koch & family,59,F,,4,2022,60.0,Koch Industries,Diversified,United States,United States,New York,False


In [90]:
# Group the Data for female billionaires by category
category_female = female.groupby("Industries").size().reset_index(name="count").sort_values(by="count", ascending=False)
category_female.reset_index(drop=True, inplace=True)

In [91]:
category_female_table = ff.create_table(category_female)
category_female_table

**Most Female Billionaires are in Manufacturing, Food & Beverages**

In [92]:
trace = go.Bar(
    x = category_female["Industries"],
    y = category_female["count"],
    marker = dict(
        color = category_female["count"],
        colorscale = "teal",
    )
)
layout = dict(
    title = dict(
        text="Distribution of Female Billionaires by Sector", 
        font = dict(color="#9C9C9C")
    ),
    xaxis = dict(
        color = "#9C9C9C",
        title = "Sector",
        tickfont = dict(size=10)
    ),
    yaxis = dict(
        color = "#9C9C9C",
        title = "Number of Female Billionaires"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [trace], layout = layout)
fig.show()

## 8. Age Distribution of Female Billionaires

In [270]:
trace = go.Histogram(
    x = female["Age"],
    marker = dict(color="#007F8E"),
    xbins = dict(start=10, end=100, size=10 )
)
layout = dict(
    title = dict(
        text = "Female Billionaire Age Distribution", 
        font = dict(color = "#9C9C9C")
    ),
    xaxis = dict(
        title="Age",
        color = "#9C9C9C"
    ),
    yaxis = dict(
        title="Count",
        color = "#9C9C9C"
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)
fig = go.Figure(data = [trace], layout = layout)
fig.show()

**Majority of the female billionaires are over 50years old** 

## 8. Distribution of Billionaires by Country

In [184]:
# GROUP the data by country then count the number of billionaires for each.
countries = df.groupby("Country").size().reset_index(name="Number of Billionaires").sort_values(by="Number of Billionaires", ascending=False)
countries.reset_index(drop=True, inplace=True)

In [185]:
countries.head(10)

Unnamed: 0,Country,Number of Billionaires
0,United States,748
1,China,571
2,India,158
3,Germany,112
4,United Kingdom,87
5,Switzerland,74
6,Hong Kong,68
7,Russia,64
8,Brazil,54
9,Italy,49


**USA and CHINA alone had almost half of all billionaires in 2022**

In [186]:
def aggregate_countries(row):
    if row["Number of Billionaires"] < 112:
        return "Other"
    else:
        return row["Country"]

countries_agg = countries.copy()  
countries_agg["Country"] = countries_agg.apply(aggregate_countries, axis=1)
countries_agg = countries_agg.groupby("Country")["Number of Billionaires"].sum().reset_index()
countries_agg = countries_agg.iloc[[4, 0, 2, 1, 3], :].reset_index(drop=True)
countries_agg["Percentage"] = round((countries_agg["Number of Billionaires"] /  countries_agg["Number of Billionaires"].sum()) * 100, 1)
countries_agg["Percentage"] = countries_agg["Percentage"].apply(lambda x: f"{x}%")
countries_agg

Unnamed: 0,Country,Number of Billionaires,Percentage
0,United States,748,28.2%
1,China,571,21.5%
2,India,158,6.0%
3,Germany,112,4.2%
4,Other,1066,40.2%


In [187]:
country_table = ff.create_table(countries_agg)
country_table

In [188]:
gender = go.Pie(
    labels = countries_agg["Country"],
    values = countries_agg["Number of Billionaires"],
    marker = dict(colors = ["#007F8E", "#FF834C", "#458F54", "#E6AD20", "#6514FF"]),
    hole = 0.4,
    direction = "clockwise",
    text = countries["Country"],
    rotation = 0
)
layout = dict(
    title = dict(
        text="Top countries by number of billionaires".title(), 
        font = dict(color="#9C9C9C")
    ),
    paper_bgcolor = "#F5F5F5",
    plot_bgcolor = "#F5F5F5"
)

fig = go.Figure(data = [gender], layout = layout)
fig.show()

**Four countries: USA, CHINA, GERMANY and INDIA had 60% of all billionaires in the world in 2022. WOW!**

## 9. Distribution of FEMALE billionaires by Country

In [189]:
# GROUP the female data we filtered earlier by country, then count the number female billionaires for each.
female_by_country = female.groupby("Country").size().reset_index(name="Female Billionaires").sort_values(by="Female Billionaires", ascending=False)
female_by_country.reset_index(drop=True, inplace=True)
female_by_country["Percentage"] = round((female_by_country["Female Billionaires"]/female_by_country["Female Billionaires"].sum()) *100, 1)
female_by_country["Percentage"] = female_by_country["Percentage"].apply(lambda x: f"{x}%")

In [190]:
top_10_female_by_country = ff.create_table(female_by_country.head(10))
top_10_female_by_country

**USA and CHINA still had the highest number of FEMALE billionaires in 2022**

In [191]:
# Merge the countries and female_by_country dataframes.
# We are using a left join because there some countries with no FEMALE billionaires.
combined = countries.merge(female_by_country, on="Country", how="left")
combined.isnull().sum()

Country                    0
Number of Billionaires     0
Female Billionaires       37
Percentage                37
dtype: int64

In [192]:
## The missing values in the combined dataframe represents the countries with zero female billionaires.
# So lets replace the with zero.
combined.fillna(value=0, inplace=True)
combined.isnull().sum()

Country                   0
Number of Billionaires    0
Female Billionaires       0
Percentage                0
dtype: int64

In [196]:
# Use a choropleth map to show billionaire distribution across the world
fig = px.choropleth(
    combined,
    locations='Country',
    locationmode="country names",
    color='Number of Billionaires',
    hover_name = 'Country',
    projection='natural earth1',
    hover_data=["Number of Billionaires", "Female Billionaires"],
    color_continuous_scale="viridis",
    title='Distribution of Billionaires by Country',
)
fig.update_layout(
    title = dict(
        font=dict(color="#9C9C9C")
    ),
    paper_bgcolor = "#F5F5F5",
)

fig.show()

## 10. Youngest Billionaires in 2022

In [224]:
ages = df.sort_values(by="Age")
ages = ages[["Name", "Age", "Networth", "Country", "Self_made"]].reset_index(drop=True)
ages["Networth"] = ages["Networth"]
youngest_10 = ages.iloc[:10]
youngest_10.fillna(value="Germany", inplace=True)

In [225]:
youngest_10_table = ff.create_table(youngest_10)
youngest_10_table

**Some of the youngest billionaires have inherited wealth**

### Conclusion
In conclusion, this EDA project on the World's Richest People 2022 as listed by Forbes provides an insightful analysis of the current state of wealth distribution in the world. We have seen the countries with the highest number of billionaires, the economic sectores they come from. We've also seen the Female representation in the Billionaires club is low.
Overall, the EDA project on the World's Richest People 2022 as listed by Forbes offers valuable insights into the state of global wealth among the top 1%.