In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
import plotly.express as px
import plotly.graph_objects as go

In [None]:
df = pd.read_csv("../input/forbes-billionaires-of-2021-20/forbes_billionaires.csv")

df.head(10)

# DATA CLEANING

**Checking the data types of the variables in the dataset**

In [None]:
df.dtypes

The variables in the dataset have the correct data types

**Checking for missing values**

In [None]:
df.count()

In [None]:
df.isnull().sum()

As shown above, this dataset has a considerable amount of missing data especially in variables such as "Children", "Status" and "Education". 

These missing values could be because of the lack of information that may be available on a lot of the billionaires that are included in the dataset, especially the ones that are not well-known or are not in the public eye or have a lower net worth than the rest. And since all of the variables are important to the analysis therefore it would not make sense to remove rows with missing data, hence we will continue with the missing values.

**Changing column names and rearranging columns**

In [None]:
headers = ["Name", "Net Worth", "Country", "Source", "Rank", "Age", "Residence", "Citizenship", "Status", "Children", "Education",
          "Self Made"]

df.columns = headers

df2 = df.reindex(labels=["Name", "Rank", "Net Worth", "Country", "Age", "Source", "Education", "Self Made", "Residence", "Citizenship",
                  "Status", "Children"], axis=1)

df2.head()

**Removing unnecessary variables**

It would be good to see if there is any difference between the variables "Country" and "Citizenship" with regards to matching values. If there is not much difference, we can remove "Citizenship" as a variable

In [None]:
df2[df2["Country"]!=df2["Citizenship"]]

Since there are only 17 instances where the value in "Country" does not match the value in "Citizenship", we can remove "Citizenship" from out dataset. 

**Removing "Citizenship" from the dataset**

In [None]:
df2.drop(labels=["Citizenship"], axis=1, inplace=True)

df2.head()

# DATA ANALYSIS & VISUALISATION

**Analysing countries with the most number of billionaires**

In [None]:
top_countries = df2.groupby("Country", as_index=False)["Name"].count()

top_countries.sort_values(by="Name", axis=0, ascending=False, kind='quicksort', ignore_index=True, inplace=True)

top_countries2 = top_countries.head(25)

fig1 = px.bar(top_countries2, x="Country", y="Name", color="Country", text="Name")

fig1.update_layout(title_text="Top 25 Countries with most Billionaires", title_y=0.97, title_x=0.50, title_font_size=22,
                   height=600, width=1000, yaxis_title="Number of Billionaires")

fig2 = px.choropleth(top_countries, locations="Country", locationmode='country names', color="Name",
                    color_continuous_scale=px.colors.sequential.Redor)

fig2.update_layout(title_text="Geo-Heat Map with most Billionaires", title_y=0.95, title_x=0.50, title_font_size=22,
                   height=600, width=1000)

fig1.show()
fig2.show()

As illustarted in the graphs above, most billionaires are situated in North America, South America, Europe and South East Asia. On the other hand, Africa, Central Aisa and the Middle East regions seem to not have many billionaires.

With regards to individual countries, there is a huge mismatch since the US and China have a total of 724 and 626 billionaires respectively. The closest country is India with 140 billionaires. This huge mismatch between US & China and the rest of the world could be because these two countries have the highest GDPs in the world and have been the two most dominant economies in the last 10 to 15 years. Having strong economies with consistent economic growth entails massive business and trade opportunities, hence creating a platform for individuals and business to flourish and progress therefore producing billionaires and business with huge market shares and net worth.

Now that we know in which countries and regions most billionaires are situated, we can further break down this analysis to find out what cities do most billionaires reside in.

**Finding out what cities most billionaires reside in**

First we will be spliting the "Residence" column to only include the city name

In [None]:
cities_split = df2["Residence"].str.split(',', expand=True)

cities_split.drop(labels=[1,2], axis=1, errors='raise', inplace=True)

header=["City"]

cities_split.columns = header

df3 = pd.concat([df2,cities_split], axis=1, join='outer')

df3.head(5)

In [None]:
cities = df3.groupby("City", as_index=False)["Name"].count()

cities.sort_values("Name", ascending=False, inplace=True)

cities.reset_index(drop=True, inplace=True)

cities.rename(columns={"Name":"Billionaires"}, inplace=True)

top_cities = cities.head(15)

fig = px.bar(top_cities, x="Billionaires", y="City", color="City", text="Billionaires",
             color_discrete_sequence=px.colors.qualitative.Alphabet)

fig.update_layout(title_text="Top 15 Cities with most Billionaires", title_y=0.97, title_x=0.45, title_font_size=22,
                  height=600, width=1000, yaxis_title="City", xaxis_title="Number of Billionaires")

These results are not surprising considering the observations of the previous analysis and graphs. Since China and the US had the highest number of billionaires, it can be expected that the cities of the these countries also have the highest number of billionaires. 

Interestingly, out of the top 15 cities with the most billionaires, China has 6 cities whereas the US has only 2. This shows that billionaires are spread out more across the country in China as opposed to the US.

**Analysis of education level and frequency of billionaires**

In [None]:
edu = df2.copy()

edu.dropna(axis=0, how='any', subset=["Education"], inplace=True)

edu.reset_index(drop=True, inplace=True)

drop_out = edu[edu["Education"].str.contains("Drop Out")]
diploma = edu[((edu["Education"].str.contains("Diploma")) | (edu["Education"].str.contains("High School"))) & (~edu["Education"].str.contains("Bachelor")) & (~edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate")) & (~edu["Education"].str.contains("Doctor")) & (~edu["Education"].str.contains("Ph.D")) & (~edu["Education"].str.contains("Drop Out"))]
associate = edu[edu["Education"].str.contains("Associate") & (~edu["Education"].str.contains("Bachelor")) & (~edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate"))]
bachelors = edu[(edu["Education"].str.contains("Bachelor")) & (~edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate")) & (~edu["Education"].str.contains("Doctor")) & (~edu["Education"].str.contains("Ph.D")) & (~edu["Education"].str.contains("Drop Out"))]
masters = edu[(edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate")) & (~edu["Education"].str.contains("Doctor")) & (~edu["Education"].str.contains("Ph.D"))]
doctorate = edu[(edu["Education"].str.contains("Doctorate")) | (edu["Education"].str.contains("Doctor")) | (edu["Education"].str.contains("Ph.D"))]

education = [{"Level of Education":"Drop Out", "Number of Billionaires":drop_out["Name"].count()},
             {"Level of Education":"Diploma", "Number of Billionaires":diploma["Name"].count()},
             {"Level of Education":"Associate", "Number of Billionaires":associate["Name"].count()},
             {"Level of Education":"Bachelors", "Number of Billionaires":bachelors["Name"].count()},
             {"Level of Education":"Masters", "Number of Billionaires":masters["Name"].count()},
             {"Level of Education":"Ph.D", "Number of Billionaires":doctorate["Name"].count()}]
              
education_new = pd.DataFrame(data=education)

print("Original =", edu["Education"].count())
print("Categorized =", education_new["Number of Billionaires"].sum())

After categorizing education level into "Drop Out", "Diploma", "Associate", "Bachelors", "Masters" & "Ph.D", we find that 59 data points were missed out. Therefore, we need to find out what these data points are and thus see if they can be included in the categorization.

In [None]:
abc = pd.concat([drop_out,diploma,associate,bachelors,masters,doctorate], axis=0, ignore_index=True)

xyz = abc["Education"]

edu[~edu["Education"].isin(xyz)]

Since a lot of the un-categorized education level data points do not specify the type of degree, it would be hard to categorize them all. But there are some that can still be included in the categories. Therefore, the categorization process will be repeated with a few additions.

In [None]:
drop_out = edu[edu["Education"].str.contains("Drop Out")]
diploma = edu[((edu["Education"].str.contains("Diploma")) | (edu["Education"].str.contains("High School"))) & (~edu["Education"].str.contains("Bachelor")) & (~edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate")) & (~edu["Education"].str.contains("Doctor")) & (~edu["Education"].str.contains("Ph.D")) & (~edu["Education"].str.contains("Drop Out"))]
associate = edu[edu["Education"].str.contains("Associate") & (~edu["Education"].str.contains("Bachelor")) & (~edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate"))]
bachelors = edu[((edu["Education"].str.contains("Bachelor")) | (edu["Education"].str.contains("LLB"))) & (~edu["Education"].str.contains("Master")) & (~edu["Education"].str.contains("Doctorate")) & (~edu["Education"].str.contains("Doctor")) & (~edu["Education"].str.contains("Ph.D")) & (~edu["Education"].str.contains("Drop Out"))]
masters = edu[((edu["Education"].str.contains("Master")) | (edu["Education"].str.contains("EMBA")) | (edu["Education"].str.contains("LLM"))) & (~edu["Education"].str.contains("Doctorate")) & (~edu["Education"].str.contains("Doctor")) & (~edu["Education"].str.contains("Ph.D"))]
doctorate = edu[(edu["Education"].str.contains("Doctorate")) | (edu["Education"].str.contains("Doctor")) | (edu["Education"].str.contains("Ph.D"))]

education = [{"Level of Education":"Drop Out", "Number of Billionaires":drop_out["Name"].count()},
             {"Level of Education":"Diploma", "Number of Billionaires":diploma["Name"].count()},
             {"Level of Education":"Associate", "Number of Billionaires":associate["Name"].count()},
             {"Level of Education":"Bachelors", "Number of Billionaires":bachelors["Name"].count()},
             {"Level of Education":"Masters", "Number of Billionaires":masters["Name"].count()},
             {"Level of Education":"Ph.D", "Number of Billionaires":doctorate["Name"].count()}]
              
education_new = pd.DataFrame(data=education)

print("Original =", edu["Education"].count())
print("Categorized =", education_new["Number of Billionaires"].sum())

A further 32 data points were successfully categorized

In [None]:
fig = px.pie(education_new, values="Number of Billionaires", names="Level of Education", color="Level of Education",
             color_discrete_sequence=px.colors.qualitative.T10, hole=0.5)

fig.update_layout(title_text="Level of Education & Billionaire Frequency", title_font_size=22,
                  height=700, width=980, yaxis_title="GDP Growth Rate(%)", title_y=0.97, title_x=0.49)

fig.show()
education_new

The results above are very interesting since it shows that apporoximately 55% of billionaires have a bachelor's degree or less. This goes against the prevailing sentiment in many countries (especially South Asian countries) that one has to have a Master's degree and/or a Ph.D degree to over-achieve financially and to be financially successful. Furthermore, 8% of billionaires had either dropped out or only had a high school diploma. What is ever more fascinating is that out of the top 10 ranked billionaires, 4 were drop-outs.

**Analysis of the relationship between Net Worth and Relationship Status**

In [None]:
status = df2.groupby("Status", as_index=False)["Net Worth"].mean()

status["Net Worth"] = status["Net Worth"].round(decimals=2)

status.sort_values(by="Net Worth", axis=0, ascending=False, kind='quicksort', inplace=True)

fig = px.bar(status, x="Status", y="Net Worth", color="Status", text="Net Worth",
             color_discrete_sequence=px.colors.qualitative.Prism)

fig.update_layout(title_text="Average Net Worth ($Bn) per Relationship Status", title_y=0.97, title_x=0.45, title_font_size=22,
                   height=800, width=1100, yaxis_title="Net Worth", xaxis_title="Relationship Status")

The results above show that billionaires that are in a relationship have the highest average net worth. Moreover, it is interesting to see that the average net worth of billionaires who are with someone (whether they are in a relationship, engaged, married or remarried) is higher than those who are not with someone.

**Analysis of the relationship between Net Worth and Number of Children**

In [None]:
df2["Children"].unique()

Billionaires in the dataset have children ranging from 1 to 23. The reason for such a high range is because the dataset does not only include individuals but also billionaire families. Therefore, I will choose the cut-off point at 6 children and see if we find a relationship. 

In [None]:
import warnings
warnings.filterwarnings("ignore")

children = df2.groupby("Children", as_index=False)["Net Worth"].mean()

one = (df2[df2["Children"]==1.0])
two = (df2[df2["Children"]==2.0])
three = (df2[df2["Children"]==3.0])
four = (df2[df2["Children"]==4.0])
five = (df2[df2["Children"]==5.0])
six = (df2[df2["Children"]==6.0])

children2 = children.head(6)

One = one["Name"].count()
Two = two["Name"].count()
Three = three["Name"].count()
Four = four["Name"].count()
Five = five["Name"].count()
Six= six["Name"].count()

number_bil = [One,Two,Three,Four,Five,Six]

children2["Number of Billionaires"] = number_bil

children2["Net Worth"] = children2["Net Worth"].round(decimals=2)

fig = px.bar(children2, y="Net Worth", x="Children", text="Net Worth", color="Net Worth",
            color_continuous_scale=px.colors.qualitative.Prism)

fig.update_layout(title_text="Average Net Worth ($Bn) per Number of Children", title_y=0.97, title_x=0.5, title_font_size=22,
                   height=800, width=1100, xaxis_title="Number of Children", yaxis_title="Average Net Worth ($bn)")

fig.show()
children2

The graph above reveals an interesting result. With every two children, there is an increase in the average net worth of billionaires. The average net worth of billionaires with 3 to 4 children is approximately 1.4 billion dollars higher than that of billionaires with 1 to 2 children. Furthemore, the average net worth of billionaires with 5 to 6 children is approximately 0.5 billion dollars higher than that of billionaires with 3 to 4 children.

**Analysis of the most common sources of income**

In [None]:
income = df.groupby("Source", as_index=False)["Name"].count()

income.sort_values(by="Name", ascending=False, axis=0, kind='quicksort', inplace=True)

income.reset_index(drop=True, inplace=True)

top_income = income.head(10)

fig = px.bar(top_income, y="Source", x="Name", text="Name", color="Source",
            color_discrete_sequence=px.colors.qualitative.T10)

fig.update_layout(title_text="Top Sources of Income for Billionaires", title_y=0.97, title_x=0.5, title_font_size=22,
                   height=800, width=1100, xaxis_title="Number of Billionaires", yaxis_title="Source")

As shown in the graph above, the most common source of income for billionaires is Real Estate.