# Data Cleaning For World Wealth 2019 Dataset

I downloaded the dataset from Makeover Monday from [data.world](https://data.world/makeovermonday/2020w7-world-wealth).

I used the cleaned dataset for visualisation which I published to [Tableau Public](https://public.tableau.com/profile/steffen.zou.weilun#!/vizhome/TotalWealthByRegionCountryIn2019/TotalWealthByRegionCountryIn2019).

In [1]:
import pandas as pd

world_wealth = pd.read_excel('WorldWealth.xlsx', sheet_name='WorldWealth')

print(world_wealth.info())
print(world_wealth.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 3 columns):
Country        214 non-null object
Region         206 non-null object
Wealth ($B)    205 non-null object
dtypes: object(3)
memory usage: 5.1+ KB
None
          Country         Region Wealth ($B)
0   United States  North America    $105,990
1           China          China     $63,827
2           Japan   Asia-Pacific     $24,992
3         Germany         Europe     $14,660
4  United Kingdom         Europe     $14,341


# Remove Row With Missing Wealth

In [2]:
missing_wealth = world_wealth['Wealth ($B)'].isnull()
world_wealth[missing_wealth]

Unnamed: 0,Country,Region,Wealth ($B)
97,Bosnia and Herzegovina,,
98,Europe,,
154,Northern Mariana Islands,,
155,Asia-Pacific,,
195,Central African Republic,,
196,Africa,,
203,St. Vincent and the Grenadines,,
204,Latin America,,
213,Venezuela,Latin America,


In [3]:
world_wealth.dropna(subset=['Wealth ($B)'], inplace=True)
print(world_wealth.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 212
Data columns (total 3 columns):
Country        205 non-null object
Region         205 non-null object
Wealth ($B)    205 non-null object
dtypes: object(3)
memory usage: 6.4+ KB
None


# Remove Dollar Sign And Comma From Wealth Column And Convert To Integer

In [4]:
world_wealth['Wealth ($B)'] = world_wealth['Wealth ($B)'].str.replace(r'\W', '').astype('int')

# Update Region 'China' and 'India' To 'Asia-Pacific'

In [5]:
world_wealth['Region'].unique()

array(['North America', 'China', 'Asia-Pacific', 'Europe', 'India',
       'Latin America', 'Africa'], dtype=object)

In [6]:
region_china_india = world_wealth['Region'].isin(['China', 'India'])
world_wealth.loc[region_china_india]

Unnamed: 0,Country,Region,Wealth ($B)
1,China,China,63827
6,India,India,12614


I verified that China and India are in the list of countries of the Asia-Pacific region. URL is https://apcss.org/about/ap-countries/.

In [7]:
world_wealth.loc[region_china_india, 'Region'] = 'Asia-Pacific'
world_wealth.loc[region_china_india]

Unnamed: 0,Country,Region,Wealth ($B)
1,China,Asia-Pacific,63827
6,India,Asia-Pacific,12614


# Save Cleaned Data Set As A New File

In [8]:
world_wealth.to_excel('WorldWealth_cleaned.xlsx', index=False)