<a href="https://colab.research.google.com/github/wcj365/python-stats-dataviz/blob/master/plotly_express_world_dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plotly Express 
Explorer the World Development Indicators

In [37]:
import pandas as pd
import plotly.express as px

## 1 - Data Prep

In [38]:
df = pd.read_csv("wdi_data.csv")
df.shape

(1661, 10)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Year,SP.POP.TOTL,SP.DYN.LE00.IN,NY.GDP.PCAP.PP.CD,Country Code,Country Name,Region,Income Group,Lending Type
0,0,2010,29185507.0,61.028,1710.575645,AFG,Afghanistan,South Asia,Low income,IDA
1,1,2011,30117413.0,61.553,1699.487997,AFG,Afghanistan,South Asia,Low income,IDA
2,2,2012,31161376.0,62.054,1914.774351,AFG,Afghanistan,South Asia,Low income,IDA
3,3,2013,32269589.0,62.525,2015.514962,AFG,Afghanistan,South Asia,Low income,IDA
4,4,2014,33370794.0,62.966,2069.424642,AFG,Afghanistan,South Asia,Low income,IDA


In [39]:
df.drop(columns=["Unnamed: 0"], inplace=True)
df.sample()

Unnamed: 0,Year,SP.POP.TOTL,SP.DYN.LE00.IN,NY.GDP.PCAP.PP.CD,Country Code,Country Name,Region,Income Group,Lending Type
1156,2017,207896686.0,66.947,4571.414491,PAK,Pakistan,South Asia,Lower middle income,Blend


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1661 entries, 0 to 1660
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               1661 non-null   int64  
 1   SP.POP.TOTL        1661 non-null   float64
 2   SP.DYN.LE00.IN     1661 non-null   float64
 3   NY.GDP.PCAP.PP.CD  1661 non-null   float64
 4   Country Code       1661 non-null   object 
 5   Country Name       1661 non-null   object 
 6   Region             1661 non-null   object 
 7   Income Group       1661 non-null   object 
 8   Lending Type       1661 non-null   object 
dtypes: float64(3), int64(1), object(5)
memory usage: 116.9+ KB


In [40]:
df_2018 = df[df["Year"]==2018]
df_2018.shape

(181, 9)

In [41]:
df_2018.head()

Unnamed: 0,Year,SP.POP.TOTL,SP.DYN.LE00.IN,NY.GDP.PCAP.PP.CD,Country Code,Country Name,Region,Income Group,Lending Type
8,2018,37172386.0,64.486,2083.321897,AFG,Afghanistan,South Asia,Low income,IDA
17,2018,2866376.0,78.458,13974.011607,ALB,Albania,Europe & Central Asia,Upper middle income,IBRD
26,2018,42228429.0,76.693,11925.798564,DZA,Algeria,Middle East & North Africa,Lower middle income,IBRD
35,2018,30809762.0,60.782,7102.405887,AGO,Angola,Sub-Saharan Africa,Lower middle income,IBRD
44,2018,96286.0,76.885,21630.179515,ATG,Antigua and Barbuda,Latin America & Caribbean,High income,IBRD


## 2 - Histogram
Histogram applies to a numerical variable. It is also known as **Frequency Distribution**.

In [15]:
fig = px.histogram(data_frame=df_2018, x="SP.POP.TOTL", nbins=10)
fig.show()

## 3 - Bar Chart

### 3.1 Univariate Categorical Variable
We will use Bar Chart to depict the frequency distribution of categories (aka **Frequency Table**)

In [49]:
# Aggregate the count for each category
df_region = df_2018.groupby("Region").count()
df_region

Unnamed: 0_level_0,Year,SP.POP.TOTL,SP.DYN.LE00.IN,NY.GDP.PCAP.PP.CD,Country Code,Country Name,Income Group,Lending Type
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
East Asia & Pacific,27,27,27,27,27,27,27,27
Europe & Central Asia,49,49,49,49,49,49,49,49
Latin America & Caribbean,30,30,30,30,30,30,30,30
Middle East & North Africa,19,19,19,19,19,19,19,19
North America,3,3,3,3,3,3,3,3
South Asia,8,8,8,8,8,8,8,8
Sub-Saharan Africa,45,45,45,45,45,45,45,45


In [50]:
df_region = df_group.reset_index()
df_region

Unnamed: 0,index,Region,Year,SP.POP.TOTL,SP.DYN.LE00.IN,NY.GDP.PCAP.PP.CD,Country Code,Country Name,Income Group,Lending Type
0,0,East Asia & Pacific,27,27,27,27,27,27,27,27
1,1,Europe & Central Asia,49,49,49,49,49,49,49,49
2,2,Latin America & Caribbean,30,30,30,30,30,30,30,30
3,3,Middle East & North Africa,19,19,19,19,19,19,19,19
4,4,North America,3,3,3,3,3,3,3,3
5,5,South Asia,8,8,8,8,8,8,8,8
6,6,Sub-Saharan Africa,45,45,45,45,45,45,45,45


In [51]:
fig = px.bar(data_frame=df_region, 
             x="Region", y="Year",  
             color="Region",
             )
fig.show()

### 3.2 - Bivariate
- One categorial variable
- One numerical variable

In [29]:
# This Bar Chart shows how populated each region is.

fig = px.bar(data_frame=df_2018, 
             x="Region", y="SP.POP.TOTL", 
             hover_name="Country Name", 
             color="Region",
             height=800
             )
fig.show()

In [None]:
How about Life Expendency?

In [57]:
# This Bar Chart does not make much sense. 
# It does not help us compare the longevity of people in each region

fig = px.bar(data_frame=df_2018, 
             x="Region", y="SP.DYN.LE00.IN", 
             hover_name="Country Name", 
             color="Region",
             height=600
             )
fig.show()

In [54]:
df_life = df_2018.groupby("Region").mean().reset_index()
df_life

Unnamed: 0,Region,Year,SP.POP.TOTL,SP.DYN.LE00.IN,NY.GDP.PCAP.PP.CD
0,East Asia & Pacific,2018,84373100.0,74.24206,26244.057088
1,Europe & Central Asia,2018,18721270.0,77.728808,36017.483152
2,Latin America & Caribbean,2018,19985090.0,74.90764,16698.310455
3,Middle East & North Africa,2018,21237260.0,75.96038,28751.646553
4,North America,2018,121269700.0,80.713171,65780.046126
5,South Asia,2018,226798600.0,71.339125,8191.477152
6,Sub-Saharan Africa,2018,23308210.0,62.707018,5490.209762


In [56]:
# This Bar Chart make sense. 
# It help us compare the longevity of people in each region

fig = px.bar(data_frame=df_life, 
             x="Region", y="SP.DYN.LE00.IN", 
             color="Region"
             )
fig.show()