### Import Dataset

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

## Exploratory Data Analysis

##### Import Dataset

In [2]:
gapminder_data = pd.read_csv("../input/gapminder-dataset/gapminder_data_graphs.csv")

In [3]:
gapminder_data.shape

##### Our dataset has 3675 rows and 8 columns!

##### Let's have a look at the head of the data!

In [4]:
gapminder_data.head()

##### Examining the columns, let's have a look at the types of data!

In [5]:
gapminder_data.info()

#### We see that we have 3 categorical variables (country, continent and year), and the rest are all numeric. We can then examine the distribution of the numeric variables.

In [6]:
gapminder_data.describe()

## Correlation Plots

### Pearson Correlation Coefficient
1. Has value -1 to 1
2. 1 means totally positively correlated
3. 0 means not correlated
4. -1 means totally negatively correlated

In [7]:
cr = gapminder_data.corr(method='pearson')
print(cr)

#### These numbers seem confusing, lets plot them out using a heatmap to visualise!

In [8]:
#heatmap using px
fig = px.imshow(
        img=cr,
        zmin=-1, 
        zmax=1,
        color_continuous_scale='rdylgn')
fig.show()

We can see that none of the features are negatively correlated with each other, and that there is a strong correlation between hdi index, life expectancy and between hdi index and services for example

## Questions to ask

#### Now that we have a general sense of how our data is laid out, we can begin by asking ourselves some interesting questions!

##### 1. What is the distribution of HDI Indexes across countries?
##### 2. What is the distribution of services across countires? 
##### 3. Which continent has the highest HDI Index? Is it the same for services?
##### 4. What about hdi indexes and service values for countries in each continent?
##### 5. Are there any notable outliers/ dips from the observed trends? 



### To answer the first two questions, we have to start with Univariate Analysis

## Univariate Analysis

Univariate plots display one variable
1. Histogram
2. Box Plot
3. Density Plots

##### Distribution of HDI Index

In [9]:
fig = px.histogram(gapminder_data,
                   x="hdi_index")
fig.show()

##### The higher the hdi index the better, the graph seems to show 2 different distributions of hdi indexes, possibly one for lower income countires and the other for higher income countries

#### Distribution of Services

In [10]:
fig = px.histogram(gapminder_data,
                   x="services", 
                   marginal="box", color_discrete_sequence = ['skyblue']) #color_discrete_sequence argument accepts explicitly-constructed color sequences 
fig.show()

##### Median services percentage is at 52.9%, the graph seems relatively evenly distibuted

### Now that we know the distribution of hdi and serivces, lets move on to more in-depth analysis into the different contients!

## Bivariate Analysis
1. Line Charts
2. Scatter Plots
3. Correlation plots

In [11]:
gapminder_data

In [12]:
hdi_cont_data = gapminder_data.groupby(['continent', 'year'], as_index= False).agg({'hdi_index': 'mean'})
hdi_cont_data

#### Although Europe has the highest hdi index and Africa having the lowest index, Africa seems to have the highest rate of increase in HDI index!

In [16]:
fig = px.line(data_frame= hdi_cont_data, x='year', y='hdi_index', color='continent', labels = {'year': 'Year', 'hdi_index': 'HDI Index'}, title='HDI Index Per Continent Across Time')

#customise legend
fig.update_layout({
    'showlegend': True,
    'legend': {
      'title': 'Continents',
      'x': 1.03, 'y': 0.9,
      'bgcolor': 'rgb(246,228,129)'}
})

fig.show()

### Let's now take a look at our Services Plots

In [17]:
services_cont_data = gapminder_data.groupby(['continent', 'year'], as_index= False).agg({'services': 'mean'})
services_cont_data

In [19]:
fig = px.line(data_frame=services_cont_data, x='year', y='services', color='continent', labels = {'year': 'Year', 'services': 'Services'}, title = '% Employed in the Service Sector Across Time')

#add annotations
fig.add_annotation(x=2009, y=64.68421,
            text="Europe overtakes North America!",
            showarrow=True,
            arrowhead=1)

fig.show()

#### Percentage employed in the service sector is the highest in Europe, with Africa at the lowest %. In fact Europe overtakes North America in the % employed in the serivces sector in 2009

#### Now we are able to answer our initial question: Which continent has the highest HDI Index? Is it the same for services?

#### Europe has the highest HDI Index and Services valuue across time, and Africa at the lowest for both

## What about hdi indexes and service values for countries in each continent?

#### Let's start by grouping and aggregating our data!
#### We will examine HDI Index first

In [20]:
hdi_ctry_cont_data = gapminder_data.groupby(['country','continent', 'year'], as_index= False).agg({'hdi_index': 'mean'})
hdi_ctry_cont_data

#### Let's examine North America's HDI Index

In [21]:
na_hdi = hdi_ctry_cont_data[hdi_ctry_cont_data.continent=='North America']
na_hdi

In [22]:
#remove null data
na_hdi_2010_2019 = na_hdi[na_hdi.year.isin([i for i in range(2000,2019)])]
px.line(data_frame = na_hdi_2010_2019,x='country', y='hdi_index', animation_frame='year', title='HDI Index in North America', range_y=[0.4,1])

#### HDI Index for United States is the highest in North America

#### Examining Services data by Country

In [26]:
services_ctry_cont_data = gapminder_data.groupby(['country','continent', 'year'], as_index= False).agg({'services': 'mean'})
services_ctry_cont_data

#### Taking a look at North America Again

In [32]:
na_services = services_ctry_cont_data[services_ctry_cont_data.continent=='North America']
na_services
px.line(data_frame = na_services,x='country', y='services', animation_frame='year', title='Services in North America', range_y=[35,90])

#### Guatemala, Haiti & Jamaica's data seem interesting, Haiti's services value is higher than Guatemala's & Honduras, yet Haiti has lower hdi index than both
#### Looking at the rest of the countries, there seems to be a positive correlation between HDI index and services
#### Let's plot HDI Index against services to have a better understanding!

In [24]:
fig = px.scatter(gapminder_data,x = "hdi_index",  y = "services",  title = 'HDI Index & Services Plot Across Time', 
                 color = "continent", hover_data = ["country"], animation_frame="year", range_x=[0,1], range_y= [0, 100])
fig.show()

#### Taking a look at the hdi index and services plots, it shows that hdi index and services are on the increasing trend

## **Lets Summarise what we have discovered from the data!**

### 1. What is the distribution of HDI Indexes across countries? 
##### Seems to have two separate distributions of data

### 2. What is the distribution of services across countires? 
##### Relatively evenly distributed 

### 3. Which continent has the highest HDI Index? Is it the same for services?
##### Europe has the highest serivices and HDI Index, Africa has the lowest

### 4. What about hdi indexes and service values for countries in each continent? 
##### In North America, US has the highest HDI Index but Bahamas has the highest service value

### 5. Are there any notable outliers/ dips from the observed trends? 
##### Higher services value in Haiti does not seem to translate to a higher HDI index

## **Other Plotting Methods**

### What if we want to have more than one plot?

### **Subplots**

In [33]:
from plotly.subplots import make_subplots

#fig = make_subplots (rows=2, cols=1)
fig = make_subplots (rows=2, cols=1, subplot_titles =['GDP Per Capita', 'HDI Index'])
fig.add_trace(go.Histogram(x=gapminder_data['gdp'], name='GDP Per Capita'), row=1,col=1)
fig.add_trace(go.Histogram(x=gapminder_data['hdi_index'], name = 'HDI Index'),row=2,col=1)

#add title to graph
fig.update_layout({'title': {'text':
    '<b>GDP Per Capita & HDI Plots<b>',
    'x': 0.5, 'y': 0.9}})
fig.show()

### **Additional Resources**
1. Plotly Library: https://plotly.com/python/
2. Plotly Express documentation: https://plotly.com/python-api-reference/plotly.express.html
3. Graph_objects documentation: https://plotly.com/python-api-reference/plotly.graph_objects.html