# MEI Introduction to Data Science
# Lesson 4 - Activity 1

This activity uses Python code to create charts for data: box plots, histograms, density plots, scatter diagrams and hexagonal bin plots. The activity uses the data from the MEI large data set 4 which gives information about different countries. 

The most common library for plotting in Python is *matplotlib*. This needs to be imported using
`import matplotlib.pyplot as plt` which adds it with the short name `plt`. Pandas can call some of the plots from *matplotlib* using simplified commands. A full description of the visualisation tools available through Pandas can be accessed at https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

## Problem
> *Which factors affect the life expectancy for different countries?*

To answer this question you could produce 1-dimensional charts to explore which other quantities have similar distributions to life-expectancy. You could also produce 2-dimensional charts to explore whether there is a link between one of the other quantities and life expectancy.

## Getting the data

In [None]:
# import pandas
import pandas as pd

# import matplotlib
import matplotlib.pyplot as plt

#import the data and check by view the top rows
country_data = pd.read_csv('../input/meilds4/mei-lds-4.csv')
country_data.head()

## Exploring the data
In the box below run some code to explore the field names, data types and shape of the data set.

In [None]:
# explore the data

It will be useful to draw some boxplots to explore the fields in the data set. Pandas contains a basic boxplot() command. 
* Run the code below to create the boxplot for Life expectancy at birth 2010

In [None]:
# generate the box plot for the Life expectancy at birth column
country_data.boxplot(column = ['Life expectancy at birth 2010'])
# the plt.show() command removed the additional output text generated from matplotlib - you can remove this
plt.show()

You should notice that the boxplot is vertical and is quite small on the screen. The next block of code splits the boxplots up by *Region*, draws them horizontally and creates a larger figure on the screen.
* Run the code below 
* Experiment with changing *vert* (can be `True` or `False`) and *figsize* (two values separated by a comma for width and height) 

In [None]:
# draw a boxplot for life expectancy split up by region
country_data.boxplot(column = ['Life expectancy at birth 2010'],by='Region', vert=False,figsize=(12, 8))
plt.show()

* In the box below write some code that generates the boxplots for GDP by sub region (you can copy the field names from the `dtypes` output above)
* Draw a boxplot for any other fields that you think will be useful to explore

In [None]:
# draw a boxplot for GDP split up by region

# draw a boxplots for any other fields that you think are relevant

**Checkpoint**
> * Comment on at least two regions that have countries that appear as outliers for life expectancy. How would you find which countries these were?
> * Which regions have a large amount of variability in GDP and which show little variability? Is this variability similar to the variability of life expectancy?
> * Which other fields could have an impact on life expectancy?

## Analysing the data
### Drawing histograms and density plots
Pandas contains a basic hist() command. Two useful parameters that can be set when drawing a histogram are `bins=` to set either the boundaries or number of bins, and `density=1` so that the total area of the histogram is 1.
* Run the code below to create the boxplot for life expectancy at birth
* Try changing the bins - you could try different boundaries such as `bins=[0,50,100]`, setting the number of bins such as `bins=3`, `bins=10`, or removing the bins parameter completely 
* Try removing the `density=1` parameter - this will give a histogram with frequency on the vertical axis

*Note: The definition of a histogram is that the area of the bars is proportional to the frequency. The version of a histogram with frequency density on the vertical axis is not supported in Pandas.*

In [None]:
# draw a histogram for country_data['life expectancy at birth']
country_data['Life expectancy at birth 2010'].plot.hist(bins=[40,50,60,70,80,90,100],density=1)
plt.show()

* In the box below write some code that generates the histogram for GDP (you should check that your bins will contain all the data - using `describe()` is useful to get the min and max values

In [None]:
# draw a histogram for GDP

Pandas also contains a command to produce a *density* plot.
* Run the code below to generate density plots for life expectancy and GDP

In [None]:
country_data['Life expectancy at birth 2010'].plot.density()
plt.show()

country_data['GDP per capita (US$)'].plot.density()
plt.show()

**Checkpoint**
> * What shape are the distributions for life expectancy and GDP (unimodal/bimodal and skewness)? 
> * Why do you think these quantities have distributions of this shape?
> * A density plot can appear to show negative values that are not possible for the data due to the smoothing of the function applied. Do any of these density plots show negative values and are they possible for the data?   

### Drawing scatter diagrams and hexagonal bin plots
Pandas can plot a scatter diagram of paired values from two fields defined using `x=` and `y=`
* Run the code below to generate the scatter diagram for life expectancy v GDP

In [None]:
# draw a scatter diagram for life expectancy v GDP
country_data.plot.scatter(x='GDP per capita (US$)', y='Life expectancy at birth 2010')
plt.show()

* In the box below write and run some code that will generate a scatter diagram for birth rate v GDP 

In [None]:
# draw a scatter diagram for birth rate per 1000 v GDP

Pandas can also generate *hexagonal bin plots*. These are similar to scatter plots but group the data into hexagons representing frequency.
* Run the code below to generate the hexagonal bin plot for life expectancy v GDP
* Explore changing the `gridsize`

In [None]:
# draw a hexagonal bin plot for life expectancy v GDP
country_data.plot.hexbin(x='GDP per capita (US$)', y='Life expectancy at birth 2010',gridsize=10, sharex=False)
plt.show()

* In the box below write and run some code that will generate a hexagonal bin plot for birth rate v GDP 

In [None]:
# draw a hexagonal bin plot for birth rate per 1000 v GDP

**Checkpoint**
> * Do these diagrams suggest a link between life expectancy and GDP or birth rate and GDP?


## Communicating the results
**Checkpoint**
> * Use the charts created to answer the initial problem: *Which factors affect the life expectancy for different countries?*
> * Are there any strengths or weaknesses in using histograms or density plots to display the shape of a distribution?
> * Was the scatter diagram or the hexagonal bin plot more useful in being able to display a link?