# JOUR7280/COMM7780 Big Data Analytics for Media and Communication
# Tutorial: Data Visualization
## 1. Introduction

In this lab, we will continue exploring the Matplotlib library and will learn how to create additional plots, namely histograms, and bar charts.

## 2. Data Preparation
**The Dataset: Immigration to Canada from 1980 to 2013**

Dataset Source: [International migration flows to and from selected countries - The 2015 revision](http://www.un.org/en/development/desa/population/migration/data/empirical2/migrationflows.shtml).

The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. The current version presents data pertaining to 45 countries.

In this lab, we will focus on the Canadian immigration data.

<img src = "../figs/Dataset.png" align="center" width=900>

**1. Load data into dataframe**

In [None]:
import numpy as np  
import pandas as pd


# load data
df = pd.read_csv('../data/Canada.csv')
df.head(10)

To view the dimensions of the dataframe, we use the `.shape` parameter.

In [None]:
df.shape

**2. Set the country name as index**
- useful for quickly looking up countries using .loc method.

This can be fixed very easily by setting the 'Country' column as the index using `set_index()` method.

In [None]:
df.set_index('Country', inplace=True)
# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()

In [None]:
df.head()

#### 3. Add a Total column 
Add a Total column that sums up the total immigrants by country over the entire period 1980 - 2013

In [None]:
df['Total'] = df.loc[:,'1980':'2013'].sum(axis=1)
df.head()

`df.sum()` function returns the sum of the values for the requested axis.
- `axis=1`: find the sum of all the values over the column axis

## 2. Visualizing Data using Matplotlib

In [None]:
# we are using the inline backendxx
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

## 2.1 Histograms

A histogram is a way of representing the <ins>frequency</ins> distribution of numeric dataset. The way it works is it partitions the x-axis into *bins*, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin. So the y-axis is the frequency or the number of data points in each bin. Note that we can change the bin size and usually one needs to tweak it so that the distribution is displayed nicely.

**Question:** What is the frequency distribution of the number (population) of new immigrants from the various countries to Canada in 2013?

Before we proceed with creating the histogram plot, let's first examine the data split into intervals. To do this, we will us `Numpy`'s `histrogram` method to get the bin ranges and frequency counts as follows:

In [None]:
# let's quickly view the 2013 data
df['2013'].head()

In [None]:
# np.histogram returns 2 values
count, bin_edges = np.histogram(df['2013'])

print(count) # frequency count
print(bin_edges) # bin ranges, default = 10 bins

By default, the `histrogram` method breaks up the dataset into 10 bins. The figure below summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see that in 2013:
* 178 countries contributed between 0 to 3412.9 immigrants 
* 11 countries contributed between 3412.9 to 6825.8 immigrants
* 1 country contributed between 6285.8 to 10238.7 immigrants, and so on..

<img src="../figs/Histogram.jpeg" align="center" width=650>

We can easily graph this distribution by passing `kind=hist` to `plot()`.

In [None]:
df['2013'].plot(kind='hist')

plt.title('Histogram of Immigration from 195 Countries in 2013') # add a title to the histogram
plt.ylabel('Number of Countries') # add y-label
plt.xlabel('Number of Immigrants') # add x-label

plt.show()

In the above plot, the x-axis represents the population range of immigrants in intervals of 3412.9. The y-axis represents the number of countries that contributed to the aforementioned population. 

Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a `xticks` keyword that contains the list of the bin sizes, as follows:

In [None]:
# 'bin_edges' is a list of bin intervals
count, bin_edges = np.histogram(df['2013'])

df['2013'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)

plt.title('Histogram of Immigration from 195 countries in 2013') # add a title to the histogram
plt.ylabel('Number of Countries') # add y-label
plt.xlabel('Number of Immigrants') # add x-label

plt.show()

*Side Note:* We could use `df['2013'].plot.hist()`, instead. In fact, throughout this lesson, using `some_data.plot(kind='type_plot', ...)` is equivalent to `some_data.plot.type_plot(...)`. That is, passing the type of the plot as argument or method behaves the same. 

See the `pandas` documentation for more info [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html).

We can also plot multiple histograms on the same plot. For example, let's try to answer the following questions using a histogram.

**Question**: What is the immigration distribution for Denmark, Norway, and Sweden for years 1980 - 2013?

In [None]:
# let's quickly view the dataset 
df.loc[['Denmark', 'Norway', 'Sweden'], '1980':'2013']

In [None]:
# generate histogram
df.loc[['Denmark', 'Norway', 'Sweden'], '1980':'2013'].plot.hist()

That does not look right! 

Don't worry, you'll often come across situations like this when creating plots. The solution often lies in how the underlying dataset is structured.

Instead of plotting the population frequency distribution of the population for the 3 countries, `pandas` instead plotted the population frequency distribution for the `years`.

This can be easily fixed by first transposing the dataset, and then plotting as shown below.

In [None]:
# transpose dataframe
df_t = df.loc[['Denmark', 'Norway', 'Sweden'], '1980':'2013'].transpose()
df_t.head()

In [None]:
# generate histogram
df_t.plot(kind='hist', figsize=(10, 6))

plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')

plt.show()

Let's make a few modifications to improve the impact and aesthetics of the previous plot:
* increase the bin size to 15 by passing in `bins` parameter
* set transparency to 60% by passing in `alpha` paramemter
* change the colors of the plots by passing in `color` parameter

In [None]:
# let's get the x-tick values
count, bin_edges = np.histogram(df_t, 15)

# un-stacked histogram
df_t.plot(kind ='hist', 
          figsize=(10, 6),
          bins=15,
          alpha=0.6,
          xticks=bin_edges,
          color=['coral', 'darkslateblue', 'mediumseagreen']
         )

plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')

plt.show()

Tip:
For a full listing of colors available in Matplotlib, run the following code:

In [None]:
import matplotlib
for name, hex in matplotlib.colors.cnames.items():
    print(name, hex)

**Question**: What is the immigration distribution for Greece and Albania for years 1980 - 2013? Use an overlapping plot with 15 bins and a transparency value of 0.35.

In [None]:
### type your answer here


Double-click __here__ for the solution.
<!-- The sample solution is:
# create a dataframe of the countries of interest (cof)
df_cof = df.loc[['Greece', 'Albania'], '1980':'2013']
\\# transpose the dataframe
df_cof = df_cof.transpose() 
\\# let's get the x-tick values
count, bin_edges = np.histogram(df_cof, 15)
\\# Un-stacked Histogram
df_cof.plot(kind ='hist',
            figsize=(10, 6),
            bins=15,
            alpha=0.35,
            xticks=bin_edges,
            color=['coral', 'darkslateblue']
            )
plt.title('Histogram of Immigration from Greece and Albania from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')
plt.show()
-->

## 2.2 Bar Charts

A bar plot is a way of representing data where the *length* of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals. 

To create a bar plot, we can pass one of two arguments via `kind` parameter in `plot()`:

* `kind=bar` creates a *vertical* bar plot
* `kind=barh` creates a *horizontal* bar plot

### 2.2.1 Vertical bar plot

In vertical bar graphs, the x-axis is used for labelling, and the length of bars on the y-axis corresponds to the magnitude of the variable being measured. One disadvantage is that they lack space for text labelling at the foot of each bar. 

**Let's start off by analyzing the effect of Iceland's Financial Crisis:**

The 2008 - 2011 Icelandic Financial Crisis was a major economic and political event in Iceland. Relative to the size of its economy, Iceland's systemic banking collapse was the largest experienced by any country in economic history. The crisis led to a severe economic depression in 2008 - 2011 and significant political unrest.

**Question:** Let's compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year 1980 to 2013. 

In [None]:
# get the data
df_iceland = df.loc['Iceland', '1980':'2013']
df_iceland.head()

In [None]:
# plot data
df_iceland.plot(kind='bar', figsize=(10, 6))

plt.xlabel('Year') # add to x-label to the plot
plt.ylabel('Number of immigrants') # add y-label to the plot
plt.title('Icelandic immigrants to Canada from 1980 to 2013') # add title to the plot

plt.show()

The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis: the number of immigrants to Canada started increasing rapidly after 2008. 

### 2.2.2 Horizontal Bar Plot

Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to  label categetorical variables.


**Question:** Using the `df` dataset, create a *horizontal* bar plot showing the *total* number of immigrants to Canada from the top 15 countries, for the period 1980 - 2013. 

Step 1: Get the data pertaining to the top 15 countries.

In [None]:
# sort dataframe on 'Total' column (descending)
df.sort_values(by='Total', ascending=False, inplace=True)
# get top 15 countries
df_top15 = df['Total'].head(15)
df_top15

Step 2: Plot data:
   1. Use `kind='barh'` to generate a bar chart with horizontal bars.
   2. Make sure to choose a good size for the plot and to label your axes and to give the plot a title.

In [None]:
### type your answer here


Double-click __here__ for the solution.
<!-- The solution is:
\\ # generate plot
df_top15.plot(kind='barh', figsize=(12, 12), color='steelblue')
plt.xlabel('Number of Immigrants')
plt.title('Top 15 Conuntries Contributing to the Immigration to Canada between 1980 - 2013')
-->

<!--
plt.show()
-->

## 2.3 Scatter Plots

A `scatter plot` (2D) is a useful method of comparing variables against each other. `Scatter` plots look similar to `line plots` in that they both map independent and dependent variables on a 2D graph. While the datapoints are connected together by a line in a line plot, they are not connected in a scatter plot. The data in a scatter plot is considered to express a trend. With further analysis using tools like regression, we can mathematically calculate this relationship and use it to predict trends outside the dataset.

Let's start by exploring the following:

Using a `scatter plot`, let's visualize the trend of total immigrantion to Canada (all countries combined) for the years 1980 - 2013.

**Step 1**: Get the dataset. Since we are expecting to use the relationship betewen `years` and `total population`, we will convert `years` to `int` type.

We can use the sum() method to get the total population per year.

`axis`: Axis for the function to be applied on. axis=0 applies sum to each column.

In [None]:
# we can use the sum() method to get the total population per year
df_tot = pd.DataFrame(df.loc[:, '1980':'2013'].sum(axis=0))
df_tot

In [None]:
# change the years to type int 
df_tot.index = map(int, df_tot.index)

# reset the index 
df_tot.reset_index(inplace = True)

# rename columns
df_tot.columns = ['year', 'total']

# view the final dataframe
df_tot.head()

Step 2: Plot the data. In `Matplotlib`, we can create a `scatter` plot set by passing in `kind='scatter'` as plot argument. We will also need to pass in `x` and `y` keywords to specify the columns that go on the x- and the y-axis.

In [None]:
df_tot.plot(kind='scatter', x='year', y='total', figsize=(10, 6))

plt.title('Total Immigration to Canada from 1980 - 2013')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')

plt.show()

Use `ggplot` stype, which adjusts the style to emulate ggplot (a popular plotting package for R). Know more about `ggplot` [here](https://matplotlib.org/3.1.1/gallery/style_sheets/ggplot.html)."

In [None]:
mpl.style.use('ggplot') # optional: for ggplot-like style

df_tot.plot(kind='scatter', x='year', y='total', figsize=(10, 6))

plt.title('Total Immigration to Canada from 1980 - 2013')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')

plt.show()

**Question**: Create a scatter plot of the total immigration from Denmark, Norway, and Sweden to Canada from 1980 to 2013.

**Step 1**: Get the data:
   1. Create a dataframe that consists of the numbers associated with Denmark, Norway, and Sweden only. Name it `df_countries`.
   2. Sum the immigration numbers across all three countries for each year and turn the result into a dataframe. Name this new dataframe `df_total`.
   3. Reset the index in place.
   4. Rename the columns to `year` and `total`.
   5. Display the resulting dataframe.

In [None]:
### type your answer here



Double-click __here__ for the solution.
<!-- The correct answer is:
\\ # create df_countries dataframe
df_countries = df.loc[['Denmark', 'Norway', 'Sweden'], '1980':'2013'].transpose()
\\ # create df_total by summing across three countries for each year
df_total = pd.DataFrame(df_countries.sum(axis=1))
\\ # reset index in place
df_total.reset_index(inplace=True)
\\ # rename columns
df_total.columns = ['year', 'total']
\\ # change column year from string to int to create scatter plot
df_total['year'] = df_total['year'].astype(int)
\\ # show resulting dataframe
df_total.head()
-->

**Step 2**: Generate the scatter plot by plotting the total versus year in **df_total**.

In [None]:
### type your answer here


Double-click __here__ for the solution.
<!-- The correct answer is:
\\ # generate scatter plot
df_total.plot(kind='scatter', x='year', y='total', figsize=(10, 6))
-->

<!--
\\ # add title and label to axes
plt.title('Immigration from Denmark, Norway, and Sweden to Canada from 1980 - 2013')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
-->

<!--
\\ # show plot
plt.show()
-->

- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0.