<a name="home"></a>
# Data Visualisation with Python
## Table of Content
1. [Introduction to Data Visualisation](#intro)
2. [Basic Visualisation Tools](#tool)
3. [Specialised Visualisation Tools](#spec)
4. [Advanced Visualisation](#adv)
5. [Visualising Geospatial Data](#geo)


<a name="intro"></a>
## Introduction to Data Visualisation
Data visualization is a way to show a complex data in a form that is graphical and easy to understand. This can be especially useful when one is trying to explore the data and getting acquainted with it.  

Best practice:
* less is more effective, 
* it is more attractive, 
* it is more impactive.

For more similar and interesting examples, check out [Darkhorse Analytics](https://www.darkhorseanalytics.com/portfolio-all) website

### Introduction to Matplotlib
Matplotlib is one of the most widely used, if not the most popular data visualization library in Python. It was created by John Hunter, who was a neurobiologist and was part of a research team that was working on analyzing Electrocorticography signals, ECoG for short.
Matplotlib was equipped with a scripting interface for quick and easy generation of graphics, represented by pyplot.

Matplotlib's architecture is composed of three main layers: 
1. <b>the Back-End layer</b> has three built-in abstract interface classes: 
    * FigureCanvas, which defines and encompasses the area on which the figure is drawn. 
    * Renderer, an instance of the renderer class knows how to draw on the figure canvas. 
    * Event, which handles user inputs such as keyboard strokes and mouse clicks    
2. the <b>Artist Layer</b> where much of the heavy lifting happens and is usually the appropriate programming paradigm when writing a web application server, or a UI application, or perhaps a script to be shared with other developers. There are two types of Artist objects: 
    * The <b>primitive type</b>, such as a line, a rectangle, a circle, or text. 
    * The <b>composite type</b>, such as the figure or the axes.  
    The top-level Matplotlib object that contains and manages all of the elements in a given graphic is the <b>figure artist</b>, and the most important composite artist is the <b>axes</b> because it is where most of the Matplotlib API plotting methods are defined, including methods to create and manipulate the ticks, the axis lines, the grid or the plot background. 
3. the <b>Scripting layer</b>, which is the appropriate layer for everyday purposes and is considered a lighter scripting interface to simplify common tasks and for a quick and easy generation of graphics and plots. Matplotlib's scripting layer is essentially the Matplotlib.pyplot interface, which automates the process of defining a canvas and defining a figure artist instance and connecting them.

[Further reading on matplotlib](https://www.aosabook.org/en/matplotlib.html)

### Basic Plotting with Matplotlib
We will be working mostly with the scripting interface. In other words, we will learn how to create almost all of the visualization tools using the scripting interface
Limitations:  
After rendering the figure, there is no way for us to add, for example, a figure title or label its axes. With the <b>% matplotlib notebook</b> backend in place, if a plt function is called, it checks if an active figure exists, and any functions you call will be applied to this active figure. If a figure does not exist, it renders a new figure.  

import Matplotlib as mpl 
impoirt matplot.pyplot as plt #its scripting interface# 

#### Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Plotting in pandas is as simple as calling the plot function on a given pandas series or dataframe. Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Plotting with Series
Plotting with Dataframes

<b>Indexing and Selection (slicing)</b>
Select Column:
There are two ways to filter on a column name:
* Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.

    df.column_name   (returns series)
* Method 2: More robust, and can filter on multiple columns.

    df['column']  (returns series)  
    df[['column 1', 'column 2']]   (returns dataframe)

Select Row:
There are main 2 ways to select rows:

1.    df.loc[label]        
        #filters by the labels of the index/column
2.    df.iloc[index]       
        #filters by the positions of the index/column

<b>Filtering</b>
1. create the condition boolean series:  
    condition = df_can['Continent'] == 'Asia'  
    df_can[condition]
2. Multiple conditions:  
    df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]


### Line Plots
A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time. The best use case for a line plot is when 
* you have a continuous dataset
* you're interested in visualizing the data over a period of time  

### Other Plots¶
There are many other plotting styles available other than the default Line plot, all of which can be accessed by passing kind keyword to plot(). The full list of available plots are as follows:

* bar, for vertical bar plots
* barh, for horizontal bar plots
* hist, for histogram
* box, for boxplot
* kde, or density for density plots
* area, for area plots
* pie, for pie plots
* scatter, for scatter plots
* hexbin, for hexbin plot

Exercise Data preparation:
1. import numpy as np  # useful for many scientific computing in Python  
    import pandas as pd # primary data structure library
2. Read the data:  
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',  
    sheet_name='Canada by Citizenship',  
    skiprows=range(20),  
    skipfooter=2)
3. Review the top5 row and the tail: df_can.head() and / or df_can-tail()
4. Analyse the data set:  
    df_can.info()  
    df_can.columns.values  
    df_can.index.values  
    df_can.shape
    df_can.isnull()
    df_can.describe()
5. Clean the data set of unnecessary columns and review head:  
    df_can.drop(['Clmn1','Clmn...','ClmnN'], axis=1, inplace=True])
6. Rename headers to more readable "headings":  
    df_can.rename(columns={'OldName1':'NewName1', '...':'...', 'OldNameN':'NewNameN'}, inplace=True)
7. Add missing Columns, e.g. sum:  
    df_can['Total'] = df_can.sum(axis=1)
8. Check if it worked:  
    df_can.isnull().sum()
9. Select the desired column as index:  
    df_can.set_index('Coutry', inplace=True)
10. Convert the columns name to string to avoid confusion with integer data:  
    df_can.columns = list(map(str, df_can.columns))
11. Assing larger range of column names e.g. 1999 - 2020 to one variable:  
    years = list(map(str, range(1980,2014)))  

Exercise Data visualisation:
1. importing Matplotlib and Matplotlib.pyplot:  
    %matplotlib inline  
    import matplotlib as mpl  
    import matplotlib.pyplot as plt
2. Extract the data you want to plot:  
    haiti = df_can.loc['Haiti', years]
3. Plot the data:  
    haiti.plot()
4. change the index values to integer for plotting:  
    haiti.index = haiti.index.map(int)
5. Lable the axis, title and special events:  
    haiti.plot(kind='line')  
    plt.title('Immigration from Haiti')  
    plt.ylabel('Number of immigrants')  
    plt.xlabel('Years')  
    plt.text(2000, 6000, '2010 Earthquake')
    plt.show()


[Home](#home)

<a name="tool"></a>
## Basic Visualisation Tools
### Area plots
An area plot also known as an area chart or graph is a type of plot that depicts accumulated totals using numbers or percentages over time. It is based on the line plot and is commonly used when trying to compare two or more quantities.

An <b>area chart</b> is really similar to a line chart, except that the area between the x axis and the line is filled in with color or shading. It represents the evolution of a numerical variable following another numerical variable. If you want to represent this evolution for several groups in the same time, you are probably interested by <b>stacked area chart</b>, where every groups are displayed one of top of each other. 

### Histograms
A histogram is a way of representing the frequency distribution of a numeric dataset. The way it works is it partitions the spread of the numeric data into bins, assigns each datapoint in the dataset to a bin, and then counts the number of datapoints that have been assigned to each bin. So the vertical axis is actually the <b>frequency or the number of datapoints in each bin</b>.

### Bar Charts
A bar chart is a very popular visualization tool. Unlike a histogram, a bar chart also known as a bar graph is a type of plot where the length of each bar is proportional to the value of the item that it represents. It is commonly used to compare the values of a variable at a given point in time.

#### Two types of plotting
There are two styles/options of ploting with matplotlib. Plotting using the <b>Artist layer</b> and plotting using the <b>scripting layer</b>.

* Option 1: Scripting layer (procedural method) - using matplotlib.pyplot as 'plt':  
    You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(...) to add title or plt.xlabel(...) to add label to the x-axis.

* Option 2: Artist layer (Object oriented method) - using an Axes instance from Matplotlib:  
    You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding "set_" to the previous methods). For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.

### Lab Exercise
#### Downloading and Prepping Data
1. Import libraries:  
    import numpy as np  
    import pandas as pd
2. Download the dataset into a dataframe:
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',  
    sheet_name='Canada by Citizenship',  
    skiprows=range(20),  
    skipfooter=2)
3. Check and clean up the data:  
    df_can.head()  
    print(df_can.shape)  
    df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)  
    df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)  
4. Ensure all column labels are string  
    all(isinstance(column, str) for column in df_can.columns) <-- checking = False  
    df_can.columns = list(map(str, df_can.columns)) <-- changing
    all(isinstance(column, str) for column in df_can.columns) <-- checking = False True  
5. Set the Country name as index:  
    df_can.set_index('Country', inplace=True)
6. Add a Total column, e.g. sum:  
    df_can['Total'] = df_can.sum(axis=1)
7. Define the years range as a variable:  
    years = list(map(str, range(1980, 2014)))

#### Visualizing Data using Matplotlib
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

<b>Area plot</b>  
1. Load the subdata from the dataframe:  
    df_can.sort_values(['Total'], ascending=False, axis=0, inplace=True)  
    df_top5 = df_top5[years].transpose()  
    df_top5.head()
2. Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values. To produce an unstacked plot, pass <b>stacked=False</b>:  
    df_top5.index = df_top5.index.map(int)  
    df_top5.plot(kind='area', stacked=False, figsize=(20, 10))  
    
    plt.title('Immigration Trend of Top 5 Countries')  
    plt.ylabel('Number of Immigrants')  
    plt.xlabel('Years')  
    plt.show()  
    
<b>Histograms</b>  
A histogram is a way of representing the frequency distribution of numeric dataset. The way it works is it partitions the x-axis into bins, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin.

Using Numpy:  
1. Create and check the bins:  
    df_can['2013'].head()  
    count, bin_edges = np.histogram(df_can['2013'])  
    print(count) # frequency count  
    print(bin_edges) # bin ranges, default = 10 bins  
2. Check plot:  
    df_can['2013'].plot(kind='hist', figsize=(8, 5))  
    plt.title('Histogram of Immigration from 195 Countries in 2013')  
    plt.ylabel('Number of Countries')  
    plt.xlabel('Number of Immigrants')  
    plt.show()
3. x-axis labels do not match with the bin size. This can be fixed by passing in a xticks keyword  
    df_can['2013'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)  
    plt.title('Histogram of Immigration from 195 Countries in 2013')  
    plt.ylabel('Number of Countries')  
    plt.xlabel('Number of Immigrants')  
    plt.show(


<b>Bar Chart:</b>  
A bar plot is a way of representing data where the length of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals.

To create a bar plot, we can pass one of two arguments via kind parameter in plot():
* kind=bar creates a vertical bar plot
* kind=barh creates a horizontal bar plot

1. Get the data for Icland:  
    df_iceland = df_can.loc['Iceland', years]
2. Plot the data:  
    df_iceland.plot(kind='bar', figsize=(10, 6))  
    plt.xlabel('Year') # add to x-label to the plot  
    plt.ylabel('Number of immigrants') # add y-label to the plot  
    plt.title('Icelandic immigrants to Canada from 1980 to 2013') # add title to the plot  
    plt.show()

Optional parameters:
* s: str, the text of annotation.
* xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).
* xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).
* xycoords: The coordinate system that xy is given in - 'data' uses the coordinate system of the object being annotated (default).
* arrowprops: Takes a dictionary of properties to draw the arrow:
    * arrowstyle: Specifies the arrow style, '->' is standard arrow.
    * connectionstyle: Specifies the connection type. arc3 is a straight line.
    * color: Specifes color of arror.
    * lw: Specifies the line width.
    * rotation: rotation angle of text in degrees (counter clockwise)
    * va: vertical alignment of text [‘center’ | ‘top’ | ‘bottom’ | ‘baseline’]
    * ha: horizontal alignment of text [‘center’ | ‘right’ | ‘left’]

<b>Tip:</b>  
For a full listing of colors available in Matplotlib, run the following code in your python shell:

import matplotlib
for name, hex in matplotlib.colors.cnames.items():
    print(name, hex)

[Home](#home)

<a name= "spec"></a>
## Specialised Visualisation Tools
### Pie Chart
A pie chart is a circualr graphic that displays numeric proportions by dividing a circle (or pie) into proportional slices. You are most likely already familiar with pie charts as it is widely used in business and media. We can create pie charts in Matplotlib by passing in the <b>kind=pie</b> keyword.

Plot the data by pass in kind = 'pie' keyword, along with the following additional parameters:
* <b>autopct</b> - is a string or function used to label the wedges with their numeric value. The label will be placed inside the wedge. If it is a format string, the label will be fmt%pct.
* <b>startangle</b> - rotates the start of the pie chart by angle degrees counterclockwise from the x-axis.
* <b>shadow</b> - Draws a shadow beneath the pie (to give a 3D feel).

To improve the visuals:
* Remove the text labels on the pie chart by passing in legend and add it as a seperate legend using plt.legend().
* Push out the percentages to sit just outside the pie chart by passing in pctdistance parameter.
* Pass in a custom set of colors for continents by passing in colors parameter.
* Explode the pie chart to emphasize the lowest three continents (Africa, North America, and Latin America and Carribbean) by pasing in explode parameter.

### Box Plots
A box plot is a way of statistically representing the distribution of given data through five main dimensions. 
1. The first dimension is minimum, which is the smallest number in the sorted data. 
2. The second dimension is first quartile, which is the point 25% of the way through the sorted data. In other words, a quarter of the datapoints are less than this value. 
3. The third dimension is median, which is the median of the sorted data. 
4. The fourth dimension is third quartile, which is the point 75% of the way through the sorted data. In other words, three-quarters of the data points are less than this value. 
5. The final dimension is maximum, which is the highest number in the sorted data. 

![Box plot example](<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Images/boxplot_complete.png" width=440, align="center">)

### Scatter Plots
A scatter plot is a type of plot that displays values pertaining to typically two variables against each other. Usually it is a dependent variable to be plotted against an independent variable in order to determine if any correlation between the two variables exists.

A scatter plot (2D) is a useful method of comparing variables against each other. Scatter plots look similar to line plots in that they both map independent and dependent variables on a 2D graph. While the datapoints are connected together by a line in a line plot, they are not connected in a scatter plot. The data in a scatter plot is considered to express a trend. With further analysis using tools like regression, we can mathematically calculate this relationship and use it to predict trends outside the dataset.

### Bubble Plots
A bubble plot is a variation of the scatter plot that displays three dimensions of data (x, y, z). The datapoints are replaced with bubbles, and the size of the bubble is determined by the third variable 'z', also known as the weight. In maplotlib, we can pass in an array or scalar to the keyword s to plot(), that contains the weight of each point.

### Subplots
Often times we might want to plot multiple plots within the same figure. For example, we might want to perform a side by side comparison of the box plot with the line plot of China and India's immigration.

To visualize multiple plots together, we can create a figure (overall canvas) and divide it into subplots, each containing a plot. With subplots, we usually work with the artist layer instead of the scripting layer.

Typical syntax is :

    fig = plt.figure() # create figure
    ax = fig.add_subplot(nrows, ncols, plot_number) # create subplots
    
[Subplots](<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Images/Mod3Fig5Subplots_V2.png" width=500 align="center">)

### Lab Exercise
#### Downloading and Prepping Data
1. Import libraries:  
    import numpy as np  
    import pandas as pd
2. Download the dataset into a dataframe:
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',  
    sheet_name='Canada by Citizenship',  
    skiprows=range(20),  
    skipfooter=2)
3. Check and clean up the data:  
    df_can.head()  
    print(df_can.shape)  
    df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)  
    df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)  
4. Ensure all column labels are string  
    all(isinstance(column, str) for column in df_can.columns) <-- checking = False  
    df_can.columns = list(map(str, df_can.columns)) <-- changing
    all(isinstance(column, str) for column in df_can.columns) <-- checking = False True  
5. Set the Country name as index:  
    df_can.set_index('Country', inplace=True)
6. Add a Total column, e.g. sum:  
    df_can['Total'] = df_can.sum(axis=1)
7. Define the years range as a variable:  
    years = list(map(str, range(1980, 2014)))

#### Visualizing Data using Matplotlib
Import the library:  
    %matplotlib inline  
    import matplotlib as mpl  
    import matplotlib.pyplot as plt  
    mpl.style.use('ggplot') # optional: for ggplot-like style  
    print('Matplotlib version: ', mpl.__version__) # >= 2.0.0

<b>Pie Charts:</b>  
1. Gather the data with pandas "groupby" function  
    1.1 Split: Splitting the data into groups based on some criteria.
    1.2 Apply: Applying a function to each group independently:
        * .sum()
        * .count()
        * .mean() 
        * .std() 
        * .aggregate()
        * .apply()
        * .etc..
    1.3 Combine: Combining the results into a data structure, e.g.  
    df_continents = df_can.groupby('Continent', axis=0).sum()  
    print(type(df_can.groupby('Continent', axis=0)))  
    df_continents.head()
2. Print the Pie chart:  
    df_continents['Total'].plot(kind='pie',   
        figsize=(5, 6),  
        autopct='%1.1f%%', # add in percentages  
        startangle=90,     # start angle 90° (Africa)  
        shadow=True)      # add shadow       
    plt.title('Immigration to Canada by Continent [1980 - 2013]')  
    plt.axis('equal') # Sets the pie chart to look like a circle.  
    plt.show()  
3. Improve the visualisation:  
    colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink']  
    explode_list = [0.1, 0, 0, 0, 0.1, 0.1] # ratio for each continent with which to offset each wedge.  
    df_continents['Total'].plot(kind='pie',
        figsize=(15, 6),
        autopct='%1.1f%%', 
        startangle=90,    
        shadow=True,     
        labels=None,         # turn off labels on pie chart
        pctdistance=1.12,    # the ratio between the center of each pie slice and the start of the text generated by autopct 
        colors=colors_list,  # add custom colors
        explode=explode_list) # 'explode' lowest 3 continents
    plt.title('Immigration to Canada by Continent [1980 - 2013]', y=1.12) 
    plt.axis('equal') 
    plt.legend(labels=df_continents.index, loc='upper left') 
    plt.show()

<b>Box Plot:</b>  
1. Get the data set:  
    df_japan = df_can.loc[['Japan'], years].transpose()  
    df_japan.head()
2. Plot the box plot:  
    df_japan.plot(kind='box', figsize=(8, 6))  
    plt.title('Box plot of Japanese Immigrants from 1980 - 2013')  
    plt.ylabel('Number of Immigrants')  
    plt.show()
3. Check graph with stat:  
    df_japan.describe()
4. Horizontal plots:  
    df_CI.plot(kind='box', figsize=(10, 7), color='blue', vert=False)  
    plt.title('Box plots of Immigrants from China and India (1980 - 2013)')  
    plt.xlabel('Number of Immigrants')  
    plt.show()

<b>Scatter plots:</b>  
1. Get the data set and convert "years" to [int.]  
    df_tot = pd.DataFrame(df_can[years].sum(axis=0))  
    df_tot.index = map(int, df_tot.index)  
    df_tot.reset_index(inplace = True)  
    df_tot.columns = ['year', 'total']  
2. View the final dataframe  
    df_tot.head()
3. Plot the data  
    df_tot.plot(kind='scatter', x='year', y='total', figsize=(10, 6), color='darkblue')  
    plt.title('Total Immigration to Canada from 1980 - 2013')  
    plt.xlabel('Year')  
    plt.ylabel('Number of Immigrants')  
    plt.show()  
4. Line of best fit  
    x = df_tot['year']      # year on x-axis  
    y = df_tot['total']     # total on y-axis  
    fit = np.polyfit(x, y, deg=1)  
    fit
5. Plot the regression line  
    df_tot.plot(kind='scatter', x='year', y='total', figsize=(10, 6), color='darkblue') 
    plt.title('Total Immigration to Canada from 1980 - 2013')  
    plt.xlabel('Year')  
    plt.ylabel('Number of Immigrants')  
    plt.plot(x, fit[0] * x + fit[1], color='red') # recall that x is the Years  
    plt.annotate('y={0:.0f} x + {1:.0f}'.format(fit[0], fit[1]), xy=(2000, 150000))  
    plt.show()
6. Print out the line of best fit  
    'No. Immigrants = {0:.0f} * Year + {1:.0f}'.format(fit[0], fit[1]) 

<b>Buble Plots:</b>  
1. Get the data  
    df_can_t = df_can[years].transpose() # transposed dataframe  
    df_can_t.index = map(int, df_can_t.index)  
    df_can_t.index.name = 'Year'  
    df_can_t.reset_index(inplace=True)
2. Create the normalised weights  
    norm_brazil = (df_can_t['Brazil'] - df_can_t['Brazil'].min()) / (df_can_t['Brazil'].max() - df_can_t['Brazil'].min())  
    norm_argentina = (df_can_t['Argentina'] - df_can_t['Argentina'].min()) / (df_can_t['Argentina'].max() - df_can_t['Argentina'].min())  
3. Plot the buble chart  
    ax0 = df_can_t.plot(kind='scatter',
                    x='Year',
                    y='Brazil',
                    figsize=(14, 8),
                    alpha=0.5,                  # transparency
                    color='green',
                    s=norm_brazil * 2000 + 10,  # pass in weights 
                    xlim=(1975, 2015)
                   )
   ax1 = df_can_t.plot(kind='scatter',
                    x='Year',
                    y='Argentina',
                    alpha=0.5,
                    color="blue",
                    s=norm_argentina * 2000 + 10,
                    ax = ax0
                   )
   ax0.set_ylabel('Number of Immigrants')  
   ax0.set_title('Immigration from Brazil and Argentina from 1980 - 2013')  
   ax0.legend(['Brazil', 'Argentina'], loc='upper left', fontsize='x-large')



<b>Subplots:</b>  
1. Create figure:  
    fig = plt.figure()  
    ax0 = fig.add_subplot(1, 2, 1) # add subplot 1 (1 row, 2 columns, first plot)  
    ax1 = fig.add_subplot(1, 2, 2) # add subplot 2 (1 row, 2 columns, second plot)  
2. Subplot 1: Box plot  
    df_CI.plot(kind='box', color='blue', vert=False, figsize=(20, 6), ax=ax0) # add to subplot 1  
    ax0.set_title('Box Plots of Immigrants from China and India (1980 - 2013)')  
    ax0.set_xlabel('Number of Immigrants')  
    ax0.set_ylabel('Countries')
3. Subplot 2: Line plot  
    df_CI.plot(kind='line', figsize=(20, 6), ax=ax1) # add to subplot 2  
    ax1.set_title ('Line Plots of Immigrants from China and India (1980 - 2013)')  
    ax1.set_ylabel('Number of Immigrants')  
    ax1.set_xlabel('Years')
4. Render  
    plt.show()



[Home](#home)

<a name='adv'></a>
## Advanced Visualisation
### Waffle Chart
A waffle chart is a great way to visualize data in relation to a whole or to highlight progress against a given threshold. The main idea here is for a given waffle chart whose desired height and width are defined, the contribution of each country is transformed into a number of tiles that is proportional to the country's contribution to the total, so that more the contribution the more the tiles, resulting in what resembles a waffle when combined.

A waffle chart is an interesting visualization that is normally created to display progress toward goals. It is commonly an effective option when you are trying to add interesting visualization features to a visual that consists mainly of cells, such as an Excel dashboard.

### Word Clouds
A word cloud is simply a depiction of the importance of different words in the body of text. A word cloud works in a simple way; the more a specific word appears in a source of textual data the bigger and bolder it appears in the world cloud. Assuming that we didn't know anything about the content of these documents, a word cloud can be very useful to assign a topic to some unknown textual data.

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.

Luckily, a Python package already exists in Python for generating word clouds. The package, called word_cloud was developed by **Andreas Mueller**. You can learn more about the package by following this [link](https://github.com/amueller/word_cloud/).

### Seaborn and Regression
Seaborn is another data visualization library, it is actually based on Matplotlib. It was built primarily to provide a high-level interface for drawing attractive statistical graphics, such as regression plots, box plots, and so on. Seaborn makes creating plots very efficient. Therefore with Seaborn you can generate plots with code that is 5 times less than with Matplotlib.

### Matlab Exercise
#### Downloading and Prepping Data
1. Import libraries:  
    import numpy as np  
    import pandas as pd  
    from PIL import Image # converting images into arrays
2. Download the dataset into a dataframe:
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',  
    sheet_name='Canada by Citizenship',  
    skiprows=range(20),  
    skipfooter=2)
3. Check and clean up the data:  
    df_can.head()  
    print(df_can.shape)  
    df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)  
    df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)  
4. Ensure all column labels are string  
    all(isinstance(column, str) for column in df_can.columns) <-- checking = False  
    df_can.columns = list(map(str, df_can.columns)) <-- changing
    all(isinstance(column, str) for column in df_can.columns) <-- checking = False True  
5. Set the Country name as index:  
    df_can.set_index('Country', inplace=True)
6. Add a Total column, e.g. sum:  
    df_can['Total'] = df_can.sum(axis=1)
7. Define the years range as a variable:  
    years = list(map(str, range(1980, 2014)))

#### Visualizing Data using Matplotlib
Import the library:  
    %matplotlib inline  
    import matplotlib as mpl  
    import matplotlib.pyplot as plt  
    import matplotlib.patches as mpatches # needed for waffle Charts  
    mpl.style.use('ggplot') # optional: for ggplot-like style  
    print('Matplotlib version: ', mpl.__version__) # >= 2.0.0

<b>Waffle Charts:</b>  
1. Get the data  
    df_dsn = df_can.loc[['Denmark', 'Norway', 'Sweden'], :]  
    df_dsn
2. Compute the proportion of each category with respect to the total  
    total_values = sum(df_dsn['Total'])  
    category_proportions = [(float(value) / total_values) for value in df_dsn['Total']]  
    for i, proportion in enumerate(category_proportions):  
        print (df_dsn.index.values[i] + ': ' + str(proportion))  
3. Define the overall size of the Waffle:  
    width = 40 # width of chart  
    height = 10 # height of chart  
    total_num_tiles = width * height # total number of tiles  
    print ('Total number of tiles is ', total_num_tiles)
4. Determine the proportion of each category to determine the number of tiles  
    tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]  
    for i, tiles in enumerate(tiles_per_category):  
        print (df_dsn.index.values[i] + ': ' + str(tiles))
5. Create the matrix that resembles the Waffle chart  
    waffle_chart = np.zeros((height, width))  
    category_index = 0  
    tile_index = 0  
    for col in range(width):  
        for row in range(height):  
            tile_index += 1  
            if tile_index > sum(tiles_per_category[0:category_index]):
                category_index += 1       
            waffle_chart[row, col] = category_index   
    print ('Waffle chart populated!')
6. Map the Waffle Chart into a visual  
    fig = plt.figure()  
    colormap = plt.cm.coolwarm  
    plt.matshow(waffle_chart, cmap=colormap)  
    plt.colorbar()
7. Pep the chart
    fig = plt.figure()  
    colormap = plt.cm.coolwarm  
    plt.matshow(waffle_chart, cmap=colormap)  
    plt.colorbar()  
    ax = plt.gca()  
    ax.set_xticks(np.arange(-.5, (width), 1), minor=True)  
    ax.set_yticks(np.arange(-.5, (height), 1), minor=True)  
    ax.grid(which='minor', color='w', linestyle='-', linewidth=2)  
    plt.xticks([])  
    plt.yticks([])
8. Create a legend  
    fig = plt.figure()  
    colormap = plt.cm.coolwarm  
    plt.matshow(waffle_chart, cmap=colormap)  
    plt.colorbar()  
    
    ax = plt.gca()  
    ax.set_xticks(np.arange(-.5, (width), 1), minor=True)  
    ax.set_yticks(np.arange(-.5, (height), 1), minor=True)  
    
    ax.grid(which='minor', color='w', linestyle='-', linewidth=2)  
    plt.xticks([])  
    plt.yticks([])  
    
    values_cumsum = np.cumsum(df_dsn['Total'])  
    total_values = values_cumsum[len(values_cumsum) - 1]  
    
    legend_handles = []  
    for i, category in enumerate(df_dsn.index.values):  
        label_str = category + ' (' + str(df_dsn['Total'][i]) + ')'  
        color_val = colormap(float(values_cumsum[i])/total_values)  
        legend_handles.append(mpatches.Patch(color=color_val, label=label_str))  
    
    plt.legend(handles=legend_handles,  
        loc='lower center',  
        ncol=len(df_dsn.index.values),  
        bbox_to_anchor=(0., -0.2, 0.95, .1))
        
<b>Word Clouds:</b>  
1. Install the package  
    !conda install -c conda-forge wordcloud==1.4.1 --yes  
    
    from wordcloud import WordCloud, STOPWORDS
    print ('Wordcloud is installed and imported!')  
2. Get the data  
    !wget --quiet https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/alice_novel.txt  
    alice_novel = open('alice_novel.txt', 'r').read()  
    print ('File downloaded and saved!')  
3. Remove redundant "stopwords"  
    stopwords = set(STOPWORDS)
4. Create the wordcloud with Max 2000 words  
    alice_wc = WordCloud(  
        background_color='white',  
        max_words=2000,  
        stopwords=stopwords)  
    alice_wc.generate(alice_novel)
5. Display the word cloud  
    plt.imshow(alice_wc, interpolation='bilinear')  
    plt.axis('off')  
    plt.show()
6. Then you can pep it and even render onto a picture 
    stopwords.add('said') # add the words said to stopwords  
    alice_wc.generate(alice_novel)  
    fig = plt.figure()  
    fig.set_figwidth(14) # set width  
    fig.set_figheight(18) # set height  
    plt.imshow(alice_wc, interpolation='bilinear')  
    plt.axis('off')  
    plt.show()

<b>Regression plots:</b>  
1. Install the package  
    !!conda install -c anaconda seaborn --yes  
    
    import seaborn as sns  
    print('Seaborn installed and imported!')
2. Create a dataframe  
    df_tot = pd.DataFrame(df_can[years].sum(axis=0))  
    df_tot.index = map(float, df_tot.index)
    df_tot.reset_index(inplace=True)
    df_tot.columns = ['year', 'total']
    df_tot.head()
3. Generate the regression with Seaborn  
    import seaborn as sns  
    ax = sns.regplot(x='year', y='total', data=df_tot)  


[Home](#home)

<a name="geo"></a>
## Visualising Geospatial Data
### Introduction to Folium
Folium is a powerful data visualization library in Python that was built primarily to help people visualize geospatial data. With Folium, you can create a map of any location in the world as long as you know its latitude and longitude values. 
You can also create a map and superimpose markers as well as clusters of markers on top of the map for cool and very interesting visualizations. You can also create maps of different styles such as street level map, stamen map, and a couple others which we will look into in just a moment. 

**Styles**  
You can create different map styles using the tiles parameter e.g.:  
* **stamen toner map**, this style is great for visualizing and exploring river meanders and coastal zones. 
* **stamen terrain**, this style is great for visualizing hill shading and natural vegetation colors.

### Maps with Markers
We continue working with the Folium library and learn how to superimpose markers on top of a map for interesting visualizations. To do that, we need to create what is called a feature group. Let's go ahead and create a feature group. When a feature group is created, it is empty and that means what's next is to start creating what is called children and adding them to the feature group.

### Choropleth Maps
A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per capita income.

In order to create a choropleth map of a region of interest, Folium requires a Geo JSON file that includes geospatial data of the region. For a choropleth map of the world, we would need a Geo JSON file that lists each country along with any geospatial data to define its borders and boundaries. Here is an example of what the Geo JSON file would include about each country.

### Lab Exercise
1. Downloading ant prepint data
    import numpy as np  
    import pandas as pd  
2. Install Folium  
    !conda install -c conda-forge folium=0.5.0 --yes  
    import folium
    print('Folium installed and imported!')
3. Print a "world map"  
    world_map = folium.Map()  
    world_map
4. Set focus on Canada  
    world_map = folium.Map(location=[56.130, -106.35], zoom_start=4)  
    world_map
5. Stamen Toner  
    world_map = folium.Map(location=[56.130, -106.35], zoom_start=4, tiles='Stamen Toner')  
    world_map
6. Stamen Terrain Maps  
    world_map = folium.Map(location=[56.130, -106.35], zoom_start=4, tiles='Stamen Terrain')  
    world_map
7. Mapbox Bright Style  
    world_map = folium.Map(tiles='Mapbox Bright')  
    world_map


**Maps with Markers**
1. get the data set  
    df_incidents = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Police_Department_Incidents_-_Previous_Year__2016_.csv')  
    print('Dataset downloaded and read into a pandas dataframe!')  
2. Check the data  
    df_incidents.head()
    df_incidents.shape
3. Limit the data  
    limit = 100  
    df_incidents = df_incidents.iloc[0:limit, :]  
    df_incidents.shape
4. Define the Geolocation of San Francisco  
    latitude = 37.77  
    longitude = -122.42
5. SanFran Map  
    sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)  
    sanfran_map
6. Superimpose markers  
    incidents = folium.map.FeatureGroup()  
    
    for lat, lng, in zip(df_incidents.Y, df_incidents.X):
        incidents.add_child(  
            folium.features.CircleMarker(  
            [lat, lng],  
            radius=5, # define how big you want the circle markers to be  
            color='yellow',  
            fill=True,  
            fill_color='blue',  
            fill_opacity=0.6))
    latitudes = list(df_incidents.Y)  
    longitudes = list(df_incidents.X)  
    labels = list(df_incidents.Category)  
    
    for lat, lng, label in zip(latitudes, longitudes, labels):  
        folium.Marker([lat, lng], popup=label).add_to(sanfran_map)  
    
    sanfran_map.add_child(incidents)
7. with Marker Cluster  
    from folium import plugins  
    sanfran_map = folium.Map(location = [latitude, longitude], zoom_start = 12)  
    
    incidents = plugins.MarkerCluster().add_to(sanfran_map)  
    for lat, lng, label, in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):  
        folium.Marker(  
            location=[lat, lng],  
            icon=None,  
            popup=label,  
            ).add_to(incidents)   
    sanfran_map

**Choropleth Maps**  
1. Get the data set  
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
                     sheet_name='Canada by Citizenship',
                     skiprows=range(20),
                     skipfooter=2)
     print('Data downloaded and read into a dataframe!')
2. Check the data  
    df_can.head()  
    print(df_can.shape)
3. Clean up the data  
    df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)  
    df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)  
    df_can.columns = list(map(str, df_can.columns))  
    df_can['Total'] = df_can.sum(axis=1)  
    years = list(map(str, range(1980, 2014)))  
    print ('data dimensions:', df_can.shape)
4. get GeoJSON file that defines the areas/boundaries of the state, county, or country  
    !wget --quiet https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/world_countries.json -O world_countries.json  
    print('GeoJSON file downloaded!')
5. Map the world  
    world_geo = r'world_countries.json' # geojson file  
    world_map = folium.Map(location=[0, 0], zoom_start=2, tiles='Mapbox Bright')  
6. Populate the GeoJSON file  
    world_map.choropleth(  
        geo_data=world_geo,  
        data=df_can,  
        columns=['Country', 'Total'],  
        key_on='feature.properties.name',  
        fill_color='YlOrRd',  
        fill_opacity=0.7,   
        line_opacity=0.2,  
        legend_name='Immigration to Canada'  
        ) 
    world_map

[Home](#home)