# Data Visualisation Overview
In this notebook, we will practice data visualisation in Python using Kaggle survey data, asking people about their use of various machine learning tools in the workplace [here](https://www.kaggle.com/kaggle/kaggle-survey-2018). A great resource for general tips about making visualisations, [Google](https://material.io/design/communication/data-visualization.html) has their own in-house resource with lots of useful information. 

This notebook will be focusing on the following two packages: 


1. [**Seaborn**](https://seaborn.pydata.org/index.html) for making easy, visually appealing graphics 
    * Better default graphics, and a larger variety of graphs to enhance data communication 
    * More customiseable and visually appealing (e.g., [colour palettes](https://seaborn.pydata.org/tutorial/color_palettes.html) & [figure aesthetics](https://seaborn.pydata.org/tutorial/aesthetics.html))



2. [**Plotly Express**](https://plot.ly/python/plotly-express/) for making interactive, publication-quality graphics 
    * Can make your visuals [interactive and animated](https://plot.ly/python/animations/)
    * *Optional*: Plotly Dash to make dashboards for your plotly graphics



*Optional:* These other packages that could be additional, beneficial tools. Take a look at them or use in on the online practice!
3. [**Bokeh**](https://bokeh.pydata.org/en/latest/index.html), another package for making interactive plots
4. [**ggplot**](https://github.com/hadley/ggplot) (a graphics package from R, made useable in Python)


Throughout this practice, keep in mind which graphs work best with certain types of data (see the below list). 
A good visualisation resource is [data-to-viz](http://data-to-viz.com), a website that helps you choose the appropriate graphs based on the data you have.

# Load the libraries
Start by loading the libraries that are needed for all the visualisation tools we will be using.

In [None]:
import pandas as pd                        # basic data manipulation
import numpy as  np                        # basic data manipulation

import matplotlib.pyplot as plt            # for basic graphical settings
import seaborn as sns                      # for seaborn visualisations
%matplotlib inline 
      # this renders any figures inside your notebook (sometimes needed in some versions of jupyter notebook)

import plotly as py                        # for exporting animations as html, and using other plotly-based tools
import plotly.express as px                # for plotly express visualisations 

# Import & Clean the Data
A smaller, cleaner version of the data has already been made available on [Google Sheets](https://decd.co/vis_data). Make sure to save it to the same folder as this notebook. Once loaded, take a look at the data. 

In [None]:
# Import the data
df = pd.read_csv("/Users/maurissa/Documents/datasets/kaggle_survey_data.csv") 

In [None]:
# Check your data
df.sample(15)

# drop all missing row values for the entire dataframes (majority of data is dropped but that's alright)
df.dropna(inplace=True)

# get info on the data
df.info()

# Seaborn
Seaborn is a powerful and easy-to-use graphing package to make some great visuals for data exploration, analysis, and communication. Take a look at the following websites for inspiration, codes, and tips & tricks: 
* [Seaborn Website](https://seaborn.pydata.org/examples/index.html)
* [Python Graph Gallery: Seaborn](https://python-graph-gallery.com/seaborn)

## 1. Colour 
In this section, we are going to use different graphs to practice various ways you can use colour. For instance, how to use built-in single colour names or colour palettes, how make your own palettes & emphasise one part to highlight importance, and appropriate colour combinations for your audience. 

A list of all the default colour names can be found [here](https://matplotlib.org/3.1.0/gallery/color/named_colors.html). You can also use hex colour codes, just rememeber to include the `#` before the 6-character code. [Here](https://htmlcolorcodes.com/) is a good website to generate hex codes. 
* [Python Graph Gallery](https://python-graph-gallery.com/33-control-colors-of-boxplot-seaborn/)
* [Making Seaborn colour palettes](https://python-graph-gallery.com/101-make-a-color-palette-with-seaborn/)

In [None]:
# set the graph theme
sns.set_style("") 

# try using "color=" for a single colour or "palette=" for a colour palette
sns.countplot(data=, x="")

# try adding "hue=" to make a grouped chart - how would you make it stacked instead of grouped?

<br>***Making custom palettes (general)*** <br>
Create a list of colours, and make sure the number of colours you select match the number of feature levels (e.g., if a feature has 5 groups, select 5 colours), and the colours will be used in the order you type them. Then use the name of the list above as the palette name. When you use this method, do not add `""` around the palette name.

In [None]:
# option 1 - using default colour names:
my_pal1 = [" ", " ", " ", " "]

# option 2 - using hex colour codes:
my_pal2=["#  ", "#  ", "#  ", "# "]

sns.countplot(data=, x=, palette=)

<br>***Assigning a custom colour for emphasis*** <br>
It's fairly simple to highlight one particular category or level of data in your graphs, particularly bar charts. Then use the colour palette you named above in the function, using `palette=___`. You need to create a dictionary for which colour you want for each value of your categorical variable.

In [None]:
# option 1 - single value as one colour, all over values are a different colour
my_pal1={highest_edu: "red" if highest_edu == "MSc/MA" else "black" for highest_edu in df.highest_edu.unique()}

# option 2 - defining different values as different colours
my_pal2={"MSc/MA":"red", "BSc/BA":"gold", "PhD":"black", "Some univ.":"black", "Profess. degree":"black"}

sns.countplot(data=, x=, palette=)

## 2. Shapes

### lineplot( ) vs. scatterplot( )
Here we are going to see how line plots and scatter plots differ, and see if/when they are useful. For both `lineplot()` and `scatterplot()`, the x-axis does not technically need to be numeric. However, using variables that are either numeric, or categorical with many categories, are optimal. 

In [None]:
# Basic lineplot
sns.lineplot(data=, x=, y=)

# What happens when you define a grouping variable using "hue"? 

# What happens if different numeric or categorical variables are used? Is the graph still useful?

Compare the lineplot example above to the scatterplot below. Consider how the data types affect how the graphs look, and whether they make sense. Which ones work better as scatterplots? Which work better as lineplots? 


In [None]:
# Basic scatterplot
sns.scatterplot(data=, x=, y=)

# What happens when you define a grouping variable using "hue"? 

# What happens if different numeric or categorical variables are used? How does it compare to the lineplots?

### Options with catplot( )
Other plots, like bar plots, are best for looking a particular continuous output for the different values in a categorical variable (e.g., height of men vs. women). Using `catplot()` allows you the freedom to choose different "shapes" for your graph. This is done by adding the parameter `kind=" "` to one of these values: `point`, `bar`, `strip`, `swarm`, `box`, `violin`, or `boxen`.

In [None]:
# change the x-axis and y-axis to orient the graph to be horizontal or vertical for kind=box
sns.catplot(data=, x=, y=,
            height=7,      # to make the chart bigger 
            kind=)         # use kind= to pick the kind of catplot you want to use

In [None]:
# use "col=" to make separate plots for each value of a categorical variable assigned to col=" "
sns.catplot(data=, x=, y=,  
            kind=,
            col=)

## 3. Relation

Sometimes it can be hard to see how different types of data relate to each other. This is where the use of shape and colour can help. Moreover, organising data in an order that makes sense can also enhance our understanding of the story trying to be told.

### Reordering values in categorical variables
The below examples demonstrate how to re-order the values in categorical variables on the x or y axis, and your grouping variable

In [None]:
# reordering values of your categorical variable using "order=" 
# for order, list the values of the variable you want to reorder in the order you want inside the []

sns.catplot(data=, x=, y=, kind=, 
            order=[])

In [None]:
# Reorder values for a grouping variable hue="__" by using hue_order=[] 
sns.catplot(data=, x=, y=, kind=, 
            hue=,
            hue_order=[])

### Bubble charts using scatterplot( )
Using bubble charts to show relationships between many values in a simplified way. This is essentially a scatterplot, but with an added parameter `size`, which scales the dots (or bubbles) based on a number range that you define, or another continuous (or even categorical) variable. 

In [None]:
# scatter plot: use the "size=" parameter to scale data point size based on a third variable of your choice
# you can also make size a single value (an integer or float) to make all points the same size
sns.scatterplot(data=, x=, y=, 
                size=, 
                size_order=[])

## 4. Simplicity

Seaborn is built on top of matplotlib. Therefore, both seaborn and matplotlib codes will work to make small fixes to your visualisations. This ranges from changing the size of your graph to be bigger to make it easier to read, to changing the look and position of labels, legends, and tick marks. 

### Resizing Figures and Removing Figure Frames
To resize a figure, type `plt.figure(figsize=(x,y))` at the begging of your cell block. It will automatically resize any graph within the same code block to your specifications. The values `x` and `y` are floats, which represent the width (x) and height (y) of the figure in inches. 

*Note:* In some graphs like `catplot()`, you specify the `height` and `aspect` ratio within catplot() rather than using `figsize=` .

In [None]:
# Using figsize=() to change the size of the figure
plt.figure(figsize=(18,7)) 

# basic plot
sns.countplot(data=, x=)

# Use this to remove the box around the graph
plt.box(on=None)

### Titles, Axis Labels, and Tick Marks
These are useful parameters for changing the titles and axis labels, as seaborn uses the feature name as the default label. However, not all graphs might need labels or titles, therefore sometimes it is best to remove them. 

In [None]:
# basic plot
graph = sns.countplot(data=, hue=, x=)

# change the title and axis names and settings: these can all be applied to set_title, set_ylabel, and set_xlabel
graph.set_title("title goes here", 
                pad=10,
                fontsize=14, 
                fontdict={"weight": "bold", "color": "maroon", "family": "sans-serif"}) #examples you can change

graph.set_ylabel("your y-axis label")

graph.set_xlabel("your x-axis label")


# if you want to remove the y-axis and x-axis labels, use this code below instead of the above
graph.set_ylabel('')    

graph.set_xlabel('')


In [None]:
# setting font size and type, and other graphical characteristics, for all graphs in the cell
# default font scale is 1; it can be any float or interger greater than 1 - see what happens when you change the value
sns.set(font_scale = , 
        font="serif",   # font family - applies to the whole graph
        style=)        # options are the same styles as in sns.set_style for the graph background

# basic plot
graph = sns.countplot(data=, hue=, x=)


# set font aesthetics for x-axis tick labels (STRINGS only); can use with y-axis with "set_yticklabels"
# for labels=[], type the values/categories of the categorical variable you want to change as a list inside []

graph.set_xticklabels(labels=[], 
                      size=,      # font size (integer)
                      rotation=)  # angle of rotation of the axis labels (integer)

### Legend Positions

Seaborn defaults to placing legends into what it deems the "best" location. However, as a person, it might not be the best for us (it covers up data points), or should be placed outside the graph completely. These are some useful tips on how to move the legend to your desired location.

In [None]:
# basic graph
graph = sns.countplot(data=, hue=, x=)


# Change the legend location inside the graph using .legend(loc=__), value can range from 0-10
graph.legend(loc=1,    
            frameon=False,  # removes the box around the legend
            fontsize=)    # font size of the legend   

# you can use .legend(bbox_to_anchor=(x,y)) parameter to place the legend outside of the graph
# bbox_to_anchor=(x,y) is used together with loc=__, the x and y coordinates can be positive or negative values

graph.legend(bbox_to_anchor=(1, 0.6), loc=2)    

# Plotly Express
[Plotly express](https://plot.ly/python/plotly-express/) is another package that can take your visualisations to a whole other level. Aside from the same functionalities as seaborn, it has additional graphs, and the ability to make your graphs [animated](https://plot.ly/python/animations/) that you can save offline as html files. 

#### Making graphs interactive and animated
You can show more information when hovering over a data point by using the `hover_name="__"` parameter.

Two main animation parameters you can add to your graphs:
* Use `animation_frame="__"` to move between "frames" (e.g., each frame might be a different year to show how the graph changes over time)
* Use `animation_group="__"` to group the data points by that change in each frame

In [None]:
# Make a scatter plot and adding animations. 
px.scatter(df, x=, y=,         
           hover_name="variable_name1"              # shows additional information of a categorical variable when hover
           animation_frame="variable_name2",
           category_orders={"variable_name2":[] })   # same as "order" and "hue_order" in seaborn if labels are out of order

# What happens if you use a different kind of plot for animations? Try using px.bar(), or px.line().

# Try using other parameters. What does "animation_group=" do?

#### Making 3-D graphs
Another powerful tool in plotly express is the ability to make 3-D graphs. Under normal circumstances, these can be very confusing to understand, and ***should be used with caution***.

In [None]:
# making a 3D scatter plot
graph = px.scatter_3d(df, x=, y=, z=, 
              color=,     # categorical grouping variable (same as "hue=" in seaborn)
              symbol= )   # changes symbol shape for each value of the categorical variable used in "color="

In [None]:
# making a 3D line chart - how does this differ from your 3D scatterplot? 
px.line_3d(df, df, x=, y=, z=)


# Is this more or less useful than a 3D scatterplot? 

#### Saving your graphs 
**seaborn graphs as .png files**
<br> To save your seaborn graph, use the following code, typing the name of your graph in place of `___` :

`___.get_figure().savefig("desired_filename_here.png")` 




<br>**plotly express graphs as .html files**
<br>
To save your interactive and animated graphs offline as html files, use the following code, typing the name of your graph in place of `___` :

`py.offline.plot(____, filename="desired_filename_here.html")`  