# Data Visualization (2017/18)

## Solutions for Assignment 3 - Visualizing multivariate data 

Presented by Group 60: 
- Udit Dokania
- Swapna Patil

Date: 04.12.2018

## Setup

In [1]:
import pandas as pd
import numpy as np

# import bokeh 
from bokeh.plotting import figure, show, Figure
from bokeh.models import ColumnDataSource, Label
from bokeh.models.glyphs import Text
from bokeh.palettes import Spectral3
from bokeh.layouts import row, column, gridplot

# tell bokeh to show the figures in the notebook
from bokeh.io import output_notebook
output_notebook()

Load data stored in bokeh:

In [2]:
from bokeh.sampledata.autompg import autompg
from bokeh.sampledata.iris import flowers
flowers.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Helpful functions

Group a dataframe according to a variable (species) and compute some statistics for a second variable (petal_width).

In [3]:
flowers.groupby(['species']).petal_width.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,0.246,0.105386,0.1,0.2,0.2,0.3,0.6
versicolor,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
virginica,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


Find unique values and count them in categorical variable.

In [4]:
flowers.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [5]:
flowers.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

Use numpy to compute a histogram for quantitative data. See the docu for further information and how to work with the output.

In [6]:
np.histogram(flowers.petal_width)

(array([41,  8,  1,  7,  8, 33,  6, 23,  9, 14], dtype=int64),
 array([0.1 , 0.34, 0.58, 0.82, 1.06, 1.3 , 1.54, 1.78, 2.02, 2.26, 2.5 ]))

## Exercise 1 a): Customize a scatterplot chart

The following code skeleton renders a scatterplot. Customize the chart to your liking. Think for example of many data points. 

This is meant to be a very quick exercise to demonstrate the concept for the following two charts.

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject or bokeh ColumnDataSource that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **y**: variable (name as string) to be represented on the y-axis
- **Calling the scatterplot**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.scatter( data, x, y )
```
This is already setup in the code skeleton below.

**<font color="deeppink">Update code</font>**

In [7]:
def scatter( self, source, x, y, **kwargs ):
    # access the figure using the self variable
    self.circle( source=source, x=x, y=y, **kwargs)
    
    label = Label( x=50, y=50, x_units='screen', y_units='screen',
                  render_mode='css' )
    self.add_layout(label)

# add the function as class method to Figure    
Figure.scatter = scatter

**<font color="deeppink">Check</font>** that your code is working:

In [8]:
p = figure( plot_width=300, plot_height=300 )
p.scatter( source=flowers, x='petal_width', y='petal_length')
show(p)

**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required). Think of scenarios where your code may fail.
- Test case 1: If any data is non numerical it will give blank graph.
- Test case 2: It will give error if the column name is not correctly specified.
- Test case 3: User doesnot have flexibility to make use of kwargs forvisualization for example changing the axis scale.

## Exercise 1 b): Implement a boxplot chart

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **y**: variable (name as string) to be represented on the y-axis
- **Orientation**: Provide boxplots with horizontal and vertical orientation (call them hboxplot and vboxplot).
- **Calling the boxplot**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.vboxplot( data, x, y )
```
This is already setup in the code skeleton below.

Hints:
- A Bokeh sample implementation can be found here: [Boxplot](https://bokeh.pydata.org/en/latest/docs/gallery/boxplot.html)
- Adapt this implementation to work on the target variable only. See code below to get started.

**<font color="deeppink">Implement</font>**

In [9]:
#     return group[(group > upper.loc[cat]) | (group < lower.loc[cat])]
def vboxplot( self, source, x, y, **kwargs ):
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas DataFrame.")

    groups = source.groupby([x])[y]
    
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    
    cat = q1.keys();
    def outliers(group):
        cat = group.name
        return group[(group > upper.loc[cat]) | (group < lower.loc[cat])]
    out = groups.apply(outliers).dropna()

    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper = [min([x_axis,y_axis]) for (x_axis,y_axis) in zip(list(qmax.loc[:]),upper)]
    lower = [max([x_axis,y_axis]) for (x_axis,y_axis) in zip(list(qmin.loc[:]),lower)]

    # stems
    self.segment(cat, upper, cat, q3, line_color="black")
    self.segment(cat, lower, cat, q1, line_color="black")
    # boxes
    self.vbar(cat, 0.7, q2, q3, fill_color="blue", line_color="black")
    self.vbar(cat, 0.7, q1, q2, fill_color="blue", line_color="black")
    # whiskers (almost-0 height rects simpler than segments)
    self.rect(cat, lower, 0.2, 0.01, line_color="black")
    self.rect(cat, upper, 0.2, 0.01, line_color="black")
    
    # outliers
    if not out.empty:
        outx = []
        outy = []
        for keys in out.index:
            outx.append(keys[0])
            outy.append(out.loc[keys[0]].loc[keys[1]])
        self.circle(outx, outy, size=6, color="#F38630", fill_alpha=0.6)
    
    # your code goes here
    
    label = Label( x=50, y=50, x_units='screen', y_units='screen',
                    render_mode='css' )
    self.add_layout(label)

Figure.vboxplot = vboxplot

In [10]:
def hboxplot( self, source, x, y, **kwargs ):
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas DataFrame.")

    groups = source.groupby([y])[x]
    
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    
    cat = q1.keys();
    
    def outliers(group):
        cat = group.name
        return group[(group > upper.loc[cat]) | (group < lower.loc[cat])]
    out = groups.apply(outliers).dropna()

    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper = [min([x_axis,y_axis]) for (x_axis,y_axis) in zip(list(qmax.loc[:]),upper)]
    lower = [max([x_axis,y_axis]) for (x_axis,y_axis) in zip(list(qmin.loc[:]),lower)]


    # stems
    self.segment(upper, cat, q3, cat, line_color="black")
    self.segment(lower, cat, q1, cat, line_color="black")
    # boxes
    self.hbar(cat, 0.7, q2, q3, fill_color="blue", line_color="black")
    self.hbar(cat, 0.7, q1, q2, fill_color="blue", line_color="black")
    # whiskers (almost-0 height rects simpler than segments)
    self.rect(lower, cat, 0.01, 0.2, line_color="black")
    self.rect(upper, cat, 0.01, 0.2, line_color="black")
    
    # outliers
    if not out.empty:
        outx = []
        outy = []
        for keys in out.index:
            outy.append(keys[0])
            outx.append(out.loc[keys[0]].loc[keys[1]])
        self.circle(outx, outy, size=6, color="#F38630", fill_alpha=0.6)

    # your code goes here

    label = Label( x=50, y=50, x_units='screen', y_units='screen',
                    render_mode='css' )
    self.add_layout(label)

Figure.hboxplot = hboxplot

**<font color="deeppink">Check</font>** your boxplot

In [11]:
p1 = figure( plot_width=400, plot_height=300, y_range=['setosa', 'versicolor', 'virginica'] )
p1.hboxplot( flowers, 'petal_width', 'species' )
p1.xaxis.axis_label = 'petal_width'
p1.yaxis.axis_label = 'species'

p2 = figure( plot_width=400, plot_height=300, x_range=['setosa', 'versicolor', 'virginica'] )
p2.vboxplot( flowers, 'species', 'petal_width' )
p2.yaxis.axis_label = 'petal_width'
p2.xaxis.axis_label = 'species'

show( row(p1,p2))

**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required).
- Test case 1: It will give error if the arg1 of hboxplot( data, arg1, arg2 ) and arg2 of vboxplot( data, arg1, arg2 ) is not numeric.
- Test case 2: User doesnot have flexibility to make use of kwargs forvisualization for example changing the axis scale.
- Test case 3: It will give error if the column name is not correctly specified.

## Exercise 1 c): Implement a histogram chart

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **nbins**: number of bins (optional argument). If not provided set a meaningful default.
- **Data type**: Provide histograms for categorical and quantitative data.
- **Scaling**: The y-axis shall give probabilities (0,1). Scale the axis to show the full range, e.g., (-0.05,1.05).
- **Calling the histogram**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.histogram( data, x )
```

Hints:
- Assume that all categorical data has type string. Respective columns in the data can be converted using:
```
df.var = df.var.astype('str')
```

**<font color="deeppink">Implement</font>**

In [12]:
from bokeh.models import Range1d, FactorRange

def histogram( self, source, x, nbins=0, *args, **kwargs ):
    if(nbins==0):
        nbins = 9
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas.DataFrame. Received ", type(df))

    if(source[x].dtypes == object):
        data = source[x].astype('str')
        data =np.array(data)
        temp={}
        for value in data:
            if value in temp:
                temp[value] +=1
            else:
                temp[value] = 1
        hist_data = [list(temp.values()),list(temp.keys())]
        bar_width = 0.5
#         hist_data = get_histogram(data,True)
    else:
        data = source[x]
        hist_data = np.histogram(data,nbins)
        bar_width = 0
    
    x_axis = hist_data[1]
    y_axis = hist_data[0] / np.sum(hist_data[0])
    if(bar_width==0):
        bar_width = (max(x_axis)-min(x_axis))/len(x_axis)
    self.vbar(x=x_axis, top=y_axis, width=bar_width, line_color='black', fill_color='red')
    self.y_range = Range1d(start=-0.05, end=1.05)
    # your code goes here

    label = Label( x=50, y=50, x_units='screen', y_units='screen',
                    render_mode='css' )
    self.add_layout(label)

Figure.histogram = histogram

**<font color="deeppink">Check</font>** your histogram

In [13]:
var1 = 'sepal_length'
var2 = 'species'
var3 = 'name'

p1 = figure( plot_width=200, plot_height=200 )
p1.histogram( flowers, var1 )
p1.yaxis.axis_label = 'probability'
p1.xaxis.axis_label = var1

labels = np.sort(flowers[var2].unique())
p2 = figure( plot_width=200, plot_height=200, x_range=labels )
p2.histogram( flowers, var2)
p2.xaxis.axis_label = var2

labels = np.sort(autompg[var3].unique())
p3 = figure( plot_width=200, plot_height=200, x_range=labels)
p3.histogram( autompg, var3 )
p3.xaxis.axis_label = var3

show( row( p1, p2, p3 ) )



**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required).
- Test case 1: The function calculates the probability by dividing the count of each bins if the sum is 0 it will give error for example there is only one data point with spices and it contains NAN then it will fail.
- Test case 2: Y-axis range in graph is fixed and cannot be changed by user.
- Test case 3: The number of bins for categorical data is fixed to the number of bins and doesnot change even the user passes the bin values in the function. For if the user wants to visualize only setosa and versicolar then its not posible with this function. 

## Exercise 2: Working with SPLOMs

The source code for the generalized scatterplot matrix (SPLOM) is stored in file splom.py. 

Usage:
```
p = splom( df, cols=['var1', 'var2', 'var3'], splom_width=1000 )
show(p)
```

Accepted parameters:
- **source** (req): pandas DataFrame
- **splom_width** (opt): total width/height of the plot.
- **cols** (opt): Array of column names to be used in the plot.
- **x_padding** (opt): additional space for the x-axis labels.
- **y_padding** (opt): additional space for the y-axis labels.

Hint:
- The SPLOM supports some interaction. Select points in the scatterplots and look at the results in the other scatterplots.

In [14]:
%run SPLOM.py

### Exercise 2a): Baseball data

In [15]:
baseball = pd.read_csv( 'Ex2_explAna/baseball_data.csv')

In [16]:
show( splom( df=baseball, cols=['handedness', 'height', 'weight', 'avg', 'HR'],
             splom_width=1000, x_padding=40, y_padding=80 ) )

categorical attributes ['handedness']




### Exercise 2b): Passengers on the Titanic

Remarks:
- For some passengers age information is missing the `fillna` command replaces those entries with -1. Feal free to make changes to this treating of missing values.
- All data is given as quantitative values. To make distinction of categorical data easier, we turn them into strings.

In [17]:
titanic = pd.read_csv( 'Ex2_explAna/titanic3.csv')

titanic.pclass = titanic.pclass.astype('str')
titanic.survived = titanic.survived.astype('str')

titanic = titanic.fillna(-1)

p = splom( df=titanic, cols=['pclass', 'survived', 'sex', 'age', 'fare'], splom_width=1000, 
           x_padding=40, y_padding=80 )
show( p )

categorical attributes ['pclass', 'survived', 'sex']




## Exercise 3: 

### Option 1: Auto MPG

In [18]:
from bokeh.sampledata.autompg import autompg

autompg.cyl = autompg.cyl.astype('str')
autompg.origin = autompg.origin.astype('str')

show( splom( df=autompg, cols=['mpg', 'cyl', 'displ', 'hp', 'weight', 'accel', 'yr', 'origin'],
             splom_width=1000, x_padding=40, y_padding=80 ) )

categorical attributes ['cyl', 'origin']




### Option 2: Iris flowers

In [19]:
from bokeh.sampledata.iris import flowers
p = splom( df=flowers, splom_width=1000, x_padding=40, y_padding=80 )
show( p )

categorical attributes ['species']


