# Python for Data Visualization - Part 1

**Goal:** The goal of this project is to construct a basic bar/column plot in Python using Bokeh.

**Description:** You are the manager of three schools: Elementary, Middle, and High. Each school has grade levels in the range 1-12, and each grade level has recorded Male, Female, and Total enrollment. **You want to build a bar/column chart to compare enrolment by grade.** We will use the following tools:
 - *Pandas:* Pandas allows us to pull data into our application, manipulate data within our application, and export it out of our application
 - *Bokeh:* Bokeh allows us to create interactive graphs and visualizations in Python

## 1A: Data Basics

### Importing and Understanding Data

Our data is stored in *CSV* (*Comma Separated Values*) files. Take a look at `ClassData.csv`. Each *cell* is separated from others by commas and line numbers. We use Pandas to import the data into Python.

In [1]:
import pandas as pd                 # Tell Python we will be using the Pandas set of tools, and nickname it to pd so we can type it quicker
df = pd.read_csv("ClassData.csv")   # Create a DataFrame, call it df, and set its value to the content of our CSV data
df.index += 1                       # Tells the DataFrame to start index labels at 1

Pandas also allows us to work with data in *DataFrames*. A DataFrame is simply a 2-D table, with *columns* and *rows*.

In [2]:
print(df)                           # Display our dataframe

        School  Grade  Male  Female  Total
1   Elementary      1    16      15     31
2   Elementary      2    12      15     27
3   Elementary      3    10      18     28
4   Elementary      4    17      13     30
5   Elementary      5    15      15     30
6       Middle      6    11      12     23
7       Middle      7    14      12     26
8       Middle      8    15      11     26
9         High      9    13      14     27
10        High     10    12      16     28
11        High     11    16      14     30
12        High     12    14      14     28


In our DataFrame, we have 5 columns: School, Grade, Male, Female, and Total. We have 12 rows, each given a label 1-12. These numbers on the leftmost side are known as the *index* and are not considered a column. Each index label allows us to uniquely identify each row in our data.

### Types of Data

Pandas DataFrames can store different types of data, but the basic types we will focus on here are:
 - *Strings* (`object`): Text
 - *Integers* (`int64`): Numbers without decimals
 - *Floats* (`float64`): Numbers with decimals
 - *Booleans* (`bool`): Values that are either True or False
 - *DateTimes* (`datetime64`): Values that store a specific date and time
 
 
 To determine the data type of a column, we can use `dtype` with a valid DataFrame column.

In [3]:
for column in df.columns:                                       # Go through each column in our list of columns
    column_type = str(df[column].dtype)                         # Get the column type, and store as a string
    print("The '" + column + "' column has type " + column_type)# Display the name of each column and its type

The 'School' column has type object
The 'Grade' column has type int64
The 'Male' column has type int64
The 'Female' column has type int64
The 'Total' column has type int64


### Selecting Data (Optional but Useful)

Our focus is on visualization and not analysis, so we will not deeply explore data manipulation. However, it helps to know a few useful tricks to select data from DataFrames. We use `df` here, but you should replace this with whatever you choose to call your DataFrame variable.

#### Column Select

To grab data from one or more columns, we can simply use `df[column_name]` or `df[column_names_list]`
For example `df['Grade']` or `df[['Grade', 'Total']]`.

#### Row Select

To grab data from one or more rows, there are three methods we can use.

1. *`iloc`*: The `iloc` method allows us to access a row or subset of rows by their *index position* (starting at 0) with `df.iloc[index_positions, column_positions]`. Index positions are always integers. Specifying columns is optional. For example, we can do `df.iloc[2]` to get the third row. `df.iloc[0:4]` to get the first four rows, or `df.iloc[0,5,9]` to get the first, sixth, and tenth row. If we choose to select specific columns, we must use the position of each column (starting at 0). For example, if we wanted the 'Grade' and 'Total' columns only for rows the first four rows, we could use `df.iloc[0:4, [1,4]]`, because 'Grade' is at postiion 1 and 'Total' at position 4.

In [4]:
ex1 = df.iloc[0:4, [1,4]] # Get the first four rows (positions 0,1,2,3) and only columns at position 1 and 4 ('Grade' and 'Total')
print(ex1)

   Grade  Total
1      1     31
2      2     27
3      3     28
4      4     30


2. *`loc`*: The `loc` method allows us to access a row or subset of rows by their *index label*. This is not the same as the index position. For example, the first row has position 0, but has an index label of 1. In this example, our index labels are integers, but that is not always the case. We use `loc` similar to `iloc`, except with labels instead of positions. For example: `df.loc[3]` gives us the row with label 3 (third row). `df.loc[[1,2,3,4], ['Grade','Total']]` gives us grades and totals for the first four rows.

In [5]:
ex2 = df.loc[[1,2,3,4], ['Grade','Total']] # Get the first four rows (labels 0,1,2,3,4) and only columns 'Grade' and 'Total'
print(ex2)

   Grade  Total
1      1     31
2      2     27
3      3     28
4      4     30


3. *Conditionals*: Conditionals allow us to grab data from rows based on a condition. For example, if we wanted data where the school was 'Middle' and the number of students was 26, we could do `df[(df.School == 'Middle') & (df.Total == 26)]`

In [6]:
ex3 = df[(df.School == 'Middle') & (df.Total == 26)] # Specifies the condition on which a row should be selected
print(ex3)

   School  Grade  Male  Female  Total
7  Middle      7    14      12     26
8  Middle      8    15      11     26


## 1B: Basic Bar/Column Plot

**Our task is to build a bar/column chart to compare enrollment by grade**

### Step 1: Bokeh Setup

Bokeh lets us tranform data into beautiful visualizations. We need to tell Python we are using it (`from` and `import`), and where the plots should be displayed (`output_notebook`).

In [7]:
from bokeh.plotting import figure, show    # Tells Python we will use figure and show from Bokeh
from bokeh.io import output_notebook       # Tells Python we will need the output_notebook function
from bokeh.models import ColumnDataSource  # We will need this when preparing our data for a bar/column plot

output_notebook()                          # Tells Python to present Bokeh plots in the notebook

### Step 2: Select Data

We need to determine x-axis (horizontal) and y-axis (vertical) values. We want a bar chart to compare enrollment by grade. Each grade will have its own bar, and the height of that bar will represent the total enrollment. Therefore, our x-axis is Grade, and our y-axis is Total (enrollment).

In [8]:
grades = (df['Grade']).apply(str)                                  # X-axis is the Grade column; we convert it to a string so that it can be read easily by Bokeh
totals = df['Total']                                               # Y-axis is the Total column
source = ColumnDataSource(data=dict(grades=grades, totals=totals)) # We combine our grade and total columns in a structure (dictionary) which Bokeh will understand

### Step 3: Plot Data

A visual graphic in Bokeh is known as a *figure*. We use `figure` to create and initialize key properties using the `=` (assignment) operator. The code below sets the following properties:
 - `title`: The title of our visualization
 - `x_range`: The labels for each column on the x-axis
 - `y_range`: The upper and lower bounds for the y-axis
 - `x_axis_label`: The title of the x-axis
 - `y_axis_label`: The title of the y-axis
 - `plot_height`: Height of the visualization
 - `plot_width`: Width of the visualization

In [9]:
visual = figure(title="Total Enrollment by Grade", x_range=grades, y_range=(0,40), 
                x_axis_label = "Grade", y_axis_label = "Total Enrollment", 
                plot_height=300, plot_width=800)

Now, we can add our columns to the empty figure using `vbar`. We must specify the following properties:
 - `x`: Specifies the x coordinates of the centers of the bars (x-axis data)
 - `top`: Specifies the top point of each bar (y-axis data)
 - `width`: Specifies the width of each bar
 - `legend_field`: Specifies the values that should be used for the legend
 - `source`: Specifies the source of the data (the dictionary created earlier)

In [10]:
visual.vbar(x='grades', top='totals', width=0.7, legend_field='grades', source=source)

Finally, we can add customizations to clean up our visualization.

In [11]:
visual.xgrid.grid_line_color = None            # Sets vertical gridlines to transparent, useful for column plot
visual.legend.orientation = "horizontal"       # Tells Python to use a horizontal legend
visual.legend.location = "top_center"          # Tells Python to put the legend at the top and center of the plot

**Importantly, we must call `show(name_of_figure)` in order for it to display**

In [12]:
show(visual)

Bokeh visualizations are interactive! Experiment with the tools at the side to explore your visualization in greater detail.

### Step 4 (Optional): Colorize the Visualization

In order to add a different color for each bar, we need to create an array of colors that is the same length as the number of bars (12). You can find programs to generate this array automatically, at websites like this http://vrl.cs.brown.edu/color.

In [13]:
from bokeh.transform import factor_cmap                # We need to import the factor_cmap tool because we will use it to color the bars

colors = ['#256676', '#5cdac5', '#277a35', '#70de63',
          '#333a9e', '#e057e1', '#d0bcfe', '#6a7fd2',
          '#99ceeb', '#6a10a6', '#991c64', '#573f56']

We then use the exact same code as before, except with a minor addition on line 6: `fill_color = factor_cmap('grades', palette=colors, factors=grades)`. This tells Python to assign a different color from `colors` to each bar in the plot.

In [14]:
visual_colored = figure(title="Total Enrollment by Grade", x_range=grades, y_range=(0,40), 
                       x_axis_label = "Grade", y_axis_label = "Total Enrollment", 
                       plot_height=300, plot_width=800)

visual_colored.vbar(x='grades', top='totals', width=0.7, legend_field='grades', source=source,
                   fill_color = factor_cmap('grades', palette=colors, factors=grades)) # Note the difference here, we have filled each grade with a different color from our colors array

visual_colored.xgrid.grid_line_color = None
visual_colored.legend.orientation = "horizontal"
visual_colored.legend.location = "top_center"

Finally, we call `show` to display our plot as before

In [15]:
show(visual_colored) # Make sure to call show() on your visualization for it to display!

## Exercise

To test your understanding, try creating a similar graph to the one above, except plotting Total Female Enrollment by Grade instead. Try using different colors, and experiment with the size, bounds, and column width.

For more information, check out the documentation at: https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html.