# **BDM 2**

**Please "Save A Copy" in your google drive and work with your own copy**

# Python Visualization

2.5 quintillion bytes of data are created everyday. Naturally, it is impossible for a human to process all this data on our own. 

In order to use all this data to make decisions in a timely fashion, we need to aggregate data into charts and visualizations. 

Good visualizations are often the determining factor as to whether we make the right (or better) decisions, and hep reduce confusion, boost messaging and save time. 

In this notebook we will be learning how to visualize data CLEARLY using a python engine. 

To do so, we will be mainly utilising the software library known as **Plotly Express**.

## Install packages, Mount Drive & Import relevant libraries



In [12]:
# pip install new version of plotly & swifter package
# even though plotly is already available in colab, 
# we want to use the newer version because stacked bars will be easier to implement
# swifter is a package that will speed up the apply() function in pandas
!pip3 install plotly==4.8
!pip3 install swifter

# you will be asked to restart runtime after installation





In [13]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

ModuleNotFoundError: No module named 'google'

In [None]:
# import the packages and check their versions
import numpy as np
import scipy

# import data reading
import pandas as pd

# import plotly express for graphing
import plotly.express as px



In [None]:
# check python version
import sys
print(f'The python version is {sys.version}')

In [None]:
# to check versions of plotly package
import plotly
print(f'plotly express version is {plotly.__version__}')

## Retrieve Data

The data is stored in 3BDM inside our data folder.

**The data used can be found here:**
https://www.kaggle.com/unsdsn/world-happiness

In [None]:
#First, let's get the data we need
folder_path = '/content/drive/MyDrive/pcml_data/3BDM'
file_name = '2018.csv'

happy18_df = pd.read_csv(folder_path + '/' + file_name)

happy18_df.columns #Use this to see the column names. It's important to get the column name correct or the code won't be able to run!

In [None]:
happy18_df.head()

## Data Cleaning

The first step towards data cleaning is to observe how the data is being arranged. This means that we check the columns for consistency, we look at the data points for sanity check, we check the dataset on how detailed it is and if it can answer our analysis.

Now if we look at our data we will see that there are some problems:

#### **Problem 1:**

Column labels are different - in 2016 & 2017 data - ***Social Support*** while in 2018 & 2019 data - ***Social support***

#### **Problem 2:**

Column labels have spacings between them and this will limit the way how we can call our columns subsequently. A good habit would be to replace all the spaces with underscores.

#### **Problem 3:**

There are 4 separate dataframes separated by the year the survey was done. We would need to join them up together into a single coherent dataframe for analysis.

#### **Problem 4:**

While all our dataframes are joined into a single dataframe, we lose the most important component - **the year in which these results were recorded**. Our analysis of the happiness index would be incomplete if we do not know the year in which these records were created.

#### **Problem 5:**

If we check through the data types in each column, we will realise that year column is not a datetime object. This will cause our plotly tool to read in the wrong information

### **Problem 1:**

Column labels are different - in 2016 & 2017 data - ***Social Support*** while in 2018 & 2019 data - ***Social support***

#### **Solution:**

Rename the columns by setting all to lowercase characters.

In [None]:
# Check the dataset
years = ['2016', '2017', '2018', '2019']
for a_year in years:
    cols = list(pd.read_csv(f'{folder_path}/{a_year}.csv').columns)
    print(f'{a_year}: {cols}')

In [None]:
years = ['2016', '2017', '2018', '2019']
for a_year in years:
    # Load dataframe
    temp_df = pd.read_csv(f'{folder_path}/{a_year}.csv')
    
    # Get list of column headers
    cols = temp_df.columns

    # Lowercase column headers
    # using python list comprehension
    lowercase_cols = [a_col.lower() for a_col in cols]

    # Reassign lowercased column headers to dataframe's columns
    temp_df.columns = lowercase_cols

    # Check to see if column headers are lowercased
    print(f'{a_year}: {list(temp_df.columns)}')

### **Problem 2:**

Column labels have spacings between them and this will limit the way how we can call our columns subsequently. A good habit would be to replace all the spaces with underscores.

#### **Solution:**

Replace the columns headers' spaces with underscores

In [None]:
years = ['2016', '2017', '2018', '2019']
for a_year in years:
    # Load dataframe
    temp_df = pd.read_csv(f'{folder_path}/{a_year}.csv')
    
    # Get list of column headers
    cols = temp_df.columns

    # Replace column headers' spaces with underscore
    # split() without any parameter will split words by space
    # '_'.join will join all the elements in a list with _
    lowercase_underscored_cols = ['_'.join(a_col.lower().split()) for a_col in cols]

    # Reassign lowercased & underscored column headers to dataframe's columns
    temp_df.columns = lowercase_underscored_cols

    # Check to see if column headers are lowercased & underscored
    print(f'{a_year}: {list(temp_df.columns)}')

### **Problem 3:**

There are 4 separate dataframes separated by the year the survey was done. We would need to join them up together into a single dataframe for analysis.

#### **Solution:**

Store the cleaned 4 dataframes into a list and use the concat method to join up the dataframes into a single dataframe. 

In [None]:
years = ['2016', '2017', '2018', '2019']

dataframes_list = []

for a_year in years:
    # Load dataframe
    temp_df = pd.read_csv(f'{folder_path}/{a_year}.csv')
    
    # Get list of column headers
    cols = temp_df.columns

    # Replace column headers' spaces with underscore
    lowercase_underscored_cols = ['_'.join(a_col.lower().split()) for a_col in cols]

    # Reassign lowercased & underscored column headers to dataframe's columns
    temp_df.columns = lowercase_underscored_cols

    # Check to see if column headers are lowercased & underscored
    print(f'{a_year}: {list(temp_df.columns)}')

    # Append cleaned dataframes into a list
    dataframes_list.append(temp_df)

# Print an empty line
print()

# Check that all dataframes are inside by checking the number of dataframes
print("Number of dataframes inside list:", len(dataframes_list))

In [None]:
# Store into a single dataframe
happiness_idx_df = pd.concat(dataframes_list, ignore_index=True) # We need to set ignore_index as True so that our resulting axis will be labeled 0, …, n - 1

# Check the new dataframe
happiness_idx_df

### **Problem 4:**

While all our dataframes are joined into a single dataframe, we lose the most important component - **the year in which these results were recorded**. Our analysis of the happiness index would be incomplete if we do not know the year in which these records were created.

#### **Solution:**

We will separately add in a column called year at the end of the dataframe which would reflect the year in which the records were taken.

In [None]:
years = ['2016', '2017', '2018', '2019']

dataframes_list = []

for a_year in years:
    # Load dataframe
    temp_df = pd.read_csv(f'{folder_path}/{a_year}.csv')
    
    # Get list of column headers
    cols = temp_df.columns

    # Replace column headers' spaces with underscore
    lowercase_underscored_cols = ['_'.join(a_col.lower().split()) for a_col in cols]

    # Reassign lowercased & underscored column headers to dataframe's columns
    temp_df.columns = lowercase_underscored_cols

    # Create a new column called year
    temp_df['year'] = int(a_year)

    # Check to see if column headers are lowercased & underscored
    print(f'{a_year}: {list(temp_df.columns)}')

    # Append cleaned dataframes into a list
    dat§aframes_list.append(temp_df)

# Print an empty line
print()

# Check that all dataframes are inside by checking the number of dataframes
print("Number of dataframes inside list:", len(dataframes_list))

In [None]:
# Store into a single dataframe
# We use concat because we have no specific keys to merge on but to simply join the dataframes below each other
happiness_idx_df = pd.concat(dataframes_list, ignore_index=True) # We need to set ignore_index as True so that our resulting axis will be labeled 0, …, n - 1

# Check the new dataframe
happiness_idx_df

### **Problem 5:**

If we check through the data types in each column, we will realise that year column is not a datetime object. This will cause our plotly tool to read in the wrong information

#### **Solution**

Set the year column to be a datetime object using pandas

In [None]:
# Check that our datatypes
happiness_idx_df.dtypes

In [None]:
# Convert the year column to datetime object
# example current date format is 07/08/2021
# format = %d/%m/%y
# reference for format: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

happiness_idx_df['year'] = pd.to_datetime(happiness_idx_df['year'], format='%Y')
happiness_idx_df

In [None]:
# Set the year column to show only the year - Extracting only the year
# Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.year.html
# further reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

# this will convert the datatype back to int
# dont have to run
#happiness_idx_df['year'] = happiness_idx_df['year'].dt.year

# Check the dataframe
# happiness_idx_df

In [None]:
# Check that our datatypes
happiness_idx_df.dtypes

## Section 4 - Plotly
First, let's take a look at Plotly. Specifically Plotly's express package.
We typically do `plotly.express as px`

Plotly express offers nice and interactive visualizations using very few lines of codes. It assumes some default settings. 

To customize the chart beyond what plotly express offers, we may have to go into the main plotly package. 


## Line Plot

Line plots are used mostly for continous data (e.g. time-series data, an example of which will be shown later).

A simple line plot can be created by using the `px.line` function.

The `px.line` function has 2 compulsory arguments - the x-axis and the y-axis. These are placed in the order x,y .

The x and y axes should contain datapoints. This can be input in the form of lists, dataframe columns or numpy arrays.

An example is shown below.

**NOTE:** There are many options to use with the arguments! For brevity's sake we will not expand on too many of these. Feel free to search online for more options. 



In [None]:
# basic line chart with plotly express
import plotly.express as px

# put in the dataframe name: happiness_idx_df
# put in column name for x-axis: year
# put in column name for y-axis: healthy_life_expectancy
# call for the figure to show: fig.show()

fig = px.line(happiness_idx_df,
              x='year',
              y='healthy_life_expectancy',
              )

fig.show()

In [None]:
# Get country of interest - Singapore
coi = 'Singapore'

# Plot the graph
fig = px.line(happiness_idx_df[happiness_idx_df['country_or_region'] == coi], 
              x="year", 
              y="healthy_life_expectancy", 
              color='country_or_region')

# Set the x-axis values such that it only reflects the year
# reference for tick labels: https://plotly.com/python/time-series/
fig.update_xaxes(tickmode="array",
                 tickvals = [2016,2017,2018,2019])

# Display the graph
fig.show()

In [None]:
# Declare list of countries
countries_of_interest = ['United States', 'Singapore', 'South Korea']

# Plot the graph
# isin() will check for the countries within the countries_of_interest list
fig = px.line(happiness_idx_df[happiness_idx_df['country_or_region'].isin(countries_of_interest)], 
              x="year", 
              y="healthy_life_expectancy", 
              color='country_or_region'
              )

# Set the x-axis values such that it only reflects the year
fig.update_xaxes(tickmode="array",
                 tickvals = [2016,2017,2018,2019])

# Display the graph
fig.show()

## Bar Charts

Bar charts allow for comparison between different groups.

A simple example of a vertical bar chart is shown below.


### **Scenario #1: Single Bar Chart**

Select the **top 5 countries** based on happiness `score` in **2019** and plot their `gdp_per_capita` as a **bar chart**.

In [None]:
# Look at a little of our dataset
happiness_idx_df.head(2)

In [None]:
# Select the dataset with the year of interest
happiness_idx_df_2019 = happiness_idx_df[happiness_idx_df['year'] == '2019-01-01']

# Select the top 5 countries based on score
top_5_2019 = happiness_idx_df_2019.nlargest(5, 'score')

# Plot the graph based on gdp_per_capita
fig = px.bar(top_5_2019, 
             x="country_or_region", 
             y="gdp_per_capita", 
             color='country_or_region')

# Display the graph
fig.show()

### **Scenario #2: Single Bar Chart**

**Your task:**


Select the **top 5 countries** based on happiness `score` in **2018** and plot their `gdp_per_capita` as a **bar chart**.

In [None]:
# Select the dataset with the year of interest
happiness_idx_df_2018 = happiness_idx_df[happiness_idx_df['year'] == '2018-01-01']

# Select the top 5 countries based on score
top_5_2018 = happiness_idx_df_2018.nlargest(5, 'score')

# Plot the graph based on gdp_per_capita
fig = px.bar(top_5_2018, 
             x="country_or_region", 
             y="gdp_per_capita", 
             color='country_or_region')

# Display the graph
fig.show()

### **Scenario #3: Grouped Bar Chart**

We can also compare **categorical data** by having multiple columns beside one another.

For example, if we wanted to compare the `healthy_life_expectancy` of 3 countries: 

In [None]:
# Declare list of countries
countries_of_interest = ['United States', 'Singapore', 'South Korea']

# Plot the graph
# Using the isin() method: Whether each element in the DataFrame is contained in values
fig = px.bar(happiness_idx_df[happiness_idx_df['country_or_region'].isin(countries_of_interest)], 
              x="year", 
              y="healthy_life_expectancy", 
              color='country_or_region',
             barmode = 'group')


# Display the graph
fig.show()

### **Scenario #4: Grouped Bar Chart**

**Your task:**
Compare the `gdp_per_capita` of 3 countries, via a grouped bar chart.

In [None]:
# Declare list of countries
countries_of_interest = ['United States', 'Singapore', 'South Korea']

# Plot the graph
# Using the isin() method: Whether each element in the DataFrame is contained in values
fig = px.bar(happiness_idx_df[happiness_idx_df['country_or_region'].isin(countries_of_interest)], 
              x="year", 
              y="gdp_per_capita", 
              color='country_or_region',
             barmode = 'group')


# Display the graph
fig.show()

## Histogram

Let's take a look at the histogram. 

The histogram lets us summarise the distribution of a variable.

From this we can visualise the data and see if it is skewed, observe its spread, look for the (multiple) modes etc. 

For an application of the histogram, lets look at the distribution of life expectancies in different countries. 



### **Scenario #5:**

We will plot a histogram of the life expectancy for 2018 in our happiness index dataset. 

In [None]:
happiness_idx_df.dtypes

In [None]:
# Let's look at our dataset in 2018
happiness_idx_df_18 = happiness_idx_df[happiness_idx_df['year'] == '2018-01-01'].reset_index(drop = True)
happiness_idx_df_18

In [None]:
# Plot the graph
fig = px.histogram(happiness_idx_df_18,
                x='healthy_life_expectancy',
                nbins = 10)

# Adjust plot size
# update_layout can control more aspects of the chart
fig.update_layout(width = 600, 
                  height = 400)

# Display the graph
fig.show()

Are we done? Wait! We haven't talked about bins yet. 

Bins are an essential part of histograms. 

**By default,** the number of bins is chosen so that this number is comparable to the typical number of samples in a bin. This number can be customized, as well as the range of values.

The number of bins determine how many intervals there are in the histogram. 

Our earlier histogram has 10 intervals.

However this might oversimplify things too much. 

To put this into perspective, the age range of 50 to 60 is too broad. We won't be able to pinpoint details clearly. 

To define the number of bins, we add the bins = ***n*** parameter into the `px.histogram` function. 

In [None]:
# Plot the graph
fig = px.histogram(happiness_idx_df_18,
                x='healthy_life_expectancy',
                nbins = 20)

# Adjust plot size
fig.update_layout(width = 600, 
                  height = 400)

# Display the graph
fig.show()

## Scatterplot
Scatter plots allow us to assess if there's a correlation between two variables.

A scatter plot shows data on the x and y axis simultaneously to show the relationship. 

This is useful to show spread or clustering, as well as to show non-linear relationships. 

Let's see how we can plot a scatter plot below using 2018 data for GDP per capita and life expectancy for different countries. 

In [None]:
happiness_idx_df_18

In [None]:
# Let's check the pairwise relationship between the two variables
# gdp_per_capita and healthy_life_expectancy
# hover_name will give the hover textbox the name, using the column name
# hover_data will add in additional columns to display
# inserting trendlines: https://plotly.com/python/linear-fits/
fig = px.scatter(happiness_idx_df_18,
                 x = 'gdp_per_capita',
                 y = 'healthy_life_expectancy',
                 hover_name='country_or_region',
                 hover_data=['score'],
                 trendline="ols"
                 )

fig.update_layout(height = 400,
                  width = 400)

fig.show()

In [None]:
# what if we want to plot the scatter plot for all variables?
# only query all the numerical variables
happiness_idx_df_18.iloc[:,2:-1]

In [None]:
# use scatter_matrix
fig = px.scatter_matrix(happiness_idx_df_18.iloc[:,2:-1])
fig.show()

## Bubble Plot

A bubble chart is a scatter plot in which a third dimension of the data is shown through the size of markers.

Bubble charts can facilitate the understanding of social, economical, medical, and other scientific relationships. Bubble charts can be considered a variation of the scatter plot, in which the data points are replaced with bubbles.

In [None]:
# Randomly sample 20 data points: sample()
sample_20 = happiness_idx_df_18.sample(20).reset_index(drop=True)
sample_20

In [None]:
# Plot the graph
fig = px.scatter(sample_20, 
                 x = 'gdp_per_capita',
                 y = 'healthy_life_expectancy',
                 size="score", 
                 color="country_or_region",
                 hover_name="country_or_region", 
                 log_x=True
                 )

# Adjust the layout
fig.update_layout(height = 400,
                  width = 600)

# Display the graph
fig.show()

## Heatmap

Given a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

In our above scenarios of the boxplot we realise we can tell much from them beyond how the distribution works. 

**Reference link:** https://plotly.com/python/builtin-colorscales/#builtin-diverging-color-scales

**Available colors for heatmap**
```
'aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance',
'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg',
'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl',
'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric',
'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys',
'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet',
'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges',
'orrd', 'oryel', 'peach', 'phase', 'picnic', 'pinkyl', 'piyg',
'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn', 'puor',
'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu', 'rdgy',
'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar', 'spectral',
'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn', 'tealrose',
'tempo', 'temps', 'thermal', 'tropic', 'turbid', 'twilight',
'viridis', 'ylgn', 'ylgnbu', 'ylorbr', 'ylorrd'
 ```

### **Correlation Visualisation**

Visualise the correlation between variables for the happiness index dataset in 2018 to understand their correlations

In [None]:
happiness_idx_df_18.head()

In [None]:
happiness_idx_df_18.columns

In [None]:
# Plot a correlation plot using a heatmap
fig = px.imshow(happiness_idx_df_18[['score', 'gdp_per_capita',
                                     'social_support', 'healthy_life_expectancy',
                                     'freedom_to_make_life_choices', 'generosity',
                                     'perceptions_of_corruption']].corr(),
                x=['score', 'gdp_per_capita',
                   'social_support', 'healthy_life_expectancy',
                   'freedom_to_make_life_choices', 'generosity',
                   'perceptions_of_corruption'],
                y=['score', 'gdp_per_capita',
                   'social_support', 'healthy_life_expectancy',
                   'freedom_to_make_life_choices', 'generosity',
                   'perceptions_of_corruption'],
                color_continuous_scale = 'rdylgn'
                )

fig.update_xaxes(side="top")
fig.show()

If you want to annotate the cells with the values, using seaborn could be easier. 

**TAKE NOTE**

This is a different plotting package. Matplotlib and Seaborn

In [None]:
# matplotlib is the most common and basic visualization package in python
import matplotlib.pyplot as plt

# seaborn is a visualization package built on top of matplotlib
import seaborn as sns

# plot correlation heatmap
_, ax = plt.subplots(figsize=(14, 12))

# This is to define the color palette for seaborn
# https://seaborn.pydata.org/generated/seaborn.diverging_palette.html
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# plot a heatmap of the correlation of the variables
# cmap: the colormap which we created in the previous line
# square: set the shape of each cell to be square shaped
# cbar_kws: define the scale of the legend
# ax: draw onto the axes based on what we defined in matplotlib subplots
# annot=True: annotation, write the correlation value in each cell
# annot_kws: set the fontsize for the annotation
sns.heatmap(happiness_idx_df_18[['score', 'gdp_per_capita',
                                     'social_support', 'healthy_life_expectancy',
                                     'freedom_to_make_life_choices', 'generosity',
                                     'perceptions_of_corruption']].corr(), 
            cmap=cmap,
            square=True,
            cbar_kws={'shrink': .9},
            ax=ax,
            annot=True,
            annot_kws={'fontsize': 12}
            )



---
**TAKE NOTE**

We will be using a different dataset from here on. 

## Stacked Bar Chart

Bars can also be stacked. 

For example, we want to compare the number of people that turns up over different `time` periods of the day (dinner vs lunch) and separate them based on `sex`.

For this exercise we will need to use a new dataset - **tips dataset**. This dataset is built into the Plotly library so we will call for the dataset using the library and process our data from there.

In [None]:
# Retrieve data
tips_df = px.data.tips()
tips_df

In [None]:
# Plot the graph
fig = px.bar(tips_df, 
             x="time",
             y='size',
             color="sex",
             barmode='relative'
            )

# Display the graph
fig.show()

In [None]:
# Get the total number of people in each group
grouped_df = tips_df.groupby(['time','sex'])[['size']].sum().reset_index()
grouped_df

In [None]:
# Plot the graph
fig = px.bar(grouped_df, 
             x="time",
             y='size',
             color="sex",
             barmode='stack', # use 'relative' or 'stack' for barmode for stacked barchart
             category_orders = {'time':["Lunch", "Dinner"]}, # rearrange the order on the x-axis,
             color_discrete_map = {'Male':'green', "Female":'purple'}, # map the correct color
             labels = {'size': 'Number of people',
                       'time':'Time of meal'} # change the axis labels
            )

# Display the graph
fig.show()

### **Scenario #6: Stacked Bar Chart**

For example, we want to compare the number of people that turns up over different `time` periods of the day (dinner vs lunch) and separate them based on `smoker` or not.


In [None]:
# Get the total number of people in each group
grouped_df = tips_df.groupby(['time','smoker'])[['size']].sum().reset_index()
grouped_df

In [None]:
# Plot the graph
fig = px.bar(grouped_df, 
             x="time",
             y='size',
             color="smoker",
             barmode='relative',
             category_orders = {'time':["Dinner", "Lunch"]}, # rearrange the order on the x-axis,
             color_discrete_map = {'Yes':'lightsalmon', "No":'lightblue'}, # map the correct color
             labels = {'size': 'Number of people'} # change the axis labels
            )

# Display the graph
fig.show()

### **Scenario #7: Horizontal 100% Stacked Bar Chart**

We can use a horizontal vertical bar to compare totals and see subcomponents. However, comparison of subcomponents will be limited to the bottom subcomponent.

Using a 100% horizontal vertical bar will allow us to compare both the bottom and the top.

For example, we want to compare the number of people that turns up over different `time` periods of the day (dinner vs lunch) and separate them based on `smoker` or not.


**How we should approach this scenario?**

1. We will check our dataframe and see the information we need
2. After checking, we realised that we will need to convert the numbers - `size`, to a percentage
3. To convert into a percentage what we need to do is to get the sum of the number of people who turned up during `dinner` and `lunch` and divde it the numbers in the `size` column by this sum
4. Create a new column to store this value - we use **`swifter.apply`** to help us. (Read below for what is swifter)

**Note:**

Swifter is a 3rd party package that automatically uses the fastest techniques to process pandas .apply methods.

We will learn about multiprocessing and distributed processing in later sessions. For now, we shall put our faith into swifter.


In [None]:
# Check out our dataframe - grouped_df again
grouped_df

In [None]:
# To achieve the Horizontal 100% Stacked Bar Chart we will need convert the numbers to a percentage
total_size = grouped_df.groupby(['time'])[['size']].sum()
total_size

In [None]:
# swifter makes pandas apply() much faster in large datasets
import swifter

# reset index so that swifter can work correctly.
time_df = grouped_df['time'].reset_index(drop=True)

# We will use swifter to create the new column for us
grouped_df['total'] = time_df.swifter.apply(lambda row: 463 if row == 'Dinner' else 164, axis = 1)

grouped_df

In [None]:
# If we can create a column called total we can get the percentage easily by taking size/total*100
grouped_df['percentage'] = grouped_df['size'] / grouped_df['total'] * 100
grouped_df

In [None]:
# Plot the graph
fig = px.bar(grouped_df, 
             x="percentage", # Flip the order our x and y axis
             y='time',
             color="smoker",
             orientation='h',
             barmode='relative', # make the bars horizontal
             category_orders = {'time':["Dinner", "Lunch"]}, # rearrange the order on the x-axis,
             color_discrete_map = {'Yes':'lightsalmon', "No":'lightblue'}, # map the correct color
             labels = {'size': 'Number of people'} # change the axis labels
            )

# Display the graph
fig.show()

### **Scenario #8: Horizontal 100% Stacked Bar Chart**


For example, we want to compare the number of people that turns up over different `day` and separate them based on periods of the day (dinner vs lunch). We only want the data for Thur & Fri because only these 2 days have records of both dinner and lunch.


In [None]:
tips_df.head(2)

In [None]:
# Get the total number of people in each group
grouped_df_day = tips_df.groupby(['time','day'])[['size']].sum().reset_index()
grouped_df_day

In [None]:
# Extract the days we want only
days = ['Thur', 'Fri']
grouped_df_day = grouped_df_day[grouped_df_day['day'].isin(days)].reset_index(drop=True) # reset_index so that swifter can work correctly


# To achieve the Horizontal 100% Stacked Bar Chart we will need convert the numbers to a percentage
total_size = grouped_df_day.groupby(['day'])[['size']].sum()

# reset index so that swifter can work correctly.
time_df = grouped_df_day['day'].reset_index(drop=True)
time_df

# We will use swifter to create the new column for us
grouped_df_day['total'] = time_df.swifter.apply(lambda row: 152 if row == 'Thur' else 40, axis = 1)
grouped_df_day

In [None]:
# If we can create a column called total we can get the percentage easily by taking size/total*100
grouped_df_day['percentage'] = grouped_df_day['size'] / grouped_df_day['total'] * 100
grouped_df_day

In [None]:
# Plot the graph
fig = px.bar(grouped_df_day, 
             x="percentage", # Flip the order our x and y axis
             y='day',
             color="time",
             orientation='h',
             barmode='relative'
            )

# Display the graph
fig.show()

**Your Task:**

Do Horizontal 100% Stacked Bar Chart on the sex. This means that we want to the percentage difference in number of people separated by their sex instead of by whether they are smokers or not.

To make this even more challenging, create the percentage column using the **`swifter.apply`** function - this means that we do not want to see a separate column called `total` in the final dataframe like the previous example.

Check your solution against the suggested answer!

In [None]:
# Get the total number of people in each group
grouped_df_sex = tips_df.groupby(['time','sex'])[['size']].sum().reset_index()
grouped_df_sex

In [None]:
total_size_by_sex = grouped_df_sex.groupby(['time'])[['size']].sum().reset_index()
# # Check dataframe
total_size_by_sex

In [None]:
### === Suggested Answer == ###

# reset index so that swifter can work correctly.
time_df = grouped_df_sex[['time','size']].reset_index(drop=True)
# # Check dataframe
# time_df

# # We will use swifter to create the new column for us
grouped_df_sex['percentage'] = time_df.swifter.apply(lambda row: row[1]/463*100 if row[0] == 'Dinner' else row[1]/164*100, axis = 1)
grouped_df_sex


In [None]:
# Plot the graph
fig = px.bar(grouped_df_sex, 
             x="percentage", # Flip the order our x and y axis
             y='time',
             color="sex",
             orientation='h',
             barmode='relative',
             category_orders = {'time':["Dinner", "Lunch"]}, # rearrange the order on the x-axis,
             color_discrete_map = {'Male':'lightsalmon', "Female":'lightblue'}, # map the correct color
             labels = {'size': 'Number of people'} # change the axis labels
            )

# Display the graph
fig.show()

## Box Plot

Box plots give a statistical summary of the features plotted. 

The top and bottom lines represents the maximum and minimum values.

The top line of the box represents the third quartile value, the middle line within the box represents the median and the bottom line of the box represents the first quartile value.

The height of the box is known as the interquartile range. 

Any black dots on the plot represent outlier values. 

We can call it easily with the `px.box()` method. 

### **Scenario #9:**

Plot the box plot to visualize the distribution of `tips` for different time periods

In [None]:
tips_df

In [None]:
# Box plot
# dots are outliers
# outliers are +- 1.5x of interquartile range (IQR)
# IQR is difference between 1Quartile (25percentile) and 3Quartile (75percentile)
fig = px.box(tips_df, 
             x="time", 
             y="tip",
             color="time")
fig.show()

### **Scenario #10:**

Plot the box plot to visualize the distribution of `total_bill` for different time periods

In [None]:
# Box plot
fig = px.box(tips_df, 
             x="time", 
             y="total_bill",
             color="time")
fig.show()

## Color Selection

Beyond the default colour palettes, we have a large variety of color palettes at our disposal. 

We can create a range of custom sequential or diverging palettes. 

Sequential Palettes have colors that move from lighter to darker, or from one to another.

Diverging Palettes have colors that attract attention at both ends of the spectrum.

The link to Plotly's color palettes is [here](https://plotly.com/python/builtin-colorscales/) and discrete colors is [here](https://plotly.com/python/discrete-color/).

