# Plotly - Unit 01 - Introduction

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

**Plotly Lesson consists of four units.**
* By the end of this lesson, you should be able to:
  * Work with Plotly datasets
  * Understand and use interactive buttons in Plotly Figures
  * Code and interpret a wide range of plot types from Plotly

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Interact with Plotly modules
* Get familiar with and load Plotly datasets
* Understand interactivity in Plotly
* Create a Plotly Figure


---

Plotly is a data visualisation library written in JavaScript, making graphs inherently  interactive
* Plotly offers more than thirty unique charts and wide customisation for statistical, geographical, financial, and 3-D plot types.
* The core aspect of plotly allows you to hover over values to reveal more detail and zoom in and out of your graphs. The interactivity and animation of the plots allow you to do a much deeper and more extensive investigation of your data while maintaining a substantial level of customisation


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study Plotly?**
  * Because there will be use cases where it is a must to add interactivity and animation to your data visualisation.  
  * This will offer a more detailed visualisation when exploring, understanding and communicating your data insights


## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add additional **code cells and try out** other possibilities; play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your own comments** in the cells. It can help you to consolidate your learning. 


* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Plotly in Python, the link is [here](https://plotly.com/python/)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Introduction: Part 01

Plotly Python library aims to **create, manipulate and render graphical figures interactively while maintaining substantial customisation**.


The **rendering process** uses Plotly.js **JavaScript** in the back end. 
* However, we will not need to interact with the Javascript library itself.

---

**Is Plotly the only library for interactive data visualisation?**


No, other libraries offer similar Plotly-like capabilities, like [Bokeh](https://bokeh.org/), [Altair](https://altair-viz.github.io/) and [Folium](https://python-visualization.github.io/folium/quickstart.html)
  * In general, there is no **"best plotting library".** Instead, it will depend on the use cases you will have in your project, how familiar you and your team are with the syntax each library has, and your personal/professional preferences on the aesthetics each library offers, for example.
    * We will study **Plotly Express** in this course, a module of Plotly, since its syntax is simple and intuitive. Typically, it will cover most of the use cases you might face in the workplace.
    * At the same time, we encourage you in your spare time for professional learning, to study by yourself the other libraries mentioned above


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Modules

Plotly has multiple modules which we can use in a Data Science 
project. The recommended module for an entry point is  **Plotly Express or the alias px**. 
  * We will start studying Plotly Express in this module; This high-level data visualisation API produces fully-populated graph object figures in a single function call.
  * It has a set of functions that can create figures at once. It works in an integrated fashion with Pandas DataFrames
    * The documentation for Plotly Express in Python can be found [here](https://plotly.com/python/plotly-express/).

Plotly Express is typically imported with the alias `px` 

import plotly.express as px

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Datasets API

Plotly offers a substantial amount of datasets to help you better understand its capabilities.
  * Naturally, these datasets can be used for ML tasks and data visualisation purposes

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Express Datasets

You can find the dataset API for Plotly Express [here](https://plotly.com/python-api-reference/generated/plotly.data.html)
 * Loading a dataset in Plotly Express is different from Seaborn.
  * You type ``px.data`` and then one of the datasets in the image below. 
  * It starts with `carshare` and continues until `wind`.
  * Don't forget to add `()` at the end of the dataset name.

  Below we load the `carshare()` dataset.
* It has records for the availability of car-sharing services near the centroid of a zone in Montreal over a month-long period

df = px.data.carshare()
df = df.head(50)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: Try and load a few datasets yourself.

df = ......  # load plotly express datasets using px.data.
df = df.head(50)
df.head()

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly GitHub Datasets

* In addition, you can load Plotly datasets from its GitHub page using Pandas. Visit this [link](https://github.com/Code-Institute-Solutions/sample-datasets) to check the dataset list.
  * We picked `earthquake.csv` as the example below:

dataset_name = 'earthquake.csv'
df = pd.read_csv(f'https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/{dataset_name}')
df = df.head(50)
df.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: Try yourself
* Go to that link, pick a few datasets and load them here
* Due to the size of the datasets, we are limiting the number of rows retrieved using the parameter (nrows).

dataset_name = '.........'  # pick a dataset, don't forget to add .csv 
df = pd.read_csv(f'https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/{dataset_name}', nrows=100)
df = df.head(50)
df.head(3)

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Upper Right Buttons in a Plotly Plot

When displaying a Plotly plot, you will notice a few buttons to interact with the plot in the upper right section.
  * We will render a Plotly graph in the cell below. The focus will not be the code or the plot itself but to perceive the upper right section

* The dataset here holds records for waiter tips based on the day of the week, time of day, total bill, gender, whether a table of smokers or not and how many people were at the table.

df = px.data.tips()
df = df.head(50)
fig = px.scatter(data_frame=df, x="total_bill", y='tip')
fig.show()

Hover over the top right corner of the plot, and you will notice these buttons. Explore the options so that you can become familiar with them: 
* Download the plot as PNG, zoom, pan, select the plot using box or lasso mode, autoscale and reset the axes.


An alternative way to **zoom** in is to click and maintain the left mouse pointer, drag over the area you want and release. 
* To switch back the zoom, just double-click anywhere in the plot
  * Try it out on the previous plot

  ### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Figure


At a basic level, a **Figure in Plotly** can be represented as a dictionary and displayed using functions from the ``plotly.io`` module. 
  * The **dictionary below describes** a Figure. It contains a scatter plot and a title.

fig = dict({
    "data": [{"type": "scatter",
              "x": ["a", "b", "c"],
              "y": [1,3,2]}],
    "layout": {"title": {"text": "sample figure"}}
})

fig

We import a function called `pio` to plot the dictionary above:

import plotly.io as pio
pio.show(fig)

Naturally, in practical terms, we do not plot our data using the method above. 
  * The **purpose** is to have a glimpse under the hood at least once in your Plotly experience.


The code example below uses **Plotly Express** to create the same plot.
  * We will not get into the specifics of the code, but you may feel **the syntax is clear and intuitive**
  * To display a plot, you will use the method `.show()`

fig = px.line(x=["a","b","c"], y=[1,3,2], title="sample figure")
fig.show()

You can `print()` the `fig` to inspect the content:

print(fig)

---

# Plotly - Unit 02 - Plots: Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and deliver line plots, area plots, histogram and boxplots in Plotly



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Plots: Part 01

In this unit we will explore and learn multiple use cases in Plotly to deliver:
* Line plot
* Area Plot
* Histogram
* Boxplot

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Line Plot


Consider the ``gapminder`` dataset; it has records on population, Gross Domestic Product, and life expectancy for more than 140 countries from 1952 to 2007. Each row represents a country in a given year

df = px.data.gapminder().query("country=='Canada'")
df.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are already familiar with a line plot. We can draw an interactive line plot using `px.line()`. The function documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.line). The arguments are data_frame, x, y and title. Other arguments can be found in the documentation and will be presented progressively
* We will plot life expectancy in Canada; therefore, we are querying it on the ``data_frame`` argument
* As a recap, we render plotly plot with ``fig.show()``

fig = px.line(data_frame=df, x="year", y="lifeExp", title='Life expectancy in Canada')
fig.show()

Now we want to plot multiple lines, each related to a country. First, let's subset some data; we will subset data from Oceania

df_oceania = px.data.gapminder().query("continent=='Oceania'")
df_oceania.head(3)

There are two countries

df_oceania['country'].unique()

We add the colour argument to` px.line()` and set it as 'country', so it plots a line for each country

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the legend, you can click in a given country to activate/deactivate it in the plot. The plot is automatically updated when you do it; This is a practical benefit of an interactive graph
* Note both countries look to change at quite similar rates and have similar levels

fig = px.line(data_frame= df_oceania,
              x="year", y="lifeExp", color='country')
fig.show()

When dealing with multiple categorical variables, you can facet (or split) the plot by column and row
* First, let's subset a part of the data where the continent is Oceania or the Americas

df_facet = px.data.gapminder().query("continent in ['Oceania','Americas']")
df_facet.head(3)

We add the parameter `facet_col='continent'` to create facetted  columns subplots
* Note in the Americas; there is a disparity in life expectancy; few countries have levels below 65 or above 75, and the majority are between these levels. Note also if all countries changed at the same rate across the years? For example, El Salvador has a different rate than the others.

fig = px.line(data_frame=df_facet,
              x="year", y="lifeExp", color='country', facet_col ='continent')
fig.show()

We add the parameter `facet_row='continent'` to create facetted  row subplots
* The interpretation is similar to the previous plot. But you may ask yourself, which option do I visualise the data best? The answer typically is: try both and see which works best.

fig = px.line(data_frame=df_facet, x="year", y="lifeExp", color='country', facet_row ='continent')
fig.show()

You can add more information to your plot, for example when you hover over a given data point. Add `hover_name` and `hover_data` arguments:
* Using: hover_name, the column `iso_alpha` displays in bold in the hover tooltip.
* Using: hover_data, the columns named `'pop'` and `'gdpPercap'` values appear at the bottom as extra data in the hover tooltip

fig = px.line(data_frame=df_facet, x="year", y="lifeExp", color='country', facet_col='continent',
              hover_name='iso_alpha',
              hover_data=['pop','gdpPercap'])
fig.show()         

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You can add a range slider when visualising time series in a line plot

Consider the dataset loaded from Plotly dataset at GitHub

df = pd.read_csv('https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/finance-charts-apple.csv')
print(df.shape)
df.head(3)

You will create the line plot as usual and update the x-axis with `rangeslider_visible=True`

fig = px.line(df, x='Date', y='AAPL.High', title='Time Series with Rangeslider')
fig.update_xaxes(rangeslider_visible=True) ####### add range slider for time series data
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note you can drag and move the "handles" circled in red in the image below. 
* That is the range slider when you are seeing time series data in a line plot


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the 'flights' dataset.

df = sns.load_dataset('flights')
df['year'] = df['year'].astype('str')
df['month'] = df['month'].astype('str')
df['Date'] = pd.to_datetime(df['month'] + '-' + df['year'] )
df.set_index('Date',inplace=True)
print(df.shape)
df.head()

You are interested in using a plotly lineplot to visualise the passengers' demand over time

# write your code here

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Area Plot


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are already familiar with Area Plot. You can visualise time series data with an area plot, which can be considered a type of line plot, but filled with a colour to facilitate visualisation

Consider the ``stocks`` dataset. It has records for stock prices from multiple technology companies from the beginning of 2018 until the end of 2019. 

df = px.data.stocks().head(50)
df.head(3)

Note the data is in a wide format. We will convert to a long format using `pd.melt()`; in case you want a refresh from the Pandas lesson, check [pd.melt documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)

df = (pd.melt(frame=df, var_name='Company', value_name='Price', 
             value_vars=['GOOG', 'AAPL', 'AMZN', 'FB', 'NFLX', 'MSFT'],
             id_vars='date')
)

df.head(3)

We plot an area chart with `px.area()`; its documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.area.html).
* Note the plot is stacked, meaning for each value in x, the set of y's is compounded or summed. 
We are not interested in that since we want to see the variation of each company separately.

fig = px.area(df,x='date', y='Price', color='Company')
fig.show()

We will facet the plot to reach the effect we want - unstack the price and see each company's evolution. We also considered the argument `facet_col_wrap`, so we defined the maximum number of facet columns.

fig = px.area(df,x='date',y='Price', color='Company',
              facet_col="Company", facet_col_wrap=2)
fig.show()


---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's see an example where it would be interesting to have stacked data

Consider the ``gapminder`` dataset; it has the population, Gross Domestic Product, and life expectancy records for more than 140 countries. Each row represents a country of a given year
* We will subset data for Oceania in this exercise

df = px.data.gapminder().query("continent == 'Oceania'")
df.head(3)

We are interested in plotting the population evolution across the years per country. In this context, we are interested in having a stacked aspect in the plot since we can check the population proportion across countries.
* For example, we see that Australia has a much larger population than New Zealand

fig = px.area(df,x='year',y='pop', color='country')
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We have to be aware when using a stacked area plot. The example above had two countries. When we have more countries, the following two concerns are raised:
* colouring
*  and lines have been cluttered


We will subset the countries from the Americas continent

df = px.data.gapminder().query("continent == 'Americas'")
df.head(3)

And we will use the same code from the previous example. You will notice that:
* The colours have started to repeat - note that Brazil, the United States and Haiti have the same green colour
* Lines are cluttered since there could be a slight difference from one country to another

fig = px.area(df,x='year',y='pop', color='country')
fig.show()

* The colour aspect can be managed with the `color_discrete_sequence` argument. You can check the possible arguments with the command `px.colors.qualitative.swatches()`

px.colors.qualitative.swatches()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You will add the desired choice to `px.colors.qualitative`; for example, if you want Alphabet, it will be `px.colors.qualitative.Alphabet `. We picked Alphabet since there are many countries and Alphabet has a wide range of colours

* The graph will improve since there will be no more repeated colours; however, the "cluttered" effect may still remain
* the insights here will be to grasp which countries were more representative over the years

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Naturally, you can set `color_discrete_sequence` argument for other plot types where you are colouring the plot with a categorical variable

fig = px.area(df,x='year',y='pop', color='country',
              color_discrete_sequence=px.colors.qualitative.Alphabet,
              )
fig.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Histogram


Consider the tips dataset. Yes, the one from Seaborn. It holds records for waiter tips based on the day of the week, time of day, total bill, gender, if it is a table of smokers or not, and how many people were at the table.

df = px.data.tips().sample(n=150, random_state=1)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You can create a histogram using `px.histogram()`; its documentation is [here](https://plotly.github.io/plotly.py-docs/generated/plotly.express.histogram.html). The arguments are:
* ``data_frame``
* ``x`` for the variable you want to see the distribution of 

Additional parameters use cases will be discussed soon.

fig = px.histogram(data_frame=df, x="tip")
fig.show()

You can create a histogram with Marginal Boxplot by adding `marginal='box'`
* It helps you to see outliers, median, Q1 and Q3 quickly

fig = px.histogram(df, x="tip", marginal="box")
fig.show()

You can add more information to your histogram; for example, with the colour argument, it uses a variable, typically categorical, to assign colour marks.

fig = px.histogram(df, x="tip", color="sex", marginal="box")
fig.show()

We can use `facet_col` and `facet_row`, which we learned in a previous section, to divide the plot further and look for more insights
* Note, for example, that tips greater than 6 occur at dinner on weekends

fig = px.histogram(df, x="tip", color="sex", facet_col='day', facet_row='time')
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the 'penguins' dataset.

df_practice = sns.load_dataset('penguins').sample(n=150)
print(df_practice.shape)
df_practice.head()

Feel free to try out your ideas to make different histograms on this dataset

# write here your code


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Boxplot


We are already familiar with Boxplot. We will use the same tips dataset to plot the boxplot using `px.box()`. Its documentation is [here](https://plotly.github.io/plotly.py-docs/generated/plotly.express.box.html)
* Similarly to other libraries, we parse the data; x for the groups or categories and y for the levels you are interested in plotting

fig = px.box(df, x="day", y="tip")
fig.show()

You can add the colour argument to divide your plot further. In this case, you see the tip distribution per day per smoker status

fig = px.box(df, x="day", y="tip", color="smoker")
fig.show()

A box plot helps with descriptive statistics figures, but a box plot itself doesn't give the full information on how much data there is in a group and its distribution shape.
  * You can add `points='all'` to give you a better sense of the distribution
  * Note you have fewer datapoints on Friday and Thursday

fig = px.box(df, x="day", y="total_bill", points='all')
fig.show()

Again, you can facet your plot with facet_col or facet_row. We selected facet_col='sex'

 fig = px.box(df, x="day", y="total_bill", points='all', facet_col='sex')
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the iris dataset.

df_practice = sns.load_dataset('iris').sample(n=100)
print(df_practice.shape)
df_practice.head()

Feel free to try out your ideas to make different box plots on this dataset

# write here your code

---

# Plotly - Unit 03 - Plots: Part 02

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and deliver bar plots, scatter plots, scatter plot 3D and parallel plots in Plotly



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Plots: Part 02

In this unit, we will explore and learn multiple use cases in Plotly to deliver:
* Bar plot
* Scatter plot
* Scatter plot 3D
* Scatter Matrix 
* Parallel plots

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Bar Plot


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are already familiar with what information a bar plot provides.

* Consider the ``gapminder`` dataset; it has the population, Gross Domestic Product, and life expectancy records for more than 140 countries. Each row represents a country for a given year
* We are querying data for Ireland

df = px.data.gapminder().query("country == 'Ireland'")
df.head()

We want to plot population levels over the years. We use `px.bar()`, its documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.bar). The arguments are: 
* ``data_frame``
* ``x`` the groups you are interested in
* ``y`` the counts you want to display

Can you see any trend (upwards or downwards)?



fig = px.bar(df, x='year', y='pop')
fig.show()

Consider another dataset, it shows records of medals for sports for given countries

df = px.data.medals_long()
df

We want to check each nation's performance in terms of different types of medals.
* We use ``px.bar()``, where the axis ``x`` holds the nation, ``y`` is the count of medals, and the medal type colours the bar.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note that the data considered here is like a "summary" since it counts medals per medal type per nation. You don't have repeated rows for the same combination of variables.

fig = px.bar(df, x="nation", y="count", color="medal", title="Long-Form Input")
fig.show()

---

Now consider the tips dataset. It holds records for waiter tips based on the day of the week, day time, total bill, gender, if it is a smoker table or not, and how many people were at the table.

df = px.data.tips()
df = df.sample(n=50, random_state=1)
df.head(3)

We want to know how many Females and Males we have in the dataset. We use `px.bar()` and set x='sex'. 
* Note that several rows share the same value of x (in this case, `sex` variable: female or male). The rectangles are stacked on top of one another by default.

fig = px.bar(df, x="sex")
fig.show()

If you prefer, you could process the data before plotting, in this case, with `.value_counts()` since you are using only one variable. Once you do a value count, you transform it to a DataFrame with `.to_frame()`

df_processed = df['sex'].value_counts().to_frame()
df_processed

We plot the DataFrame above with `px.bar()`

fig = px.bar(data_frame=df_processed, y="sex")
fig.show()

---

In both previous cases, you count how many females and males you have in the dataset.
* However, you may be interested in counting the levels of `tip` (numerical variable) per `sex`. You can add the argument `y` and set it to `tip`

fig = px.bar(df, x="sex", y='tip')
fig.show()

If you prefer, you could process the data before plotting, in this case, with `groupby()` and aggregate by `.sum()` and `.reset_index()` since you have two variables

df_processed = df[['sex','tip']].groupby('sex').sum().reset_index()
df_processed

You may plot using the processed data frame
* Note the mini-stacked data is not present anymore

fig = px.bar(data_frame=df_processed, x="sex", y='tip')
fig.show()

---

You may be interested in colouring your bar plot with another categorical variable. You can do that with the ``colour`` argument
* The ``barmode`` allows you to stack them (`stack`) or leave them side by side (`group`)

fig = px.bar(df, x="sex",  color="smoker", barmode="stack")
fig.show()

The example below has the same code from the previous cell but uses `barmode='group'`

fig = px.bar(df, x="sex", color="smoker", barmode="group")
fig.show()

---

As usual, you can facet your plot using face_row and facet_col.
* In this plot, you want to see multiple bar plots displaying the count of Females and Males, per day, per time and smoker status

fig = px.bar(data_frame=df, x="sex", color="smoker", barmode="group",
             facet_row="time", facet_col="day")
fig.show()

You will notice the days of the week are not ordered conventionally.
* You can manually arrange the order for your categorical variables in your plot with ``category_orders``; just parse a dictionary, where the ``key`` is the variable name, and the ``values`` are in the order you want.

fig = px.bar(data_frame=df, x="sex", color="smoker", barmode="group",
             facet_row="time", facet_col="day",
             category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]})

fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the penguins' dataset.

df_practice = sns.load_dataset('penguins').sample(n=150, random_state=1)
df_practice.head()

Feel free to try out your ideas or use the following suggestion.

You are interested in using a bar plot. You decide to learn the species count coloured by island in a stacked bar plot

# write your code here


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scatter Plot

We are familiar with the Scatter plot. Consider the ``iris`` dataset. It contains records of 3 classes of iris plants, with their petal and sepal measurements.

df = px.data.iris()
df = df.sample(n=50, random_state=1)
df.head()

We want to build a scatter plot where the x-axis has sepal width, the y-axis sepal length and is coloured by species. We use `px.scatter()`, its documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.scatter). 

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species")
fig.show()

You can add Marginal Histograms to your Scatter Plot, by adding  marginal_y="box", and marginal_x="box"
* In this case, it gives quick insights into median, Q1, and Q3 outliers for each species on sepal length and sepal width

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 marginal_y="box", marginal_x="box")
fig.show()

You can add a trend line, even if you are not predicting a number (like in this dataset, where we want to predict the flower species). The trend line's benefit is seeing overall behaviour, like variance, range, min and max, and change rate.
* You just need to add `trendline='ols'` for a linear trend line. Other trend line options are found in the documentation

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 trendline="ols")
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the penguins' dataset.

df_practice = sns.load_dataset('penguins').sample(n=50, random_state=1)
df_practice.head()

Feel free to try out your ideas on a scatter plot with Plotly using what you just learned

# write your code here

---

Consider the ``gapminder`` dataset. It has records on population, Gross Domestic Product, and life expectancy for more than 140 countries. Each row represents a country in a given year

df = px.data.gapminder().head(500)
print(df.shape)
df.head()

You can transform your scatter plot to a bubble chart and add  Animation
* When you set the argument `size`, you create a bubble chart, where each data point has its size related to a variable. In this case, we will use population
* Use the argument `animation_frame` to assign marks to animation frames. In this case, we animate over the years
* X and Y axis will GDP and life expectancy
* The data points will be coloured by continent, and we will see the related country when we hover over them.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Look how much information we can display in a single plot!
* n this case, we have three numerical variables (for x, y and size) and three categorical variables (year, continent and country) that allow for that

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note there is a play button; click to run the animation. There is also a stop button

fig = px.scatter(df, x="gdpPercap", y="lifeExp", 
                 animation_frame="year", 
                 size="pop",
                 color="continent", hover_name="country",
            )

fig.show()

You might feel it is nice to see the bubbles moving around, but you can't make much sense of it. We need to set additional parameters
* The issues rely on the sizes and their positions on the plot.
* We set `size_max=55`, so we get bigger circles (there is no rule for 55, it is trial and error)
* The ``x-axis`` has a wide range; we can compare levels better over a large range of values when it is in log scale, so we add `log_x=True`
* We set in a list [min, max] the x and y range limits with range_x and range_y. You will find a suitable number for min and max using trial and error 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Now it is clear to see how continents changed their GDP, life expectancy, and population levels over time.

fig = px.scatter(df, x="gdpPercap", y="lifeExp", 
                 animation_frame="year", 
                 size="pop",size_max=55, 
                 color="continent", hover_name="country",
                 log_x=True, range_x=[100,100000], range_y=[25,90]
            )

fig.show()

Again, you can facet your plot if you wish to view individualised animated scatter plots.
* This visualisation is, in fact, very powerful. Reflect for a moment on the meaning of it when you consider the GPD level, life expectancy and population across different countries. How similar and different are the patterns among them?

fig = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year",
           size="pop", color="continent", hover_name="country", facet_col="continent",
           log_x=True, size_max=30, range_x=[100,100000], range_y=[25,90])
fig.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Animation Side Note

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  You can animate almost all the plots we are presenting in this lesson: histograms, boxplots, bar plots, scatter plots, maps etc. However, it requires having your data in good shape when adding animation. To validate, look for the ``animation_frame`` argument in the studied functions.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">Typically you will be interested in animating your data when:
  * There is a time/date component, like the year, month, day, weekday etc.
  * Or, when you have a categorical variable, you are interested in checking the behaviour across its levels dynamically and animatedly.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scatter Plot 3D


Consider the dataset
* It has records for three different species of penguins collected from 3 islands in the Palmer Archipelago, Antarctica

df = sns.load_dataset('penguins')
df = df.sample(n=50, random_state=1)
df.head()

We are interested in creating a 3D scatter plot.


Notice that the colour variable is categorical in this example. That will create a discrete colour sequence
* As we studied in a previous notebook, the colour aspect can be managed with the `color_discrete_sequence` argument. You can check the possible arguments with the command `px.colors.qualitative.swatches()`

px.colors.qualitative.swatches()

We create a 3D scatter plot with px.scatter_3d(). The documentation is found [here](https://plotly.com/python-api-reference/generated/plotly.express.scatter_3d.html). The arguments are x, y and z for the axis coordinates, colour to set the colour for data points. color_discrete_sequence defines the colour palette.

fig = px.scatter_3d(df, x="bill_length_mm", y="bill_depth_mm", z="flipper_length_mm",
                    color='species',
                    color_discrete_sequence=px.colors.qualitative.Vivid)
fig.show()

---

Consider another dataset now
* It contains records of 3 classes of iris plants, with petal and sepal measurements

df = px.data.iris()
df = df.sample(n=50, random_state=1)
df.head()

The use case now is different; we will use a continuous variable to colour the plot: ``petal_length``


That will create a scale with a range of colours. You can set the colours with the `color_continuous_scale` argument 
* The options for this argument can be found with the command: `px.colors.sequential.swatches() `

px.colors.sequential.swatches()

We picked 'ice'


fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
                    color='petal_length',color_continuous_scale='ice')
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**:  Do a 3D Scatter plot using the iris dataset using three numerical variables and colour by species
* Task: Can you see patterns among the variables to see regions that are more typical from a given class? Also, is there a region where the classes are mixed/mingled?

# write your code here

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scatter Matrix


 Similar to Pairplot from Seaborn, we can use a Scatter matrix in Plotly with `px.scatter_matrix()`; its documentation is [here](https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_matrix.html). For a basic plot, we have to provide the following arguments: data_frame for the data, dimensions for the columns you want to plot, and colour so you distinguish the dots visually in the plot.
 * One downside is that this figure shows repeated information since the upper triangle has the same information as the lower triangle
 * An upside is that we can interactively select data points in a given plot (for example, select a few data points on sepal_width x sepal_length plot), and the same data points will be highlighted on the other plots.

df = sns.load_dataset('iris')
fig = px.scatter_matrix(data_frame=df,
                        dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
                        color="species",
                        )
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the penguins' dataset.

df_practice = sns.load_dataset('penguins')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head()

You are interested in using a pair plot. You decide on doing multiple plots, using colour as the categorical variables (species, island, sex) 

# write your code here

# write your code here

# write your code here

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Parallel Plots


* In a Parallel Coordinates plot, each observation of the DataFrame is a polyline mark which goes through a set of parallel axes, where each axes is a **numerical variable**
* It is useful when you are interested in revealing patterns of the relationships between numerical variables.

Consider the dataset
* It has records for three different species of penguins collected from 3 islands in the Palmer Archipelago, Antarctica

df = sns.load_dataset('penguins')
df = df.sample(n=50, random_state=1)
df.head(3)

We convert categorical to numerical with `.replace()`
* You should know in advance the category levels you want to replace; for example, sex has Female and Male, species has Adelie, Chinstrap and Gentoo etc

df['species'] = df['species'].replace({'Adelie':0, 'Chinstrap':1, 'Gentoo':2})
df['sex'] = df['sex'].replace({'Male':0, 'Female':1})
df['island']= df['island'].replace({'Torgersen':0, 'Biscoe':1, 'Dream':2})
df.head(3)

We use `px.parallel_coordinates()`, its documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.parallel_coordinates.html). We parse the dataset, the columns we want to see in the plot and colour.
* Note we parsed a list of the columns we want to see in the plot. We don't want to see 'species', since that will be the color.
* Note the plot is coloured by `species`, so we are interested to see patterns for species across the variables.
* Note for `island` 0 (Torgersen), there is only `species` 0 (Adelie)
* Note higher levels for `flipper_length_mm` appear in `species` 2 (Gentoo), and lower levels for `bill_length_mm` happen more often in species 0 (Adelie)

fig = px.parallel_coordinates(df, color="species",
                              dimensions = ['island','bill_length_mm','bill_depth_mm',
                                            'flipper_length_mm',	'body_mass_g',	'sex'])
fig.show()

We coloured the plot by `species`, which is now a numerical variable. Check the potential colours style to use in this case

px.colors.sequential.swatches() 

We picked `'viridis'` for the colour

fig = px.parallel_coordinates(df, color="species", color_continuous_scale='viridis')
fig.show()

---

If your data has more categorical variables or you are only interested in conducting an analysis with categorical variables, it may be more effective to use `px.parallel_coordinates()`, so you don't have to add the effort of converting from categorical to numerical. Its documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.parallel_categories.html)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> The only aspect is that your **`colour` variable should be numerical**. In case this variable is already a number, no additional steps are needed. In case it is a category, you will need to convert it
* In our exercise, ``survived`` is already a number (even though it represents a category)

df = sns.load_dataset('titanic').drop(['alive'],axis=1).sample(n=200, random_state=1).reset_index(drop=True)
print(df.shape)
df.head(3)

You will notice only categorical variables; all numerical are disconsidered
* This plot is useful since you can see the proportions on each variable's level (for example, there were more males than females, but more females survive than males)
* Note we didn't parse dimensions, so we allowed the plot to include all possible variables, including survived (which is the colour variable). This is also fine. When you are not sure which variables to plot at first, you plot all, then refine.

fig = px.parallel_categories(df, color="survived", color_continuous_scale='viridis')
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the gapminder dataset.

df_practice = px.data.gapminder()
df_practice = df_practice.sample(n=100, random_state=1).reset_index(drop=True)
df_practice.head()

We map the continents to numbers

continent_map = {'Asia':0, 'Europe':1, 'Africa':2,'Americas':3, 'Oceania':4}
df_practice['continent'] = df_practice['continent'].replace(continent_map)
df_practice.head()

 First, just a quick recap on the continent_map

continent_map

Make a parallel coordinates with dimensions as ['continent','lifeExp',	'pop',	'gdpPercap' ] ,colored by continent.
* What are the patterns that you see? Where typically do we have a larger population, life expectancy or GDP per cap?

# write code here

---

# Plotly - Unit 04 - Plots: Part 03

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

In this unit, we will explore and learn multiple use cases in Plotly to deliver:

* Map box
* Sun burst
* Tree Map
* Waterfall chart
* Additional considerations when using Plotly



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import plotly.express as px

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Plotly Plots: Part 03

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In this section, we will present the last set of Plotly charts and will deliver
* Map box
* Sun burst
* Tree Map
* Waterfall chart

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Map box


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There are datasets that contain geolocation information, like longitude and latitude. An effective way to understand it is to plot it on a map, where each dot represents the latitude and longitude.
* Typically you will be interested in showing a continuous value as the size of the data point. You could also colour the data point either using a categorical or numerical value. On top of that, you can animate the plot if your data consider time series information. These are considerations, but all will boil down to figuring out which behaviour you want to understand

Consider the following dataset
* It has records for the availability of car-sharing services near the centroid of a zone in Montreal over a month-long period

df = px.data.carshare()
df = df.sample(n=50, random_state=1)
df.head()

We can use `px.scatter_mapbox()`; to plot a map considering the latitude and longitude. The function documentation is [here](https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_mapbox.html). The arguments are:
* `data_frame` as the data, `lat` and `lon` as latitude and longitude, `color` and `size` as colour and size of the dots.
* The colour options can be found with the command `px.colors.sequential.swatches()` and you set with the argument color_continuous_scale.
* `size_max` and `zoom` set the max dot size and the map zoom. We set them as 15 and 10, but this is more a trial-and-error exercise
* `mapbox_style` is a critical parameter, and it sets the style of the plot. In case you don't set it, the plot may not be rendered. You may find other options in the documentation.


Once you render the figure, you can naturally zoom in and out, move around the map, and hover over the data points.


fig = px.scatter_mapbox(data_frame=df, lat="centroid_lat", lon="centroid_lon", color="peak_hour", size="car_hours",
                        mapbox_style='open-street-map', #### don't foret this argument :)
                        color_continuous_scale="plasma",
                        size_max=15, zoom=10)
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the US cities dataset.

df_practice = pd.read_csv("https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/2014_us_cities.csv")
df_practice = df_practice.sample(n=70, random_state=1)

print(df_practice.shape)
df_practice.head()

Create a scatter mapbox where color and size should be ``pop``. When you hover the mouse, the name should appear. Try size_max=30 and zoom=2

# write your code here

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Sunburst


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Plotly [documentation](https://plotly.com/python/sunburst-charts/), Sunburst plots visualise hierarchical data spanning outwards radially from root to leaves. The root starts from the centre, and children are added to the outer rings.
* It shows the relation of Part to Whole. Here you are interested in understanding how they interact and what their proportions are to each other

* It is interesting when you have multiple categorical variables and want to visualise proportions or counts for numerical variables. You can colour either with a numerical or categorical variable

Consider the dataset
* It has records on population, Gross Domestic Product, and life expectancy for more than 140 countries. Each row represents a country in a given year. We consider 2007, for this exercise

df = px.data.gapminder().query("year == 2007")
print(df.shape)
df.head()

We use `px.sunburst()` to create a sun burst. The documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.sunburst.html). The arguments are:
* ``data_frame`` for the data. ``path`` is the hierarchy of sectors. ``values`` define each sector size. ``color`` gives colour to each sector. ``hover_name`` and ``hover_data`` are arguments we are already familiar with and show additional information when we hover over the sectors
* Click in a given continent to drill down. Click it back in the same area to zoom back out
* The values shown when you hover over a continent are the sum. For example, in Asia, you see 3.81 bi



fig = px.sunburst(data_frame=df, path=['continent', 'country'], values='pop',
                  color='gdpPercap', hover_name='iso_alpha', hover_data=['lifeExp'])
fig.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Tree Map


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Tree map also shows parts of the whole, but in a different manner compared to the Sunburst. It shows hierarchical data using nested rectangles instead.

Consider the dataset
* It has records on population, Gross Domestic Product, and life expectancy for more than 140 countries. Each row represents a country in a given year. Again, we consider data from 2007

df = px.data.gapminder().query("year == 2007")
print(df.shape)
df.head()

We will use `px.treemap()` to plot a tree map. The documentation is [here](https://plotly.com/python-api-reference/generated/plotly.express.treemap.html). The arguments we use here are the same as from sunburst.

fig = px.treemap(data_frame=df, path=['continent', 'country'], values='pop',
                  color='gdpPercap', hover_name='iso_alpha', hover_data=['lifeExp'])
fig.show()

In this particular plot, we can consider one difference in path argument, where in the list, we created a constant (in this case, a string: **World**) to represent the largest rectangle that is embracing everything.
* Note it is positioned as the first item in the list 

fig = px.treemap(data_frame=df, path=[px.Constant('World'), 'continent', 'country'], values='pop',
                  color='gdpPercap', hover_name='iso_alpha', hover_data=['lifeExp'])
fig.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the sales success dataset. It shows records for salesperson performance (calls and sales) and their territory (region, county)

df_practice = pd.read_csv("https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/sales_success.csv")
df_practice = df_practice.sample(n=50, random_state=1)
print(df_practice.shape)
df_practice.head(3)

Do a tree plot where the path is a region, county and salesperson. values and colour shall be sales, and when you hover over them, the calls should appear
* Visually speaking, in which region do you tend to see more sales? And less? And in which counties?

# write your code for tree map

Now create a sunburst chart using the same dataset. Which one do you find more effective in understanding the data?

# write your code for sunburst

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Waterfall


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Waterfall chart aims to demonstrate the cumulative effect of sequentially added values. It is commonly used in finance.





Imagine if your company are interested in creating a waterfall chart displaying the annual Profit and Losses. 
* Your company has revenue from 4 products. It has costs and expenses and pays tax.

Consider the fictitious data below (in the workplace, this dataset likely would come from a processed report)
* In this exercise, someone from your company prepared the data already in the format below, where the DataFrame index has the report item (like Revenue, Cost, Profit etc.), and the column Value holds the amount for a particular item.
* `.T` transpose the rows and columns of your DataFrame

df = pd.DataFrame({"Revenue Product 1": [500],
                   "Revenue Product 2": [100],
                   "Revenue Product 3": [600],
                   "Revenue Product 4": [250],
                   "Net revenue": [1450], 
                   "Fixed Cost": [-200],
                   "Variable Cost": [-400],
                   "Trips Expenses": [-300],
                   "Other expenses": [-10],
                   "Operating profit": [540],
                   "Income tax": [-200],
                   "Net Profit": [340]},
              index=['Value'])

df = df.T  # .T transpose the rows and columns
df


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In this particular chart, we are using another module from Plotly, called Plotly Graph Objects, or the common alias, `go`
  * Currently, Plotly Express doesn't have this capability for a Waterfall chart


The documentation for Waterfall Charts in Plotly is found [here](https://plotly.com/python/waterfall-charts/), and the documentation for Plotly GraphObjects is found [here](https://plotly.com/python/graph-objects/). The arguments we consider are:
* `orientation='h'`, for horizontal
* `measure`, you will know how many values (or steps) your chart will have. For each, you need to set in a list if that will be `relative` or `total`. A relative means it is being added or subtracted in the flow; for example, "Revenue Product 2" adds to the product revenues. A total simply shows the value for that step; for example, Net revenue shows the accumulated revenue for a set of products

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You will read the plot bottom-up, starting from Revenue Product 1 until Net Profit

import plotly.graph_objs as go
fig = go.Figure(go.Waterfall(
    orientation = "h",
    measure = ["relative", "relative", "relative", "relative","total",
               "relative", "relative", "relative", "relative","total",
                "relative","total"],
    y = df.index.to_list(),
    x = df['Value']
))


fig.update_layout(title = "Waterfall Chart - Profit & Losses in 2020 in M$",
                  width=800, height=500)
# go.Waterfall() doesn't have a argument for setting width, height and title
# we set with fig.update_layout()
fig.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Additional arguments consideration

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  There are additional arguments present in almost all of the functions we covered that we often will adjust depending on our use case and projects. We will study:
* Title
* Set plot size
* Set template

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Title

Consider the dataset
* It has records on population, Gross Domestic Product, and life expectancy for more than 140 countries. Each row represents a country in a given year

df = px.data.gapminder()
df = df.sample(n=50, random_state=1)
df.head(3)

You can set the title with the argument `title`

fig = px.scatter(df, x="gdpPercap", y="lifeExp", 
                 animation_frame="year", 
                 size="pop",size_max=55, 
                 color="continent", hover_name="country",
                 log_x=True, range_x=[100,100000], range_y=[25,90],
                 title='Title for this Figure!!!!'
            )

fig.show()

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Set Plot Size

You can set the plot size using width and height; the values are in pixels

fig = px.scatter(df, x="gdpPercap", y="lifeExp", 
                 animation_frame="year", 
                 size="pop",size_max=55, 
                 color="continent", hover_name="country",
                 log_x=True, range_x=[100,100000], range_y=[25,90],
                 width=600, height=350
            )

fig.show()

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Set template

You can check available templates or themes

import plotly.io as pio
pio.templates

And use the available options using the `template` argument

fig = px.scatter(df, x="gdpPercap", y="lifeExp", 
                 animation_frame="year", 
                 size="pop",size_max=55, 
                 color="continent", hover_name="country",
                 log_x=True, range_x=[100,100000], range_y=[25,90],
                 width=600, height=350,
                 template='simple_white'  ######## set template
            )

fig.show()


---