Created by: Andrew Edward

# Introduction

As I started my journey with data science, I have always had so many great ideas for visualizations, but every time I have the data, I struggle with coding it with the libraries I was introduced to at the start. Then I found Plotly, and it made my life easier.

This tutorial will introduce some basic methods to visualizing data using 'Plotly Express', a great library that provides interactive, responsive and  easily readable plots. There are many great visualizing libraries that currently exist and have a lot of great capabilities, but none has the interactive capabilities that Plotly Express offers to its users.

Any figure created in a single function call with Plotly Express could be created using graph objects alone, but with between 5 and 100 times more lines of code. Visualizing data is great, but being able to interact and apply edits and filters instantly to any chart, increases the value we can get from every line of code! It allows us to further analyse every aspect in a chart or a plot and explore new theories.

# Tutorial Content

In this tutorial, we will show how to create simple yet powerful interactive visualizations in Python, specifically using [Plotly Express](https://plotly.com/python/plotly-express/).

We will be using data collected from [IMDb](https://www.imdb.com/), the most popular movies website. IMDb stores information related to more than 6 million titles (of which almost 500,000 are featured films). The dataset can be found here: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset. The data was scraped from the website on 1/1/2020.
 

We will cover the following topics in this tutorial:
1. [Installing the libraries](#1)
- [Loading data and plotting](#2)
- [Bar Charts](#3)
- [Plotly's Responsivity](#4)
- [Line Charts](#5)
- [Scatter Plots](#6)
- [Pie Charts](#7)
- [Histograms](#8)
- [Box Plot](#9)
- [Libraries comparison](#10)
- [Summary & References](#11)


<a id="1"></a> <br>

# 1. Installing the libraries

Before getting started, you'll need to install the various libraries that we will use. You can install 'Plotly' using pip:

    $ pip install plotly==4.11.0

or using conda:

    $ conda install -c plotly plotly=4.11.0

If this install does not work for you, please consult with the documentation of the library. After all installs are run, make sure the following commands work for you.

In [None]:
import plotly as py
import plotly.express as px
import numpy as np
import pandas as pd

<a id="2"></a> <br>
# 2. Loading Data and plotting

Now after we have installed and loaded all the relevant libraries, lets load our data. We are going to load the data in a dataframe format.

If you are going to follow this tutorial on your own machine, download the files from the dataset link on Kaggle: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset. Then unzip the `archive.zip` file to create a `archive` folder with a bunch of csv files inside. Rename the folder to `IMDb`. This folder contains the files we will need in order to properly use the dataset. So, you will need to copy this `IMDb` directory into the same folder as this notebook. Then you can load the data using the following commands:

In [None]:
# Load the data we will use
movies = pd.read_csv("../input/imdb-extensive-dataset/IMDb movies.csv")
ratings = pd.read_csv("../input/imdb-extensive-dataset/IMDb ratings.csv")

All of the new objects we just loaded are Pandas DataFrames.

We will merge both movies and ratings so we can use the ratings given to movies in our charts and use the .info() function to take a closer look at the columns in the IMDb dataset.

In [None]:
movies_ratings = pd.merge(movies, ratings, on='imdb_title_id')
movies_ratings.info()

The movies dataset includes 85,855 movies with attributes such as movie description, average rating, number of votes, genre, etc.

The ratings dataset includes 85,855 rating details from demographic perspective. For example who from the voters are men, women, age ranges, inside the U.S., outside the U.S. and much more.

We shall be exploring both as we go. As you can see, some columns are currently considered as objects and it will help us with analysis if we can convert them to the right type and clean the dataset a bit.
For example, lets turn the 'year' column into an integer.

In [None]:
movies_ratings['year'] = movies_ratings.year.astype('str')

#We will first use the strip() function to make sure no whitespaces exist
movies_ratings['year'] = movies_ratings.year.str.strip()

In [None]:
# Then we will drop the one row that has text in it
movies_ratings['year'] = movies_ratings.year.drop(index = 83917, axis = 0)
movies_ratings = movies_ratings.dropna(subset=['year'], axis = 0)

In [None]:
#Check if any nulls exist
movies_ratings[movies_ratings['year'].isnull()]

In [None]:
#Convert to int
movies_ratings['year'] = movies_ratings.year.astype('int')

Now that we have the  dataset ready for what we need, we can start exploring the capabilities of Plotly and how we can visualize the data.

Plotly offers more than 30 functions for creating charts and figures, we will exploring some of the basic ones you can create and how much value you can earn

<a id="3"></a> <br>
# 3. Bar Charts

Lets start by creating a simple bar chart using Plotly Express. Lets look at the 10 genres that have the most produced movies in all time.

First lets group the movies in the dataset by genre in a new DataFrame

In [None]:
genres_df = movies_ratings[['genre','title']].groupby(['genre']).count().reset_index().rename(columns={'title':'number_of_movies'})

#Sort them in descending order so we can extract the 10 genres with the most movies made
genres_df = genres_df.sort_values(by='number_of_movies', ascending=False)
genres_df

Then lets save the top 10 genres to a new df

In [None]:
genres_top10_df = genres_df.iloc[:10]
genres_top10_df

Now lets create our first bar chart! Each chart in Plotly has its own unique function, for bar charts. We will call px.bar() and set our

In [None]:
genres_bar = px.bar(genres_top10_df, 
                    x = 'genre', 
                    y = 'number_of_movies', 
                    title = 'Top 10 Genres in terms of number of movies', 
                    labels = dict(genre = 'Genre', number_of_movies = 'Number of movies'))
                    
genres_bar.show()

As simple as that, we have a bar chart showing how so many drama movies were made over the years, which is then followed by the comedy genre.

Simplicity is not the only thing that makes Plotly Express special, but it offers a lot of customizability and interactivity that we can start taking a look at later on.


<a id="4"></a> <br>
# 4. Plotly's responsivity

Lets first explore the responsive options and buttons that Plotly's interface offers:

As you hover over the chart you should see an options bar at the top that looks like this:

![x](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/bar.png)
<!-- [<img src = 'images/bar.png' >]
 -->
If you want to disable this responsiveness, all you need to do is set the responsivity to False and pass that config variable to the show function:

```config = {'responsive': False}```


In [None]:
config = {'responsive': False}

genres_bar = px.bar(genres_top10_df, 
                    x = 'genre', 
                    y = 'number_of_movies', 
                    title = 'Top 10 Genres in terms of number of movies', 
                    labels = dict(genre = 'Genre', number_of_movies = 'Number of movies'))
                    
genres_bar.show(config = config)

Some useful options:
#### - Download plot Button: The camera icon on the modebar causes a static version of the figure to be downloaded via the user's browser. The default behaviour is to download a PNG of size 700 by 450 pixels. but you can change it and set to what ever works best for your slides or posters..etc

You can change file type, the export title, dimensions and scale, helping you present your slides in a more professional way and with an option like SVG being available, you can even customize your chart's colors and much more without having to change it in your code.'

```
config_Image = {
  'toImageButtonOptions': {
    'format': 'svg', # one of png, svg, jpeg, webp
    'filename': 'custom_image',
    'height': 500,
    'width': 700,
    'scale': 1 # Multiply title/legend/axis/canvas sizes by this factor
  }
}
```

#### - Zooming in and out of a chart: Helping you look at specific data when a chart is looking busy

#### - Box and lasso select: Allowing you select specific parts of a chart you think might be useful or important to highlight

#### - Show closest data on hover: An amazing option helping you figure what piece of data each point on the chart refer to exactly. An example from our past bar chart is if we hover over the bar for drama, it will tell us exactly the number of movies this bar refers to. (12,543 Movies)

#### - Compare data on hover: A very useful option, especially when you have subplots or more than one factor you are looking at, helping you to compare between values really easily.

Another amazing part about Plotly is the customization it allows to every chart.

For example, lets say instead of having to hover over every bar to find the number of movies created in that genre, we want to just see it on the graph. we can do that through the *text* parameter

In [None]:
genres_bar = px.bar(genres_top10_df, 
                    x = 'genre', 
                    y = 'number_of_movies', 
                    title = 'Top 10 Genres in terms of number of movies',
                    text = 'number_of_movies', 
                    labels = dict(genre = 'Genre', number_of_movies = 'Number of movies'))
                    
genres_bar.show()

What if we want to change the colors to be different based on bar?

In [None]:
genres_bar = px.bar(genres_top10_df, 
                    x = 'genre', 
                    y = 'number_of_movies', 
                    title = 'Top 10 Genres in terms of number of movies',
                    text = 'number_of_movies', 
                    labels = dict(genre = 'Genre', number_of_movies = 'Number of movies'),
                    color = 'genre')
                    
genres_bar.show()

Plotly also offers a great way to look at details in each bar that are beyond showing the total number of movies represented.

Lets look at the next example, this time we will look at the movies produced in the 2000s. We will look at the number of movies created every year and with the help of Plotly, we will find out which genres were the most popular. 

First we will create a new DataFrame using the groupby() function, grouping the rows by year and genre and then using the count() function we will get the number of movies for year and genre.

In [None]:
years_df = movies_ratings[['year','title', 'genre']].groupby(['year','genre']).count().reset_index().rename(columns={'title':'number_of_movies'})
years_df

Now we will just sort it by the number of movies in a descending order so we see the most popular

In [None]:
years_df = years_df.sort_values(by=['number_of_movies'], ascending=False)
years_df

And then the final step before visualization is that we will create a new dataframe to save the movies created in the 2000s. 

We will also look at the genres with +50 movies, just so that we can see the most important ones.

In [None]:
years_df = years_df[years_df['year'] >= 2000][years_df['number_of_movies'] >= 50]
years_df

Finally, now we can create a Plotly Bar chart in one line, the same way as we did in the past example, the only difference is that this time the color will be based on the total number for each genre.

In [None]:
detailed_genres_bar = px.bar(years_df, 
                             x ='year', 
                             y = 'number_of_movies', 
                             color = 'genre', 
                             title='Movies produced in the 2000s classified by genres', 
                             text = 'number_of_movies',
                             labels = dict(year = 'Year', number_of_movies = 'Number of movies')
                             )
detailed_genres_bar.show()

Another great responsive feature that Plotly offers is that you can compare between segements in each bar separately, giving you more insights into the data and help you look at trends.
For example, if we want to look at how the number of *Horror* movies produced overtime in the 2000s separately, all we need to do is double click on the "Horror" tag in the legend on the right side and it will transform the graph to show only "Horror" movies numbers like the image below:
![](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/horror_filter.png)

<a id="5"></a> <br>
# 5. Line Plots
Now lets look at another type of charts in Plotly: Line charts! Lets look over the years if the precentage of male vs female voters for movies on IMDb has changed. And what is better than a cool line chart to do that!

We will create a new DataFrame and save in it the total number of of male and female voters for movies by year.

In [None]:
male_female_df = movies_ratings[movies_ratings.year >= 2000]
male_female_df = male_female_df[male_female_df.year < 2020] #To exclude 2020 since the data for it is in-complete
male_female_df = male_female_df[['year','title', 'males_allages_votes', 'females_allages_votes']].groupby(['year']).sum().reset_index()
male_female_df

The amazing thing is that, Plotly Express allows us to show multiple axes with only inputing a list into the x or y arguments.

In [None]:
male_female_scatter = px.line(male_female_df, 
                            x='year', 
                            y=['males_allages_votes','females_allages_votes'],
                            title='Number of voters for movie ratings on IMDb per year split by gender', 
                            labels = dict(year = 'Year', value = 'Voters', variable = 'Gender'),
                            )

male_female_scatter.show()

From this we can see that the number of voters is actually decreasing, even though we noticed the number of movies is not.
To make it easier to analyse and compare, we can select the "Compare data on hover" option from the top bar:

![](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/compare_button.png)

Then every time we hover over the male votes on a specific year for example, Plotly Express's interface will show us the coressponding value for females on the same year like this image:
![](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/compare_example.png)

<a id="6"></a> <br>
# 6. Scatter Plots
Scatter plots are used to observe relationships between variables using dots to represent the values. In Plotly Express, it is very easy to create one!

In this section, we will use a scatter plot to examine the relation between the scores given to movie by critics (Metascores) and the budgets spent on the movies by production companies for all movies produced in the year 2018.


In [None]:
reviews2018_df = movies_ratings[movies_ratings.year == 2018]

Then since the budget column is currently not in a numeric type, we will fix that first.

In [None]:
reviews2018_df['budget'] = reviews2018_df['budget'].str.split(' ', expand = True)[1] #Since the format is '$ xxxx', we will only save the numbers 
reviews2018_df.dropna(inplace=True) #Drop any NaN values
reviews2018_df['budget'] = reviews2018_df['budget'].astype(float) #since the numbers are huge, we will save them floats
reviews2018_df

Now we have all the movies in 2018 but that is still a lot, we will look only at movies with a budget that is higher than or equal to 1M

In [None]:
reviews2018_df = reviews2018_df[reviews2018_df.budget >= 1000000]

#We will also do this because there is one movie that is within a 20B budget that we don't want to include
reviews2018_df = reviews2018_df[reviews2018_df.budget < 19000000000] 

#Lastly, we will sort the movies by budget descending order
reviews2018_df.sort_values(by = 'budget', ascending=False, inplace = True)

#Lets see the final result
reviews2018_df[['title','metascore','budget']]

Now, all we need to do is call px.scatter() and input our x and y and color (How do we want Plotly to color the points).

The only point is that for this one we will add two new arguments.

1- `hover_name:` Basically on hover over any point, what data should the plot present to you? For the purpose of this, we will make it the title of movie.

2- `color_discrete_sequence:` Allows you to choose any color scheme you want, either setting it yourself or through the set color schemes that Plotly offers. For more info you can check this page: https://plotly.com/python/discrete-color/
Also other themes exist for continous sequences, but since ours in here is qualitative, we will use this.

In [None]:
reviews2018_scatter = px.scatter(reviews2018_df, 
                                 x='budget', 
                                 y='metascore',
                                 title='Metascore vs Budget, by Genre', 
                                 color = 'genre',
                                 hover_name = 'title',
                                 labels = dict(budget = 'Budget', metascore = 'Score by critics', genre = 'Genre'),
                                 color_discrete_sequence= px.colors.qualitative.Light24
                                )

reviews2018_scatter.show()

As before, if we double click on any genre in the legend, we can see the movies under it. So for example if we double click on 'Action, Adventure, Sci-Fi', we can see that to an extent, any movies that are under the budget of 120M did not get more than 50/100 by critics as this screenshot (Colors might be different):
![](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/genre_filter.png)
Then we can compare specific genres to each other without having to create a new plot. Now with 'Action, Adventure, Sci-Fi' selected, lets click once on 'Action, Adventure, Comedy', your screen should look like this (Colors might be different):
![](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/genre_filter2.png)
Now we can compare between these two specific categories without the noise around them,all thanks to Plotly's responsivity.

<a id="7"></a> <br>
# 7. Pie charts
Pie charts are very simple, but with Plotly Express, we can get a lot of extra insights from easily. Lets try to find out the top 10 locations where most movies are produced around the world. First we will create a new DataFrame and save in it the count of each country of production.

In [None]:
movies_countries_df = movies_ratings[['country','title']].groupby(['country']).count().reset_index().rename(columns={'title':'number_of_movies'})
movies_countries_df = movies_countries_df.sort_values(by='number_of_movies', ascending=False)

Then we will only save the top 10 rows

In [None]:
movies_countries_df = movies_countries_df.iloc[:10]
movies_countries_df

Pie charts only need 2 parameters mainly, the 'values' parameter, which in this case is the number of movies and the 'name' parameter, which is the category we are looking to analyze.

In [None]:
fig = px.pie(movies_countries_df, values='number_of_movies', names='country', title='Top 10 countries where movies are produced')
fig.show()

As expected, we can see that USA, India and the UK have most shares. 

What if we want to exclude USA and we want to look at India's share of producing movies compared to the rest of the world (the other 8 countries in this case)?

Using other libraries, you would typically create a new Dataframe where USA is excluded and run again, or maybe we can just start calculating it.

With Plotly's responsive layout, all we need to do is click once on 'USA' in the legend and automatically Plotly will exclude it from the plot like this:
![](https://raw.githubusercontent.com/AndrewEdward37/tutorial_images/master/pie_chart_exclude.png)
So now we can easily conclude with one-click that out of all the countries outside of USA, India produces 24.7% of all movies.

<a id="8"></a> <br>
# 8. Histograms

Plotly Express offers us the ability to creat histograms, helping us identify in a group of ranges, what is the distribution of some numerical data. So, lets try to create a histogram using Plotly Express to identify how the number of votes for movies had been changing over the years.

First we will create a DataFrame from our main dataset to group all data by year and get the sum of votes for movies produced in every year.

In [None]:
voters_df = movies_ratings[['year','title', 'total_votes']].groupby(['year']).sum().reset_index()
voters_df

Now using px.histogram(), we will map out the distribution

In [None]:
fig = px.histogram( voters_df,
                    x = 'year',
                    y = 'total_votes',
                    title='IMDb movie voters distribution over time', 
                    labels = dict( total_votes = 'total votes', year = 'Years'),
                    )
fig.show()

As expected, if we hover over the bars to analyze the distribution, we can see how the number of voters almost doubled when comparing the 90s and early 2000s with how computers became more accessible to everyone and how that growth slowed later on.

<a id="9"></a> <br>

# 9. Box Plots
The plot type we will explore in this introductory tutorial is how to create a box plot using Plotly express and how also we can use subplots to compare data.

Lets say if we want to see what is the average and maybe optimal duration for movies and if that has changed drastically over the years from the 1980s till 2019. We will explore that using a series of boxplots created in a very simple way.

First we will create a new DataFrame split to exactly the max and min years we want and is sorted by year ascendingly.

In [None]:
movies_duration_df = movies_ratings[movies_ratings.year >= 1980]
movies_duration_df = movies_duration_df[movies_duration_df.year < 2020]
movies_duration_df.sort_values(by = 'year', inplace = True)

Then before we create our boxplot, we will have to modify the years representation so we show to Plotly how are we grouping.

To do that, we will replace every 10 years in a specific span with one value to represent them all.

For example, all years between 1980-1981 will be replaced by '1980' to represent a decade (10 years) and so on.

In [None]:
movies_duration_df['year'] = movies_duration_df['year'].astype('str')
movies_duration_df['year'] = movies_duration_df['year'].replace(to_replace = '^199', value = 1990, regex = True)
movies_duration_df['year'] = movies_duration_df['year'].replace(to_replace = '^198', value = 1980, regex = True)
movies_duration_df['year'] = movies_duration_df['year'].replace(to_replace = '^200', value = 2000, regex = True)
movies_duration_df['year'] = movies_duration_df['year'].replace(to_replace = '^201', value = 2010, regex = True)
movies_duration_df

There we go! We have our dataframe ready, now we will create a box plot, simply using px.box() same the others. This time, we want to create different box plots for each decade and compare them side by side, so we will do that using the 'facet_col' parameter. This parameter helps us create subplots based on our 10 years grouping.

Another new parameter - But is explicitly for box plots - is the 'points' parameter. The parameter offers a lot of options to make adjustments to how a box-and-whisker diagram looks like. 

Some of the useful arguments that you can use for it is 

'all': which shows all points also on the plot

False: Which hides the points from the plot

For this example we will set points to be False just because there are many points and depending on your computer it might crash it, but feel free to change it and see how it looks.

In [None]:
fig = px.box(movies_duration_df, 
             y="duration", 
             title= 'Duration of movies change by year',
             facet_col = 'year', 
             points = False,
             color = 'year')
fig.show()

So in just 2-3 lines in Plotly express, we created a lot of great simple plots that hold a lot of value.

<a id="10"></a> <br>

# 10. Libraries Comparison

As we went through this tutorial, we looked at some of the capabilities that Plotly Express offers. In this section, we will just create a comparison between Plotly and other famous visualization libraries that are used in data science like Seaborn.

Some installations required for the other libraries to run this section, run both:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

For this task, we will look at the top 10 companies with most movies produced in the world, lets see how we can visualize this using the three libraries.

In [None]:
production_companies_df = movies_ratings[['production_company','title']].groupby(['production_company']).count().reset_index().rename(columns={'title':'number_of_movies'})
production_companies_df = production_companies_df.sort_values(by='number_of_movies', ascending=False)
production_companies_df = production_companies_df.iloc[:10]
production_companies_df

### A. Seaborn

In [None]:
fig, ax = plt.subplots(figsize=(30, 6))

sns.barplot(x = 'production_company',
            y = 'number_of_movies',
            data = production_companies_df)
plt.show()

### B. Plotly Express

In [None]:
detailed_genres_bar = px.bar(production_companies_df, 
                             x ='production_company', 
                             y = 'number_of_movies', 
                             color = 'production_company', 
                             title='Top 10 production companies', 
                             text = 'number_of_movies',
                             labels = dict(production_company = 'Production Company', number_of_movies = 'Number of movies')
                             )
detailed_genres_bar.show()

We can see that both libraries provide great readable charts that are very useful and definitely very similar. This comparison only shows that both visualization libraries are very similar from customization to code length, but what gives Plotly Express an edge is its responsive capabilities that make the charts come to life.

<a id="11"></a> <br>
# Summary & References:
This tutorial highlighted just a few elements of what can be possibly achieved using Plotly Express. Much more details about the library and some of the great things you can do with Plotly Express are available in the following links:

1. Plotly Express: https://plotly.com/python/plotly-express/
2. Configuration options: https://plotly.com/python/configuration-options/
3. Colors: https://plotly.com/python/discrete-color/
4. https://medium.com/plotly/introducing-plotly-express-808df010143d
5. https://towardsdatascience.com/step-by-step-bar-charts-using-plotly-express-bb13a1264a8b
7. IMDb Dataset: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset