# Practical Data Science Tutorial

#### By Reshmi Ghosh (reshmig@andrew.cmu.edu) 

## Introduction
Tutorial on effectively using Plotly express and Plotly to create stunning visualizations. The Plotly package is a library compatible with multiple programming languages like Python, R, Matlab, and JavaScript. It is a high quality graphing library used to create line plots, scatter plots, area plots, interactive maps, bar charts, error bars, box plots, ploar charts, and bubble charts. 

Plotly express was previously a wrapper used with plotly objects, but now it is a part of plotly. For detailed tutorials and documentation, you can access the official [Plotly express](http://127.0.0.1:8888/notebooks/Tutorial/Tutorial%20Plotly.ipynb) documentation.

This tutorial is an attempt to help you learn Plotly in less than 30 minutes. Two different datasets are used to create various visualizations. It is important to note that to make good visuals with Plotly, it is important to understand color schemes, and how they affect your audience. 

## Exploratory Data Analysis
The first step to any data science project should be understanding the data. A clear understanding of the underlying distributions of the datapoints helps the Data Scientist to apply appropriate models to solve the problem at hand. Through this tutorial we show various graphics that will help you to do Exploratory Data Analysis in your data science project.

Let us begin by importing certain libraries in the environment that you will be working in.

In [2]:
import plotly
import plotly.figure_factory as pff
import plotly.express as px
import plotly.graph_objects as go
import chart_studio.plotly as py
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Note
If any of the above mentioned libraries are not installed in your system, run "pip3 install <library_name>" to install the modules.

#### Data
The first dataset that we will be using is S&P 500 Stock Market data. This dataset has the stock market price, Price-to-Earnings ratio (P/E ratio), Dividends Yield, Earnings per Share, Weekly Low price, Weekly High price, Market Cap, Price-to-Sales ratio (P/S ratio), Price-to-Book value ratio (P/B ratio) for top 500 publicly available stocks in the market. 

The second dataset is a detailed record of all AirBnB listings in the NYC area. It has 40,000+ records with price per night, availability in the year of 2019, number of reviews, last review date, reviews per month, number of host listings (other AirBnBs the host is responsible for), latitude and longitude of the listing. Please download the datasets that was provided in the zipped file before running this notebook.

#### Loading the data
This tutorial uses two .csv files. In the next steps we will load the csv files in our notebook using the pandas "read_csv" method. Different file types can also be imported in a jupyter notebook as a pandas dataframe. For detailed instructions check out this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). After loading the files it is also important to understand the schema of the file. For this we use .head() method from pandas.

In [3]:
## import first data set and see the schema
data1 = pd.read_csv("constituents-financials_csv.csv")
data1.head()

Unnamed: 0,Symbol,Name,Sector,Price,Price/Earnings,Dividend Yield,Earnings/Share,52 Week Low,52 Week High,Market Cap,EBITDA,Price/Sales,Price/Book,SEC Filings
0,MMM,3M Company,Industrials,222.89,24.31,2.332862,7.92,259.77,175.49,138721055226,9048000000.0,4.390271,11.34,http://www.sec.gov/cgi-bin/browse-edgar?action...
1,AOS,A.O. Smith Corp,Industrials,60.24,27.76,1.147959,1.7,68.39,48.925,10783419933,601000000.0,3.575483,6.35,http://www.sec.gov/cgi-bin/browse-edgar?action...
2,ABT,Abbott Laboratories,Health Care,56.27,22.51,1.908982,0.26,64.6,42.28,102121042306,5744000000.0,3.74048,3.19,http://www.sec.gov/cgi-bin/browse-edgar?action...
3,ABBV,AbbVie Inc.,Health Care,108.48,19.41,2.49956,3.29,125.86,60.05,181386347059,10310000000.0,6.291571,26.14,http://www.sec.gov/cgi-bin/browse-edgar?action...
4,ACN,Accenture plc,Information Technology,150.51,25.47,1.71447,5.44,162.6,114.82,98765855553,5643228000.0,2.604117,10.62,http://www.sec.gov/cgi-bin/browse-edgar?action...


In [4]:
##import the second data set and see the schema
data2 = pd.read_csv("AB_NYC_2019.csv")
data2.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


It is always a good thing to check the count of records (rows) before doing any data cleaning. It helps us understand how big the dataset is and also helps to understand how many records were eliminated during the data cleaning and pre-processing step.

In [5]:
print(data1.shape[0])
print(data2.shape[0])

505
48895


An important step while doing Exploratory Data Analysis is to check if the data contains NaN values, blank spaces, infinity, and zeros, and then find a reasonable way to deal with these weird values. In our case, the first dataset had infinity and NaN values, which we just removed it as the original dataset contained 505 records while there were only 10 rows with NaNs and infinity.

In [6]:
#drop Nan and Infs from the dataframe
data1 = data1.replace([-np.inf, np.inf], np.nan)
data1.dropna(inplace = True) 

Next we will create individual series of the variables from our dataframe to create different plots. These series of data are important while analyzing individual data distributions or also as a list of several series together to comapre their underlying distribution.

In [7]:
dividends = data1["Dividend Yield"]
price = data1['Price']
price_earn = data1["Price/Earnings"]
marketcap = data1["Market Cap"]
price_sales = data1["Price/Sales"]
price_book = data1["Price/Book"]

We begin by creating a histogram of the stock market prices for all 500 shares to visualize their underlying distribution. Plotly has several options of customizing your plots providing a myriad of options for title and axis fonts, colors, size, etc. Here we are using Monospaced, 'Times New Roman' font with the color code #a12d2d. You can check out other fonts [here](https://plot.ly/python/text-and-annotations/).

In [8]:
fig = px.histogram(data1, x="Price")
fig.update_layout(title=go.layout.Title(text="Histogram of share prices",xref="paper",x=0),
xaxis=go.layout.XAxis(title=go.layout.xaxis.Title(text="Price of share in $",font=dict(family="Times New Roman, monospace",size=18, 
                                                                                        color="#a12d2d"))),
yaxis=go.layout.YAxis(title=go.layout.yaxis.Title(text="count",font=dict(family="Times New Roman, monospace",
                                                                                                  size=18,color="#a12d2d"))))
fig.show()

We can clearly see that the plot is highly skewed and will need appropriate transformation before you model it for any data science project. The logarithmic transformation is one of the commonly used transformations used on skewed data, but other transformation practices can also be used.

To make a better distribution plot, we will use the "Plotly Figure Factory" library and create a dist plot. It is important to note that the series must be treated as a list to be plotted using the Plotly Figure Factory library. Although the label is not an important feature in the plot shown below as we are using a single series, it becomes useful when we are analyzing multiple columns/series of a dataframe.

Note, how we change the font and color here. To choose a color you can use the [color picker](https://www.google.com/search?q=color+picker) which generates the appropriate color number.

In [9]:
label = ['Stock Prices']
data = [price]
figx = pff.create_distplot(data, label, bin_size=.2)
figx.update_layout(title=go.layout.Title(text="Distribution plot of stock prices",xref="paper",x=0),font=dict(family="Helvetica, monospace",size=18, 
                                                                                        color="#44c253"))
figx.show()

The distplot method give you a better sense of how the distribution and the corresponding rug plot looks like, which helps you understand where your data is concentrated.

Next we will compare two different series to see how their respective distributions look like.

In [10]:
group_labels = ["Price to Earnings ratio", "Price to Book value ratio" ]
data = [price_earn, price_book]
figy = pff.create_distplot(data, group_labels, bin_size=.2)
figy.update_layout(title=go.layout.Title(text="Distribution plot of P/E ratio and P/B ratio",xref="paper",x=0))
figy.show()

From the dist-plot above it can be seen that both Price to Earnings and Price to Book value ratio have similar distribution. Either that means Price per share is very large as compared to Earnings and Book Value or both Earnings and Book Value have a similar distribution. It is hard to conclude any assertion from our dataset as we do not have absolute values of either Earnings or Book Value.

Now, lets create a scatter plot. Scatter plots are very useful to visualize how one variable affects the other. We will start by creating a simple scatter plot between P/E ratio and Earnings per share.

In [11]:
#scatter plot
fig2 = px.scatter(data1, x = "Price/Earnings", y = "Earnings/Share")
fig2.update_layout(title=go.layout.Title(text="Scatter plot of P/E ratio and Earnings/Share",xref="paper",x=0),
xaxis=go.layout.XAxis(title=go.layout.xaxis.Title(text="P/E ratio",font=dict(family="Times New Roman, monospace",size=18, 
                                                                                        color="#4287f5"))),
yaxis=go.layout.YAxis(title=go.layout.yaxis.Title(text="Earnings per share",font=dict(family="Times New Roman, monospace",
                                                                                                  size=18,color="#4287f5"))))
fig2.show()

Scatter plots can be made further interesting by adding colors. Here we seggregate the plot by coloring different Sectors.But our dataset has way too many sectors which makes the interpretation of the scatter plot difficult.

In [12]:
## Price vs Market cap demarcated by sector
fig3 = px.scatter(data1, x = "Price/Sales", y = "Market Cap", color = "Sector")
fig3.update_layout(title=go.layout.Title(text="Scatter plot of P/S ratio vs Market Cap",xref="paper",x=0),
xaxis=go.layout.XAxis(title=go.layout.xaxis.Title(text="P/S ratio",font=dict(family="Times New Roman, monospace",size=18, 
                                                                                        color="#4287f5"))),
yaxis=go.layout.YAxis(title=go.layout.yaxis.Title(text="Market Cap",font=dict(family="Times New Roman, monospace",
                                                                                                  size=18,color="#4287f5"))))
fig3.show()

To appropriately visualize the scatter plot, we will subset the data to three different sectors, and then visualize their histogram and rug plot along with the scatter plot.

In [13]:
# Subset data to visualize only certain sectors
Sector_sub = ["Information Technology", "Consumer Staples", "Industrials"]
data_mod = data1.loc[data1['Sector'].isin(Sector_sub)]
fig7 = px.scatter(data_mod, x = "Price/Sales", y = "Market Cap", color = "Sector", marginal_y = "rug", marginal_x = "histogram", color_continuous_scale=px.colors.colorbrewer.RdYlBu)
fig7.show()

Like histogram and rug, you can also add box plot or violin plots to check various quantile ranges and kernel density estimation respectively. From the above plot again we see that all three series are highly skewed because we did not apply any transformation to any series.

In [14]:
# Subset data to visualize only certain sectors
Sector_sub1 = ["Health Care", "Financials", "Materials", "Real Estate"]
data_mod1 = data1.loc[data1['Sector'].isin(Sector_sub1)]
fig8 = px.scatter(data_mod1, x = "Price/Sales", y = "Market Cap", color = "Sector", marginal_y = "violin", marginal_x = "box", color_continuous_scale=px.colors.colorbrewer.RdYlBu)
fig8.show()

Through the scatter plots, it is difficult to conclude any relationship between the Price-to-Sales ratio vs Market Cap, but it can be seen that for Healthcare, Financials, and Real Estate the distribution of Price-to-Sales ratio affected Market Cap in a similar fashion as compared to the Real Estate Sector.

#### Maps
Often times, when a dataset is location specific and contains latitude and longitude parameters, it is wise to visualize it as a Map. Using our second data set, we will create a map of the AirBnB listings around the NYC area. We will represent the listings as circles and the size of the circles will be the price of each AirBnB.

#### Note
The access token provided below is to access 'mapbox' publicly. You can create your own private access token using the following [link](https://docs.mapbox.com/help/how-mapbox-works/access-tokens/). [Mapbox](https://www.mapbox.com/) is a website for using custom online maps and is used by Plotly for creating location specific visuals. Mapbox is used commonly to create interactive maps.

In [15]:
#showing all AirBnBs in NYC area, hover over the map to see the details, size of each circle corresponds to the price
access_token = "pk.eyJ1IjoicmVzaG1pNjQ5NCIsImEiOiJjazFwMnN1dHowNDVqM2NvNmpzOWE4MTNwIn0.Jf2k0ulVlJGtBnBL5MGDBw"
fig4 = px.scatter_mapbox(data2, lat="latitude", lon="longitude", size="price", size_max=15, zoom=10)
fig4.update_layout(mapbox_accesstoken = access_token)
fig4.show()

#### Note
At any point if you have zoomed way to much, just double click on the map to return back to the original state.

The map gives a high level view of all AirBnb listings and shows us where they are concentrated at. Since there are 44,000+ records, zoom in to specific locations in the NYC area to view the coordinates and the price of the listing. 

#### Maps with colors
To better represent your data with Maps, color scales can be used to convey a clear story to your audience. We will plot the availability (in number of days for the year 2019) of an AirBnB listing. Now, to plot all AirBnBs and use a color scheme on all the records will still result in a messy graphic. This is the reason why we subset our dataset to visualize only those AirBnBs whose per night accomodation rate is greater than $1500. The color scheme represents the number of days the listing is available for booking.

In [16]:
##subsetting the dataset to only view airbnbs which have prices more than a $1500 per night
price1500 = data2["price"]>1500
data2_price1500 = data2[price1500]
fig5 = px.scatter_mapbox(data2_price1500, lat="latitude", lon="longitude", color = "availability_365",size="price",
                  color_continuous_scale=px.colors.colorbrewer.RdYlBu, size_max=15, zoom=10)
fig5.update_layout(mapbox_accesstoken = access_token)
fig5.show()

There are a few AirBnb listings that are extremely high priced (~$10000) and are also not available for booking. The color scheme chosen was Red-Yellow-Blue to clearly present the diverging effect of the listing availability for booking. You can choose a different color scheme from the list presented [here](https://plot.ly/ipython-notebooks/color-scales/).

You can also take advantage of the mmapbox dark colored maps to present your data. To add the dark mode, just update your map layout using mapbox_style = dark. 

Here we again subset the dataset to only see the AirBnB listings which have per night rate of greater than $5000.

In [17]:
##subsetting the dataset to only view airbnbs which have prices more than a $5000 per night
price5000 = data2[data2.price>5000]
#data2_price1000 = data2[price1000]
fig6 = px.scatter_mapbox(price5000, lat="latitude", lon="longitude", color = "availability_365",size="price",
                  color_continuous_scale=px.colors.sequential.YlOrRd, size_max=15, zoom=10)
fig6.update_layout(mapbox_style = "dark" ,mapbox_accesstoken = access_token)
fig6.show()

Another way of visualizing the data is by creating '2-D density plots'. 2D density plots look like contour maps but are just an extension of the classic histogram. It shows the distribution of two quantitative variables and their spread relative to each other. This is also useful to avoid over plotting on a scatter plot. This 2D plots counts the number of particular observations, here number of listings available between two variables (price and availability of the listing.

In [18]:
#density plot between price and availability
pricerange = data2[(data2.price>=800) & (data2.price<=1500)]
#data2_range = data2[pricerange]
fig9 = px.density_contour(pricerange, x="price", y="availability_365")
fig9.update_layout(title=go.layout.Title(text="Density plot between price and availability",xref="paper",x=0),
xaxis=go.layout.XAxis(title=go.layout.xaxis.Title(text="Price per night in $",font=dict(family="Times New Roman, monospace",size=18, 
                                                                                        color="#a12d2d"))),
yaxis=go.layout.YAxis(title=go.layout.yaxis.Title(text="Availability of Airbnb in 2019",font=dict(family="Times New Roman, monospace",
                                                                                                  size=18,color="#a12d2d"))))
fig9.show()

2D density plots can be further enhanced by adding colors and then adding the rug plot and histogram of the different categories represented by those colors.

In [19]:
#density plot between price and availability
fig10 = px.density_contour(data2_price1500, x="price", y="availability_365", color="neighbourhood_group", marginal_x="rug", marginal_y="histogram")
fig10.show()

Lastly, we will see how to create an heatmap. A heatmap is a visual to represent individual records in a matrix format which is color coded. This color coded grid shows the intensity of interaction between two variables. For example, we will plot the interaction between Price-to-Sales ratio and Earnings per Share from our first dataset in a heatmap.

In [20]:
## Create a heatmap
figz = px.density_heatmap(data1, x="Price/Sales", y="Earnings/Share", marginal_x="rug", marginal_y="histogram")
figz.show()

The heatmap shoes that there are 52 stocks having Price-to-Sales ratio and Earnings per Share between 2-3.99, represented by yellow. The color scale shows the count of the stocks in each area block. The histogram and rug plot shows individual distribution of Earnings per share and of Price-to-Sales ratio respectively.


We show another example of the heatmap using our second data set. Here we have created a subset of the original AirBnB listing data and have also shown how to use violin plot instead of rug plot. 

In [21]:
## Create a heatmap
datamodz1 = data2[(data2.availability_365 > 100) & (data2.number_of_reviews > 100)]
figz1 = px.density_heatmap(datamodz1, x="number_of_reviews", y="availability_365", marginal_x="violin", marginal_y="histogram")
figz1.show()