<a href="https://colab.research.google.com/github/sabinagio/IronSabina/blob/main/data_viz_returns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Once again... data visualization!

ADD GIF OR MEME

## Why do we need data visualization?

> Graphics reveal data features that statistics and models may miss: unusual distributions of data, local patterns, clusterings, gaps, missing values, evidence of rounding or heaping, implicit boundaries, outliers, and so on. Graphics raise questions that stimulate research and suggest ideas. It sounds easy. In fact, interpreting graphics needs experience to identify potentially interesting features and statistical nous to guard against the dangers of overinterpretation. (nous ~ intelligence)

## EDA vs EDA - a quick reminder

**Exploratory Data Viz** - visualizations meant for identifying patterns or anomalies in data and starting to hypothesize

Exploratory graphs would typically:
- be rough sketches
- be used (and understood) by analysts
- look at many variables at once
- have a simple, descriptive title

**Explanatory Data Viz** - visualizations meant to convey a specific insight

Explanatory graphs would typically:
- be refined charts
- be created by analysts, understood by anyone
- focus on one-two variables of interest
- have a takeaway point as a title

In [13]:
# I promise I will get you to use plotly by the end of this bootcamp...
import plotly.express as px
import pandas as pd

## Common mistakes!

### 1. Plotting without understanding what you're plotting (e.g. `customer_id`)

### 2. Doing barplots for numerical continuous data

### 3. Doing histograms or boxplots for numerical discrete data

### 4. Plotting noisy charts

### 5. 

## Visualizing one feature (univariate analysis)

In [19]:
gapminder = px.data.gapminder() 
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


What data do we have?
- numerical continuous: lifeExp, pop, gdpPercap
- numerical discrete: iso_num
- categorical: country, continent, iso_alpha
- time: year

### 1. Numerical continuous

#### ✅Histograms

In [20]:
px.histogram(gapminder, x='lifeExp')

What do we see?
1. 

✅boxplots  
❌barplots  
❌lineplots 
❌scatterplots

## Questions!

### 1. To what extent is it possible to evaluate correlation between variables using plots? 

*Which plots to use? 2 numericals: pairplot? 2 categoricals? 1 categorical and 1 numerical?*

### 2. How to interpret the plots with the result of ML models?

### 3. How to plot a time variable and how to interpret the plot?

It... depends.  

If you are interested in plotting a time variable **alone**, then you'd be interested to see things like:
- the time period covered by the dataset
- if there are any patterns in terms of when the data was recorded, e.g. more records in December compared to November 

You can do this using a simple barplot which counts the number of records per time frame:



In [14]:
time_series = px.data.stocks()
time_series.date = pd.to_datetime(time_series.date)  # needed for one of the scatterplots :)
time_series

Unnamed: 0,date,GOOG,AAPL,AMZN,FB,NFLX,MSFT
0,2018-01-01,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
1,2018-01-08,1.018172,1.011943,1.061881,0.959968,1.053526,1.015988
2,2018-01-15,1.032008,1.019771,1.053240,0.970243,1.049860,1.020524
3,2018-01-22,1.066783,0.980057,1.140676,1.016858,1.307681,1.066561
4,2018-01-29,1.008773,0.917143,1.163374,1.018357,1.273537,1.040708
...,...,...,...,...,...,...,...
100,2019-12-02,1.216280,1.546914,1.425061,1.075997,1.463641,1.720717
101,2019-12-09,1.222821,1.572286,1.432660,1.038855,1.421496,1.752239
102,2019-12-16,1.224418,1.596800,1.453455,1.104094,1.604362,1.784896
103,2019-12-23,1.226504,1.656000,1.521226,1.113728,1.567170,1.802472


In [5]:
# Date range
px.bar(time_series, x='date')

We can already see we have one record for each week in the dataset and that we have 2 years worth of data. However, what is usually more interesting to see is how a certain feature varies over time! Now, depending on what that feature is you will use different types of charts to visualize its variation:
- numerical continuous data (e.g. stock prices during a day):
  - scatterplots
  - boxplots / candlestick plots
  - line charts (using an aggregation, e.g. count/average)
  - histograms (to examine patterns within a day, e.g. if a stock was more on the lower/high side on a specific day/week/month)

Our stock data is provided on a weekly basis (and there isn't much of it), so the easiest plot we can use is a line chart to examine any weekly patterns in stock price:

In [17]:
px.line(time_series, x='date', y='GOOG')

When analyzing just the Google stock, we can see that:
- the highest prices occurred in August 2018, April 2019, December 2019
- the lowest prices occurred in April 2018, December 2018, June 2019

From this we can see that in 2018 Google stock would increase then decrease once every quarter (could be related to quarterly financial statements being released?) 

In [16]:
px.line(time_series, x='date', y=['GOOG', 'AAPL'])

In [15]:
px.scatter(time_series, x='date', y=['GOOG', 'AAPL'], trendline='ols')

Our stock data is provided on a weekly basis (and there isn't much of it), so if we want to look at items like boxplots/histograms, we need to bin our data. In our case, we could do that by looking at monthly patterns instead of weekly ones.

- numerical discrete data (e.g. movie ratings):
 - scatterplots
 - barplots - these will show either
 - lineplots

My guess is this question is coming from the `effective_to_date` feature in your dataset. 

## Customize your charts!


In [None]:
fig = px.bar()