# <center> <h1>Welcome to Data Visualization in Python</h1> </center>

![DataViz](images/beautiful.png)  

**Source:** https://informationisbeautiful.net/beautifulnews/

> “There is no such thing as information overload. There is only bad design.” ~ Edward Tufte

## Outline 4 Today


1. Introduction to Data Visualization
    - What is DataViz?
    - Quantitative vs Qualitative
    - Schema for Creating Visualizations
    - Static vs Interactive DataViz
    - Do's
    - Dont's
2. Jupyter Lab/Notebook and Python
3. Our Project for Tonight
4. The Data We'll Use
5. Interrogating the Data one Visualization at a Time
7. Next steps

# 1. Introduction to Data Visualization

Data visualisation, more than being part art and part science, is one of the key components of the data analytics cycle. People have different learning styles and to be able to convey information in a more accessible way, sometimes it is better to do so through visualisations rather than tables and written text. So..

## 1.1 What is DataViz?

> "Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data." ~ [Tableau](https://www.tableau.com/learn/articles/data-visualization)

Data Visualization as a field of study has been on the rise for over many decades now --if not centuries-- and it is an exciting area to be a part of. Organisations such as the [Data Visualization Society](https://www.datavisualizationsociety.com/), FreeCodeCamp, and others, have extensive information on how to go beyond simple data visualisation, should that be something that interests you. If you would like to read more about data visualisation and what you can do with it, check out this [medium site](https://medium.com/nightingale).

Python has a wide variety of visualisation tools available for static and interactive, quantitative and qualitative, time series and geographic data visualisation, and some of the most-widely used libraries for these purposes, to date, are the following ones:

- [matplotlib](https://matplotlib.org/) --> highly customisable and long-term contender in the dataviz arena
- [seaborn]() --> beautiful data visualisation library that is easy to use and fast
- [bokeh]() --> great (and beautiful) tool for interactive data visualisation
- [plotly]() --> bokeh's top contender
- [altair]() --> beautiful data visualisation library based on the grammar of graphics philosophy
- [plotnine]() --> data visualisation library based on R's ggplot2

## 1.2 Quantitative vs Qualitative

When we get to the data visualisation stage of the data analytics cycle, we should always keep in mind the nature of the data we would like to visualise. If we want to see relationships (correlations between variables) we might only choose quantitative variables for our visualisations. If we want to show a specific theme in our dataset, e.g. gender differences, customer type, or potential customer, we might just opt for visualising frequencies in qualitative data. In contrast, if we want to show the relationship of variables given a specific group in our dataset (e.g. income differences by gender), we would choose a combination of qualitative and quantitative variables.

To give you a more concrete example, I have burrowed the following table from a book that I highly recommend if you want to really get started with data visualization, and that is, _"Fundamentals of data visualization: A primer on making informative and compelling figures"_ by Claus O. Wilke.

![data_var_types](images/var_types.png)

**Source:** Wilke, C. O. (2019). _Fundamentals of data visualization: A primer on making informative and compelling figures._ Sebastopol, CA: O'Reilly Media.

Now that we are aware of the subtle differences in data visualization, how do we know which visualisation to create with the data we have? The answer is that it will depend on the context of your task, and on how much information you would like to convey in your visualisation. As you decompose a task to choose the best course of action for your visualisation, keep the following diagram in mind from ActiveWizards.

## 1.3 Static vs Interactive Visualizations

An important aspect to keep in mind is when creating visualizations is whether we should represent our data in a static or interactive format. As analysts, we should always ask ourselves, will our message reach our audience better if they were able to interact with the visualisation? The reason behind this can be captured in a very famous quote by Benjamin Franklin.

> "Tell me and I forget. Teach me and I remember. Involve me and I learn." ~ Benjamin Franklin

If the goal of our visualisations is to teach something to our audience chances are that, allowing them to interact with our visualisation will do just that. Let's talk a bit more about static and interactive visualisations.

### 1.3.1 Static DataViz

Static data visualisations are those meant to show one or several facts about the data in a specific way. They help us convey a message and are often used closely with other narratives. For example, the New York Times is one of the most famouss new agencies in the world not just for the top content they manage to craete and provide to the masses, but also for the beautiful visualisations one can find in their large amounts of information.

Static visualisations are also often embedded in inforgraphics to carry a message even further. Think about the graphs that are displayed in broshures that tell us to buy a fragance or a particular type of cutlery alongside a statistic, some often say "75% of those who purchased these products have experienced...blah blah blah". Watch out for those :)

### 1.3.2 Interactive DataViz

Interactive data visualisations tell different stories while letting the users pick which one they would like to see or understand better as they evaluate the piece of work. These kinds of visualisations can be very powerful tools not only to convey messages to many people but also to provide top-notch educational content for others.

Involving your audience through interactive visualisations can be a much more involved process though. A static visualisation can be saved and shared with many in a matter of minutes. Interactive visualisations, on the other hand, might require a web application to work and be displayed, making it more difficult to show it to people on the go. Dashboards and other tools, while requiring a bit more work to be put together, can have a lot useful interactivity in them.

## 1.4 Do's

When creating data visualisations, it is important to keep in mind the following Do's.

- Label your axes where appropriate
- Add a title
- Use color appropriately. Showcase what you need, not every data point
- Use full axis and maintain consistency with different graphs shown in parallel
- Ask others for their opinion
- Pass the squint test (blurry viz)

## Dont's


Just as there are many **Do's** in data visualisations, there are also many **DONT's**. Let go over a few of them together.

1. Don't use too much color  
<img src="https://clauswilke.com/dataviz/pitfalls_of_color_use_files/figure-html/popgrowth-US-rainbow-1.png" alt="bad pie" width="400"/>   

2. Don't use unmatching percentages  
<img src="http://livingqlikview.com/wp-content/uploads/2017/04/Worst-Data-Visualizations-02.jpg" alt="bad percentages" width="400"/>  

3. Don't try to put everything in one graph  
<img src="http://livingqlikview.com/wp-content/uploads/2017/04/Worst-Data-Visualizations-07.jpg" alt="bad pie" width="400"/>  

4. Trend lines need time not categories  
<img src="http://livingqlikview.com/wp-content/uploads/2017/04/Worst-Data-Visualizations-03.jpg" alt="bad lines" width="400"/>  

5. Don't make your chart data unreadible  
<img src="http://livingqlikview.com/wp-content/uploads/2017/04/Worst-Data-Visualizations-04.jpg" alt="bad text" width="400"/>  

6. Don't make no sense  
<img src="https://i.insider.com/51cb1c3e69bedd713300000e?width=1200" alt="bad sense" width="400"/>  

7. Don't deceive your audience with different intervals and axes  
<img src="https://i.insider.com/51cb25fa69beddcd4f000005?width=1200" alt="bad intervals" width="400"/>  

8. Axes betrayal  
<img src="https://i.insider.com/51cb2721eab8ea1d33000004?width=1200" alt="bad percentages" width="400"/>  


**Source 1:** taken from Fundamentals of Data Visualization by Claus O. Wilke. Data source is US Census Bureau  
**Source 2:** Figures 2, 3, 4, and 5 were taken from [QlikView](http://livingqlikview.com/the-9-worst-data-visualizations-ever-created/)  
**Source 3:** Figures 6, 7, and 8 were taken from [Business Insider](https://www.businessinsider.com.au/the-27-worst-charts-of-all-time-2013-6?r=US&IR=T#did-anyone-learn-anything-by-looking-at-this-pseudo-pie-chart-what-do-these-colors-even-mean-why-is-it-divided-into-quadrants-well-never-know-1)

# 2. Jupyter Lab/Notebook and Python

## Jupyter

JupyterLab is an [Integrated Development Environment](https://en.wikipedia.org/wiki/Integrated_development_environment) created by the [Jupyter Project](https://jupyter.org/). It allows you to combine different tools that are paramount for a good coding workflow. For example, you can have the terminal, a Jupyter notebook, and a markdown file for note-taking/documenting your work, as well as others, opened at the same time to improve your workflow as you write code (see image below).

![jupyterlab](https://jupyterlab.readthedocs.io/en/latest/_images/interface_jupyterlab.png)
**Source** - [https://jupyterlab.readthedocs.io](https://jupyterlab.readthedocs.io)

To run code you will use the following two commands:

> # Shift + Enter

and

> # Alt + Enter  

The first will run the cell and take you to the next one. If there is no cell underneath the one you just ran, it will insert a new one for you. The second one will run the cell and insert a new one below automatically. Alternatively, you can also run the cells using the play (▶︎) button at the top or with the _Run menu_ on the top left-hand corner.

Anything that follows a hash `#` sign is a comment and will not be evaluated by Python. They are useful for documenting your code and letting others know what is happening with every line of code or with every cell.

To check the information of a package, function, method, etc., use `?` or `??` at the begining or end of such element, and it will provide you with a lot of information about it.

## Python

Python is a general-purpose programming language that allows us to create programs, analyse data, create websites, create applications, and many other cool things. It is free and open-source, which means that anyone can contribute to its development and help make this language an even better one. This latter fact, along with the great readability of the language, are (to your host) two of the major contributing factors of Python's popularity.

Python can be thought of as a person, we are very cool the way we are but to interact more efficiently we make use of "add-ons", and thus, so does Python. These add-ons may be clothes, shoes, accessories, slang words, and other physical objects such as cars, houses, boats, etc. In Python, our add-ons become additional programs other people have created in order to make a specific workflow easier.

### Some Key Concepts in Python We'll Need 4 Today

**Data Types**

1. Strings --> Text or written data. e.g. "this is a string"
2. Integers --> numbers without decimal places. e.g. 1, 2, 3, 4, 5, 6
3. Floats --> numbers with decimal places. e.g. 3.5, 2.06, 7.9, 4.1
4. Dates --> time-related object. e.g. 19-March-2020 18:00, 20-March-2020 12:30
5. Boolean --> logical value that can be `True` or `False` and 1 or 0, respectively

**Data Structures**
1. Dataframe --> spreadshee-like object (literally), with rows and columns
2. Series or Array --> a row or column in a spreadsheet, or a combination of the two

# Let's Work Through Some Examples Together

# Our Project 4 Tonight

> Evaluate and Improve our Workout Habits

Say you have always exercised at least 5 days per week, or that, since last year, your New Year's resolution was to exercise for at least 5 days a week throughout the entire year. This being a completely novel endeavor for you, something no one else picks during New Year's (😎), you decide it would good to also track your progress throughout the year to understand what your workout routines and habits look like throughout the year. You also want to come up with a a few hypotheses to test using your own data, we will do that another time though.

Here is a picture of me on day 1

![day1](https://media.giphy.com/media/13Lwn87rxZSUVi/giphy.gif)

also me on day 2

![day2](https://media.giphy.com/media/ewelN8qzxQqNG/giphy.gif)

# 4. The Data We'll Use

Now that we have our task defined, we can move on to gathering the data we will need for our project. In my case, I have a Garmin watch and I went through the following steps to extract a comma separated values file from it.

- Go to Garmin [__Garmin Connect__](https://connect.garmin.com/signin)
- Sign in with your username and password
- Go to the activities section and click on All Activities
- Exporting the data gets a bit tricky here because Garmin will only export items that have already loaded, so in order to get all of your data, scroll all the way down to the very end of all of your activities.
- Once you reach the last one, click on the Export CSV button at the top right hand corner. You will see a Activities.csv file in your downloads folder.
**NOTE:** Garmin tracks much more data than what you will download from Garmin Connect but the format is not as user-friendly as what you would get using these steps.

Depending on which smartwatch or smartphone you have, there will be different ways to access your data.

# Interrogating the Data one Visualization at a Time

The first thing we want to do is to import into our session, some external packages available in the Python ecosystem.

In [3]:
import pandas as pd
import altair as alt
pd.set_option('display.max_columns', None)

We will then read in the data using the pandas package we imported above and assign it to a variable. You can thing of a variable as a column in Excel, a container, or a bucket, that will hold, and give a name to, whathever piece of information you are working with.

In [2]:
df = pd.read_csv('data/clean_data/workouts_data.csv')

Now that we have loaded our data into memory, we can examine it by viewing a few rows of it using what is called a method. You can think of methods as the behavior of an object in Python. For example, the behavior of a stove is that it gets hot if we turn it on and it is cold if it is off. The behavior of our variable `df` when we apply the method `.head()` is that it returns a small view of our data for us.

In [4]:
df.head()

Unnamed: 0,activity_type,date,title,distance,calories,time,avg_hr,max_hr,aerobic_te,avg_run_cadence,max_run_cadence,avg_pace,best_pace,avg_stride_length,climb_time,min_temp,number_of_laps,number_of_runs,month,year,week,weekday,quarter,time_exercise,date_exercise,day_of_week,time_day,week_or_end
0,Treadmill Running,2020-10-18 16:33:42,Treadmill Running,1.12,160.0,00:10:40,140,153,2.2,146.0,167.0,9:33,7:42,1.14,10:40,75.2,2.0,2.0,10,2020,42,6,4,16:33:42,2020-10-18,Sunday,afternoon,weekend
1,Running,2020-10-17 17:26:12,Sydney Running,5.01,757.0,00:45:40,158,180,4.2,159.0,168.0,9:07,7:53,1.11,45:40,69.8,6.0,6.0,10,2020,42,5,4,17:26:12,2020-10-17,Saturday,afternoon,weekend
2,Elliptical,2020-10-16 18:02:39,Elliptical,0.0,213.0,00:19:43,117,149,2.0,95.0,150.0,,,0.0,19:43,78.8,1.0,1.0,10,2020,42,4,4,18:02:39,2020-10-16,Friday,night,week_day
3,Treadmill Running,2020-10-15 18:20:50,Treadmill Running,2.0,249.0,00:15:39,148,162,2.7,150.0,183.0,7:50,7:03,1.15,15:39,75.2,2.0,2.0,10,2020,42,3,4,18:20:50,2020-10-15,Thursday,night,week_day
4,Treadmill Running,2020-10-14 16:32:58,Treadmill Running,0.79,98.0,00:06:43.7,136,146,1.7,169.0,179.0,8:28,7:14,1.12,6:43.7,77.0,1.0,1.0,10,2020,42,2,4,16:32:58,2020-10-14,Wednesday,afternoon,week_day


## Exercise 1

Try using the method `.tail()` with our variable df on the cell below.

### Question 1

Regardless of the year, what is the average amount of calories I burn per month?

In [12]:
alt.Chart(df).mark_line().encode(
    x='month',
    y='mean(calories)'
)
line

## Ex 2

Are kilometers run corelated with the amount of calories I burned?

In [17]:
alt.Chart(df).mark_point().encode(
    x='distance',
    y='calories'
)

In [18]:
alt.Chart(df).mark_point().encode(
    x='distance',
    y='calories'
).interactive()

In [21]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('distance', bin=True),
    y='count()'
)

In [22]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('distance', bin=True),
    y=alt.Y('calories', bin=True),
    color='count()'
)

In [25]:
alt.Chart(df).mark_point().encode(
    x='distance',
    y='calories',
    color='week_or_end'
)

In [32]:
interval = alt.selection_interval(encodings=['x'], zoom=True)

chart = alt.Chart(df).mark_point().encode(
    x='distance',
    y='calories',
    color=alt.condition(interval, 'week_or_end', alt.value('lightgray')),
).properties(
    selection=interval
)

chart

In [33]:
chart | chart.encode(x='avg_hr')

In [36]:
interval = alt.selection_interval(zoom=True)

chart1 = alt.Chart(df).mark_point().encode(
    x='distance',
    y='calories',
    color=alt.condition(interval, 'week_or_end', alt.value('lightgray')),
    tooltip='activity_type'
).properties(
    selection=interval
)

chart1 | chart1.encode(x='min_temp')

In [39]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('sum(calories)', stack="normalize"),
    y='day_of_week',
    color='activity_type'
)#.save('mychart.html')

In [None]:
alt.vconcat()

In [42]:
hist = alt.Chart(df).mark_bar().encode(
    x='count()',
    y='week_or_end',
    color='week_or_end'
)
chart1 & hist

In [43]:
hist = alt.Chart(df).mark_bar().encode(
    x='count()',
    y='week_or_end',
    color='week_or_end'
).transform_filter(
    interval
)
chart1 & hist

Grammar of Graphics

Data
Transformation
Marks
Encoding - mapping from fields to mark properties
scale - functions that map data to visual scales
Guides - visualization of scales (axes, legends, etc.)

Altair is Declarative
what and how it should be done

We start with the chart
Assign the type of representation (i.e. mark) we want our data to take
The next step is the encoding, which allow us to map visual elements of the chart. Thinking in two dimensions, we need an X and Y coordinate