# Task #1 - A Quick introduction to visualization for Data Exploration in Python 

In Data-Science, it is important to know-your-data.

The **first step** that you should **always** do is to **look at your data**.

#### Group Work
Programming in a group?! How is it done? 

Data-Science is a collaborative task. As such, there are many options. For example:
- One of you would share her/his screen and in a 2-minutes rotation, each dictates what to code. After 2 minutes, the next one must continue where the previous stopped.
- One is sharing the screen, and you all discuss together how to approach the task.
- Each codes their own solution, but after every step and sub-task, you compare your approaches and results.
- Or... come up with the method that fits you best! 

Remember that we are here to learn from each other.

#### First look at the data
We can look at the data by calculating its statistics, examining and comparing different samples in random, and of course - by visualizing the data.  
Data Visualization can reveal patterns that wouldn't be discovered otherwise.

In [0]:
# installing pandas, matplotlib and seaborn - if you don't have them already
%pip install pandas matplotlib seaborn -q

In [0]:
# Import the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


We load here a dataset - `room_occupancy` - into the variable `df_room_occupancy`, which contains for every point in time different properties that were measured in a room, together with a target variable - `Occupancy` - that determines if someone was in the room.

In [0]:
df_room_occupancy = pd.read_csv('../../Data/room_occupancy.csv')

# show the columns and their types
df_room_occupancy.dtypes

Here is a short explanation of the dataset variables:

- ``date``: the specific day and time when the values were recorded 
- ``Temperature``: measured in Celsius
- ``Humidity``: relative humidity - a present state of absolute humidity relative to a
maximum humidity given the same temperature expressed as a percentage
- ``Light``: measured in Lux
- ``CO2``: in ppm (parts per million)
- ``HumidityRatio``: derived quantity from temperature and relative humidity, expressed in kilograms of water vapor per kilogram of dry air  
- ``Occupancy``: the presence of a person in the room. 
The occupancy of the room was obtained from pictures that were taken every minute for a period of 8 days (1 if a person is present, 0 otherwise)

In [0]:
########## YOUR TURN ##########
# Explore the dataset:
# - Take a look at the first few rows of the dataset
# - Display the summary statistics of the dataset
# - Perform other exploratory tasks (checking for missing values, duplicates, etc.)











################################

#### Visualizations and Plots in Pandas

[Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) offers several basic plots.  

While there are several ways to invoke these plots with pandas, we recommend using one the *`<object>`*`.`**`plot`**`.`*`<plot_type>`* method:  
  - [`DataFrame.plot.line`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.line.html)
  - [`DataFrame.plot.hist`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hist.html)
  - [`DataFrame.plot.box`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)
  - [`DataFrame.plot.bar`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html)
  - [`DataFrame.plot.barh`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.barh.html)
  - [`DataFrame.plot.scatter`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html)
  - [`DataFrame.plot.pie`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.pie.html)
  - [`DataFrame.plot.kde`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.ht)
  - [`DataFrame.plot.density`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.density.html)
  - [`DataFrame.plot.area`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.area.html)
  - [`DataFrame.plot.hexbin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html)

Additionally, to create a histogram (`hist`) or a boxplot (`box`) there are two more plotting functions, directly from the dataframe itself, without the `.plot.*` prefix: 
- [`DataFrame.hist()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) and  
- [`DataFrame.boxplot()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html)

Lastly, `scatter_matrix` creates a matrix of scatter plot and a histogram/kde to quickly explore the dataset:

```python
from pandas.plotting import scatter_matrix
scatter_matrix(df_room_occupancy, alpha=0.2, figsize=(6, 6), diagonal="hist");
```

Let's see them all in action:

#####  *`<object>`*`.`**`plot`**`.`*`<plot_type>`*

In [0]:
# Directly plot the histogram of the dataframe level, for all the fields in one plot:
df_room_occupancy.plot.hist();

This method of *`<object>`*.`plot.`*`<plot_type>`* can operate either on the whole data framework object, or on a single column (also called a data **series**):

In [0]:
# Directly plot the histogram of a specific column:
df_room_occupancy["CO2"].plot.hist();

##### DataFrame.hist()

In [0]:
# When plotting directly the data frame (without .plot.hist()), it will plot all each columns in a separate plot:
df_room_occupancy.hist();

In [0]:
field = "Temperature"
df_room_occupancy[field].plot.hist();

In [0]:
# Jupyter tip: you can use the double question mark to see the source code of a function, and learn about its parameters
df_room_occupancy.plot.hist??

In [0]:
########## YOUR TURN ##########
# Plot the histogram of the "Light" column in the dataset, using 50 bins, and a figure size of 10x10
# Bonus for advanced: plot the histogram of the "Light" column, grouped by the "Occupancy" column



################################

The [Scatter Matrix](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html) allows us to have a quick view of the numerical values dataset:

In [0]:
from pandas.plotting import scatter_matrix

scatter_matrix(
    df_room_occupancy,                                      # The dataframe to explore 
    alpha=0.2,                                              # The transparency of the scatter points
    figsize=(10, 10),                                       # The size of the figure
    diagonal="hist",                                        # The type of plot for the diagonal subplots
    hist_kwds={"bins": 50, "color": "green", "alpha": 0.5}  # Additional parameters for the histogram plots (in a dictionary format)
    );

#### Seaborn

With [Seaborn](https://seaborn.pydata.org/), reaching a good visualization is easier.

Seaborn supports a larger variety of charts for different purposes, such as [visualizing data distribution](https://seaborn.pydata.org/tutorial/distributions.html), [exploring the categorical data](https://seaborn.pydata.org/tutorial/categorical.html), and also [relations between the different variables and columns](https://seaborn.pydata.org/tutorial/relational.html).

It can even be combined and mixed with other charts. Here's an example of a simple histogram, done with [Matplotlib's hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) method, and with seaborn we can add to that same chart also a marginal [distributions plot - rugplot](https://seaborn.pydata.org/generated/seaborn.rugplot.html).

In [0]:
field = "Temperature"
plt.hist(data=df_room_occupancy, x=field, alpha=0.3)
sns.rugplot(data=df_room_occupancy, x=field, hue="Occupancy");

Here's another example of the same histogram, this time, combined with a [Kernel Density Estimate (KDE)](https://towardsdatascience.com/kernel-density-estimation-explained-step-by-step-7cc5b5bc4517) plot.  
in this example we even divided the data by the target - `Occupancy` - to compare the Temperature differences between the two groups:

In [0]:
sns.histplot(data=df_room_occupancy, x=field, kde=True, hue='Occupancy');

Seaborn offers also a [`violin plot`](https://en.wikipedia.org/wiki/Violin_plot) to visualize the variable density.  
This is done using the method [`sns.violinplot()`](https://seaborn.pydata.org/generated/seaborn.violinplot.html).

In [0]:
sns.violinplot(data=df_room_occupancy, x=field, hue='Occupancy');

Here's a broader view of the charts Seaborn offers:

![image.png](https://seaborn.pydata.org/_images/function_overview_8_0.png)

Moving from Univariate (exploring a single variable) to Bivariate (exploring 2 variables), Seaborn offers two excellent plotting functions: `joinplot` and `pairplot`:

```python
sns.pairplot(data=df_room_occupancy, hue="Occupancy")
```
and
```python
sns.jointplot(data=df_room_occupancy, x="Temperature", y="CO2", hue="Occupancy")
```


![image.png](https://seaborn.pydata.org/_images/function_overview_36_0.png)
![image.png](https://seaborn.pydata.org/_images/function_overview_38_0.png)

In [0]:
######### Your Turn #########
# Explore the dataset in various ways by plotting its data.
#
# Discuss together as a group:
# - Which visualizations are the most informative?
# - Consider a task of Occupancy Prediction: What do you see in the data? What can you infer from the data? 









################################

For additional reading and inspiration, check [this reource](https://towardsdatascience.com/how-to-perform-exploratory-data-analysis-with-seaborn-97e3413e841d) out.