# Introduction

In this lab we will be using some Python packages that are great tools for visualizing data: ```matplotlib``` and ```seaborn```, as well as built-in methods for ```pandas``` DataFrames.

In case you do not have these libraries already installed in your environment, run:

```pip install seaborn```

```pip install matplotlib```

## Creating the DataFrame

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# you can change the default style of plots - google for more choices
plt.style.use("ggplot")

In [None]:
# read in data from an URL
irisdf = pd.read_csv("iris.csv", header = None)

# remove NA values
irisdf = irisdf.dropna()

# add column names
irisdf.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

# fix some values in the dataframe
irisdf.species = irisdf.species.replace("Iris-virginicas", "Iris-virginica")

# add a column to the dataset
irisdf["flower"] = irisdf.apply(lambda row: "small" if row["sepal_length"] < 5 else "big", axis=1)

# have a look at 6 first instances
irisdf.head(6)

# pandas DataFrame built-in methods for plotting

## Frequency histogram

In [None]:
# finds the number of bins required to plot each value as a separate bin
# if unchanged, the histogram shows the frequency of each individual value
# when the number of bins is changed, sepal length values will be grouped together
minlength = min(irisdf.sepal_length.values)
maxlength = max(irisdf.sepal_length.values)
bins = int((maxlength - minlength) * 10) + 1

# shows the distribution of sepal lengths 
irisdf["sepal_length"].hist(bins = bins); 

In [None]:
# run code below to see the actual counts of each sepal length
irisdf.sepal_length.value_counts().sort_index()

## Barplot for categorical values

In [None]:
# barplot for occurrences of species
irisdf.species.value_counts().plot(kind="bar");

# alternatively, as an histogram
#irisdf["species"].hist(bins=5);

## Grouped barplot

In [None]:
irisdf.groupby(["species", "flower"])["sepal_length"].count().unstack().plot(kind="bar");

## Scatter plot

In [None]:
irisdf.plot(x="sepal_length", y="sepal_width", kind="scatter");

## Generating data according to some function

We use generate values on the x axis, then we apply a function ($f(x) = x^2$ and $g(x) = e^x$) to those values (in the code after the ```lambda``` keyword)

In real life, the data is never exactly according to a mathematical function, and it contains noise, thus we add some noise to the data. 
```np.random.normal(0, 1, 1)``` generates values from normal distribution with mean 0 and standard deviation 1. The last 1 represents the amount of values to be samples. It returns a list, thus ```[0]``` is added to it. 

In [None]:
# generate data 

cont_data = pd.DataFrame({
    # creates a list of values from -4 to 4 with step 0.1
    'x': np.arange(-4, 4.1, 0.1),
})
# applies quadratic function to x and adds some white noise
cont_data['quadratic'] = cont_data.apply(lambda df: df.x**2 + np.random.normal(0, 1, 1)[0], axis=1)

# applies exponential function to x and adds some white noise
cont_data['exponential'] = cont_data.apply(lambda df: np.e**df.x + np.random.normal(0, 1, 1)[0], axis=1)

cont_data.head()

## Line chart

In [None]:
cont_data.plot(marker='o', linestyle='dashed', title = "Data", x='x', alpha=0.7);

More info https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

```matplotlib.plot()``` and ```pandas.DataFrame.plot()``` are different functions, however, by default, built-in plotting functions in ```pandas``` also use ```matplotlib``` in the backend.

# matplotlib 

Conventionally, from ```matplotlib``` we import ```pyplot``` as ```plt```.

With ```rcParams``` you can configure your figure's aesthetic attributes, e.g. font family, size, line color, line width, etc.

They are automatically run at startup and they configure your stuff : **run** & **configure** $\rightarrow$ rc

In [None]:
# This will set figsize for all other plots as well
plt.rcParams['figure.figsize'] = [6, 6] # width x height (inch)
plt.rcParams['figure.figsize'] = 6, 6  # or this way

In [None]:
# Update font size
plt.rcParams.update({'font.size': 26})

In [None]:
# Plot one figure with (x,y) coordinates.
plt.plot(cont_data['x'], cont_data['quadratic'], marker='+',  color='black' , linestyle='');

In [None]:
# Update font size again
plt.rcParams.update({'font.size': 14})

In [None]:
# Points connected with a line
plt.plot(cont_data['x'], cont_data['quadratic'], marker='+',  color='black' , linestyle='--');

In [None]:
# add labels
plt.plot(cont_data['x'], cont_data['quadratic'], 'k+') # k+ is short for  color='black', marker='+', linestyle='' Look https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
plt.title('Data')
plt.ylabel('y label')
plt.xlabel('x label');

## Using subplots and figures

In [None]:
# Figure can contain multiple plots and you can also set params
fig = plt.figure(figsize=[16, 6])

# Multiple plots on one figure - add_subplot(nrows, ncols, index)
fig.add_subplot(1, 3, 1).plot(cont_data['x'], cont_data['quadratic'], 'k+')
fig.add_subplot(132).plot(cont_data['x'], cont_data['quadratic'], 'ro')
fig.add_subplot(133).plot(cont_data['x'], cont_data['quadratic'], 'g-');

In [None]:
# Figure can contain multiple plots and you can also set params
fig = plt.figure(figsize=[8, 8])

# Multiple plots on one figure - add_subplot(nrows, ncols, index)
fig.add_subplot(2, 2, 1).plot(cont_data['x'], cont_data['quadratic'], 'k-')
fig.add_subplot(222).plot(cont_data['x'], cont_data['exponential'], 'ro', alpha=0.4)
fig.add_subplot(223).plot(cont_data['x'], cont_data['quadratic'], 'g+')
fig.add_subplot(224).plot(cont_data['x'], cont_data['exponential'], 'b--');

### What on earth are axes?

When creating a subplot, it is possible to assign it to a variable as a ```matplotlib.axes.Axes``` instance. This is useful for manipulating the subplots independently.

In [None]:
plt.rcParams['figure.figsize'] = [8, 5]

# ax: matplotlib axes object. 
# The Axes contains most of the figure elements: Axis, Tick, Line2D, Text, Polygon, etc., and sets the coordinate system.
ax = plt.subplot(121)  # ax is the name of the plot
ax.plot(cont_data['x'], cont_data['exponential'], 'go--', alpha=0.4)

ax2 = plt.subplot(122)  # ax2 is the name of the plot
ax2.plot(cont_data['x'], cont_data['exponential'], 'b+')

ax2.set_title('demo - only ax2')
ax.set_xlabel('x for ax');

More info https://matplotlib.org/api/axes_api.html

## (Relative) frequency histograms

A good article about plotting histograms in matplotlib: https://www.datacamp.com/community/tutorials/histograms-matplotlib


In [None]:
# Figure can contain multiple plots and you can also set params
fig = plt.figure(figsize=[16, 8])

# Multiple plots on one figure - add_subplot(nrows, ncols, index)
ax = plt.subplot(1, 3, 1)
hist, bin_edges = np.histogram(irisdf["sepal_length"])
ax.bar(bin_edges[:-1], hist, width = 0.3, color='#0203aa')
ax.set_title('Frequency histogram1')

ax2 = plt.subplot(132)  
ax2.hist(irisdf["sepal_length"], 
         width = 0.3,
         weights = np.ones(len(irisdf["sepal_length"])) / len(irisdf["sepal_length"]) 
        )
ax2.set_title('Relative frequency histogram')

ax3 = plt.subplot(133)  
ax3.hist(irisdf["sepal_length"], 
         width = 0.3,
          color='#fb1dbf'
        )
ax3.set_title('Frequency histogram2');

# seaborn

In [None]:
# Seaborn is a plotting module for Python.
# Works well with pandas DataFrames
import seaborn as sns
plt.rcParams['figure.figsize'] = [6, 6]
plt.style.use('seaborn-whitegrid') # Plot style
ax = sns.scatterplot(x='x', y="exponential", data=cont_data)

In [None]:
# hue - Grouping variable that will produce points with different colors
# using a built-in dataset from sns
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip", hue="day", style="time", data=tips)  # Might give value error: (https://stackoverflow.com/questions/63443583/seaborn-valueerror-zero-size-array-to-reduction-operation-minimum-which-has-no) 

In [None]:
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip",  hue=tips.day.tolist(), style=tips.time.tolist(), data=tips)


More about scatterplot https://seaborn.pydata.org/generated/seaborn.scatterplot.html

Seaborn Example gallery with code https://seaborn.pydata.org/examples/index.html

## Adding a point, a line and a legend

In [None]:
plt.rcParams['figure.figsize'] = [6, 6]
plt.plot(cont_data['x'],cont_data['exponential'], 'ro', label='data', alpha=0.2) 

# add a point with x and y coordinates
plt.plot(4,10, color='black', marker='+', markersize=15, label='outlier')
plt.plot(*[4,11], color='black', marker='+', markersize=15, label='unknown')

# add a line
# [[x1,x2], [y1,y2]]
plt.plot([8,0], [6,0] ,'g--')
plt.plot((8,0), (6,0) ,'g--') # Both work

# [[x,y], [x,y]]
# zip([1, 2], [10,11]) -> [(1, 10), (2, 11)]
# *[(1, 10), (2, 11)] -> (1, 10) (2, 11) ; unpacking the containers
plt.plot(*list(zip([1, 2], [10,11])) ,'k-')

# add a line with x and y coordinates
plt.legend(markerscale=1, frameon=True, loc='upper left')
plt.show()

## Add annotations to a plot

In [None]:
plotdata = irisdf.head(10)
plt.plot(plotdata['petal_length'], plotdata['petal_width'], 'ro', label='species')
for i, label in enumerate(plotdata['species']):
    x = plotdata['petal_length'][i] + 0.01 # move the label on x axis
    y = plotdata['petal_width'][i]
    plt.annotate(label,(x,y))
plt.show()

# Exercises


## EX1

Create a pandas dataframe and insert into it values sampled from normal distribution (or generate them in some other way randomly). Make it have 3 columns: x, y, and z, each sampled separately. Choose different means and standard deviations for each of the columns. 

Hint: ```np.random.normal(loc, scale, size = 200)``` (https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.normal.html)

* loc: mean ($\mu$)
* scale: standard deviation ($\sigma$)



## EX2

Create a plot with 6 histograms in it. 2 rows, 3 columns. All the histograms in the upper row should be relative frequency histograms and the ones in the lower row should be regular frequency histograms. 

Make two plots for each column from the dataframe created in previous exercise - one of each frequency histogram. 