# Data Visualization using Pandas and Seaborn

## Table of Contents

1. [**Introduction**](#Intro)
2. [**Line Plots**](#lnplt)
3. [**Bar Plots**](#brplt)
4. [**Histogram Plots**](#hstplt)
5. [**Scatter Plots**](#sctrplt)
6. [**Box Plots**](#bxplt)



## 1. Introduction <a name="Intro"></a>

__matplotlib__ can be a fairly low-level tool. You assemble a plot from its base components: the data display, legend, title, tick labels, and other annotations. To see how to use _matplotlib_ to plot various types of plots take a look at: https://matplotlib.org/stable/gallery/index.html.

However, when dealing with data sets, we may have multiple columns of data, along with row indices and column labels. __pandas__ itself has built-in methods that simplify creating visualizations from DataFrames. Another library is __seaborn__, a statistical graphics library. Seaborn simplifies creating many common visualization types.

## 2. Line Plots <a name="lnplt"></a>

DataFrames in _pandas_ each have a plot attribute for making some basic plot types. By default, _plot()_ makes line plots.

In [None]:
import pandas as pd                      # importing pandas library 
import numpy as np                       # importing numpy library
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(10, 4),           
                  columns=['A', 'B', 'C', 'D'],
                  index=np.arange(0, 100, 10))               
df

In [None]:
fig=df.plot( figsize=(10,6), lw=3, style='--' )
plt.grid()
plt.title('plot')
plt.xlabel('x values')
plt.ylabel('y values')
plt.legend(['A-data','B-data','C-data','D-data'],loc='upper right')

DataFrame has a number of options allowing some flexibility with how the columns are handled; for example, whether to plot them all on one plot or to create separate subplots. 

![alt text](https://docs.google.com/uc?export=download&id=1QWgO3FXC-Rba5zk0HZLjnG7O_Yyi-vuC)



In [None]:
fig=df.plot(figsize=(7,7), subplots=True, grid=True)               # plotting each column on a separate subplot

In [None]:
fig=df.plot(figsize=(10,5), title='data set plots', lw=4)

<font color='red'>__Question (1)__</font>: Create the following data frame
```python
dfq = pd.DataFrame(np.random.randn(30, 4),           
                  columns=['Col1', 'Col2', 'Col3', 'Col4'],
                  index=np.arange(0, 30, 1))
```
and then plot all the columns in one plot. Make sure to use <font color='blue'>figsize</font>, <font color='blue'>title</font>, and <font color='blue'>lw</font> options for your plot. Also, put grid lines on your plot. Once you did that, save and download your plot.

In [None]:
# In-Class Assignment



## 3. Bar Plots <a name="brplt"></a>

The <font color='blue'>plot.bar()</font> and <font color='blue'>plot.barh()</font> make vertical and horizontal bar plots, respectively. In this case, the DataFrame index will be used as the x (bar) or y(barh) ticks.

For DataFrames, Bar plots group the values in each row together in a group in bars, side by side, for each value.

In [None]:
df = pd.DataFrame(np.random.rand(6, 4),
                  index=['one', 'two', 'three', 'four', 'five', 'six'],
                  columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df

In [None]:
fig=df.plot.bar( figsize=(10,5) )

We can create stacked bar plots from a DataFrame by passing _stacked=_<font color='blue'>True</font>, resulting in the value in each row being stacked together 

In [None]:
plt.style.use('dark_background')                  # set the background as dark
fig=df.plot.barh(figsize=(10,5),stacked=True)

An alternative way of plotting these graphs is to use the option _kind_ in pandas's <font color='blue'>plot</font> function.

In [None]:
plt.style.use('seaborn-darkgrid')                            # set the background
fig=df.plot(figsize=(10,5),kind='bar');
fig=df.plot(figsize=(10,5),kind='barh',stacked=True);

we can also use __seaborn__ library to make the plots. Let's plot column B against column A.

In [None]:
import seaborn as sns

fig=sns.barplot(data=df, x='A',y='B', orient='h')    # you can use a uniform color by adding the option for example color='b'

What if we just want to plot the values in one column?

In [None]:
fig=sns.barplot(data=df, x=df.index, y='C', orient='v', color='k')

<font color='blue'>seaborn.barplot</font> has a _hue_ option that enables us to split by an additional categorical value. Essentially, _hue_ shows how many different values an indicator can have.

To show the application, lets add a new column to our DataFrame with categorical values of either Yes ot No.

In [None]:
df['cat']=['Yes','No','Yes','Yes','Yes','No']
df

In [None]:
plt.style.use('default')
fig=sns.barplot(data=df, x=df.index, y='C', hue='cat', orient='v')

<font color='orange'>Note</font>: To change the size of this figure, you can simply add the following line to your code before plotting the figure 
```python
sns.set(rc={'figure.figsize':(width,height)})
```
where, _width_ and _height_ are numbers in inches. To get more information about plotting bar charts using seaborn library, take a look at https://seaborn.pydata.org/generated/seaborn.barplot.html

<font color='red'>__Question (2)__</font>: Write a piece of code to plot a vertical bar chart using the _titanic_ data set from seabon.
```python
df = sns.load_dataset('titanic')
```
Here is a quick look at the description for each column in this data set:

![alt text](https://docs.google.com/uc?export=download&id=10d4mqmS70Qfb9sw6MTwQgA7Q8eQb9US6)

For this assignment, provide a bar chart that shows the sex of the passengers against the data of whether the passenger survived. Also, add the passenger class as the third piece of information to your plot (using _hue_ option).

In [None]:
# In-Class Assignment



## 4. Histogram Plots <a name="hstplt"></a>

A histogram is a kind of bar plot that gives a discretized display of value frequency. The data points are split into discrete, evenly spaced bins, and the number of data points in each bin is plotted. To plot a histogram, we can use either <font color='blue'>plot.hist()</font> or the option _kind='hist'_ for pandas' plot.

In this part, I am going to import the Airfoil Noise dataset and use that. If you remeber from before, the columns in this dataset were:

1. Frequency, in Hertzs. 
2. Angle of attack, in degrees. 
3. Chord length, in meters. 
4. Free-stream velocity, in meters per second. 
5. Suction side displacement thickness, in meters. 
6. Scaled sound pressure level, in decibels. 

In [None]:
url = 'https://raw.githubusercontent.com/MasoudMiM/ME_364/main/Airfoil_noise/Airfoil_Noise.csv'   # Link to the Airfoil Noise data set
df1 = pd.read_csv(url ,header=None, names=['Frequency (Hz)','Attack_Angle (deg)','Chord (m)','FS_Velocity (m/s)','SSD_Thickness (m)','Sound_Pressure_Level (dB)'])

# Dataset is now stored in a Pandas's Dataframe
df1.head()

We are going to plot a histogram of the sound level column.

In [None]:
fig=plt.figure(figsize=(15,5))               # defining the matrix of plots

fig.add_subplot(1,2,1)
# plotting using plot.hist()
df1['Sound_Pressure_Level (dB)'].plot.hist(bins=20,color='blue')
plt.xlabel('Sound (db)')
plt.ylabel('Frequency of occurrence')
plt.title('Plot using plot.hist()')

fig.add_subplot(1,2,2)
# plot using pandas plot option 'hist'
df1['Sound_Pressure_Level (dB)'].plot(kind='hist',bins=20,color='blue')
plt.xlabel('Sound (db)')
plt.ylabel('Frequency of occurrence')
plt.title("Plot using plot(kind='hist')")

A related plot type is a __density__ plot, which is formed by computing an estimate of a continuous probability distribution that might have generated the observed data. 

In [None]:
fig=plt.figure(figsize=(15,5))               # defining the matrix of plots

fig.add_subplot(1,2,1)
# plotting using plot.density()
df1['Sound_Pressure_Level (dB)'].plot.density(color='blue')
plt.xlabel('Sound (db)')
plt.ylabel('Probability of occurrence')
plt.title('Plot using plot.density()')

fig.add_subplot(1,2,2)
# plot using pandas plot option 'density'
df1['Sound_Pressure_Level (dB)'].plot(kind='density',color='blue')
plt.xlabel('Sound (db)')
plt.ylabel('probability of occurrence')
plt.title("Plot using plot(kind='density')");

__Seaborn__ makes histograms and density plots even easier through its <font color='blue'>distplot</font> method, which can plot both a histogram and a continuous density estimate simultaneously. 

__Note__: This approach is depreciated in newer versions of Python so it might be removed from future versions. You can use <font color='blue'>displot</font> or <font color='blue'>histplot</font> along with the options `kde=True` or `kind="kde"`)

In [None]:
import seaborn as sns # make sure seaborn is imported

plt.figure(figsize=(7,5))
sns.distplot(df1['Sound_Pressure_Level (dB)'], bins=20, color='blue')
plt.xlabel('Sound (db)')
plt.ylabel('Probability of occurrence')

There is a lot more you can do with `displot` in terms of data visualization. Here is the link to the command documentation page with various examples: https://seaborn.pydata.org/generated/seaborn.displot.html 

<font color='red'>__Question (3)__</font>: Use the titanic data set from the previous question and plot a histogram+density plot for passenger fare to show the distribution of the fares.

In [None]:
# In-Class Assignment


## 5. Scatter Plots <a name="sctrplt"></a>

scatter plots can be a useful way of examining the relationship between two one-dimensional data series. Let's use a new data set for this section. 

We can then use seaborn’s <font color='blue'>regplot</font> method, which makes a scatter plot and fits a linear regression line.

This dataset comes from research by TR/Selcuk University Mechanical Engineering department. 

The aim of the study was to determine how much of the adjustment parameters in 3d printers affect the print quality, accuracy and strength. There are nine setting parameters and three measured output parameters.

Setting Parameters:

•	Layer Height (mm)

•	Wall Thickness (mm)

•	Infill Density (%)

•	Infill Pattern ()

•	Nozzle Temperature (ºC)

•	Bed Temperature (ºC)

•	Print Speed (mm/s)

•	Material ()

•	Fan Speed (%)


Output Parameters: (Measured)

•	Roughness (µm)

•	Tension (ultimate) Strenght (MPa)

•	Elongation (%) 

This work is based on the Ultimaker S5 3-D printer settings and filaments.

In [None]:
url = 'https://raw.githubusercontent.com/MasoudMiM/ME_364/main/3D_Printer_Data/3DPrinterDataset.csv'   # Link to the 3D printer data set
df2 = pd.read_csv(url)

# Dataset is now stored in a Pandas's Dataframe
df2.head()


In [None]:
import seaborn as sns

sns.regplot(data=df2, x='layer_height', y='roughness', color='b') # use option fit_reg=False if you don't want to see the line fit
plt.xlabel('Layer Height (mm)')
plt.ylabel('Roughness (micro-m)')

Seaborn has a scatterplot that can be used directly:

In [None]:
sns.scatterplot(data=df2, x='layer_height',y='roughness', color='g')
plt.xlabel('Layer Height (mm)')
plt.ylabel('Roughness (micro-m)')

If you don't want to fit a straight line and only need to plot the scatter plot using the data points, you can use pandas library plot directly.

In [None]:
plt.style.use('default')
df2.plot.scatter(x='layer_height', y='roughness', figsize=(5,3), s=150, c='k')  # s here is the marker size
plt.xlabel('Layer Height (mm)')
plt.ylabel('Roughness')
# OR
#df2.plot(kind='scatter',x='layer_height',y='roughness',s=70,c='g',figsize=(5,3))  # s here is the marker size

In exploratory data analysis it’s helpful to be able to look at all the scatter plots among a group of variables; this is known as a pair plot or scatter plot matrix. Making such a plot from scratch is a bit of work, so seaborn has a convenient pairplot function, which supports placing histograms or density estimates of each variable along the diagonal.

In [None]:
sns.pairplot(df2, plot_kws={'alpha': 0.3})

If you don't want to see the scatter plot for all the available variables, you can choose specific columns from your data. Pay attention to how we set the dimension of this plot since it is different from the day we usually do it for our plots.

In [None]:
sns.pairplot(df2,
             x_vars=["layer_height", "wall_thickness", "bed_temperature"],
             y_vars=["roughness", "tension_strenght", "elongation"],
             height=3, # make the plot 3 units high
             aspect=2) # width should be 2 time height

A good reference to seaborn _pairplot_ can be found here: https://seaborn.pydata.org/generated/seaborn.pairplot.html

<font color='red'>__Question (4)__</font>: Use the titanic data set from before and plot a scatter plot showing the passenger _fare_ against their _age_. Manually override the labels of the axes for the plot

In [None]:
# In-Class Assignment



## 6. Box Plots <a name="bxplt"></a>

Box plots are used to to visualize distributions. They show the median, quartiles, and outliers and are very useful when you want to compare data between two groups. We can use <font color='blue'>boxplot</font> function from _seaborn_ library to plot the box plots.

![alt text](https://docs.google.com/uc?export=download&id=1-yFABVPm_jtrYidyU4X4yhvOei28dKNb)


Let's import the data available for energy generated using various renewable resources in the UK for different years in different regions. The data is stored at: https://raw.githubusercontent.com/MasoudMiM/ME_364/main/UK_Renewable_Energy/UKEnergy.csv

In [None]:
url = 'https://raw.githubusercontent.com/MasoudMiM/ME_364/main/UK_Renewable_Energy/UKEnergy.csv'
dfUK = pd.read_csv(url)

# Dataset is now stored in a Pandas's Dataframe
dfUK.head(10)

In [None]:
plt.figure(figsize=(7,4))
sns.boxplot(x='Year', y='Wind2', data=dfUK)
plt.ylabel('Generated Wind Energy [GWh]')

You can also make boxplots using __pandas__, which we will not cover here but you can look at this link if you are interested to know how it can be done: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html

You can accompony your box plot with a swarmplot to also represent all the data points on your box plot.

In [None]:
plt.figure( figsize=(15,8) )
sns.boxplot(x='Year', y='Wind2', data=dfUK, linewidth=2.5)

sns.swarmplot(x='Year', y='Wind2', data=dfUK, color=".25")
plt.grid()
plt.ylabel('Generated Wind Energy [GWh]')

You can find more details about <font color='blue'>boxplot</font> in seaborn and its different options here: https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot

<font color='red'>__Question (5)__</font>: For the airfoil noise data set, provide a box plot+swarmplot representing attack angle versus sound pressure level.

In [None]:
# In-Class Assignment

