# Lecture - Data Visualization 
Data visualization is a very important step in data analysis. It helps in exploring the data as well as in presenting and interpreting the results.
we can use plots to check distributions of variables or checking outliers
There are two main plot libraries in Python, which are Matplotlib and Seaborn. I personally prefer to use Seaborn because it looks nicer in my opinion.

In this lecture we will explore different Python modules for data visualization.

## Matplotlib 
You've already used it in previous lectures. Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. 
Check out the website https://matplotlib.org.

To use the plotting functionality from Matplotlib:
- `import matplotlib.pyplot as pl`

To plot Y in function of X:
- `plt.plot(X,Y)`
- `plt.show()`

### How to customize a graph?
- Change figure size : 
 - `plt.figure(figsize=(20,10))`

- Change graph style:
 - `import matplotlib.style as style` 
 - `style.available` : find the available styles
 - `style.use('fivethirtyeight')` : Use a specific style

- You can also customize the graph inside the plot function:
 - `plt.plot(X, Y, color="red", label= 'redLine',linewidth=2.0, linestyle="-")`. 
 - Different linestyle are possible : `[ '-' | '--' | '-.' | ':' | 'steps' `

- To set the legend:
 - `plt.legend()`


-  Set x and y limits
 - `plt.xlim(left_limit,right_limit)`
 - `plt.ylim(left_limit,right_limit)`

- Set x and y ticks
 - Set y ticks : `plt.yticks(ticks,lables,kwrds)`
   - `ticks` : array_like. A list of positions at which ticks should be placed. You can pass an empty list to  disable yticks.
   - `labels` : array_like, optional. A list of explicit labels to place at the given locs.

   - `**kwrds` : Text properties can be used to control the appearance of the labels.
   - Check this for some demo examples. https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yticks.html

 - Set x ticks => these are the scales of the axis
  - xticks(ticks=None, labels=None, **kwargs)

- Set labels and titles
 - `plt.title('My curve')`
 - `plt.xlabel('x-axis')`
 - `plt.ylabel('y-axis')`
 
### Creating subfugures and customizing them

These are variables that help you to set parameters of the figure and all the charts in the figure for matplotlib. 

To create subfigures in the same figure:

`fig, (ax0, ax1, ...) = plt.subplots(nrows,ncols, figsize=(, ), dpi= )`
- nrows,ncols : number of rows and columns of the subfigures
- figsize : figure.figsize means figure size in inches (width, height)
- dpi: figure.dpi means resolution in dots per inch

To plot on the first subfigure for example:
- `df.plot(x,y,ax=ax0)`

To set the title, x and y labels of the first subfigure

- `ax0.set(title=' ', xlabel='x axis', ylabel='y axis')`

To set the title of the subfigure:

- `fig.suptitle(' ', fontsize=, fontweight='bold')`



### Different examples of graph
Check out this link for a summary of plot functionalities in matplotlib. http://matplotlib.org/api/pyplot_summary.html

#### Scatter plot
A scatter plot, also known as a scatter graph or a scatter chart, is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables - one plotted along the x-axis and the other plotted along the y-axis.
![seaborn_scatterplot](img/seaborn-scatterplot-2.png)

- `plt.scatter(X,Y, alpha=.5, s=20,marker= 'o'`
  - alpha is used to give a % of transparency
  - different other markers [ '+' | ',' | '.' | '1' | '2' | '3' | '4' ]
  - s is the size of the points

#### Bar chart
A bar plot is a plot that presents categorical data with rectangular bars with lengths proportional to the values that they represent. A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.

- `plt.bar(X, Y,kwrds)`
  - kwrds Additional keyword arguments are documented in pandas.DataFrame.plot()
  - Example kwrds: `facecolor='skyblue'` change background color, `edgecolor='blue'`
  
#### Histogram

- `plt.hist(X, bins, histtype='step', orientation='horizontal')`
  - bins: histogram bins
  - `orientation` : `'horizontal'` or `'vertical`
  - `histtype` : type of histogram, {'bar', 'barstacked', 'step', 'stepfilled'}
  - You can find all the parameters here : https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html

#### Area line
Plots the area under a curve. `data.plot.area()`

#### Pie chart

`plt.pie(value,labels,colors,shadow, explode)`

- value : array of numerical values
- labels : array of strings
- colors : array of colors
- shadow: True of False for shadow effect

example: `explode=(0,0.1,0,0)` will result in the exploding of the second element.

![PieChart](img/PieChart.png)




There exist other graphs types that might be very beneficial and interesting to check. Check out the links in this lecture. Do not hesitate to ask Google as well.


## Seaborn

### Some out of the box plots you can find in Seaborn library
Of course you should import the Seaborn library.
`import seaborn as sn`

#### Scatter Plot
To plot the scatter plot, while plotting the linear relationship between them (`fit_reg=True`).
- `sns.lmplot(x='column_name', y='column_name', data=df, fit_reg=True, scatter_kws={'keyword1': value})`
 - Example: `scatter_kws={'alpha': 0.5'}` to change the transparancy

![lmplot](img/lmplot.png)

To color the points according to certain column's categories : hue='column_name'
- `sns.lmplot(x='column_name', y='column_name', data=df, fit_reg=False, hue='column_name', scatter_kws={'keyword1': value}) `

![lmplot_nofit](img/lmplot_notfit.png)

#### Box plot
- `bp=sns.boxplot(data=videogames, x='Genre', y='Year') `
![boxplot](img/boxplot.png)

#### Violin plot
Combines the boxplot with the distribution of the data.
- `vp=sns.violinplot(data=videogames, x='Genre', y='Year', inner=None)`
 - `inner` : None, or "box" , quartile, stick, point
![violinplot](img/violinplot.png)

Source: https://www.youtube.com/watch?v=cLHwwRgny5g

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()

In [None]:
ax = sns.violinplot(x=tips["total_bill"], inner =None)

In [None]:
ax = sns.violinplot(x=tips["total_bill"], inner ='box')

In [None]:
ax = sns.violinplot(x=tips["total_bill"], inner ='quartile')

See each observation with a stick inside the violin.

In [None]:
ax = sns.violinplot(x=tips["total_bill"], inner ='stick')

In [None]:
ax = sns.violinplot(x=tips["total_bill"], inner ='point')

#### Swarm plot
A great tool to see a better representation of the distribution of values with respect to certain categories.
- `vp=sns.swarmplot(data=df, x='column_name_with_categories', y='column_name_with_values', hue='column_name_with_categories', alpha=) `


![swarmplot](img/swarmplot.png)


#### Factor plot
Factor plot is informative when we have multiple groups to compare. There are many parameters under factor plot and I want to focus on basic parameters.
- `g = sns.factorplot(x='column_name_with_values', `
                   `y='column_name_with_values',`
                   `data=df,` 
                   `hue='column_name_with_categories',`  : Color by column with categories
                   `col='column_name_with_categories',`  # Separate by column categories
                  ` kind='column_name_with_categories') ` # Specifiy a kind of plot for ex. Swarmplot
                   
In the following example,  there are three different types of assortments under a feature called assortment, there is a feature called promo, which is an indicator whether promotions happened or not, and there is a feature called sales, which indicate the total sales of stores

x is the promo column, and y is the sales column and hue is the assortment column, which will give us the total sales based on 3 different types of assortments with the existence of promotion. 
![factorplot_basic](img/FactorPlot_basic.png)

Another example, setting the other parameters:


 ![factorplot](img/factorplot.png)
 
#### Count plot
- `c=sns.countplot(x='column_name_with_values', data=df) `#, palette=pkmn_type_colors)
![countplot](img/countplot.png)

#### Distribution plot
- `sns.distplot('column_name_with_values') `
![distplot](img/distplot.png)



### Customize the graph.
- `axes = bp.axes`
- `axes.set_ylim(ylim1,ylim2)`
- `axes.set_xticklabels(data.column_name_with_categories.unique(), rotation=90)`
- `.legend_.remove()` :removes the legend


### Excellent resources
I advice you to look at the following documentations and examples. You will be amazed of what you can do with Seaborn.


http://seaborn.pydata.org/examples/index.html

https://elitedatascience.com/python-seaborn-tutorial


## Bokeh
https://bokeh.pydata.org/en/latest/


## Altair
https://github.com/altair-viz/altair_widgets

altair-viz jakevdp github






# In-class exercises

## Matplotlib

### Graph customization : titles , axes, legends, colors , font, ....


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### How to plot simple Y,X line-graph ? 
Use linespace from Numpy. `numpy.linspace(start, stop, num, endpoint=True, retstep=False, dtype=None, axis=0)` to create the x-axis points.
- `start` : number to start with
- `stop` : last number 
- `num` : number of points 
- `endpoint` : True or False , If True, stop is the last sample. Otherwise, it is not included. Default is True.
- `retstep`: True or False, If True, return (samples, step), where step is the spacing between samples.
- `dtype` : The type of the output array. If dtype is not given, infer the data type from the other input arguments.
- `axis` : The axis in the result to store the samples. Relevant only if start or stop are array-like. By default (0), the samples will be along a new axis inserted at the beginning. Use -1 to get an axis at the end.


In [None]:
# np.pi : iginally defined as the ratio of a circle's circumference to its diameter. 
# It is approximately equal to 3.14159.
X = np.linspace(-np.pi, np.pi, 100)  #Return evenly spaced numbers over a specified interval.

#np.cos(X) : cosine function
#np.sin(X) : sine function

C,S = np.cos(X), np.sin(X)



#### Plot the two curves C and S on the same graph

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Now, choose a general style for our charts. 


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Plot again C and S to see the style change

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


Do not hesitate to test with other colours, labels, width of the line and style of the line

### Modify the properties of each curve
- Change the figure size to 20,10. 
- Plot cosine using red color with a continuous line of width 2 (pixels)
- Plot sine using purple color with a dotted line of width 1 (pixels)
- set the legend of the curves

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Change the scale of the graph



Try several limits, ticks and titles.

- Set the x and y limits to -4, 4 for x-axis and -1.1, 1.1 for y-axis


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- Set the x ticks to [-np.pi, -np.pi/2, 0, np.pi/2, np.pi]
- set the x ticks labels to ['- \pi/', '$-\pi/2$', '$0$', '$+\pi/2$', '$+\pi$']


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- Set the y ticks to -1, 1 with a spacing of 4. *Hint: Use the linspace function*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- Set labels and titles

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Copy all the above steps in one cell and execute it.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Creating subplots


In [None]:
df=pd.DataFrame({'Cos':C, 'Sin':S}, index=X )


- create subplots to plot the sine and cosine figures. Use the variable fig to hold the figure and axe0, axe1 to hold the subplots. Set the figure size and the resolution as you wish
- plot in ax0 the Cosine function and in ax1 the Sine function
- Set the titles as well as the x and y labels of the subfigures and the figure



In [None]:
# YOUR CODE HERE
raise NotImplementedError()

How to save a plot ? 

In [None]:
fig.savefig('kick.png', transparent=False, dpi=80, bbox_inches="tight")

## Different examples of graph


### Scatter plot 

In [None]:
n = 2017
X = np.random.normal(0,1,n)
Y = np.random.normal(0,1,n)

#### Plot the scatter plot. Set alpha, points size and marker

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Bar chart 

In [None]:
n = 15
X = np.arange(n)
Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)
Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)

#### Plot the bar plot of Y1 and Y2. Set Y2 to negative. 
Set different `facecolor` values and `edgecolor='blue'`  for Y1, `edgecolor=['red']*len(X)` for Y2

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Histogram

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Plot the histogram. set the bins; histtype to 'step' and orientation to horizontal

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Area line

In [None]:
df = pd.DataFrame({'sales': [3, 2, 3, 9, 10, 6],
                   'signups': [5, 5, 6, 12, 14, 13],
                   'visits': [20, 42, 28, 62, 81, 50]}, index=pd.date_range(start='2018/01/01', end='2018/07/01',freq='M'))


#### Plot the area of the data in the dataframe

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Check this as well.

In [None]:
x=np.linspace(0,10, 5)
var1=np.random.randint(0,10,5)
var2=np.random.randint(0,10,5)
var3=np.random.randint(0,10,5)

plt.figure(figsize=(20,10))
plt.plot([],[],color='blue',label='var1')
plt.plot([],[],color='black',label='var2')
plt.plot([],[],color='gold',label='var3')
plt.stackplot(x,var1,var2,var3,colors=['blue','black','gold'])

In [None]:
var1

### Pie chart

In [None]:
value=[28,15,15,42]
element=['tomato sauce','beef','cheese','pasta']
cols=['red','brown','yellow','wheat']

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Seaborn

To practice with Seaborn, we will work with the database videogames_data. 
This database contains data about video games such as the rank, the year it went public, type, publisher, and sales (in Europe, Japan, North America, Global, etc..)


The database is stored on the server of Emlyon. 
The following code will load the database and put it in a dataframe `videogames`. Don't worry about the code to access the database. You will learn this in the next lectures. Just execute the following code and lets get going into experimenting with Seaborn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import datetime as dt



In [None]:
import pymysql
server = '10.126.8.140'
username = "student1"
password = "student1"


In [None]:
connection = pymysql.connect(host=server,
                             user=username,
                             password=password,
                             db='videogames_data',
                             charset='utf8')
SQL = "SELECT * FROM videogames_data.sales"
videogames = pd.read_sql(SQL, connection)
videogames.Global_Sales=videogames.Global_Sales.replace('','0')
for i in [6,7,8,9,10]:
    videogames.iloc[:,i]=videogames.iloc[:,i].map(lambda x: float(x))



### Import seaborn

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
videogames.head()

### Plot the relationship between Sales in Europe and Sales Globally.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Plot European sales in function of Global sales. Color the points in function of the type 'Genre'

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Check if there are some data where the Year is equal to 0

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### When the Year is 0, replace the 'Genre' by NAN. `Use np.nan`. Then remove these lines using `dropna()`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Draw the boxplot of the type 'Genre' in function of the year between the years 1987 and 2018. Set the xticks labels to Types. Use `videogames.Genre.unique()`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Plot the violin plot of 'Genre' in funtion of 'Year'. Set the xticks labels as above.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Create a new dataframe called df which contains data befor year 1995.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Draw the swarplot of 'Genre' in function of 'Year'. Set the year limits to 1975 and 2020.  
Remove the legend. Set the xtick labels as above.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Repeat the same thing without setting the year limits. Notice the difference.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Create a new data frame that satisfy the following conditions:
- Type is 'Sports','Racing','Action'
- Year greater than 2010


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

###  Choose 100 samples

In [None]:
df = df.sample(100)
df

### Draw the factor plot of Global Sales in function of the Year. Color by Type, Separate by Type, Use Swarm plot.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Draw the countplot of each year in the dataframe

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Plot the distribution of the Global Sales for the video games whose rank is less than 100. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()