
# JOUR7280/COMM7780 Big Data Analytics for Media and Communication
# Tutorial:  More on data visualization

## 1. Seaborn
[Seaborn](https://seaborn.pydata.org/) is a Python data visualization library based on `matplotlib`. It provides a high-level interface for drawing attractive and informative statistical graphics.

Library Reference: [LINK](https://seaborn.pydata.org/api.html)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Set the aesthetic style of the plots
sns.set_style('whitegrid')

## Data Exploration and Analysis

We will explore the titanic dataset. The variables are as follows:
<img src="../figs/titanic.png" alt="drawing" width="550"/>

In [None]:
# Load the 'titanic' dataset from the Seaborn's online repository
titanic = sns.load_dataset('titanic')

In [None]:
# Seaborn loads the dataset as Pandas Dataframe
print(type(titanic))

In [None]:
titanic.head()

In [None]:
titanic.describe()

In [None]:
#check NA values
print(titanic.isna().sum())

**Example 1** 

Plot of univariate distribution of the 'fare' column in titanic:
* color the bars in red 
* specify the bins equal to 30.
The bins specify how to group the data.

In [None]:
sns.distplot?

In [None]:
sns.distplot(titanic['fare'], kde=False, color="red", bins=30)

**Example 2**

Plot the counting of males and females in the dataset

In [None]:
sns.countplot(x='sex', data=titanic);

**Example 3**

Plot the counting of the different passenger classes in the dataset

In [None]:
sns.countplot(x='pclass', data=titanic);

**Example 4** 

Draw a box plot that shows distribution of 'age' with respect to 'class' categories.

In [None]:
sns.boxplot(x="class", y="age", data=titanic, palette='rainbow');

**Example 5**

A complex chart: includes bivariate and univariate graphs, using the `fare` column in titanic as x and `age` column as y.

In [None]:
sns.jointplot(x='fare', y='age', data =titanic);

**Example 6**

The scatterplot with non-overlapping points of the 'age' column, categorized by the 'class' field and color the plot in 'Set2' palette.

Scatterplot is sensitive to missing data. We should apply some data cleaning strategy.

In [None]:
# Stategy: delete the rows with missing data
titanicCopy = titanic.copy()
titanicCopy.dropna(inplace = True)
sns.swarmplot(x="class", y="age", data=titanicCopy, palette='Set2');

In [None]:
# Stategy: when missing, set the 'age' to the median value
titanicCopy = titanic.copy()
medianAge = titanicCopy['age'].median()
titanicCopy.fillna({'age': medianAge}, inplace=True)
sns.swarmplot(x="class", y="age", data=titanicCopy, palette='Set2');

**Example 7**

Visualize as heatmap the pairwise correlation of columns in dataset:
* set title 'titanic.corr()'
* set color map to 'YlGnBu'

Possible color map values: `Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, twilight, twilight_r, twilight_shifted, twilight_shifted_r, viridis, viridis_r, vlag, vlag_r, winter, winter_r`

In [None]:
sns.heatmap(titanic.corr(), annot=True, cmap='BuGn')
plt.title('titanic.corr()');

**Example 8** 

Multi-plot grid for plotting, using histogram, conditional relationships between:
* 'sex' and 'age'
* 'sex' and 'fare'
* 'class' and 'age'
* 'class' and 'fare'

In [None]:
sns.FacetGrid?

In [None]:
g = sns.FacetGrid(titanic, col='sex')
g = g.map(plt.hist, 'age') # Apply a plotting function to each facet's subset of the data

In [None]:
g = sns.FacetGrid(titanic, col='sex')
g = g.map(plt.hist, 'fare')

In [None]:
g = sns.FacetGrid(titanic, col='class')
g = g.map(plt.hist, 'age')

In [None]:
g = sns.FacetGrid(titanic, col='class')
g = g.map(plt.hist, 'fare')

## 2. Word Cloud Visualization
Many times you might have seen a cloud filled with lots of words in different sizes, which represent the frequency or the importance of each word. This is called `Tag Cloud` or `WordCloud`.
We will use Google Job Skill Analysis dataset again.

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv('../data/job_skills.csv')
df.head(1)

For this tutorial we need to install the `wordcloud` library. 

In [None]:
# Install wordcloud using pip package in the current Jupyter kernel
# import sys
# !{sys.executable} -m pip install wordcloud

In [None]:
from wordcloud import WordCloud, ImageColorGenerator

### Example 1

- Name a new dataframe called 'dfAnalyst' by selecting rows whose 'Title' field containing keywords 'Analyst'. 
- All the visualizations are supposed to perform on 'dfAnalyst'.

In [None]:
# create new dataframe
dfAnalyst = df.loc[df.Title.str.contains('Analyst')]
dfAnalyst.head(1)

### Example 2 

    Create a word cloud for the 'Responsibilities'. You could refer to the following steps:
    1. Create a collection of responsibilities joining all rows
    2. Use the 'wordcloud' library to generate the word cloud
    3. Configure the plot and show the result

The `join()` string method returns a string by joining all the elements.

In [None]:
ResAN = ' '.join(text for text in dfAnalyst['Responsibilities'])

# Create and generate a word cloud image:
wordcloud = WordCloud(background_color="white").generate(ResAN)
# Display the generated image:
plt.figure(figsize=(8,4))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title('Responsibilites',size=24)
plt.show()

`background_color` - color value (default=”black”), Background color for the word cloud image.

Know more about `WordCloud` method [here](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html).

You've probably noticed the argument `interpolation="bilinear"` in the `plt.imshow()`. This is to make the displayed image appear more smoothly. For more information about the choice, [here](https://matplotlib.org/gallery/images_contours_and_fields/interpolation_methods.html) is a helpful link to explore more about this choice.

### Example 3

In the same way, create a word cloud for 'Minimum_Qualifications' and 'Preferred_Qualifications' separately and add a mask you like.

Hint: Not all images are suitable as a mask, please find out the requirement for mask image.

In [None]:
# library to load the image
from PIL import Image

cloud = np.array(Image.open('../figs/cloud_mask.png'))

QuaAN = ' '.join(text for text in dfAnalyst['Minimum_Qualifications'])

wordcloud = WordCloud(mask=cloud, background_color="white").generate(QuaAN)
# plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title('Minimum_Qualifications',size=24)
plt.show()

Keep in mind the background of the mask image used must be white, otherwise, the system will consider the background as an object. In addition, the background cannot be transparent, because transparent colors will be considered black. 

In [None]:
twitter = np.array(Image.open('../figs/twitter_mask.png'))

PreQuaAN = ' '.join(dfAnalyst['Preferred_Qualifications'].tolist())

wordcloud = WordCloud(mask=twitter,background_color="white").generate(PreQuaAN)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title('Preferred_Qualifications',size=24)
plt.show()