# Agenda


- Part-I: Data visualization

  - Best practices and bad practices

  - Examples and resources

- Part-II: Matplotlib

  - Anatomy of a plot

  - Subplots

  - Bar charts and histograms

- Hands on experiments

# Part-I

[Reddit - Data is beatiful](https://www.reddit.com/r/dataisbeautiful/)

[Reddit - Data is Ugly](https://www.reddit.com/r/dataisugly/)


https://faculty.ucmerced.edu/jvevea/classes/Spark/readings/Cairo2015_Chapter_GraphicsLiesMisleadingVisuals.pdf

## Discussion


- [Data Storytelling Tips](https://visme.co/blog/data-storytelling-tips/) : Section: Examples of How to Improve Data Storytelling

# Part-II

## Matplotlib



In [None]:
# import matplotlib
import matplotlib.pyplot as plt



- There are two modes of matplotlib: 'pyplot' and 'object oriented'. Now we will work in the 'pyplot' mode.

- You can do more with object-oriented API but it requires more knowledge about the structure of the package. 

- On the other hand, pyplot API is very handy however it is not as flexible as object-oriented API.

- Creating a visualization is as simple as:

### Basic Plots

In [None]:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()

[Pyplot tutorial](https://matplotlib.org/stable/tutorials/introductory/pyplot.html)

Changing styles

In [None]:
# plot o's with red color
# try squares and dashes also
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'o')

# plt.axis(xmin, xmax, ymin, ymax)
plt.axis([0, 6, 0, 20])

plt.show()

- In addition to lists matplotlib can work with numpy arrays (in fact this is more practical and common.) 

- Also we can plot multiple lines into the same figure.

In [None]:
import numpy as np

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)
print(t)

In [None]:
# red dashes, blue squares and green triangles
plt.plot(t, 2*t, 'r--', t, t**2, 'bs', t, t**3, 'g^')

# plt.plot(t, 2*t, 'r--')
# plt.plot(t, t**2, 'bs')
# plt.plot(t, t**3, 'g^')

plt.show()

### Subplots

In [None]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

In [None]:
t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

# t1, t2

In [None]:
plt.figure()
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')
plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()


In [None]:
# Examples



### Working with text

In [None]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, density=1, facecolor='g', alpha=0.75)


plt.xlabel('Smarts')
plt.ylabel('Density')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')

plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

### Bar plots and Histograms

In [None]:
names = ['group_a', 'group_b', 'group_c']
values = [1, 10, 100]


plt.bar(names, values)
plt.draw()

[Source](https://matplotlib.org/stable/tutorials/introductory/pyplot.html)

 Other parameters:

 - `color`
 - `log` True
 - `align`: 'edge' and `width` = 0.2 vs 0.5


### Histograms


A histogram is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but not required to be) of equal size.

In [None]:
N_points = 100000
n_bins = 20

# Generate a normal distribution, center at x=0 and y=5
x = np.random.randn(N_points)
y = .4 * x + np.random.randn(100000) + 5

fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)

# We can set the number of bins with the `bins` kwarg
axs[0].hist(x, bins=n_bins)
axs[1].hist(y, bins=n_bins)

__Histogram - Extra__

In [None]:
from matplotlib import colors
from matplotlib.ticker import PercentFormatter
fig, axs = plt.subplots(1, 2, tight_layout=True)

# N is the count in each bin, bins is the lower-limit of the bin
N, bins, patches = axs[0].hist(x, bins=n_bins)
# print(patches)

# We'll color code by height, but you could use any scalar
fracs = N / N.max()

# we need to normalize the data to 0..1 for the full range of the colormap
norm = colors.Normalize(fracs.min(), fracs.max())

# Now, we'll loop through our objects and set the color of each accordingly
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.viridis(norm(thisfrac))
    thispatch.set_facecolor(color)

# We can also normalize our inputs by the total number of counts
axs[1].hist(x, bins=n_bins, density=True)

# Now we format the y-axis to display percentage
axs[1].yaxis.set_major_formatter(PercentFormatter(xmax=1))

[Source](https://matplotlib.org/stable/gallery/statistics/hist.html)

### Box Plots

In [None]:
plt.boxplot(y)

Things to do:

- Explain Q1, Q3, IQR and outlier computations

- Mention drawbacks of boxplots -- distributions is not very clear. (mention violin plots?)

# Part-III

[Kaggle - NBA Players Data](https://www.kaggle.com/justinas/nba-players-data)



In [None]:
import pandas as pd

nba_players = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS601_Fall21/main/Week02/data/all_seasons.csv', index_col = 0)


In [None]:
nba_players.columns

## Scatter Plot 

In [None]:
plt.scatter(nba_players.player_height, nba_players.ast)
# change x-label


# change y-label


# change title

plt.draw()

In [None]:
plt.hist(nba_players.pts)

In [None]:
plt.boxplot(nba_players.pts)

In [None]:
# lets check outliers
nba_players[nba_players.pts >25].player_name.unique()

In [None]:
nba_players.reb.value_counts()

# Part-IV

## Lab Part1

Please choose appropriate graph and visualize the followings with correct/appropriate labels

* Height only
* Points only
* Height vs Points

In [None]:
import pandas as pd

nba_players = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS601_Fall21/main/Week02/data/all_seasons.csv', index_col = 0)

height = nba_players['player_height']
points = nba_players['pts']

## Lab Part2
Please visualize year vs score with correct/appropriate labels

In [None]:
bowling_data = pd.read_csv("https://raw.githubusercontent.com/msaricaumbc/DS601_Fall21/main/Week02/data/bowling_stats.csv",
                            header=None,
                            names=['year','city','state','count1','count2'])
def merge_columns(row):
    if pd.isna(row['count2']):
        return row['count1']
    else:
        return row['count1']*1000+row['count2']
    
bowling_data['total']=bowling_data.apply(merge_columns,axis=1)

year = bowling_data['year']
score = bowling_data['total']