<a href="https://colab.research.google.com/github/zackives/upenn-cis5450-hw/blob/main/11_Module_3_Part_1_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture Notebook: Information Visualization


This notebook covers several things:

1. The basics of plotting Pandas dataframes using matplotlib.
2. Some rules of thumb about bar vs line charts, axes, normalization, and whether to interpolate.
3. Basics of ggplot on Python
4. Seaborn and visualization of statistical data

## Autograder setup!


In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

In [None]:
%set_env HW_ID=cis5450_25f_HW9

In [None]:
!pip3 install penngrader-client

In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

## Initialize Visualization Libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Loading familiar data into Pandas

We'll use the CEOs dataset from Wikipedia as an example to compare two different sub-populations: those CEOs who are actually **founders**, and those who are simply "**regular CEOs**".

In [None]:
import requests
from io import StringIO

def import_html(url: str):
  # Now let's read an HTML table!
  headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
  }

  return requests.get(url, headers=headers).text


In [None]:
# Read the Wikipedia HTML table containing information about CEOs!

url = 'https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs'
company_ceos_df = pd.read_html(StringIO(import_html(url)))[1]



In [None]:
company_ceos_df

In [None]:

company_ceos_df.dropna(subset = ['Since'], inplace=True)
# Clean the references out of the Since field and the Title field...
company_ceos_df['Since'] = company_ceos_df['Since'].apply(lambda x: int(x.split(' ')[-1]) if not pd.isna(x) and ' ' in x else x)
company_ceos_df['Since'] = company_ceos_df['Since'].apply(lambda x: int(x.split('[')[0].strip()) if (not pd.isna(x)) and isinstance(x, str) and '[' in x else int(x))
company_ceos_df['Title'] = company_ceos_df['Title'].apply(lambda x: x.split('[')[0].strip() if '[' in x else x)

for i in range(0,len(company_ceos_df)):
  print(company_ceos_df.iloc[i]['Executive'] + ': ' + str(company_ceos_df.iloc[i]['Since']))
# Show the output
company_ceos_df.info()

Now that we have the data, let's split into two dataframes.

In [None]:
# Split the founders
founders_df = company_ceos_df[company_ceos_df['Title'].apply(lambda s: True if 'founder' in s.lower() else False)]

# This is a set difference: we keep only items that are duplicated
regular_ceos_df = pd.concat([company_ceos_df, founders_df]).drop_duplicates(keep=False)

In [None]:
# For inspection: who are non-founders?
regular_ceos_df.info()

In [None]:
# For inspection: who are non-founders?

founders_df

## Plotting our first graph

OK, so we'll do our first plot.  We want to see company vs CEO start year, for CEOs who are also founders.  This is a *bar chart* since companies are categorical rather than continuous-valued.

In [None]:
founders_df.plot(kind='bar', x='Company', y='Since', color='gray')


This looks pretty ridiculous, because the assumption is that dates start at 0, and that we are measuring dates!  

Could we change the graphed value to that conceptually makes more sense, e.g., maybe we should look at **how long** people have been CEOs?

In [None]:
import datetime
now = datetime.datetime.now()

founders_df['Years'] = founders_df['Since'].apply(lambda x: now.year - x)

fig = founders_df.plot(kind='bar', x='Company', y='Years')


# Based on "domain expertise", we will assume no one should be CEO for more
# than ~70 years -- if they started at 20, they would be 90...
fig.set_ylim([0, 70])

## Plotting for comparison

Let's look at how many folks founded companies in each year, comparing founding CEOs vs "regular" CEOs...

Here, year can be considered a continuous-valued parameter (although note that we are actually quantizing it to integer values, so fractional years aren't really useful here).

In [None]:

# gca stands for 'get current axis'
ax = plt.gca()

# This will create counts for how many founders started in each year
founders_by_year = founders_df.groupby(['Since']).count()

founder = founders_by_year.plot(kind='line',y='Company',ax=ax, label='Founding CEOs')
regular_ceos_by_year = regular_ceos_df.groupby(['Since']).count()

regular = regular_ceos_by_year.plot(kind='line',y='Company', color='red', ax=ax, \
                                    label='Other CEOs')




If we look closely at this graph, we'll note there seems to be one founding CEO every year. Could that be?  Maybe we should look more closely!!!

We'll re-plot, putting a marker at each point.  And perhaps we can even remove the line from the "founding CEO" plot, just looking at the markers...

In [None]:

# gca stands for 'get current axis' and if we share the x-axis,
# we will be able to plot multiple items against it
ax = plt.gca()

# This will create counts for how many founders started in each year
founders_by_year = founders_df.groupby(['Since']).count()

founder = founders_by_year.plot(kind='line',y='Company',ax=ax, label='Founding CEOs',
                                marker='x', linewidth=0)
regular_ceos_by_year = regular_ceos_df.groupby(['Since']).count()

regular = regular_ceos_by_year.plot(kind='line',y='Company', color='red', ax=ax,
                                    label='Other CEOs', marker='+')

plt.xlabel('Year CEO started', fontsize = 16)


Much clearer -- in fact the blue x's show that founding CEOs are not that common!

## Plotting and Thinking about Scale

Let's try another plot, here comparing three items...

In [None]:
graph_df = pd.DataFrame([{'scale': 100, 'value': 800}, {'scale': 200, 'value': 1200}, {'scale': 500, 'value': 2400}])

graph_df.plot(kind='bar', x='scale', y='value', label='Execution time')


This plot is perfectly fine, but note that the x-axis actually contains **numeric** items, which might be continuous-valued.  Moreover, there is neither a log-scale nor a linear-scale progression along the axis -- so while our eyes see something that looks non-linear, in fact we can plot this as a line chart and see what's really happening...

In [None]:
graph_df.plot(kind='line', x='scale', y='value', label='Execution time', marker='o')


## Plotting and Normalization

Now let's look at data and scaling, where perhaps we are looking at phenomena that are quite different.  A common situation is to measure the running time of three computations, using some baseline computation and comparing it with some alternate computation.  We can plot this using bar charts as we see below.

In [None]:
# Suppose we are counting, for three computations, two different components, the
# baseline computation and the alternative.

# These are the "baseline" numbers for some computation
baseline_df = pd.DataFrame([{'comp': 1, 'value': 800}, {'comp': 2, 'value': 5}, {'comp': 3, 'value': 2400}])
# These are alternative computations
alternative_df = pd.DataFrame([{'comp': 1, 'value': 720}, {'comp': 2, 'value': 3}, {'comp': 3, 'value': 2100}])

# We want to plot side-by-side
combined_df = baseline_df.rename(columns={'value': 'baseline'})
combined_df['alternative'] = alternative_df['value']

fig = combined_df.plot.bar(x='comp')



Wow, we can't see computation #2 at all!  Given that each plot is very different, we may want to normalize each...

In [None]:
rescaled_df = combined_df.copy()
rescaled_df['alternative'] = combined_df.apply(lambda r: r['alternative'] / r['baseline'], axis=1)
rescaled_df['baseline'] = combined_df.apply(lambda r: 1.0, axis=1)

fig = rescaled_df.plot(kind='bar', x='comp')
fig.set_title('Normalized Performance')


Note that an "honest" presentation of the data will emphasize that these are normalized, and that the relative running times are quite different.  In fact, sometimes people will put a caption above each bar showing the actual timings.

# Visualization of Statistical Data with Seaborn

In [None]:
!pip install seaborn

import seaborn as sb

In [None]:
# Some simple data, random points around a line
points = 150
slope = 0.3

x = np.array(range(points))
# We'll plot these
y = np.random.randn(points) * 5 + x * slope
# Choose a random integer, set z to True if it's positive, else set z to False
z = map(lambda x: x >= 0, np.random.randn(points))

sample_df = pd.DataFrame({'x': x, 'y': y, 'z': z})

sample_df

In [None]:
# Do a scatter plot, with height 4, shading the points based on whether z is True
sb.lmplot(data=sample_df, x='x', y='y', height=4, aspect=1.5, fit_reg=False, hue="z")


In [None]:
# Do a scatter plot, with height 4, shading the points based on whether z is True
sb.lmplot(data=sample_df, x='x', y='y', height=4, aspect=1.5, fit_reg=True, hue="z")

In [None]:
# Sample dataset with people + tips
tips_dataset = sb.load_dataset('tips')

tips_dataset

In [None]:
# Sample dataset with people + tips
tips_dataset = sb.load_dataset('tips')

# The question: how do people tip on different days of the week?
tips_dataset['tip_pct'] = tips_dataset.apply(lambda r: r['tip'] / r['total_bill'], axis=1)

# We will create a different graph for each value of 'time' (lunch vs dinner)
g = sb.FacetGrid(tips_dataset, col='time', hue='day',
                 height=4, aspect=1)

# Within each graph, plot total bill vs tip
g.map(plt.scatter, 'total_bill', 'tip_pct')
g.add_legend()

In [None]:
# We will create a different graph for each value of 'time' (lunch vs dinner)
g = sb.FacetGrid(tips_dataset, col='time', hue='day',
                 height=4, aspect=1)

# Within each graph, plot total bill vs tip
g.map(plt.scatter, 'total_bill', 'size')
g.add_legend()

In [None]:
# Let's look at how different factors are influenced by the size
# of the party
sb.pairplot(data=tips_dataset,kind='scatter', hue='size')

In [None]:
# Create histogram bins of size 5
bins = np.arange(tips_dataset.total_bill.min(), tips_dataset.total_bill.max(), 5)

# Cut the bins, and group by size
by_bill_binned = tips_dataset.groupby([pd.cut(tips_dataset.total_bill, bins, precision=0),
                                       'size']).size().unstack().fillna(0)

by_bill_binned

In [None]:
sb.set(font_scale=1.0)

sb.heatmap(by_bill_binned[by_bill_binned.sum(axis=1) > 3])

In [None]:
sb.boxplot(x=tips_dataset.time, y=tips_dataset.total_bill)

In [None]:
tips_dataset.total_bill.sort_values()

## Exercise

### Are we confident lunch and dinner have different price distributions?

We talked about the *t-test* as a way of comparing whether two distributions have different means.  Let us compare lunch vs. dinner `total_bill` and see what the *p-value* is, with regards to refuting the null hypothesis (establishing the distibutions are different).  Use the standard measure of *alpha* (the p-value threshold) for scientific results.



In [None]:
from scipy import stats

# Separate the total bills for lunch and dinner
lunch_bills = tips_dataset[tips_dataset['time'] == 'Lunch']['total_bill']
dinner_bills = tips_dataset[tips_dataset['time'] == 'Dinner']['total_bill']

# Perform the independent samples t-test, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
# TODO
t_statistic, p_value = # Something

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Interpret the results
alpha = # TODO
if p_value < alpha:
  grader.grade(test_case_id='lunch', answer=True)
else:
  grader.grade(test_case_id='lunch', answer=False)
