<a href="https://colab.research.google.com/github/tmckim/materials-fa23-colab/blob/main/lectures/lec04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lec04.ipynb**". You can rename this to just the title of the assignment "**lec04.ipynb**". Make sure you do keep an informative name (like the name of the lecture and number) so that you know which files go with the course sessions. More instruction on file formats is at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them.

In [None]:
# Just run this cell
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

#!mkdir -p '/content/gdrive/My Drive/colab-materials-NS499DataSci-notebooks/'
!rm -r materials-fa23-colab
!git clone https://github.com/tmckim/materials-fa23-colab '/content/gdrive/My Drive/colab-materials-NS499DataSci-notebooks/materials-fa23-colab/'

%cd /content/gdrive/MyDrive/colab-materials-NS499DataSci-notebooks/materials-fa23-colab/lectures/

In [None]:
# Just run this cell. Import packages and visualization options
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

## Lecture 4 ##

Topic: Visualizing the distribution of data

## Categorical Distribution ##

How often does each possible value occur? There are a finite set of values and so we can visualize those counts as a bar chart.

In [None]:
# Using the top movies dataset
top_movies = Table.read_table('top_movies_2017.csv')
top_movies

### Exercise:
Compute how many times does each studio appear in the list. (Here we use the `group` function which we will go over more later.)<br>
Info can be found [here](https://www.data8.org/sp23/reference/)

In [None]:
# Use the group function to count number of movies per Studio
studio_distribution = studios.group('Studio')
studio_distribution

## Bar Charts ##

### Exercise:
Construct a bar chart depicting the number of movies from each studio (the "`count`")

In [None]:
# Use barh
studio_distribution.barh('Studio')

In [None]:
# Also using sort to make it prettier
studio_distribution.sort('count', descending=True).barh('Studio')

In [None]:
# What's wrong with this? It still produces a plot
studio_distribution.take(np.arange(10)).sort('count', descending=True).barh('Studio')

## Exercise:
Construct a bar chart containing the percentage of the movies from each studio.

In [None]:
# Array of counts
count_col = studio_distribution.column('count')
count_col

In [None]:
# compute percentage
count_col / count_col.sum()

In [None]:
# Put this column into the table
studio_distribution = studio_distribution.with_column('percent', 100 * count_col / count_col.sum())
studio_distribution

In [None]:
# Now make the visualization
studio_distribution.sort('percent', descending=True).barh('Studio', 'percent')

## Numerical Distribution ##

The most basic tool for visualizing the distribution of numerical data is the historgram.

In [None]:
# Examine the age of the top 200 films
# Show just a preview of the table
top_movies.take(np.arange(5))

In [None]:
# Add a column containing the age of each movie in the top_movies table
this_year = 2023
ages = this_year - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)

In [None]:
# Show the table
top_movies

In [None]:
# Find the min and max using built in functions we've learned before
min(ages), max(ages)

### Exercise:
Split the "`Age`" column into the following bins

In [None]:
# Create the bins
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 100)

In [None]:
# Bin the data
binned_data = top_movies.bin('Age', bins = my_bins)
binned_data

You an also use `np.arange` to create regular bins of a fixed size or even just specify a number.

In [None]:
# Demonstrate other ways to bin the data
my_bins_another_way = top_movies.bin('Age', bins = np.arange(0,126,25))
my_bins_another_way

In [None]:
# Demonstrate other ways to bin the data
my_bins_another_way = top_movies.bin('Age', bins = 15)
my_bins_another_way

## Histograms ##

We can construct histograms of numerical variables by calling `tbl.hist(...)` function using our `bins`

## Exercise:
Make a histogram of `Age` using `my_bins`

In [None]:
# What are my_bins from before?
my_bins

In [None]:
# Reminder of the data with original bins
binned_data

In [None]:
# Show table with new column
binned_data

In [None]:
# Let's make our first histogram!
top_movies.hist('Age', bins = my_bins, unit = 'Year')

In [None]:
# Let's try equally spaced bins instead.
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')

In [None]:
# Let's try not specifying any bins
top_movies.hist('Age', unit='Year')

## Height ##

### Question: What is the height of the [40, 65) bin?

In [None]:
# Review the plot to remind ourselves what it looks like based on bins
top_movies.hist('Age', bins=my_bins, unit='Year')
binned_data

In [None]:
# Add a column containing what percent of movies are in each bin
binned_data = binned_data.with_column(
    'percent', 100 * binned_data.column('Age count') / binned_data.column('Age count').sum())
binned_data

In [None]:
# Step 1: Calculate % of movies in the [40, 65) bin
percent = binned_data.where('bin',40).column('percent').item(0)
percent

In [None]:
# Step 2: Calculate the width of the 40-65 bin
width = 65 - 40

In [None]:
# Step 3: Area of rectangle = height * width
#         --> height = percent / width
height = percent / width
height

### What are the heights of the rest of the bins?

In [None]:
# Get the bin lefts
bin_lefts = binned_data.take(np.arange(binned_data.num_rows - 1))

In [None]:
# Get the bin widths
bin_widths = np.diff(binned_data.column('bin'))
bin_lefts = bin_lefts.with_column('width', bin_widths)

In [None]:
# Get the bin heights
bin_heights = bin_lefts.column('percent') / bin_widths
bin_lefts = bin_lefts.with_column('height', bin_heights)

In [None]:
bin_lefts

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

## Defining Functions ##  

Example: Create a function that takes a numerical input and triples it: $\textsf{triple}(x)=3\,x$

In [None]:
# Define our first function
def triple(x):

In [None]:
# Call and run this function
triple(3)

We can also assign a value to a name, and call the function on the name:

In [None]:
num = 4

In [None]:
# Function with input num as argument
triple(num)

In [None]:
# More interesting expressions
triple(num * 5)

## The Anatomy of a Function ##  
    
```python
def functionname(Arguments_Parameters_Expressions_or_Values):     
      return return_expression
```

## Functions are Type-Agnostic  ##

In [None]:
# Input string as an argument to our function
triple('ha')

In [None]:
# Create an array and view the numbers
np.arange(4)

Feed the array above into our function `triple` to see what is produced:

In [None]:
# Input an array as an argument to our function
triple(np.arange(4))

### Discussion ###

- What does the following function do?
- What type of input does it take?
- What type of output does it produce?
- What's a good name for the function?

```python
def f(s):     
      return np.round(s / sum(s) * 100, 2)
```

In [None]:
# Define our function
def f(s):
    """

    """
    return np.round(s / sum(s) * 100, 2)

In [None]:
# Make an array
first_four=make_array(1,2,3,4)
first_four

In [None]:
# Input name as an argument to our function
f(first_four)

In [None]:
# Input a different array as an argument to our function
f(make_array(1, 213, 38))

### Functions Can Take Multiple Arguments ###

Example: Calculate the Hypotenuse Length of a Right Triangle


Pythagoras's Theorem: If $x$ and $y$ denote the lengths of the right-angle sides, then the hypotenuse length $h$ satisfies:

$$ h^2 = x^2 + y^2 \qquad \text{which implies}\qquad \hspace{20 pt} h = \sqrt{ x^2 + y^2 } $$

In [None]:
# Define our function with two arguments
def hypotenuse(x, y):
    hypot_squared = (x ** 2 + y ** 2)
    hypot = hypot_squared ** 0.5
    return hypot

In [None]:
# Test out our function with values
hypotenuse(1, 2)

In [None]:
# Run our function with different values
hypotenuse(3, 4)

We could've typed the body all in one line. Do you find this more readable or less readable than the original version?

In [None]:
# Another way to write our function
def hypotenuse(x,y):
    return (x ** 2 + y ** 2) ** 0.5

In [None]:
# Test it out
hypotenuse(9, 12)

### Example: A function that takes the year of birth of a person and produces their age in years. ###

In [None]:
# Define our age function
def age(year):
    age = 2023 - year
    return age

In [None]:
# Run our function
age(1942)

Now add some bells and whistles:  Take person's name and year of birth (two arguments). Produce a sentence that states how old they are.

In [None]:
# A more refined function with two arguments
def name_and_age(name, year):
    return name + ' is ' + str(age(year)) + ' years old.'

In [None]:
# Run our function to see what it returns
name_and_age('Joe', 1942)

## Apply ##

In [None]:
# A table with office characters
ages = Table().with_columns(
    'Person', make_array('Jim', 'Pam', 'Michael', 'Creed'),
    'Birth Year', make_array(1985, 1988, 1967, 1904)
)
ages

In [None]:
# Use our age function on each column
make_array(age(ages.column('Birth Year').item(0)),
           age(ages.column('Birth Year').item(1)),
           age(ages.column('Birth Year').item(2)),
           age(ages.column('Birth Year').item(3)))

In [None]:
# Use the apply function to run it on the entire table
ages.apply(age, 'Birth Year')

In [None]:
# Run our other function using apply
ages.apply(name_and_age, 'Person', 'Birth Year')

### Saving
Remember to save your notebook before closing.
Choose **Save** (and make sure you've already saved a copy in your drive) from the **File** menu.