# Lecture 7

## Functions and Apply, Group, Join, Conditionals, Iteration

## Announcements

- Gradescope HW Submissions
    - Must tag pages or points will be deducted.
- iClicker scores
    - You will have a zero for the other section of lecture. This is fine!
- Programming Basics
    - Focus on functions.

In [None]:
# imports!
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings; warnings.simplefilter('ignore') # just for the slides!
import numpy as np
from datascience import *

# Functions and Apply

### Example: counts to percent

Create a function that converts an array of counts to percentages of the total
* include docstring

In [None]:
def percents(counts):
    ''' Converts counts to percents out of the total'''
    counts = np.array(counts)  # make sure counts is an array
    total = counts.sum()
    return counts/total*100  # extra: how would you round it to two decimal places?

percents([1,2,3]) #--> should output 16.666666, 33.33333333, 50

In [None]:
help(percents)

### Example: counts to percent
* Add an argument for decimal places
* Give the second argument a default values

In [None]:
def percents(counts, decimal_places):
    ''' Converts counts to percents out of the total'''
    counts = np.array(counts)  # make sure the container is an array
    total = sum(counts)
    return np.round(counts / total * 100, decimal_places)

percents([1,2,3], 0)

In [None]:
def percents(counts, decimal_places=2):
    ''' Converts counts to percents out of the total'''
    counts = np.array(counts)  # make sure the container is an array
    total = sum(counts)
    return np.round(counts / total * 100, decimal_places)

percents([1,2,3])

In [None]:
percents([1,2,3], decimal_places=3)

# Applying Functions to Tables

## Apply

The `apply` method creates an array by calling a function on every element of the input column(s).
* First argument: Function to apply
* Other argument(s): The input column(s)

```
table_name.apply(function_name, 'column_label')
```

### Use apply to do custom data cleaning
* Very few older people -- replace all 3 digit ages with 100

In [None]:
mini_census = Table().with_columns(
    ['Name', ['Ashley', 'Ben', 'Charlie', 'David', 'Ella', 'Frank', 'Greta', 'Henry'], 
     'Age', [27, 68, 106, 51, 102, 27, 115, 2]]
)

mini_census

### Use `apply` to clean data
* Create a function that returns the smallest of `Age` and 100.
* Apply it to all people in census
* Create a new column with cleaned data

In [None]:
def cut_off_at_100(age):
    '''The smaller of x and 100'''
    return min(age, 100)

mini_census.apply(cut_off_at_100, 'Age')

In [None]:
mini_census = mini_census.with_column('Age Cleaned', mini_census.apply(cut_off_at_100, 'Age'))
mini_census

### Discussion Question

If the name of the table is `top_movies` and the name of a column is "Title", how do we find the length of each movie title?

Option|Answer
---|---
A|`top_movies.apply(len(s), 'Title')`
B|`top_movies.apply(len(), 'Title')`
C|`top_movies.apply(len, 'Title')`
D|`Title.apply(len, 'top_movies')`
E|`Title.apply(len(), 'Title')`

In [None]:
top_movies = Table.read_table('top_movies.csv')
top_movies

In [None]:
top_movies.apply(len, 'Title')

## Example: Prediction

### Sir Francis Galton

* 1822 - 1911 (knighted in 1909)
* A pioneer in making predictions
* Particular interest in heredity
* Charles Darwin's half-cousin




### Can we predict the height of a child, given the height of their parents?
* read in heights from `galton.csv`
* relabel columns for convenience

In [None]:
galton = (
    Table
    .read_table('galton.csv')
    .relabeled('midparentHeight', 'midparent')
    .relabeled('childHeight', 'child')
)

heights = galton.select('midparent', 'mother', 'father', 'child')
heights

## Can we predict the height of a child, given the height of their parents?

### scatterplot of child vs. mother/father
* Children's height influenced by a combination of the height of the mother and father

In [None]:
heights.select('child', 'mother', 'father').scatter('child')

### scatterplot of midparent vs child height
* Galton calculated the variable `midparent` that is a weighted average of the parents' height

In [None]:
heights.scatter('midparent', 'child')

### Can we predict the height of a child, given the midparent height?
* Use the current dataset to inform a prediction of new, unseen parents/children.

In [None]:
heights.scatter('midparent', 'child')
_ = plt.plot([67.5, 67.5], [50, 85], color='red', lw=2)
_ = plt.plot([68.5, 68.5], [50, 85], color='red', lw=2)
_ = plt.scatter(68, 66.24, color='gold', s=40)

### Can we predict the height of a child, given the midparent height?

* Given the midparent height, restrict to nearby examples in the dataset (within 0.5 in).
* Take the average child height within these nearby examples.
* This average is our guess!

In [None]:
def predict_child(mp):
    '''returns a childs predicted height, given the midparent height, mp.'''
    nearby = heights.where('midparent', are.between_or_equal_to(mp - 0.5, mp + 0.5))
    return nearby.column('child').mean()

predict_child(68)

### Can we predict the height of a child, given the midparent height?
* Apply our function to all our examples
* Create a new column called `prediction` and plot the output 

In [None]:
heights = heights.with_column('prediction', heights.apply(predict_child, 'midparent'))
heights

In [None]:
heights.select('midparent', 'child', 'prediction').scatter('midparent')

# `apply` using Multiple Arguments

### `apply` using Multiple Arguments

The apply method creates an array by calling a function on every element in one or more input columns.
* First argument: Function to apply
* Other argument(s): The input column(s)
* If no columns are supplied, then applies to whole row

```
table_name.apply(function_name, 'column_label1',...,'column_labelN')
```

### Recreate midparent using a function
* Weighted average of father (1.07) and mother (1.0)
* Compare runtime of applying a custom function vs. numpy column operations

In [None]:
def calc_mp(mother, father):
    return (mother + 1.07 * father) / 2

In [None]:
%timeit heights.apply(calc_mp, 'mother', 'father')

In [None]:
%timeit (heights.column('mother') + heights.column('father') * 1.07)/2

# Grouping

Classifying variables

## Our familiar NBA data...

In [None]:
#: read from csv and relabel
nba = Table.read_table('nba_salaries.csv').relabeled("'15-'16 SALARY", 'SALARY')
nba

## How big is each team?

- We know how to do this: `.group()`.
- Can visualize distribution of team sizes with `.hist()`.

In [None]:
nba.group('TEAM')#.hist('count')

## How much does each team pay in payroll?

- Instead of counting, we want to sum the `SALARY` column.

- `sum` is applied to all columns (besides `TEAM`)
- Notice how columns get renamed automatically.
- But we can't sum all columns. E.g., `PLAYER`.
- In those cases: empty column.

## Which position has the highest average salary?

- We need to group by position.
- Within each group, find the average.
- Then sort by average salary.

In [None]:
nba.group('POSITION', np.mean)#.sort('SALARY mean', descending=True)

## What is the max salary of each position?

- Group by position.
- Within each group, use `max`.

In [None]:
nba.group('POSITION', max)

## Discussion question

Does Zaza Pachulia play for the Washington Wizards?

A. Yes  
B. No  
C. I cannot tell from this table.

## For each position, which team has the most players at that position?

- We want to count...
- but sizes of groups within groups.
- i.e., sizes of position groups within teams.

In [None]:
nba.group(['TEAM', 'POSITION'])#.sort('count', descending=True).sort('POSITION', distinct=True)

## What are the number of players at each position on *every* team?

In [None]:
nba.group(['TEAM', 'POSITION'])

## A better approach: `.pivot()` to create a two-way table

In [None]:
nba.pivot('POSITION', 'TEAM')

## `.pivot()` can do more than count...

- What is the *average* salary of each position on every team?

In [None]:
nba.pivot('POSITION', 'TEAM', 'SALARY', np.mean)

# Join

Combining columns from two different tables

## Example: Drinks

In [None]:
#: table of products
products = Table(['Location', 'Product', 'Price']).with_rows([
    ['Cups', 'Green Tea', 1.25],
    ['Cups', 'Latte', 2.50],
    ['Cups', 'Drip Coffee', 1.00],
    ['Art of Espresso', 'Espresso', 2.00],
    ['Art of Espresso', 'Latte', 3.00],
    ['Perks', 'Drip Coffee', 1.25],
    ['Perks', 'Green Tea', 1.50]
])
products

## Example: Drinks

In [None]:
#: table of coupons
#: discounts are percentages off

coupons = Table(['Location', 'Discount']).with_rows([
    ['Cups', .10],
    ['Art of Espresso', .25]
])
coupons

## How do we calculate discounted price of each product?

- Idea: "cross-reference" tables.
- I.e., for each row in `products`, find discount in `coupons` for that row's `Location`.
- This is what `.join()` does:

In [None]:
discounted = products.join('Location', coupons)
discounted

In [None]:
discounted.with_column(
    'Discounted Price',
    np.round(discounted.column('Price') * (1 - discounted.column('Discount')), 2)
)

## The `.join()` method:

- `this_table.join(common_column, that_table)`
- Only contains rows with values of `common_column` which appear in *both* tables.
    - For example, Perks was omitted.
- What if the "common columns" have different names?
- `this_table.join(this_column, that_table, that_column)`

## Common Columns with Different Names

In [None]:
cafes = coupons.relabeled('Location', 'Cafe')
cafes

In [None]:
products

In [None]:
products.join('Location', cafes, 'Cafe')

# Conditionals

## Conditionals

- Do something if an expression is `True`.
- Syntax (don't forget the colon):


    if <condition>:
        <body>
            
- Indentation matters!

In [None]:
#: in San Diego
is_sunny = True

if is_sunny:
    print('Wear sunglasses!')

## Conditionals

- `else`: do something else if condition is `False`

In [None]:
#: in San Diego
is_sunny = False

if is_sunny:
    print('Wear sunglasses')
else:
    print('Stay inside')

## Conditionals

- `elif`: If original condition is `False`, check another condition.
    - stands for "else, if"
- Checks conditions one by one until first `True` condition is found, then stops.
- "Catch" everything that remains with `else`.

In [None]:
#: in San Diego
is_raining = False
is_warm = True
is_sunny = True

if is_raining:
    print('Get an umbrella')
elif is_warm:
    print('Wear shorts')
elif is_sunny:
    print('Wear sunglasses')
else:
    print('All conditions false!')

## Example: sign function

Write a function that takes a single number and prints
- "positive" if it is a positive number
- "negative" if it is a negative number
- "neither" if it is zero

In [None]:
def sign(x):
    if x > 0:
        print('positive')
    elif x < 0:
        print('negative')
    else:
        print('neither')

In [None]:
sign(7)

In [None]:
sign(-2)

In [None]:
sign(0)

## Discussion question

```
def func(a, b):
    if (a + b > 4 and b > 0):
        return 'foo'
    elif (a*b >= 4 or b < 0):
        return 'bar'
    else:
        return 'baz'
```

What is returned when `func(2, 2)` is called?

- A) foo
- B) bar
- C) baz
- D) more than one of the above

## Using parenthesis...

Instead of:

    if (a + b > 4 and b > 0):
        ...

You might prefer: 

    if (a + b > 4) and (b > 0):
        ...
        
They do the same thing, because comparison operators are evaluated first.

## Example: the other one

- Develop a function which takes a 2-element array and a value.
- If the value is:
    - the first element, return the second.
    - the second element, return the first.
    
    
    >>> choices = make_array('moon', 'sun')
    >>> other_one(choices, 'moon')
    sun
    >>> other_one(choices, 'sun')
    moon

In [None]:
def other_one(a, value):
    if value == a.item(0):
        return a.item(1)
    elif value == a.item(1):
        return a.item(0)
    else:
        print('Invalid input!')

In [None]:
choices = make_array('moon', 'sun')
other_one(choices, 'moon')

# Iteration

We can use Python to help automate our job at NASA:

In [None]:
#: counting down...
import time

print("Launching in...")
print("t-minus", 10)
time.sleep(1)
print("t-minus", 9)
time.sleep(1)
print("t-minus", 8)
time.sleep(1)
print("t-minus", 7)
time.sleep(1)
print("t-minus", 6)
time.sleep(1)
print("t-minus", 5)
time.sleep(1)
print("t-minus", 4)
time.sleep(1)
print("t-minus", 3)
time.sleep(1)
print("t-minus", 2)
time.sleep(1)
print("t-minus", 1)
time.sleep(1)
print("Blast off!")

## Better approach: use a `for`-loop.

In [None]:
print("Launching in...")

for t in [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]:
    print("t-minus", t)
    time.sleep(1)
    
print("Blast off!")

## `for`-loops

- Do something for every value in a sequence
- Syntax (don't forget the colon):

```
for <loop variable> in <sequence>:
    <body>
```

- Indentation matters!


In [None]:
#: loop variable can be anything
for x in [1, 2, 3, 4]:
    print(x ** 2)

## Ranges

- We can use `np.arange` to create sequences to iterate over:

In [None]:
#: count to 9, starting from 0
for x in np.arange(10):
    print(x)

In [None]:
#: countdown
for x in np.arange(10, 0, -1):
    print(x)

## Iterating over array by indexing

In [None]:
#: use np.arange(size)

flavors = make_array('Chocolate', 'Vanilla', 'Strawberry')

for index in np.arange(flavors.size):
    print('Flavor at index', index, 'is', flavors.item(index))

## Building an array by iterating

- How many letters are in each name?
- We want to save our results!
- Use `np.append`: appends an element to end of array.

In [None]:
#: names
names = ['Whitney', 'Xiang', 'Yekaterina', 'Zahara']

name_lengths = make_array()

for name in names:
    name_lengths = np.append(name_lengths, len(name))
    
name_lengths