In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 12

In this lecture we introduce more complex boolean expressions, conditionals, and for loops.

---

## Boolean expressions

We have already seen basic boolean expressions before

In [2]:
3 > 1

True

In [3]:
type(3 > 1)

bool

In [4]:
type(True)

bool

Recall that single `=` is **assignment**.  Thus the following is an error:

```python
3 = 3.0
```

Equality:

In [5]:
3 == 3.0

True

Inequality: 

In [6]:
10 != 2

True

Using variables in boolean expressions:

In [7]:
x = 14
y = 3 # assigning values to variables, with one equal sign

In [8]:
x > 15 # comparison operators

False

In [9]:
12 < x

True

In [10]:
x < 20

True

Compound boolean expressions:

In [11]:
12 < x < 20

True

(The comparison `12 < x < 20` is equivalent to `12 < x and x < 20`.)

In [12]:
12 < x and x < 20

True

Note: `or` is non-exclusive

In [13]:
False or True # just one of the statements in an 'or' statement, need to be true

True

In [14]:
True or True

True

In [15]:
False and True # requirements for an 'and' statement to be true, both statements or all statements listed
# need to be true

False

In [16]:
12 < x or x < 12 # 12 < x is true and x < 12 is false
# remember with an 'or' statement, just one of the statements needs to be true, for the entire statement to be true

True

In [17]:
12 < x and x < 12 # 12 < x is true and x < 12 is false
# with an 'and' statement, we need both/all statements to be true, for the entire statement to be true

False

---
<center> return to slides </center>

---

## Boolean Expressions with Arrays

Just as arrays can be used in mathematical expressions we can also apply boolean operations to arrays.  They are applied element-wise.

In [18]:
pets = make_array('cat', 'cat', 'dog', 'cat', 'dog', 'rabbit')
pets

array(['cat', 'cat', 'dog', 'cat', 'dog', 'rabbit'],
      dtype='<U6')

In [19]:
pets == 'cat'

array([ True,  True, False,  True, False, False], dtype=bool)

How many cats?

In [20]:
sum(pets == 'cat')
# first python, converts pets == 'cat' into an array of boolean values 
#array([ True,  True, False,  True, False, False], dtype=bool)
# python interprets my array of boolean values and converts it into numerical values, True = 1, False = 0
# array([ 1,      1,   0,       1,    0 ,    0], dtype=bool)
# sum(array([ 1,      1,   0,       1,    0 ,    0]) = 3
# Notice, when we are taking the sum of an array of boolean values or an array of 1s and 0s, we are effectively
# counting the number of Trues or 1s

3

Math with booleans
- What is the average number of cats?
- Let's create an array `is_cat` holding `True` values for cats and `False` for not cats
- Let's create an array where we put `1` for cat and `-1` for not cat

In [21]:
is_cat = (pets == 'cat') # count number of cats in my pets array
is_cat

array([ True,  True, False,  True, False, False], dtype=bool)

In [27]:
is_cat_fixed = is_cat * 2 - 1

In [28]:
# array([ 1,  1, 0,  1, 0, 0], dtype=bool)

np.mean(is_cat_fixed)

0.0

---

<center> return to slides </center>

---

## Rows & Apply

Just as we can access individual columns in a table we can also access individual rows. 
- pull out the first row of the table and assign it to a variable `r` (using the `.row()` method)
- check the type of `r`
- get a value of the row (using the `item()` method). Get e.g. the 'Year' (0th element) and the 'Extraversion' (1st element)

In [29]:
survey = Table.read_table('data/welcome_survey_w24_cleaned.csv') # changed to welcome_survey_w24_cleaned.csv
survey.show(3)

Year,Extraversion,Number of texts,Handedness,Sleep position,Hours of Sleep,Siblings,Pets,Random Number,Tattoo,Commute
Second,3,5,Right,Back,6,1,cat,4635,No,walk
Second,3,30,Right,Left side,8,1,1,1025,Yes,walk
First,2,5,Right,Back,8,2,Jack Mackerel,7682,No,walk


In [33]:
r = survey.row(0)
type(r) # notice r is a datascience.table.Row data type, as opposed to an array
r

Row(Year='Second', Extraversion=3.0, Number of texts=5.0, Handedness='Right', Sleep position='Back', Hours of Sleep=6.0, Siblings='1', Pets='cat', Random Number=4635, Tattoo='No', Commute='walk')

In [34]:
r.item('Year')

'Second'

Getting a field from a row

### Math On Rows

Suppose we get a row that contains only numbers:

In [35]:
r2 = survey.select("Extraversion", "Number of texts", "Hours of Sleep").row(2) # change Number of textees to Number of texts
# change Hours of sleep to Hours of Sleep
r2

Row(Extraversion=2.0, Number of texts=5.0, Hours of Sleep=8.0)

We can apply aggregation functions to that row. Try e.g. `sum()`

In [36]:
sum(r2) # adding extraversion rating + number of text + hours of sleep together
# what does the sum effectively tell you?
# is there insight gained from the output?

15.0

What if the row does NOT contain only numbers?

In [37]:
r3 = survey.select("Year", "Extraversion", "Number of texts", "Hours of Sleep").row(2)
r3

Row(Year='First', Extraversion=2.0, Number of texts=5.0, Hours of Sleep=8.0)

In [38]:
sum(r3)

TypeError: unsupported operand type(s) for +: 'int' and 'numpy.str_'

Recall that if we wanted to **apply** a function to all the rows of a table we use `apply`

In [39]:
(
    survey
    .select("Extraversion", "Number of texts", "Hours of Sleep")
    .apply(sum)
)

array([  14.,   41.,   15.,   14.,   64.,  172.,   13.,   23.,   66.,
         34.,   66.,  523.,   31.,   54.,   35.,   62.,   24.,   50.,
         20.,   64.,   29.,   98.,   21.,   44.,   44.,   91.,   22.,
         40.,   16.,   21.,   39.,   19.,   26.,   26.,   65.,   34.,
         15.,   12.,   47.,   34.,   11.,   48.,   29.,   12.,  116.,
         14.,   17.,   18.,   44.,   19.,   19.])

In [40]:
survey.num_rows

51

In [42]:
len([  14.,   41.,   15.,   14.,   64.,  172.,   13.,   23.,   66.,
         34.,   66.,  523.,   31.,   54.,   35.,   62.,   24.,   50.,
         20.,   64.,   29.,   98.,   21.,   44.,   44.,   91.,   22.,
         40.,   16.,   21.,   39.,   19.,   26.,   26.,   65.,   34.,
         15.,   12.,   47.,   34.,   11.,   48.,   29.,   12.,  116.,
         14.,   17.,   18.,   44.,   19.,   19.])

51

Let's use this insight to improve our pivot table:

In [43]:
p = survey.pivot("Sleep position", "Hours of Sleep")
p.show()

Hours of Sleep,Back,Left side,Right side,Stomach
5,0,0,1,0
6,2,4,0,1
7,1,5,14,0
8,3,4,6,1
9,2,2,0,3
10,0,0,1,0
13,0,0,0,1


**Exercise:** Add the row totals to the table:

In [47]:
totals = p.drop('Hours of Sleep')
totals = totals.apply(sum)
p.with_column('Total', totals)

Hours of Sleep,Back,Left side,Right side,Stomach,Total
5,0,0,1,0,1
6,2,4,0,1,7
7,1,5,14,0,20
8,3,4,6,1,14
9,2,2,0,3,7
10,0,0,1,0,1
13,0,0,0,1,1


**Exercise:** Do the same thing with a `group` and a `join`:

In [51]:
totals_group = survey.group('Hours of Sleep')
p.join('Hours of Sleep', totals_group, 'Hours of Sleep')

Hours of Sleep,Back,Left side,Right side,Stomach,count
5,0,0,1,0,1
6,2,4,0,1,7
7,1,5,14,0,20
8,3,4,6,1,14
9,2,2,0,3,7
10,0,0,1,0,1
13,0,0,0,1,1


In [52]:
Table.apply?

[0;31mSignature:[0m [0mTable[0m[0;34m.[0m[0mapply[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mfn[0m[0;34m,[0m [0;34m*[0m[0mcolumn_or_columns[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Apply ``fn`` to each element or elements of ``column_or_columns``.
If no ``column_or_columns`` provided, `fn`` is applied to each row.

Args:
    ``fn`` (function) -- The function to apply to each element
        of ``column_or_columns``.
    ``column_or_columns`` -- Columns containing the arguments to ``fn``
        as either column labels (``str``) or column indices (``int``).
        The number of columns must match the number of arguments
        that ``fn`` expects.

Raises:
    ``ValueError`` -- if  ``column_label`` is not an existing
        column in the table.
    ``TypeError`` -- if insufficient number of ``column_label`` passed
        to ``fn``.

Returns:
    An array consisting of results of applying ``fn`` to elements
    specified by ``column_label`` in eac

---

<center> return to slides </center>

---

## Conditional Statements

Conditional statements in python allow us to do different things based on the values in our data

In [53]:
x = 20 # assignment statement
# we are assigning value 20 to the variable name x

If the value of x is greater than or equal to 18 then print 'You can legally vote.'

In [54]:
x >= 18

True

In [57]:
if x >= 18: # this is a control statement, if the following statement is true, proceed with executing the code indented below
    print('You can legally vote.') # <---- following code will be ran if x >= 18 is True

You can legally vote.


Conditionals consist of two main parts:

```python

if boolean expression here :
    # body of the if statement goes here and must be indented
```

Notice than if the boolean expression is False than the body of the if statement is not executed:

In [59]:
x >= 21

False

In [58]:
print("Can you drink?")

if x >= 21: # remember x = 20, x >= 21 will be False, so we do not enter the body of this conditional statement
    print('You can legally drink.')
    print("This line of code is never run...")    

print("This is run")
print("The value of x is", x)

Can you drink?
This is run
The value of x is 20


Sometimes you want to do something else if the first statement wasn't true:

In [60]:
if x >= 21: # remember at this point x = 20, x > = 21 is False, skip to line 3
    print('You can legally vote and drink.')
elif x >= 18: # x= 20, so x >=18 is True, so we will enter this conditional statement body
    print('You can legally vote.') # run this line, and exit out of this if-else body of code
else:
    print('You can legally drink milk.')

You can legally vote.


Implementing a function with conditionals and muliple return values:

In [61]:
def age(x):
    if x >= 21:
        return 'You can legally vote and drink.'
    elif x >= 18:
        return 'You can legally vote.'
    else:
        return 'You can legally drink milk.'

In [62]:
age(3)

'You can legally drink milk.'

In [63]:
age(20)

'You can legally vote.'

In [64]:
age(23)

'You can legally vote and drink.'

### Putting the pieces together

Here we will build a function that returns whether a trip was one way or a round trip:

In [66]:
trip = Table().read_table('data/trip.csv')
trip.show(20)

Trip ID,Duration,Start Date,Start Station,Start Terminal,End Date,End Station,End Terminal,Bike #,Subscriber Type,Zip Code
913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139
913459,1036,8/31/2015 23:11,San Antonio Shopping Center,31,8/31/2015 23:28,Mountain View City Hall,27,35,Subscriber,95032
913455,307,8/31/2015 23:13,Post at Kearny,47,8/31/2015 23:18,2nd at South Park,64,468,Subscriber,94107
913454,409,8/31/2015 23:10,San Jose City Hall,10,8/31/2015 23:17,San Salvador at 1st,8,68,Subscriber,95113
913453,789,8/31/2015 23:09,Embarcadero at Folsom,51,8/31/2015 23:22,Embarcadero at Sansome,60,487,Customer,9069
913452,293,8/31/2015 23:07,Yerba Buena Center of the Arts (3rd @ Howard),68,8/31/2015 23:12,San Francisco Caltrain (Townsend at 4th),70,538,Subscriber,94118
913451,896,8/31/2015 23:07,Embarcadero at Folsom,51,8/31/2015 23:22,Embarcadero at Sansome,60,363,Customer,92562
913450,255,8/31/2015 22:16,Embarcadero at Sansome,60,8/31/2015 22:20,Steuart at Market,74,470,Subscriber,94111
913449,126,8/31/2015 22:12,Beale at Market,56,8/31/2015 22:15,Temporary Transbay Terminal (Howard at Beale),55,439,Subscriber,94130
913448,932,8/31/2015 21:57,Post at Kearny,47,8/31/2015 22:12,South Van Ness at Market,66,472,Subscriber,94702


In [67]:
# are there any round trips in our bike-share table?
# if so, how many?
def trip_kind(start, end):
    if start == end:
        return 'round trip'
    else: 
        return 'one way'

In [72]:
kinds_of_trips = trip.apply(trip_kind, 'Start Station', 'End Station')
sum(kinds_of_trips == 'round trip')
# add kinds_of_trips back to trip table
kinds_table = trip.with_column('Trip Kind', kinds_of_trips)
#kinds_table

Pivotting to Trip Kind

In [77]:
kinds_pivot = (
    kinds_table
    .where('Duration', are.below(600)) # looking at shorter trips
    .pivot('Trip Kind', 'Start Station') # comparing kinds of trips with start stations
    .sort("round trip", descending=True) # sorting by round trip, longest round trips at the top
    .take(np.arange(10)) # finding the top 10 longest round trips, as opposed to one way trips
)
kinds_pivot

Start Station,one way,round trip
Embarcadero at Sansome,6938,120
Harry Bridges Plaza (Ferry Building),8643,105
San Francisco Caltrain 2 (330 Townsend),12021,104
2nd at South Park,6484,98
San Francisco Caltrain (Townsend at 4th),11181,95
2nd at Townsend,9513,83
Powell Street BART,7156,81
Market at 10th,6599,80
Civic Center BART (7th at Market),5179,73
Townsend at 7th,8073,68


---

<center> return to slides </center>

---

## Simulation

We will use simulation heavily in this class.  A key element of simulation is leveraging randomness. The numpy python library has many functions for generating random events. Today we will use the `np.random.choice` function:

In [79]:
mornings = make_array('wake up', 'sleep in')
mornings

array(['wake up', 'sleep in'],
      dtype='<U8')

In [89]:
np.random.choice(mornings)

'sleep in'

In [81]:
np.random.choice(mornings)

'sleep in'

In [82]:
np.random.choice(mornings)

'wake up'

We can also pass an argument that specifies how many times to make a random choice:

In [90]:
np.random.choice(mornings, 7)

array(['wake up', 'wake up', 'wake up', 'sleep in', 'wake up', 'wake up',
       'sleep in'],
      dtype='<U8')

In [91]:
np.random.choice(mornings, 7)

array(['wake up', 'wake up', 'wake up', 'wake up', 'wake up', 'wake up',
       'sleep in'],
      dtype='<U8')

In [92]:
morning_week = np.random.choice(mornings, 7)
morning_week

array(['wake up', 'wake up', 'wake up', 'sleep in', 'sleep in', 'sleep in',
       'sleep in'],
      dtype='<U8')

In [93]:
morning_week

array(['wake up', 'wake up', 'wake up', 'sleep in', 'sleep in', 'sleep in',
       'sleep in'],
      dtype='<U8')

In [94]:
morning_week == 'wake up'

array([ True,  True,  True, False, False, False, False], dtype=bool)

In [96]:
sum(morning_week == 'wake up')

3

In [95]:
sum(morning_week == 'sleep in')

4

In [97]:
np.mean(morning_week == 'sleep in')

0.5714285714285714

In [98]:
4/7

0.5714285714285714

### Playing a Game of Chance

Steps:
1. Find a way to simulate two dice rolls.
2. Compute how much money we win/lose based on the result.
3. Do steps 1 and 2 10,000 times.

### Simulating the roll of a die

In [99]:
die_faces = np.arange(1, 7)
die_faces

array([1, 2, 3, 4, 5, 6])

In [106]:
np.random.choice(die_faces)

4

**Exercise:** Implement a function to simulate a single round of play and returns the result.

In [124]:
def simulate_one_round():
    my_roll = np.random.choice(die_faces)
    your_roll = np.random.choice(die_faces)
    #print('my_roll: ', my_roll, 'your_roll: ', your_roll)
    
    if my_roll > your_roll:
        return 1 # you owe me a $1
    elif my_roll < your_roll:
        return -1 # i lose a $1
    else: 
        return 0

In [117]:
simulate_one_round()

my_roll:  2 your_roll:  2


0

---

<center> return to slides </center>

---

## `For` Statements

The for statement is another way to apply code to each element in a list or an array.

In [118]:
print('I love my cat')
print('I love my dog')
print('I love my rabbit')

I love my cat
I love my dog
I love my rabbit


In [119]:
count = 0
for pet in make_array('cat', 'dog', 'rabbit'):
    count = count + 1
    print('for loop #:', count)
    print(pet)
    print('I love my ' + pet)

for loop #: 1
cat
I love my cat
for loop #: 2
dog
I love my dog
for loop #: 3
rabbit
I love my rabbit


**Exercise:** What is the output of this for loop?

In [121]:
for i in np.arange(1,4): # (1,2,3)
    print(i)

1
2
3


In [122]:
x = 0
for i in np.arange(1, 4): # np.arange(1,4): (1,2,3)
    x = x + i
    print(x)

print("The final value of x is:", x)

1
3
6
The final value of x is: 6


**Exercise:** Use a for loop to simulate the total outcome of 10,000 plays of our game of chance:

In [125]:
N = 10_000
winnings = 0

for i in np.arange(N):
    winnings = winnings + simulate_one_round() # add winnings (+1, -1, 0) each time

print('I win', winnings, 'dollars.')

I win -9 dollars.


**Bonus Exercise:** Use table functions to simulate 10,000 rounds of play:

In [None]:
print("My total winnings:", rolls.column("outcome").sum())

---

<center> return to slides </center>

---

## Appending Arrays

Sometimes we will want to collect the outcomes of our simulations into a single array.  We can do this by appending each experiment to the end of an array using the numpy `np.append` function.

In [None]:
first = np.arange(4)
second = np.arange(10, 17)

In [None]:
np.append(first, 6)

In [None]:
first

In [None]:
np.append(first, second)

In [None]:
first

In [None]:
second

**Exercise:** Use append to record the outcomes of all the games rather than just the total.

### Another example: simulating heads in 100 coin tosses

Suppose we simulate 100 coin tosses.  What fraction will be heads?  What if we simulate 100 coin tosses thousands of times.  What fraction will be heads?

In [None]:
coin = make_array('heads', 'tails')

In [None]:
sum(np.random.choice(coin, 100) == 'heads')

In [None]:
# Simulate one outcome

def num_heads():
    return sum(np.random.choice(coin, 100) == 'heads')

In [None]:
# Decide how many times you want to repeat the experiment

repetitions = 10000

In [None]:
# Simulate that many outcomes

outcomes = make_array()

for i in np.arange(repetitions):
    outcomes = np.append(outcomes, num_heads())

In [None]:
heads = Table().with_column('Heads', outcomes)
heads.hist(bins = np.arange(29.5, 70.6))

--- 
## Optional: Advanced `where`

Sometimes the `are.above_or_equal_to` style syntax will be painful to use.  We can instead construct an array of booleans to select rows from our table.  This will allow us to select rows based on complex boolean expressions spanning multiple columns. 

In [None]:
ages = make_array(16, 22, 18, 15, 19, 39, 27, 21)
patients = Table().with_columns("Patient Id", np.arange(len(ages))+1000, 'Age', ages,)
patients

**Exercise:** Find all the patients that are older than 21 or have a Patient Id that is even:

To compute the even patient ids, we can use the `%` modulus operator: