In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 12

In this lecture we introduce more complex boolean expressions, conditionals, and for loops.

---

## Boolean expressions

We have already seen basic boolean expressions before

In [2]:
3 > 1

True

In [3]:
type(3 > 1) # boolean expression
# True and False are boolean expressions

bool

In [4]:
type(True)


bool

Recall that single `=` is **assignment**.  Thus the following is an error:

```python
3 = 3.0
```

Equality:

In [5]:
3 == 3.0

True

Inequality: 

In [6]:
10 != 2

True

Using variables in boolean expressions:

In [7]:
x = 14
y = 3

In [8]:
x > 15

False

In [9]:
12 < x

True

In [10]:
x < 20

True

Compound boolean expressions:

In [11]:
12 < x < 20

True

(The comparison `12 < x < 20` is equivalent to `12 < x and x < 20`.)

In [12]:
12 < x and x < 20 # same as 12 < x < 20
# Python evaluates the above line of code to:
# True and True
# returns True 
# for an "and" statement to return True, both expressions need to be true, for the entire statement to be True

True

In [13]:
True and True

True

In [18]:
False and True

False

In [19]:
True and False

False

In [20]:
False and False

False

In [None]:
# for an "or" statement, only one expression needs to be true, for the entire statement to be True

Note: `or` is non-exclusive

In [21]:
x

14

In [22]:
12 < x or x > 20
# True or False

True

In [23]:
12 < x or x < 12
# True or False

True

In [14]:
False or True

True

In [15]:
True or True

True

In [16]:
True or False

True

In [17]:
False or False

False

---
<center> return to slides </center>

---

## Boolean Expressions with Arrays

Just as arrays can be used in mathematical expressions we can also apply boolean operations to arrays.  They are applied element-wise.

In [24]:
pets = make_array('cat', 'cat', 'dog', 'cat', 'dog', 'rabbit')
pets

array(['cat', 'cat', 'dog', 'cat', 'dog', 'rabbit'],
      dtype='<U6')

In [25]:
pets == 'cat'


array([ True,  True, False,  True, False, False], dtype=bool)

How many cats?

In [26]:
sum(pets == 'cat')
# first python checks for equality inside the parenthesis
# sum(array([ True,  True, False,  True, False, False])
# then we convert our array of boolean values to ints
# sum(array([1,       1,   0,         1, 0,       0])
# 3

3

Math with booleans
- What is the average number of cats?
- Let's create an array `is_cat` holding `True` values for cats and `False` for not cats
- Let's create an array where we put `1` for cat and `-1` for not cat

In [27]:
# avg num of cats
np.average(pets == 'cat')

0.5

In [29]:
is_cat = (pets == 'cat')
is_cat

array([ True,  True, False,  True, False, False], dtype=bool)

In [35]:
pets

array(['cat', 'cat', 'dog', 'cat', 'dog', 'rabbit'],
      dtype='<U6')

In [34]:
# how many non-cats do we have?
len(is_cat) - sum(is_cat)
# 6 pets    -   3 cats    
# 3 non-cats (dogs and rabbits)

3

In [31]:
(is_cat * 2) - 1
# converts to ints and then does multiplication
# converts array([ True,  True, False,  True, False, False], dtype=bool)
# to
#          array([[1,       1,   0,         1, 0,       0])


array([ 1,  1, -1,  1, -1, -1])

In [32]:
np.mean(is_cat)

0.5

---

<center> return to slides </center>

---

## Rows & Apply

Just as we can access individual columns in a table we can also access individual rows. 
- pull out the first row of the table and assign it to a variable `r` (using the `.row()` method)
- check the type of `r`
- get a value of the row (using the `item()` method). Get e.g. the 'Year' (1st element) and the 'Extraversion' (2nd element)

In [39]:
survey = Table.read_table('data/classdatasurvey_s24_relabeled.csv') 
# instead of welcome_survey_sp23.csv
# USE classdatasurvey_s24_relabeled.csv <<<<----<<<----
survey.show(3)

Timestamp,Year,Extroversion,Texts,Handedness,Sleep Position,Hours of Sleep,Siblings,Pets,Random Number,Tattoos,Commute
4/23/2024 9:31:24,First,1,12,Both,Stomach,4,0,0,1122,No,Car
4/23/2024 9:31:52,First,2,200,Right,Right side,7,1,0,1417,No,Bus
4/23/2024 9:32:06,Second,5,30,Right,Left side,6,4,0,1234,No,walk


In [43]:
r = survey.row(0)
r
# tables.Row data type

Row(Timestamp='4/23/2024 9:31:24', Year='First', Extroversion=1, Texts=12.0, Handedness='Both', Sleep Position='Stomach', Hours of Sleep=4.0, Siblings=0, Pets='0', Random Number=1122, Tattoos='No', Commute='Car')

In [44]:
type(r)

datascience.tables.Row

In [45]:
r.item(1)
#or
r.item('Year')

'First'

In [47]:
r.item(2)
#or
r.item('Extroversion')

1

Getting a field from a row

### Math On Rows

Suppose we get a row that contains only numbers:

In [48]:
# replace Extraversion with Extroversion
# replace Number of textees with Texts
# replace Hours of sleep with Hours of Sleep
r2 = survey.select("Extroversion", "Texts", "Hours of Sleep").row(2) # student in position 2, or the third row

r2 # represents the third student in the survey table

Row(Extroversion=5, Texts=30.0, Hours of Sleep=6.0)

We can apply aggregation functions to that row. Try e.g. `sum()`

In [49]:
sum(r2)
# this sum doesn't mean much because we although we are adding three numbers together
# 5 + 30.0 + 6.0
# the value is meaningless because we are adding
# extroversion rating, # of texts, hours of sleep

41.0

What if the row does NOT contain only numbers?

In [50]:
# replace Extraversion with Extroversion
# replace Number of textees with Texts
# replace Hours of sleep with Hours of Sleep
r3 = survey.select("Year", "Extroversion", "Texts", "Hours of Sleep").row(2)
r3

Row(Year='Second', Extroversion=5, Texts=30.0, Hours of Sleep=6.0)

In [51]:
sum(r3)
# we get an error since we are trying to add a string to an int or float
# we are trying to do the following
# 'Second' + 5 + 30.0 + 6.0

TypeError: unsupported operand type(s) for +: 'int' and 'numpy.str_'

Recall that if we wanted to **apply** a function to all the rows of a table we use `apply`

In [53]:
# replace Extraversion with Extroversion
# replace Number of textees with Texts
# replace Hours of sleep with Hours of Sleep
len(
    survey
    .select("Extroversion", "Texts", "Hours of Sleep")
    .apply(sum)
)
# we selected three columns in our survey table
# applied the function sum to the table
# similar to what we did above, where we grabbed one student and summed up their values for Extroversion + Texts + Hours of Sleep
# we did the same operation for all students in our survey table


80

In [54]:
survey.num_rows

80

Let's use this insight to improve our pivot table:

In [55]:
survey

Timestamp,Year,Extroversion,Texts,Handedness,Sleep Position,Hours of Sleep,Siblings,Pets,Random Number,Tattoos,Commute
4/23/2024 9:31:24,First,1,12,Both,Stomach,4,0,0,1122,No,Car
4/23/2024 9:31:52,First,2,200,Right,Right side,7,1,0,1417,No,Bus
4/23/2024 9:32:06,Second,5,30,Right,Left side,6,4,0,1234,No,walk
4/23/2024 9:33:01,Second,5,30,Right,Left side,6,4,0,1234,No,walk
4/23/2024 9:33:02,First,7,15,Right,Left side,9,2,,1475,No,Skateboard
4/23/2024 9:37:21,Second,3,15,Right,Right side,7,2,"cats, dogs, fish",8457,No,Car
4/23/2024 9:38:27,First,7,100,Right,Right side,8,7,4 black cats,6969,Yes,walk
4/23/2024 9:38:53,First,1,16,Right,Stomach,9,1,1,3246,No,walk
4/23/2024 9:39:14,First,7,100,Right,Right side,8,7,4 black cats,6902,Yes,walk
4/23/2024 9:39:48,First,5,100,Right,Right side,8,2,2 cats,3497,No,Bike


In [57]:
# replace Hours of sleep with Hours of Sleep
# replace Sleep position with Sleep Position
p = survey.pivot("Sleep Position", "Hours of Sleep")
p.show()

Hours of Sleep,Back,Left side,Right side,Stomach
4.0,1,0,0,1
5.0,1,1,3,0
6.0,6,3,9,0
7.0,11,4,6,6
7.5,0,0,1,0
8.0,2,6,10,2
9.0,0,5,0,2


**Exercise:** Add the row totals to the table:

In [73]:
totals = p.drop('Hours of Sleep')
totals = totals.apply(sum)
# when we don't pass in a second argument, it automatically applies that function to each row in the table

# apply sum function to totals table, without the Hours of Sleep column
# if we don't pass in a second argument into .apply method
# what it does is it automatically applies that function  to each row in the table
totals
# we have 7 element in our totals array
# corresponds to the 7 rows in our pivot table

array([ 2,  5, 18, 27,  1, 20,  7])

In [72]:
survey.apply(np.min, 'Hours of Sleep')
# we previously introduce .apply this way
# where we can apply a function to a column, within a table
# what returns is an array of the results

# apply doesn't like built-in sum or min function
# accepts np.min and np.sum

array([ 4. ,  7. ,  6. ,  6. ,  9. ,  7. ,  8. ,  9. ,  8. ,  8. ,  6. ,
        8. ,  7. ,  8. ,  9. ,  8. ,  8. ,  5. ,  5. ,  9. ,  7. ,  8. ,
        5. ,  8. ,  8. ,  5. ,  7. ,  7. ,  8. ,  7.5,  6. ,  9. ,  7. ,
        7. ,  7. ,  7. ,  7. ,  6. ,  7. ,  7. ,  6. ,  7. ,  8. ,  9. ,
        7. ,  7. ,  6. ,  8. ,  7. ,  8. ,  6. ,  7. ,  7. ,  6. ,  7. ,
        6. ,  6. ,  8. ,  6. ,  7. ,  7. ,  6. ,  6. ,  7. ,  7. ,  5. ,
        6. ,  7. ,  8. ,  8. ,  6. ,  6. ,  9. ,  8. ,  8. ,  8. ,  7. ,
        4. ,  7. ,  6. ])

**Exercise:** Do the same thing with a `group` and a `join`:

In [None]:
# TODO at home



---

<center> return to slides </center>

---

## Conditional Statements

Conditional statements in python allow us to do different things based on the values in our data

In [83]:
age = 20
# update age to 17
age = 17

If the value of x is greater than or equal to 18 then print 'You can legally vote.'

In [81]:
if age >= 18: # if the following statement is true (x >= 18), Python evaluates this as True
    # proceed with running the code in this block
    # a block is indicated by an indent
    # and comes after a colon
    print('You can legally vote.')

Conditionals consist of two main parts:

```python

if boolean expression here :
    # body of the if statement goes here and must be indented
```

Notice than if the boolean expression is False than the body of the if statement is not executed:

In [86]:
x = 21

In [87]:
print("Can you drink?")

if x >= 21:
    print('You can legally drink.')
    print("This line of code does actually run...")    

print("This is run")
print("The value of x is", x)

Can you drink?
You can legally drink.
This line of code is never run...
This is run
The value of x is 21


Sometimes you want to do something else if the first statement wasn't true:

In [92]:
x = 22

In [93]:
if x >= 21:
    print('You can legally vote and drink.')
    print(x)
elif x >= 18:
    print('You can legally vote.')
    print(x)
else:
    print('You can legally drink milk.')
    print(x)

You can legally vote and drink.
22


Implementing a function with conditionals and muliple return values:

In [94]:
def age(x):
    if x >= 21:
        return 'You can legally vote and drink.'
    elif x >= 18:
        return 'You can legally vote.'
    else:
        return 'You can legally drink milk.'

In [95]:
age(3)

'You can legally drink milk.'

In [96]:
age(20)

'You can legally vote.'

In [97]:
age(23)

'You can legally vote and drink.'

### Putting the pieces together

Here we will build a function that returns whether a trip was one way or a round trip:

In [None]:
trip = Table().read_table('data/trip.csv')
trip.show(3)

Pivotting to Trip Kind

In [None]:
kinds_pivot = (
    kinds
    .where('Duration', are.below(600))
    .pivot('Trip Kind', 'Start Station')
    .sort("round trip", descending=True)
    .take(np.arange(10))
)
kinds_pivot

---

<center> return to slides </center>

---

## Simulation

We will use simulation heavily in this class.  A key element of simulation is leveraging randomness. The numpy python library has many functions for generating random events. Today we will use the `np.random.choice` function:

In [None]:
mornings = make_array('wake up', 'sleep in')

In [None]:
np.random.choice(mornings)

In [None]:
np.random.choice(mornings)

In [None]:
np.random.choice(mornings)

We can also pass an argument that specifies how many times to make a random choice:

In [None]:
np.random.choice(mornings, 7)

In [None]:
np.random.choice(mornings, 7)

In [None]:
morning_week = np.random.choice(mornings, 7)
morning_week

In [None]:
sum(morning_week == 'wake up')

In [None]:
sum(morning_week == 'sleep in')

In [None]:
np.mean(morning_week == 'sleep in')

### Playing a Game of Chance

Steps:
1. Find a way to simulate two dice rolls.
2. Compute how much money we win/lose based on the result.
3. Do steps 1 and 2 10,000 times.

Steps:
1. Find a way to simulate two dice rolls.
2. Compute how much money we win/lose based on the result.
3. Do steps 1 and 2 10,000 times.

### Simulating the roll of a die

In [None]:
die_faces = np.arange(1, 7)
die_faces

In [None]:
np.random.choice(die_faces)

**Exercise:** Implement a function to simulate a single round of play and returns the result.

In [None]:
simulate_one_round()

---

<center> return to slides </center>

---

## `For` Statements

The for statement is another way to apply code to each element in a list or an array.

In [None]:
for pet in make_array('cat', 'dog', 'rabbit'):
    print('I love my ' + pet)

**Exercise:** What is the output of this for loop?

In [None]:
x = 0
for i in np.arange(1, 4):
    x = x + i
    print(x)

print("The final value of x is:", x)

**Exercise:** Use a for loop to simulate the total outcome of 10,000 plays of our game of chance:

**Bonus Exercise:** Use table functions to simulate 10,000 rounds of play:

In [None]:
print("My total winnings:", rolls.column("outcome").sum())

---

<center> return to slides </center>

---

## Appending Arrays

Sometimes we will want to collect the outcomes of our simulations into a single array.  We can do this by appending each experiment to the end of an array using the numpy `np.append` function.

In [None]:
first = np.arange(4)
second = np.arange(10, 17)

In [None]:
np.append(first, 6)

In [None]:
first

In [None]:
np.append(first, second)

In [None]:
first

In [None]:
second

**Exercise:** Use append to record the outcomes of all the games rather than just the total.

### Another example: simulating heads in 100 coin tosses

Suppose we simulate 100 coin tosses.  What fraction will be heads?  What if we simulate 100 coin tosses thousands of times.  What fraction will be heads?

In [None]:
coin = make_array('heads', 'tails')

In [None]:
sum(np.random.choice(coin, 100) == 'heads')

In [None]:
# Simulate one outcome

def num_heads():
    return sum(np.random.choice(coin, 100) == 'heads')

In [None]:
# Decide how many times you want to repeat the experiment

repetitions = 10000

In [None]:
# Simulate that many outcomes

outcomes = make_array()

for i in np.arange(repetitions):
    outcomes = np.append(outcomes, num_heads())

In [None]:
heads = Table().with_column('Heads', outcomes)
heads.hist(bins = np.arange(29.5, 70.6))

--- 
## Optional: Advanced `where`

Sometimes the `are.above_or_equal_to` style syntax will be painful to use.  We can instead construct an array of booleans to select rows from our table.  This will allow us to select rows based on complex boolean expressions spanning multiple columns. 

In [None]:
ages = make_array(16, 22, 18, 15, 19, 39, 27, 21)
patients = Table().with_columns("Patient Id", np.arange(len(ages))+1000, 'Age', ages,)
patients

**Exercise:** Find all the patients that are older than 21 or have a Patient Id that is even:

To compute the even patient ids, we can use the `%` modulus operator: