In [22]:
# Click into this cell and press shift-enter before using this notebook.
# This line loads the ability to use %%ai in your file
%load_ext jupyter_ai_magics
# These lines import the Python modules we commonly use in CMPSC 5A
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots

# Lecture 4, CMPSC 5A, S25

## Note to self: check font size

From ic04, where i asked for feedback on the course:
* There was a request to check the font size.
* So this is here to remind me to check in with the backrow to make sure the font size is comfortable, and
* To invite you *at any time* to interrupt the lecture if you cannot see or hear what's going on.

It's important to me that you be able to see and hear, otherwise, what's the point?  
* *So definitely interrupt* if there's a problem with that.
* I'll do what I can to address it.

### The pace

As for the pace, there were:
* Many people that said it was going too fast
* Many people that said it was just right
* Many that said it was going too slow.

That's a *subjective* impression and not a rigorous analysis; 
* we'll talk at some point about "qualitative" vs. "quantitative" data and the difficulty of rigorously analyzing qualitative data (e.g. free form text)
* for now, it seems to be divided enough that I'm going to just stay on the current pace.

But do feel free to continue to offer feedback on that.  I'll keep asking occasionally.  Maybe next time, I'll ask about pace with a quantitative scale to make it easier to analyze.

## Review from last time

The "first cell" that we start by running each time now has five things in it (not counting the comments):

| Item | Expanation |
|------|------------|
|`%load_ext jupyter_ai_magics` | Loads the software needed to use `%%ai ` commands in code cells |
|`from datascience import *` | Loads the UC Berkeley Data 8 software (`datascience` module) |
|`import numpy as np` | Imports a module `numpy` for data science, and gives it the nickname `np` |
|`%matplotlib inline` | Allows us to see `matplotlib` output in our Jupyter Notebook  (graphs)|
|`import matplotlib.pyplot as plots` | Imports a module for plotting graphs as `plots` |

Remember to shift-enter on this cell before you try any other code cells!

## Where are we in the reading?

By today, you should have finished reading:
* [Chapter 1: What is Data Science](https://inferentialthinking.com/chapters/01/what-is-data-science.html)
* [Chapter 3: Programming in Python](https://inferentialthinking.com/chapters/03/programming-in-python.html)
* [Chapter 4: Data Types](https://inferentialthinking.com/chapters/04/Data_Types.html)
* [Chapter 5 (Sequences)](https://inferentialthinking.com/chapters/05/Sequences.html) 
   

For next Tuesday, 04/15, please also read: 
* [Chapter 6 (Intro)](https://inferentialthinking.com/chapters/06/Tables.html)
* [Section 6.1 (Sorting Rows)](https://inferentialthinking.com/chapters/06/1/Sorting_Rows.html#)
* [Section 6.2 (Selecting Rows)](https://inferentialthinking.com/chapters/06/2/Selecting_Rows.html)


## Programming in Python (Chapter 3)

We've already covered most of this when we went through Chapter 1!

So this should go fast.



## Section 3.1: Expressions


In your text, [Section 3.1: Expressions](https://inferentialthinking.com/chapters/03/1/Expressions.html) covers expressions.

We'll see lots of contexts for expressions, but one of the most common is the right hand side of an assignment statment, such as `y = m * x + b` in this block of code:

In [7]:
m = 2
b = 3
x = 4
y = m * x + b
y

11

But in truth, *every* right hand side of an assignment is an expression.  In fact, all of these are expressions:

| Expression | Comments |
|------------|----------|
| `2`        | Every number is an expression |
| `"Data Science"` | Every string is an expression | 
| `y`        | Every variable or name is an expression | is an expresssion (all strings are expressions) |
| `m * x`    | Expressions connected by *operators* are also expresssions; here `*` is the multiplication operator |
|`m * x + b` |  Another example of connecting two expresssions with an operator (`+`) |


In complicated expressions, we use "order of operations" to decide what to do first; Python uses the usual PEMDAS rules you may remember from high school algebra:

* Parentheses
* Exponentiation
* Multiplication/Division
* Addition/Subtraction

Almost all Python operators *associate left to right* if they are at the same level of precedence.  That is, you apply the operators from left to right.

There are only a handful of operators that associate right to left, and the only one of those we are likely to 
see is the exponentiation operators (i.e. $x^y$ written as `x**y`).

Here are some examples:

| Expression without parentheses | Equivalent Expression to show the order of evaluation |
|--------------------------------|-------------------------------------------------------|
| 2 + 3 * 5 + 1                  |  ((2 + (3 * 5)) + 1 )                                 |
| 2 + 3 + 5                      |  ((2 + 3) + 5)                                        |
| 2 ** 3 ** 4                    |  (2 ** (3 ** 4))                                      |

<details>
<summary>
Optional: Click the triangle to learn about the other operators that associate right to left. 
</summary>

<h5>Right to Left constructs in Python</h5>

Here are the right-to-Left associative operators and operator-like constructs in Python.  Note that you probably *will not need* this information for CMPSC 5A, but it's here in case you are curious.

| Operator / Construct   | Associativity    | Type           | Notes |
|------------------------|------------------|----------------|-------|
| `=` (assignment)       | Right-to-left    | Statement      | Does not return a value; can't be used in expressions |
| `:=` (walrus)          | Right-to-left    | Expression     | Introduced in Python 3.8; returns a value |
| `**` (exponentiation)  | Right-to-left    | Expression     | `2 ** 3 ** 2` is `2 ** (3 ** 2)` = 512 |
| `lambda`               | Right-to-left    | Expression     | Rightmost `lambda` binds first |
| `x if cond else y`     | Right-associative-ish | Expression | Conditional expressions nest from the right |
| Function calls (`f()`) | N/A (nested eval) | Expression     | Inner (rightmost) call returns function for next call |
| Attribute access (`obj.a.b`) | Left-to-right | Expression | Syntactically left-associative, but can chain deeply |
| Chained comparisons (`1 < x < 5`) | Custom chaining | Expression | Equivalent to `1 < x and x < 5` |

    
</details>


## Table of Common Operators (from [Section 3.1: Expressions](https://inferentialthinking.com/chapters/03/1/Expressions.html))

The following table is *not* a complete list of the operators in Python, but it does contain the ones you'll need most often.   We'll introduce others as and when needed.

<table class="table">
<thead>
<tr class="row-odd"><th class="head"><p>Expression Type</p></th>
<th class="head"><p>Operator</p></th>
<th class="head"><p>Example</p></th>
<th class="head"><p>Value</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>Addition</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">+</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">2</span> <span class="pre">+</span> <span class="pre">3</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">5</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>Subtraction</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">-</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">2</span> <span class="pre">-</span> <span class="pre">3</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">-1</span></code></p></td>
</tr>
<tr class="row-even"><td><p>Multiplication</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">*</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">2</span> <span class="pre">*</span> <span class="pre">3</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">6</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>Division</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">/</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">7</span> <span class="pre">/</span> <span class="pre">3</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">2.66667</span></code></p></td>
</tr>
<tr class="row-even"><td><p>Remainder</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">%</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">7</span> <span class="pre">%</span> <span class="pre">3</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">1</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>Exponentiation</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">**</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">2</span> <span class="pre">**</span> <span class="pre">0.5</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">1.41421</span></code></p></td>
</tr>
</tbody>
</table>

A couple of others that are worth mentioning:

| Expression Type | Operator | Example | Value |
|-|-|-|-|
| String Concatenation | `+` | `'UC' + 'SB'` | `'UCSB'` |
| Integer Division (divide and throw away remainder) | `//` | ` 7 // 2` | `3` (*not* `3.5`) |

## ic06!

Let's do ic06 from Gradescope.

## Section 3.2: Names (Variables)


In your text, [Section 3.2: Names](https://inferentialthinking.com/chapters/03/2/Names.html) covers *names* which are also called *variables*.

We use names in Python to store *values*.  These values can include any of the following (this is *not* a complete list!)
* numbers
* strings
* boolean values (values that are `True` or `False`)
* lists
* Arrays
* Tables
* functions
* modules

There are rules for names: 
* only letters, digits, underscore (no spaces)
* must not start with a digit

It turns out that what name you choose:
* Doesn't matter to the computer
* Is *very* signifcant to the humans trying to read your code (including you!)

Good names help you understand your code better.  They help your "short term memory" (STM).

Bad names get in the way.

In fact, Felinne Herman, in her book *The Programmer's Brain* has an *entire chapter* called *How to Get Better at Naming Things*.

* If you login here first with your UCSBNetId: <https://bit.ly/ucsb-or>
* You can read that chapter [here](https://learning.oreilly.com/library/view/the-programmers-brain/9781617298677/Text/ch08.htm)
* That site also has dozens and dozens of good Data Science and Programming books if you want to continue studying these topics after the course is over.



## Section 3.3: Call Expressions 


In your text, [Section 3.3: Call Expressions](https://inferentialthinking.com/chapters/03/3/Calls.html) covers *call expressions* which are also called *function calls* (or sometimes *method calls*, depending on the context.)

Some examples:

| Example | Result | What it does |
|-|-|-|
| `abs(-10)` | `10` | Takes the absolute value |
| `round(1.51)` | `2` | Rounds the number to the nearest integer value and converts the type to int |
| `max(2, 2 + 3, 4)` | `5` | Returns the largest of it's *arguments* (the things separated by commas) |

### Arguments

The call expression `abs(-10)` has one argument, namely `-10`.  It's the thing in the parentheses `()`.

The call expresssion `max(2, 2+3, 4)` has three arguments:
* `2`
* `2+3`
* `4`

Each argument is an expression; in this case, we are only seeing numbers, and simple expressions made up of numbers, but the arguments to a call expression can be arbitrarily complex expressions, even other call expressions.

In [18]:

x = 5
y = 7
z = 9
answer = round(abs(max(x,y,z) - min(x-1,y-2,z-3)))
answer

5

### A short digression: the *Terminal* in Jupyter Hub

If you are just wanting to play around with Python to figure out how things work in the language, you *could* create code cells in a notebook, and press shift-enter each time.

But this can be tedious. *There's a more efficient way*.

In Jupyter Hub, on the `File` menu, select `New` then `Terminal`.

You'll get a terminal session. If you then type `python3`, you'll get a *Python prompt*.  Now everything you type in will be interpreted *just like a code cell* but it goes a lot faster.

Here's what that looks like; then I'll demonstrate:

![new-terminal.jpg](attachment:51462550-0026-456c-b2b3-4693f07da290.jpg)

```
jovyan@jupyter-phtcon:~$ python3
Python 3.12.9 | packaged by conda-forge | (main, Mar  4 2025, 22:48:41) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> abs(-10)
10
>>> round(5.01)
5
>>> type(round(5.01))
<class 'int'>
>>>
```

### A special call expression: `print()`

A particular call expression that is often misunderstood is `print()`.   Here's where the misunderstanding arises:

Consider these code cells:


In [1]:
x = 3 + 4 * 5

In [2]:
x

23

In [3]:
print(x)

23


It may appear that using `print` to show the value of a variable or expression has no particular effect.  Consider also:

In [4]:
3 + 4 * 5

23

In [5]:
print(3 + 4 * 5)

23


However, there is a difference, and it's easiest to show with string data.  Consider these examples:

In [6]:
school = "UC Santa Barbara"

In [7]:
school

'UC Santa Barbara'

In [8]:
print(school)

UC Santa Barbara


You can see that `print()` takes away the quotation marks.  It's even more dramatic with multi-line strings, which we *delimit* (mark the beginning and ending of) with triple quotes like this (as you saw in lab01):

In [20]:
declaration_excerpt = """
  We hold these truths to be self-evident, 
  that all men are created equal, 
  that they are endowed by their Creator 
  with certain unalienable Rights, 
  that among these are Life, Liberty and the pursuit of Happiness.
"""

In [11]:
declaration_excerpt

'\nWe hold these truths to be self-evident, \nthat all men are created equal, \nthat they are endowed by their Creator \nwith certain unalienable Rights, \nthat among these are Life, Liberty and the pursuit of Happiness.\n'

In [12]:
print(declaration_excerpt)


We hold these truths to be self-evident, 
that all men are created equal, 
that they are endowed by their Creator 
with certain unalienable Rights, 
that among these are Life, Liberty and the pursuit of Happiness.



The `\n` that you see when you just show the value of the variable is the indication of a *newline character*.  The `\` is the backslash (it's leaning *back* if you consider the direction of english text, which is left to right.)  We can put that inside a string to show where a new line goes.  Compare:

In [13]:
print("UC Santa Barbara")

UC Santa Barbara


In [14]:
print("UC\nSanta\nBarbara")

UC
Santa
Barbara


We'll talk more about the use of print() as the course continues, and especialy when we get to writing our own functions.

In that context `print` vs. `return` is a *whole thing*! 

But that's for later.

## Section 3.4: Introduction To Tables 


In your text, [Section 3.4: Introduction to Tables](https://inferentialthinking.com/chapters/03/4/Introduction_to_Tables.html) covers *Tables* which are fundamental to *this* course in Data Science.

### `Table()` is specific to `datascience`

Note that in this course we use the Python `datascience` library that was developed at UC Berkeley for the course *Data 8*, and goes along with the textbook we are using in this course.

In that library, the `Table()` data type is the main *abstraction* that is used for a table of data: it's something like a spreadsheet with rows and columns.

Other datascience libraries exist, and they use different libraries.  I'm not going to derail things too much by talking about those, but I'll briefly mention the other libraries and what they use instead of `Table()`

| Library        | Table Abstraction |
|----------------|-------------------|
| `datascience`  | `Table`           |
| `pandas`       | `DataFrame`       |
| `numpy`        | `ndarray`         |
| `polars`       | `DataFrame`       |
| `dask`         | `DataFrame`       |
| `pyspark`      | `DataFrame`       |

So, when we talk about a `Table()`, while that's a specific concept in the `datascience` library.  But it's also a pretty generic *idea*.

Let's look at how to create an empty Table. An empty table is useful because it can be extended to contain new rows and columns.

In [23]:
# create an empty Table
Table()

If you get a name error on Table(), remember that you have to shift-enter the
top cell in your notebook at the start of the session so that you run
the `from datascience import *` command.

Without that, `Table()` is undefined.



##### with_columns
The `with_columns` method on a table constructs a new table with additional labeled columns. 

In [25]:
# add one column
Table().with_columns('Number of petals', make_array(8, 34, 5))

Number of petals
8
34
5


In [26]:
# add two or more columns
Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)

Number of petals,Name
8,lotus
34,sunflower
5,rose


We can give it a name and further extend it!

In [28]:
flowers = Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)

flowers.with_columns(
    'Color', make_array('pink', 'yellow', 'red')
)

Number of petals,Name,Color
8,lotus,pink
34,sunflower,yellow
5,rose,red


#### read_table
You may have an existing table that you may want to read! We will use the Table method `read_table` to read a CSV (comma-separated values) file below. Open up this file on your devices and see what it looks like! You should get very familiar with looking at data in this course!

In [30]:
movies = Table.read_table("data/movies_by_year_with_ticket_price.csv")
movies

Year,Average Ticket Price,Total Gross,Number of Movies,#1 Movie
2019,9.16,11237.9,788,Avengers: Endgame
2018,9.11,11611.3,878,Black Panther
2017,8.97,11125.4,740,Star Wars: Episode VIII - The Last Jedi
2016,8.65,11373.7,737,Rogue One: A Star Wars Story
2015,8.43,11098.8,706,Star Wars: Episode VII - The Force Awakens
2014,8.17,10436.8,708,American Sniper
2013,8.13,10955.9,689,The Hunger Games: Catching Fire
2012,7.96,10959.6,669,The Avengers
2011,7.93,10114.1,601,Harry Potter and the Deathly Hallows: Part 2
2010,7.89,10212.7,538,Toy Story 3


#### num_columns/num_rows
The method `num_columns` gives the number of columns in the table, and `num_rows` the number of rows.

In [31]:
movies.num_columns

5

#### Accessing data in a column
We can use a column’s label to access the array of data in the column as shown below. 

In [32]:
movies.column("#1 Movie")

array(['Avengers: Endgame', 'Black Panther',
       'Star Wars: Episode VIII - The Last Jedi',
       'Rogue One: A Star Wars Story',
       'Star Wars: Episode VII - The Force Awakens', 'American Sniper',
       'The Hunger Games: Catching Fire', 'The Avengers',
       'Harry Potter and the Deathly Hallows: Part 2', 'Toy Story 3',
       'Avatar', 'The Dark Knight', 'Spider-Man 3',
       "Pirates of the Caribbean: Dead Man's Chest",
       'Star Wars: Episode III - Revenge of the Sith', 'Shrek 2',
       'The Lord of the Rings: The Return of the King', 'Spider-Man',
       "Harry Potter and the Sorcerer's Stone",
       'How the Grinch Stole Christmas',
       'Star Wars: Episode I - The Phantom Menace', 'Saving Private Ryan',
       'Titanic', 'Independence Day', 'Toy Story', 'Forrest Gump',
       'Jurassic Park', 'Aladdin', 'Terminator 2: Judgment Day',
       'Home Alone', 'Batman', 'Rain Man', 'Three Men and a Baby',
       'Top Gun', 'Back to the Future', 'Beverly Hills Cop',
 

There are a variety of *methods* we can use on tables; these include 
* `show`
* `select`
* `drop`
* `sort`
* `where`
and many others.

Let's ask chatgpt for help:

In [34]:
%%ai openai-chat:gpt-3.5-turbo
Can you give me a brief summary of some of the most useful methods that
we can use on a Table() from the Python datascience library?

### Summary of Useful Methods for Table() in Python Datascience Library

1. **show()**: Display the table with customizable formatting options.
2. **select()**: Select specific columns from the table.
3. **sort()**: Sort the table based on specified columns.
4. **where()**: Filter rows based on specified conditions.
5. **group()**: Group rows based on specified columns.
6. **join()**: Join two tables based on common columns.
7. **pivot()**: Reshape the table using specified columns as row and column headers.
8. **barh()**: Create a horizontal bar chart from the table data.
9. **scatter()**: Create a scatter plot from the table data.
10. **save()**: Save the table to a specified file format.

In [35]:
%%ai openai-chat:gpt-3.5-turbo
Can you give an example of using each of those methods, using the movies table that is defined in this notebook?

### Examples of Using Useful Methods for Table() in Python Datascience Library with Movies Table

1. **show()**:
```python
movies.show()
```

2. **select()**:
```python
selected_columns = movies.select('Title', 'Genre')
selected_columns.show()
```

3. **sort()**:
```python
sorted_table = movies.sort('Year')
sorted_table.show()
```

4. **where()**:
```python
filtered_rows = movies.where('Director', are.equal_to('Christopher Nolan'))
filtered_rows.show()
```

5. **group()**:
```python
grouped_table = movies.group('Director')
grouped_table.show()
```

6. **join()**:
```python
actors = Table().with_columns('Actor', ['Tom Hanks', 'Leonardo DiCaprio'], 'Movie', ['Forrest Gump', 'Inception'])
joined_table = movies.join('Title', actors, 'Movie')
joined_table.show()
```

7. **pivot()**:
```python
pivoted_table = movies.pivot('Genre', 'Director', 'Title', collect=sum)
pivoted_table.show()
```

8. **barh()**:
```python
movies.barh('Director', 'Title')
```

9. **scatter()**:
```python
movies.scatter('Year', 'Revenue (Millions)')
```

10. **save()**:
```python
movies.save('movies_data.csv')
```

In [36]:
%%ai openai-chat:gpt-3.5-turbo
Can you repeat that, but add a brief explanation 
along with each example that would help a beginner understand
what the methods does, and anything else they should know?

### Examples of Using Useful Methods for Table() in Python Datascience Library with Movies Table

1. **show()**:
   - This method displays the contents of the table in a format that is easily readable for the user.
   ```python
   movies.show()
   ```

2. **select()**:
   - Select method allows you to specify and display only the columns of interest from the table.
   ```python
   selected_columns = movies.select('Title', 'Genre')
   selected_columns.show()
   ```

3. **sort()**:
   - Sort method arranges the rows of the table in ascending order based on the specified column.
   ```python
   sorted_table = movies.sort('Year')
   sorted_table.show()
   ```

4. **where()**:
   - Where method filters the rows of the table based on a specified condition.
   ```python
   filtered_rows = movies.where('Director', are.equal_to('Christopher Nolan'))
   filtered_rows.show()
   ```

5. **group()**:
   - Group method groups the rows of the table based on a specified column, aggregating the data within each group.
   ```python
   grouped_table = movies.group('Director')
   grouped_table.show()
   ```

6. **join()**:
   - Join method merges two tables based on a common column, allowing you to combine related data.
   ```python
   actors = Table().with_columns('Actor', ['Tom Hanks', 'Leonardo DiCaprio'], 'Movie', ['Forrest Gump', 'Inception'])
   joined_table = movies.join('Title', actors, 'Movie')
   joined_table.show()
   ```

7. **pivot()**:
   - Pivot method reshapes the table, using specified columns as row and column headers, and aggregating the data based on a function.
   ```python
   pivoted_table = movies.pivot('Genre', 'Director', 'Title', collect=sum)
   pivoted_table.show()
   ```

8. **barh()**:
   - Barh method creates a horizontal bar chart using the data from the specified columns in the table.
   ```python
   movies.barh('Director', 'Title')
   ```

9. **scatter()**:
   - Scatter method creates a scatter plot using the data from the specified columns in the table.
   ```python
   movies.scatter('Year', 'Revenue (Millions)')
   ```

10. **save()**:
    - Save method allows you to save the table data to a specified file format for future use or sharing.
    ```python
    movies.save('movies_data.csv')
    ```