In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

In [1]:
from datascience import *
import numpy as np

# Lab 5 – Tables

## Data 94, Spring 2021

In this lab we will be talking all about *Tables*. We use tables to store all sorts of data form sports statistics to population information. If there's data you have ever been curious about, it is very likely that the Internet has a table somewhere with that data!

Tables are integral to the foundation of Data Science, and we will go over how to **query** a table. **Querying** a table is basically asking information about the table. Some examples of common queries (in English, not code):

- How many data points are there?
- Which data points have a specific characteristic?
- What is the attribute of a specific data point?
- And many more!

There are so many ways we can use tables to get information we need, and there are several existing libraries in Python that we can use to do this! In this course, we will be using the `datascience` library, and if you take Data Science classes beyond this one, you may learn many more!

## Table Creation

Let's take a look at a table in action. Python does not have any tables by default, so we can either make a new one ourselves or we can import a table from a file. First, let's see how we can make our own table from scratch:

In [2]:
# We start out with an empty Table
# Note that 'table' is capitalized, and there is nothing in the parentheses
our_table = Table()
our_table

In [3]:
# Now, let's put some data in our table!
# We put the name of a column (a label)
# then a comma ','
# and then a NumPy array of that column's values
# Alternating labels and column values, we fill our table!
our_table = our_table.with_columns(
    "Department", np.array(["Data Science", "Economics", "Political Science", "Sociology"]),
    "Course Number", np.array([94, 1, 2, 121])
)
our_table

## Table attributes: `num_rows` and `num_columns`

We can ask for all sorts of information about the table itself:

In [4]:
our_table.num_rows

In [5]:
our_table.num_columns

## Getting columns: `column()` and `with_columns()`

We can also ask about the data in the table using the `column()` method. As mentioned in lecture, we can pass in a `label` or an `index` to this method. We index into the columns of a table much like we do the items of a list; 0 corresponds to the first column, 1 corresponds to the second, etc....

In [6]:
our_table.column("Department")

In [7]:
our_table.column(0)

Say we have a table and we want to add additional data. We can use the `with_columns()` method to do this (just like we did above)! The `with_columns` method takes inputs the exact same way as the `column()` method. We need to make sure that the columns we add to the table have the same number of rows (the length of the array we pass in) as the table, otherwise we get an error.

In [8]:
# Our table has 4 rows, so our new column needs an array with 4 items, 1 for each row
our_table_new_column = our_table.with_columns("Number of Students", np.array([21, 905, 209, 63]))
our_table_new_column

In [9]:
our_table_bad_column = our_table_new_column.with_columns("Too Few Rows", np.array([1, 2, 3]))
our_table_bad_column

Note the error message here. This may be a common mistake at first, so if you see this error message, check the number of items in your column arrays.

In [10]:
# This is our final table!
# You may use this cell to explore the table and see what you can do with it so far!
our_table_new_column

## Loading a Table

Although creating our own tables by hand can be useful, more often than not the data we want to work with is far too large to be able to type out by hand. More commonly, we load datasets in from other sources using the `Table.read_table()` method. We can pass in a *file path* to this method and it will load that data into a table we can use in Python!

Let's see how this works using the file `"data/football.csv"`:

*To understand why we need the `data/` in front of the filename, consult Question 1c on Homework 4 or review Lecture 14*

In [11]:
file_table = Table.read_table("data/football.csv")
file_table.show(5)

## Seeing a table: `show()`

The use of the `show()` method **displays** the first n rows of a table. Like `print()` this does not return a value, it just displays the value to us at the end of a cell.

## Excluding columns: `drop()`

We now have information about Cal Football's seasons since statistics were kept. Because this file was pulled from the internet, it may have some data in it that we are not interested in, like the rows with a bunch of `nan` values (`nan` means "Not a number", and it is commonly used to indicate there is no value there).

*Do not always necessarily remove all columns with several NaN values from a table. There may be a reason why the values are not present, but for a lab exercise we don't need to worry about it.*

We can use the `drop()` method to remove columns like this from the table. Let's drop the `Notes` column:

In [12]:
file_table_no_notes = file_table.drop("Notes")
file_table_no_notes.show(5)

Let's also drop the `AP Pre`, `AP High`, `AP Post`, `SRS` and `SOS` columns from the table. These are statistics specific to college football, and they are not important for what we're doing. `drop()` can take in as many columns as you need, and it will drop them all from the table.

In [13]:
file_table_improved_columns = file_table_no_notes.drop("AP Pre", "AP High", "AP Post", "SRS", "SOS")
file_table_improved_columns.show(5)

## Number of Years

We can see how many years this table covers by asking about how many rows the table has. Assign the variable `file_table_rows` to the number of rows in `file_table_improved_columns`. You should not write an integer, rather use one of the table attributes we have talked about so far to **calculate** the number of rows.

<!--
BEGIN QUESTION
name: q1
points: 0
-->

In [14]:
file_table_rows = ...

In [None]:
grader.check("q1")

Using this value, we can calculate the first year in `file_table_improved_columns` without looking at it!

In [16]:
# The Table covers up until 2020, so subtracting the number of rows gives us the first year NOT in the table
# We add one to the result to get the first year in the table
print("The first year in this table is:", 2020 - file_table_rows + 1)

## Querying

Let's try querying our new table. Let's see what conferences Cal has played in during its history using the `Conf` column:

In [17]:
conference_list = file_table_improved_columns.column("Conf")
conference_list

As you can see, this list looks long and repetitive, but we can use the `np.unique` method to tell us all the conferences only once as they appear:

In [18]:
np.unique(conference_list)

## Picking columns: `select()`

It appears that there are also several other columns that we are not very interested in. Instead of dropping several columns, we can use the `select()` method to grab only the columns we want. In this case, we only want to keep the `Year`, `W`, `L`, `T`, and `Pct`,  columns:

In [19]:
football_table = file_table_improved_columns.select("Year", "W", "L", "T", "Pct")
football_table

In [20]:
# Note that our file_table table is still in tact after this:
file_table_improved_columns

## Changing column labels: `relabeled()`

Some of these columns have labels that may not be best for what they store. Let's change the column labels to the following:

- Year: stay the same
- W -> Wins
- L -> Losses
- T -> Ties
- Pct -> Winning Percentage

We can rename column labels using the `relabeled()` method. You have the choice to only relabel one column or you can relabel several at once. To change the names of multiple columns, we pass in an array of the old names and an array of the new names as the 2 inputs to `relabeled()`:

*There is another method `relabel()` which changes the original table without an `=`. **Be careful** using this method as it can change your data when you may not want to.*

In [21]:
old_names = np.array(["W", "L", "T", "Pct"])
new_names = np.array(["Wins", "Losses", "Ties", "Winning Percentage"])

football_table_relabeled = football_table.relabeled(old_names, new_names)
football_table_relabeled.show(5)

## Asking Questions

Now that we have the table we want, let's try to write some code that tells us some information about Cal Football's wins. Let's write some queries that can help us answer these 3 questions. The first question has been given to you, but let's write the other 2!

- What is the most wins Cal has ever had in one season?
- How many games has Cal ever lost?
- What is the average amount of games Cal wins every year?

*Remember, you do not need to calculate the answers to these questions by hand, you should be writing queries to have Python do all the calculation for you.*

In [22]:
# Question 1 (done for you)
most_wins_ever = np.max(football_table_relabeled.column("Wins"))
most_wins_ever

Let's break down this query and see what it does. First, we ask for the `Wins` column of `football_table_relabeled`, which gives us access to the win total from every season. We then use the `np.max` method to find the maximum value in this array, which ultimately tells us the most wins Cal Football has even had in any one season.

Let's use similar queries to answer the other 2 questions:

<!--
BEGIN QUESTION
name: q2
points: 0
-->

In [23]:
# Question 2
games_lost_alltime = ...

In [None]:
grader.check("q2")

In [25]:
# Question 3
average_wins = ...

In [None]:
grader.check("q3")

This means you can (roughly) expect Cal to win 5-6 games a year. While this is not a perfect statistic (some seasons are longer than others, football is a completely different game than it was a long time ago, etc.), in a 12-13 game season, do you think this a good amount of wins? The answer to this question is not concrete, and even with data to back up either side, neither answer seems more right than the other.

Much of Data Science is not only being able to compute the answers to questions, but forming good questions and ensuring your answer is not misleading in any way.

Let's bring back our `football_table_relabeled`:

In [27]:
football_table_relabeled

## Sorting a column: `sort()`

We will now introduce a new table method: `sort()`. `sort()` allows us to see a table's column values sorted by its values, from either **biggest-to-smallest** (`descending=True`) or **smallest-to-biggest** (`descending=False`).

Let's say we want to ask the question: **What is Cal's best season ever?**. There are many ways to answer the question, but you may argue that a season with the most wins or the fewest losses could be considered the best:

In [28]:
# We can sort in descending order:
football_table_relabeled.sort("Wins", descending=True)

In [29]:
# Or we can sort in ascending order:
football_table_relabeled.sort("Losses", descending=False)

As you can see, queries about the most wins and the fewest losses can both answer the question **What is Cal's best season ever?** in different ways. Note that the same seasons do not necessarily show up in the top of each queried table.

## Question 4

Yet another way to answer this question about Cal's best seasons ever is to sort by winning percentage. Assign the variable `best_win_pct` to the result of a table query sorting the table based on winning percentage:

<!--
BEGIN QUESTION
name: q4
points: 0
-->

In [30]:
seasons_sorted_by_win_percentage = ...
seasons_sorted_by_win_percentage

In [None]:
grader.check("q4")

As you can see, many of Cal Football's best seasons are quite far in the past, only a few modern seasons even show up in any of these queries 😢

## Row selection: `where()` and the `are` Predicates

The last table method we will talk about is the `where()` method. The `where()` method keeps all rows that satisfies a particular boolean condition. It takes in a column label and an `are` statement, which can be crafted using the `are` library. These are the most important `are` library methods, but there are many more if you would like to investigate: [Explore the 'are' library here.](http://data8.org/datascience/predicates.html)

| Method | Input Type | Method Description |
| --- | --- | --- |
| `are.equal_to(n)` | number | Is the value from the column equal to `n`? |
| `are.above(n)` | number | Is the value from the column above `n`? |
| `are.above_or_equal_to(n)` | number | Is the value from the column above or equal to `n`? |
| `are.below(n)` | number | Is the value from the column below `n`? |
| `are.below_or_equal_to(n)` | number | Is the value from the column below or equal `n`? |
| `are.containing(s)` | string | Is `s` contained in the string value from the given column? |
| `are.containined_in(s)` | string | Is the string value from the given column contained in `s`? |

Adding a `not_` in front of all of these methods makes each method do the opposite of what it does (ex: `are.not_equal_to(n)`).

For example, if we only wanted to see the Cal Football seasons where Cal had a tie, we could use the `where()` method combined with an `are` method:

In [32]:
football_table_relabeled.where("Ties", are.above(0))

Or if we wanted to see Cal's worst seasons where their winning percentage was worse than .500, we can use a similary query:

In [33]:
football_table_relabeled.where("Winning Percentage", are.below(.5))

Again you can see that Cal Football (especially recently) has had some rough seasons 😢

For reference, here are links to the Lecture 16 and Lecture 18 slides so you can see all the methods we covered in this lab notebook!

[Lecture 16 slides](https://docs.google.com/presentation/d/1Oy9PYPbow8OVJBFHuB4yd10ZlYCt6paqtLp6H4EKPxE)

[Lecture 18 slides](https://docs.google.com/presentation/d/1Eh8WQan8sshT2eDMSFHB3_pdQLOmmNoBl0oL5WnOh6k)

## Done! 😇

That's it! There's nowhere for you to submit this, as labs are not assignments. However, please ask any questions you have with this notebook in lab or on Ed.

There are no extra problems this week, good luck with the homework!

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)