# Discussion 01: Python Basics, Arrays & Tables


Welcome to Discussion 01! This week, we will go over Python Basics and take a detailed look at arrays & tables. You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).

Additionally, [here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a potentially useful reference sheet that contains several data wrangling tips.

I also highly recommend checking out [this](https://nationalzoo.si.edu/webcams/panda-cam) baby pandas resource as well.

<img src="data/panda.jpeg" width="600">

In [1]:
# please don't change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np 
import math
import otter
grader = otter.Notebook()

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update(
    "livereveal", {
        "width": "90%",
        "height": "90%",
        "scroll": True,
})

## Jupyter Notebook Shortcuts

shift+enter: run cell and move focus to cell below <br>
ctl+enter: run cell and keep focus on cell <br>

Command Mode (cell is blue):<br>
x: cut the cell, also quick way to delete<br>
c: copy the cell<br>
v: paste the cell<br>
d+d: delete cell<br>
a: make new cell above<br>
b: make new cell below<br>
y: change cell to code<br>
m: change cell to markdown<br>
enter: start editing cell<br>

Editing Mode (cell is green):<br>
esc: enter command mode<br>
shift+tab: info about a function<br>

# What we'll cover:
---

- What is Python?
- Data Types
- Variables
- Functions
- Creating a Table

# What is Python?
---

Python is a **high-level**, **interpreted** programming language invented by Guido Van Rossum in 1991.  It is a powerful language while remaining **dynamically-typed**, easily **readable**, and has plenty of **whitespace**.

- Interpreted:
  - A file or cell can run instantly; does not need to compile to another file

- Dynamically Typed:
  - Python infers what type you want a variable to be; you don't tell it explicitly

- Readable:
  - Simply reading code aloud should largely reveal what's going on

- Whitespace:
  - You can *and should* use multiple lines to fit the `Python a e s t h e t i c`

# Data Types in Python
---

Everything in Python has a type.

Some things are really simple—you could call them *"primitive"*.  
These things have a specific value.

There are four types of primitives:
- Integers (ex. 1, 2, -12)
- Floats (ex. 1.0, 3.5, -0.34)
- Strings ("this is a string", "a", "b")
- Booleans (True, False)

Other things are a bit more complex.  
These things act more like containers for values (or more containers).

Some examples include:
- Lists
- Arrays
- Tables
- Dictionaries
- Sets

### Primitive Types: integers, floats, strings, booleans

In [2]:
# Integers
type(65)

In [3]:
# Floats
type(1.0)

In [4]:
# Strings
type("Hello")

In [5]:
# Booleans (True or False)
type(False) 

What are some things we can do with these primitive types?

In [6]:
# Let's do some testing together... here's a couple to start with:

3 + 5 # Can we do this?

In [7]:
3 + 5.9876 # What about this?

In [8]:
# How about this?
# 3 + "string"

In [9]:
# or this?
"string" + "another string"

In [10]:
# Feel free to play around with different types and see what else is possible!

### Some Others: arrays, and tables

In [11]:
# Lists
type([[5,2,'hello'], '1', 2, '3'])

In [12]:
# Lists can contain any type of data
['hi', ['how', 'are', 'you']]

### NumPy Arrays

NumPy Arrays will fit all data to **the same type**

In [13]:
import numpy as np

np.array([1, 2, 3])

In [14]:
print("overall type:", type(np.array([1, 2, 3])))
print("type of individual elements:", np.array([1, 2, 3]).dtype)

Recap:

All objects in Python have a type, some of which are primitive, some of which act more like containers.

If we ever forget what type something is, we can use `type()` to find out!

# Variables
---

In Python when you assign a variable like this:

`x = 4 + 3`

You're essentially telling Python this:

`From now on, please let the value of 'x' contain the value of 7.`

If you then re-assign the same variable name to a different value, the old value will be lost forever.

In [15]:
x = 4
y = "Why"
z = [4.0, "That's the dream..."]
class_choice = np.array(["Cake", 3.14])

print(x)
print(y)
print(z)
print(class_choice) 

What happens if we assign x again?

In [16]:
print(x)

In [17]:
x = "string"
print(x)

Recall that variables assume the type of what you assign it to

In [18]:
print(type(x))
print(type(y))
print(type(z))
print(type(class_choice))

We can even assign variables to other variables, this can get a bit tricky.

In [19]:
x = 1
y = x
x = 2

print("y == x?       ", y == x)

Wait but I thought we just set `y = x`!

Recall that we're telling Python to assign y **to the value of** x, not directly to x!

What is the value of something then?
It's whatever is returned if you run it at the end of a cell.

In [20]:
# The value of x is
x

We can also perform operations on NumPy arrays.

In [21]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

In [22]:
x + y

In [23]:
x * 2

## Arrays vs. Lists

---
Arrays and lists are helpful when we want to store and manipulate **sequences** of data

##### Lists
- Built into Python
- Friendly with different data types
- EXTREMELY SLOW

##### Arrays
- Not built into Python directly (that's why we have NumPy!)
- Elements must be the same data type
- MUCH FASTER


# Functions
---

Functions, like `print()`, allow us to easily run something with different <b>arguments</b>.

We can also define our own functions to allow us to run our own code multiple times with different arguments.

### Definitions:
<b>Parameter</b>: Variable in method definition. Ex: `def print(string_to_print):`

<b>Argument</b>: Actual value used in function calls. Ex: `print("hello")`

Kinda pedantic, they are often used interchangably and people will know what you mean either way.

Many functions take values as inputs.  
All functions will return a value (but that value may be `None`).

Just like all values in Python, these have a type!

So, it's important that we know what a function takes and what it returns.

This helps a lot when it comes to fixing bad code!

A Python function is called with the following format:

`function_name(arg_1, arg_2, ...)`

For example, `sum` takes a list (or array-like object) as a argument.  
The function `len` can take a list too.

In [24]:
sum(np.array([1, 2, 3]))

In [25]:
len(np.array([1, 2, 3]))

And other functions, like `pow` take more than one argument.

In [26]:
help(pow)

In [27]:
pow(1.618, 2)

Some objects have their own functions!  
To call this, you need to use "dot notation", it looks like this:


`some_object.func_name(some_arg_1, some_arg_2, ...)`

In [28]:
"hello world".title()

In [29]:
"hello world".replace('world', 'DSC 10')

We can assign a variable as the result of a function the same way we assign any variable!

In [30]:
x = pow(8, 2)
x

<b>Bonus Question</b>: What is the return type of `print("hello")`?

In [31]:
x = print("hello")
type(x)

<b>Bonus Bonus Question</b>: What will be printed out?

`
f = print
x = f("hello")
f(type(x))
`

In [32]:
# f = print
x = print("hello")
print(type(x))

## Array Problems (part 1)
---

Try out these problems to get a bit more familiar with NumPy arrays.

**Question 1.1**

We have 5 triangles.
`base` measures the base of each triangle, `height` measures the height.

What is the average area of a triangle in the data set?

In [33]:
base = np.array([3, 1, 3, 5, 2])
height = np.array([6, 2, 7, 7, 1])

In [34]:
average_area = ...
average_area

In [None]:
grader.check("q11")

## Recall: Ranges
We can use this to easily generate sequential NumPy arrays.

In [36]:
np.arange(20)

In [37]:
np.arange(3, 16, 4)

**Question 1.2**

Create an array that runs from 0 to 50 (included), with steps of 5 as below:

0, 5, 10, ..., 45, 50

In [38]:
zero_to_fifty_array = ...
zero_to_fifty_array

In [None]:
grader.check("q12")

**Question 1.3**

Using code, create the array:

[4, 8, 16, 32, 64]

In [40]:
four_to_sixty_four_array = ...
four_to_sixty_four_array

In [None]:
grader.check("q13")

## FYI: Other array creation functions

In [42]:
ones_array = np.ones(5)
ones_array

In [43]:
zeros_array = np.zeros(5)
zeros_array

In [44]:
some_array = np.array([6, 1, 9, 5, 2, 3, 4, 3, 2, 4])

**Question 1.4** : How many elements are in ```some_array```?

<!--
BEGIN QUESTION
name: q14
-->

In [45]:
num_elems = ...
num_elems

In [None]:
grader.check("q14")

**Question 1.5** : How do we access the *first* element of ```some_array```?

<!--
BEGIN QUESTION
name: q15
-->

In [47]:
first_elem = ...
first_elem

In [None]:
grader.check("q15")

**Question 1.6** : How do we access the *last* element of ```some_array```?

<!--
BEGIN QUESTION
name: q16
-->

In [49]:
last_elem = ...
last_elem

In [None]:
grader.check("q16")

**Question 1.7** : What happens when we do ```some_array[-2]```?

In [52]:
some_array

In [53]:
some_array[-2]

**Question 1.8** : How do we make a new array that contains only the first 5 elements from ```some_array```? 

<!--
BEGIN QUESTION
name: q18
-->

In [55]:
first_five = ...
first_five

In [None]:
grader.check("q18")

In [57]:
array_1 = np.array([1,2,3,4,5,6,7,8])
array_2 = np.array([6,7,8,9,10,11,12,13])

**Question 1.9** : How to we get the element-wise sum of ```array_1``` and ```array_2```?

<!--
BEGIN QUESTION
name: q19
-->

In [58]:
elem_wise_sum = ...
elem_wise_sum

In [None]:
grader.check("q19")

**Question 1.10** : How to we get the max element from ```some_array```?

<!--
BEGIN QUESTION
name: q110
-->

In [60]:
max_elem = ...
max_elem

In [None]:
grader.check("q110")

**Remember!** NumPy Arrays will fit all data to the same type:

In [63]:
random_array = np.array([45,"hello", True, 987, 34.5, "Yes"])
random_array

In [64]:
#But Lists will not:
random_list = ['hello', 'there', 'buddy', 43, True, 38.9, 'DSC 10']
random_list

In [65]:
print(type(random_array))
print(type(random_list))

## Tables

---
Tables are a slightly more complex data structure that are helpful when we want to store and manipulate larger datasets with lots of information. We often refer to tables as "data frames" and these terms are interchangable in this class.

**Rows** are labeled arrays that correspond to different entries or samples in the table. (The labels are typically *unique* and easily identifiable names.)

**Columns** are labeled arrays that correspond to the different pieces of information we care about. (The labels are the titles of each column.)

The **Index** of a table is an array completely separate from the columns and contains all of the row labels.

A **Series** is a labeled array that corresponds to a single column from the table.

##### Important note
To get a particular element from a table, in this class, we'll always ```.get()``` the column label, then ```.loc[]``` the row label.


In [66]:
print(type(random_array))
print(type(random_list))

## Table Problems (part 2)

Try out these problems to get a bit more familiar with Data Frames.

# Ultimate Halloween Candy Showdown
---
269,000 user submitted winners of head to head candy matchups

## Read from CSV

In [67]:
candy = bpd.read_csv("data/candy.csv")
candy

Right now, the rows and indexed by numbers 0 to 84.
This isn't very informative, so we can change the index to one of the existing columns.

What would be a good index to set?

## Setting the index

In [68]:
candy = candy.set_index('competitorname')
candy

This looks much better!

In [69]:
candy.get("sugarpercent")

**Question 2.1**

What is the `sugarpercent` of a bag of "Pop Rocks"?

In [70]:
sugar_percent_pop_rocks = ...
sugar_percent_pop_rocks

In [None]:
grader.check("q21")

**Question 2.2**

What is the highest `winpercent` out of any candy?

In [72]:
highest_win_percent = ...
highest_win_percent

In [None]:
grader.check("q22")

**Question 2.3**

Which candy has the highest `sugarpercent`?

In [74]:
candy_highest_sugar_percent = ...
candy_highest_sugar_percent

In [None]:
grader.check("q23")

**Question 2.4**

What is the `winpercent` of the candy with the least amount of sugar?

In [76]:
winpercent_least_sugar = ...
winpercent_least_sugar

In [None]:
grader.check("q24")

**Question 2.5**

What is the average `winpercent` of chocolate candies?

**Bonus**: try to do it with two different approaches.

In [78]:
winpercent_chocolate_average = ...
winpercent_chocolate_average

In [None]:
grader.check("q25")

**Question 2.6**

Store the competitor names of the top 5 most popular candies (winpercent) in an array

<!--
BEGIN QUESTION
name: q26
-->

In [81]:
top_five = ...
top_five

In [None]:
grader.check("q26")

**Question 2.7**

How many candies have caramel but not chocolate?

<!--
BEGIN QUESTION
name: q27
-->

In [83]:
candies_with_caramel_not_choco = ...
candies_with_caramel_not_choco

In [None]:
grader.check("q27")

**Question 2.8**

What is the name of the one candy that has the word "candy" as part of its name? 

<!--
BEGIN QUESTION
name: q28
-->

In [85]:
candy_name = ...
candy_name

In [None]:
grader.check("q28")

**Question 2.9**

Do chococlate candies or non-chocolate candies have a higher win percentage?
Your answer should be 1 for chocolate candies or 0 for non-chocolate candies.

<!--
BEGIN QUESTION
name: q29

-->

In [87]:
higher_winpercent_choc = ...
higher_winpercent_choc

In [None]:
grader.check("q29")

**Question 2.10**

Do chocolate candies have more hard candies or non-hard candies.
Your answer should be 1 for hard candies or 0 for non-hard candies.

<!--
BEGIN QUESTION
name: q210
-->

In [89]:
more_hard_or_soft = ...
more_hard_or_soft

In [None]:
grader.check("q210")

In [91]:
# recall df
candy

**Question 2.11**

Which candy has the most 1s in across all binary variables (Hint: Use ```.apply()```)

<!--
BEGIN QUESTION
name: q211
-->

In [92]:
competitor_name = ...
competitor_name

In [None]:
grader.check("q211")

In [94]:
grader.check_all()

## More fun with Tables

In [95]:
# recall the full data frame we've been working with
candy

# Building and Organizing DataFrames

add/replace a column : ```df.assign(Name_of_Column=column_data)```

drops a single column :  ```df.drop(columns=column_name)```

drops every column in a list of columns : ```df.drop(columns=[col_1_name, ..., col_k_name])```

move the index to a column : ```df.reset_index()```

move the column to the index : ```df.set_index(column_name)```

sort entire DataFrame by values in a column : ```df.sort_values(by=column_name)```


In [96]:
# assign

new_col_data = np.arange(candy.shape[0])
candy.assign(new_col=new_col_data)

In [97]:
# drop (single)

candy.drop(columns='chocolate')

In [98]:
# drop (multiple)

candy.drop(columns=['chocolate','fruity','caramel','peanutyalmondy','nougat'])

In [99]:
# reset index

candy = candy.reset_index()
candy

In [100]:
# set index

candy = candy.set_index('competitorname')
candy

In [101]:
# sort

candy.sort_values(by="winpercent")

# Retrieving Information

## DataFrame methods

**Returns a Series**

retrieve column : ```df.get(column_name)```

**Returns a DataFrame**

retrieve several columns : ```df.get([col_1_name, ..., col_k_name])```

select row(s) by index position : ```df.take([pos_1, ..., pos_k])```

select row(s) using Boolean array : ```df[bool_array]```

**Returns other type**

number of rows : ```df.shape[0]``` **number**

number of columns : ```df.shape[1]``` **number**

retrieve element in the index by its position : ```df.index[position]``` **index name**

## Series methods

**Returns an element**

retrieve an element by the *row label* : ```ser.loc[label]```

retrieve an element by its *index position* : ```ser.iloc[position]```

In [102]:
# rows 

candy.shape[0]

In [103]:
# columns

candy.shape[1]

In [104]:
# get 

candy.get("chocolate") # Series

In [105]:
# get 

candy.get(["chocolate", "caramel"]) # DataFrame

In [106]:
# take

candy.take([0])

In [107]:
# take

candy.take(np.arange(10))

In [108]:
# bool access

all_true = np.ones(candy.shape[0]).astype(bool) # must match number of rows in df!
print(all_true)

some_true = np.random.randint(2, size=candy.shape[0]).astype(bool) # must match number of rows in df!
print(some_true)

candy[some_true]

In [109]:
# series by label

win_series = candy.get("winpercent")

win_series.loc["Twizzlers"]

In [110]:
# series by position

win_series.iloc[80]

# More data : Fires in California

In [111]:
calfire = bpd.read_csv('data/calfire-full.csv')
calfire

**Question 3.1**

Create a new table with one column named count containing the number of fires in each county.

In [112]:
calfire_by_county = ...
calfire_by_county

**Question 3.2**

What was the county with the largest number of fires?

In [113]:
county_most_fires = ...
county_most_fires

In [None]:
grader.check("q32")

**Question 3.3**

How many fires were due to Arson?

**Bonus**: try this in two different ways.

In [115]:
arson_fires = ...
arson_fires

In [None]:
grader.check("q33")

In [118]:
grader.check_all()