# Discussion 01: Python Basics


Welcome to Discussion 01! This week, we will go over Python Basics. You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).

Additionally, [here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a potentially useful reference sheet that contains several data wrangling tips.

I also highly recommend checking out [this](https://nationalzoo.si.edu/webcams/panda-cam) baby pandas resource as well.

<img src="data/panda.jpeg" width="600">

In [1]:
# please don't change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np 
import math
import otter
grader = otter.Notebook()

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update(
    "livereveal", {
        "width": "90%",
        "height": "90%",
        "scroll": True,
})

{'width': '90%', 'height': '90%', 'scroll': True}

## Jupyter Notebook Shortcuts

shift+enter: run cell and move focus to cell below <br>
ctl+enter: run cell and keep focus on cell <br>

Command Mode (cell is blue):<br>
x: cut the cell, also quick way to delete<br>
c: copy the cell<br>
v: paste the cell<br>
d+d: delete cell<br>
a: make new cell above<br>
b: make new cell below<br>
y: change cell to code<br>
m: change cell to markdown<br>
enter: start editing cell<br>

Editing Mode (cell is green):<br>
esc: enter command mode<br>
shift+tab: info about a function<br>

# What we'll cover:
---

- What is Python?
- Data Types
- Variables
- Functions
- Creating a Table

# What is Python?
---

Python is a **high-level**, **interpreted** programming language invented by Guido Van Rossum in 1991.  It is a powerful language while remaining **dynamically-typed**, easily **readable**, and has plenty of **whitespace**.

- Interpreted:
  - A file or cell can run instantly; does not need to compile to another file

- Dynamically Typed:
  - Python infers what type you want a variable to be; you don't tell it explicitly

- Readable:
  - Simply reading code aloud should largely reveal what's going on

- Whitespace:
  - You can *and should* use multiple lines to fit the `Python a e s t h e t i c`

# Data Types in Python
---

Everything in Python has a type.

Some things are really simple—you could call them *"primitive"*.  
These things have a specific value.

There are four types of primitives:
- Integers (ex. 1, 2, -12)
- Floats (ex. 1.0, 3.5, -0.34)
- Strings ("this is a string", "a", "b")
- Booleans (True, False)

Other things are a bit more complex.  
These things act more like containers for values (or more containers).

Some examples include:
- Lists
- Arrays
- Tables
- Dictionaries
- Sets

### Primitive Types: integers, floats, strings, booleans

In [2]:
# Integers
type(65)

int

In [3]:
# Floats
type(1.0)

float

In [4]:
# Strings
type("Hello")

str

In [5]:
# Booleans (True or False)
type(False) 

bool

What are some things we can do with these primitive types?

In [6]:
# Let's do some testing together... here's a couple to start with:

3 + 5 # Can we do this?

8

In [7]:
3 + 5.9876 # What about this?

8.9876

In [8]:
# How about this?
# 3 + "string"

In [9]:
# or this?
"string" + "another string"

'stringanother string'

In [10]:
# Feel free to play around with different types and see what else is possible!

### Some Others: arrays, and tables

In [11]:
# Lists
type([[5,2,'hello'], '1', 2, '3'])

list

In [12]:
# Lists can contain any type of data
['hi', ['how', 'are', 'you']]

['hi', ['how', 'are', 'you']]

### NumPy Arrays

NumPy Arrays will fit all data to **the same type**

In [13]:
import numpy as np

np.array([1, 2, 3])

array([1, 2, 3])

In [14]:
print("overall type:", type(np.array([1, 2, 3])))
print("type of individual elements:", np.array([1, 2, 3]).dtype)

overall type: <class 'numpy.ndarray'>
type of individual elements: int64


Recap:

All objects in Python have a type, some of which are primitive, some of which act more like containers.

If we ever forget what type something is, we can use `type()` to find out!

# Variables
---

In Python when you assign a variable like this:

`x = 4 + 3`

You're essentially telling Python this:

`From now on, please let the value of 'x' contain the value of 7.`

If you then re-assign the same variable name to a different value, the old value will be lost forever.

In [15]:
x = 4
y = "Why"
z = [4.0, "That's the dream..."]
class_choice = np.array(["Cake", 3.14])

print(x)
print(y)
print(z)
print(class_choice) 

4
Why
[4.0, "That's the dream..."]
['Cake' '3.14']


What happens if we assign x again?

In [16]:
print(x)

4


In [17]:
x = "string"
print(x)

string


Recall that variables assume the type of what you assign it to

In [18]:
print(type(x))
print(type(y))
print(type(z))
print(type(class_choice))

<class 'str'>
<class 'str'>
<class 'list'>
<class 'numpy.ndarray'>


We can even assign variables to other variables, this can get a bit tricky.

In [19]:
x = 1
y = x
x = 2

print("y == x?       ", y == x)

y == x?        False


Wait but I thought we just set `y = x`!

Recall that we're telling Python to assign y **to the value of** x, not directly to x!

What is the value of something then?
It's whatever is returned if you run it at the end of a cell.

In [20]:
# The value of x is
x

2

We can also perform operations on NumPy arrays.

In [21]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

In [22]:
x + y

array([5, 7, 9])

In [23]:
x * 2

array([2, 4, 6])

# Functions
---

Functions, like `print()`, allow us to easily run something with different <b>arguments</b>.

We can also define our own functions to allow us to run our own code multiple times with different arguments.

### Definitions:
<b>Parameter</b>: Variable in method definition. Ex: `def print(string_to_print):`

<b>Argument</b>: Actual value used in function calls. Ex: `print("hello")`

Kinda pedantic, they are often used interchangably and people will know what you mean either way.

Many functions take values as inputs.  
All functions will return a value (but that value may be `None`).

Just like all values in Python, these have a type!

So, it's important that we know what a function takes and what it returns.

This helps a lot when it comes to fixing bad code!

A Python function is called with the following format:

`function_name(arg_1, arg_2, ...)`

For example, `sum` takes a list (or array-like object) as a argument.  
The function `len` can take a list too.

In [24]:
sum(np.array([1, 2, 3]))

6

In [25]:
len(np.array([1, 2, 3]))

3

And other functions, like `pow` take more than one argument.

In [26]:
help(pow)

Help on built-in function pow in module builtins:

pow(x, y, z=None, /)
    Equivalent to x**y (with two arguments) or x**y % z (with three arguments)
    
    Some types, such as ints, are able to use a more efficient algorithm when
    invoked using the three argument form.



In [27]:
pow(1.618, 2)

2.6179240000000004

Some objects have their own functions!  
To call this, you need to use "dot notation", it looks like this:


`some_object.func_name(some_arg_1, some_arg_2, ...)`

In [28]:
"hello world".title()

'Hello World'

In [29]:
"hello world".replace('world', 'DSC 10')

'hello DSC 10'

We can assign a variable as the result of a function the same way we assign any variable!

In [30]:
x = pow(8, 2)
x

64

<b>Bonus Question</b>: What is the return type of `print("hello")`?

In [31]:
x = print("hello")
type(x)

hello


NoneType

<b>Bonus Bonus Question</b>: What will be printed out?

`
f = print
x = f("hello")
f(type(x))
`

In [32]:
# f = print
x = print("hello")
print(type(x))

hello
<class 'NoneType'>


# Practice questions part 1
---

Try out these problems to get a bit more familiar with NumPy arrays.

## Question 1.1

We have 5 triangles.
`base` measures the base of each triangle, `height` measures the height.

What is the average area of a triangle in the data set?

In [33]:
base = np.array([3, 1, 3, 5, 2])
height = np.array([6, 2, 7, 7, 1])

In [34]:
average_area = np.mean(0.5 * base * height) # SOLUTION
average_area

7.8

In [None]:
grader.check("q11")

## Recall: Ranges
We can use this to easily generate sequential NumPy arrays.

In [36]:
np.arange(20)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [37]:
np.arange(3, 16, 4)

array([ 3,  7, 11, 15])

## Question 1.2

Create an array that runs from 0 to 50 (included), with steps of 5 as below:

0, 5, 10, ..., 45, 50

In [38]:
zero_to_fifty_array = np.arange(0,51,5) # SOLUTION
zero_to_fifty_array

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50])

In [None]:
grader.check("q12")

## Question 1.3

Using code, create the array:

[4, 8, 16, 32, 64]

In [40]:
four_to_sixty_four_array = pow(2,np.arange(2,7)) # SOLUTION
four_to_sixty_four_array

array([ 4,  8, 16, 32, 64])

In [None]:
grader.check("q13")

## FYI: Other array creation functions

In [42]:
ones_array = np.ones(5)
ones_array

array([1., 1., 1., 1., 1.])

In [43]:
zeros_array = np.zeros(5)
zeros_array

array([0., 0., 0., 0., 0.])

# Practice questions part 2
---

Try out these problems to get a bit more familiar with Data Frames.

# Ultimate Halloween Candy Showdown
---
269,000 user submitted winners of head to head candy matchups

## Read from CSV

In [44]:
candy = bpd.read_csv("data/candy.csv")
candy

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.860,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,Twizzlers,0,1,0,0,0,0,0,0,0,0.220,0.116,45.466282
81,Warheads,0,1,0,0,0,0,1,0,0,0.093,0.116,39.011898
82,WelchÕs Fruit Snacks,0,1,0,0,0,0,0,0,1,0.313,0.313,44.375519
83,WertherÕs Original Caramel,0,0,1,0,0,0,1,0,0,0.186,0.267,41.904308


Right now, the rows and indexed by numbers 0 to 84.
This isn't very informative, so we can change the index to one of the existing columns.

What would be a good index to set?

## Setting the index

In [45]:
candy = candy.set_index('competitorname')
candy

Unnamed: 0_level_0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
competitorname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.860,66.971725
3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465
...,...,...,...,...,...,...,...,...,...,...,...,...
Twizzlers,0,1,0,0,0,0,0,0,0,0.220,0.116,45.466282
Warheads,0,1,0,0,0,0,1,0,0,0.093,0.116,39.011898
WelchÕs Fruit Snacks,0,1,0,0,0,0,0,0,1,0.313,0.313,44.375519
WertherÕs Original Caramel,0,0,1,0,0,0,1,0,0,0.186,0.267,41.904308


This looks much better!

## Question 2.1

What is the `sugarpercent` of a bag of "Pop Rocks"?

In [46]:
sugar_percent_pop_rocks = candy.get("sugarpercent").loc["Pop Rocks"] # SOLUTION
sugar_percent_pop_rocks

0.60399997

In [None]:
grader.check("q21")

## Question 2.2

What is the highest `winpercent` out of any candy?

In [48]:
highest_win_percent = candy.get("winpercent").max() # SOLUTION
highest_win_percent

84.18029

In [None]:
grader.check("q22")

## Question 2.3

Which candy has the highest `sugarpercent`?

In [50]:
candy_highest_sugar_percent = candy.sort_values(by="sugarpercent", ascending=False).index[0] # SOLUTION
candy_highest_sugar_percent

'ReeseÕs stuffed with pieces'

In [None]:
grader.check("q23")

## Question 2.4

What is the `winpercent` of the candy with the least amount of sugar?

In [52]:
winpercent_least_sugar = candy.get("winpercent").loc[candy.sort_values(by="sugarpercent").index[0]] # SOLUTION
winpercent_least_sugar

32.261086

In [None]:
grader.check("q24")

## Question 2.5

What is the average `winpercent` of chocolate candies?

**Bonus**: try to do it with two different approaches.

In [54]:
winpercent_chocolate_average = candy[candy.get("chocolate") == 1].get("winpercent").mean() # SOLUTION
winpercent_chocolate_average

60.9215294054054

In [None]:
grader.check("q25")

In [56]:
def has_chocolate(series): # SOLUTION NO PROMPT
    return series == 1 # SOLUTION NO PROMPT

winpercent_chocolate_average = candy[candy.get("chocolate").apply(has_chocolate)].get("winpercent").mean() # SOLUTION NO PROMPT
winpercent_chocolate_average # SOLUTION NO PROMPT

60.9215294054054

# More data : Fires in California

In [57]:
calfire = bpd.read_csv('data/calfire-full.csv')
calfire

Unnamed: 0,year,month,unit,name,cause,acres,county,longitude,latitude
0,1898,9,Ventura County,LOS PADRES,14 - Unknown / Unidentified,20539.949219,Ventura County,-119.367830,34.446830
1,1898,4,Ventura County,MATILIJA,14 - Unknown / Unidentified,2641.123047,Ventura County,-119.299625,34.488614
2,1898,9,Ventura County,COZY DELL,14 - Unknown / Unidentified,2974.585205,Ventura County,-119.265380,34.482316
3,1902,8,Ventura County,FEROUD,14 - Unknown / Unidentified,731.481567,Ventura County,-119.320979,34.417515
4,1903,10,Ventura County,SAN ANTONIO,14 - Unknown / Unidentified,380.260590,Ventura County,-119.253422,34.430616
...,...,...,...,...,...,...,...,...,...
13532,2019,9,Monterey - San Benito CAL FIRE,STAGE,7 - Arson,13.019149,Monterey County,-121.599207,36.764065
13533,2019,10,Monterey - San Benito CAL FIRE,CROSS,14 - Unknown / Unidentified,289.151428,Monterey County,-120.726245,35.793698
13534,2019,9,Monterey - San Benito CAL FIRE,FRUDDEN,2 - Equipment Use,11.789393,Monterey County,-120.908061,35.908627
13535,2019,9,Monterey - San Benito CAL FIRE,JOLON,11 - Powerline,61.592369,Monterey County,-121.010025,35.910750


## Question 3.1

Create a new table with one column named count containing the number of fires in each county.

In [58]:
calfire_by_county = bpd.DataFrame().assign(count=calfire.groupby('county').count().get("year")) #SOLUTION
calfire_by_county

Unnamed: 0_level_0,count
county,Unnamed: 1_level_1
Alameda County,53
Alpine County,21
Amador County,72
Butte County,173
Calaveras County,217
...,...
Tulare County,498
Tuolumne County,381
Ventura County,458
Yolo County,65


## Question 3.2

What was the county with the largest number of fires?

In [59]:
county_most_fires = calfire_by_county.sort_values(by="count", ascending=False).index[0] # SOLUTION
county_most_fires

'Los Angeles County'

In [None]:
grader.check("q32")

## Question 3.3

How many fires were due to Arson?

**Bonus**: try this in two different ways.

In [61]:
arson_fires = calfire[calfire.get("cause") == "7 - Arson"].shape[0] # SOLUTION
arson_fires

763

In [None]:
grader.check("q33")

In [63]:
def caused_by_arson(series): # SOLUTION NO PROMPT
    return series == "7 - Arson" # SOLUTION NO PROMPT

arson_fires = calfire[calfire.get("cause").apply(caused_by_arson)].shape[0] # SOLUTION NO PROMPT
arson_fires # SOLUTION NO PROMPT

763

In [64]:
grader.check_all()