# Discussion 02: Arrays and Tables

Welcome to Discussion 02! At the end of last week's discussion, we got a sneak peak of arrays and tables.

This week, we will go back over these data structures and work with the same datasets we saw last time. 

You can find additional help on these topics in the 'Arrays' and 'DataFrames' sections of the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).

[Here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a pointer to that reference sheet we saw last time.

<img src="data/panda_basketball.jpg" width="600">

In [2]:
# please don't change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()

## Part 1 : Arrays vs. Lists

---
Arrays and lists are helpful when we want to store and manipulate **sequences** of data

##### Lists
- Built into Python
- Friendly with different data types
- EXTREMELY SLOW

##### Arrays
- Not built inot Python directly (that's why we have NumPy!)
- Elements must be the same data type
- MUCH FASTER


## Array Problems (part 1)

In [3]:
some_array = np.array([6, 1, 9, 5, 2, 3, 4, 3, 2, 4])

**Question 1** : How many elements are in ```some_array```?

<!--
BEGIN QUESTION
name: q11
-->

In [4]:
num_elems = ...
num_elems

In [None]:
grader.check("q11")

**Question 2** : How do we access the *first* element of ```some_array```?

<!--
BEGIN QUESTION
name: q12
-->

In [6]:
first_elem = ...
first_elem

In [None]:
grader.check("q12")

**Question 3** : How do we access the *last* element of ```some_array```?

<!--
BEGIN QUESTION
name: q13
-->

In [8]:
last_elem = ...
last_elem

In [None]:
grader.check("q13")

**Question 4** : What happens when we do ```some_array[-2]```?

In [11]:
some_array

In [12]:
some_array[-2]

**Question 5 - BONUS** : How do we make a new array that contains only the first 5 elements from ```some_array```? 

<!--
BEGIN QUESTION
name: q15
-->

In [14]:
first_five = ...
first_five

In [None]:
grader.check("q15")

In [26]:
array_1 = np.array([1,2,3,4,5,6,7,8])
array_2 = np.array([6,7,8,9,10,11,12,13])

**Question 6** : How to we get the element-wise sum of ```some_array``` and ```some_array_2```?

<!--
BEGIN QUESTION
name: q16
-->

In [27]:
elem_wise_sum = ...
elem_wise_sum

In [None]:
grader.check("q16")

**Question 7** : How to we get the max element from ```some_array```?

<!--
BEGIN QUESTION
name: q17
-->

In [29]:
max_elem = ...
max_elem

In [None]:
grader.check("q17")

**Question 8 - BONUS** : How to we get the average of first 6 elements from ```some_array```?

<!--
BEGIN QUESTION
name: q18
-->

In [32]:
first_six_average = ...
first_six_average

In [None]:
grader.check("q18")

**Question 9** : How to we make an array with every integer under 13

<!--
BEGIN QUESTION
name: q19
-->

In [34]:
under_13 = ...
under_13

In [None]:
grader.check("q19")

**Question 10** : How do we make an array of [6,9,12,15,18,21]

<!--
BEGIN QUESTION
name: q110
-->

In [36]:
threes_array = ...
threes_array

In [None]:
grader.check("q110")

**Question 11** : How do we make an array of 2 to the power of from 2 to 6, aka [4,8,16,32,64]?

<!--
BEGIN QUESTION
name: q111
-->

In [38]:
powers_of_two = ...
powers_of_two

In [None]:
grader.check("q111")

**Remember!** NumPy Arrays will fit all data to the same type:

In [40]:
random_array = np.array([45,"hello", True, 987, 34.5, "Yes"])
random_array

In [41]:
#But Lists will not:
random_list = ['hello', 'there', 'buddy', 43, True, 38.9, 'DSC 10']
random_list

In [42]:
print(type(random_array))
print(type(random_list))

## Part 2 : Tables

---
Tables are a slightly more complex data structure that are helpful when we want to store and manipulate larger datasets with lots of information. We often refer to tables as "data frames" and these terms are interchangable in this class.

**Rows** are labeled arrays that correspond to different entries or samples in the table. (The labels are typically *unique* and easily identifiable names.)

**Columns** are labeled arrays that correspond to the different pieces of information we care about. (The labels are the titles of each column.)

The **Index** of a table is an array completely separate from the columns and contains all of the row labels.

A **Series** is a labeled array that corresponds to a single column from the table.

##### Important note
To get a particular element from a table, in this class, we'll always ```.get()``` the column label, then ```.loc[]``` the row label.


## Table Problems (Part 2)

# Ultimate Halloween Candy Showdown (remember this?)
---
269,000 user submitted winners of head to head candy matchups

### Read from CSV and Set the index

In [43]:
candy = bpd.read_csv("data/candy.csv")
candy = candy.set_index('competitorname')
candy

In [44]:
candy.get("sugarpercent")

## Q2.1

Store the competitor names of the top 5 most popular candies (winpercent) in an array

<!--
BEGIN QUESTION
name: q21
-->

In [78]:
top_five = ...
top_five

In [None]:
grader.check("q21")

## Q2.2

How many candies have caramel but not chocolate?

<!--
BEGIN QUESTION
name: q22
-->

In [80]:
candies_with_caramel_not_choco = ...
candies_with_caramel_not_choco

In [None]:
grader.check("q22")

## Q2.3

What is the name of the one candy that has the word "candy" as part of its name? 

<!--
BEGIN QUESTION
name: q23
-->

In [49]:
candy_name = ...
candy_name

In [None]:
grader.check("q23")

## Q2.4
Do chococlate candies or non-chocolate candies have a higher win percentage?
Your answer should be 1 for chocolate candies or 0 for non-chocolate candies.

<!--
BEGIN QUESTION
name: q24

-->

In [51]:
higher_winpercent_choc = ...
higher_winpercent_choc

In [None]:
grader.check("q24")

## Q2.5

Do chocolate candies have more hard candies or non-hard candies.
Your answer should be 1 for hard candies or 0 for non-hard candies.

<!--
BEGIN QUESTION
name: q25
-->

In [53]:
more_hard_or_soft = ...
more_hard_or_soft

In [None]:
grader.check("q25")

In [81]:
# recall df
candy

## Q2.6 - BONUS

Which candy has the most 1s in across all binary variables (Hint: Use ```.apply()```)

<!--
BEGIN QUESTION
name: q26
-->

In [57]:
competitor_name = ...
competitor_name

In [None]:
grader.check("q26")

In [59]:
grader.check_all()

# Supplemental Reference Sheet Examples

**See remote discussion recording for details/explanation**

## More fun with Tables

In [60]:
# recall the full data frame we've been working with
candy

# Building and Organizing DataFrames

add/replace a column : ```df.assign(Name_of_Column=column_data)```

drops a single column :  ```df.drop(columns=column_name)```

drops every column in a list of columns : ```df.drop(columns=[col_1_name, ..., col_k_name])```

move the index to a column : ```df.reset_index()```

move the column to the index : ```df.set_index(column_name)```

sort entire DataFrame by values in a column : ```df.sort_values(by=column_name)```


In [61]:
# assign

new_col_data = np.arange(candy.shape[0])
candy.assign(new_col=new_col_data)

In [62]:
# drop (single)

candy.drop(columns='chocolate')

In [63]:
# drop (multiple)

candy.drop(columns=['chocolate','fruity','caramel','peanutyalmondy','nougat'])

In [64]:
# reset index

candy = candy.reset_index()
candy

In [65]:
# set index

candy = candy.set_index('competitorname')
candy

In [66]:
# sort

candy.sort_values(by="winpercent")

# Retrieving Information

## DataFrame methods

**Returns a Series**

retrieve column : ```df.get(column_name)```

**Returns a DataFrame**

retrieve several columns : ```df.get([col_1_name, ..., col_k_name])```

select row(s) by index position : ```df.take([pos_1, ..., pos_k])```

select row(s) using Boolean array : ```df[bool_array]```

**Returns other type**

number of rows : ```df.shape[0]``` **number**

number of columns : ```df.shape[1]``` **number**

retrieve element in the index by its position : ```df.index[position]``` **index name**

## Series methods

**Returns an element**

retrieve an element by the *row label* : ```ser.loc[label]```

retrieve an element by its *index position* : ```ser.iloc[position]```

In [67]:
# rows 

candy.shape[0]

In [70]:
# columns

candy.shape[1]

In [71]:
# get 

candy.get("chocolate") # Series

In [72]:
# get 

candy.get(["chocolate", "caramel"]) # DataFrame

In [73]:
# take

candy.take([0])

In [74]:
# take

candy.take(np.arange(10))

In [75]:
# bool access

all_true = np.ones(candy.shape[0]).astype(bool) # must match number of rows in df!
print(all_true)

some_true = np.random.randint(2, size=candy.shape[0]).astype(bool) # must match number of rows in df!
print(some_true)

candy[some_true]

In [76]:
# series by label

win_series = candy.get("winpercent")

win_series.loc["Twizzlers"]

In [77]:
# series by position

win_series.iloc[80]