# Imports

In [1]:
import pandas as pd
import numpy as np


## print multiple things from same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Simulate data

In [2]:
## setting seed for reproducibility
np.random.seed(1129)
shop_names = ['Compass', 'Starbucks', 'Baked and Wired', 'Peets', 
              'Blue Bottle', 'Saxbys']
coffee_df = pd.concat([pd.DataFrame({'shop_name': shop_names,
                         'opening_time': np.random.choice(["8:00 AM", "9:00 AM", "10:00 AM"],
                                                       len(shop_names),
                                                       replace = True),
                         'closing_time': np.random.choice(["5:00 PM", "6:00 PM", 
                                                          "7:00 PM"],
                                                         len(shop_names),
                                                          replace = True),
                         'hourly_wage': np.random.uniform(14, 20,
                                                          len(shop_names)),
                        'year': 2019}),
                      pd.DataFrame({'shop_name': shop_names,
                         'opening_time': np.random.choice(["8:00 AM", "9:00 AM", "10:00 AM"],
                                                       len(shop_names),
                                                       replace = True),
                         'closing_time': np.random.choice(["3:00 PM", "4:00 PM",
                                                           "6:00 PM", 
                                                          "7:00 PM"],
                                                         len(shop_names),
                                                          replace = True),
                         'hourly_wage': np.random.uniform(14, 20,
                                                          len(shop_names)),
                        'year': 2021})]).sort_values(by = 'shop_name')
                      

coffee_df

Unnamed: 0,shop_name,opening_time,closing_time,hourly_wage,year
2,Baked and Wired,10:00 AM,5:00 PM,16.308214,2019
2,Baked and Wired,10:00 AM,3:00 PM,19.222221,2021
4,Blue Bottle,8:00 AM,7:00 PM,19.231655,2019
4,Blue Bottle,9:00 AM,7:00 PM,19.951942,2021
0,Compass,10:00 AM,6:00 PM,16.169551,2019
0,Compass,10:00 AM,7:00 PM,15.316723,2021
3,Peets,10:00 AM,7:00 PM,16.240656,2019
3,Peets,8:00 AM,7:00 PM,19.992475,2021
5,Saxbys,9:00 AM,6:00 PM,14.598841,2019
5,Saxbys,8:00 AM,3:00 PM,17.238755,2021


# 1. Methods versus attributes

In [3]:
## good explanation here: https://medium.com/@shawnnkoski/pandas-attributes-867a169e6d9b 
## especially in pandas:
## attributes: give more information about some object (eg a Pandas dataframe or Series)
## methods: apply some transformation to that object 

## 1.1 Attributes of or methods that operate on dataframes

## 1.2 Attributes of or methods that operate on pandas series

## 1.3 Applying methods to base python objects versus pandas dataframes/series

Different types of objects have different types of methods

Suppose we want to convert a string to all lowercase. There are different ways a string might be stored:

- As an object in base python. Method for this object: lower()- https://www.w3schools.com/python/ref_string_lower.asp 

- As a pandas series stored as a character. Method for this object: str.lower() - https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html 

Each accomplishes the same task but syntax differs slightly depending on what type of object we "feed" the lower function

## 1.4 Practice for you 

- Use the quantile function (a method that operates on pandas series: https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html) to get three percentiles of the `hourly_wage` column and assign to an object you call `wage_summary`

    - Bottom 10th (0.1)
    - Median (0.5)
    - Top 10th (0.9)

- Check the type of `wage summary`, convert the type to a numpy array and calculate the gap between the 90th percentile hourly wage and the 10th percentile (hint, you may want to use np.min and np.max)

# 2. pivot() or pivot_table() to reshape

We currently have a "long" format dataframe where each coffee shop appears twice, once for its 2019 information and another time for its 2021 information

To perform different calculations (eg changes in total hours), we may want to reshape to wide format where each row is a single coffee shop and we have one value for its 2019 wages/hours and another value for its 2021 wages/hours

Good discussion here of pivot() versus pivot_table(): https://www.roelpeters.be/pandas-pivot_table-vs-pivot/#:~:text=Basically%2C%20the%20pivot_table()%20function,Here's%20an%20example

**Task**: 

- We first want to calculate the # of hours open per day. Trickier with no dates so we add an arbitrary same date to each (could also use .today() which is today's date)

- We then want to pivot to wide to create different columns for 2019 versus 2021 total hours open and total wages

## 2.1 Adding a new column with difference in times

How to get hours: https://stackoverflow.com/questions/52093199/pandas-extract-hour-from-timedelta

## 2.2 Reshaping

We want to subset to the following columns and reshape from long to wide, so where each row represents a single shop:

- shop_name
- year
- hours_open
- hourly_wage

## 2.3 Using that reshaped data to simplify operations

Create a new column that takes value of True if hours_open_2021 is less than in 2019, False otherwise

## 2.4 Practice for you

- Below is a fake dataframe with metro stops and delays. 

- Reshape so that each row is a metro stop and find whether there were more days with delays in 2021 than in 2020 for that stop


In [4]:
## long dataframe
np.random.seed(1129)
metro_stops = pd.DataFrame({'stop': ['dupont circle', 'dupont circle', 
                           'foggy bottom', 'foggy bottom'],
                           'days_delayed': np.random.choice([50, 40, 100, 200],
                                                           4, replace = False),
                           'year': [2020, 2021, 2020, 2021]})
metro_stops

Unnamed: 0,stop,days_delayed,year
0,dupont circle,200,2020
1,dupont circle,50,2021
2,foggy bottom,40,2020
3,foggy bottom,100,2021


In [5]:
## your code


# 3. User defined functions and if/elif/control flow

## 3.1 if/elif/else outside a function

**Task**: with the original `coffee_df` data, pick an arbitrary shop_name and check if has more than 1 word; if it does, print "shop name has >1 word"; if not, print "shop name has one word"

## 3.2 if/elif/else inside a function 

**Task**: move that inside a function that takes a single shop name as an argument. Apply the function w/ one arbitrary shop name

## 3.3 different ways of executing over all shop names

Note: because the function returns nothing, it just returns `None` with either way of executing

### 3.3.1 List comprehension

### 3.3.2 Apply 

## 3.4 Practice for you

**Task**: modify the previous function to do the following instead of printing:
        
- If the shop name is longer than 1 word, return just the first word of the name
- Otherwise return the full shop name

After making sure the function works with one shop name, iterate over the shop names using one of the above methods (list comprehension or apply) and create a new column in `coffee_df` with the single-word name

# 4. Plotting practice

**Task**: create a plot where the x axis is a coffee shop and the y axis is the number of hours open, creating separate bars (or shading separately) for 2019 versus 2021