# Lab 5 (2/10): Working with data

### Web pages
Course page: https://ambujtewari.github.io/teaching/STATS306-Winter2020/

Lab page: https://rogerfan.github.io/stats306_w20/

### Office Hours
    Mondays: 2-4pm, USB 2165
    
### Contact
    Questions on problems: Use the slack discussions
    If you need to email me, include in the subject line: [STATS 306]
    Email: rogerfan@umich.edu

Today, we will look at what people order at Chipotle. Some example questions we are interested in are:
- How much do people spend on average at Chipotle?
- Do people prefer bowls or burritos?
- What percentage of people order drinks?


Recall the following commands from `dplyr`:

1. `group_by`
2. `summarize`:  `df = df %>% group_by(groupvar) %>% summarize(newvar = mean(oldvar))`
3. `mutate`: `df = df %>% mutate(newvar = oldvar + sqrt(oldvar2))`
4. `filter`: `df = df %>% filter(compvar == 'something')`

Some other functions that may come in handy are:

1. `top_n` from `dplyr`
2. `sum`, `max`, and `min` from base R


In [None]:
# Setup and read in data
library(dplyr)
library(ggplot2)
library(stringr)
df = read.csv("https://raw.githubusercontent.com/rogerfan/stats306_w20/master/labs/chipotle.csv", stringsAsFactors=F)
df$X = NULL

In [2]:
head(df)

Unnamed: 0_level_0,order_id,quantity,item_name,choice_description,item_price
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>
1,1,1,Chips and Fresh Tomato Salsa,,$2.39
2,1,1,Izze,[Clementine],$3.39
3,1,1,Nantucket Nectar,[Apple],$3.39
4,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
5,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]",$16.98
6,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]",$10.98


### Q1: What are the five most popular items?

### Q2: `item_price` is currently a string. Remove the dollar sign and convert it to a numerical variable

Save this new dataset as `df_clean`.

Hint: Consider the functions `str_replace` and `as.numeric`. Note that to replace dollar signs in the string you will need to use `'\\$'` in `str_replace`.

### Q3: Construct a summary table by item type

For each item type, the table should contain the total revenue, the number of items sold, and the max, mean, and minimum prices. Sort the table by items sold in decreasing order.

I have done this for you, and my solution is:

In [None]:
pricetable = df_clean %>% group_by(item_name) %>% 
    summarize(revenue = sum(item_price), 
              itemsold = n(),
              meanprice = mean(item_price),
              maxprice = max(item_price),                       
              minprice = min(item_price)) %>% 
    arrange(desc(itemsold))
head(pricetable)

Does anything look strange about this summary? Which number seems out-of-place?

### Q4: What is the issue? Can you figure out the mistake in the code above?

Hint: Think about how you might find some of the problematic rows and look at them closely.

### How would you fix the code in Q3?

### Q4: Calculate the total price for each order. Plot a histogram of order prices.

Save this new dataframe in the variable `totalprice`.

Note how extreme outliers can make it difficult to interpret plots.

### Q5: Change the data/plot in Q4 so that it only contains orders with prices below 40. Try different binwidths to see if your interpretations change.

How do your spending habits at Chipotle compare to those who are in the dataset?

### Q6: Are bowls, burritos, or tacos more popular? Create a bar plot of amount sold of each.

To extract whether an item is a bowl, burrito, or taco, you can use the following code:

In [None]:
df_withtype = df_clean %>% 
    mutate(type = case_when(str_detect(item_name, "Bowl") ~ "Bowl",
                            str_detect(item_name, "Burrito") ~ "Burrito",
                            str_detect(item_name, "Tacos") ~ "Tacos",
                            TRUE ~ "neither"))
bowburtacotable = df_withtype %>% filter(type != "neither") 
head(bowburtacotable)

Note the usage of the function `case_when` from tidyverse. Look up the documentation and make sure you understand how this function works.

In [None]:
ggplot(bowburtacotable, aes(x=type)) + geom_bar(aes(weight=quantity))

### What about the different fillings (Chicken, Steak, Barbacoa, Carnitas, Veggie)? Color the above bar chart by type of filling.

Note that you will have to do some additional data transformation using the intermediate dataset `df_withtype`.

### Is there a modification you can make to the default bar chart to better compare how the proportion of fillings differs across order types more easily?

### Q7: How many orders contain at least one drink of any kind? How many don't have a drink?

To detect multiple things in `str_detect`, you can separate the items with pipes `|`. For example:

In [None]:
str_detect('I play football.', 'football|soccer')
str_detect('I play soccer.', 'football|soccer')
str_detect('I play baseball.', 'football|soccer')

In [5]:
unique(df$item_name)

### Bonus Question: Suppose non-drink items have a 20% profit margin and drinks have a 50% profit margin. What percentage of the total profit comes from drinks?