<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Chipotle Data

_Author: Joseph Nelson (DC)_

---

For Project 2, you will complete a series of exercises exploring [order data from Chipotle](https://github.com/TheUpshot/chipotle), compliments of _The New York Times'_ "The Upshot."

For these exercises, you will conduct basic exploratory data analysis (Pandas not required) to understand the essentials of Chipotle's order data: how many orders are being made, the average price per order, how many different ingredients are used, etc. These allow you to practice business analysis skills while also becoming comfortable with Python.

---

## Basic Level

### Part 1: Read in the file with `csv.reader()` and store it in an object called `file_nested_list`.

Hint: This is a TSV (tab-separated value) file, and `csv.reader()` needs to be told [how to handle it](https://docs.python.org/2/library/csv.html).

In [1]:
import csv
from collections import namedtuple   # Convenient to store the data rows
import pandas as pd
import numpy as np

DATA_FILE = './data/chipotle.tsv'

In [2]:

file_nested_list = list(csv.reader(open(DATA_FILE), delimiter='\t'))
file_nested_list

FileNotFoundError: [Errno 2] No such file or directory: './data/chipotle.tsv'

### Part 2: Separate `file_nested_list` into the `header` and the `data`.


In [49]:
header = file_nested_list[0]
data = []

for row in file_nested_list:
    data.append(row)

del data[0]

---

## Intermediate Level

### Part 3: Calculate the average price of an order.

Hint: Examine the data to see if the `quantity` column is relevant to this calculation.

Hint: Think carefully about the simplest way to do this!

In [188]:
# Convert nested list to dataframe

df = pd.DataFrame(data,columns=header)
df['order_id'] = df['order_id'].astype(float)
df['quantity'] = df['quantity'].astype(float)
df['item_price'] = df['item_price'].replace('[\$,]', '', regex=True).astype(float)

df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1.0,1.0,Chips and Fresh Tomato Salsa,,2.39
1,1.0,1.0,Izze,[Clementine],3.39
2,1.0,1.0,Nantucket Nectar,[Apple],3.39
3,1.0,1.0,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2.0,2.0,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98


In [90]:
# Compare price of one item at quantity of 1 and 2 to determine whether quantity is relevent.

df[(df['item_name'] == 'Chips and Fresh Tomato Salsa') & (df['quantity']==2)]

# Based on test, quantity impacts item price, but is not relevent to average price of an order. 
# In this example item_price doubles with quanity of 1 vs 2

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
1882,759.0,2.0,Chips and Fresh Tomato Salsa,,5.9
2267,912.0,2.0,Chips and Fresh Tomato Salsa,,5.9
2729,1083.0,2.0,Chips and Fresh Tomato Salsa,,5.9


In [97]:
# Create dataframe grouped by order_id, the sum of item_price becomes order price. Then take average of order price column.

df_order = df.groupby('order_id').sum()
Avg_Order_Price = df_order['item_price'].mean()
Avg_Order_Price

# Average price of an order is $18.81

18.81142857142869

### Part 4: Create a list (or set) named `unique_sodas` containing all of unique sodas and soft drinks that Chipotle sells.

Note: Just look for `'Canned Soda'` and `'Canned Soft Drink'`, and ignore other drinks like `'Izze'`.

In [112]:
df_soda = df[(df['item_name'] == 'Canned Soda') | (df['item_name'] == 'Canned Soft Drink')]
unique_sodas = df_soda['choice_description'].unique().tolist()
unique_sodas

['[Sprite]',
 '[Dr. Pepper]',
 '[Mountain Dew]',
 '[Diet Dr. Pepper]',
 '[Coca Cola]',
 '[Diet Coke]',
 '[Coke]',
 '[Lemonade]',
 '[Nestea]']

---

## Advanced Level


### Part 5: Calculate the average number of toppings per burrito.

Note: Let's ignore the `quantity` column to simplify this task.

Hint: Think carefully about the easiest way to count the number of toppings!


#### My Approach: 
##### 1. Filter for burritos by conditioning item_name for containing burrito (these have a few names such as burrito, carnitas burrito, etc.)
##### 2. Convert the choice_decsritpions to list, and count number of ingridients. Since each item was not seperated with columns, I did this by counting commas.
##### 3. Take average of these counts. I did this using sum of counts / len.


In [154]:
df_burrito = df[df['item_name'].str.contains('Burrito')]
burrito_topping = df_burrito['choice_description'].tolist()

In [155]:
topping_cnt = []

for order in burrito_topping:
    count = 1
    for o in order:
        if o == ',':
            count +=1
    topping_cnt.append(count)


def Average(lst): 
    return sum(lst) / len(lst) 

Average(topping_cnt)

# The average number of toppings per burrito is 5.4

5.395051194539249

### Part 6: Create a dictionary. Let the keys represent chip orders and the values represent the total number of orders.

Expected output: `{'Chips and Roasted Chili-Corn Salsa': 18, ... }`

Note: Please take the `quantity` column into account!

Optional: Learn how to use `.defaultdict()` to simplify your code.

In [186]:
df_chips = df[df['item_name'].str.contains('Chips')]
df_chip_grp = df_chips.groupby('item_name')['quantity'].sum()
chip_dict = df_chip_grp.to_dict()
chip_dict

{'Chips': 230.0,
 'Chips and Fresh Tomato Salsa': 130.0,
 'Chips and Guacamole': 506.0,
 'Chips and Mild Fresh Tomato Salsa': 1.0,
 'Chips and Roasted Chili Corn Salsa': 23.0,
 'Chips and Roasted Chili-Corn Salsa': 18.0,
 'Chips and Tomatillo Green Chili Salsa': 45.0,
 'Chips and Tomatillo Red Chili Salsa': 50.0,
 'Chips and Tomatillo-Green Chili Salsa': 33.0,
 'Chips and Tomatillo-Red Chili Salsa': 25.0,
 'Side of Chips': 110.0}

---

## Bonus: Craft a problem statement about this data that interests you, and then answer it!


## When are chips most likely to be ordered: When no entre (bowl, burrito, taco, salad) are ordered, when 1 entre is ordered, or more than 1 entre is ordered?

#### Approach: Add flags for whether order has an entre and has a chip. Then calculate percentage of order with chips segemented by whether there was 0, 1 , or >1 entre ordered.

In [249]:
# Add flags for whether order contains entre and chips

df2 = df
df2['entre'] =  np.where(df['item_name'].str.contains('Salad|Taco|Burrito|Bowl',regex = True),1,0)
df2['chips']=  np.where(df['item_name'].str.contains('Chips',regex = True),1,0)
df2.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,Entre,Chips,entre,chips
0,1.0,1.0,Chips and Fresh Tomato Salsa,,2.39,0,1,0,1
1,1.0,1.0,Izze,[Clementine],3.39,0,0,0,0
2,1.0,1.0,Nantucket Nectar,[Apple],3.39,0,0,0,0
3,1.0,1.0,Chips and Tomatillo-Green Chili Salsa,,2.39,0,1,0,1
4,2.0,2.0,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98,1,0,1,0
5,3.0,1.0,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",10.98,1,0,1,0
6,3.0,1.0,Side of Chips,,1.69,0,1,0,1
7,4.0,1.0,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",11.75,1,0,1,0
8,4.0,1.0,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",9.25,1,0,1,0
9,5.0,1.0,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",9.25,1,0,1,0


In [242]:
# Group data by order_id, and segement by entres ordered

df3 = df2.groupby('order_id')['entre','chips'].sum()

df_entre_0 = df3[df3['entre']==0]
df_entre_1 = df3[df3['entre']==1]
df_entre_2up = df3[df3['entre'] >1]

# create function to calcualte % likelhood chips were ordered
def chip_chances(dataframe):
    chip_cnt = dataframe[dataframe['chips']>0]['chips'].count()
    tot_cnt = dataframe['chips'].count()
    pct = chip_cnt / tot_cnt
    
    return pct

In [244]:
# Test No Entre
chip_chances(df_entre_0)   

1.0

In [245]:
# Test 1 Entre
chip_chances(df_entre_1)   

0.6731481481481482

In [246]:
# Test 2+ Entre
chip_chances(df_entre_2up)   

0.37632978723404253

## Result: Chips were most likely ordered when no entre was ordered (100% of time); however, sample size is too small (2 orders) to draw conclusion. Single entre orders (67%) were more likely than multi-entre orders (38%) to order chips.