<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Chipotle Data

_Author: Joseph Nelson (DC)_

---

For Project 2, you will complete a series of exercises exploring [order data from Chipotle](https://github.com/TheUpshot/chipotle), compliments of _The New York Times'_ "The Upshot."

For these exercises, you will conduct basic exploratory data analysis (Pandas not required) to understand the essentials of Chipotle's order data: how many orders are being made, the average price per order, how many different ingredients are used, etc. These allow you to practice business analysis skills while also becoming comfortable with Python.

---

## Basic Level

### Part 1: Read in the file with `csv.reader()` and store it in an object called `file_nested_list`.

Hint: This is a TSV (tab-separated value) file, and `csv.reader()` needs to be told [how to handle it](https://docs.python.org/2/library/csv.html).

In [1]:
# Import packages
import pandas as pd
import numpy as np

In [2]:
import csv
from collections import namedtuple   # Convenient to store the data rows
DATA_FILE = './chipotle.tsv'

In [3]:
file_nested_list = []
with open(DATA_FILE, 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        file_nested_list.append(row)
file_nested_list

[['order_id', 'quantity', 'item_name', 'choice_description', 'item_price'],
 ['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],
 ['1', '1', 'Izze', '[Clementine]', '$3.39 '],
 ['1', '1', 'Nantucket Nectar', '[Apple]', '$3.39 '],
 ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 '],
 ['2',
  '2',
  'Chicken Bowl',
  '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]',
  '$16.98 '],
 ['3',
  '1',
  'Chicken Bowl',
  '[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$10.98 '],
 ['3', '1', 'Side of Chips', 'NULL', '$1.69 '],
 ['4',
  '1',
  'Steak Burrito',
  '[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$11.75 '],
 ['4',
  '1',
  'Steak Soft Tacos',
  '[Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]]',
  '$9.25 '],
 ['5',
  '1',
  'Steak Burrito',
  '[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Ch

### Part 2: Separate `file_nested_list` into the `header` and the `data`.


In [4]:
# extract header
header = file_nested_list[0]
header

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

In [5]:
# extract the data
data = file_nested_list[1:]
data

[['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],
 ['1', '1', 'Izze', '[Clementine]', '$3.39 '],
 ['1', '1', 'Nantucket Nectar', '[Apple]', '$3.39 '],
 ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 '],
 ['2',
  '2',
  'Chicken Bowl',
  '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]',
  '$16.98 '],
 ['3',
  '1',
  'Chicken Bowl',
  '[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$10.98 '],
 ['3', '1', 'Side of Chips', 'NULL', '$1.69 '],
 ['4',
  '1',
  'Steak Burrito',
  '[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$11.75 '],
 ['4',
  '1',
  'Steak Soft Tacos',
  '[Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]]',
  '$9.25 '],
 ['5',
  '1',
  'Steak Burrito',
  '[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]]',
  '$9.25 '],
 ['5', '1', 'Chips and Guacamole'

In [6]:
# Create a dictionary with the data
dict_for_df = {}
for index, column_name in enumerate(header):
    dict_for_df[column_name] = []
    for row in data:
        dict_for_df[column_name].append(row[index])
dict_for_df

{'order_id': ['1',
  '1',
  '1',
  '1',
  '2',
  '3',
  '3',
  '4',
  '4',
  '5',
  '5',
  '6',
  '6',
  '7',
  '7',
  '8',
  '8',
  '9',
  '9',
  '10',
  '10',
  '11',
  '11',
  '12',
  '12',
  '13',
  '13',
  '14',
  '14',
  '15',
  '15',
  '16',
  '16',
  '17',
  '17',
  '18',
  '18',
  '18',
  '18',
  '19',
  '19',
  '20',
  '20',
  '20',
  '20',
  '21',
  '21',
  '21',
  '22',
  '22',
  '23',
  '23',
  '24',
  '24',
  '25',
  '25',
  '26',
  '26',
  '27',
  '27',
  '28',
  '28',
  '28',
  '28',
  '29',
  '29',
  '30',
  '30',
  '30',
  '31',
  '31',
  '32',
  '32',
  '33',
  '33',
  '34',
  '34',
  '34',
  '34',
  '35',
  '35',
  '36',
  '36',
  '37',
  '37',
  '38',
  '38',
  '38',
  '39',
  '39',
  '40',
  '40',
  '40',
  '41',
  '41',
  '42',
  '42',
  '43',
  '43',
  '44',
  '44',
  '45',
  '45',
  '45',
  '46',
  '46',
  '47',
  '47',
  '48',
  '48',
  '49',
  '49',
  '49',
  '50',
  '50',
  '51',
  '51',
  '51',
  '52',
  '52',
  '53',
  '53',
  '53',
  '54',
  '54',
  '55',

In [7]:
df = pd.DataFrame(dict_for_df)
df.head(1)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39


---

## Intermediate Level

### Part 3: Calculate the average price of an order.

Hint: Examine the data to see if the `quantity` column is relevant to this calculation.

Hint: Think carefully about the simplest way to do this!

In [8]:
# Check the shape 
df.shape

(4622, 5)

In [9]:
# Check for null values
df.isnull().sum()

order_id              0
quantity              0
item_name             0
choice_description    0
item_price            0
dtype: int64

In [10]:
# check datatypes
df.dtypes

order_id              object
quantity              object
item_name             object
choice_description    object
item_price            object
dtype: object

In [11]:
df.quantity.value_counts()

1     4355
2      224
3       28
4       10
7        1
15       1
10       1
5        1
8        1
Name: quantity, dtype: int64

In [12]:
df.item_price.value_counts().sort_values()

$32.94       1
$22.20       1
$6.45        1
$13.35       1
$10.50       1
          ... 
$8.49      311
$4.45      349
$9.25      398
$11.25     521
$8.75      730
Name: item_price, Length: 78, dtype: int64

In [13]:
# strip the $ sign from the item_price and create a new column
df['item_price_stripped'] = df['item_price'].map(lambda x: x.lstrip('$ '))
df.head(1)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,item_price_stripped
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,2.39


In [14]:
# then convert this to a float (need to create a new column for this - not sure why)
# Conversion wasnt working when I was doing this in place
df['item_price_stripped_float'] = df['item_price_stripped'].astype(float)
# I also needed to convert quantity in the float 
df['quantity_float'] = df['quantity'].astype(float)
df.head(1)
df.dtypes

order_id                      object
quantity                      object
item_name                     object
choice_description            object
item_price                    object
item_price_stripped           object
item_price_stripped_float    float64
quantity_float               float64
dtype: object

In [15]:
# Create new column for quantity_x_price
df['quantity_x_price'] = df['quantity_float']*df['item_price_stripped_float']
df.head(5)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,item_price_stripped,item_price_stripped_float,quantity_float,quantity_x_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,2.39,2.39,1.0,2.39
1,1,1,Izze,[Clementine],$3.39,3.39,3.39,1.0,3.39
2,1,1,Nantucket Nectar,[Apple],$3.39,3.39,3.39,1.0,3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,2.39,2.39,1.0,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,16.98,16.98,2.0,33.96


In [16]:
# Average price of an order: the sum of quantity x item_price divided by the unique number of orders
average_price_of_order = df.quantity_x_price.sum() / len(df.order_id.unique())
average_price_of_order

21.39423118865867

In [17]:
# format the answer
print('The average price of an order is $' + format(str(round(average_price_of_order,2))))

The average price of an order is $21.39


### Part 4: Create a list (or set) named `unique_sodas` containing all of unique sodas and soft drinks that Chipotle sells.

Note: Just look for `'Canned Soda'` and `'Canned Soft Drink'`, and ignore other drinks like `'Izze'`.

In [18]:
# generate an array of all item_names 
df.item_name.value_counts()

Chicken Bowl                             726
Chicken Burrito                          553
Chips and Guacamole                      479
Steak Burrito                            368
Canned Soft Drink                        301
Steak Bowl                               211
Chips                                    211
Bottled Water                            162
Chicken Soft Tacos                       115
Chicken Salad Bowl                       110
Chips and Fresh Tomato Salsa             110
Canned Soda                              104
Side of Chips                            101
Veggie Burrito                            95
Barbacoa Burrito                          91
Veggie Bowl                               85
Carnitas Bowl                             68
Barbacoa Bowl                             66
Carnitas Burrito                          59
Steak Soft Tacos                          55
6 Pack Soft Drink                         54
Chips and Tomatillo Red Chili Salsa       48
Chicken Cr

In [19]:
# check the choice description
df.choice_description.value_counts()

NULL                                                                                                                                                                                                         1246
[Diet Coke]                                                                                                                                                                                                   134
[Coke]                                                                                                                                                                                                        123
[Sprite]                                                                                                                                                                                                       77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                                                                                          

In [20]:
# Create a new dataframe with only canned soda and canned soft drink 
df_drinks = df[(df['item_name'] == 'Canned Soda') | (df['item_name'] == 'Canned Soft Drink')]
df_drinks.head(1)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,item_price_stripped,item_price_stripped_float,quantity_float,quantity_x_price
18,9,2,Canned Soda,[Sprite],$2.18,2.18,2.18,2.0,4.36


In [21]:
# Convert the choice_description column to a list from the filtered dataframe
drinks_list = df_drinks['choice_description'].tolist()
drinks_list

['[Sprite]',
 '[Dr. Pepper]',
 '[Mountain Dew]',
 '[Sprite]',
 '[Dr. Pepper]',
 '[Diet Dr. Pepper]',
 '[Coca Cola]',
 '[Diet Coke]',
 '[Diet Dr. Pepper]',
 '[Coca Cola]',
 '[Dr. Pepper]',
 '[Coca Cola]',
 '[Diet Coke]',
 '[Mountain Dew]',
 '[Mountain Dew]',
 '[Dr. Pepper]',
 '[Mountain Dew]',
 '[Diet Dr. Pepper]',
 '[Mountain Dew]',
 '[Coke]',
 '[Coca Cola]',
 '[Sprite]',
 '[Diet Coke]',
 '[Coke]',
 '[Coke]',
 '[Lemonade]',
 '[Sprite]',
 '[Diet Coke]',
 '[Coca Cola]',
 '[Diet Coke]',
 '[Diet Coke]',
 '[Mountain Dew]',
 '[Coke]',
 '[Coke]',
 '[Coke]',
 '[Sprite]',
 '[Coke]',
 '[Coke]',
 '[Dr. Pepper]',
 '[Coca Cola]',
 '[Dr. Pepper]',
 '[Coke]',
 '[Diet Dr. Pepper]',
 '[Diet Coke]',
 '[Lemonade]',
 '[Coke]',
 '[Diet Coke]',
 '[Diet Coke]',
 '[Diet Coke]',
 '[Diet Coke]',
 '[Nestea]',
 '[Diet Coke]',
 '[Dr. Pepper]',
 '[Coke]',
 '[Sprite]',
 '[Coke]',
 '[Lemonade]',
 '[Coke]',
 '[Coke]',
 '[Diet Coke]',
 '[Diet Coke]',
 '[Coca Cola]',
 '[Coca Cola]',
 '[Lemonade]',
 '[Lemonade]',
 '[Diet

In [22]:
# remove the '[' and ']' characters and convert to a set 
replace_first = [i.replace('[', '') for i in drinks_list]
cleaned_list = [j.replace(']', '') for j in replace_first]
set(cleaned_list)

{'Coca Cola',
 'Coke',
 'Diet Coke',
 'Diet Dr. Pepper',
 'Dr. Pepper',
 'Lemonade',
 'Mountain Dew',
 'Nestea',
 'Sprite'}

---

## Advanced Level


### Part 5: Calculate the average number of toppings per burrito.

Note: Let's ignore the `quantity` column to simplify this task.

Hint: Think carefully about the easiest way to count the number of toppings!


In [23]:
# turns 'item name' column into a list 
# then uses list comprehension to provide a list of only those strings that contain the string 'Burrito'
# then converst this into a set 
set([i for i in df['item_name'].tolist() if 'Burrito' in i]) 

{'Barbacoa Burrito',
 'Burrito',
 'Carnitas Burrito',
 'Chicken Burrito',
 'Steak Burrito',
 'Veggie Burrito'}

In [24]:
# Create a new dataframe with only item_names that contain burrito 
df_burrito = df[(df['item_name'] == 'Barbacoa Burrito') | (df['item_name'] == 'Burrito') | (df['item_name'] == 'Carnitas Burrito') | (df['item_name'] == 'Chicken Burrito') | (df['item_name'] == 'Steak Burrito')| (df['item_name'] == 'Veggie Burrito')]
df_burrito.shape

(1172, 9)

In [25]:
# Convert the 'choice_description'column into a list
burrito_choice_list = df_burrito['choice_description'].tolist()
burrito_choice_list

['[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
 '[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]]',
 '[Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Cheese, Sour Cream]]',
 '[Fresh Tomato Salsa (Mild), [Black Beans, Rice, Cheese, Sour Cream, Lettuce]]',
 '[[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Rice, Cheese, Sour Cream, Lettuce]]',
 '[[Tomatillo-Green Chili Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Pinto Beans, Rice, Cheese, Sour Cream, Guacamole, Lettuce]]',
 '[[Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium)], [Black Beans, Rice, Sour Cream, Lettuce]]',
 '[Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Rice, Cheese, Sour Cream]]',
 '[[Roasted Chili Corn Salsa (Medium), Fresh Tomato Salsa (Mild)], [Rice, Black Beans, Sour Cream]]',
 '[Fresh Tomato Salsa, [Rice, Pinto Beans, C

In [26]:
# Turn the burrito choice description list into a single string
burrito_choice_string = ''.join(burrito_choice_list)
burrito_choice_string

'[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]][Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]][Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Cheese, Sour Cream]][Fresh Tomato Salsa (Mild), [Black Beans, Rice, Cheese, Sour Cream, Lettuce]][[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Rice, Cheese, Sour Cream, Lettuce]][[Tomatillo-Green Chili Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Pinto Beans, Rice, Cheese, Sour Cream, Guacamole, Lettuce]][[Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium)], [Black Beans, Rice, Sour Cream, Lettuce]][Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Rice, Cheese, Sour Cream]][[Roasted Chili Corn Salsa (Medium), Fresh Tomato Salsa (Mild)], [Rice, Black Beans, Sour Cream]][Fresh Tomato Salsa, [Rice, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]][Tomati

In [27]:
# The number of commas in the string equals the total number of toppings
# count the number of commas in the string
count = 0
for i in burrito_choice_string: 
    if i == ',': 
        count = count + 1
print(count)

5151


In [28]:
print('The average number of toppings per burrito order (excluding quantity) is ' + format(round(count/df_burrito['item_name'].count(),2)))

The average number of toppings per burrito order (excluding quantity) is 4.4


### Part 6: Create a dictionary. Let the keys represent chip orders and the values represent the total number of orders.

Expected output: `{'Chips and Roasted Chili-Corn Salsa': 18, ... }`

Note: Please take the `quantity` column into account!

Optional: Learn how to use `.defaultdict()` to simplify your code.

In [29]:
# filter 'item_name' by to include only those rows that contain the string 'Chips'
df_chips = df[df['item_name'].str.contains('Chips',case=False)]

In [30]:
# Get the value counts of the remainining item_names
df_chips['item_name'].value_counts()

Chips and Guacamole                      479
Chips                                    211
Chips and Fresh Tomato Salsa             110
Side of Chips                            101
Chips and Tomatillo Red Chili Salsa       48
Chips and Tomatillo Green Chili Salsa     43
Chips and Tomatillo-Green Chili Salsa     31
Chips and Roasted Chili Corn Salsa        22
Chips and Tomatillo-Red Chili Salsa       20
Chips and Roasted Chili-Corn Salsa        18
Chips and Mild Fresh Tomato Salsa          1
Name: item_name, dtype: int64

In [31]:
# convert the array into a dictionary 
df_chips['item_name'].value_counts().to_dict()

{'Chips and Guacamole': 479,
 'Chips': 211,
 'Chips and Fresh Tomato Salsa': 110,
 'Side of Chips': 101,
 'Chips and Tomatillo Red Chili Salsa': 48,
 'Chips and Tomatillo Green Chili Salsa': 43,
 'Chips and Tomatillo-Green Chili Salsa': 31,
 'Chips and Roasted Chili Corn Salsa': 22,
 'Chips and Tomatillo-Red Chili Salsa': 20,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Chips and Mild Fresh Tomato Salsa': 1}

---

## Bonus: Craft a problem statement about this data that interests you, and then answer it!


In [59]:
burrito_choice_dict =  {x:burrito_choice_list.count(x) for x in burrito_choice_list}

In [60]:
max_value = max(burrito_choice_dict.values())  # maximum value
max_keys = [k for k, v in burrito_choice_dict.items() if v == max_value] # getting all keys containing the `maximum`
print(max_value,max_keys)

24 ['[Fresh Tomato Salsa (Mild), [Pinto Beans, Rice, Cheese, Sour Cream]]']


In [61]:
# The most popular choice_description served on Burritoes is Fresh Tomato Salsa (Mild), Pinto Beans, Rice, Cheese, Sour Cream
# This combination was served 24 times
# answer ignores quanity column