# Project 2: Cleaning and Transforming Data for Analysis
## For 3 data sets chosen from the week 5 discussion forum:
1. Create a .csv file
2. Use pandas to read the file and tidy it
3. Perform the analysis suggested in the discussion post

## Data Set 2: Starbucks caffeine content
This was my data set that I posted in the discussion forum. I want to look at caffeine content by drink category.

First, I will import the file and look at the first few rows.

In [1]:
import pandas as pd
import numpy as np

url = 'https://github.com/sarahbill33/dataacq/blob/main/starbucks_caffeine%20-%20Sheet1.csv?raw=true'
caffeine = pd.read_csv(url, skipinitialspace=True, index_col=0)

print(caffeine.head(5))

                         Short (8 fl oz) Tall (12 fl oz) Grande (16 fl oz)  \
Beverage                                                                     
Pike Place Brewed Coffee          155 mg          235 mg            310 mg   
Blonde Roast                      180 mg          270 mg            360 mg   
Featured Dark Roast               130 mg          195 mg            260 mg   
Clover® Brewed Coffees                 –               –                 –   
-Reserve roasts                   190 mg          280 mg            380 mg   

                         Venti (20 fl oz)  
Beverage                                   
Pike Place Brewed Coffee           410 mg  
Blonde Roast                       475 mg  
Featured Dark Roast                340 mg  
Clover® Brewed Coffees                  –  
-Reserve roasts                    470 mg  


I notice a couple of things right away that I learned working with the first dataset:
- the dataset is wide and needs to be transformed to a long dataset
- the values in the rows have blanks, and are also strings instead of numbers

I will start by converting values. I will take the approach of looking at the values first and then determining how to replace them with numbers.

In [2]:
caffeine['Short (8 fl oz)'].unique()

array(['155 mg', '180 mg', '130 mg', '–', '190 mg', '170 mg', '15 mg',
       '75 mg', '85 mg', '135 mg', '150 mg', '90 mg', '95 mg', nan,
       '25 mg', '50 mg', '40 mg', '0 mg', '0-15 mg', '15-25 mg'],
      dtype=object)

In [2]:
caffeine['Short (8 fl oz)'] = caffeine['Short (8 fl oz)'].str.replace(r'\D', '')
caffeine['Tall (12 fl oz)'] = caffeine['Tall (12 fl oz)'].str.replace(r'\D', '')
caffeine['Grande (16 fl oz)'] = caffeine['Grande (16 fl oz)'].str.replace(r'\D', '')
caffeine['Venti (20 fl oz)'] = caffeine['Venti (20 fl oz)'].str.replace(r'\D', '')

print(caffeine.head(5))

                         Short (8 fl oz) Tall (12 fl oz) Grande (16 fl oz)  \
Beverage                                                                     
Pike Place Brewed Coffee             155             235               310   
Blonde Roast                         180             270               360   
Featured Dark Roast                  130             195               260   
Clover® Brewed Coffees                                                       
-Reserve roasts                      190             280               380   

                         Venti (20 fl oz)  
Beverage                                   
Pike Place Brewed Coffee              410  
Blonde Roast                          475  
Featured Dark Roast                   340  
Clover® Brewed Coffees                     
-Reserve roasts                       470  


  caffeine['Short (8 fl oz)'] = caffeine['Short (8 fl oz)'].str.replace(r'\D', '')
  caffeine['Tall (12 fl oz)'] = caffeine['Tall (12 fl oz)'].str.replace(r'\D', '')
  caffeine['Grande (16 fl oz)'] = caffeine['Grande (16 fl oz)'].str.replace(r'\D', '')
  caffeine['Venti (20 fl oz)'] = caffeine['Venti (20 fl oz)'].str.replace(r'\D', '')


In [3]:
caffeine['Short (8 fl oz)'] = caffeine['Short (8 fl oz)'].replace('', 0)
caffeine['Tall (12 fl oz)'] = caffeine['Tall (12 fl oz)'].replace('', 0)
caffeine['Grande (16 fl oz)'] = caffeine['Grande (16 fl oz)'].replace('', 0)
caffeine['Venti (20 fl oz)'] = caffeine['Venti (20 fl oz)'].replace('', 0)

print(caffeine.head(5))

                         Short (8 fl oz) Tall (12 fl oz) Grande (16 fl oz)  \
Beverage                                                                     
Pike Place Brewed Coffee             155             235               310   
Blonde Roast                         180             270               360   
Featured Dark Roast                  130             195               260   
Clover® Brewed Coffees                 0               0                 0   
-Reserve roasts                      190             280               380   

                         Venti (20 fl oz)  
Beverage                                   
Pike Place Brewed Coffee              410  
Blonde Roast                          475  
Featured Dark Roast                   340  
Clover® Brewed Coffees                  0  
-Reserve roasts                       470  


In [4]:
caffeine['Beverage_col'] = caffeine.index

print(caffeine.head(5))

                         Short (8 fl oz) Tall (12 fl oz) Grande (16 fl oz)  \
Beverage                                                                     
Pike Place Brewed Coffee             155             235               310   
Blonde Roast                         180             270               360   
Featured Dark Roast                  130             195               260   
Clover® Brewed Coffees                 0               0                 0   
-Reserve roasts                      190             280               380   

                         Venti (20 fl oz)              Beverage_col  
Beverage                                                             
Pike Place Brewed Coffee              410  Pike Place Brewed Coffee  
Blonde Roast                          475              Blonde Roast  
Featured Dark Roast                   340       Featured Dark Roast  
Clover® Brewed Coffees                  0    Clover® Brewed Coffees  
-Reserve roasts                  

In [5]:
caffeine2 = pd.melt(caffeine, id_vars='Beverage_col', value_vars=['Short (8 fl oz)', 'Tall (12 fl oz)', 'Grande (16 fl oz)', 'Venti (20 fl oz)'])

print(caffeine2.head(5))

               Beverage_col         variable value
0  Pike Place Brewed Coffee  Short (8 fl oz)   155
1              Blonde Roast  Short (8 fl oz)   180
2       Featured Dark Roast  Short (8 fl oz)   130
3    Clover® Brewed Coffees  Short (8 fl oz)     0
4           -Reserve roasts  Short (8 fl oz)   190


In [6]:
caffeine2.value.unique()

array(['155', '180', '130', 0, '190', '170', '15', '75', '85', '135',
       '150', '90', '95', nan, '25', '50', '40', '0', '015', '1525',
       '235', '270', '195', '280', '255', '20', '100', '115', '70', '55',
       '4045', '310', '360', '260', '380', '340', '375', '225', '315',
       '175', '185', '80', '5060', '410', '475', '470', '425', '445',
       '30', '300', '320', '265', '65', '120', '110'], dtype=object)

In [7]:
caffeine2['value'] = caffeine2.value.fillna(0)

caffeine2.value.unique()

array(['155', '180', '130', 0, '190', '170', '15', '75', '85', '135',
       '150', '90', '95', '25', '50', '40', '0', '015', '1525', '235',
       '270', '195', '280', '255', '20', '100', '115', '70', '55', '4045',
       '310', '360', '260', '380', '340', '375', '225', '315', '175',
       '185', '80', '5060', '410', '475', '470', '425', '445', '30',
       '300', '320', '265', '65', '120', '110'], dtype=object)

In [8]:
caffeine2['value'] = caffeine2['value'].astype(str).astype(int)

caffeine2.value.unique()

array([ 155,  180,  130,    0,  190,  170,   15,   75,   85,  135,  150,
         90,   95,   25,   50,   40, 1525,  235,  270,  195,  280,  255,
         20,  100,  115,   70,   55, 4045,  310,  360,  260,  380,  340,
        375,  225,  315,  175,  185,   80, 5060,  410,  475,  470,  425,
        445,   30,  300,  320,  265,   65,  120,  110])

In [9]:
caffeine2.groupby('variable').value.max()

variable
Grande (16 fl oz)    5060
Short (8 fl oz)      1525
Tall (12 fl oz)      4045
Venti (20 fl oz)     1525
Name: value, dtype: int32

In [27]:
max_grande = caffeine2.query("`value` == 5060 and `variable` == 'Grande (16 fl oz)'")['Beverage_col']
max_short = caffeine2.query("`value` == 1525 and `variable` == 'Short (8 fl oz)'")['Beverage_col']
max_tall = caffeine2.query("`value` == 4045 and `variable` == 'Tall (12 fl oz)'")['Beverage_col']
max_venti = caffeine2.query("`value` == 1525 and `variable` == 'Venti (20 fl oz)'")['Beverage_col']

print('The drink with the most caffeine in a short is ') 
print(max_short.to_list())

print('The drink with the most caffeine in a tall is ') 
print(max_tall.to_list())

print('The drink with the most caffeine in a grande is ') 
print(max_grande.to_list())

print('The drink with the most caffeine in a venti is ') 
print(max_venti.to_list())

The drink with the most caffeine in a short is 
['Teavana Jade Citrus Mint', 'Teavana Oprah Chai Brewed Tea']
The drink with the most caffeine in a tall is 
['Starbucks Verismo']
The drink with the most caffeine in a grande is 
['Starbucks Verismo']
The drink with the most caffeine in a venti is 
['Teavana Emperor’s Cloud & Mist', 'Teavana Youthberry']


## I can see that the answer is:
- Jade Citrus Mint and Oprah Chai teas have the most caffeine in a size short
- Starbucks Verismo has the most caffeine in a tall
- Starbucks Verismo has the most caffeine in a grande
- Emperor's Cloud and Youthberry have the most caffeine in a venti

It would be interesting to do further analysis where a column is added for 'caffeine per ounce' and then compared. Maybe another time!