# Drink Caffeine Content Analysis Project

## Project Introdocution

In this analysis we will delve into a topic that is very near and dear to my heart - caffeine! 

I have a caffeine addiction. More often than not I opt for energy drinks ala Red Bull, Monster, Rockstar, etc. However, when the weather gets cold I start to really enjoy coffee. I often worry that the amount of caffeine I ingest will one day come back to bite me in terms of health. Despite how much caffeine I consume on a weekly basis, I know shockingly little about the amount of caffeine in the drinks I enjoy so much. 

This data set contains information related to roughly 600 different drinks with caffeine and provides information for each related to caffeine content, volume, and calories. 

## Project Goal

We will conduct some exploratory data analysis (EDA) and then attempt to find a subset of drinks which may be "less unhealthy" and could serve as some alternatives to sugar filled energy drinks and coffee.


## Import Libraries and Load the Dataset

In [51]:
# Import libraries used in the project and turn off warnings

import pandas as pd
import plotly.express as px
import plotly.io as pio
import warnings

# plotly.offline needs to be imported and the following code run in order to display visualizations offline

import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

# hiding warnings

warnings.filterwarnings('ignore')

In [52]:
# import the caffeine dataset

caffeine = pd.read_csv(r'C:\Users\benwh\Documents\Projects\Caffeine\caffeine.csv')

## Data Familiarization

In [53]:
caffeine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610 entries, 0 to 609
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   drink          610 non-null    object 
 1   Volume (ml)    610 non-null    float64
 2   Calories       610 non-null    int64  
 3   Caffeine (mg)  610 non-null    int64  
 4   type           610 non-null    object 
dtypes: float64(1), int64(2), object(2)
memory usage: 24.0+ KB


In [54]:
caffeine.head()

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type
0,Costa Coffee,256.993715,0,277,Coffee
1,Coffee Friend Brewed Coffee,250.19181,0,145,Coffee
2,Hell Energy Coffee,250.19181,150,100,Coffee
3,Killer Coffee (AU),250.19181,0,430,Coffee
4,Nescafe Gold,250.19181,0,66,Coffee


We can see after running .info() and .head() that our data set is pretty straight forward - 610 observations with five fields covering drink name, volume, calories, caffeine content and drink type.

In [55]:
# Check for duplicate drink names

duplicatedrinks = caffeine[caffeine.duplicated(['drink'])]
duplicatedrinks

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type


We have no duplicate drink names in our dataset.

In [56]:
# Distribution of drink types in the dataset.

caffeine['type'].value_counts().sort_values(ascending=False)

Energy Drinks    219
Coffee           173
Soft Drinks       90
Tea               66
Energy Shots      36
Water             26
Name: type, dtype: int64

Value_counts() run on the "type" field reveals there are six unique drink types.

In [57]:
# Distribution of volumes in the dataset.

caffeine['Volume (ml)'].value_counts().sort_values(ascending=False)

354.882000    159
473.176000    125
236.588000     96
250.191810     35
59.147000      19
             ... 
14.786750       1
650.617000      1
256.993715      1
751.166900      1
329.744525      1
Name: Volume (ml), Length: 70, dtype: int64

The volume of drinks varies quite a bit, but more than half of all volumes are higher than 200 mL.

In [58]:
# Distribution of calorie contnet in the dataset.
caffeine['Calories'].value_counts().sort_values(ascending=False)

0      200
5       32
10      30
160     22
140     21
      ... 
12       1
163      1
139      1
74       1
299      1
Name: Calories, Length: 98, dtype: int64

Calories per drink appear to vary quite a bit. Notably, just under half appear to be very low in calories.

In [59]:
# Distribution of caffeine content in the dataset.
caffeine['Caffeine (mg)'].value_counts().sort_values(ascending=False)

80     37
100    34
160    33
0      28
120    23
       ..
68      1
178     1
157     1
36      1
99      1
Name: Caffeine (mg), Length: 162, dtype: int64

While there are six types of drinks included in the dataset, there is a lot more variation among volume, calories, and caffeine content.

## Univariate Analysis 

### Summary Statistics for Volume (ml), Calories, and Caffeine (mg)

In [60]:
# Summary statistics for numerical data types in the dataset as a whole

caffeine.describe()

Unnamed: 0,Volume (ml),Calories,Caffeine (mg)
count,610.0,610.0,610.0
mean,346.54363,75.527869,134.693443
std,143.747738,94.799919,155.362861
min,7.393375,0.0,0.0
25%,236.588,0.0,50.0
50%,354.882,25.0,100.0
75%,473.176,140.0,160.0
max,1419.528,830.0,1555.0


In [61]:
# Summary statistics for numerical data types individually (i.e. Volume (mL), Calories, Caffeine (mg))

caffeine.groupby(['type']).describe()

Unnamed: 0_level_0,Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Calories,Calories,Calories,Calories,Calories,Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Coffee,173.0,335.870855,159.716392,12.716605,236.588,325.3085,473.176,1419.528,173.0,73.49711,...,120.0,830.0,173.0,200.589595,248.222165,2.0,100.0,145.0,200.0,1555.0
Energy Drinks,219.0,388.971198,106.409997,236.588,250.19181,354.882,473.176,751.1669,219.0,86.671233,...,149.5,320.0,219.0,147.86758,76.734535,0.0,80.0,135.0,182.5,400.0
Energy Shots,36.0,57.742259,22.094888,7.393375,57.076855,59.147,59.147,125.98311,36.0,16.5,...,25.0,100.0,36.0,193.416667,79.535931,75.0,120.0,192.5,241.25,350.0
Soft Drinks,90.0,355.243454,41.509635,236.588,354.882,354.882,354.882,591.47,90.0,111.111111,...,160.0,320.0,90.0,33.677778,24.915961,0.0,9.25,37.0,47.75,102.0
Tea,66.0,360.47408,167.002318,177.441,236.588,236.588,473.176,946.352,66.0,52.757576,...,107.5,299.0,66.0,55.863636,39.333637,0.0,29.25,45.0,68.0,165.0
Water,26.0,394.590111,99.70274,236.588,343.791938,354.882,473.176,591.47,26.0,11.538462,...,3.75,110.0,26.0,53.730769,34.060602,0.0,35.0,60.0,73.75,120.0


### Distribution of Drink Types

In [62]:
caffeine_gb = caffeine.groupby(["type"]).count()
fig = px.bar(caffeine_gb, 
             x=caffeine_gb.index, 
             y="drink", 
             title='Count of Drink Types')
fig.update_layout(xaxis={'categoryorder':'total descending'})

fig.show()

### Histogram - Volume (ml)

In [63]:
fig = px.histogram(caffeine, 
             x = 'Volume (ml)', 
             title='Distribution of Volume')

fig.show()

The right tail of the histogram for Volume (ml) is pretty long. Let's zoom in and look at a histogram of volumes > 800 ml.

In [64]:
large_volume = caffeine[(caffeine['Volume (ml)'] > 600)]
large_volume

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type
29,Starbucks Bottled Iced Coffee,1419.528,240,640,Coffee
30,Baskin Robbins Cappuccino Blast,709.764,470,234,Coffee
31,Dunkin' Cold Brew,709.764,5,260,Coffee
32,Dunkin' Donuts Iced Coffee,709.764,20,297,Coffee
33,Dunkin' Donuts Iced Latte,709.764,100,166,Coffee
214,Monster Hydro,751.1669,150,188,Energy Drinks
242,Mega Monster Energy Drink,709.764,320,240,Energy Drinks
243,Amino Force Energy Drink,650.617,0,200,Energy Drinks
520,Xingtea Iced Green Tea,694.97725,50,110,Tea
530,McDonalds Sweet Tea,946.352,160,100,Tea


Our large right-sided tail is made up of 13 drinks. Some of these have very large volume sizes, such as the Starbucks Bottled Iced Coffee - it's nearly 1.5 liters!

### Histogram - Calories

In [65]:
fig = px.histogram(caffeine, 
             x = 'Calories', 
             title='Distribution of Calories')

fig.show()

Similar to to Volume (ml), we have a large right-sided tail for Calories. Let's take a closer look and filter for drinks with >= 400 calories.

Also worth nothing is there are a large number of drinks with less than 20 calories.

In [66]:
high_calories = caffeine[(caffeine['Calories'] >= 400)]
high_calories

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type
13,Dare Iced Coffee,500.087885,429,160,Coffee
30,Baskin Robbins Cappuccino Blast,709.764,470,234,Coffee
37,Arby's Jamocha Shake,473.176,830,12,Coffee
81,Big Train Java Chip Ice Coffee,354.882,410,49,Coffee


The long right-sided tail for calories is made up of four drinks that all look like they are a liquid version of ice cream.

### Histogram - Caffeine (mg)

In [67]:
fig = px.histogram(caffeine, 
             x = 'Caffeine (mg)', 
             title='Distribution of Caffeine', 
             template='plotly_white')

fig.show()

Again we have a long, right-sided tail. This time for Caffeine. Let's zoom in on the tail by looking at any drink with > 400 mg of caffeine.

In [68]:
high_caffeine = caffeine[(caffeine['Caffeine (mg)'] > 400)]

fig = px.histogram(high_caffeine, 
             x = 'Caffeine (mg)', 
             nbins=10, 
             title='Distribution of Drinks With Caffeine Content Higher Than 400 mg', 
             template='plotly_white')

fig.show()

### Univariate Analysis - Takeaways

- A bit over half of the dataset is made up of Enery Drink and Coffee drink types
- Most volumes fall between 200 and 475 ml
- More than half of the drinks in the dataset appear to be low calorie (< 100 calories
- Most drinks have less than 200 mg caffeine
- ~ 75% drinks have a caffeine concentration of .75 or less

## Bivariate Analysis

### Box Plot for Caffeine Across all Drink Types

In [69]:
fig = px.box(caffeine, 
            x='type', 
            y='Caffeine (mg)',
            title='Caffeine Content By Drink Type',
            labels={'type': 'Type'},
            template='plotly_white',
            hover_data=['drink', 'Volume (ml)']
            )
fig.show()

Coffee has numerous outlying drinks where caffeine content is very high. Additionally, Coffee drinks appear to be a bit higher in terms of caffeine content than Energy Drinks. However, Energy Shots are higher on average than both Coffee and Energy Drinks. 

On average, Coffee, Energy Drinks, and Energy Shots have higher caffeine content than Soft Drinks, Tea, and Water.

### Volume x Caffeine

In [70]:
fig = px.scatter(caffeine,
            x='Volume (ml)', 
            y='Caffeine (mg)',
            color='type',  
            title='Volume x Caffeine',
            height = 800,
            template='plotly_white',
            hover_data=['drink']
            )

fig.show()

The coffee outliers can be seen here as well and it looks like they are mostly among drinks of the same volume. Energy shots are clustered around 100 ml or lower but have similar caffeine content to many enery drinks and coffee. Tea, Water, and Soft Drinks appear to have lower caffeine content. 

## Box Plot for Caffeine Concentration Across all Drink Types

We're going to take another look at caffeine distribution across all drink types, but this time we'll use caffeine concentration which is equal to Caffeine (mg) / Volume (ml). This will give us a better idea of which drink types pack a caffeine punch relative to the volume of the drink. We'll also take a look at some summary statistics for the new caffeine distribution field

In [71]:
# create Caffeine Concentration field which = Caffeine (mg) / Volume (ml)

caffeine['Caffeine Concentration'] = caffeine['Caffeine (mg)'] / caffeine['Volume (ml)']

# get summary statistics for Caffeine Concentration

caffeine['Caffeine Concentration'].describe()

count    610.000000
mean       0.594908
std        1.088245
min        0.000000
25%        0.154981
50%        0.319755
75%        0.507211
max       10.482358
Name: Caffeine Concentration, dtype: float64

In [72]:
# Caffeine Concentration boxplot for all drink types

fig = px.box(caffeine,
        x='type',
        y='Caffeine Concentration',
        title='Caffeine Concentration By Drink Type',
        labels={'type': 'Type'},
        height=800,
        template='plotly_white',
        hover_data=['drink']
        )

fig.show()

As we saw just little while ago we ran caffeine['Caffeine Concentration'].describe(), the average caffeine concentration is .594 which means that on average for roughly every 1.7 mL of volume, we get 1 mg of caffeine, which is almost a 2:1 ratio. It also looks like there are some serious outliers with the max value being 10.48.

The caffeine concentration of energy shots is clearly much higher than any other drink type. Between Coffee and Energy drinks, Coffee appears to have a higher caffeine concentration. Additionally, just like with the previous box plot showing caffeine content for all drink types, there are numerous outliers for coffee.


## Healthier Options

Now that we have a better understanding of the data set and have created a caffeine concentration value for each drink, lets see if we can identify some potentially healthier options. In this data set, we only have calories as a data point which could be extrapolated to interpret whether something is healthy or not, so we will run with the assumption that lower calories = a healtheir alternative. 

Let's filter for a subset of drinks that have 100 or less calories and are not energy shots since they arent really a "drink" persay.

In [73]:
healthy_caffeine = caffeine[(caffeine['Calories'] <= 100) & (caffeine['type'] != 'Energy Shots')].sort_values(by=['Caffeine Concentration'], ascending=False).head(20)

healthy_caffeine

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type,Caffeine Concentration
85,Black Label Brewed Coffee,354.882,0,1555,Coffee,4.381738
102,Very Strong Coffee,354.882,0,1350,Coffee,3.804081
92,Devils Brew Extreme Caffeine Coffee,354.882,0,1325,Coffee,3.733635
101,Taft Coffee (EU),354.882,0,1182,Coffee,3.330685
95,High Voltage Coffee (AU),354.882,0,1150,Coffee,3.240514
28,Stok Coffee Shots,12.716605,10,40,Coffee,3.145494
84,Black Insomnia Coffee,354.882,0,1105,Coffee,3.113711
89,Cannonball Coffee Maximum Charge (UK),354.882,0,1101,Coffee,3.10244
82,Biohazard Coffee,354.882,3,928,Coffee,2.614954
91,Death Wish Coffee,354.882,0,728,Coffee,2.051386


Looking at the first 20 results, virtually everything is a type of coffee. Also, it looks like we have some drinks with very low volumes which, when combined with the drink name in some instances, appear to be some type of espresso drink. All of the values list above are well above our mean of .594 in terms of caffeine concentration.

We could end our analysis here and walk away with a list of drinks to consider drinking as alterntives, but lets do two more things before we wrap up. Let's do the following:

1. Get a list of energy drinks with the same rules applied (100 calories or less).
2. Get a combined list of tea and water drinks using the same criteria. These could be good options if we're trying to avoid enery drinks and coffee all together.

### 1. Which Energy Drinks Have the Highest Caffeine Content and Low Calories?

In [74]:
healthy_energy_drinks = caffeine[(caffeine['Calories'] <= 100) & (caffeine['type'] == 'Energy Drinks')].sort_values(by=['Caffeine Concentration'], ascending=False).head(20)

healthy_energy_drinks

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type,Caffeine Concentration
388,Redline Xtreme Energy Drink,236.588,0,316,Energy Drinks,1.335655
387,Redline Princess,236.588,0,300,Energy Drinks,1.268027
348,Hyde Xtreme,354.882,0,400,Energy Drinks,1.127135
385,PerformElite Pre-Workout,236.588,0,225,Energy Drinks,0.95102
364,Spike Shooter,354.882,10,300,Energy Drinks,0.845351
341,Cocaine Energy Drink,354.882,90,280,Energy Drinks,0.788995
381,EBOOST Workout Crusher Mix,236.588,40,175,Energy Drinks,0.739682
314,Spike Hardcore Energy,473.176,0,350,Energy Drinks,0.739682
275,Loud Energy Drink,473.176,10,320,Energy Drinks,0.676281
252,Bang Energy,473.176,0,300,Energy Drinks,0.634014


There are some heavy hitters in this group, some packing a caffeine concentration of > 1. Still the others contains more than 200 mg of caffeine. For reference, a 12 oz Red Bull has 111 mg of caffeine.

### 2. Which Tea and Water Drinks Have the Highest Caffeine Content and Low Calories?

In [75]:
healthy_tea_water = caffeine[(caffeine['Calories'] <= 100) & ((caffeine['type'] == 'Tea') | (caffeine['type'] == 'Water'))].sort_values(by=['Caffeine Concentration'], ascending=False).head(20)

healthy_tea_water

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type,Caffeine Concentration
581,Zest Highly Caffeinated Tea,236.588,0,150,Tea,0.634014
556,Fast Lane Black Tea,236.588,0,110,Tea,0.464943
558,HICAF Tea,236.588,0,110,Tea,0.464943
584,Perrier Energize,250.19181,35,99,Water,0.395696
528,Inko's White Tea Energy,458.38925,100,165,Tea,0.359956
580,YMateina Yerba Mate,236.588,0,80,Tea,0.338141
598,Arti Sparkling Water,354.882,0,120,Water,0.338141
561,Lipton Natural Energy Tea,236.588,0,75,Tea,0.317007
545,Brew Dr Kombucha Uplift,414.029,80,130,Tea,0.313988
554,Cold Brew Tea,236.588,0,70,Tea,0.295873


It looks like there are more than a few good options for tea and water type drinks as well. Im definitely going to do some googling to see where I can find some of these. While the associated caffeine concentrations are lower than our mean of .594, these drinks tend to have zero calories and more than a few have the same caffeine content as a 12 oz Redbull energy drink (111 mg).

## Project Conclusion

In our analysis of the Caffeine dataset, we were able to find a subset (and a subset of that subset) of healthier caffeinated drink options based on calories, caffeine amount, and caffeine concentration. In general, excluding energy shots, coffee and energy drinks tended to have the highest caffeine amounts. However, there were plenty of tea and water type drinks with caffeine amounts and caffeine concentration that rival those of standard energy drinks such as Redbull.

