# Vegetarian trends at food.com
The interest in vegetarian and vegan food has grown steadily over the past two decades.

The number of searches for 'veganism' in the UK has increased 900% from 2009 to 2019.[1]

And although in 2015, just 3.4% of all Americans said they were vegetarian, fully a quarter of 25- to 34-year-olds identified as such.[2]

Below, I will explore whether this vegetarian trend extends to the users of [food.com](https://www.food.com/), one of the leading online recipe websites.

[1] [Veganism: Why are vegan diets on the rise?](https://www.bbc.com/news/business-44488051)

[2] [The year of the vegan, The Economist](https://worldin2019.economist.com/theyearofthevegan)

## Questions
1. Is there a positive trend in the number of vegetarian recipes posted on food.com between 2008 and 2017?
2. Is there a positive trend in the number of interactions with vegetarian recipes on Food.com between 2008 and 2017?
2. Is there a difference in the ratings vegetarian recipes received compared to non-vegetarian recipes between 2008 and 2017?

## Dataset
I will be using the [food.com dataset](https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions) by Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian McAuley.

The dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen).

For the purpose of this exploration study, I will look at data between 2008 and 2017.

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
import seaborn as sns
from ast import literal_eval

%matplotlib inline

## Palettes

In [None]:
veg_meat = ["#454d66", "#b7e778", "#1fab89"]
sns.set_palette(veg_meat)

## Data import & summary

In [None]:
recipes = pd.read_csv('/kaggle/input/food-com-recipes-and-user-interactions/RAW_recipes.csv')
interactions = pd.read_csv('/kaggle/input/food-com-recipes-and-user-interactions/RAW_interactions.csv')

In [None]:
recipes.head()

In [None]:
print(recipes.info())
recipes.describe()

2,147,484,000 minutes sounds like a very long time to cook a recipe. Let's take a closer look at the `minutes` column later on.

In [None]:
recipes[['minutes', 'n_steps', 'n_ingredients']].hist()

In [None]:
interactions.head()

In [None]:
print(interactions.info())
interactions.describe()

In [None]:
interactions['rating'].hist()

Users tend to rate recipes very highly.

This is unlikely to mean that they are universally satisfied with recipes on food.com. More likely, users who don't like the results of a recipe either don't bother to rate, or blame their own cooking skills, giving the recipe author a benefit of the doubt.

## Data wrangling

Let's start by filtering the dataset down to our chosen 10-year interval.

In [None]:
from_year, to_year = '2008-01-01','2017-12-31'

recipes['submitted'] = pd.to_datetime(recipes['submitted'])
recipes['submitted'] = recipes['submitted'].apply(lambda x: x.tz_localize(None))
recipes_l0y = recipes.loc[recipes['submitted'].between(from_year, to_year, inclusive=False)]

interactions['date'] = pd.to_datetime(interactions['date'])
interactions['date'] = interactions['date'].apply(lambda x: x.tz_localize(None))
interactions_l0y = interactions.loc[interactions['date'].between(from_year, to_year, inclusive=False)]

print(recipes_l0y.shape)
print(interactions_l0y.shape)

### Remove outliers

In [None]:
sns.boxplot(x=recipes_l0y["minutes"])

There is clearly at least one extreme outlier in the data set. 694 days is too long of a preparation time for even the tastiest of recipes!

In [None]:
# calculate the first quartile, third quartile and the interquartile range
Q1 = recipes_l0y['minutes'].quantile(0.25)
Q3 = recipes_l0y['minutes'].quantile(0.75)
IQR = Q3 - Q1

# calculate the maximum value and minimum values according to the Tukey rule
max_value = Q3 + 1.5 * IQR
min_value = Q1 - 1.5 * IQR

# filter the data for values that are greater than max_value or less than min_value
minutes_outliers = recipes_l0y[(recipes_l0y['minutes'] > max_value) | (recipes_l0y['minutes'] < min_value)]
minutes_outliers.sort_values('minutes')

As we can see above, the Tukey method filters out many reasonable recipes as outliers. Some recipes, such as pickles, extracts and liqueurs can take many days to prepare, and should not be excluded.

The one extreme outlier at 1051200 minutes is the [How to Preserve a Husband](https://www.food.com/recipe/how-to-preserve-a-husband-447963) recipe. Although it is no doubt very valuable, I will exclude it from the rest of this exploration.

### Exclude How to Preserve a Husband recipe

In [None]:
# filter out recipes that take longer than 730 days as outliers
recipes_l0y = recipes_l0y.query('minutes < 1051200')

### Rating count and average by recipe and year

In [None]:
recipes_l0y['year'] = recipes_l0y['submitted'].dt.year
interactions_l0y['year'] = interactions_l0y['date'].dt.year

In [None]:
ratings_by_recipe = interactions_l0y.groupby(['recipe_id', 'year']).agg(
    rating_cnt = ('rating', 'count'),
    rating_avg = ('rating', 'mean'),
)
ratings_by_recipe.head()

### Merge recipes and ratings

In [None]:
recipes_and_ratings = recipes_l0y.merge(ratings_by_recipe, left_on='id', right_on='recipe_id')
recipes_and_ratings.head(2)

### Tags to lists

In [None]:
# convert the tags column to list format
recipes_and_ratings['tags'] = recipes_and_ratings['tags'].apply(lambda x: literal_eval(str(x)))

### Add vegan and vegetarian columns, check overlap

In [None]:
# add vegetarian and vegan boolean columns
recipes_and_ratings['vegetarian'] = ['vegetarian' in tag for tag in recipes_and_ratings['tags']]
recipes_and_ratings['vegan'] = ['vegan' in tag for tag in recipes_and_ratings['tags']]
recipes_and_ratings = recipes_and_ratings.drop(columns=['name', 'tags', 'nutrition', 'steps', 'description', 'ingredients'])
recipes_and_ratings.head(2)

In [None]:
#plot a venn diagram of vegetarian and vegan recipe counts
vegetarian_cnt = len(recipes_and_ratings.query('vegetarian == True'))
vegan_cnt = len(recipes_and_ratings.query('vegan == True'))
intersect_cnt = len(recipes_and_ratings.query('vegetarian == True and vegan == True'))

venn2(subsets = (vegetarian_cnt, vegan_cnt-intersect_cnt, intersect_cnt), set_labels = ('Vegetarian', 'Vegan'), set_colors=('#b7e778', '#031c16', '#031c16'), alpha = 1)

As expected, we can se that the `vegetarian` tag is a superset of the `vegan` category, so we don't need to preprocess the tags any further.

Given the very high likelyhood of users forgetting to tag vegan recipes as vegetarian, we can assume that the tags were automatically generated by the system.

## Exploration

### New recipes by year

In [None]:
df = recipes_and_ratings.groupby(['year', 'vegetarian']).agg(
    recipe_cnt = ('id', 'count')
).reset_index()

plt.figure(figsize=(12,6))

ax = sns.lineplot(data=df, x='year', y='recipe_cnt', hue='vegetarian', linewidth=2.5)
ax.set(ylim=(0, None))
ax.set_title('Number of new recipes by year')
ax

We can see that there has been a rapid decline in the number of new recipes posted on food.com over the past decade.

Assuming that the data set is not missing information for some of the more recent years,this decline is somewhat perplexing, as food.com is the second largest recipe website, and internet usage overall has [increased by 10 percentage points](https://www.pewresearch.org/fact-tank/2018/09/28/internet-social-media-use-and-device-ownership-in-u-s-have-plateaued-after-years-of-growth/) in the last decade.

Nevertheless, there could be a number of explanations, which we unfortunately won't have the opportunity to answer in this study:
* There is a saturation of recipes on the website. Everything there is to cook is already covered.
* Related to the preciding point, Food.com might have stopped investing in attracting new contributors.
* Users have shifted their recipe consumption to mobile apps, made by other providers.
* Most prolific recipe authors launched their own recipe blogs, or moved to social media.

Whatever the reason for the decline, let's see whether it has had an equal impact on vegetarian and non-vegetarian recipes.

In [None]:
df = recipes_and_ratings.groupby(['year']).agg(
    total_cnt = ('id', 'count'),
    vegetarian_cnt = ('vegetarian', 'sum'),
    vegan_cnt = ('vegan', 'sum'),
).reset_index()

df['vegetarian_pct'] = df['vegetarian_cnt'] / df['total_cnt'] * 100
df['vegan_pct'] = df['vegan_cnt'] / df['total_cnt'] * 100

plt.figure(figsize=(12,6))

ax = sns.lineplot(data=pd.melt(df[['year', 'vegetarian_pct', 'vegan_pct']], ['year']), x='year', y='value', palette=veg_meat[1:], hue='variable', linewidth=2.5)
ax.set(ylim=(0, 100))
ax.set_title('Percent of vegetarian recipes by year')
ax

The vegetarian category has declined at the same rate as the non-vegetarian portion of food.com, and did even worse between 2014–2017.

### Ratings by year

In [None]:
ratings_by_recipe = interactions_l0y.groupby(['recipe_id', 'year']).agg(
    rating_cnt = ('rating', 'count'),
    rating_avg = ('rating', 'mean'),
).reset_index()
ratings_by_recipe = ratings_by_recipe.merge(recipes_and_ratings[['id', 'vegetarian', 'vegan']], left_on='recipe_id', right_on='id')

df = ratings_by_recipe.groupby(['year', 'vegetarian']).agg(
    rating_cnt = ('rating_cnt', 'sum'),
    rating_avg = ('rating_avg', 'mean'),
).reset_index()

plt.figure(figsize=(12,6))

ax = sns.lineplot(data=df, x='year', y='rating_cnt', hue='vegetarian', linewidth=2.5)
ax.set_title('Recipe ratings by year')
ax

We can see that there has been a similar decline in the number of interactions (reviews) on the recipes.

The decline started one year later, which could probably be explained by a lag between new recipe postings and ratings. That is, the spike in new recipe postings in 2008 would only convert into interactions in the following year.

Again, let's see whether the decline has had an equal impact on vegetarian and non-vegetarian recipes.

In [None]:
interactions_by_recipe_and_year = interactions_l0y.reset_index().groupby(['recipe_id', 'year']).agg(
    rating_cnt = ('index', 'count'),
    rating_avg = ('rating', 'mean'),
).reset_index()

interactions_and_recipes = interactions_by_recipe_and_year[['recipe_id', 'year', 'rating_cnt', 'rating_avg']].merge(recipes_and_ratings[['id', 'vegetarian', 'vegan']], left_on='recipe_id', right_on='id')

interactions_and_recipes['vegetarian_rating_cnt'] = np.where(interactions_and_recipes['vegetarian'] == True, interactions_and_recipes['rating_cnt'], 0)
interactions_and_recipes['vegan_rating_cnt'] = np.where(interactions_and_recipes['vegan'] == True, interactions_and_recipes['rating_cnt'], 0)

df = interactions_and_recipes.groupby(['year']).agg(
    total_cnt = ('rating_cnt', 'sum'),
    vegetarian_cnt = ('vegetarian_rating_cnt', 'sum'),
    vegan_cnt = ('vegan_rating_cnt', 'sum'),
).reset_index()

df['vegetarian_pct'] = df['vegetarian_cnt'] / df['total_cnt'] * 100
df['vegan_pct'] = df['vegan_cnt'] / df['total_cnt'] * 100

plt.figure(figsize=(12,6))

ax = sns.lineplot(data=pd.melt(df[['year', 'vegetarian_pct', 'vegan_pct']], ['year']), x='year', y='value', palette=veg_meat[1:], hue='variable', linewidth=2.5)
ax.set(ylim=(0, 100))
ax.set_title('Percent of votes on vegetarian recipes by year')
ax

The share of ratings posted on vegetarian and vegan recipes has remained flat through the 10-year period. This time, we don't even see a decline in the period between 2014 and 2017.

This may suggest that although the number of vegetarian contributors has declined at a faster rate than that of non-vegetarian authors, the reader composition remained roughly the same.

In [None]:
df = ratings_by_recipe.groupby(['year', 'vegetarian']).agg(
    rating_avg = ('rating_avg', 'mean')
).reset_index()

plt.figure(figsize=(12,6))

ax = sns.lineplot(data=df, x='year', y='rating_avg', hue='vegetarian', linewidth=2.5)
ax.set(ylim=(0, 5))
ax.set_title('Average recipe rating by year')
ax

The average rating for vegetarian recipes was roughly the same until 2013, but has since grown to ~0.2 points above that of non-vegetarian recipes.

## Cohort analysis
I will next conduct a cohort retention analysis to confirm that the number of vegetarian contributors has indeed declined at a faster rate than that of non-vegetarian authors.

Some of the code below was taken from: http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/

### Add submission year column

In [None]:
recipes_and_cohorts = recipes_and_ratings.copy()
recipes_and_cohorts['submitted_year'] = recipes_and_cohorts['submitted'].apply(lambda x: x.strftime('%Y'))

### Divide users into cohorts

In [None]:
# add cohort column — the year of the user's first recipe submission
recipes_and_cohorts.set_index('contributor_id', inplace=True)
recipes_and_cohorts['contributor_cohort'] = recipes_and_cohorts.groupby(level=0)['submitted'].min().apply(lambda x: x.strftime('%Y'))
recipes_and_cohorts.reset_index(inplace=True)
recipes_and_cohorts.head()

In [None]:
def add_cohort_periods(df):
    """
    Creates a `cohort_period` column, which is the Nth period based on the contributor's first recipe.
    """
    df['cohort_period'] = np.arange(len(df)) + 1
    return df

def group_into_cohorts(df):
    """
    Aggregates contributor count, recipe count and cohort period by contributor cohort and year of submission.
    """
    df = df.groupby(['contributor_cohort', 'submitted_year']).agg(
        contributor_cnt = ('contributor_id', 'nunique'),
        recipe_cnt = ('id', 'nunique'),
    )
    df = df.groupby('contributor_cohort').apply(add_cohort_periods)
    return df

# non-vegetarian cohorts
cohorts_nonveg = group_into_cohorts(recipes_and_cohorts[recipes_and_cohorts['vegetarian'] == False])

# vegetarian cohorts
cohorts_veg = group_into_cohorts(recipes_and_cohorts[recipes_and_cohorts['vegetarian'] == True])
cohorts_veg.head()

### User retention by cohort group

In [None]:
def calculate_cohort_sizes(df):
    """
    Calculates cohort sizes.
    """
    df.reset_index(inplace=True)
    df.set_index(['contributor_cohort', 'cohort_period'], inplace=True)
    return df['contributor_cnt'].groupby('contributor_cohort').first()

# calculate cohort sizes
cohort_sizes_nonveg = calculate_cohort_sizes(cohorts_nonveg)
cohort_sizes_veg = calculate_cohort_sizes(cohorts_veg)
cohort_sizes_veg.head()

In [None]:
def convert_cohort_counts_to_pct(df, cohort_sizes):
    """
   Converts cohort period contributor counts to percentages.
    """
    df = df.unstack(0).divide(cohort_sizes, axis=1)
    df.reset_index(inplace=True)
    return df

# convert cohort period contributor counts to percentages
contributor_retention_nonveg = convert_cohort_counts_to_pct(cohorts_nonveg['contributor_cnt'], cohort_sizes_nonveg)
contributor_retention_veg = convert_cohort_counts_to_pct(cohorts_veg['contributor_cnt'], cohort_sizes_veg)
contributor_retention_veg

In [None]:
def plot_retention_curves(df, cohorts, title, position):
    """
   Plots retention curves for cohorts.
    """
    plot = sns.lineplot(
        data=pd.melt(contributor_retention_nonveg[['cohort_period'] + cohorts], ['cohort_period']),
        x='cohort_period',
        y='value',
        palette='rocket_r',
        hue='contributor_cohort',
        linewidth=2.5,
        ax=ax[position])
    plot.set(xlim=(0, 8))
    plot.set(ylim=(0, 1))
    plot.set(xlabel='Cohort period')
    plot.set(ylabel='Active contributors')
    plot.set_title('Contributor retention by cohort: ' + title)
    return

# plot contributor retention curves
fig, ax = plt.subplots(1, 2, figsize=(12,6))

cohorts_to_display = ['2008', '2009', '2010', '2011']

plot_retention_curves(contributor_retention_nonveg, cohorts_to_display, 'Non-vegetarian', 0)
plot_retention_curves(contributor_retention_veg, cohorts_to_display, 'Vegetarian', 1)

fig.show()

We can see that contributor retention at food.com has deteriorated significantly over the years.

The difference between the churn of vegetarian and non-vegetarian contributors appears very mild, however, suggesting that higher atrition rate is not the reason for the decline in the proportion of new vegetarian recipes on the site.

## Contributor acquisition
The alternative explanation for the drop in the share of new vegetarian recipes is that food.com acquires non-vegetarian contributors at a faster rate than vegetarian authors.

In [None]:
# get first recipe by contributor
df = recipes_and_cohorts.groupby('contributor_id').agg(
    vegetarian = ('vegetarian', 'mean'),
    contributor_cohort = ('contributor_cohort', 'min'),
)
# counting contributors with >50% of vegetarian contibutions as vegetarians
df.reset_index(inplace=True)
df = df.round(0)

# get first recipe by contributor
df = df.groupby(['contributor_cohort', 'vegetarian']).agg(
    contributor_cnt = ('contributor_id', 'count'),
)
# counting contributors with >50% of vegetarian contibutions as vegetarians
df.reset_index(inplace=True)
df['vegetarian'] = df['vegetarian'].astype(bool)

plt.figure(figsize=(12,6))

ax = sns.lineplot(data=df, x='contributor_cohort', y='contributor_cnt', palette=veg_meat[:2], hue='vegetarian', linewidth=2.5)
ax.set(xlabel='New contributors')
ax.set(ylabel='Year')
ax.set_title('New contributors by year')

Let's try the same with a logarithmic yscale.

In [None]:
plt.figure(figsize=(12,6))

ax = sns.lineplot(data=df, x='contributor_cohort', y='contributor_cnt', palette=veg_meat[:2], hue='vegetarian', linewidth=2.5)
ax.set(yscale="log")
ax.set(xlabel='New contributors')
ax.set(ylabel='Year')
ax.set_title('New contributors by year (log)')

Indeed, we can see that from 2014 to 2017, the number of vegetarian contributors has dropped at a faster rate than that of non-vegetarian contributors.

## Conclusion
**Is there a positive trend in the number of vegetarian recipes posted on food.com between 2008 and 2017?**

There is no positive trand. Between 2008 and 2013, the number of new vegetarian recipes has fallen year-over-year at the same rate as the number of new non-vegetarian recipes. It then started to fall at an even faster rate than for non-vegetarian recipes between 2014 and 2017.

**Is there a positive trend in the number of interactions with vegetarian recipes on Food.com between 2008 and 2017?**

There is no positive trand. The share of reviews posted on vegetarian and non-vegetarian recipes hsa remained unchanged over the 10-year period.

**Is there a difference in the ratings vegetarian recipes received compared to non-vegetarian recipes between 2008 and 2017?**

Yes, there is a slight positive trend. Vegetarian recipes were rated the same between 2008 and 2013, but started to attract higher ratings from there on. As of 2018, vegetarian recipe average ratings are 0.2 points higher than those of non-vegetarian recipes. 

### Business implications
We can conclude that the growing vegetarian trend has had no positive impact on food.com, and might have even had a negative impact on our contributor acqusition.