## Exercise 01

Build a valid regular expresion for date and time format _(`YYYY-MM-DDTHH:MM:SS`)_ 

*NOTE: Take this as an exercise just to practice with regex. Validating datetimes with regex is a complex, confusing and unnecessary task. We can use other python libraries like pandas for that*

In [1]:
import re
import pandas as pd

In [2]:
# In this exercise we don't validate all aspects of a date (like leap years).
# We also don't take into account the current month to validate the day of month (all month can have 31 days)

# simple regex - doesn't take into account a leap year and days of month
dateregex = r"\d{4}-(0[1-9]|(1[0-2]))-((0[1-9])|([1-2]\d)|(3[01]))"

# more complex regex - doesn't take into account a leap year
dateregex = r"\d{4}-(((01|03|05|07|08|10|12)-((0[1-9])|([1-2]\d)|(3[01])))|((04|06|09|11)-((0[1-9])|([1-2]\d)|(30)))|(02-((0[1-9])|([1-2]\d))))" 
timeregex = r"(([01]\d)|(2[0-3])):([0-5]\d):([0-5]\d)"

# build the regex expression by concatenating the date and time regex
expr = dateregex+"T"+timeregex

In [3]:
regex = re.compile(expr)

In [4]:
# test the regex with a list of dates
test = pd.Series([
    "2025-12-09T12:12:00", # OK
    "2025-00-09T12:12:00", # KO
    "2025-13-09T00:12:00", # KO
    "2025-12-32T00:12:00", # KO
    "2025-12-09T24:12:00", # KO
    "2025-12-09T00:61:00", # KO
    "2025-12-09T00:12:63", # KO
    "1025-02-30T00:12:00"  # KO, but not contemplated
], name="date")

In [5]:
match = test.str.match(expr)
match.name = "match"

Let's build a DataFrame to show the results

In [6]:
results = pd.concat([test, match], axis=1)
results

Unnamed: 0,date,match
0,2025-12-09T12:12:00,True
1,2025-00-09T12:12:00,False
2,2025-13-09T00:12:00,False
3,2025-12-32T00:12:00,False
4,2025-12-09T24:12:00,False
5,2025-12-09T00:61:00,False
6,2025-12-09T00:12:63,False
7,1025-02-30T00:12:00,False


## Exercise 02

Load the JSON file `recipeitems-latest.json` containing several recipies. Store data into a pandas DataFrame and work with the `ingredients` column.

Answer the following questions
 1. How many recipes are in the table?
 2. Get all "spicy" recipies. (Just look for spicy word) 
 3. Get all recipies containing at lest a cup of flour (any kind of flour)
 4. Get all recipies containing more than 300ml of whole milk 

In [7]:
import re
import pandas as pd

In [8]:
recipes = pd.read_json('data/recipeitems-latest.zip')
recipes.head(1)

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,creator,recipeCategory,dateModified,recipeInstructions
0,{'$oid': '5160756b96cc62079cc2db15'},Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276011104},PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,


In [9]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173278 entries, 0 to 173277
Data columns (total 17 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   _id                 173278 non-null  object
 1   name                173278 non-null  object
 2   ingredients         173278 non-null  object
 3   url                 173278 non-null  object
 4   image               158278 non-null  object
 5   ts                  173278 non-null  object
 6   cookTime            117936 non-null  object
 7   source              173278 non-null  object
 8   recipeYield         165628 non-null  object
 9   datePublished       78110 non-null   object
 10  prepTime            130186 non-null  object
 11  description         158068 non-null  object
 12  totalTime           1570 non-null    object
 13  creator             395 non-null     object
 14  recipeCategory      388 non-null     object
 15  dateModified        161 non-null     object
 16  re

#### How many recipes are in the table?

In [10]:
len(recipes)

173278

#### Get all "spicy" recipies. (Just look for spicy word)

In [11]:
regex = r"[Ss]picy"

In [12]:
recipes.ingredients.str.contains(r"[Ss]picy")

0         False
1          True
2         False
3         False
4         False
          ...  
173273    False
173274    False
173275    False
173276    False
173277    False
Name: ingredients, Length: 173278, dtype: bool

In [13]:
recipes[recipes.ingredients.str.contains(r"[Ss]picy")].shape

(602, 17)

#### Get all recipies containing at least a cup of flour (any kind of flour)

In [14]:
recipes[recipes.ingredients.str.contains(r"\d cups?.*[Ff]lour")].shape

(16624, 17)

In [15]:
recipes[recipes.ingredients.str.contains(r"\d cups?.*[Ff]lour")].ingredients

0         Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
3         Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
12        Tart Dough:\n1 cup all-purpose flour\n1/8 tsp ...
13        2 cups / 475 ml whole milk\n2 tablespoons unsa...
20        6 tbsp unsalted butter, at room temperature\n1...
                                ...                        
173206    1 cup FOR THE MUFFINS:\n1 cup White Whole Whea...
173219    2 cups FOR THE CAKE:\n2 cups All-purpose Flour...
173238    1 Tablespoon Coconut Oil\n1 cup Almond Flour\n...
173241    1 cup Plus 1 Tablespoon All-purpose Flour\n½ t...
173276    Two 16 ounce cans Old El Paso Refried Beans\n4...
Name: ingredients, Length: 16624, dtype: object

#### Get all recipies containing more than 300ml of whole milk

In [21]:
ml = recipes.ingredients.str.extract(r"(\d+)ml.*[Ww]hole [Mm]ilk", expand=False).dropna().astype(int)
ml = ml[ml>300]

In [22]:
print(recipes.loc[ml.index].ingredients.iloc[9])

300g/10½oz plain flour
200g/7oz salted butter
150ml/5fl oz ice cold water 
squeeze lemon juice
1 free-range egg
110g/4oz butter
157g/5½oz plain flour
4 large free-range eggs
1 tbsp flavourless oil
2 tsp sugar
4 large free-range eggs
100g/3½oz caster sugar
1 vanilla pod
50g/1¾oz cornflour
500ml/17fl oz whole milk
25g/1oz unsalted butter
12 passion fruit
150g/5¼oz golden caster sugar
2 large free-range eggs
2 large free-range egg yolks
100g/3½oz unsalted butter
500ml/17fl oz double cream
1 vanilla pod
6 tbsp icing sugar
200g/7oz caster sugar
100g/3½oz milk chocolate
50g/1¾oz double cream
400g/14oz strawberries
150g/5½oz caster sugar
2 sheets of gold leaf


## Exiercise 03: Recipe recommender
Note: Exercise extracted from https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html#A-simple-recipe-recommender

Let's define the following spices list
```python
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
``` 

**Build a function that, given a list of spices (ex. `[pepper, paprika]`), returns the list of recipes containing these ingredients.**

In [1]:
spices = ["pepper", "paprika"]

In [30]:
# fast function without regex
def recipe_recommender(spices):
    masks = []
    for spice in spices:
        mask = recipes.ingredients.str.contains(fr"\b{spice}\b", case=False)
        masks.append(mask)
    
    masks = pd.concat(masks, axis=1)
    rec_recipes = recipes[masks.all(axis=1)]
    
    return rec_recipes

In [31]:
%%time

rec_f = recipe_recommender_fast(["pepper","paprika"])

CPU times: user 1.19 s, sys: 57 µs, total: 1.19 s
Wall time: 1.18 s


In [43]:
rec_f.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3104 entries, 44 to 173275
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   _id                 3104 non-null   object
 1   name                3104 non-null   object
 2   ingredients         3104 non-null   object
 3   url                 3104 non-null   object
 4   image               2656 non-null   object
 5   ts                  3104 non-null   object
 6   cookTime            1938 non-null   object
 7   source              3104 non-null   object
 8   recipeYield         2865 non-null   object
 9   datePublished       789 non-null    object
 10  prepTime            2116 non-null   object
 11  description         2641 non-null   object
 12  totalTime           38 non-null     object
 13  creator             20 non-null     object
 14  recipeCategory      20 non-null     object
 15  dateModified        11 non-null     object
 16  recipeInstructions  0