# Cleaning the dataset from Kaggle

In [1]:
import pandas as pd
import numpy as np
import re
import random
import requests
import json
from nltk.probability import FreqDist

In [259]:
data = pd.read_csv('kagglerecipes/RAW_recipes.csv')

In [260]:
data = data[['name', 'id', 'ingredients']]

In [261]:
data

Unnamed: 0,name,id,ingredients
0,arriba baked winter squash mexican style,137739,"['winter squash', 'mexican seasoning', 'mixed ..."
1,a bit different breakfast pizza,31490,"['prepared pizza crust', 'sausage patty', 'egg..."
2,all in the kitchen chili,112140,"['ground beef', 'yellow onions', 'diced tomato..."
3,alouette potatoes,59389,"['spreadable cheese with garlic and herbs', 'n..."
4,amish tomato ketchup for canning,44061,"['tomato juice', 'apple cider vinegar', 'sugar..."
...,...,...,...
231632,zydeco soup,486161,"['celery', 'onion', 'green sweet pepper', 'gar..."
231633,zydeco spice mix,493372,"['paprika', 'salt', 'garlic powder', 'onion po..."
231634,zydeco ya ya deviled eggs,308080,"['hard-cooked eggs', 'mayonnaise', 'dijon must..."
231635,cookies by design cookies on a stick,298512,"['butter', 'eagle brand condensed milk', 'ligh..."


In [7]:
# I will extract a list of ingredients (just to see how many I end up with)

In [8]:
ingredients = []
for row in data['ingredients']:
    row = row.replace('[\'', '\']')
    row = row.replace(']', '') 
    items = row.split(', ')
    for item in items: 
        ingredients.append(item.replace('\'', ''))

In [9]:
len(ingredients)

2103719

In [10]:
# Converting to set to get unique values
ingr_set = set(ingredients)

In [11]:
len(ingr_set)

14968

In [12]:
#print(ingr_set)

I need to pare this down a bit. I'll try to cut off at least those ingredients that only occur once.

In [13]:
frequencies = FreqDist(ingredients)

In [14]:
frequencies

FreqDist({'salt': 85746, 'butter': 54975, 'sugar': 44535, 'onion': 39065, 'water': 34914, 'eggs': 33761, 'olive oil': 32822, 'flour': 26266, 'milk': 25786, 'garlic cloves': 25748, ...})

In [15]:
len(frequencies)

14968

In [16]:
frequencies = dict(frequencies)

In [17]:
frequencies = sorted(frequencies.items(), key=lambda item: item[1], reverse = True)

In [1]:
#frequencies

The BIG problem here is that you get different descriptors for the same thing. I want to treat 'chunky peanut butter' as identical to 'peanut butter', 'cauliflower florets' as identical to 'cauliflower'.

But this is hard to automate. The obvious option would be overly restrictive:
- Reduce all longer entries whose substrings are separate entries to that separate entry. E.g. 'peanut butter' is in the list, as is 'chunky peanut butter', so remove 'chunky peanut butter'. BUT this would reduce 'olive oil' to 'oil', 'coconut milk' to 'milk', and 'cream cheese' to 'cheese'. 
- But it might help if I only remove *one* word at a time, rather than reducing everything all the way down right away. 

# Simplify list of ingredients

## First attempt: using the API from theMealDB

theMealDB (www.themealdb.com) has a nicely organized list of 'basic' ingredients (a total of 574). However, this list is based on 'only' 284 recipes, so there's a good chance that I will miss some (or many) of the ingredients in my bigger dataset. That is the main thing that I need to test here.

### First extract the list of ingredients

In [3]:
url = 'https://www.themealdb.com/api/json/v1/1/list.php?i=list'

In [4]:
page = requests.get(url)

In [5]:
content = page.content

In [6]:
jsoncontent = json.loads(content)

In [7]:
ingredientlist = []
for item in jsoncontent['meals']:
    ingredientlist.append(item['strIngredient'].lower())

In [2]:
#ingredientlist

This list is a little messy (e.g. 'basil' / 'basil leaves', 'bay leaf' / 'bay leaves'), but much more manageable.
Before I start pruning this out by hand, though, I'll check how much I would lose if I were to just use this version of the list as a criterion.

In [9]:
%%time

newlist = []
notincluded = []
for listingredient in ingredientlist:
    for item in ingredients:
        if ingredient in item:
            newlist.append(ingredient)
        else:  
            notincluded.append(item)

NameError: name 'ingredients' is not defined

In [88]:
(len(ingredientlist) * len(ingredients)) 

1207534706

In [77]:
len(notincluded)

1204924488

In [75]:
len(newlist)

2610218

In [87]:
len(ingredients)

2103719

In [76]:
len(set(notincluded))

14968

In [67]:
len(set(ingredients))

14968

Uhhhhhh...

In [None]:
# I'll just move on and see what this would do to the dataframe...

In [89]:
def simplify(row):

    rowresults = []
    row = row.replace('[\'', '\']')
    row = row.replace(']', '') 
    items = row.split(', ')
    for item in items: 
        for ingredient in ingredientlist:
            if ingredient in item: 
                rowresults.append(ingredient)
    return rowresults

In [90]:
data['newcolumn'] = data['ingredients']

In [94]:
data['newcolumn'] = data['newcolumn'].apply(simplify)

In [95]:
data

Unnamed: 0,name,id,ingredients,newcolumn
0,arriba baked winter squash mexican style,137739,"['winter squash', 'mexican seasoning', 'mixed ...","[squash, mixed spice, honey, butter, oil, oliv..."
1,a bit different breakfast pizza,31490,"['prepared pizza crust', 'sausage patty', 'egg...","[sage, eggs, egg, milk, pepper, salt, cheese]"
2,all in the kitchen chili,112140,"['ground beef', 'yellow onions', 'diced tomato...","[beef, ground beef, onions, onion, yellow onio..."
3,alouette potatoes,59389,"['spreadable cheese with garlic and herbs', 'n...","[cheese, garlic, potatoes, shallots, parsley, ..."
4,amish tomato ketchup for canning,44061,"['tomato juice', 'apple cider vinegar', 'sugar...","[tomato, apple cider vinegar, vinegar, cider, ..."
...,...,...,...,...
231632,zydeco soup,486161,"['celery', 'onion', 'green sweet pepper', 'gar...","[celery, onion, pepper, cloves, garlic, garlic..."
231633,zydeco spice mix,493372,"['paprika', 'salt', 'garlic powder', 'onion po...","[paprika, salt, garlic, garlic powder, onion, ..."
231634,zydeco ya ya deviled eggs,308080,"['hard-cooked eggs', 'mayonnaise', 'dijon must...","[eggs, egg, mayonnaise, mustard, dijon mustard..."
231635,cookies by design cookies on a stick,298512,"['butter', 'eagle brand condensed milk', 'ligh...","[butter, condensed milk, milk, brown sugar, su..."


I need to do this a different way: order the ingredientlist by number of words (from many to few), and then loop through and stop once it's found a match. That way it will not reduce 'tomato juice' to 'tomato', and will not have 'apple cider vinegar' return both 'apple' and 'cider' and 'vinegar'. 

In [None]:
for row in data['ingredients']:
    row = row.replace('[\'', '\']')
    row = row.replace(']', '') 
    items = row.split(', ')

In [3]:
#ingredientlist

In [104]:
countlist = []
for item in ingredientlist:
    itemlist = item.split(' ')
    countlist.append(len(itemlist))

In [105]:
len(countlist)

574

In [112]:
zipper = dict(zip(ingredientlist, countlist))

In [4]:
#zipper

In [123]:
ingredientlistdict = dict(sorted(zipper.items(), key=lambda item: item[1], reverse = True))

In [126]:
ingredientlist = list(ingredientlistdict.keys())

In [134]:
def simplify2(row):

    rowresults = []
    row = row.replace('[\'', '\']')
    row = row.replace(']', '') 
    items = row.split(', ')
    for item in items: 
        for ingredient in ingredientlist:
            if ingredient in item: 
                rowresults.append(ingredient)
                break
    return rowresults

In [135]:
data['newcolumn'] = data['ingredients'].apply(simplify2)

In [136]:
data['newcolumn']

0         [squash, mixed spice, honey, butter, olive oil...
1                        [sage, eggs, milk, pepper, cheese]
2         [ground beef, yellow onion, diced tomatoes, to...
3         [cheese, potatoes, shallots, parsley, olive oi...
4         [tomato, apple cider vinegar, sugar, salt, pep...
                                ...                        
231632    [celery, onion, pepper, garlic clove, olive oi...
231633    [paprika, salt, garlic powder, onion, basil, d...
231634    [eggs, mayonnaise, dijon mustard, cajun, tabas...
231635    [butter, condensed milk, brown sugar, sour cre...
231636    [granulated sugar, shortening, eggs, flour, cr...
Name: newcolumn, Length: 231637, dtype: object

This is good enough for me. I might still miss a number of ingredients, but I can add those as I go along, if I make a pipeline for cleaning and filtering. 

One task that is left: see if any of the items in the list are substrings of other items (I noticed that 'sauSAGE' returns 'sage')

I actually think it will be worth my while to manually go through the list, add some things that I notice are not in there (like sausage and tomato juice), and weed out some hidden duplicates (like 'bay leaf' / 'bay leaves')

But first I'll use NLTK to lemmatize everything

## Cleaning 2: NLTK

In [10]:
from nltk.stem import WordNetLemmatizer
  
wnl = WordNetLemmatizer()

In [11]:
ingredientlist_split = []
for ingredient in ingredientlist:
    ingredientlist_split.append(ingredient.split(' '))

In [5]:
#ingredientlist_split

In [13]:
newingredientlist = []
for entry in ingredientlist_split: 
    dummy = []
    for item in entry:         
        dummy.append(wnl.lemmatize(item))
    newingredientlist.append(' '.join(dummy))
    

## Be advised: here comes a long list of line-by-line cleaning.

In [6]:
#newingredientlist

In [15]:
print(len(newingredientlist))
print(len(set(newingredientlist)))

574
567


In [20]:
def listreplacer(to_replace, replacement):
    i = newingredientlist.index(to_replace)
    newingredientlist[i] = replacement

In [21]:
listreplacer('free-range egg, beaten', 'egg')

In [22]:
listreplacer('free-range eggs, beaten', 'egg')

In [23]:
listreplacer('bramley apple','apple')

In [24]:
listreplacer( 'chopped onion','onion')

In [25]:
listreplacer( 'chopped parsley','parsley')

In [26]:
listreplacer( 'chopped tomato','tomato')

In [27]:
newingredientlist.remove( 'cold water')

In [28]:
newingredientlist.remove('coriander leaf')

In [29]:
listreplacer( 'flaked almond','almond')

In [30]:
listreplacer( 'floury potato','potato')

In [31]:
listreplacer( 'fresh basil','basil')

In [32]:
listreplacer('fresh thyme','thyme')

In [33]:
listreplacer( 'garlic clove','garlic')

In [34]:
listreplacer( 'gouda cheese','gouda')

In [35]:
listreplacer('miniature marshmallow', 'marshmallow')

In [36]:
listreplacer('mozzarella ball','mozzarella')

In [37]:
newingredientlist.remove( 'parmesan cheese')

In [38]:
listreplacer( 'plain chocolate','chocolate')

In [39]:
listreplacer('plain flour', 'flour')

In [40]:
newingredientlist.remove( 'smoky paprika')

In [41]:
newingredientlist.remove('tamarind ball')

In [42]:
listreplacer( 'tamarind paste','tamarind')

In [43]:
listreplacer( 'tomato ketchup','ketchup')

In [44]:
listreplacer( 'vermicelli pasta','vermicelli')

In [45]:
listreplacer( 'pappardelle pasta','pappardelle')

In [46]:
listreplacer( 'paccheri pasta','paccheri')

In [47]:
listreplacer( 'linguine pasta','linguin')

In [48]:
listreplacer( 'stilton cheese','stilton')

In [49]:
listreplacer( 'shiitake mushroom','shiitake')

In [50]:
listreplacer( 'braeburn apple','apple')

In [51]:
listreplacer('white flour', 'flour')

In [52]:
listreplacer( 'potatoe bun','potato bun')

In [53]:
listreplacer( 'bulgur wheat','bulgur')

In [54]:
newingredientlist.remove( 'cheese slice')

In [55]:
newingredientlist.remove('warm water')

In [56]:
newingredientlist.remove('dark soft brown sugar')

In [57]:
newingredientlist.remove('light brown soft sugar')

In [58]:
newingredientlist.remove('dark brown soft sugar')

In [59]:
newingredientlist.index('cubed feta cheese')

89

In [61]:
newingredientlist[89] = 'feta cheese'

In [62]:
listreplacer('dark soy sauce','soy sauce')

In [63]:
listreplacer( 'freshly chopped parsley','parsley')

In [64]:
listreplacer('little gem lettuce','gem lettuce')

In [65]:
listreplacer('pitted black olive','black olive')

In [66]:
listreplacer('raw king prawn', 'prawn')

In [67]:
listreplacer('vegetable stock cube', 'vegetable stock')

In [68]:
listreplacer('chicken stock cube', 'chicken stock')

In [69]:
listreplacer('ra el hanout','ras el hanout')

In [70]:
listreplacer('beef stock concentrate', 'beef stock')

In [71]:
listreplacer('chicken stock concentrate', 'chicken stock')

In [72]:
listreplacer('sugar snap pea','sugar snap')

In [73]:
listreplacer('shredded monterey jack cheese','monterey jack cheese')

In [74]:
len(set(newingredientlist))

532

In [75]:
ingr = list(set(newingredientlist))

In [7]:
#ingr

...And convert this back to the long-to-short sorting.

In [77]:
countlist = []
for item in ingr:
    itemlist = item.split(' ')
    countlist.append(len(itemlist))

In [78]:
len(countlist)

532

In [79]:
zipper = dict(zip(ingr, countlist))

In [8]:
#zipper

In [81]:
ingredientlistdict = dict(sorted(zipper.items(), key=lambda item: item[1], reverse = True))

In [82]:
ingredientlist = list(ingredientlistdict.keys())

In [9]:
#ingredientlist

I will save this to a file, as it will be the cornerstone of all the data cleaning (and one of the obvious points to work on if I want to improve the system later on). This one is far from perfect, but it's good enough to allow me to build a first prototype. 

In [84]:
with open('masterlist.txt', 'w') as f:
    for line in ingredientlist:
        f.write(f"{line}\n")

Now use this new list to clean the whole shebang

In [262]:
data['cleaned'] = data['ingredients'].apply(simplify2)

In [10]:
data.head()

NameError: name 'data' is not defined

Finally, I want to get the original URLs for these recipes (if I can), so that I can recommend the recipes to the user.

In [266]:
data[data['id']==19135]

Unnamed: 0,name,id,ingredients,cleaned


In [267]:
data['url'] = f"https://www.food.com/recipe/{data['id']}"

In [269]:
data.drop(columns=['url'], inplace = True)

In [271]:
urllist = []
for item in data['id']:
    urllist.append(f"https://www.food.com/recipe/{item}")

In [274]:
data['url'] = urllist

In [276]:
data = data[['name', 'url', 'ingredients', 'cleaned']]

In [277]:
data

Unnamed: 0,name,url,ingredients,cleaned
0,arriba baked winter squash mexican style,https://www.food.com/recipe/137739,"['winter squash', 'mexican seasoning', 'mixed ...","[squash, mixed spice, honey, butter, olive oil..."
1,a bit different breakfast pizza,https://www.food.com/recipe/31490,"['prepared pizza crust', 'sausage patty', 'egg...","[sage, egg, milk, salt, cheese]"
2,all in the kitchen chili,https://www.food.com/recipe/112140,"['ground beef', 'yellow onions', 'diced tomato...","[ground beef, yellow onion, diced tomato, toma..."
3,alouette potatoes,https://www.food.com/recipe/59389,"['spreadable cheese with garlic and herbs', 'n...","[cheese, potato, shallot, parsley, olive oil, ..."
4,amish tomato ketchup for canning,https://www.food.com/recipe/44061,"['tomato juice', 'apple cider vinegar', 'sugar...","[tomato, apple cider vinegar, sugar, salt, pep..."
...,...,...,...,...
231632,zydeco soup,https://www.food.com/recipe/486161,"['celery', 'onion', 'green sweet pepper', 'gar...","[celery, onion, pepper, clove, olive oil, ham,..."
231633,zydeco spice mix,https://www.food.com/recipe/493372,"['paprika', 'salt', 'garlic powder', 'onion po...","[paprika, salt, garlic powder, onion, basil, d..."
231634,zydeco ya ya deviled eggs,https://www.food.com/recipe/308080,"['hard-cooked eggs', 'mayonnaise', 'dijon must...","[egg, mayonnaise, dijon mustard, salt, tabasco..."
231635,cookies by design cookies on a stick,https://www.food.com/recipe/298512,"['butter', 'eagle brand condensed milk', 'ligh...","[butter, condensed milk, brown sugar, sour cre..."


# Cleaning, NYT

Cleaning the data I got from the New York Times website.

In [1]:
import pandas as pd
import numpy as np
import re
import random
import requests
import json
from nltk.probability import FreqDist

In [3]:
nyt = pd.read_csv('nyt/nyt.csv', sep = '|')

In [4]:
nyt

Unnamed: 0,name,url,ingredients
0,Harissa-Roasted Sweet Potatoes and Red Onion,https://cooking.nytimes.com/recipes/1023541-ha...,"['3 medium sweet potatoes, washed and trimmed,..."
1,Tofu and Mushroom Jorim (Soy-Braised Tofu),https://cooking.nytimes.com/recipes/1023476-to...,"['1/3 cup low-sodium soy sauce', '5 garlic clo..."
2,Roasted Chicken With Crispy Mushrooms,https://cooking.nytimes.com/recipes/1023551-ro...,"['2 to 2 1/4 pounds boneless, skinless chicken..."
3,Chocolate Pumpkin Swirl Muffins,https://cooking.nytimes.com/recipes/1023565-ch...,"['2 cups/256 grams all-purpose flour', '1 tabl..."
4,Pasta e Patate (Pasta and Potato Soup),https://cooking.nytimes.com/recipes/1023564-pa...,"['1/3 cup extra-virgin olive oil', '1 large ye..."
...,...,...,...
22070,Mushrooms in Marsala Wine (Funghi Alla Marsala),https://cooking.nytimes.com/recipes/31-mushroo...,"['1 ounce dried mushrooms, preferably imported..."
22071,Veal Scaloppine With Mushrooms Bordelaise,https://cooking.nytimes.com/recipes/30-veal-sc...,"['12 slices veal scaloppine, about 1 1/4 pound..."
22072,Mushroom and Meat Loaf,https://cooking.nytimes.com/recipes/28-mushroo...,"['1/2 pound mushrooms', '1 tablespoon butter',..."
22073,Mushroom and Pepper Salad,https://cooking.nytimes.com/recipes/29-mushroo...,"['1 large sweet red pepper, about 1/2 pound', ..."


In [8]:
file = open('masterlist.txt', 'r')

In [9]:
ingredientlist = [] 
for line in file.readlines():
    ingredientlist.append(line.replace('\n',''))

In [13]:
def simplify(row):

    rowresults = []
    row = row.replace('[\'', '\']')
    row = row.replace(']', '') 
    items = row.split(', ')
    for item in items: 
        for ingredient in ingredientlist:
            if ingredient in item: 
                rowresults.append(ingredient)
                break
    return rowresults

In [14]:
nyt['cleaned'] = nyt['ingredients'].apply(simplify)

In [15]:
nyt

Unnamed: 0,name,url,ingredients,cleaned
0,Harissa-Roasted Sweet Potatoes and Red Onion,https://cooking.nytimes.com/recipes/1023541-ha...,"['3 medium sweet potatoes, washed and trimmed,...","[sweet potato, red onion, olive oil, ground cu..."
1,Tofu and Mushroom Jorim (Soy-Braised Tofu),https://cooking.nytimes.com/recipes/1023476-to...,"['1/3 cup low-sodium soy sauce', '5 garlic clo...","[soy sauce, clove, ginger, scallion, scallion,..."
2,Roasted Chicken With Crispy Mushrooms,https://cooking.nytimes.com/recipes/1023551-ro...,"['2 to 2 1/4 pounds boneless, skinless chicken...","[chicken thigh, black pepper, clove, thyme, th..."
3,Chocolate Pumpkin Swirl Muffins,https://cooking.nytimes.com/recipes/1023565-ch...,"['2 cups/256 grams all-purpose flour', '1 tabl...","[flour, cinnamon, baking powder, kosher salt, ..."
4,Pasta e Patate (Pasta and Potato Soup),https://cooking.nytimes.com/recipes/1023564-pa...,"['1/3 cup extra-virgin olive oil', '1 large ye...","[olive oil, yellow onion, carrot, celery, clov..."
...,...,...,...,...
22070,Mushrooms in Marsala Wine (Funghi Alla Marsala),https://cooking.nytimes.com/recipes/31-mushroo...,"['1 ounce dried mushrooms, preferably imported...","[mushroom, mushroom, water, olive oil, minced ..."
22071,Veal Scaloppine With Mushrooms Bordelaise,https://cooking.nytimes.com/recipes/30-veal-sc...,"['12 slices veal scaloppine, about 1 1/4 pound...","[veal, mushroom, olive oil, pea, oil, flour, p..."
22072,Mushroom and Meat Loaf,https://cooking.nytimes.com/recipes/28-mushroo...,"['1/2 pound mushrooms', '1 tablespoon butter',...","[mushroom, butter, onion, pork, veal, nutmeg, ..."
22073,Mushroom and Pepper Salad,https://cooking.nytimes.com/recipes/29-mushroo...,"['1 large sweet red pepper, about 1/2 pound', ...","[red pepper, green pepper, celery, mushroom, s..."


Whew, that was fast. This is enough to work with for now.

In [17]:
nyt.to_csv('nyt_cleaned.csv', sep ='|', index = False)

In [278]:
nyt = pd.read_csv('nyt_cleaned.csv', sep = '|')

In [279]:
nyt

Unnamed: 0,name,url,ingredients,cleaned
0,Harissa-Roasted Sweet Potatoes and Red Onion,https://cooking.nytimes.com/recipes/1023541-ha...,"['3 medium sweet potatoes, washed and trimmed,...","['sweet potato', 'red onion', 'olive oil', 'gr..."
1,Tofu and Mushroom Jorim (Soy-Braised Tofu),https://cooking.nytimes.com/recipes/1023476-to...,"['1/3 cup low-sodium soy sauce', '5 garlic clo...","['soy sauce', 'clove', 'ginger', 'scallion', '..."
2,Roasted Chicken With Crispy Mushrooms,https://cooking.nytimes.com/recipes/1023551-ro...,"['2 to 2 1/4 pounds boneless, skinless chicken...","['chicken thigh', 'black pepper', 'clove', 'th..."
3,Chocolate Pumpkin Swirl Muffins,https://cooking.nytimes.com/recipes/1023565-ch...,"['2 cups/256 grams all-purpose flour', '1 tabl...","['flour', 'cinnamon', 'baking powder', 'kosher..."
4,Pasta e Patate (Pasta and Potato Soup),https://cooking.nytimes.com/recipes/1023564-pa...,"['1/3 cup extra-virgin olive oil', '1 large ye...","['olive oil', 'yellow onion', 'carrot', 'celer..."
...,...,...,...,...
22070,Mushrooms in Marsala Wine (Funghi Alla Marsala),https://cooking.nytimes.com/recipes/31-mushroo...,"['1 ounce dried mushrooms, preferably imported...","['mushroom', 'mushroom', 'water', 'olive oil',..."
22071,Veal Scaloppine With Mushrooms Bordelaise,https://cooking.nytimes.com/recipes/30-veal-sc...,"['12 slices veal scaloppine, about 1 1/4 pound...","['veal', 'mushroom', 'olive oil', 'pea', 'oil'..."
22072,Mushroom and Meat Loaf,https://cooking.nytimes.com/recipes/28-mushroo...,"['1/2 pound mushrooms', '1 tablespoon butter',...","['mushroom', 'butter', 'onion', 'pork', 'veal'..."
22073,Mushroom and Pepper Salad,https://cooking.nytimes.com/recipes/29-mushroo...,"['1 large sweet red pepper, about 1/2 pound', ...","['red pepper', 'green pepper', 'celery', 'mush..."


In [280]:
finaldf = pd.concat([data, nyt], axis = 0)

In [282]:
finaldf.isna().sum()

name           1
url            0
ingredients    0
cleaned        0
dtype: int64

In [283]:
finaldf[finaldf['name'].isna()]

Unnamed: 0,name,url,ingredients,cleaned
721,,https://www.food.com/recipe/368257,"['lemon', 'honey', 'horseradish mustard', 'gar...","[lemon, honey, horseradish, clove, parsley, ba..."


In [287]:
finaldf.reset_index(drop=True, inplace = True)

In [289]:
finaldf['name'][721] = 'honey-mustard dressing'

The user who uploaded that particular recipe did not include a title...

In [291]:
finaldf['name'][720:725]

720              hawaiian sunrise           mimosa
721                         honey-mustard dressing
722                            4 cheese baked ziti
723    baked potato   baked  microwaved or grilled
724                             light   berry loaf
Name: name, dtype: object

In [292]:
finaldf.to_csv('dataframe.csv', index = False, sep = '|')

That is it for the first round of cleaning. And here is some work that remains to be done: 

- Finetune the masterlist with ingredients. Since it is the backbone of the cleaning process, the better it is, the better my system will be. 
    - Remove items with no values (this will be easier down the line, once I have a matrix with ones and zeroes)
    - Add new items
    - Prune existing items even more
- Also, since the final user interface will use the masterlist of ingredients, the better it is, the easier it is to accept a wide range of inputs. 
- Put all the above into an easily re-usable pipeline. I may decide to scrape even more recipes, and then it would be great to just be able to run a single function to clean it. But this is a case of YAGNI - it is very probable that any new dataset would follow a different formatting, making the whole process entirely different. So I will cross that bridge when I get there. 