In [56]:
import pandas as pd

We will first read the csv file titled RAW_recipes.

In [57]:
RAW_recipes=pd.read_csv('RAW_recipes.csv')

We want to see how an entry of "tags" column of this file looks like.

In [58]:
RAW_recipes['tags'][0]

"['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'side-dishes', 'vegetables', 'mexican', 'easy', 'fall', 'holiday-event', 'vegetarian', 'winter', 'dietary', 'christmas', 'seasonal', 'squash']"

We will write a function that turns a string to a list. We will use this function to turn some string entries of the columns RAW_recipes into lists. 

In [59]:
import ast

def convert_to_list(tags_str):
    try:
        return ast.literal_eval(tags_str)
    except (ValueError, SyntaxError):
        return []

In [60]:
RAW_recipes['tags']=RAW_recipes['tags'].apply(convert_to_list)

We will write a function that will help us to read the "tags" column of RAW_recipes and look for certain cuisine related keywords. The following is keyword for the 100 best cuisines in the world given by Taste Atlas.

In [61]:
world_cuisines = [
    'italian',
    'japanese',
    'greek',
    'portuguese',
    'chinese',
    'indonesian',
    'mexican',
    'french',
    'spanish',
    'peruvian',
    'indian',
    'brazilian',
    'polish',
    'argentinian',
    'turkish',
    'states',
    'thai',
    'korean',
    'croatian',
    'serbian',
    'hungarian',
    'vietnamese',
    'iranian',
    'chilean',
    'lebanese',
    'georgian',
    'bosnian',
    'colombian',
    'romanian',
    'bulgarian',
    'malaysian',
    'german',
    'filipino',
    'belgian',
    'czech',
    'austrian',
    'swiss',
    'lithuanian',
    'english',
    'algerian',
    'russian',
    'slovak',
    'canadian',
    'swedish',
    'dutch',
    'moroccan',
    'scottish',
    'ecuadorian',
    'danish',
    'australian',
    'egyptian',
    'south-african',
    'ukrainian',
    'syrian',
    'irish',
    'singaporean',
    'pakistani',
    'puerto',
    'norwegian',
    'bolivian',
    'macedonian',
    'israeli',
    'palestinian',
    'slovenian',
    'finnish',
    'tunisian',
    'haitian',
    'jamaican',
    'armenian',
    'venezuelan',
    'belarusian',
    'moldovan',
    'lankan',
    'jordanian',
    'cuban',
    'uzbekistani',
    'azerbaijani',
    'taiwanese',
    'uruguayan',
    'montenegrin',
    'ethiopian',
    'iraqi',
    'qatari',
    'trinidadian',
    'libyan',
    'lao',
    'barbadian',
    'cypriot',
    'bengali',
    'kazakhstani',
    'albanian',
    'kyrgyzstani',
    'burmese',
    'zealand',
    'saudi',
    'irish',
    'bahamian',
    'dominican',
    'welsh',
    'ghanaian'
]

We will write a function that searches entries of "tags" column of the RAW_recipes for the keywords above. If there is a string which has the keyowrds above we return them.

In [62]:
def extract_cuisine_tags(tags):
    cuisine_related_tags = []
    for tag in tags:
        for cuisine in world_cuisines:
            if cuisine in tag:
                cuisine_related_tags.append(tag)
    return cuisine_related_tags

# Create a new column in the RAW_recipes with the extracted cuisine-related tags
RAW_recipes['Cuisine_Tags'] = RAW_recipes['tags'].apply(extract_cuisine_tags)

Now we will try to understand how many recipies are there with a Cuisine Tag we created.

In [69]:
# Count the occurrences of each category in the Cuisine_Tags column
cuisine_category_counts = RAW_recipes['Cuisine_Tags'].value_counts()

# Print the counts of all categories for reference
print(cuisine_category_counts)

Cuisine_Tags
[]                                                                            176104
[italian]                                                                       6911
[southern-united-states]                                                        6277
[mexican]                                                                       5642
[canadian]                                                                      3923
                                                                               ...  
[canadian, french, southern-united-states, russian]                                1
[brazilian, mexican, peruvian, venezuelan]                                         1
[southern-united-states, irish, irish, polish, northeastern-united-states]         1
[southwestern-united-states, moroccan]                                             1
[greek, moroccan]                                                                  1
Name: count, Length: 667, dtype: int64


For the training purposes, we will clean the ones without a Cuisine Tag and the ones with more than a Cuisine Tag.

In [64]:
# Get the indices of rows that do not have exactly one cuisine-related tag
indices_no_cuisine_tags = RAW_recipes[RAW_recipes['Cuisine_Tags'].apply(lambda x: len(x) != 1)].index

# Drop them
RAW_recipes_cuisine_cleaned = RAW_recipes.drop(indices_no_cuisine_tags).reset_index(drop=True)

In [65]:
# Count the occurrences of each category in the Cuisine_Tags
cusine_category_counts = RAW_recipes_cuisine_cleaned['Cuisine_Tags'].value_counts()

# Print the counts of all categories for reference
print(cusine_category_counts)

Cuisine_Tags
[italian]                       6911
[southern-united-states]        6277
[mexican]                       5642
[canadian]                      3923
[indian]                        2564
[australian]                    2355
[southwestern-united-states]    2194
[greek]                         2165
[chinese]                       1841
[french]                        1813
[northeastern-united-states]    1724
[german]                        1186
[english]                       1173
[thai]                          1116
[spanish]                        871
[moroccan]                       839
[japanese]                       768
[new-zealand]                    405
[scottish]                       381
[swedish]                        340
[russian]                        322
[vietnamese]                     314
[south-african]                  313
[swiss]                          302
[polish]                         299
[portuguese]                     292
[hungarian]              

There are 6 recipes with a tag "chinese-new-year". We will turn them into just "chinese".

In [66]:
# Define a function to replace 'chinese-new-year' with 'chinese'
def replace_chinese_new_year(tags):
    return ['chinese' if tag == 'chinese-new-year' else tag for tag in tags]

# Apply the function to the 'Cuisine_Tags' column
RAW_recipes_cuisine_cleaned['Cuisine_Tags'] = RAW_recipes_cuisine_cleaned['Cuisine_Tags'].apply(replace_chinese_new_year)

Now we will check if everything went as we wanted.

In [67]:
# Count the occurrences of each category in the Time_Category column
cusine_category_counts = RAW_recipes_cuisine_cleaned['Cuisine_Tags'].value_counts()

# Print the counts of all categories for reference
print(cusine_category_counts)

Cuisine_Tags
[italian]                       6911
[southern-united-states]        6277
[mexican]                       5642
[canadian]                      3923
[indian]                        2564
[australian]                    2355
[southwestern-united-states]    2194
[greek]                         2165
[chinese]                       1847
[french]                        1813
[northeastern-united-states]    1724
[german]                        1186
[english]                       1173
[thai]                          1116
[spanish]                        871
[moroccan]                       839
[japanese]                       768
[new-zealand]                    405
[scottish]                       381
[swedish]                        340
[russian]                        322
[vietnamese]                     314
[south-african]                  313
[swiss]                          302
[polish]                         299
[portuguese]                     292
[hungarian]              

In [72]:
len(RAW_recipes_cuisine_cleaned)

50399

In [73]:
len(cusine_category_counts)

58

So, we have categorized 50399 recipes into 58 cuisine categories. We can use this data set as a training set for a model that predicts the cuisine for a given list of ingredients or a recipe.