This repository has been archived by the owner. It is now read-only.
Visualization and analysis of food habits in Switzerland vs France.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
0-mining
1-preprocessing
2-elasticsearch
3-webserver
4-matching
5-analysis
images
.gitignore
LICENSE
README.md

README.md

Food habits

This project is part of the EPFL's Applied Data Analysis course and promotes Data Science in Switzerland. The concept is not spatialy restricted and can be easily generalized elsewhere.

Abstract

Switzerland is well know for its rich heritage: incredible landscapes, watches, cheese, chocolate and diversified influences from its five neighboring countries. This project investigates how this heritage is reflected in terms of food habits. We picked 2 Swiss and 3 French cities with high restaurant density to get insights about dietetics. We mapped restaurant meals to recipes and ingredients of recipes to products to analyze the corresponding nutriments.

Is there any area-based nutrition bias ?

Our infrastructure and datasets also allow us to explore other topics such as:

  • food trends according to clichés (e.g. Rösti, Malakoff)
  • food/nutriments variety per locations (e.g. meals with more salt/lipids/etc..)

Data description

  • 11k restaurants (e.g. LaFourchette)
  • 35k meals (extracted from the restaurants' menus)
  • 170k recipes (various websites, e.g. CuisineAZ)
  • 1.3M ingredients (derived from the recipes)
  • 5k products (e.g. FDDB, OpenFood)
  • 40k nutriments (extracted from the products)

Assumptions

We assumed that:

  • the restaurants listed in LaFourchette were representative enough of the local food habits.
  • we could associate recipes to meals and products to recipes well enough to derive the nutritious facts for a meal without suffering too much of variance and central limit theorem.

Journey

Each folder in this git is a step in the developpement of this project. Each folder contains a README describing what we did and why. We recommend reading them to get a better idea of the work we have done.

Data pipeline

We implemented the following data pipeline:

Visualization of the data pipeline

Matching

We used this process to find matches:

Visualization of the matching system

Types of matching

Disadvantage Advantage
Rare events, misspelled, grouped
Pavé de boeuf aux morilles
Pavé de boeuf aux morilles simplissimes
Order tolerance
Tiramisu caramel speculos beurre salé
Tiramisu au caramel au beurre salé et spéculoos
Wide, personal meaning
café gourmand
café gourmand à ma façon
Exact match
Salade d'orange au miel et à la cannelle
Salade d'orange au miel et à la cannelle
Principal component
Rognons de lapins à la moutarde de Meaux
Fricassée de champignons à la moutarde de Meaux
Limited difference
Terrine de foie gras et confiture de pruneaux
Terrine de foie gras aux pruneaux et raisins secs
Unknown, language
Tartare de boeuf minute, salade et potatoes
Twice baked potatoes au bacon
Complex
Cassolette de Saint-Jacques et crevettes
Ravioles, noix de Saint-Jacques et crevettes en cassolettes raffinées

Food trends

A few examples of food facts we can extract from the datasets with our infrastructure.

Per country Per city
Energy trend per country
Energy(kCal) per country
Energy trend per city
Energy(kCal) per city
Protein trend per country
Protein per country
Protein trend per city
Protein per city
carbs trend per country
Carbohydrates per country
Carbs trend per city
Carbohydrates per city
Salt trend per country
Salt per country
Salt trend per city
Salt per city

Visualization

Here are a few visualization examples for cliché-meal searches.

Speciality Different kind
First visualization example Second visualization example
Choucroute (red), Malakoff (blue) Fondue Savoyarde (red), Fondue au fromage (blue)

Results

Expected food trends were present as one could expect from well-known clichés. Looking closer at the estimated nutritious facts, the high variance and noisiness of the datasets coupled to the matching process increases greatly the difficutly of our analysis. No relevant area-based nutrition bias among the insights was found. One could nonetheless use the matching process and the pipeline as tools for further in depth investigation.

Expected and encountered challenges

Before starting the project, we expected the following points to be the most challenging:

  • datasets collection : menus data can be difficult to gather
  • sparsity and spatial homogeneity : depending on datasets quality some regions might need to be ignored due to lack of data
  • content languages : textual informations (including menus) can have different name depending on area, standardization and translation might be needed
  • data completeness : non food data might need be extracted from different sources to achieve a valuable meaning

After finishing the project, the challenges actually were the following ones:

  • data mining and normalization (high variance, different sources, captchas)
  • data organization (complex queries, centralized storage with ElasticSearch)
  • french NLP (weird characters, hard modeling)
  • matching (many candidates, heterogeneous units)
  • computationally heavy (vectorization, visualization)

Regarding the content languages, no data was available for the German and Italian part of Switzerland on LaFourchette. Hence we focused our work on France and the French part of Switzerland.

Improvements

  • formal statistical evaluation: as limited in time, the project does not contain a lot of insights. This could be definetly enhanced to increase modelling and evaluation.
  • deep recurrent neural network for matching: one should evalute the effiency of neural net to match meal to recipes.
  • computational efficiency: currently the matching lasts 20 seconds per restaurant (centralized server), this could be improved by batching, parallelisation and local server.
  • expand visualization: better interactive and more diverse kind of visuzalization.
  • more and enhanced data for Switzerland: data precision is still an issue. This could have been improved by using personal restaurant websites for example.

License

Project is available under Apache 2.0 license and data belong to their owners under appropriate licensing.