# Part 2 (Exploratory Data Analysis)

## Created by Konstantin Georgiev

### Email: dragonflareful@gmail.com

In the first part, I managed to obtain and clean the three food datasets and filtered them based on the questions I layed out in the beginning. Now I'll put these datasets to good use by exploring the data using methods such as grouping, visualization and hypothesis testing.<br> First we need to import some packages that we are going to need for exploring.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.stats import ttest_ind

from nose.tools import *

from mpl_toolkits.basemap import Basemap

## Step 1 - Load the previously cleaned datasets

Of course, the first thing we need to do is load the datasets that were cleaned in the first step, print them and confirm their shapes.

In [None]:
world_food_data=pd.read_csv("../input/openfoodfactsclean/world_food_scrubbed.csv")
starbucks_data=pd.read_csv("../input/openfoodfactsclean/star_menu_scrubbed.csv")
mcd_menu_data=pd.read_csv("../input/openfoodfactsclean/mcd_menu_scrubbed.csv")

In [None]:
assert_is_not_none(world_food_data)
assert_is_not_none(starbucks_data)
assert_is_not_none(mcd_menu_data)

In [None]:
world_food_data.head()

In [None]:
starbucks_data.head()

In [None]:
mcd_menu_data.head()

In [None]:
print("Dataset shapes for exploration: MCD({},{}), Starbucks({},{}), World food({},{})".format(mcd_menu_data.shape[0],mcd_menu_data.shape[1],
                                                                              starbucks_data.shape[0],starbucks_data.shape[1],
                                                                              world_food_data.shape[0],world_food_data.shape[1]))

The dataset shapes be the same as the cleaned ones in part 1.

In [None]:
assert_equal(world_food_data.shape,(71091,13))
assert_equal(starbucks_data.shape,(242,7))
assert_equal(mcd_menu_data.shape,(260,6))

The data looks ready for exploring now.

## Step 2 - Exploration of the french products dataset

### Analyzing food packaging

Let's begin our exploration by taking a look at the food packaging. I'm going to apply some grouping and take a look at the most common types of packaging.

In [None]:
world_food_data["packaging"].groupby(world_food_data["packaging"]).count().sort_values(ascending=False)

The three most common types of packaging seem to be __plastic__, __cardboard__ and __can__.<br><br>Since one product can have multiple types of packaging, I'm going to take each place where a product contains the strings `plastique`, `carton` and `conserve`, save these places in separate dataframes and count these types based on their occurence. Each of these values should be __> 0__.

In [None]:
products_plastic=world_food_data.loc[world_food_data["packaging"].str.contains("plastique")]
products_cardboard=world_food_data.loc[world_food_data["packaging"].str.contains("carton")]
products_canned=world_food_data.loc[world_food_data["packaging"].str.contains("conserve")]

num_canned,num_cardboard,num_plastic=(products_canned["packaging"].count(),
                                      products_cardboard["packaging"].count(),
                                      products_plastic["packaging"].count())

In [None]:
assert_is_not_none(products_plastic)
assert_is_not_none(products_cardboard)
assert_is_not_none(products_canned)
assert_greater(num_canned,0)
assert_greater(num_cardboard,0)
assert_greater(num_plastic,0)

Let's plot a bar chart to see what the counts look like.

In [None]:
plt.title("Distribution of different packaging")
plt.bar(range(3), [num_canned,num_cardboard,num_plastic])
plt.xticks(range(3), ["Canned", "Cardboard", "Plastic"])
plt.ylabel("Packaging Count")
plt.show()

We can see that __plastic__ is definitely the predominant type of packaging in France - around double the amount of the other two categories combined.<br><br>
But that doesn't tell us much. What we can do, however is see whether these different types of packaging affect the nutrition score. Let's plot a histogram of the distributions of the nutrition scores across different packaging types.<br><br> Also, according to the french nutri-score system, the healthier products should have values close to 0.

In [None]:
plt.title("Distribution of nutrition scores by different packaging")
plt.ylabel("Packaging count")
plt.xlabel("Nutrition score")
plt.hist(products_plastic["nutrition_score"],bins=20,alpha=0.7)
plt.hist(products_cardboard["nutrition_score"],bins=20,alpha=0.7)
plt.hist(products_canned["nutrition_score"],bins=20,alpha=0.7)
plt.legend(["Plastic", "Cardboard", "Canned"])
plt.show()

Well, it seems that the products with __plastic__ packaging have a different peak than the other two types of products. However, the three distributions follow more or less the same pattern, so there doesn't seem to be anything interesting going on here.<br><br>Let's take a look at the additive count now.

### Analyzing the additive count

First, I'll plot a simple histogram to see the additive count distribution in the french products.

In [None]:
plt.title("Distribution of additive count in products")
plt.xlabel("Additive count")
plt.ylabel("Additive count distribution")
plt.hist(world_food_data["additives_n"])
plt.show()

We can see that the distribution here is asymmetrical and most products have additive counts in the field of __(0,5)__, which is a good indicator.<br><br> Let's explore further.

In [None]:
world_food_data["additives_n"].unique()

Let's take a look at what part of the french products actually contain additives. I'm going to apply grouping on the `contains_additives` column and change the index names for clarity. There should be only two classes here.

In [None]:
products_with_additives=world_food_data["contains_additives"].groupby(world_food_data["contains_additives"]).count()
products_with_additives

In [None]:
products_with_additives.index=["don't contain additives","contain additives"]

In [None]:
assert_equal(len(products_with_additives),2)

Afterwards, I'll make a function for fast pie chart plotting with parameters - __the grouped dataframe__, __the title__ and __the amount of cropped space between the two classes__.

In [None]:
def plot_pie_on_grouped_data(grouped_data,title,explode):
    plt.gca().set_aspect("equal")
    plt.pie(grouped_data,labels=grouped_data.index, autopct = "%.2f%%",explode=explode,radius=1)
    plt.title(title)
    plt.show()

Now we can plot the pie chart and see how many products actually contain additives.

In [None]:
plot_pie_on_grouped_data(products_with_additives,"Percentage of french products containing additives",(0,0.1))

It seems that there are actually more french products that contain additives. That's interesting but does that really affect the nutrition value?<br><br>
Let's group the additive column by nutrition grade and find out!<br><br>
The grouped dataframe here should be of size __5__, one group for each grade `('A','B','C','D','E')`.

In [None]:
additives_by_grade=world_food_data["additives_n"].groupby(world_food_data["nutrition_grade_fr"])

In [None]:
assert_equal(len(additives_by_grade),5)

Now let's plot a histogram of the distributions.

In [None]:
for additive, grade in additives_by_grade:
    plt.hist(grade, label = "Grade {}".format(additive), alpha = 0.5)
plt.title("Distribution of additive count by nutrition grade")
plt.xlabel("Additive count")
plt.ylabel("Additive count distribution")
plt.legend()
plt.show()

It doesn't seem like the additive count in products affects the grades as the distributions are more or less the same.<br><br>
But we can see that the predominant grade here is __'D'__, which means that a lot of products have a grade that is below average.<br><br>
We'll take a look at the palm oil ingredients next.

### Analyzing the palm oil ingredient count

I'm going to observe the `ingredients_from_palm_oil_n` column now.<br><br>
Similar to the previous exploration, I'm just going to group the data by count and this time there should be only two classes. For clarity, I'm going to label them as `palm_oil_absent` and `palm_oil_present`.

In [None]:
palm_oil_group=world_food_data["ingredients_from_palm_oil_n"].groupby(world_food_data
                                                                         ["ingredients_from_palm_oil_n"]).count()
palm_oil_group

In [None]:
palm_oil_group.index=["palm oil absent","palm oil present"]

In [None]:
assert_equal(palm_oil_group.values.tolist(),[67055,4036])
assert_equal(palm_oil_group.index.tolist(),["palm oil absent","palm oil present"])

In [None]:
palm_oil_group

Now we can plot our pie chart using the previous helper function that I made.

In [None]:
plot_pie_on_grouped_data(palm_oil_group,"French products with and without palm oil ingredients",(0,0.1))

We can see that most of the french products don't contain any ingredients from palm oil, which is a good indicator.<br><br>
This means that the chance that the palm oil impacts the nutrition grade is really low so I won't explore this any further.

### Analyzing the product categories

We can also look at the categories with the largest amount of products in the dataframe.<br><br>
Let's take the __10__ categories with the largest product count and see what they look like.

In [None]:
num_products_by_category=world_food_data.main_category.groupby(world_food_data.main_category).count().sort_values(ascending=False).nlargest(10)

In [None]:
assert_equal(len(num_products_by_category),10)

Also I'll make a quick helper function to plot bar charts.<br> It will take four parameters - __a grouped dataset__, __a title__, __the label across the y axis__ and __the figure size__.

In [None]:
def plot_barh_on_grouped_data(grouped_data,title,y_label,fig_size):
    plt.figure(figsize = fig_size)
    plt.title(title)
    plt.ylabel(y_label)
    plt.barh(range(len(grouped_data)), grouped_data)
    plt.yticks(list(range(len(grouped_data))), grouped_data.index)
    plt.show()

Now we can plot our categories.

In [None]:
plot_barh_on_grouped_data(num_products_by_category,"French product categories with the highest count","",(10,6))

It looks like the majority of this dataframe includes plant-based products. In most cases, these products should actually be the ones that have the best nutrition score. This should normally mean that most products have higher grades. But is that really the case? Let's find out.<br><br>I'm going to group the categories by grade and test this theory.

In [None]:
grades_by_category=world_food_data.main_category.groupby(world_food_data.nutrition_grade_fr).count()

In [None]:
assert_equal(len(grades_by_category),5)

In [None]:
grades_by_category

Let's plot the grouped dataset again using our helper function for bar charts.

In [None]:
plot_barh_on_grouped_data(grades_by_category,"Nutrition grade distributions for the french products","Nutrition grade",(10,6))

This totally contradicts the previous theory.<br><br>
From what we can see the three lowest grades are the ones with the highest count! Maybe the french products aren't as healthy as it seems after all. But to be sure, we need to try to correlate our data with other datasets.

As I have the McDonalds and Starbucks datasets at my disposal, what I can do is single out the french meat products and beverages and see how they compare to fast food in terms of nutrition. Maybe this will take me one step closer to figuring out why the nutrition grades are so low.

## Step 3 - Comparing french beverages to Starbucks beverages

I'll start by taking the products in the french dataset, which are labeled as `beverages` along with their nutrients and also take some similar data from the Starbucks dataset.

In [None]:
french_beverages=world_food_data[["product_name","fat_g","carbohydrates_g","proteins_g"]].loc[world_food_data["main_category"]=="beverages"]
starbucks_beverages=starbucks_data[["product_name","beverage_prep","fat_g","carbohydrates_g","proteins_g"]]

In [None]:
assert_equal(french_beverages.shape[1],4)
assert_equal(starbucks_beverages.shape[1],5)

The first thing I'm going to do is display the correlations between the separate nutrients.

In [None]:
starbucks_beverages.corr()

In [None]:
french_beverages.corr()

We can see that there is some correlation between the data, but nothing too interesting is going on.

In [None]:
print("Number of french beverages:{} , Number of Starbucks beverages:{}".format(french_beverages.shape[0],
                                                                               starbucks_beverages.shape[0]))

In [None]:
print("Number of unique french beverages:{}, Number of unique Starbucks beverages:{}".format(len(french_beverages.product_name.unique()),
                                                                                             len(starbucks_beverages.product_name.unique())))

When we print the shapes and number of unique products in the two dataframes, we see that the french products are a lot more and have a larger variance. So the difference in correlations in the two dataframes isn't surprising.<br><br>
So what can we do in order to make an accurate comparison between the number of nutrients in the two dataframes?
What I've done is write a function which does the following:
 1. Takes three parameters - __the two dataframes and number of iterations (experiments to apply sampling)__
 2. Switches the two dataframes if the `larger_df` has a smaller size, in order to get an accurate representation
 3. Samples nutrient values __equal to the size of the smaller dataframe__ for the __larger__ dataframe
 4. Sums over __the sampled__ values and acquires the total sum of nutrients for a sample of the __larger__ dataframe
 5. Sums over __all__ nutrients in the __smaller__ dataframe
 6. Prints these sums
 7. Prints the percentage of nutrients contained in a single sample of the __larger__ dataframe
 8. Prints the __mean total percentage__ of nutrients in the __larger__ dataframe across all samples
 
In our case the __larger__ dataframe will always be the french products dataframe and I expect that a sampled sum of these nutrients will always be less than the total sum of fast food nutrients, so I'm going to create this function based on that assumption.

The following will allow us to see exactly what is the difference in nutrition between the two dataframes.

In [None]:
def extract_mean_total_nutrients(larger_df,smaller_df,num_iterations):
    total_sum=0
    larger_df_copy=larger_df.copy()
    smaller_df_copy=smaller_df.copy()
    
    #Check if the larger dataframe is actually given as the second parameter and switch the dataframes
    if larger_df.shape[0] < smaller_df.shape[0]:
        larger_df=smaller_df_copy
        smaller_df=larger_df_copy
        
    for i in range(num_iterations):
        total_nutrients_larger = round(larger_df.carbohydrates_g.sample(len(smaller_df)).sum() + larger_df.proteins_g.sample(len(smaller_df)).sum() + larger_df.fat_g.sample(len(smaller_df)).sum())
        total_nutrients_smaller = round(smaller_df.carbohydrates_g.sum() + smaller_df.proteins_g.sum() + smaller_df.fat_g.sum())
        print("Sample ",i+1)
        print("Total sampled nutrients (Larger dataframe):{} , Total nutrients (Smaller dataframe):{}".format(total_nutrients_larger,
                                                                                        total_nutrients_smaller))
        sample_per=total_nutrients_larger/total_nutrients_smaller*100
        print("Total % of nutrients in iteration for the larger dataframe:{:.2f}".format(sample_per))
        total_sum+=sample_per
    
    total_mean=total_sum/num_iterations
    print("\nMean total % of nutrients for the larger dataframe across all iterations:{:.2f}".format(total_mean))
    return total_mean

In [None]:
mean_result=extract_mean_total_nutrients(french_beverages,starbucks_beverages,10)

For 10 samples, we can see that mean total percentage of nutrients in the french beverages ranges from around __10__ to __12__.<br><br> Now we need to find out whether the french beverages are too poor, or the Starbucks beverages are too rich on nutrients.

In [None]:
assert_greater(mean_result,5)
assert_less(mean_result,20)

I'm going to make a helper function which accepts a __dataframe__ and a __product category__ (in our case a nutrient) and returns a grouped dataframe that contains the __10__ products that are the richest in that category.

In [None]:
def get_max_product_values_by_category(dataframe,category):
    group_result=category.groupby(dataframe.product_name).max().sort_values(ascending=False).nlargest(10)
    return group_result

Next I'll apply that function for the french and Starbucks beverages and use __carbohydrates__ as the testing category.

In [None]:
carb_heavy_french_beverages=get_max_product_values_by_category(french_beverages,french_beverages.carbohydrates_g)
carb_heavy_starbucks_beverages=get_max_product_values_by_category(starbucks_beverages,starbucks_beverages.carbohydrates_g)

In [None]:
carb_heavy_french_beverages

We can see that there is some incorrect data amongst the beverages so let's correct that. I'm going to filter out all of the syrups, desserts and medical products as best as I can.

In [None]:
filter_out_list = ['Sirop', 'SIROP', 'sirop', 'Agaven', 'agaven', 'AGAVEN',
                                                        'Dessert', 'DESSERT', 'dessert', 'Bonbons']
pattern='|'.join(filter_out_list)
french_beverages = french_beverages[~french_beverages.product_name.str.contains(pattern)]

In [None]:
carb_heavy_french_beverages=get_max_product_values_by_category(french_beverages,french_beverages.carbohydrates_g)

In [None]:
carb_heavy_french_beverages

In [None]:
carb_heavy_starbucks_beverages

There should be __10__ values in each grouped dataframe.

In [None]:
assert_equal(len(carb_heavy_french_beverages),10)
assert_equal(len(carb_heavy_starbucks_beverages),10)
assert_equal(carb_heavy_french_beverages.values[0],99)
assert_equal(carb_heavy_starbucks_beverages.values[0],340)

Let's plot our grouped data to see how the two dataframes compare.

In [None]:
plot_barh_on_grouped_data(carb_heavy_french_beverages,"French beverages that contain the highest amount of carbohydrates","",(10,6))
plot_barh_on_grouped_data(carb_heavy_starbucks_beverages,"Starbucks beverages that contain the highest amount of carbohydrates","",(10,6))

We can see that there is a lot less variance in the french beverages and that most of the top ones are medicines and different syrups. The first most relevant product to compare to the Starbucks drinks would be the Ginger drink.<br><br>When we compare that to the first beverage in the Starbucks dataframe - The __Java Chip__, we can see that __1 __ __Java Chip__ amounts to around __3 Ginger drinks__, which is logical.<br><br> I researched that the best amount of carbohydrate intake for one day for a normal weighed person is around __288__. We can see that about half of the Starbucks drinks here exceed that amount and therefore I think we can assume that the Starbucks beverages are just unhealthily rich on similar nutrients.

In [None]:
print(round(carb_heavy_starbucks_beverages[0]/carb_heavy_french_beverages[4]))

## Step 4 - Comparing french meat to McDonalds meat 

We saw how the french beverages compare to fast food beverages. Now let's check out the french meat.<br><br>
Again I'm going to extract the products from a specific category along with the needed features, in this case the `meats` category and save them into `french_meat_data`.<br><br>To obtain the McDonalds meat products, first we'll take the categories `Beef & Pork` and `Chicken & Fish`. Afterwards we'll add the products from other categories, which contain the words `["Sausage","Bacon","Chicken","Steak"]`. Finally we'll filter out the ones labeled as `Fish`.

In [None]:
french_meat_data=world_food_data[["product_name","fat_g","carbohydrates_g","proteins_g"]].loc[world_food_data["main_category"]=="meats"]
words_to_search=["Sausage","Bacon","Chicken","Steak"]
pattern='|'.join(words_to_search)
mcd_meat_data=mcd_menu_data[["product_name","fat_g","carbohydrates_g","proteins_g"]].loc[(mcd_menu_data["category"]=='Beef & Pork') | 
                                                                                         (mcd_menu_data["category"]=='Chicken & Fish')]
mcd_meat_data=mcd_meat_data.append(mcd_menu_data[["product_name","fat_g","carbohydrates_g","proteins_g"]].
                                   loc[mcd_menu_data.product_name.str.contains(pattern)])
mcd_meat_data=mcd_meat_data[~mcd_meat_data["product_name"].isin(["Fish"])]

In [None]:
mcd_meat_data.shape

In [None]:
print("Number of french meat products:{} , Number of McDonalds meat products:{}".format(french_meat_data.shape[0],
                                                                               mcd_meat_data.shape[0]))

In [None]:
assert_equal(french_meat_data.shape,(3790,4))
assert_equal(mcd_meat_data.shape,(111,4))

We should have received 3790 observations for french meat products and 111 for McDonalds products.<br><br>
Let's also take a look at the correlations.

In [None]:
mcd_meat_data.corr()

In [None]:
french_meat_data.corr()

Again, there are a lot larger correlations between the nutrients in the McDonalds dataframe, but that's due to the small amount of observations made compared to the french dataframe.<br><br> Next, let's use our helper sampling function that I made before to sample some nutrient sums and see how the dataframes compare.

In [None]:
mean_result = extract_mean_total_nutrients(french_meat_data,mcd_meat_data,10)

If we execute the cell a few times, we can see that the french mean total ranges from around __38__ to __41__ this time around. It seems that this time the french samples are lot more similar to the McDonalds data. This is very interesting and worth looking into.

In [None]:
assert_greater(mean_result,30)
assert_less(mean_result,45)

Let's single out the __fat__ nutrient this time and see which products have the largest amount of that.

In [None]:
fat_heavy_french_meat_products=get_max_product_values_by_category(french_meat_data,french_meat_data.fat_g)
fat_heavy_mcd_meat_products=get_max_product_values_by_category(mcd_meat_data,mcd_meat_data.fat_g)

In [None]:
fat_heavy_french_meat_products

In [None]:
fat_heavy_mcd_meat_products

Again, there should be only 10 products in each grouped category.

In [None]:
assert_equal(len(fat_heavy_french_meat_products),10)
assert_equal(len(fat_heavy_mcd_meat_products),10)
assert_equal(fat_heavy_french_meat_products.values[0],73)
assert_equal(fat_heavy_mcd_meat_products.values[0],118)

Let's plot our grouped data to see how the dataframes compare.

In [None]:
plot_barh_on_grouped_data(fat_heavy_french_meat_products,"French meat products with the highest amount of fat","",(10,6))
plot_barh_on_grouped_data(fat_heavy_mcd_meat_products,"McDonalds products with the highest amount of fat","",(10,6))

Interesting. Apart from the __Chicken McNuggets__, which are sold in buckets and are a very large portion in general, we can see that most french meat products surpass the rest of the McDonalds menu in terms of __fat__!<br><br>Does this mean that the __fat__ nutrient is the main reason for the bad nutrition grades? We'll try to find out with the help of some __hypothesis testing__.

## Step 5 - Hypothesis testing

To find out whether the nutrients affect the nutrition grade, first I'm going to add it to the meat products.<br><br> Again, the grade column should have only __5__ possible values - 1 for each grade.

In [None]:
french_meat_data["grade"]=world_food_data["nutrition_grade_fr"]

In [None]:
assert_equal(french_meat_data.shape[1],5)

In [None]:
french_meat_data.head()

Because I'll try out a few different hypotheses, I'll make a helper function to automate the process.<br><br>
It is going to accept a __dataframe__ and a __category__(in our case a nutrient) and the algorithm will be as follows:
 1. Group the selected category by nutrition grade.
 2. Print the mean values across every grade to get a feel of the difference in the grade values.
 3. Make three __ttests__ for the category with three different grade combinations.<br>In this case, we'll use the grades __A__,__B__ and __D__ (__D__ is the most common grade and I want to see how much the best grades differ from it).<br>I'll also assume `equal_var=False`, because we don't have overlapping data in the grouped dataframes.<br>__Student's t-test__ - Equal sample sizes, equal variance<br>__Welch's t-test__ - Unequal variances and unequal sample sizes.<br>__T-Test__ - Asserts that the two populations of data have equal means.<br> We'll also use a threshold value of __1%__ for these cases. This means that:<br>- If the `pvalue` of any of the three tests is <=__1%__, the $H_0$ hypothesis will be rejected - there will be significant differences in the data.<br>- Otherwise the $H_1$ hypothesis will be rejected - the differences in the data will not be significant.

In [None]:
def group_by_grade_and_make_hypotheses(dataframe,category):
    group_result=category.groupby(dataframe.grade)
    print("Category mean by:{}".format(group_result.mean()))
    category_grade_a=group_result.get_group("a")
    category_grade_b=group_result.get_group("b")
    category_grade_d=group_result.get_group("d")
    hyp_ab = ttest_ind(category_grade_a,category_grade_b,equal_var=False)
    hyp_bd = ttest_ind(category_grade_b,category_grade_d,equal_var=False)
    hyp_ad = ttest_ind(category_grade_a,category_grade_d,equal_var=False)
    print("A->B:{}".format(hyp_ab.pvalue))
    print("B->D:{}".format(hyp_bd.pvalue))
    print("A->D:{}".format(hyp_ad.pvalue))
    if hyp_ab.pvalue <= 0.01 and hyp_bd.pvalue <= 0.01 and hyp_ad.pvalue <= 0.01:
        print("The differences in grades are significant. Reject H0.")
    else:
        print("There's not enough evidence to reject H0. Don't accept or reject anything else.")
    return (hyp_ab,hyp_bd,hyp_ad)

We've made our helper function. Now let's take a look at the results.<br>
First, we'll see how the __fat__ nutrient affects the grades.

In [None]:
(test_fat_result_ab,test_fat_result_bd,test_fat_result_ad)=group_by_grade_and_make_hypotheses(french_meat_data,french_meat_data.fat_g)

In [None]:
assert_is_not_none((test_fat_result_ab,test_fat_result_bd,test_fat_result_ad))

The differences in the grades in this case are astronomical! Could __fat__ be the main grade influencer?<br>Let's take a look at how the other nutrients have done in the tests.

In [None]:
(test_carb_result_ab,test_carb_result_bd,test_carb_result_ad)=group_by_grade_and_make_hypotheses(french_meat_data,french_meat_data.carbohydrates_g)

In [None]:
assert_is_not_none((test_carb_result_ab,test_carb_result_bd,test_carb_result_ad))

With the __carbohydrates__ there is some indication of influence between grades __B__ and __D__, but the other two tests have failed so we'll reject this as a factor.

In [None]:
(test_prot_result_ab,test_prot_result_bd,test_prot_result_ad)=group_by_grade_and_make_hypotheses(french_meat_data,french_meat_data.proteins_g)

In [None]:
assert_is_not_none((test_prot_result_ab,test_prot_result_bd,test_prot_result_ad))

With the __protein__ nutrient there seems to be almost no influence whatsoever so we'll also reject this as a factor.<br><br> With this information I think we can assume that __fat__ is the key nutrient that affects the nutrition grades!

## Step 6 - Draw a map of the first packaging locations of the french products

Just to be safe on our previous assumptions, I'll also draw a map, which will mark the places where the french products were packaged.<br> I'm going to do that using a helper drawing function and with the help of `Basemap`. The points to mark will be the values in the `[df_latitude,df_longitude]` columns. We'll also set some styles for the boundaries, coastlines and countries.

In [None]:
def draw_map_of_french_products(df_latitude,df_longitude,lat_lower_left,lon_lower_left,lat_upper_right,lon_upper_right,title):
    plt.figure(figsize = (12, 10))
    m = Basemap(projection = "merc", llcrnrlat = lat_lower_left, llcrnrlon = lon_lower_left, urcrnrlat = lat_upper_right, urcrnrlon = lon_upper_right)
    x, y = m(df_longitude.tolist(),df_latitude.tolist())
    m.plot(x,y,'o',markersize=1,color='red')
    m.drawcoastlines()
    m.drawcountries()
    m.fillcontinents(color = "lightgreen", lake_color = "aqua")
    m.drawmapboundary(fill_color = "aqua")
    plt.title(title)
    plt.show()

In [None]:
draw_map_of_french_products(world_food_data.fp_lat,world_food_data.fp_lon,-73,-180,80,180,"First packaging of the French products")

It's seems that all the points are located in a single country. Let's zoom in a bit.

In [None]:
draw_map_of_french_products(world_food_data.fp_lat,world_food_data.fp_lon,20,-20,52,20,"First packaging of the French products zoomed")

This is indeed the territory of France and it would seem that the packaging coordinates are correct!

This concludes our Exploratory Data Analysis. We successfully managed to look into the nutrition values of the french products and compare our findings to fast food products and most importantly answer the questions that we laid out in the beginning. <br> I will present these answers in the conclusion, but first we'll move on to a bit of modelling before we end the project.