## ML for Prediction

All the CV parts are on the appendix. Helper functions are on helper.py. All the missing code is in the appendix/helper.py

### Dividing the Data to Decisions

 Decided to focus on 3 decisions, each get different piece of the data, saving the class balance:<br>
 Feature Engineering - 70% of train data<br>
 Feature Selection - 15% of train data<br>
 Model Tuning - will use Catboost - 15% of train data

## Feature Engineering

In this part we will make a feature matrix from 4 data sets: food_train, food_nutrients, nutrients,snacks images.<br>
We use cross validation for features extracted from the following fields:<br>
- household fulltext<br>
- description<br>
- ingredients<br>
- brand<br>

For each these we split into 4 folds and examine meaningful extractable features.<br>

For other sources of data (e.g image, nutrients), we settle for the work done in the exploratory phase, or in one "plain vanilla" option.<br>

After these stages, we merge all resulting features and apply feature selection (see below)



### Food Nutrients Data Set

We add all nutrients as features.<br>
Beacuse each decision is taken in additive way we start with nutrients and only then check other features. <br>
We don't need to scale the nutrients becuase we will use Tree model.

####  # nutrients in snack

Let's take a look on #nutrients per category.

In [1103]:
food_train['# nutrients'] = food_train[[col for col in food_train.columns if 'nutrient_' in col]].count(axis=1)

In [186]:
food_train_fe.groupby('category')['# nutrients'].mean()

category
cakes_cupcakes_snack_cakes              14.428518
candy                                   11.848182
chips_pretzels_snacks                   15.248447
chocolate                               13.709958
cookies_biscuits                        14.591782
popcorn_peanuts_seeds_related_snacks    14.788303
Name: # nutrients, dtype: float64

We will add #nutrients as a feature beacuse it seems to have information about the category separation - this feature gets different value by average for each category.

In [200]:
def get_nutrient_amount(data):
    data['# nutrients'] = data[[col for col in data.columns if 'nutrient_' in col]].count(axis=1)
    return data

#### # of nutrients per unit for each snack - Appendix

### Household full text

In this part we will use a data devision that allocated for houehold column, first we will focus on cleaning this column.

In [1451]:
household_decision_df = food_train_fe[fold_indices[0]]

In [1454]:
print("#Unique values of household_serving_fulltext:",len(houshold_decision_df.household_serving_fulltext.unique()))

#Unique values of household_serving_fulltext: 971


In [1417]:
household_decision_df = clean_column(household_decision_df,'household_serving_fulltext') # clean_column on helper.py

Let's take a look on the values we get after cleaning the column.

In [741]:
print('# Unique values of household_serving_fulltext after cleaning: ',len(houshold_decision_df.household_serving_fulltext.unique()))

# Unique values of household_serving_fulltext after cleaning:  227


We see 'a cake' when so we will remove 'stop words' of english - words as 'the','a' etc...<br>
We also see that we have pop and pops, pack and packs, cake and cakes and etc..<br>
Therfore we will use stemmimg that will keep us only the root of the words.

In [606]:
# remove_stop_words and stemmimg functions are in helper.py

In [1420]:
household_decision_df['household_serving_fulltext'] = household_decision_df.household_serving_fulltext.map(lambda a: remove_stop_words(a))

In [1424]:
household_decision_df['household_serving_fulltext'] = household_decision_df.household_serving_fulltext.str.strip()

In [1426]:
print('# Unique values of household_serving_fulltext remove stop words: ',len(household_decision_df.household_serving_fulltext.unique()))

# Unique values of household_serving_fulltext remove stop words:  204


In [1427]:
household_decision_df['household_serving_fulltext'] = household_decision_df.household_serving_fulltext.map(lambda a: stemming(a))


In [1428]:
household_decision_df = clean_column(household_decision_df,'household_serving_fulltext')

In [1429]:
print('# Unique values of household_serving_fulltext after cleaning: ',len(household_decision_df.household_serving_fulltext.unique()))

# Unique values of household_serving_fulltext after cleaning:  171


We will satisfy for now with this level of cleaning, from 971 unique values to 171.

After cleaning the household column, we want to decrease the number of levels (not by cleaning each string).<br>

We will check if its better to use the option below:<br>
We will count for each category the n top values of houshold column, and those will be our keywords.<br>
If a the houshold column contains one of the keywords -> we will change the column to that word.<br>
It's possible to check different size of n, we will choose 10,20,50,70,90.<br>
We are using top n words from each category so the class imbalance won't neglect any class.


In [1114]:
#we will do the same for n = [10,20,50,70,90]
houshold_n_list = [10,20,50,70,90]

In [1115]:
houshold_keywords_dict_top_n = {n: set(sum({category: get_column_top_n_word_count_by_category('household_serving_fulltext',category,houshold_decision_df,n) for category in categories}.values(), [])) for n in houshold_n_list}

In [1116]:
#adding those columns to the df
for n in houshold_n_list:
    household_decision_df[f'houshold_manual_top_{n}'] = household_decision_df.household_serving_fulltext.map(lambda a: replace_to_keyword(houshold_keywords_dict_top_n[n],a))

#### Checking the best n with CV - Code in the appendix

In [1122]:
fe_houshold_mean_kf_score = {key: np.array(kf_fe_dict_houshold[key]).mean() for key in kf_fe_dict_houshold.keys()}

In [1123]:
fe_houshold_mean_kf_score

{10: 0.6246121194853298,
 20: 0.6296513284421966,
 50: 0.619214331505093,
 70: 0.6183144033827843,
 90: 0.6183144033827843,
 'household_serving_fulltext': 0.6300113644457971}

-> Decided to leave household_serving_fulltext as it is.

#### houshold unit

As was shown in the exploratory, a known measures in household unit might be a good feature.

In [692]:
def add_is_houshold_unit(data):
    units_keywords = ['onz','cup', 'grm','tbsp','pouch','ounce','tsp']
    is_houshold_unit = data.household_serving_fulltext.map(lambda a: is_keywords_in(units_keywords,a))
    data['is_houshold_unit'] = is_houshold_unit
    return data
    

In [1433]:
household_decision_df = add_is_houshold_unit(household_decision_df)

### Description

In this part we will use a data devision that allocated for description column, first we will focus on cleaning this column.

In [1128]:
description_decision_df = food_train_fe[fold_indices[1]]

In [1129]:
description_decision_df = process(description_decision_df,'description')

 Let's check if there is a difference in the #words between the different categories.

In [1130]:
description_decision_df['# description words'] = description_decision_df.description.str.count(' ') + 1

In [1131]:
description_decision_df['# description words'] = description_decision_df.description.str.count(' ') + 1
description_decision_df[['description','# description words']].sort_values(by='# description words', ascending=True)
description_decision_df.groupby('category')['# description words'].mean()

category
cakes_cupcakes_snack_cakes              3.789474
candy                                   4.019417
chips_pretzels_snacks                   4.460815
chocolate                               4.634494
cookies_biscuits                        4.289700
popcorn_peanuts_seeds_related_snacks    4.033309
Name: # description words, dtype: float64

It's not a big difference but it could be a feature...

In [794]:
full_description_word_list = get_full_description_word_list(description_decision_df) #helper.py
print('# Words in descriptions: ',len(full_description_word_list))
print('# Unique Words in descriptions: ',len(set(full_description_word_list)))

# Words in descriptions:  23124
# Unique Words in descriptions:  2725


We will check 2 options to extract features from description:
1. We will count for each category the top n used words values of description column, and those will be our keywords.
    The features will be if a the description column contain one of the keywords.
    we will check different size of n.
2. using word2vec vectors as features - will check different size of vocabulary by filtering the number of appearences.

#### Manual word count by category

Now we need to make a boolean feature if a description contains a keyword for each of the keywords.

In [1]:
# functions on appendix

In [1132]:
description_n_list = [10,20,50,70,90,110]
for n in description_n_list:
    description_decision_df = add_features_from_column(description_decision_df,'description',n)

n =  10
#description_keywords 46
n =  20
#description_keywords 87
n =  50
#description_keywords 192
n =  70
#description_keywords 249
n =  90
#description_keywords 311
n =  110
#description_keywords 369


#### Word2Vec

We will use word2vec library and check different sizes of vocabulary to use.

In [None]:
n_list = [10,20,25,30,35,40,45]
# Getting the filtered list for minimum apperences of 5,10,20
full_description_filterd_dict = {n: get_filtered_list(full_description_word_list,n) for n in n_list}
# Remove the words in the df that don't appear on that list.
description_for_vec_dict = {n: description_decision_df.description.map(lambda a: remove_keywords_from_df_column(a,full_description_filterd_dict[n]," ")) for n in n_list}
# Getting Word2Vec model for each n
model_description_word2vec_dict = {n:  Word2Vec(description_for_vec_dict[n].to_list(), vector_size=40, window=15, min_count=5, workers=4,epochs=100) for n in n_list}
# Getting words of Word2Vec model for each n
words_description_word2vec_dict = {n: list(model_description_word2vec_dict[n].wv.index_to_key) for n in n_list}
# Getting words of Word2Vec model for each n
vectors_description_word2vec_dict = {n: np.asarray(model_description_word2vec_dict[n].wv.vectors) for n in n_list}
# Create a dictionary of mapping word to vector for each n  
description_word2vec_mapping_dict = {n: {words_description_word2vec_dict[n][i]: vectors_description_word2vec_dict[n][i] for i in range(len(words_description_word2vec_dict[n]))} for n in n_list}
#Map the Word2Vec vectors to each ingredient and calculate the average of each coordinate in the vectors of each sentence. 
description_vectors_dict= {n: description_for_vec_dict[n].map(lambda a: calc_mean_word2vec(a,description_word2vec_mapping_dict[n])) for n in n_list}
# Eventually add each averaged coordinate as a feature.
for n in n_list:
    df = pd.DataFrame(description_vectors_dict[n])
    description_decision_df[[f'description_{n}_filtered_vec_{i}' for i in range(0,40)]]  = pd.DataFrame(df.description.tolist(), columns=['description_'+str(i) for i in range(0,40)])

#### Check word2vec VS. manual word count with different size of n with CV - Code in the Appendix

In [1445]:
kf_fe_dict_description_mean

{'manual10': 0.8855522602619974,
 'manual20': 0.8889709834292783,
 'manual50': 0.8878920086253231,
 'manual70': 0.8862715228357368,
 'manual90': 0.8868117387278296,
 'manual110': 0.8880717028537386,
 'description': 0.8527991827959773,
 'word2vec10': 0.853878967033394,
 'word2vec20': 0.8547788951557026,
 'word2vec25': 0.855498157729442,
 'word2vec30': 0.8549587512708106,
 'word2vec35': 0.8547785713823182,
 'word2vec40': 0.8553186253877186,
 'word2vec45': 0.8544185353787178}

The description as it is got the lowest score.

In [1447]:
max(kf_fe_dict_description_mean.items(), key=operator.itemgetter(1))[0]

'manual20'

In [None]:
# manual 20 won!

### Ingredients

In this part we will use a data devision that allocated for ingredients column, first we will focus on cleaning this column.

In [1144]:
ingredients_decision_df = food_train_fe[fold_indices[2]]

After moving hence and forth with the ingredients cleaning, we decided to:<br>
    - remove all text indise all brackets<br>
    - take the string that comes after "of:", "following:", "icing:", "including:", "less:", "ingredients:"<br>
    - group flour values that seems to be wheat flour<br>
    - more technical decisions

We ran word2vec on the ingredients and used kmeans for clustering on the vectors and got this as similiar group (without 'unbleach flour','bleach flour').<br>
So it strenghened our intuition from the exploratory to combine those words as 'flour'.(see the group on appendix)

In [875]:
#clean_ingredients and process_ingredients are at the helper.py

In [1146]:
ingredients_decision_df = procces_ingredients(ingredients_decision_df)

In [878]:
full_ingredients = get_full_ingredients_list(ingredients_decision_df)
print('#Ingredients in all snacks: ',len(full_ingredients))
print('# Unique Ingredients in all snacks: ',len(set(full_ingredients)))

#Ingredients in all snacks:  58087
# Unique Ingredients in all snacks:  5715


for Ingredients we will extract:<br> 
1. All ingredients embedded with word2vec- will check different size of vocabulary by filtering number of appearences.<br>
2.We will count for each category the top n used words values of ingredients column, and those will be our keywords.<br>
    The features will be if a the houshold ingredients contain one of the keywords.<br>
    we will check different size of n.<br>
3. The first ingredient as a feature<br>
We will check by CV what is the best feature to extract.<br>
In addition we will add the number of ingredients as a feature as seen in the exploratory.

#### Manual count

In [None]:
# Do the same with n = 10,20,50
n_list = [10,20,50,70,90]
for n in n_list:
    ingredients_decision_df = add_features_from_column(ingredients_decision_df,'ingredients',n)

#### First Ingredient

In [881]:
ingredients_decision_df.ingredients

2        corn starch,eggs,vegetable oil,leavening,natur...
12       sugar,flour,soybean oil,eggs,water,milk,modifi...
16       flour,whole wheat graham flour,cane sugar,vege...
20                               ginger,sugar,citric acid,
24       blend root vegetable,expeller pressed canoloil...
                               ...                        
30312    corn syrup,high oleic canoloil,soybean oil,bar...
30317                         potatoes,vegetable oil,salt,
30364                         potatoes,vegetable oil,salt,
30401               roasted chickpea,wasabi soy seasoning,
30439               native andean potatoe,palm oil,sesalt,
Name: ingredients, Length: 5557, dtype: object

In [1148]:
ingredients_decision_df['first_ingredient'] = ingredients_decision_df.ingredients.str.split(',',n=1,expand=True)[0]

In [884]:
print('unique first ingredient values in the dataset: ',len(set(ingredients_decision_df['first_ingredient'])))

unique first ingredient values in the dataset:  943


#### Top First Ingredient

It could be a good feature (see appendix) but probably has to many levels. We will examine few n top values to keep.

In [None]:
n_first_ingredient_list = [50,100,150,200,250]
for n in n_first_ingredient_list:
    top_values = get_top_values_by_categories(n,'first_ingredient',ingredients_decision_df)
    ingredients_decision_df[f'first_ingredient_{n}'] = ingredients_decision_df.first_ingredient.map(lambda a: 'other' if a not in top_values else a) 

#### Word2Vec on all ingredients

First we filter all ingredients by minimum amount of appereances.<br>
We will perform Word2Vec for minimum apperences of 5,10,20 and check by cross validation what gives the best result.

In [859]:
def calc_mean_word2vec(column,word2vec_mapping_dict):  
    if column != []:
        result = [word2vec_mapping_dict[i].astype('float') for i in column]
        result =  np.mean(result,axis = 0 ).astype('float')
        return result
    return [np.nan for i in range(0,40)]

In [None]:
n_list = [5,10,20,25,30,35,40,45,50]
# Getting the filtered list for minimum apperences of 5,10,20
full_ingredients_filterd_dict = {n: get_filtered_list(full_ingredients,n) for n in n_list}
# Remove the words in the df that don't appear on that list.
ingredients_for_vec_dict = {n: ingredients_decision_df.ingredients.map(lambda a: remove_keywords_from_df_column(a,full_ingredients_filterd_dict[n],",")) for n in n_list}
# Getting Word2Vec model for each n
model_ingredients_word2vec_dict = {n:  Word2Vec(ingredients_for_vec_dict[n].to_list(), vector_size=40, window=15, min_count=5, workers=4,epochs=100) for n in n_list}
# Getting words of Word2Vec model for each n
words_ingredients_word2vec_dict = {n: list(model_ingredients_word2vec_dict[n].wv.index_to_key) for n in n_list}
# Getting words of Word2Vec model for each n
vectors_ingredients_word2vec_dict = {n: np.asarray(model_ingredients_word2vec_dict[n].wv.vectors) for n in n_list}
# Create a dictionary of mapping word to vector for each n  
ingredients_word2vec_mapping_dict = {n: {words_ingredients_word2vec_dict[n][i]: vectors_ingredients_word2vec_dict[n][i] for i in range(len(words_ingredients_word2vec_dict[n]))} for n in n_list}
#Map the Word2Vec vectors to each ingredient and calculate the average of each coordinate in the vectors of each sentence. 
ingredients_vectors_dict= {n: ingredients_for_vec_dict[n].map(lambda a: calc_mean_word2vec(a,ingredients_word2vec_mapping_dict[n])) for n in n_list}
# Eventually add each averaged coordinate as a feature.
for n in n_list:
    df = pd.DataFrame(ingredients_vectors_dict[n])
    ingredients_decision_df[[f'ingredients_{n}_filtered_vec_{i}' for i in range(0,40)]]  = pd.DataFrame(df.ingredients.tolist(), columns=['ingredients_'+str(i) for i in range(0,40)])

#### CV between ingredients options - code in the appendix

In [1161]:
fe_ingredients_mean_kf_score = {key: np.array(kf_fe_dict_ingredients[key]).mean() for key in kf_fe_dict_ingredients.keys()}

In [1162]:
fe_ingredients_mean_kf_score

{'manual10': 0.8599913228732946,
 'manual20': 0.8594509450945095,
 'manual50': 0.8711480680442145,
 'manual70': 0.8702484636952905,
 'manual90': 0.871148391817599,
 'word2vec5': 0.857832078171846,
 'word2vec10': 0.8571116823912608,
 'word2vec20': 0.8581921141754464,
 'word2vec25': 0.8581922760621387,
 'word2vec30': 0.858191952288754,
 'word2vec35': 0.8571128155981066,
 'word2vec40': 0.8576518982833535,
 'word2vec45': 0.8571126537114143,
 'word2vec50': 0.8587320062941546,
 'first_ingredient_50': 0.8718681400514153,
 'first_ingredient_100': 0.8729472767420626,
 'first_ingredient_150': 0.8718674925046461,
 'first_ingredient_200': 0.8738480142978327,
 'first_ingredient_250': 0.8720483199399076}

In [1163]:
max(fe_ingredients_mean_kf_score.items(), key=operator.itemgetter(1))[0]

'first_ingredient_200'

First Ingredient with 200 top levels is chosen!

#### Number of ingredients

In [895]:
ingredients_decision_df['# ingredients'] = ingredients_decision_df.ingredients.str.count(',') + 1

In [896]:
ingredients_decision_df.groupby('category')['# ingredients'].mean()

category
cakes_cupcakes_snack_cakes              22.611360
candy                                   11.420659
chips_pretzels_snacks                    9.390578
chocolate                                9.889236
cookies_biscuits                        13.334812
popcorn_peanuts_seeds_related_snacks     6.448557
Name: # ingredients, dtype: float64

It could be a good feature, we can see clearly difference in # ingredients between the categories.

### Brand

In this part we will use a data devision that allocated for brand column, first we will focus on cleaning this column.

In [None]:
brand_decision_df = food_train_fe[fold_indices[3]]

In [1179]:
brand_decision_df = clean_brand(brand_decision_df)# helper.py

In [948]:
print("# Unique Brands:", len(brand_decision_df.brand.unique()))

# Unique Brands: 1900


We will examine 2 options to extract featurs out of brand:<br>

1. we will change all the brands that are not in the top n of each category, all other brands will go to 'other' category to decrease the number of levels.<br>
we will check different size of n.<br>
2. We will produce keywords by top n words appearences from each category.<br>
The motivation for that is repeating words as chocolate, bakery and candy inside the brand names.<br>
we will check different size of n.<br>

#### Top Values

In [949]:
def get_top_values_by_categories(n,column,data):
    print("n = ", n)
    result = []
    for category in categories:
        result.extend(data[column][data.category == category].value_counts().index.get_level_values(0).to_list()[0:n])
    result = list(set(result))
    print("# values", len(result))
    return result

In [950]:
n_brand_list = [50,100,150,200,250]

In [951]:
n_brand_list

[50, 100, 150, 200, 250]

In [None]:
n_brand_list = [50,100,150,200,250]
for n in n_brand_list:
    top_values = get_top_values_by_categories(n,'brand',brand_decision_df)
    brand_decision_df[f'brand_top_{n}'] = brand_decision_df.brand.map(lambda a: 'other' if a not in top_values else a) 

#### Manual Word Count

We see words as candy and chocolate on the brand names the might imply the snack.
In order to do word count we have to clean better the brand column - with stemming and remove stop words
We will use our 'process' function.

In [1181]:
brand_decision_df = process(brand_decision_df,'brand')

In [None]:
brand_top_n_list = [10,20,50,70,90,110]
for n in brand_top_n_list:
    food_train_fe = add_features_from_column(brand_decision_df,'brand',n)

#### Check top n brands VS. top n word count in CV - code in the appendix

In [1187]:
fe_brand_mean_kf_score = {key: np.array(kf_fe_dict_brand[key]).mean() for key in kf_fe_dict_brand.keys()}

In [1188]:
fe_brand_mean_kf_score

{'brand_top_50': 0.8673701183067948,
 'brand_top_100': 0.8715095610280452,
 'brand_top_150': 0.8709691832492602,
 'brand_top_200': 0.8697090572366589,
 'brand_top_250': 0.8718678162780307,
 'word_count_top_10': 0.8652108736053462,
 'word_count_top_20': 0.8675504600819794,
 'word_count_top_50': 0.8659311074992392,
 'word_count_top_70': 0.8657501181772854,
 'word_count_top_90': 0.8675501363085949,
 'word_count_top_110': 0.8668303880747785}

In [1189]:
max(fe_brand_mean_kf_score.items(), key=operator.itemgetter(1))[0]

'brand_top_250'

Top 250 values from each category of brand is chosen.

### grm Serving Size 

We will create serving_size column by grm, we will multiply the serving size with serving size unit of ml by 1000. <br>
There is no need to scale the data becuase we use tree model.

### Image feature extraction

In order to extract features from the given images of snacks (per category), we chose to train an independent CNN model on the portion of the train set dedicated for feature engineering (this portion was selected to preserve the class proportion in the entire set).<br> The stand alone model's last layer contains a 128-sized vector that is the input to the last layer's softmax decision function.<br> This layer represents important information contained in the image that is useful for the required classification.<br> We will use this representation as additional features in the top-level model.

#### CNN Implementation

We use keras ImageDataGenerator feature to load the data used for training.<br> We use a default network architecture shown below, with 0.001 learning rate, 128 batch size and an RMSProp optimizer.<br>
 All these are model hyperparameters that might be optimized. However we restrict ourselves to one option.

In [None]:
batch_size = 128
pic_dim = 140
train_datagen = ImageDataGenerator(rescale=1/255)
train_generator = train_datagen.flow_from_directory(
        fe_train_dir,  
        target_size=(pic_dim, pic_dim),  
        batch_size=batch_size,        
        classes = categories,
        class_mode='categorical')

In [None]:
model = tf.keras.models.Sequential([

    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(pic_dim, pic_dim, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    # The second convolution
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # The third convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # The fourth convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # The fifth convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # Flatten the results to feed into a dense layer
    tf.keras.layers.Flatten(),
    # 128 neuron in the fully-connected layer
    tf.keras.layers.Dense(128, activation='relu'),
    # 5 output neurons for 5 classes with the softmax activation
    tf.keras.layers.Dense(6, activation='softmax')
])

In [None]:
from tensorflow.keras.optimizers import RMSprop
total_sample=train_generator.n
n_epochs = 30
import scipy
print (scipy.__version__)
if scipy is None:
    print('no scipy')

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(learning_rate=0.001),
              metrics=['acc'])
history = model.fit(
        train_generator, 
        steps_per_epoch=int(total_sample/batch_size),  
        epochs=n_epochs,
        verbose=1)


The resulting model is then cropped using keras's pop method, leaving a model whose output is a 128-sized vector.<br> We then use this new model to extract features on the test set.

In [None]:
from keras.models import Model
from keras.models import load_model
from keras import layers

model.pop()
model.summary()
# Model(inputs=model.inputs, outputs=model.layers[-1].output).summary()

Some of the images are black and white, hence we use keras augmentation functions to stnadardaize the input to the best possible manner (make sure all images are converted to valid dimension tensors).


In [None]:
#is_good_image function on appendix

## Feature Selection

At this point we have 270 features and we would like to examine their importance.<br> We will perform 5 fold CV on a new training set dedicated for this decision with the features we created on the feture engineering part.<br> We will use the feature importance of catbooat.

In [1210]:
food_train_fs =  food_train.loc[food_train['idx'].isin(feature_selection_index)]

In [1211]:
food_train_fs.shape # contain category and idx

(4764, 272)

The Averaged accuracy we got over the 5 folds of the data for feature selection:(code in the appendix)

In [1215]:
np.mean([report_dict[i]['accuracy'] for i in range(0,5)])

0.5619503205269515

In [1216]:
[report_dict[i]['accuracy'] for i in range(0,5)]

[0.18782791185729275,
 0.8048268625393494,
 0.7565582371458552,
 0.3672612801678909,
 0.6932773109243697]

In [1217]:
print("# Features:", len(columns_list))

# Features: 270


We got pretty bad accuracy, probably had overfitting in the feature engineering decisions.<br>
Maybe it has to do with the order of our additive decisions also that kind of canceling here - we are not assuming to have any feature in advanced and we choose the best features, but in feature engineering part, on each decision we assumed to have the features we added before.<br>

We will choose n features out of all the features by counting how many times each feature appered on the top 100 features the folds.<br> we will choose features that appeared in at least 3 folds.
We know 100 is chosen without any approval, but we want to avoid of making few decisions on the same data set, so we take this guess.

In [1218]:
sorted_fe_dicts = [dict(sorted(report_dict[i]['feature_importance'].items(), key=lambda item: item[1])) for i in range(0,5)]

In [1219]:
lists_of_most_important_features_per_fold = [list(sorted_fe_dicts[i].keys())[0:100] for i in range(0,5)]

In [1220]:
merged = [l for f in lists_of_most_important_features_per_fold  for l in f]

In [1221]:
from collections import Counter
fe_count = Counter(merged)

In [1230]:
top_features = [i[0] for i in fe_count.most_common(100) if i[1] > 3] #no more than 100 features that appear in at least 4 folds

In [1343]:
# top_features_beyoned_2 = [i[0] for i in fe_count.most_common(100) if i[1] > 2]
# pd.DataFrame(top_features_beyoned_2).to_csv('top_features_beyoned_2.csv')

In [1231]:
len(top_features)

79

This feature list seems to be ok, let's see the result on the model tuning part.

## Model Tuning

On this part we will apply grid search on the catboost parameters. We will use GridSearchCV and choost the best parameters on new training set, with the features selected on feature selection part.

In [1234]:
mt_columns = top_features + ['category','idx']
food_train_mt = food_train_mt[[col for col in food_train_mt.columns if col in mt_columns ]]

'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 400 are chosen! (code in the appendix)

#### Preparing the Full Model

We need to apply the features we created and chose on the test set and train our model on all the train set. (loading images model that was trained on the entire train set in the appendix)

#### Top Features Model

The selected features will be features who appeared on the top 100 features on 3 folds and more on the feature selection part.

In [1513]:
len(top_features)

79

#### Top features model on the training set

In [1492]:
preds_class_train_set = model.predict(X)

In [1495]:
accuracy_score(food_train['category'], preds_class_train_set)

0.4269786778369185

Not the best result, let's see what happens if we are using all the features.

#### All features model

In [1512]:
len(columns_list)

270

#### All features model  predictions on the training set

In [1510]:
accuracy_score(food_train['category'], preds_class_train_set_all_features)

0.9474032313942868

The result on the test set is much better but it might be becuase of overfitting...
We can't know which of the models will perform better on the test set, so those models predictions will be our submissions.
We will try one more model - with top features with lower threshold.

#### Top features with lower threshold model

The selected features will be features who appeared on the top 100 features on 2 folds and more on the feature selection part.

In [1511]:
len(top_features_beyoned_2)

97

In [1515]:
accuracy_score(food_train['category'], preds_class_train_set_top_features_beyoned_2)

0.5415262511416963

The result is better than top features, again we can't know which model will perform best on the test set.