### Hey, everyone, I practiced some feature engineering, and I hope that my thoughts and explanations might be helpful for others.
When beginning to study machine learning, I felt like the mathematical models I got tought were pretty mighty and robust, but this changed after making my first steps in real-world data analytics. I began to understand, that even the best and most complex models are performing poor on weak datasets. 

Not only do we have to encode the input data for our machine learning models, but also do we have to wrap our minds around the data and uncover further valuable information in the input data. this allows us to create a strong fundament of data for our machine learning models.

**Note:** The following Feature Engineering methods have to be performed on both, train and test data. One **robust and error-resistent way of applying Feature Engineering on both data sets** is using sklearn Pipelines. Feel free to take a look at my [Pipelining Notebook on kaggle](https://www.kaggle.com/milankalkenings/no-pipelines-you-are-probably-doing-it-wrong).


Take a look at my [Comprehensive Tutorial: Feature Engineering](https://www.kaggle.com/milankalkenings/comprehensive-tutorial-feature-engineering), if this first glance at the topic arouses your interest in feature engineering.

imports..

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})

! pip install -q country_converter
import plotly.express as px
import country_converter as co

from nltk.stem import WordNetLemmatizer

df = pd.read_csv('../input/ramen-ratings/ramen-ratings.csv')

In [None]:
df.head()

***
## How can we prepare this data for machine learning algorithms?
After obtaining an overview, we should inspect each column in our dataframe and decide the following:
* [How do we handle missing values in the column (if there are any) ?](#sec2)
* [Can we simplify the column without losing too much information ?](#sec2)
* [How do we handle outliers/errors in the column (if there are any) ?](#sec3)
* [Can we uncover any further valuable information from this column and save it in other columns ?](#sec4)
* [How do we encode this column for our machine learning algorithms?](#sec5)

I will give you an example for each of these techniques.
***

<a id="sec1"></a>
# 1. Overview
We should always get some first impressions of our dataset before treating critical columns individually.

In [None]:
# the desired rate of null values per col
nulls_per_col = df.isna().sum(axis=0) / len(df.index)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 5))
    
nulls_per_col.plot(kind='bar', color='steelblue', x=nulls_per_col.values, y=nulls_per_col.index, ax=ax, 
                       width=1, linewidth=1, align='edge', edgecolor='steelblue', label='Null value rate')
    
    
# centered labels
labels=df.columns
ticks = np.arange(0.5, len(labels))
ax.xaxis.set(ticks=ticks, ticklabels=labels)

# workaround to visualize very small amounts of null values per col
na_ticks = ticks[(nulls_per_col > 0) & (nulls_per_col < 0.05)]
if (len(na_ticks) > 0):
    ax.plot(na_ticks, [0,]*len(na_ticks), 's', c='steelblue', markersize=10, 
            label='Very few missing values')
    

ax.set_ylim((0,1))
ax.legend()
fig.suptitle('Null Value Rate per Column', fontsize=30, y=1.05)
fig.tight_layout()

As we can see, the 'Style' column contains very few missing values and the 'Top Ten' column consists mostly of missing values. Let's get some further insights.

In [None]:
df.describe(include='all')

Interestingly, the value '\n' is the most common value in the column 'Top Ten'. We will have to deal with this later on. Moreover, since 'Chicken' occurs merely 7 times but it is still the most common variety, we can expect a large number of different values in this column.

<a id="sec2"></a>
# 2. Handling missing values and Simplifying columns
We already identified, which columns contain missing values. Let's get some insights and handle the missing values accordingly.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,12))
styles_oc = df['Style'].value_counts()
styles_oc['others'] = styles_oc[-4:].sum()
styles_oc = styles_oc.drop(labels=styles_oc.index[-5:-1])
styles_oc.plot(kind='pie', ax=ax, colormap='cividis', rotatelabels = 270)
plt.show()

As we can see, the most common value is 'Pack',  I guess it will not be too harmful, if we replace the missing values within this column with 'Pack', since this column doesn't contain a high number of missing values, and 'Pack' is by far **the most common value**. This method of replacing missing values is the equivalent to replacing missing numerical data with the column mean or column median.

In [None]:
df['Style'] = df['Style'].fillna(value='Pack')

### Let's take a look at the column 'Top Ten' to decide how to treat the missing values in that column.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,12))
# yea.. seems like some errors happened during data collection, that's why we have to drop the '\n's 
df['Top Ten'] = df['Top Ten'].replace(to_replace='\n', value=None)
styles_oc = df['Top Ten'].value_counts()
styles_oc.plot(kind='pie', cmap='twilight', rotatelabels = 270, labeldistance= 1.01, ax=ax)
plt.show()

Okay, as we can see, after dropping the invalid values, there are **solely the top 10 soups of the covered years** stored inside this column. It wouldn't make any sense to replace the missing values with e.g. '2012 #10'.

**Usually, I would simply remove this column, because the ranking depends on the target and thus will not be available on test. Moreover, its Null Value Rate is icredibly high. I will perform some further steps on that column for educational reasons, and pretend that this feature might be available on test. In that case, this column would be a very strong indicator for high star ratings, and thus I wouldn't necessarily discard it, if further analysis would reveal a significant importance of this feature.**


I think we should just replace the Null values with '0', and convert all values within the column to integers.
We will lose the year in which they achieved this rank, but according to the dataset description, the column 'Review #' already contains time data, and thus we don't lose much information. 

I apply this method, since we lose almost no information and we would have to store the year in another column, which would make the data more complex. 

In [None]:
# convert to ranks
df['Top Ten'] = df['Top Ten'].str.slice(start=6)
df['Top Ten'] = df['Top Ten'].fillna('0')

# convert rank strings to integers
df['Top Ten'] = df['Top Ten'].astype(np.int8)
df

For further informations and explanations on **handling missing data**, I recommend you taking a look at this notebook:


https://www.kaggle.com/milankalkenings/wine-reviews-data-cleaning

Let's take a look at the other columns of the dataset.

<a id="sec3"></a>
# 3. Handling errors
The column 'country' contains information about the location of the Ramen shops. 
**I guess there might be some differences in taste around the wourld** 
and thus I think this column is pretty important in order to predict the star rating of a soup. Let us get some information about the distribution of reviews across the globe. Therefore, we have to convert the country names to their ISO3 representation.

In [None]:
country_raw = df['Country']
# this might take a while
country_converted = country_raw.apply(lambda x: co.convert(x, to='ISO3'))

Let's find out whether all strings have been converted automatically.

In [None]:
not_found = country_raw[country_converted=='not found'].unique()
print(f'{not_found} haven\'t been converted automatically')

### We found the following errors:
* Dubai is a city in the United Arab Emirates (ARE)
* Holland is a Region in the Netherlands and is often confused with the Netherlands (NLD)
* Sarawak is a Malaysian state (MY)
* UK is the abbreviation for United Kingdom (GBR)

We have to **fix this manually**, which shouldn't be a big deal, since we solely have to replace 4 values:

In [None]:
def convert_man(country):
    '''
    country: a string which should be a country name
    
    replaces 'country' by its ISO3 representation, if
    country wasn't already converted automatically.
    '''
    if country=='UK':
        return 'GBR'
    if country=='Dubai':
        return 'ARE'
    if country=='Holland':
        return 'NLD'
    if country=='Sarawak':
        return 'MY'
    else: 
        return country
        
    
    
raw_fixed = country_raw.apply(convert_man)
converted_fixed = raw_fixed.apply(lambda x: co.convert(x, to='ISO3'))
df['Country'] = converted_fixed

In [None]:
# rev:=reviewers
# co:=countries
# get reviewers per country
rev_co = df.groupby('Country').count()['Review #']
rev_co = [pd.Series(rev_co.values, name='reviewers'), pd.Series(rev_co.index, name='country')]
rev_co_df = pd.concat(rev_co, axis=1)
rev_co_df = rev_co_df.groupby(by='country', axis=0, as_index=False).sum()

We can finally plot the reviewers per country.

In [None]:
fig = px.choropleth(rev_co_df, locations="country",
                    color="reviewers",
                    color_continuous_scale='oranges',
                    title="Reviewers per Country",
                   )
fig.show()

This plot is inspired by 

https://www.kaggle.com/heyytanay/beginner-s-eda-notebook

Most reviews were made either in the North America or in East Asia. 

<a id="sec4"></a>
# 4. Generating further valuable information from one column
I wonder which words are used to describe Ramen and if this information might have some relations with the reviewers country or the rating of the soup. Let's plot the most frequent words in the 'variety' column in a wordcloud.

In [None]:
df['Variety'] = df['Variety'].str.lower()
descriptions = ' '.join(df['Variety'])
# add some custom stopwords:
custom_stop_words = ['noodle', 'soup', 'instant', 'flavor', 'flavour'] 
stop_words = list(STOPWORDS) + custom_stop_words

#plot the wordcloud
plt.figure(figsize=(15,10))
wordcloud_ramen = WordCloud(stopwords=stop_words, 
                            max_font_size=80, max_words=160, 
                            width=600, height=400, 
                            colormap='inferno', background_color='white'
                           ).generate(descriptions)
plt.imshow(wordcloud_ramen, interpolation='bilinear')
plt.axis('off')
plt.savefig('wordcloud.png')
plt.show()

This wordcloud allows us to gain some feeling of the most used ingredients and styles of Ramen just by looking at it. 

Besides giving some insights (and making me hungry), this plot inspires me to create some binary features from the 'Variety' column, each of them should contain the information whether a variety contains one of the most frequent words occurring in the variety descriptions.



### Let's use some Lexical Processing to obtain meaningful features
Obviously, it wouldn't make any sense to create a new column containing the information whether the variety value contains the words 'the', 'a' or symbols. Likewise, we should remove redundant ('noodles' and 'noodle')or obviously occurring values like 'flavour'. Hence, we should use a **stemmer/lemmatizer** and a customized list of stopwords. 


Moreover, it wouldn't be a good idea to create too many new columns in order to evade the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). 

In [None]:
wnl = WordNetLemmatizer()
descriptions_splitted = pd.Series(descriptions.split())
descriptions_lemmatized = descriptions_splitted.apply(wnl.lemmatize)

# for simplicity, let's hardcode the initial number of new features as 20
common = descriptions_lemmatized.value_counts()[:20]


# we don't want to have any stopwords as features
custom_stops = ['flavour', 'flavor', 'cup', 'soup', '&']
stopwords = custom_stops + list(STOPWORDS)
stop_labels = common[np.in1d(common.index, stopwords)].index
common = common.drop(labels=stop_labels)
common.index

These keywords seem to be helpful. Let's take these keywords as features.

In [None]:
new_cols = common.index
temp = np.empty([len(df.index), len(new_cols)], dtype=np.int8)

def fill(row):
    '''
    row: a row of the dataframe
    
    stores whether the 'Variety' of the row contains the (lemmatized) keywords.
    '''
    # we have to lemmatize each row, since we lemmatized the names of the new columns
    row_lemmatized = pd.Series(row['Variety'].split()).apply(wnl.lemmatize)
    temp[row.name] = np.in1d(new_cols, row_lemmatized)

    

# rowise:
df.apply(fill, axis=1)

new_cols = pd.DataFrame(temp, columns=common.index)
df_augmented = pd.concat([df, new_cols], axis=1)
df_augmented.head()

We generated a number of features from the most common keywords in our 'Variety' column. This technique is often calles **Feature Splitting**. 
Note that I defined myself, how many of those 'most common' keywords I used in this section. We should change this number of keywords and treat it like a **hyperparameter** in real-world applications. 


<a id="sec5"></a>
# 5. Column encoding
We have to convert string data to numerical data. Therefore, we can apply differing methods, these are the most common ones: 

* We can either Create **indicator variables**, as we did for the new columns we generated from 'Variety'. This method extremely increase the runtime of the algorithm. One famous way of creating indicator variables is **One-Hot Encoding**.

* We can **factorize** the column and replace each distinct value with an unique integer. The resulting label encoding will indicate a relationship or order between the different values, since they are solely encoded as integers within the same column.


I will apply the second method on our dataset in order to keep the data less complex, since the versatility of these columns is pretty high.


[>Some more encoding approaches<](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)

In [None]:
# we don't have to factorize all columns
df_cat = df_augmented[['Brand', 'Variety', 'Style', 'Country']]
df_done = df_augmented.drop(columns=df_cat.columns)


# factorize the columns of one df and reunite both dataframes
df_factorized = df_cat.copy().apply(lambda x: pd.factorize(x)[0])
df_cat.columns = df_cat.columns + '_cat'
df_augmented = pd.concat([df_factorized, df_done], axis=1)
# for better interpretability, we should save the encoding.
# This allows us to decode the columns after applying our 
# machine learning models.
df_encoding = pd.concat([df_cat, df_factorized], axis=1)

There is one last thing that might not seem to be important, but it is quite important:

We should change the names of our columns in order to obtain **uniformity**. The column names within the dataset were capitalised, and our new features aren't.
Let's change this.

In [None]:
df_augmented.columns = df_augmented.columns.str.lower()

df_augmented

Depending on the model you will apply on the data, some further steps like **Standardization** might be necessary.

That's it. We handled missing values and errors and obtained some new valuable features from a more or less invaluable one.
Feel free to comment any recommendations or improvements and help me and other preople to learn more about this topic.

# Thank you for reading this notebook. Please upvote the notebook if it helped you out in any way. =) 

Take a look at my [Comprehensive Tutorial: Feature Engineering](https://www.kaggle.com/milankalkenings/comprehensive-tutorial-feature-engineering), if this first glance at the topic arouses your interest in feature engineering.