In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))        
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Beer Efficiency
I usually like to do EDA with these kernels, however there really isn't much in this data set! As we can see below we only have a few hundred samples and a few features, so I thought it would be interesting to see if we can take this data and wrangle more features that could be used for a machine learning problem. The data set has the following features about beer: name, calories, ABV, and efficiency. Efficiency is the interesting feature, this tell us how effective a beer is at delivering ABV while minimizing calories. We'll use efficiency as the output, everything else is input.

In [None]:
# magic numbers bad
pal1 = '#ac4b1c'
seed = 121285

df = pd.read_csv('/kaggle/input/beer-efficiency/beer_efficiency.csv')
df.info()

# EDA
One of the first steps when working on an ML problem is to examine the feature distributions. Normal shaped inputs help a model learn better, if they are distributed in an odd way we can use scalers to fix this. Also if the output is normally distributed, it is often easier to create a model. As we can see below all of the features are relatively normal with a small right skew, we can try and normalize these later but so far there is not anything concerning.

In [None]:
fig, ax = plt.subplots(1,3, figsize=(18,6))

sns.distplot(df['abv'], ax=ax[0], color=pal1, kde=False)
sns.distplot(df['calories'], ax=ax[1], color=pal1, kde=False)
sns.distplot(df['efficiency'], ax=ax[2], color=pal1, kde=False)

ax[0].set_title('ABV')
ax[1].set_title('Calories')
ax[2].set_title('Efficiency')

Now that we've determined the features are more or less healthy we can begin determining if any of them are useful, I like to begin this process with a correlation matrix. A correlation matrix tells us how similar all of the features are, do they rise or fall at a similar rate. If an input correlates to the output then that means it will likely be a good predictor, so keep an eye out for values close to 1 or -1.

We can also use this same plot to determine if any of our inputs are collinear, meaning that we are duplicating our training data. Some models will ignore this, though others will not function at all with them. Below we can see that calories and abv are collinear, it is possible the model could be harmed by including them both! We can also see that calories is more highly correlated than abv to the output, if we need to drop one we have an easy choice.

In [None]:
sns.heatmap(df.corr(), annot=True)

# Feature Engineering
With such a limitation on features it would be nice if we could extract something from the name. We have sklearn tools to do some of this, but first I wanted to show an approach using Python on its own. What we need to do is see if there are any commonly repated terms that correlate to the output, so first lets see if there are any terms that repeat.

In [None]:
from itertools import chain
from collections import Counter

# split all strings into lists, for example we now have [['Blue', 'Moon', 'Belgian', 'White'], ... ]
words = [x.split() for x in df.name.to_list()]
# now lets just throw all these sub lists into a single list
words = [x for x in chain.from_iterable(words)]

# have counter do the heavy lifting and throw it into a series for easy manipulation
count = pd.Series(Counter(words))
count = count[count > 5]

fig, ax = plt.subplots(figsize=(5,10))
sns.barplot(count, count.index, color=pal1)

So it looks like we have lots of repeated terms that could be useful and related to what our output features is: amber, light, porter, IPA, etc. However we also get all the brand information as well: Blue, Moon, Brooklyn, Bud, etc. If we had another feature or data set we could filter those out, or manually build a list, but for now we're just going to leave it in and see what happens.

Now that we've looked at this by hand, lets let sklearn handle the extraction with the CountVectorizer. This tool examines tokens (highly recurring words) and creates a matrix with their occurence for each sample. There are a few different ways to use it but I like specifying the number of features, this way I can fine tune the complexity of our new inputs as well as control which words appear. The output is a sparse matrix that we can pretty easily just concatenate to our original dataframe. The vocabulary are the common tokens found (basically what we did above manually).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=10)
mat = cv.fit_transform(df.name).toarray()
print(cv.vocabulary_)

newdf = pd.DataFrame(mat, columns=cv.vocabulary_.keys())
df = pd.concat([df, newdf], axis=1)
df = df.drop('name', axis=1)

So the moment of truth, lets draw another correlation plot and see if any of these inputs have potential. Below you can see that we actually did pretty good, at least one of the new inputs is as good as calories, and the remaining are as good as abv (mostly)!

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(df.corr(), annot=True)

We should also try alternate scoring just to help sanity check the usefulness of our features, I like to use SelectKBest which uses a p-value to determine the usefulness of an input. This more or less confirms what we discovered with the correlation matrix, two of our features are highly useful while the rest are ok.

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression

# seperate into input/output features
X = df.drop('efficiency', axis=1)
y = df['efficiency']

ksel = SelectKBest(k='all', score_func=f_regression)
ksel.fit(X, y)

sns.barplot(x=ksel.scores_, y=X.columns, color=pal1)

# Training / Validation
Well we have some features, however i'm worried about overfit due to the large number of features compared to the number of samples we have. First lets train and validate to see how a model responds before we start trimming anything. Also note that typically we would do cross validation here to get a more robust idea of model performance and to avoid data leakage, but I am going to keep this problem simplifed.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

model = LinearRegression()

model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

Just what I was afraid of, over fit! There is not any strict value to determine this, but I typically use a difference of at least 10 between training and testing scores. This means that our model has memorized the training and is not doing as well on its test. Lets check a residual plot and see if we can get any extra information about this.

In [None]:
sns.residplot(model.predict(X_test), y_test)

It looks like our model actually does pretty good in most cases, when we get to higher efficiency beers the model really falls apart. There is a slight 'shape' to the graph so it is likely that a polynomial model might help. There is also the possibility that the 'outlier' beer around (100, -20) is throwing our model off, we could examine that sample and see if we feel it is important to keep.

In addition to training / testing scores and the residual plot it is also good to look at the error a model generates, a good start to this is MAE and MSE. Mean Absolute Error (MAE) is on average how much error the model generates, as you can see below our model is typically only 2.38 effeciency off when making predictions. This should make sense, if we look at our residual plot above you can see most of the residuals are within the range 5 to -5. However whats up with our Mean Squared Error (MSE), its way higher at nearly 21! MSE is the error squared, the effect of this is that larger errors have a larger contribution to the result. So we can say that our model does well on average, but in some caes it really fails, again we can look at the residual plot to confirm these findings.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

print(mean_squared_error(model.predict(X_test), y_test))
print(mean_absolute_error(model.predict(X_test), y_test))

Be sure to look at these values with respect to the output feature, you might be initially excited to get an MSE of 2.38, though if the range of your output is 0.1 to 0.5 then you are in some serious trouble. Another metric we can use which ignores units is Mean Absolute Percentage Error (MAPE), this metric is simply a measure of prediction accuracy. We have to be careful when using this metric since we divide by the test output, if any of these values are 0 then we obviously have a problem.

In [None]:
mape = 100 - (np.mean(np.abs(y_test - model.predict(X_test)) / y_test) * 100)
print(mape)

Next I would like to try and trim features that are not contributing using SelectKBest, however it can be difficult to pick a good value for 'k' (how many features to keep) so we can just collect all possible values and graph the results. Here we can see that at around 7 features the model is not getting any better.

In [None]:
train_scores = []
test_scores = []

# iterate through all possible features counts, 1 to keep all features
for i in range(1, X.shape[1]):
    ksel = SelectKBest(k=i, score_func=f_regression)
    X_ = ksel.fit_transform(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=seed)

    model = LinearRegression()
    model.fit(X_train, y_train)

    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

plt.plot(range(1, X.shape[1]), train_scores)
plt.plot(range(1, X.shape[1]), test_scores)
plt.xlabel('# Features Kept')
plt.ylabel('Score')
plt.legend(['train', 'test'])

One of the issues with this data set is the small number of samples, we can use a learning curve to see how much better the model could potentially be. What we're looking for is a trend with the increase in training samples. As you can see below it does appear that the testing score could be improved with an increase in samples, this should not be surprising with the very small number that we started with.

In [None]:
from sklearn.model_selection import learning_curve

sizes, t_scores, v_scores = learning_curve(LinearRegression(), X, y, train_sizes=[0.2, 0.4, 0.6, 0.8, 1.0])

t_scores_mean = np.mean(t_scores, axis=1)
v_scores_mean = np.mean(v_scores, axis=1)

fig, ax = plt.subplots()
plt.plot(sizes, t_scores_mean, 'o-', color="r", label="Training score")
plt.plot(sizes, v_scores_mean, 'o-', color="g",label="Test score")
ax.set(xlabel='Training Samples', ylabel='Score')
ax.legend()

# Moving On
There is a lot more we could do here: different models, tuning hyper parameters, validation curves, cluster analysis, polynomial models, etc.. but I think this kernel has already gotten long enough for what I initially set out to do, exploring feature extraction with some basic validation. I've learned a lot from this cumminity so I hope this brings something to the table for somebody else. If anybody has questions, concerns, suggestions I would love to hear them!