# Case Study: Expanding Our Candy Brand

From an entrepreneurial point of view, let's try to answer the following questions with this notebook.

* Which ingredients influence the selection of a candy and how strong is it? / How strong are features correlated?
* What are the most important ingredients? / What features can describe the whole feature space sufficiently?
* What are the ingredients for a recipe that maximizes the selection? / Which feature values maximize the target variable?

###### Table of contents

1. [Exploratory Data Analysis](#eda)
    1. [Interesting Questions](#questions)
    2. [Feature Engineering](#feat_eng)
    3. [Distributions](#distributions)
    4. [Feature Correlation](#corr)
    5. [Principal Component Analysis (PCA)](#pca)
2. [Predictive Analysis - Linear Regression](#pred_ana)
    1. [Training + Single Metric Evaluation](#train_eval)
    2. [Learning Curve + Cross Validation](#curve_cross)
3. [Recommendation](#recomm)
    1. [Dicussion and Outlook](#disc_out)
    2. [Conclusion and Final Recommendation](#conclusion)

## Exploratory Data Analysis <a name="eda"/>

In [None]:
# general
import os

# visualization and plots
import seaborn as sns
import matplotlib.pyplot as plt

# for the data
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# predictive models
from sklearn.linear_model import LinearRegression

# evaluation
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import learning_curve

# warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# globals
BASE_DIR = '../input/the-ultimate-halloween-candy-power-ranking'

In [None]:
# read the data set
candy_df = pd.read_csv(os.path.join(BASE_DIR, 'candy-data.csv'))

# take a quick look at the data
candy_df.describe()

In [None]:
# scale winpercent between 0 and 1
candy_df.winpercent = candy_df.winpercent/100

###### Observations
Here you can see a rough overview how the features are distributed.

* *sugarpercent*, *pricepercent* and *winpercent* are the only continuous features. The rest are boolean features.
* Distribution of the continuous features do not seem skewed because the distance between the 75% and 25% to the 50% quantile are more or less the same.

In [None]:
# Overview with information of the dataset
candy_df.info()

###### Observations
* No missing values, which is a perfect situation. Otherwise missing values can be substituted by e.g. the mean. It is possible to drop the according columns/rows, too.
* Feature *competitorname* is a categorial feature that needs to be transformed if used in a predictive model (OneHotEncoding, LabelEncoding)

In [None]:
# How many different competitors are there?
len(candy_df.competitorname.unique())

85 unique competitors. This would make the columns grow in case of OneHotEncoding. Also, there is no ranking or order in it that you could encode with labels. Thus, let's drop it.

### Interesting Questions <a name="questions"/>

In [None]:
# sort candies by winpercent
candy_sorted_win = candy_df.sort_values(by=['winpercent'], ascending=False)

In [None]:
# What are the top 10 candies with the highest winpercent?
candy_sorted_win.head(10)

In [None]:
# What are the flop 10 candies with lowest winpercent?
candy_sorted_win.tail(10)

###### Observations
* In the top 10
    * All instances have *chocolate* but are not *fruity*
    * The second place has almost no sugar
    * All are soft
* In the flop 10
    * There is no *chocolate*, *nougat* and *crispedricewafer*
    * All of them are no bar candies

Side note: *competitorname* has signs (Õ) that you might want to substitute. However, here the column will be dropped.

In [None]:
# sort candies by sugar
candy_sorted_sugar = candy_df.sort_values(by=['sugarpercent'], ascending=False)

In [None]:
# How did candies with the highest sugar perform?
candy_sorted_sugar.head(10)

In [None]:
# How did candies with the lowest sugar perform?
candy_sorted_sugar.tail(10)

###### Observations
* Both seem to perform rather mediocre.
* Highest sugar
    * No *caramel*, *nougat* and *crispedricewafer*
    * Only no bar candies
* Lowest sugar
    * No *caramel*, *nougat* and *crispedricewafer*
    * Only no bar candies

In [None]:
# sort candies by price
candy_sorted_price = candy_df.sort_values(by=['pricepercent'], ascending=False)

In [None]:
# How did candies with the highest price perform?
candy_sorted_price.head(10)

In [None]:
# How did candies with the lowest price perform?
candy_sorted_price.tail(10)

###### Observations
* Pricy candies looks like to perform better.
* There is no nougat in both cases.
* In both cases the candy is either with chocolate or fruity.
* Highest 10 *pricepercent*
    * Almost no *peanutyalmondy*
    * A lot of chocolate
* Lowest 10 *pricepercent*
    * Less sugar compared to the higher priced
    * Only soft candies
    * Pixie Sticks and Root Beer Barrels have none of the mentioned ingredients

### Feature Engineering <a name="feat_eng"/>
Let's see if we can combine features and highlight other properties. Note that the feature you like to predict is excluded. Otherwise, it would cause data leackage. Also, be cautios to do something like this in a real application.

#### Sweet but economic
Let this be candies that are sweet but have a low price. If *sugarpercent* is high and *pricepercent* is low then the value will be high.

In [None]:
# divide sugarpercent by pricepercent
candy_df["sweetbyprice"] = candy_df.sugarpercent / candy_df.pricepercent

sweetbyprice_mean = np.mean(candy_df.sweetbyprice)
sweetbyprice_std = np.std(candy_df.sweetbyprice)

print(sweetbyprice_mean, sweetbyprice_std)

# normalize
candy_df.sweetbyprice = (candy_df.sweetbyprice - sweetbyprice_mean) / sweetbyprice_std

#### Chocolate and Fruity
Let this be candies that are both chocolate and fruity.


In [None]:
# muliply for AND function
candy_df["chocolateAndFruity"] = candy_df.chocolate * candy_df.fruity

### Distributions <a name="distributions"/>

In [None]:
def plot_dists(df, start, end, grid_x, grid_y):
    """
    Function to plot distributions of columns of a dataframe.
    
    @params:
    - df: pandas DataFrame
    - start: Integer that indicates starting index
    - end: Integer that indicates ending index
    - grid_x: Integer and matplotlib subplots rows
    - grid_y: Integer and matplotlib subplots columns
    """
    f, axes = plt.subplots(grid_x, grid_y, figsize=(grid_x*4,grid_y*5), sharex=False, sharey=False)
    
    for r in range(0, grid_x):
        for c in range(0, grid_y):
            if start >= end:
                axes[r,c].set_visible(False)
            else:
                sns.distplot(df.iloc[:,start], ax=axes[r, c], bins=10, kde=False)
                start += 1

In [None]:
n_cols = candy_df.shape[1]

# leave competitorname out
plot_dists(candy_df, 1, n_cols, 4, 4)

###### Observations

* The features *caramel*, *peanutyalymondy*, *nougat*, *crispedricewafer*, *hard* and *bar* tend to have around 7-8 more negative for 1-2 positive occurances.  
* For the *chocolate*, *fruity* and *pluribus* feature there is almost a 50:50 relation.  
* The continuous feature *winpercent* seems to be more or less normal distributed.
* There are not many cost efficient candies.
* There are not many candies that are sweet but cheap.

### Feature Correlation <a name="corr"/>
Let's see how each features are correlated and influence each other. In other words, which properties influence the selection of a candy.

In [None]:
# Drop competitorname
candy_data = candy_df.drop(columns=['competitorname'])

In [None]:
# set figure size
plt.figure(figsize=(15, 15))
corr_heat = sns.heatmap(candy_data.corr(), vmin=-1, vmax=1, center=0, annot=True, fmt=".1g", cmap="coolwarm")

###### What about correlated features?
Here you can see the influence of between each features. It is interesting to see how features affect the winpercent which is the target variable. To see the influence of features on the target the row to look at is the third to the last. Ideally you want features that are independent. Features that are (highly) correlated, result into high variance or in other words an unstable decision on new data.  
How can correlated features be handled?

* keep one and delete the others
* combine them and map their dependency
* reduce the dimension/number of features

#### Observations

* *chocolate* has a positive effect on the target.
* *fruity* has a negative effect on the target.
* *chocolate* and *fruity* are highly negative correlated.
* Small positive influence between *sugarpercent* and *pricepercent*.
* *cost_efficiency* has almost no influence on the target.

Note that the data does not explain if *pricepercent* is influenced e.g by the quality of the ingredients.  
According to the heatmap a first recommendation could be a soft, non fruity, one-bar candy with chocolate, caramel, nougat, nuts (peanuts, almonds, ..) and some crunch (cookies, waffles, ...).  



### Principal Component Analysis <a name="pca"/>
Let’s do a short PCA to again verify more important features and how much they explain
the data. In this case doing a PCA to reduce the dimension (number of features) is not trivial.
Originally the analysis is ment to be for continuous features, not (binary) categorial. Remember
that the dataset has binary categorial features like chocolate and more. However, it is still useful to gain further insights on the contribution of features in the dataset.

In [None]:
# split data into X and y
# note: candy_data is without the competitorsname feature
X = candy_data.drop(columns=["winpercent"])
y = candy_data.winpercent

In [None]:
# values need to be scaled before the analysis
stand_scal = StandardScaler()
X_scaled = stand_scal.fit_transform(X)

# pca
pca = PCA(random_state=123)
X_pca = pca.fit_transform(X_scaled)

In [None]:
# init plot
fig, ax = plt.subplots(1, 2, sharey=True)
fig.set_size_inches(15, 5)


# pexplained variance ratio as bar chart
ax[0].bar([i for i in range(1,len(pca.explained_variance_ratio_) + 1)], height = pca.explained_variance_ratio_)
ax[0].set_xlabel('Number of PCA component')
ax[0].set_ylabel('Explained Variance Ratio')

# cummulative explained variance ratio as bar chart
cummulative_exp_var = np.cumsum(pca.explained_variance_ratio_)
ax[1].bar([i for i in range(1,len(cummulative_exp_var) + 1)], height = cummulative_exp_var)
ax[1].set_xlabel('Number of PCA component')
ax[1].set_ylabel('Cumulative Explained Variance Ratio')
plt.show()

In [None]:
# which features contribute the most or are more important for each component?
pca_components = pca.components_

# transpose for better usage
# now the rows represents the features
pca_comp_t = np.transpose(pca_components)

features = X.columns
n_features = len(features)
pca_idxs = np.arange(1, len(features)+1)
x_labels = ["PCA {}".format(i) for i in range(1, n_features + 1)]
bar_width = 0.8

# define subplots dimension
n_rows = 4
n_cols = 4

fig, ax = plt.subplots(n_rows, n_cols, sharex=False, sharey=True)
fig.set_size_inches(40, 40)

rects = []
index = 0

# iterate through subplots
for r in range(n_rows):
    for c in range(n_cols):
        if index >= n_features:
            ax[r,c].set_visible(False)
        else:
            # draw the bar chart
            rect = ax[r,c].bar(pca_idxs, pca_comp_t[index], bar_width)
            ax[r,c].set_xticks(pca_idxs)
            ax[r,c].set_xticklabels(x_labels)
            ax[r,c].set_ylabel('Contibution')
            ax[r,c].set_xlabel(features[index])
            index += 1

plt.show()

###### Observations
* To explain at least 95% of the variance 9 PCA components are recquired. Thus, reducing the dimension of the feature space does not come along with a substantial benefit.
* *bar* has the highest positive and *fruity* the highest negative contribution in PCA1
* *cost_efficiency* has the highest positive and *pricepercent* the highest negative contribution in PCA2

## Predictive Analysis - Linear Regression <a name="pred_ana"/>
A predictive analysis is important if you want to have an idea how much the new candy is accepted or not. It is not perfect but it minimizes the risk of going through development, marketing and production of a new candy.  
You can choose from multiple models but let's do a Linear Regression. You can decide further if it sufficiently models the data or not. Dependant on that you can choose a more complex model.

### Training + Single Metric Evaluation <a name="train_eval"/>
A Linear Regression model is trained and evaluated only with one metric - Root Mean Squared Error (RMSE). There are different metrics to choose from, but a RMSE is fine.

In [None]:
# split data further into training and testing/validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)

In [None]:
# check if the distributions are similar after the split
training_data = pd.concat([X_train, y_train], axis=1)
testing_data = pd.concat([X_test, y_test], axis=1)

Is the data distributed more or less the same after the split?

In [None]:
plot_dists(training_data, 0, training_data.shape[1], 4, 4)

In [None]:
plot_dists(testing_data, 0, testing_data.shape[1], 4, 4)

With 85 instances the dataset is really small. So it is difficult to split the data such that models are trained/tested on data that is representative. According to the plots of the distributions, a 20% split looks fine. A Machine Learning convention is to use 10% - 30% of the data for the test set.

In [None]:
# train a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

predictions = lin_reg.predict(X_test)

# the root mean squared error of the test set
rmse = np.sqrt(mean_squared_error(predictions, y_test))
print("RMSE: ", rmse)

The coefficients of the Linear Regression model show how each feature contributes to the output.

In [None]:
# coefficients of the linear regression
lin_reg.coef_

The coefficients are ordered like the features, thus *chocolate* has a coefficient of 0.21, *fruity* 0.1, etc...  
The coefficient of 0.21 is the highest and this is expected because *chocolate* has the biggest influence on the target as seen in the heatmap.  
Maximizing the target means maximizing the equation of the Linear Regression. So, in case of a binary feature you choose to include it (1) if the coefficient is positive or exclude it (0) if it is negative. In case of a continuous feature you want to go as near as possible to 0 for a negative coefficient and 1 for a positive coefficient.  

Finally, you can come up with a constellation that has the highest winpercent.

In [None]:
# to maximize the equation let's choose
# ... the nearest possible values for 0 or 1
choc = 1
fruit = 0
caramel = 1
nuts = 1
nougat = 0
crispy = 1
hard = 0
bar = 0
plural = 0
sugar_percent_val = 0.99
price_percent_val = 0.2
sweet_by_price_val = ((sugar_percent_val / price_percent_val) - sweetbyprice_mean) / sweetbyprice_std
chocFruit = 0

# feature values that maximizes the outcome of the Linear Regression
best_feat_values = np.array([choc, fruit, caramel, nuts, nougat, crispy, hard, bar, plural, sugar_percent_val, price_percent_val, sweet_by_price_val, chocFruit])
best_winpercent = np.dot(lin_reg.coef_, best_feat_values) + lin_reg.intercept_
best_winpercent

With this constellation the Linear Regression predicts a winpercent of 84.64%, higher than the first place of the data, 84.18%.

### Learning Curve + Cross Validation <a name="curve_cross"/>
A learning curve is plotted to see how the model behaves. It gives insights e.g. if the model is too complex or if more training data is needed.

In [None]:
#calculate learning curve
def calc_learning_curve(X, y, train_sizes = np.linspace(0.1, 1., 5), cv=10, shuffle=False):
    """
    calculates the learning curve
    
    @params:
    - X: pandas DataFrame with shape [n_samples, n_features]
    - y: pandas Series with shape [n_samples, 1]
    - train_sizes: numpy array with values that contain the different training sizes
    - cv: int, number of k-fold cross validation
    - shuffle: boolean if training data is shuffled before split
    
    @ return:
    - train_sizes: numpy array with values that contain the different training sizes
    - train_scores: numpy array with cross validation scores
    - test_scores: numpy array with cross validation scores
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator=LinearRegression(),
        X = X,
        y = y,
        train_sizes = train_sizes,
        cv = cv,
        scoring = "neg_mean_squared_error",
        return_times = False,
        shuffle = shuffle,
        random_state = 123
    )

    train_scores = np.sqrt(-1 * train_scores)
    test_scores = np.sqrt(-1 * test_scores)
    
    return train_sizes, train_scores, test_scores

In [None]:
# plot learning curve
def plot_learning_curve(train_sizes, train_scores, test_scores):
    """
    plots a learning curve
    
    @params:
    @ return:
    - train_sizes: numpy array with values that contain the different training sizes
    - train_scores: numpy array with cross validation scores
    - test_scores: numpy array with cross validation scores
    """
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.figure(figsize=(15,10))
    plt.plot(train_sizes, train_scores_mean, label='Training Error', color='b')
    plt.plot(train_sizes, test_scores_mean, label='Testing Error', color='r')

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="b")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="r")


    plt.title('Linear Regression - Learning Curve')
    plt.xlabel('Training Set Size')
    plt.ylabel('RMSE')
    plt.legend(loc="best")

    plt.show()

#### Shuffle before training

In [None]:
train_sizes, train_scores, test_scores = calc_learning_curve(X, y, shuffle=True)
plot_learning_curve(train_sizes, train_scores, test_scores)

##### Observations
* Both train and testing curves flatten more or less at the same speed and seem to intersect at around 75 training examples - 68 with 20% test size.
* Much more training examples should not result into a much better score because the curves are about to intersect.
* Looking at the curves there is no big sign of high bias (fast flattening test error) or variance (large remaining gab between curves)
* The learning curve is expected to have higher deviation at the beginning because the fit is not well with few training examples. It is expected that it drops with more training examples, too.

## Recommendation <a name="recomm"/>

In order to maximize this selection chance, you need to derive the values of the Linear Regression model that maximize the output.  
Here is a table that maps the features to the coefficients of the trained Linear Regression model with the best values for the features:

| Feature | Coefficient | Best Feature Value |
|----------|:-------------:|---:|
| chocolate | 0.21148295 | 1 |
| fruity | 0.09599222 | 0 |
| caramel | 0.00292705 | 1 |
| peanutyalmondy | 0.069123 | 1 |
| nougat | -0.00937832 | 0 |
| crispedricewafer | 0.08260203 | 1 |
| hard | -0.07228368 | 0 |
| bar | -0.01970207 | 0 |
| pluribus | -0.05246864 | 0 |
| sugarpercent | 0.1301698 | 0.99 |
| pricepercent | -0.14166532 | 0.2 |
| sweetbyprice | -0.02571 | 0.9797313623732542 |
| chocolateAndFruity | -0.18310419 | 0 |

### Conclusion <a name="disc_out"/>

With the best feature values, you get a *winpercent* of 84.74%. You should interpret those as indicators that a sweet and cost efficient candy is preferred. The data is really simplistic. It tells nothing about the quality of the ingredients. Features like *sugarpercent_squared* and *pricepercent_squared* could be included. A more complex model can also be used to model non linear relations. However, the current data does not show clear evidence of a more complex relation. Thus, a Linear Regression suffices. It could be beneficial to include more information about the ingredients into future datasets to potentially obtain a more detailed optimal recipe.