# Chocolate Quality Analysis Using Machine Learning
---
<p align="center">
    <img src="images/choco.jpg"/>
</p>

This objective of this notebook is to explain how to build Machine Learning models for simple dataset such as **Sabroso Chocolate Quality Assurance** dataset.

For this analysis we are going to use `Scikit-Learn`, `Pandas`, and `Numpy`.

First let's import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import sklearn

print('pandas version: {}'.format(pd.__version__))
print('numpy version: {}'.format(np.__version__))
print('pandas version: {}'.format(sklearn.__version__))

pandas version: 0.17.1
numpy version: 1.11.3
pandas version: 0.18.1


Next, we import dataset into a `Pandas` `DataFrame` and visualize that data frame.

In [2]:
#loading data using pandas
training = pd.read_csv('./SYNERGEN Exercise_Train.csv')

In [3]:
# Next, we visualize data by just looking at first few lines
training.tail(5)

Unnamed: 0,UNIQUE_ID,LIST_OF_INGREDIENTS,PREPARATION_METHOD,MANUFACTURED_DATE,MANUFACTURED_LOCATION,QUANTITY,APPETIZING_COLOR,ATTRACTIVE_PACKAGING,SUBMISSION_DATE,QUALITY_ASSURANCE_ENTITY,RESPONSE
180,zSmsge37tMxCUAN,"X1, X2, X3, X4, X5","D2, D1, D3",2017-02-01,Y2,400,0,1,2017-02-02,Z1,0
181,Cd16nq5rkNVhS95,X6,D10,2017-02-02,Y1,100,0,1,2017-02-03,Z1,0
182,NopCApkw4nCCwQw,X6,D10,2017-02-02,Y1,50,0,1,2017-02-03,Z1,0
183,hntvWJULUiJSybI,"X1, X2, X3, X4, X5","D2, D1, D3",2017-02-05,Y3,150,0,1,2017-02-06,Z1,0
184,w7WNrNenF4h9C5,"X1, X2, X3, X4, X5","D1, D2, D3",2017-12-31,Y1,50,1,1,2018-01-01,Z1,0


In [4]:
# And print columns
training.columns

Index(['UNIQUE_ID', 'LIST_OF_INGREDIENTS', 'PREPARATION_METHOD',
       'MANUFACTURED_DATE', 'MANUFACTURED_LOCATION', 'QUANTITY',
       'APPETIZING_COLOR', 'ATTRACTIVE_PACKAGING', 'SUBMISSION_DATE',
       'QUALITY_ASSURANCE_ENTITY', 'RESPONSE'],
      dtype='object')

In Machine Learning we work with tabular data (usually, load data as CSV files.) However, if we take an individual cell of **`LIST_OF_INGREDIENTS``** and **`PREPARATION_METHOD`**, each cell represent a list of properties. Hence, we need to process  and create colors to represent those properties. 

Firstly, we identify unique properties contains in those two columns using following simple method.

In [5]:
# Following method iterates the given pandas series
# and find the unique properties which reside in 
# that series.
def find_unique_items(series):
    unique = set()
    for elements in series.iteritems():
        ingre = elements[1]
        for element in ingre.split(','):
            unique.add(element.strip())
    return unique

In [6]:
print('Unique values in LIST_OF_INGREDIENTS: {}'.format(find_unique_items(training['LIST_OF_INGREDIENTS'])))
print('Unique values in PREPARATION_METHOD: {}'.format(find_unique_items(training['PREPARATION_METHOD'])))

Unique values in LIST_OF_INGREDIENTS: {'X6', 'X10', 'X5', 'X3', 'X9', 'X1', 'X4', 'X2', 'X8', 'X7'}
Unique values in PREPARATION_METHOD: {'D10', 'D2', 'D4', 'D1', 'D11', 'D3', 'D15', 'D5', 'D8'}


In [7]:
# Next, those list of properties in LIST_OF_INGREDIENTS and PREPARATION_METHOD
# replaces with a set of columns
def ingregient_extractor(x, ingredient):
    for element in x.split(','):
        y = element.strip()
        if y == ingredient:
            return 1
    return 0

In [8]:
training['X_1'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X1',))
training['X_2'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X2',))
training['X_3'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X3',))
training['X_4'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X4',))
training['X_5'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X5',))
training['X_6'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X6',))
training['X_7'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X7',))
training['X_8'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X8',))
training['X_9'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X9',))
training['X_10'] = training['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X10',))

training['D_1'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D1',)) 
training['D_2'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D2',)) 
training['D_3'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D3',)) 
training['D_4'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D4',)) 
training['D_5'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D5',)) 

training['D_8'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D8',)) 
training['D_10'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D10',)) 
training['D_11'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D11',)) 
training['D_15'] = training['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D15',)) 

In [9]:
# finally, we drop LIST_OF_INGREDIENTS and PREPARATION_METHOD
del training['LIST_OF_INGREDIENTS']
del training['PREPARATION_METHOD']

In [10]:
training.groupby('RESPONSE').count()['UNIQUE_ID']

RESPONSE
0    90
1    95
Name: UNIQUE_ID, dtype: int64

Now the data cleaning part is done, so next, we are going to encode categorical data create our training features.

In [11]:
training_features = pd.concat([training[['X_1', 'X_2', 'X_3', 'X_4', 'X_5', 
                                         'X_6', 'X_7', 'X_8', 'X_9', 'X_10', 
                                         'D_1', 'D_2', 'D_3', 'D_4', 'D_5', 
                                         'D_8', 'D_10', 'D_11', 'D_15',
                                         'QUANTITY', 'ATTRACTIVE_PACKAGING']], 
                               pd.get_dummies(training[['MANUFACTURED_LOCATION']]),
                               pd.get_dummies(training[['QUALITY_ASSURANCE_ENTITY']])], 
                               axis=1)

Now the model building part comes. As shown below, we trained two main Machine Learning algorithms namely: **`Logistic Regression`** and **`Random Forst`** algorithms. Also, we used 3-Fold cross validation in order to tune hyperparameters. 

In [12]:
%%time 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import scale

X_train = training_features.values
y_train = training['RESPONSE'].values

folds = KFold(n_splits=3, shuffle=True)
cv_accuracies = []
for trining_idx, testing_idx in folds.split(X_train):
    X_train_cv = X_train[trining_idx]
    y_train_cv = y_train[trining_idx]
    
    X_test_cv = X_train[testing_idx]
    y_test_cv = y_train[testing_idx]
    
    logistic_regression = LogisticRegression()
    logistic_regression.fit(scale(X_train_cv), y_train_cv)
    y_predict_cv = logistic_regression.predict(scale(X_test_cv))
    current_accuracy = accuracy_score(y_test_cv, y_predict_cv)
    cv_accuracies.append(current_accuracy)
    print('cross validation accuracy: {}'.format(current_accuracy))
    
print( '---------------------------------------')
print( 'average corss validation accuracy: %f' %(sum(cv_accuracies)/len(cv_accuracies)))

cross validation accuracy: 0.8709677419354839
cross validation accuracy: 0.7903225806451613
cross validation accuracy: 0.8032786885245902
---------------------------------------
average corss validation accuracy: 0.821523
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.3 ms


In [13]:
%%time 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

X_train = training_features.values
y_train = training['RESPONSE'].values

folds = KFold(n_splits=3, shuffle=True)
cv_accuracies = []
for trining_idx, testing_idx in folds.split(X_train):
    X_train_cv = X_train[trining_idx]
    y_train_cv = y_train[trining_idx]
    
    X_test_cv = X_train[testing_idx]
    y_test_cv = y_train[testing_idx]
    
    random_forest = RandomForestClassifier(n_estimators = 100)
    random_forest.fit(scale(X_train_cv), y_train_cv)
    y_predict_cv = random_forest.predict(scale(X_test_cv))
    current_accuracy = accuracy_score(y_test_cv, y_predict_cv)
    cv_accuracies.append(current_accuracy)
    print( 'cross validation accuracy: %f' %(current_accuracy))

    
print('---------------------------------------')
print('average corss validation accuracy: %f' %(sum(cv_accuracies)/len(cv_accuracies))) 
print( '---------------------------------------\n')

cross validation accuracy: 0.838710
cross validation accuracy: 0.806452
cross validation accuracy: 0.868852
---------------------------------------
average corss validation accuracy: 0.838005
---------------------------------------

CPU times: user 340 ms, sys: 0 ns, total: 340 ms
Wall time: 340 ms


So it is clear that for this dataset, Random Forest works better than Logistic Regression. Hence, we take Random Forest model as our final model and going to predict response values for the testing dataset using our Random Forst model.

In [14]:
X_train = training_features.values
y_train = training['RESPONSE'].values
random_forest = RandomForestClassifier(n_estimators = 100)
random_forest.fit(scale(X_train_cv), y_train_cv)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [15]:
# Next, we clean the testing dataset
testing = pd.read_csv('./SYNERGEN Exercise_Prediction.csv')

testing['X_1'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X1',))
testing['X_2'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X2',))
testing['X_3'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X3',))
testing['X_4'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X4',))
testing['X_5'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X5',))
testing['X_6'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X6',))
testing['X_7'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X7',))
testing['X_8'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X8',))
testing['X_9'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X9',))
testing['X_10'] = testing['LIST_OF_INGREDIENTS'].apply(ingregient_extractor, args=('X10',))


testing['D_1'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D1',)) 
testing['D_2'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D2',)) 
testing['D_3'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D3',)) 
testing['D_4'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D4',)) 
testing['D_5'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D5',)) 

testing['D_8'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D8',)) 
testing['D_10'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D10',)) 
testing['D_11'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D11',)) 
testing['D_15'] = testing['PREPARATION_METHOD'].apply(ingregient_extractor, args=('D15',)) 

del testing['LIST_OF_INGREDIENTS']
del testing['PREPARATION_METHOD']

testing_features = pd.concat([testing[['X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6', 'X_7', 'X_8', 'X_9', 'X_10', 
                                         'D_1', 'D_2', 'D_3', 'D_4', 'D_5', 'D_8', 'D_10', 'D_11', 'D_15',
                                         'QUANTITY', 'ATTRACTIVE_PACKAGING']], 
                                         pd.get_dummies(testing[['MANUFACTURED_LOCATION']]),
                                         pd.get_dummies(testing[['QUALITY_ASSURANCE_ENTITY']])], axis=1)

In [16]:
X_test = testing_features.values
output = random_forest.predict(X_test)
unique_indices = training['UNIQUE_ID'].values
for i, j in zip(output, unique_indices):
    print('index: {} prediction: {}'.format(j, i))

index: monUvMmr95OP05e prediction: 0
index: 1xbRcd2JbUuZ0IK prediction: 1
index: 8FMJ6YMJYbTC4yp prediction: 0
index: fuovowqPpCHv3W9 prediction: 0
index: monUvMmr95OP05e prediction: 0
index: monUvMmr95OP05e prediction: 1
index: monUvMmr95OP05e prediction: 1
index: monUvMmr95OP05e prediction: 0
index: monUvMmr95OP05e prediction: 1
index: monUvMmr95OP05e prediction: 0
index: winhsDL92bKQS4x prediction: 1
index: xWc5x0sOguKgkJa prediction: 0
index: 8jE26hwkyWzOOpV prediction: 1
index: j1KqeiH7KrzuW9N prediction: 1
index: lBLYpZi7P5Fbs1N prediction: 0
index: tossSzrrzX43iqu prediction: 0
index: ZIIvtO5hcTg8Tg4 prediction: 1
index: w7WNrNenF4h9C5 prediction: 0
index: z8lmAhwtP3ehr63 prediction: 1
index: QibP7kHXVqO8Ve7 prediction: 0
index: SuNat7oTPLjWsXD prediction: 1
index: 9KXp0XrXblr7bxy prediction: 0
index: w7WNrNenF4h9C5 prediction: 0
index: b8HQmdEN4W8VMfj prediction: 0
index: Cb1ZysE3Vb0BmRc prediction: 1
index: jbsQ6vrRg4FK2ea prediction: 1
index: RcECLtfRYf0pIvi prediction: 0
ind

## Discussion

I would like to mention followings regarding the dataset, models and final outcome of this exercise.

1. Machine Learning is all about learning from previous data. The amount of data unavailable for building the model really help to generalize the model. So more data really helps to better predictions. Unfortunately, this dataset contains less than 200 data points. Hence, building a generalize model is impossible with such a small dataset. 

2. Even this such a small dataset and relatively simple algorithm, we managed to achieve just above 80 percent accuracy (according to 3-Fold cross-validation) and this is much better than random guessing. (However, don't expect similar level of accuracy when it runs against testing set to the inability of the model due to small number of training example available for model building.)

3. When we are building complex machine learning models one of the main concerns is managing bias-variance tradeoff. Since the simplicity of the dataset (both number of feature and number of data points),  we really restricted to simple (I would say less powerful models) models such as Logistic Regression and Random Forest. 

4. Finally, it was mention in the question that Java language is prepared. However, it in Machin Learning world people NEVER use Java programming language. (Yes, it is true that we have Apache Spark. But it is mainly using for large production systems.). The usual practice in Machine Learning is: First, we build quick models such as one shown above in Python or R.Next, when it comes to production we use C/C++.

