In [197]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# check what the data we have is
import os
print(os.listdir("../input"))

# Helpers
I'm just writing a couple of helper functions here. They print the head / tail of pandas dataframes a little nicer. You can find the original code, which I copied, [here](https://gist.github.com/dmyersturnbull/035876942070ced4c565e4e96161be3e)

In [198]:
# from https://gist.github.com/dmyersturnbull/035876942070ced4c565e4e96161be3e

from IPython.display import display, Markdown
import pandas as pd

def head(df: pd.DataFrame, n_rows:int=1) -> None:
    """Pretty-print the head of a Pandas table in a Jupyter notebook and show its dimensions."""
    display(Markdown("**whole table (below):** {} rows × {} columns".format(len(df), len(df.columns))))
    display(df.head(n_rows))
    
def tail(df: pd.DataFrame, n_rows:int=1) -> None:
    """Pretty-print the tail of a Pandas table in a Jupyter notebook and show its dimensions."""
    display(Markdown("**whole table (below):** {} rows × {} columns".format(len(df), len(df.columns))))
    display(df.tail(n_rows))

# Preprocessing
Let's go ahead and read in the data. Then, after we've read it in, we're going to split it into features and labels. I do two things to the features data. First, I normalize all the data, and secondly, I run PCA on the data. When I was initially running random forest on the data, I found that only four of the features were really lending to the data (found by looking at `your_rfc.feature_importances_`). Therefore, I chose to reduce the dimensionality of the data to four principal components, all of which will lend more or less equally to the variability of the data (which we of course want to maximize).

In [199]:
input = pd.read_csv('../input/winequality-red.csv')

# get X and y slices, do preprocessing
X = input.iloc[:, :10]

# https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame
from sklearn import preprocessing

to_scale = X.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
scaled = min_max_scaler.fit_transform(to_scale)
X = pd.DataFrame(scaled)

from sklearn.decomposition import PCA

pca = PCA(n_components=4)
pca.fit(X)

X = pd.DataFrame(pca.transform(X), columns=['PCA%i' % i for i in range(4)], index=X.index)

y = input.iloc[:, 11]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)

# Predictions
We'll use check_stats to show some statistics, namely loss and accuracy, related to some models. I copied the models from [another Kaggle kernel](https://www.kaggle.com/pranavcoder/random-forests-and-keras). We're focusing on random forest, but we have a single decision tree as a reference, as well as a gradient boosting classifier, which I included so we can do some ensembling a bit later as well. Note that I use a heck-ton of estimators in the random forest classifier.

In [200]:
def check_stats(model, model_name):
    sum = 0
    loss = 0
    total = X_test.shape[0]

    predictions = model.predict(X_test)

    index = 0
    for prediction in predictions:

        actual = y_test.iloc[index]

#         print('pred', prediction, 'actual: ', actual)

        loss += abs(actual - prediction)
        if prediction == actual:
            sum += 1

        index += 1

    accuracy = sum / total
    avg_loss = loss / total

    print('MODEL STATS: ' + model_name)
    print('loss: ', loss)
    print('avg loss: ', avg_loss)
    print('accuracy: ', round(accuracy * 100, 2), '%\n')
    
# https://www.kaggle.com/pranavcoder/random-forests-and-keras

from sklearn import ensemble, tree
from imblearn.pipeline import make_pipeline

cart = tree.DecisionTreeClassifier(criterion='entropy', max_depth=None)
forest = ensemble.RandomForestClassifier(criterion='entropy', n_estimators=1000, max_features=None, max_depth=None)
gboost = ensemble.GradientBoostingClassifier(max_depth=None)

cart.fit(X_train, y_train)
forest.fit(X_train, y_train)
gboost.fit(X_train, y_train)

check_stats(cart, 'Decision Tree')
check_stats(forest, 'Random Forest')
check_stats(gboost, 'GBoost')

# Voting Classifier
Now, I had two other models than random forest so I could ensemble the results. Let's go ahead and do so, using sklearn's built-in voting classifier. I tried both soft and hard voting, and it didn't really make a difference. I think soft voting is fine in this situation, and it's the default. Notice the voting classifier is just like the other three models, which is nice since we can reuse our check_stats function.

In [201]:
# let's try a voting classifier
from sklearn.ensemble import VotingClassifier

cart = tree.DecisionTreeClassifier(criterion='entropy', max_depth=None)
forest = ensemble.RandomForestClassifier(criterion='entropy', n_estimators=1000, max_features=None, max_depth=None)
gboost = ensemble.GradientBoostingClassifier(max_depth=None)

vc = VotingClassifier(estimators=[('cart', cart), ('forest', forest), ('gboost', gboost)], voting='soft')

vc = vc.fit(X_train, y_train)

check_stats(vc, 'Voting Classifier')
    

# Voting Classifier Results 
So voting among the classifiers doesn't yield better results. That's fine, we just wanted to play around with ensembling in this situation. There's other problems that we can attribute our current low accuracy to. One of these is the size of the dataset relative to the number of classes we're trying to predict. We have eight outputs, for the range of qualities: `[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]`, and only 1600 samples. If we did something with two classes, like good wine / bad wine, we would probably get better results. But where's the fun in that? In the future I might try synthesizing additional data to see if I can get better results across all the classes - but that would of course be artificial, so the current results are fine.

# Cross Validation
Let's go ahead and try one more thing, based off of [another kernel using this dataset](https://www.kaggle.com/vishalyo990/prediction-of-quality-of-wine) which does do good/bad classes. They used cross validation and found that their 'precision' increased by an appreciable 4%.

In [202]:
from sklearn.model_selection import cross_val_score

#Now lets try to do some evaluation for random forest model using cross validation.
rfc_eval = cross_val_score(estimator = forest, X = X_train, y = y_train, cv = 8)
rfc_eval.mean()


# Cross Validation Results
So in our case, again likely due to the number of classes we have, cross validation didn't really help our random forest model. It in fact lowers the accuracy in general - as long as `rfc_eval.mean()` is equivalent to my accuracy score.

### Thanks for taking a look at this kernel, it is of course just a very quick exploration into random forest classification and a bit of ensembling. Let me know if I've made any grave mistakes etc.