# Simple solution (Naive Bayes)

Simple ideas are often good ones! In this kernel you will discover how we can
produce a good model with only a few lines of code.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, accuracy_score, log_loss

%matplotlib inline
matplotlib.style.use('ggplot')

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

authors = df_train['author'].unique()
print('Authors are:', authors)

We just loaded our datasets, now let's split our data in two buckets: a train set (90% of the original data) and a test set (10% of the data).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_train['text'].values,
    df_train['author'].values,
    test_size=0.2,
    random_state=42
)

Now that we are all set, we can train our model.

We are going to use a technique named Multinomial Naive Bayes.

It will automatically train a model by extracting the most used words for each authors and define a metric to classify new examples.

Let's see how it goes.

In [None]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(X_train)

classifier = MultinomialNB()
classifier.fit(counts, y_train)

Done. That was simple, isn't it?

We have now a working solution, but how does it perform?

In [None]:
examples = ['How peculiar!', "That is a monster!", "the old man hadn't much time"]
example_counts = vectorizer.transform(examples)
pred = classifier.predict(example_counts)
pred

In [None]:
test_counts = vectorizer.transform(X_test)
y_pred = classifier.predict(test_counts)

accu = accuracy_score(y_test, y_pred)
print("Accuracy: %.02lf" % (100.*accu))

In [None]:
y_pred_proba = classifier.predict_proba(test_counts)
y_label = LabelBinarizer().fit_transform(y_test)
loss = log_loss(y_label, y_pred_proba)

print("Log-loss: %.04lf" % loss)

In [None]:
conf = confusion_matrix(y_test, y_pred)
conf = pd.DataFrame(
    conf.astype(np.float)/conf.sum(axis=1),
    index=authors,
    columns=authors
)

plt.figure()
plt.title('Confusion matrix of the predictions')
cmap = sns.cubehelix_palette(as_cmap=True)
sns.heatmap(conf, annot=True, cmap=cmap)
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

Finally we just create a new DataFrame with Pandas and we just get the result ready to send to Kaggle.
Have fun!

In [None]:
final_counts = vectorizer.transform(df_test['text'])
result = pd.DataFrame(classifier.predict_proba(final_counts), columns=authors)
result.insert(0, 'id', df_test['id'])
result.to_csv('kaggle_solution.csv', index=False, float_format='%.15f')