In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from imblearn.pipeline import make_pipeline


In [None]:
pd.concat([pd.read_csv("../input/sample_submission.csv")['id'],pd.DataFrame(make_pipeline(CountVectorizer(), TfidfTransformer(), SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=10, random_state=42)).fit(*np.split(pd.read_csv("../input/train.csv")[['text','author']].T.values.flatten(), 2)).predict_proba(pd.read_csv("../input/test.csv")['text']), columns=['EAP','HPL','MWS'])], axis=1).to_csv('submission.csv', sep=',',index=False)

**\*record scratch\* **

**\*freeze frame\***

*Yup, that's my code. You're probably wondering how I ended up in this situation.*


### So, let's break it down:

*(ok, so I lied a bit in the title: if you also count the imports, it's not really a one-line solution. Sorry about that!*

First, we read in the train data (we are only interested in the **text** and **author** columns):

In [None]:
train = pd.read_csv("../input/train.csv")
train = train[['text','author']]

To classify all the given texts based on their authors, we use the **sklearn** library.
We define a *pipeline* that does three things:
* preprocess, tokenize and filter stopwords, basically transforming texts into feature vectors (**CountVectorizer**)
* compute the *Term Frequency times Inverse Document Frequency* or tf-idf (**TfidfTransformer**)
* train a linear classifier with stochastic gradient descent (SGD) learning (**SGDClassifier**)

In [None]:
classifier_pipeline = make_pipeline(
    CountVectorizer(), 
    TfidfTransformer(), 
    SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=10, random_state=42)
)

Our pipeline needs to be trained using a pair of example inputs and outputs. These are our two columns in the *train* data frame. So we do a little trick to pass the two pandas columns as parameters to our *fit* function. 
We flatten the two columns so that the first half of the resulting array contains the *text* samples and the second the *author* ones.


In [None]:
flattened = train.T.values.flatten()
x,y = np.split(flattened, 2) # x is text, y is authors
classifier_pipeline.fit(x, y)

Once our classifier is trained, we read the test data and do our predictions

In [None]:
test = pd.read_csv("../input/test.csv")
prediction = classifier_pipeline.predict_proba(test['text'])

For each text given as input, the prediction contains an array with the probabilities for that text belonging to each of the three authors.
We then use the sample submission (we re-use the *id* column) and overwrite the three authors columns with our prediction probabilities

In [None]:
sample_submission = pd.read_csv("../input/sample_submission.csv")
id_column = sample_submission['id']

authors = pd.DataFrame(prediction, columns=['EAP','HPL','MWS'])

submission = pd.concat([id_column, authors], axis=1)

Note the *axis=1* above. This means we add columns instead of trying to append rows to our data frame. 
We are now ready to save our data to a .csv file and submit it.

In [None]:
submission.to_csv('submission_long.csv', sep=',', index=False)

The two files generated should be identical, and they score a 0.89 on the leaderboard. Not an impressive score, but a good start nonetheless.
And given that you now have a pipeline set up, you can start experimenting with hyper-parameter tuning, cross-validation, and all sorts of other pre-processors and classifiers.

Happy coding!