First of all, I took ideas from this website: 

https://gerardnico.com/data_mining/baseline

The first model that we are going to build is a base line model. A base line model uses a naive classification rule such as: 

* Always classify to the largest class. 

According to this kernel: 

[Univariate exploratory analysis](https://www.kaggle.com/ricardorios/univariate-exploratory-analysis)

The most frequent class is 1.0, then we are going to build a base line model that always predicts 1.0. 


In [None]:
# Loading the packages
import numpy as np
import pandas as pd 
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [None]:
# Loading the training dataset
df_train = pd.read_csv("../input/train.csv")


In [None]:
y = df_train["target"]
# We exclude the target and id columns from the training dataset
df_train.pop("target");
df_train.pop("id")
X = df_train 
del df_train

In machine learning often, we split the training set into training, validation, and test dataset. I suggest you to read the following web page in order to understand why we need to do this: 

https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

According to the website above we have the following definitions: 

* Training Dataset: The sample of data used to fit the model. Fit a model is a process to adjust the model in order to get good predictions. 

* Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model [hyperparameters](https://www.quora.com/What-are-hyperparameters-in-machine-learning). The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

* Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

Our base line model do not have hyperparameters that is why we do not need to use a validation dataset.

In [None]:
# Split data into training and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

Next, we are going to create a [dummy classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) that always predict the most frequent class. This model will be trained with the data contained in X_train and y_train. Train means: to use the training data in order to find a mathematical expression that we can use to predict new values. Fit and train are synonyms. 

In [None]:
# Create dummy classifer 
dummy = DummyClassifier(strategy='most_frequent', random_state=0)

# "Train" model
dummy.fit(X_train, y_train)


In the following cell of code, we are going to generate the predictions over the test dataset. Note that the model always predicts 1.

In [None]:
y_train_predict = dummy.predict_proba(X_train)
y_train_predict = y_train_predict[:,1]
#Probabilities for the class 1 in the trainind dataset
y_test_predict = dummy.predict_proba(X_test)
y_test_predict = y_test_predict[:,1] 
#Probabilities for the class 1 in the test dataset
print(np.unique(y_train_predict))
print(np.unique(y_test_predict))
print("The model always predicts 1!!")

In the next cell of code we are going to calculate the metric AUC in the training and test dataset, I suggest you to read the following in order to understand this:

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

In [None]:
auc_train = roc_auc_score(y_train, y_train_predict)
auc_test = roc_auc_score(y_test, y_test_predict)

In [None]:
print("The AUC in the training dataset is {}".format(auc_train))
print("The AUC in the test dataset is {}".format(auc_test))

We have obtained a low value of the AUC in the training and test dataset, this model is too simple ([underfitting](https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76)). 

Finally, we are going to generate predictions over the file test.csv, this file is provided by Kaggle in order to rank the competitors.

In [None]:
df_test = pd.read_csv("../input/test.csv")
df_test.pop("id");
X = df_test 
del df_test
y_pred = dummy.predict(X)

In [None]:
# submit prediction
smpsb_df = pd.read_csv("../input/sample_submission.csv")


In [None]:
smpsb_df.head()

In [None]:
smpsb_df["target"] = y_pred
smpsb_df.to_csv("base_line_model.csv", index=None)