## Baseline Model Pipeline   

This is the baseline kernel (automatically generated by my bot: Maggle). In this kernel, an end to end classification pipeline is implemented.

### Contents 

1. Prepare Environment  
2. Preparation and Exploration   
&nbsp;&nbsp;&nbsp;&nbsp; 2.1 Dataset Snapshot and Summary    
&nbsp;&nbsp;&nbsp;&nbsp; 2.2 Target Variable Distribution    
&nbsp;&nbsp;&nbsp;&nbsp; 2.3 Missing Values    
&nbsp;&nbsp;&nbsp;&nbsp; 2.4 Variable Correlations
3. Preprocessing  
&nbsp;&nbsp;&nbsp;&nbsp; 3.1 Label Encoding    
&nbsp;&nbsp;&nbsp;&nbsp; 3.2 Missing Values Treatment     
&nbsp;&nbsp;&nbsp;&nbsp; 3.3 Feature Engineering   
&nbsp;&nbsp;&nbsp;&nbsp; 3.4 Train Test Split    
4. Modelling   
&nbsp;&nbsp;&nbsp;&nbsp; 4.1 Logistic Regression  
&nbsp;&nbsp;&nbsp;&nbsp; 4.2 Random Forest  
&nbsp;&nbsp;&nbsp;&nbsp; 4.3 Extereme Gradient Boosting  
5. Feature Importance   
6. Creating Submission

## Step 1: Prepare Environment
Lets load the required libraries to be used

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from xgboost import plot_importance
from collections import Counter
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns 
import xgboost as xgb 
import pandas as pd
import numpy as np 
import itertools

## Step 2: Dataset Preparation and Exploration
Load the train and test dataset into memory

In [None]:
## read dataset
train_df = pd.read_csv('../input/train 2.csv')
test_df = pd.read_csv("../input/test 2.csv")

## get predictor and target variables
_target = "Survived"
_id = "PassengerId" 

_target = "author"
_id = "id" 
tag = "text"

Y = train_df[_target]
distinct_Y = Y.value_counts().index
test_id = test_df[_id]

## drop the target and id columns
train_df = train_df.drop([_target, _id], axis=1)
test_df = test_df.drop([_id], axis=1)

textcol = "text"

### 2.1 Dataset snapshot and summary

In [None]:
## snapshot of train and test
train_df.head()

In [None]:
## summary of train and test
# if tag != "text":
train_df.describe()

### 2.2 Target variable distribution

In [None]:
tar_dist = dict(Counter(Y.values))

xx = list(tar_dist.keys())
yy = list(tar_dist.values())

plt.figure(figsize=(6,5))
sns.set(style="whitegrid")
ax = sns.barplot(x=xx, y=yy)
ax.set_title('Distribution of Target')
ax.set_ylabel('count');
ax.set_xlabel(_target);

### 2.3 Missing Value Counts

In [None]:
mcount = train_df.isna().sum()
xx = mcount.index 
yy = mcount.values

plt.figure(figsize=(6,5))
sns.set(style="whitegrid")
ax = sns.barplot(x=xx, y=yy)
ax.set_title('Number of Missing Values')
ax.set_ylabel('Number of Columns');

### 2.4 Variable Correlations 

Lets plot the correlations among the variables

In [None]:
corr = train_df.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(6,5))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, 
            square=True, linewidths=.5, cbar_kws={"shrink": .5});