##  Stacking Different Feature Models

In this notebook, I am training on two different feature spaces:

1. Combined TFIDF for analyzer=word analyzer=char. Most of the code is taken from the notebook of @nicapotato [tf-idf-xgboost](https://www.kaggle.com/nicapotato/tf-idf-xgboost)
2. TensorFlow Universal Sentence Encoder. The features are modelled on basis of [nlp-disaster-tweets-1](https://www.kaggle.com/akazuko/nlp-disaster-tweets-1)

Both the features are trained separately on XGBoost

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OneHotEncoder

from scipy.sparse import hstack
import xgboost as xgb
from xgboost.sklearn import XGBClassifier # <3
from sklearn.model_selection import train_test_split
import gc
import re
from nltk.corpus import stopwords

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv').fillna(' ')#.sample(1000)
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv').fillna(' ')#.sample(1000)

keep_train_index=train['text'].drop_duplicates().index
train=train.iloc[keep_train_index,:]
train_text = train["text"]


test_text=test['text']
all_text = pd.concat([train_text, test_text])

## Text PreProcessing
The text is pre-processed to :
1. Remove URLs
2. Twitter Handles
3. Spit on punctuations
4. Remove stopwords
5. Remove numbers

In [None]:
# Contractions
#Reference: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
def replace_contractions(tweet):
    tweet = re.sub(r"he's", "he is", tweet)
    tweet = re.sub(r"there's", "there is", tweet)
    tweet = re.sub(r"We're", "We are", tweet)
    tweet = re.sub(r"That's", "That is", tweet)
    tweet = re.sub(r"won't", "will not", tweet)
    tweet = re.sub(r"they're", "they are", tweet)
    tweet = re.sub(r"Can't", "Cannot", tweet)
    tweet = re.sub(r"wasn't", "was not", tweet)
    tweet = re.sub(r"don\x89Ã›Âªt", "do not", tweet)
    tweet = re.sub(r"aren't", "are not", tweet)
    tweet = re.sub(r"isn't", "is not", tweet)
    tweet = re.sub(r"What's", "What is", tweet)
    tweet = re.sub(r"haven't", "have not", tweet)
    tweet = re.sub(r"hasn't", "has not", tweet)
    tweet = re.sub(r"There's", "There is", tweet)
    tweet = re.sub(r"He's", "He is", tweet)
    tweet = re.sub(r"It's", "It is", tweet)
    tweet = re.sub(r"You're", "You are", tweet)
    tweet = re.sub(r"I'M", "I am", tweet)
    tweet = re.sub(r"shouldn't", "should not", tweet)
    tweet = re.sub(r"wouldn't", "would not", tweet)
    tweet = re.sub(r"i'm", "I am", tweet)
    tweet = re.sub(r"I\x89Ã›Âªm", "I am", tweet)
    tweet = re.sub(r"I'm", "I am", tweet)
    tweet = re.sub(r"Isn't", "is not", tweet)
    tweet = re.sub(r"Here's", "Here is", tweet)
    tweet = re.sub(r"you've", "you have", tweet)
    tweet = re.sub(r"you\x89Ã›Âªve", "you have", tweet)
    tweet = re.sub(r"we're", "we are", tweet)
    tweet = re.sub(r"what's", "what is", tweet)
    tweet = re.sub(r"couldn't", "could not", tweet)
    tweet = re.sub(r"we've", "we have", tweet)
    tweet = re.sub(r"it\x89Ã›Âªs", "it is", tweet)
    tweet = re.sub(r"doesn\x89Ã›Âªt", "does not", tweet)
    tweet = re.sub(r"It\x89Ã›Âªs", "It is", tweet)
    tweet = re.sub(r"Here\x89Ã›Âªs", "Here is", tweet)
    tweet = re.sub(r"who's", "who is", tweet)
    tweet = re.sub(r"I\x89Ã›Âªve", "I have", tweet)
    tweet = re.sub(r"y'all", "you all", tweet)
    tweet = re.sub(r"can\x89Ã›Âªt", "cannot", tweet)
    tweet = re.sub(r"would've", "would have", tweet)
    tweet = re.sub(r"it'll", "it will", tweet)
    tweet = re.sub(r"we'll", "we will", tweet)
    tweet = re.sub(r"wouldn\x89Ã›Âªt", "would not", tweet)
    tweet = re.sub(r"We've", "We have", tweet)
    tweet = re.sub(r"he'll", "he will", tweet)
    tweet = re.sub(r"Y'all", "You all", tweet)
    tweet = re.sub(r"Weren't", "Were not", tweet)
    tweet = re.sub(r"Didn't", "Did not", tweet)
    tweet = re.sub(r"they'll", "they will", tweet)
    tweet = re.sub(r"they'd", "they would", tweet)
    tweet = re.sub(r"DON'T", "DO NOT", tweet)
    tweet = re.sub(r"That\x89Ã›Âªs", "That is", tweet)
    tweet = re.sub(r"they've", "they have", tweet)
    tweet = re.sub(r"i'd", "I would", tweet)
    tweet = re.sub(r"should've", "should have", tweet)
    tweet = re.sub(r"You\x89Ã›Âªre", "You are", tweet)
    tweet = re.sub(r"where's", "where is", tweet)
    tweet = re.sub(r"Don\x89Ã›Âªt", "Do not", tweet)
    tweet = re.sub(r"we'd", "we would", tweet)
    tweet = re.sub(r"i'll", "I will", tweet)
    tweet = re.sub(r"weren't", "were not", tweet)
    tweet = re.sub(r"They're", "They are", tweet)
    tweet = re.sub(r"Can\x89Ã›Âªt", "Cannot", tweet)
    tweet = re.sub(r"you\x89Ã›Âªll", "you will", tweet)
    tweet = re.sub(r"I\x89Ã›Âªd", "I would", tweet)
    tweet = re.sub(r"let's", "let us", tweet)
    tweet = re.sub(r"it's", "it is", tweet)
    tweet = re.sub(r"can't", "cannot", tweet)
    tweet = re.sub(r"don't", "do not", tweet)
    tweet = re.sub(r"you're", "you are", tweet)
    tweet = re.sub(r"i've", "I have", tweet)
    tweet = re.sub(r"that's", "that is", tweet)
    tweet = re.sub(r"i'll", "I will", tweet)
    tweet = re.sub(r"doesn't", "does not", tweet)
    tweet = re.sub(r"i'd", "I would", tweet)
    tweet = re.sub(r"didn't", "did not", tweet)
    tweet = re.sub(r"ain't", "am not", tweet)
    tweet = re.sub(r"you'll", "you will", tweet)
    tweet = re.sub(r"I've", "I have", tweet)
    tweet = re.sub(r"Don't", "do not", tweet)
    tweet = re.sub(r"I'll", "I will", tweet)
    tweet = re.sub(r"I'd", "I would", tweet)
    tweet = re.sub(r"Let's", "Let us", tweet)
    tweet = re.sub(r"you'd", "You would", tweet)
    tweet = re.sub(r"It's", "It is", tweet)
    tweet = re.sub(r"Ain't", "am not", tweet)
    tweet = re.sub(r"Haven't", "Have not", tweet)
    tweet = re.sub(r"Could've", "Could have", tweet)
    tweet = re.sub(r"youve", "you have", tweet)  
    tweet = re.sub(r"donÃ¥Â«t", "do not", tweet) 
    return tweet

In [None]:
custom_stop_list = []
#stop_files=['slang.txt']
#for stopfile in stop_files:
#    with open("../data/"+stopfile) as f:
#        for line in f:
#            custom_stop_list.extend(line.split())

stopword_set = stopwords.words('english')+custom_stop_list+['url']

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
#https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


remove_emoji("Omg another Earthquake ðŸ˜”ðŸ˜”")
def preProcess(iter):
    #https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove
    
           
        
        # remove extra space
        regex_ws=re.compile("\s+")
        ret=regex_ws.sub(" ",iter)
        
        ret=ret.replace("&amp;","&").replace("&lt;","<").replace("&gt;",">")
        
        
        #Replace slang words
        #for key in abbreviations.keys():
        #    ret=ret.replace(key,abbreviations[key])
        
        #Replace URL
        regexp="(https?:\/\/(?:www\.|(?!www)|(?:xmlns\.))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})"
        ret=re.sub(regexp,"url",ret)
        
        #replace @addresses
        regexp='@[A-z0-9_]+'
        ret=re.sub(regexp,"@twitterhandle",ret)
        
        ret=remove_emoji(ret)
        #ret=replace_contractions(ret)
        #Split on punctuations
        ret1=re.split("[,_, \<>!\?\.:\n\"=*/]+",ret)
        
        #Remove Stopwords
        ret2=[word for word in ret1 if word not in stopword_set]
        ret2=" ".join(ret2)
        
        #Remove  numbers
        ret2=re.sub(r"(\s\d+)"," ",ret2)
                
        #STEM TEXT
        #ret3=stem_text(strip_punctuation(ret2))
    
        return ret2
te=preProcess("I'll make the day a fantastic 1!!")
print(te)

## TF-IDF
The ``word-vectorizer`` extracts features based on combined data of Test and Train.
The ``char-vectorizer`` extracts features on character n-gram range of length 2 to 6. This helps to take into account word subsets. For example: word-vectorizer will count fire and fires as separate features. However, character n-gram will count the subset "fire" in "fires". This will increase the weightage of the word.
I found that character vectorizers resulted in more accurate models than stemming or spell-correction.

In [None]:
print("TFIDF")
#Reference: https://www.kaggle.com/nicapotato/tf-idf-xgboost
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    norm='l2',
    min_df=0,
    smooth_idf=False,
    preprocessor=preProcess,
    max_features=15000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)


char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    norm='l2',
    min_df=0,
    smooth_idf=False,preprocessor=preProcess,
    max_features=30000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

enc = OneHotEncoder(handle_unknown='ignore')
train_keyword_features = enc.fit_transform(train[['keyword']].to_numpy().reshape(-1,1))
test_keyword_features = enc.transform(test[['keyword']].to_numpy().reshape(-1,1))


train_features = hstack([train_char_features, train_word_features]).tocsr()
del train_char_features,train_word_features
test_features = hstack([test_char_features, test_word_features]).tocsr()
del test_char_features,test_word_features

print(train_features.shape)
print(test_features.shape)

## F1 Score

``f1_metric`` returns __1-f1_score__. This is done because XGBoost tries to minimize the metric.
Using the feval parameter in XGBoost, this can be defined as the custom metric.

In [None]:
from sklearn import  metrics
def f1_metric(preds,dtrain):
    ytrue=dtrain.get_label()
    return 'f1_score', 1-metrics.f1_score(preds.round().astype(np.int), ytrue, average='macro')

## Model
XGBClassifier is used as the common model.
Stratified K-fold is used for splitting the data into train and test. The results over all folds are averaged to get the best prediction.



In [None]:


def train_model(train_features, train_target,test_features,n_estimator):
    d_test = xgb.DMatrix(test_features)
    #del test_features
    gc.collect()
    
    print("Modeling")
    cv_scores = []
    xgb_preds = []
    submission = pd.DataFrame.from_dict({'id': test['id']})
   
    
    xgb_params = {'eta': 0.1, 
                  'max_depth': 7, 
                  'subsample': 0.7,
                  'colsample_bytree': 0.75, 
                  'n_estimators':n_estimator,
                  'objective': 'binary:logistic', 
                  'metric': 'f1_score', 
                  'eval_metric':'error',
                  'gamma':0,
                  'seed': 314159265,
                  'verbose':20,
                  'min_child_weight':1
                  
                 }
    
    cv=StratifiedKFold(n_splits=4)

    for i,(train_index, valid_index) in enumerate(cv.split(train_features, train_target)):
        X_train, X_valid=train_features[train_index],train_features[valid_index]
        y_train, y_valid = train_target.iloc[train_index],train_target.iloc[valid_index]
        
                
        d_train = xgb.DMatrix(X_train, y_train)
        d_valid = xgb.DMatrix(X_valid, y_valid)

        watchlist = [(d_valid, 'valid')]
        model = xgb.train(xgb_params, d_train, 200, watchlist, verbose_eval=False,early_stopping_rounds=100, feval=f1_metric)
        print(model.attributes())
        print(model.attributes()['best_msg'])
        cv_scores.append(float(model.attributes()['best_score']))
        submission[i] = model.predict(d_test)#.round().astype(np.int)
        #del X_train, X_valid, y_train, y_valid
        gc.collect()
    print('Total CV score is {}'.format(np.mean(cv_scores)))
    '''d_train = xgb.DMatrix(train_features, train_target)

    model = xgb.cv(xgb_params, d_train, 300,  verbose_eval=False,nfold=5,stratified=True,
                          early_stopping_rounds=150, feval=f1_metric)'''
    print(model.attributes())
    print(model.attributes()['best_msg'])
    cv_scores.append(float(model.attributes()['best_score']))
    submission = model.predict(d_test)#.round().astype(np.int)
    
    return submission
    #return model


In [None]:
train_target = train['target']
print(train_target.shape)
train_target.value_counts()

In [None]:

submission_tfidf=train_model(train_features, train_target,test_features,n_estimator=97)
#res=train_model(train_features, train_target,test_features,n_estimator=4)
#print(res)

## TensorFlow Universal Sentence Encoder

TFIDF is based only on frequency of words in a tweet.
Sentence embedding gives a semantic meaning to the tweet and represent it in a vector form.
To make the train_model function reusable, the tensor in embedding is converted to a CSR matrix.


In [None]:
#!pip3 install --quiet tensorflow-hub
#Reference: https://www.kaggle.com/akazuko/nlp-disaster-tweets-1
import tensorflow_hub as hub
from scipy import sparse
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
train_text=train_text.apply(remove_emoji)
test_text=test_text.apply(remove_emoji)
X_train_embeddings = embed(train_text)
X_test_embeddings = embed(test_text)
X_feat=sparse.csr_matrix(X_train_embeddings.numpy())
X_test=sparse.csr_matrix(X_test_embeddings.numpy())
X_feat.shape
submission_hub=train_model(X_feat,train_target,X_test,n_estimator=100)#


In [None]:
print(submission_hub)

In [None]:
#submission_hub=np.mean(submission_hub.iloc[:,1:],axis=0)
#submission_tfidf=np.mean(submission_tfidf.iloc[:,1:],axis=0)



## Model Evaluation

**Note** The f1-score above is __1-f1__. 
The Universal Sentence Encoder features give better f1_score than TFIDF.
However, combining the two models gives a better score.

In [None]:
submission=(0.35*submission_tfidf+0.65*submission_hub).round().astype(np.int)
final_submit = pd.DataFrame.from_dict({'id': test['id']})
final_submit['target']=submission

print(final_submit)
final_submit.to_csv('submission.csv', index=False)
print(final_submit['target'].value_counts())

## Further Enhancements:
1. Hyperparameter Tuning
2. Stacking with different classification models