## NLP Model Building: Traditional Classifiers

Authored by [abhilash1910](https://www.kaggle.com/abhilash1910)

### Movie Reviews !!

This is an extension of the original [Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop) which has been separately provided for a piecewise analysis of the NLP Model building with Transformers and sophiosticated Neural architectures. For more details other kernels are also provided:

- [Kernel](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop-2)
- [Kernel](https://www.kaggle.com/abhilash1910/nlp-workshop-ml-india)
The second and most interesting part of the curriculum is building different models - both statistical and deep learning so as to provide a proper classification model for our use case. In this case, we will create an initial baseline with statistical classification algorithms by using both non semantic vectors as well as semantic vectors. Later we will try to improvise on these baselines with standard neural models like Bidirectional LSTMs, Convolution Networks, Gated Recurrecnt Units. The improvements of these traditional neural models over the baselines would be further investigated when we will explore advanced architectures, particularly that of an encoder decoder . Further advancement would be made on attention based encoder-decoder modules like Transformers and using the different flavours from BERT to GPT.

The following is the workflow of this notebook:

- Traditional NN models
  - With Static Embeddings

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Recap from Workshop-1

In the previous [Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop), we learned about cleaning,EDA, vectorization and embeddings. We saw how cleaning played a significant role in determining knowledge embeddings from the corpus and also determined the similarity between words and sentences. 

We will be following a few steps from the elaborate Notebook:

- Use the cleaning methods (regex) for redundancy removal
- Lemmatization
- Vectorization
- Embeddings (Static,Dynamic,Transformer)

Since we have already implemented the Regex cleaning method, we can apply the same here. In the first section of this notebook, we will be running statistical classifiers, with 3 different vectorized data.

- Non semantic TFIDF Vectorized Data
- Static Embedding Vectorized Data
- Dynamic Embedding Vectorized Data

For the first use case of statistical models, we will be relying on TFIDF Baseline with Statistical classifiers.


<img src="https://i.pinimg.com/originals/b0/ec/e4/b0ece436f4244f1f97bab3facf4d6b8a.gif">


In [None]:
#Load the set
train_df=pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
train_df.head()

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing,metrics,manifold
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV,cross_val_predict
from imblearn.over_sampling import ADASYN,SMOTE
from imblearn.under_sampling import NearMiss
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import collections
import matplotlib.patches as mpatches
from sklearn.metrics import accuracy_score
%matplotlib inline
from sklearn.preprocessing import RobustScaler
import xgboost
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import classification_report,roc_auc_score,roc_curve,r2_score,recall_score,confusion_matrix,precision_recall_curve
from collections import Counter
from sklearn.model_selection import StratifiedKFold,KFold,StratifiedShuffleSplit
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, TruncatedSVD,SparsePCA
from sklearn.metrics import classification_report,confusion_matrix
from nltk.tokenize import word_tokenize
from collections import defaultdict
from collections import Counter
import seaborn as sns
from wordcloud import WordCloud,STOPWORDS
import nltk
from nltk.corpus import stopwords
import string
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import xgboost
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import classification_report,roc_auc_score,roc_curve,r2_score,recall_score,confusion_matrix,precision_recall_curve
from collections import Counter
from sklearn.model_selection import StratifiedKFold,KFold,StratifiedShuffleSplit
from xgboost import XGBClassifier as xg
from lightgbm import LGBMClassifier as lg
from sklearn.ensemble import RandomForestRegressor,GradientBoostingClassifier,RandomForestClassifier,AdaBoostClassifier,BaggingClassifier,ExtraTreesClassifier

In [None]:
%%time
#Convert the labels into integers (numerics) for reference.

train_li=[]
for i in range(len(train_df)):
    if (train_df['sentiment'][i]=='positive'):
        train_li.append(1)
    else:
        train_li.append(0)
train_df['Binary']=train_li
train_df.head()

In [None]:
%%time
#Running the Preprocessing and cleaning phase as well as the TFIDF Vectorization

import re
#Removes Punctuations
def remove_punctuations(data):
    punct_tag=re.compile(r'[^\w\s]')
    data=punct_tag.sub(r'',data)
    return data

#Removes HTML syntaxes
def remove_html(data):
    html_tag=re.compile(r'<.*?>')
    data=html_tag.sub(r'',data)
    return data

#Removes URL data
def remove_url(data):
    url_clean= re.compile(r"https://\S+|www\.\S+")
    data=url_clean.sub(r'',data)
    return data

#Removes Emojis
def remove_emoji(data):
    emoji_clean= re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    data=emoji_clean.sub(r'',data)
    url_clean= re.compile(r"https://\S+|www\.\S+")
    data=url_clean.sub(r'',data)
    return data
#Lemmatize the corpus
def lemma_traincorpus(data):
    lemmatizer=WordNetLemmatizer()
    out_data=""
    for words in data:
        out_data+= lemmatizer.lemmatize(words)
    return out_data

def tfidf(data):
    tfidfv = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), lowercase=True, max_features=150000)
    fit_data_tfidf=tfidfv.fit_transform(data)
    return fit_data_tfidf


train_df['review']=train_df['review'].apply(lambda z: remove_punctuations(z))
train_df['review']=train_df['review'].apply(lambda z: remove_html(z))
train_df['review']=train_df['review'].apply(lambda z: remove_url(z))
train_df['review']=train_df['review'].apply(lambda z: remove_emoji(z))
count_good=train_df[train_df['sentiment']=='positive']
count_bad=train_df[train_df['sentiment']=='negative']
train_df['review']=train_df['review'].apply(lambda z: lemma_traincorpus(z))


## Revisiting TFIDF for our Baseline

TFIDF vectorization is non semantic frequency based algorithm which uses a logarithmic distribution over document frequencies to embed vectors based on normalized frequency of occurence of words in the corpus. A descriptive formulation is provided here:

<img src=https://plumbr.io/app/uploads/2016/06/tf-idf.png>


The logical inference for using TFIDF vectorization over other vectorization strategies to embed vectors in HD space is to capture rare words occuring across the corpus. This vectorized embeddings can be applied on a statistical model for training.

<img src="https://cdn-images-1.medium.com/max/876/1*_OsV8gO2cjy9qcFhrtCdiw.jpeg">

In [None]:
%%time
#TFIDF Vectorize the Data

train_set=tfidf(train_df['review'])

## Statistical Training Without Balancing

We have heard of balancing techniques in the previous [Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop), and approaches like [SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/), [Adasyn](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html),are used to balance the classes. In the first case, we will not be balancing the set and evaluate the preliminary training of the statistical models.This initial benchmark can be used for further improvement by balancing the dataset.




## Splitting the set


Originally we have the transformed /vectorized data as we saw in previous notebook, the TSNE transformation of which looked like :

<img src="https://i.imgur.com/uuhL17b.png">


In generic supervised learning, we are implying model training after splitting the tfidf vectorized data. This transformation is linear and just splits depending on the ratio/test size provided by us. For this strategy, we will be using the sklearn split module.There are many versions of splitting the data based on the data, and some of them include:

- [Test Train Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [Stratified Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [Stratified Shuffle Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)
- [Shuffle Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html)
- [Stratified K Fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

Jason's [blog](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/) provides a good idea about splitting techniques for generic classification problems.

In [None]:
%%time
#Normal Train Test Split

train_y=train_df['sentiment']
train_x,test_x,train_y,test_y=train_test_split(train_set,train_y,test_size=0.2,random_state=42)
train_x.shape,train_y.shape,test_x.shape,test_y.shape

## Splitting in Imbalanced Classes

There are several ways to strategize and split imbalanced classes. One of the ways is to use the "stratify" (Stratified Split) option during train_test_split. this splitting allows propertional splitting between the  classes, and also maintains the proportionality on the training and testing dataset.

In [None]:
%%time
#Stratified Train Test Split

train_stratify_y=train_df['sentiment']
train_stratified_x,test_stratified_x,train_stratified_y,test_stratified_y=train_test_split(train_set,train_stratify_y,test_size=0.2,random_state=42,stratify=train_stratify_y)
train_stratified_x.shape,train_stratified_y.shape,test_stratified_x.shape,test_stratified_y.shape

## Analysing TFIDF-LR Baseline with simple split 

In this case, we want to evaluate the performance of a Logistic Regression classifier on the tfidf vectorized data sampled with normal train_test_split. Logistic Regression Classifier uses a sigmoid kernel for training. In a supervised learning mode , Logistic Regression is one of the standardised models under generalized linear models which tries a convex optimization by passing the cost function through the sigmoid kernel. The sigmoid function is denoted by the formulation:

<img src="https://www.gstatic.com/education/formulas2/-1/en/sigmoid_function.svg">


This equation due to its convergence property (+/- infinity) and due to its differentiability , the sigmoid kernel allows clamping of predicted values to binary labels. The sigmoid curve actually has optima at x=0 and x=1.Now in the case of supervised logistic regression, when we try to optimize the cost function (in this case a linear sum of weights & biases passed through sigmoid kernel), by applying stochastic gradient descent. Since by gradient descent, the steepest slope is considered, the change in derivatives (gradients) at each stage is computed and the weights of the cost function are updated. The effective loss function for logistic regression is E=(|y_predicted -y_actual|^2). This [blog](https://machinelearningmastery.com/gradient-descent-for-machine-learning/) provides an idea. 



<img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xoKf0Xfi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AXisqJi764DTxragiRFexSQ.png">

Some resources:

- [Blog](https://www.analyticsvidhya.com/blog/2020/10/how-does-the-gradient-descent-algorithm-work-in-machine-learning/)
- [Sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [KDNuggets](https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm.html)
- [Blog](https://towardsdatascience.com/machine-learning-101-an-intuitive-introduction-to-gradient-descent-366b77b52645)



In [None]:
#Applying Logistic Regression on split tfidf baseline
model=LogisticRegression()
model.fit(train_x,train_y)
pred=model.predict(test_x)
print("Evaluate confusion matrix for LR")
print(confusion_matrix(test_y,pred))
print(f"Accuracy Score for LR with C=1.0  ={accuracy_score(test_y,pred)}")

## MultiNomial Naive Bayes on TFIDF Baseline

[MultiNomial NB](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.naive_bayes.MultinomialNB.html) is a probabilistic statistical classification model which uses conditional probability to segregate or classify samples. This works well with discrete integer valued features (such as count vectorization) but can also be used with TFIDF vectors. Particularly, this uses the Bayes Theorem which tries to determine conditional probability using prior and posterior probabilities as shown in the figure:

<img src="https://storage.googleapis.com/coderzcolumn/static/tutorials/machine_learning/article_image/Scikit-Learn%20-%20Naive%20Bayes.jpg">

The major concept under this category is statistics of [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). There are many variants:

- Gaussian NB which relies on Gaussian distribution of the input features

<img src="https://www.researchgate.net/profile/Yune_Lee/publication/255695722/figure/fig1/AS:297967207632900@1448052327024/Illustration-of-how-a-Gaussian-Naive-Bayes-GNB-classifier-works-For-each-data-point.png">


- Complement NB which is suited for unbalanced classes and relies on the statistics of complement of each class to generate the weights. It is better than Multinomial NB for textual classification as it has a normalization factor (and a smoothing hyperparameter alpha) which tends to capture information from longer sequences of text.


<img src="https://www.researchgate.net/profile/Motaz_Saad/publication/231521157/figure/fig7/AS:667829850345476@1536234452166/Figure-31-Complement-Naive-Bayes-Algorithm-72.png">

- Bernoulli NB which relies on multivariate bernoulli distributions of the input features and also expects the data to be in binary format.

<img src="https://www.astroml.org/_images/fig_simple_naivebayes_1.png">

Other variants include:

- Categorical NB
- Partial Fit of NB models

Resources:

- [Blog](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/)
- [Kernel](https://www.kaggle.com/abhilash1910/nlp-workshop-ml-india#Vectorization-and-Benchmarking)
- [Jason's Blog](https://machinelearningmastery.com/classification-as-conditional-probability-and-the-naive-bayes-algorithm/)




In [None]:
#Applying MultiNomial Naive Bayes on split tfidf baseline
model=MultinomialNB()
model.fit(train_x,train_y)
pred=model.predict(test_x)
print("Evaluate confusion matrix for NB")
print(confusion_matrix(test_y,pred))
print(f"Accuracy Score for NB ={accuracy_score(test_y,pred)}")

## Multiple Baseline computation using KFold and Cross Validation


In this concept, we will be training multiple statistical models based on KFold and Cross Validation Technique.
KFold cross validators provide train/test split indices for splitting the dataset into 'k' folds without shuffling. The general methodology for using Kfold and Cross Validation is provided below:

- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
   - Take the group as a hold out or test data set
   - Take the remaining groups as a training data set
   - Fit a model on the training set and evaluate it on the test set
   - Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores

This technique has a following rule: The first ``` n_samples % n_splits``` folds have size ``` n_samples // n_splits + 1```, other folds have size ``` n_samples // n_splits```, where n_samples is the number of samples.

A typical flowchart of cross validation is provided below:

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png">

This allows for better hyperparameter search using GridSearch CV algorithms which will be covered later.
The following procedure is followed for each of the k “folds”:

- A model is trained using  of the folds as training data;

- The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png">

Some resources:

- [Blog](https://machinelearningmastery.com/k-fold-cross-validation/)
- [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
- [Documentation](https://scikit-learn.org/stable/modules/cross_validation.html)



## Brief Introduction of Statistical Models


This is going to be a brief introduction of different statistical models which will be used simultaeneously with k fold and cross validation techniques for examining the accuracy of the models. In this case, we will be focussing on accuracy as the major KPI and later we will be running on different observations such as f1,ROC etc.


### Decision Trees

[Decision Trees](https://scikit-learn.org/stable/modules/tree.html) is a supervised model for classification/regression. This works on creating decision branches which evolves a criteria and is often acknowledged as a simplistic classification (white box) model as the stages of decision can be easily derived. A regression tree appears as follows:

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_tree_regression_0011.png">

The [algorithms](https://scikit-learn.org/stable/modules/tree.html) include ID3,C4.5/C5.0,CART which can be analysed as follows:

- ID3(Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data.

- C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.

- C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.

- CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.


In general, the major logics involved in Decision Trees involves computation of Entropy or Gini Index, which is as follows:

<img src="https://qph.fs.quoracdn.net/main-qimg-690a5cee77c5927cade25f26d1e53e77">

Typically a Gini Coefficient  is evaluated as the area between the ```y=x``` line and Lorentz curve

<img src="https://i.stack.imgur.com/iawuF.jpg">


Misclassification is another criteria:

<img src="https://miro.medium.com/max/2180/1*O5eXoV-SePhZ30AbCikXHw.png">

Typically a decision tree appears as follows:

<img src="https://scikit-learn.org/stable/_images/iris.png">

Some resources:

- [Blog](https://towardsdatascience.com/scikit-learn-decision-trees-explained-803f3812290d)
- [Blog](https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/)
- [Blog](https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/)



### Random Forests 


[Random Forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)  is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the ```max_samples``` parameter ```if bootstrap=True (default)```, otherwise the whole dataset is used to build each tree. When splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size ```max_features```. (See the parameter tuning guidelines for more details).The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

<img src="https://www.researchgate.net/profile/Hung_Cao12/publication/333438248/figure/fig6/AS:763710377299970@1559094151459/Random-Forest-model-with-majority-voting.ppm">


### Gradient Boosting Forests and Trees

[Gradient Boosting](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/) is a central part of ensemble modelling in sklearn.

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Two families of ensemble methods are usually distinguished:

- In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

  Examples: Bagging methods, Forests of randomized trees

- By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

  Examples: AdaBoost, Gradient Tree Boosting


Pictorially these can be represented as :

<img src="https://miro.medium.com/max/3908/1*FoOt85zXNCaNFzpEj7ucuA.png">


Several Boosting Models can be found under this criteria:

#### [AdaBoosting](https://blog.paperspace.com/adaboost-optimizer/#:~:text=AdaBoost%20is%20an%20ensemble%20learning,turn%20them%20into%20strong%20ones.): 
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights  to each of the training samples. Initially, those weights are all set to (1/N), so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_adaboost_hastie_10_2_0011.png">


### [LightGBM](https://lightgbm.readthedocs.io/en/latest/)

Light GBM is another gradient boosting strategy which relies on trees.It has the following advantages:

- Faster training speed and higher efficiency.

- Lower memory usage.

- Better accuracy.

- Support of parallel and GPU learning.

- Capable of handling large-scale data.

LightGBM grows leaf-best wise and will choose the leaf with maximum max delta loss to grow.

<img src="https://lightgbm.readthedocs.io/en/latest/_images/leaf-wise.png">


Some resources:

- [XGB](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)
- [Blogs](https://www.analyticsvidhya.com/blog/tag/gradient-boosting/)

In [None]:
%%time
#KFold and cross validation on tfidf baseline
models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
# models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
#models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
# models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
model_result=[]
scoring='accuracy'
print("Statistical Model TFIDF- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_x,train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    model_result.append(results.mean())

### Applying SMOTE balancing on TFIDF Baseline

Here, we will be applying SMOTE on TFIDF vectorized data for creating a different baseline.Since in our case, the data is balanced, applying SMOTE on balanced data would only reduce the efficiency of the models.



In [None]:
#Balancing The Sampple for TFIDF Baseline
#SMOTE oversampling
smote=SMOTE(random_state=42,k_neighbors=2)
smote_train_x,smote_train_y=smote.fit_sample(train_x,train_y)
smote_train_x.shape,smote_train_y.shape

In [None]:
%%time
#Applying SMOTE TFIDF Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
#models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
# models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
model_training_result,model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model SMOTE TFIDF- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,smote_train_x,smote_train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    predictions=cross_val_predict(model,test_x,test_y)
    accuracy = accuracy_score(predictions,test_y)
    model_training_result.append(results.mean())
    model_validation_result.append(accuracy)

final_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_outcomes['Model']=models
final_outcomes['Training Acc']=model_training_result
final_outcomes['Validation Acc']=model_validation_result
final_outcomes.to_csv('TFIDF-SMOTE-Baseline.csv',index=False)
final_outcomes

## Concluding Non Semantic Baseline Techniques


We have seen the effect of applying a statistical classifier on the basis of non semantic TFIDF vectorized data and also attained a parallel analysis of the accuracy of the different algorithms. The inference for using these statistical models is that it provides an initial benchmark which has to be improved further by trying different models as such. This provides a quick overview of how  a traditional classifier can be used for non semantic classification and in the next case we will be using semantic embeddings (vectors) with these traditional classifiers.


<img src="https://media2.giphy.com/media/118u58QrLaLnDG/giphy.gif">

## Static Semantic Embedding Baseline


In this context, we will be applying the traditional ML classifiers on Word2Vec, Glove and Fasttext data to create a classification model. We have already realised the concept of these embeddings and some of the images of applying these embeddings on the corpus is provided in the Notebook [here](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop):

### Word2Vec From [Gensim](https://www.analyticsvidhya.com/blog/tag/word2vec/)

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___63_1.png">

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___59_1.png">

### Google News Vectors Variant of [Word2Vec](https://code.google.com/archive/p/word2vec/)


<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___66_0.png">

### [Glove](https://www.analyticsvidhya.com/blog/tag/glove/) [Embeddings](https://nlp.stanford.edu/projects/glove/)

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___70_0.png">

### [Fasttext Embeddings](https://fasttext.cc/docs/en/supervised-tutorial.html)

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___73_0.png">


In [None]:
%%time
## Load word2vec algorithm from gensim and vectorize the words
from gensim.models import Word2Vec,KeyedVectors
check_df=list(train_df['review'].str.split())
model=Word2Vec(check_df,min_count=1,iter=20)

In [None]:
#Label Encode the labels
from sklearn.preprocessing import LabelEncoder
label_y= LabelEncoder()
labels=label_y.fit_transform(train_df['sentiment'])
labels

## Creating Sentence Vectors from Word2Vec

Since Word2Vec creates vector embeddings for individual words in a corpus by transforming them to a manifold, we need effective document /sentence vectors from these individual vectorized words.  The concept of [pooling](https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8#:~:text=A%20pooling%20layer%20is%20another,in%20pooling%20is%20max%20pooling) is derived from Neural Networks particularly [Convolution Neural Network Architectures](https://analyticsindiamag.com/max-pooling-in-convolutional-neural-network-and-its-features/) where MaxPooling signifies taking the maximum from a range (particularly a kernel/filter or a window of input features). A typical Maxpooling diagram is as follows:


<img src="https://i.redd.it/61tcfy2xy2u41.png">



But in the case of creating document embeddings, a general notion is to use [Average pooling](https://i.redd.it/61tcfy2xy2u41.png). Mean pooling is generally used to create document vectors by taking the average of all the vectors in the context. A schematic diagram of the same is provided:



<img src="https://yashuseth.files.wordpress.com/2018/08/5.jpg?w=834">


As we move forward towards using complex embeddings, we will be using Mean Pooling to create sentence/paragraph vectors from the individual word vectors. There are also other strategies involving Max Pooling and then applying Mean Pooling on the word Vectors to create complete vectors.

<img src="https://www.researchgate.net/profile/Xingsheng_Yuan/publication/332810604/figure/fig2/AS:754128875683841@1556809743129/Simple-word-embedding-based-model-with-modified-hierarchical-pooling-strategy.png">


Some resources

- [Research](https://www.researchgate.net/figure/Simple-word-embedding-based-model-with-modified-hierarchical-pooling-strategy_fig2_332810604)
- [Some paper](https://www.cs.tau.ac.il/~wolf/papers/qagg.pdf)
- [Huggingface](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)

In [None]:
#Convert word vectors to sentence vectors/sentence vectors and apply mean pooling

def convert_sentence(data):
    vocab=[w for w in data if w in model.wv.vocab]
    avg_pool=np.mean(model[vocab],axis=0)
#     sum_pool=np.sum(model[vocab],axis=0)
#     min_pool=np.min(model[vocab],axis=0)
#     max_pool=np.max(model[vocab],axis=0)
    return avg_pool

train_df['Vectorized_Reviews']=train_df['review'].apply(convert_sentence)

#Split the dataset into training and testing sets
train_y=train_df['sentiment']
train_x,test_x,train_y,test_y=train_test_split(train_df['Vectorized_Reviews'],train_y,test_size=0.2,random_state=42)
train_x.shape,train_y.shape,test_x.shape,test_y.shape
    

### Convert the sentence vectors to List

This is done to ensure the dimensionality of the input sentence vectors is that of an array (list). This can be easily fed into any statistical classifier for our use case.




In [None]:
test_x=list(test_x)
train_x=list(train_x)

## Apply Statistical Models on Static Embeddings

Now we move forward to apply the statsitical models on the Compressed Sentence Vectors . This will allow us to apply gradient boosting algorithms on the static embeddings computed by taking the mean of the word vectors.We have also compressed the train and test dataset for our purposes.

Steps:

- Apply Word2Vec on the Corpus
- Create Sentence Vectors by Mean Pooling
- Run the input sentence vectors with Kfold Cross Validation on Traditional and gradient boosting classifiers.


<img src="https://media1.tenor.com/images/0e438477bb88b5683690bfe101cf1181/tenor.gif?itemid=10724659">

In [None]:
%%time
#Applying W2V Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
w2v_model_training_result,w2v_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Word2Vec- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_x,train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_x,test_y)
    accuracy = accuracy_score(predictions,test_y)
    w2v_model_validation_result.append(accuracy)

final_w2v_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_w2v_outcomes['Model']=models
final_w2v_outcomes['Training Acc']=w2v_model_training_result
final_w2v_outcomes['Validation Acc']=w2v_model_validation_result
final_w2v_outcomes.to_csv('W2V-Baseline.csv',index=False)
final_w2v_outcomes

## Applying XGBoost and LightGBM

This will allow us to further analyse the  results of XGBoost over LightGBM.

<img src="https://miro.medium.com/max/1554/1*FLshv-wVDfu-i54OqvZdHg.png">

In [None]:
#Evaluating XGBoost & Light GBM on the dataset
from xgboost import XGBClassifier as xg
from lightgbm import LGBMClassifier as lg
model_xgb= xg(n_estimators=100,random_state=42)
model_xgb.fit(train_x,train_y)
y_pred_xgb=model_xgb.predict(test_x)
print(accuracy_score(test_y,y_pred_lgbm.round()))
# print("Confusion matrix")
model_lgbm= lg(n_estimators=100,random_state=42)
model_lgbm.fit(train_x,train_y)
y_pred_lgbm=model_lgbm.predict(test_x)
# print("Confusion matrix")
# print(confusion_matrix(test_y,y_pred_lgbm))
print(accuracy_score(test_y,y_pred_lgbm.round()))

## Converting Other Vectors (Glove,Fasttext) to Word2Vec 

In this case, we will be using Glove,Fasttext by converting them to Word2Vec embeddings and then applying mean pooling on them. The method of conversion is taken from the [previous Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop). The steps involved are as follows:

- Convert Glove,Fasttext,Google News to Word2Vec by using Gensim
- Apply Mean Pooling on the Word Vectors to create Sentence Vectors
- Apply Statistical Classifiers with Kfold Cross Validation

There are alternate strategies to apply bu the this is by far the simplest one with minimalistic code.
Some resource:

- [Good Alternate Script](https://www.kaggle.com/eswarbabu88/toxic-comment-glove-logistic-regression)
- [Blog](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)



In [None]:
from gensim.models import Word2Vec,KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

##Google News Vectors to word2vec format for mean pooling
google_news_embed='../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin'
google_loaded_model=KeyedVectors.load_word2vec_format(google_news_embed,binary=True)
print(google_loaded_model)
##Glove Vectors to word2vec format for mean pooling
glove_file='../input/glove-global-vectors-for-word-representation/glove.6B.50d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove_loaded=glove2word2vec(glove_file, word2vec_output_file)
glove_loaded = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(glove_loaded)
##Fasttext to word2vec format for mean pooling
fasttext_file="../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(fasttext_file, binary=False)
print(fasttext_model)

In [None]:
def convert_sentence_embeddings(data,model):
    vocab=[w for w in data if w in model.wv.vocab]
    avg_pool=np.mean(model[vocab],axis=0)
#     sum_pool=np.sum(model[vocab],axis=0)
#     min_pool=np.min(model[vocab],axis=0)
#     max_pool=np.max(model[vocab],axis=0)
    return avg_pool

#Google vectors
print('Google Vectors')
train_google_df=train_df
train_google_df['Google_News_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,google_loaded_model) )
#Split the dataset into training and testing sets
train_google_y=train_df['sentiment']
train_google_x,test_google_x,train_google_y,test_google_y=train_test_split(train_google_df['Google_News_Vectorized_Reviews'],train_google_y,test_size=0.2,random_state=42)
train_google_x=list(train_google_x)
test_google_x=list(test_google_x)
# train_google_x.shape,train_google_y.shape,test_google_x.shape,test_google_y.shape


# #Glove Vectors
# print('Glove Vectors')
# train_glove_df=train_df
# train_glove_df['Glove_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,glove_loaded) )
# #Split the dataset into training and testing sets
# train_glove_y=train_df['sentiment']
# train_glove_x,test_glove_x,train_glove_y,test_glove_y=train_test_split(train_glove_df['Glove_Vectorized_Reviews'],train_glove_y,test_size=0.2,random_state=42)
# train_glove_x=list(train_glove_x)
# test_glove_x=list(test_glove_x)
# # train_glove_x.shape,train_glove_y.shape,test_glove_x.shape,test_glove_y.shape

# #FastText Vectors
# print('Fasttext Vectors')
# train_fasttext_df=train_df
# train_fasttext_df['Fasttext_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,fasttext_model) )
# #Split the dataset into training and testing sets
# train_fasttext_y=train_df['sentiment']
# train_fasttext_x,test_fasttext_x,train_fasttext_y,test_fasttext_y=train_test_split(train_fasttext_df['Fasttext_Vectorized_Reviews'],train_fasttext_y,test_size=0.2,random_state=42)
# train_fasttext_x=list(train_fasttext_x)
# test_fasttext_x=list(test_fasttext_x)
# # train_fasttext_x.shape,train_fasttext_y.shape,test_fasttext_x.shape,test_fasttext_y.shape



In [None]:
#FastText Vectors
print('Fasttext Vectors')
train_fasttext_df=train_df
train_fasttext_df['Fasttext_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,fasttext_model) )
#Split the dataset into training and testing sets
train_fasttext_y=train_df['sentiment']
train_fasttext_x,test_fasttext_x,train_fasttext_y,test_fasttext_y=train_test_split(train_fasttext_df['Fasttext_Vectorized_Reviews'],train_fasttext_y,test_size=0.2,random_state=42)
train_fasttext_x=list(train_fasttext_x)
test_fasttext_x=list(test_fasttext_x)
# train_fasttext_x.shape,train_fasttext_y.shape,test_fasttext_x.shape,test_fasttext_y.shape


## Applying the Classifiers

Here in this sequence of codebases we apply the classifiers for our use case. 

In [None]:
%%time
#Applying Google Vectors Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
models.append(('XGBoosting',xg(n_estimators=100,random_state=42)))
models.append(('LightGBM',lg(n_estimators=100,random_state=42)))
google_model_training_result,google_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Google Vectors- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_google_x,train_google_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_google_x,test_google_y)
    accuracy = accuracy_score(predictions,test_google_y)
    w2v_model_validation_result.append(accuracy)

final_google_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_google_outcomes['Model']=models
final_google_outcomes['Training Acc']=google_model_training_result
final_google_outcomes['Validation Acc']=google_model_validation_result
final_google_outcomes.to_csv('GoogleNewsVectors-Baseline.csv',index=False)
final_google_outcomes

In [None]:
%%time
#Applying Glove Vectors Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
models.append(('XGBoosting',xg(n_estimators=100,random_state=42)))
models.append(('LightGBM',lg(n_estimators=100,random_state=42)))
glove_model_training_result,glove_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Google Vectors- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_glove_x,train_glove_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_glove_x,test_glove_y)
    accuracy = accuracy_score(predictions,test_glove_y)
    w2v_model_validation_result.append(accuracy)

final_glove_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_glove_outcomes['Model']=models
final_glove_outcomes['Training Acc']=glove_model_training_result
final_glove_outcomes['Validation Acc']=glove_model_validation_result
final_glove_outcomes.to_csv('GoogleNewsVectors-Baseline.csv',index=False)
final_glove_outcomes

In [None]:
%%time
#Applying Fasttext Vectors Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
models.append(('XGBoosting',xg(n_estimators=100,random_state=42)))
models.append(('LightGBM',lg(n_estimators=100,random_state=42)))
fasttext_model_training_result,fasttext_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Google Vectors- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_fasttext_x,train_fasttext_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_fasttext_x,test_fasttext_y)
    accuracy = accuracy_score(predictions,test_fasttext_y)
    w2v_model_validation_result.append(accuracy)

final_fasttext_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_fasttext_outcomes['Model']=models
final_fasttext_outcomes['Training Acc']=fasttext_model_training_result
final_fasttext_outcomes['Validation Acc']=fasttext_model_validation_result
final_fasttext_outcomes.to_csv('GoogleNewsVectors-Baseline.csv',index=False)
final_fasttext_outcomes

## Conclusion

This concludes non semantic baseline classification along with static embeddings. The next part can be found in this [Notebook](https://www.kaggle.com/colearninglounge/nlp-model-building-transformers-attention-more)


<img src="https://64.media.tumblr.com/38d4b3b4455c6bf339f26cc3ab49e653/tumblr_podm01SXsM1sc0ffqo2_540.gifv">