# Introduction: Word Embeddings with SVM 
Hello people, welcome to this kernel. In this kernel I'm going to show you how to use word embeddings with traditional machine learning models like Support Vector Machine. We'll use word2vec to train word vectors, scikit-learn to train SVM.

Before starting let's take a look at some topics

### Why Word Vectors/Embeddings
In natural language processing, we have to make words understandable for computer. There are several ways to do this and in order to understand why word embeddings, let's examine a more primitive way, one hot encoding.

In one hot encoding technique, we have to create a one hot vector (a vector which has only one 1 value, also others have to be 0) which has a length number of the words we have in our vocab.

Such as let's make an example. We have a vocab like this => {"My","Mine","Turkey","Nice"} So our each vector for a word will 4D.

Let's check the vectors of the words:
My => [1,0,0,0]
Mine => [0,1,0,0]
Turkey => [0,0,1,0]
Nice => [0,0,0,1]

But if we compute the distance between my&mine and Turkey&mine we can see the distances are same and this means **we could not protect the real relationships between the words**

So let's briefly explain the disadvantages of one hot encoding

**Disadvantages of One Hot Encoding**:
* In big vocabs vectors for the words are insanely big and this causes memory problems
* Most of the elements of the vector don't have a meaning.
* It's impossible to protect the relation between the words because each vector's distance is same.

So we need a better way to do this and we have this: **word embeddings!**

### Word Embeddings
Before implementing it with a Support Vector Machine let's understand what are they. In word embeddings each element of vector is a different number. I said in **one hot encoding** most of the elements of vector don't have a meaning but word in word embeddings they have a meaning.

Let's make an example.

We have a vocab like = {"My,"Mine","Your","Turkey"}

Their vectors might like this:

My = [1.12,1.42.1,45.1,52]
Mine = [1.14,1.40,1.47,1.53]
Turkey = [3.132,4.12312,6,123,7.123]


As you see if we compute the distance between *my* and *mine* it's closer than *mine* and *turkey*.

I can hear your questions: "Of course, but how can we create this vectors?".
We'll create this vectors by computing the probabilities of being words side by side.

This information is enough, let's start.

# Preparing Environment

In [None]:
import numpy as np
import pandas as pd
import gensim
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

data = pd.read_csv('../input/spam-text-message-classification/SPAM text message 20170820 - Data.csv')
data.head()

# Data Preprocessing

In [None]:
# We'll write a function which will clean the text and prepare it.
def cleanText(text):
    cleaned = re.sub("[^a-zA-Z0-9']"," ",text)
    lowered = cleaned.lower()
    return lowered.strip()

cleanText("Let's test our function, by writing this string!")

In [None]:
x,y = np.asarray(data["Message"]),np.asarray(data["Category"])

x_cleaned = [cleanText(t) for t in x]
x_cleaned[:4]

In [None]:
# Also we should convert our categories to the integer labels
label_map = {cat:index for index,cat in enumerate(np.unique(y))}
y_prep = np.asarray([label_map[l] for l in y])

label_map

# Training Word Embeddings
Until here I did not explain code so much with markdowns but the thing starts here so let's explain what will we do in this section.

In this section we're going to train word embeddings using our data and in order to train those word vectors we'll tokenize our data.

Tokenizing basically means splitting sentences into words (e.g. You are nice => ["You","are","nice"])
Then we'll train our model using gensim. I won't get into details of word2vec but you should learn them because if you don't know how they work it's meaningless to use it


In [None]:
x_tokenized = [[w for w in sentence.split(" ") if w != ""] for sentence in x_cleaned]
x_tokenized[0]

In [None]:
# Now we'll create our model 
import time

start = time.time()

model = gensim.models.Word2Vec(x_tokenized,
                 vector_size=100
                 # Size is the length of our vector.
                )

end = round(time.time()-start,2)
print("This process took",end,"seconds.")


In [None]:
model.wv.most_similar("free")

# Writing A Class To Create Sequences
Our model is ready, but we need a class to convert texts to create word embedding sequences

In [None]:
class Sequencer():
    
    def __init__(self,
                 all_words,
                 max_words,
                 seq_len,
                 embedding_matrix
                ):
        
        self.seq_len = seq_len
        self.embed_matrix = embedding_matrix
        """
        temp_vocab = Vocab which has all the unique words
        self.vocab = Our last vocab which has only most used N words.
    
        """
        temp_vocab = list(set(all_words))
        self.vocab = []
        self.word_cnts = {}
        """
        Now we'll create a hash map (dict) which includes words and their occurencies
        """
        for word in temp_vocab:
            # 0 does not have a meaning, you can add the word to the list
            # or something different.
            count = len([0 for w in all_words if w == word])
            self.word_cnts[word] = count
            counts = list(self.word_cnts.values())
            indexes = list(range(len(counts)))
        
        # Now we'll sort counts and while sorting them also will sort indexes.
        # We'll use those indexes to find most used N word.
        cnt = 0
        while cnt + 1 != len(counts):
            cnt = 0
            for i in range(len(counts)-1):
                if counts[i] < counts[i+1]:
                    counts[i+1],counts[i] = counts[i],counts[i+1]
                    indexes[i],indexes[i+1] = indexes[i+1],indexes[i]
                else:
                    cnt += 1
        
        for ind in indexes[:max_words]:
            self.vocab.append(temp_vocab[ind])
                    
    def textToVector(self,text):
        # First we need to split the text into its tokens and learn the length
        # If length is shorter than the max len we'll add some spaces (100D vectors which has only zero values)
        # If it's longer than the max len we'll trim from the end.
        tokens = text.split()
        len_v = len(tokens)-1 if len(tokens) < self.seq_len else self.seq_len-1
        vec = []
        for tok in tokens[:len_v]:
            try:
                vec.append(self.embed_matrix[tok])
            except Exception as E:
                pass
        
        last_pieces = self.seq_len - len(vec)
        for i in range(last_pieces):
            vec.append(np.zeros(100,))
        
        return np.asarray(vec).flatten()
                
                
            
        

* Our class is ready, let's take a last look at that.
    1. In constructor function our class takes 4 parameters: all_words,max_words,seq_length,embedding_matrix
        * All Words = This means give your all dataset in a list format which contains all tokens (not list of lists (sentences) concatenate all the sentences).
        * Max Words = If your dataset has a lot of unique words you might want to limit the number of words. This parameter will be used in finding most used N (max_words) word.
        * Sequence Length = In machine learning our dataset's number of variable has to be specified. But in real life each sentence might has a different length. In order to prevent this problem we'll determine a length and adapt our sentences to that length.

In [None]:
sequencer = Sequencer(all_words = [token for seq in x_tokenized for token in seq],
              max_words = 1200,
              seq_len = 15,
              embedding_matrix = model.wv
             )

In [None]:
test_vec = sequencer.textToVector("i am in love with you")
test_vec

In [None]:
test_vec.shape

# PCA (Principal Component Analysis)

Everything looks fine, but as you see each vector for a sentence has 1500 elements and it'll consume a lot of time to train a Support Vector Machine Classifier on this.

In order to prevent this problem, we'll use the power of Statistics, Principal Component Analysis.
Principal Component Analysis is a way to reduce dimension of vectors. It maximizes the variance and creates N components.


In [None]:
# But before creating a PCA model using scikit-learn let's create
# vectors for our each vector
x_vecs = np.asarray([sequencer.textToVector(" ".join(seq)) for seq in x_tokenized])
print(x_vecs.shape)


In [None]:
from sklearn.decomposition import PCA
pca_model = PCA(n_components=50)
pca_model.fit(x_vecs)
print("Sum of variance ratios: ",sum(pca_model.explained_variance_ratio_))

* We've said we want 50 components and with those 50 component we can protect our data's %99. 
* Stats look nice, so let's use transform function and reduce dimension.


In [None]:
x_comps = pca_model.transform(x_vecs)
x_comps.shape

* Everything is nice about the data, let's split our data to train and test set.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x_comps,y_prep,test_size=0.2,random_state=42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)


# Support Vector Machine Classifier
In this section we're going to create and train our Support Vector Machine classifier. We'll use scikit-learn library and it'll ease our jobs a lot.
 


In [None]:
start = time.time() 

svm_classifier = SVC()
svm_classifier.fit(x_train,y_train)

end = time.time()
process = round(end-start,2)
print("Support Vector Machine Classifier has fitted, this process took {} seconds".format(process))


* Hey, it looks like our principal component analysis solution worked, let's check the accuracy.

In [None]:
svm_classifier.score(x_test,y_test)

* Our accuracy is %94. Looks nice, but before finishing this kernel, let's check more machine learning algorithm.

In [None]:
# More algorithms!!!!
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB,BernoulliNB

rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
print("Score of RFC",rfc.score(x_test,y_test))

logreg = LogisticRegression()
logreg.fit(x_train,y_train)
print("Score of LogReg",logreg.score(x_test,y_test))

gnb = GaussianNB()
gnb.fit(x_train,y_train)
print("Score of GaussianNB",gnb.score(x_test,y_test))

bnb = BernoulliNB()
bnb.fit(x_train,y_train)
print("Score of BernoulliNB",bnb.score(x_test,y_test))

* We've got the best results using Random Forest Classifier, also Logistic Regression and Bernoulli Naive Bayes bringed nice results.


# Conclusion 
Hey, we finished one more kernel and I feel really excited. If you liked this kernel, please upvote. It motivates me a lot :)

Also if you have any question please ask me in the comment section by mentioning me. I'll return ASAP.

Have a great day/night.
