# Data Representation in Natural Language Processing

<img src="NLP-image.jpg">

# Vectorization

* We all know that computer understand binary language in the form of 0's and 1's. 
* It is impossible to make them understand words naturally. 
* But encoding such words into numeric form can solve our problem.
* The process of converting textual information into numbers is called Vectorization. 
* It is also termed as feature extraction.

# (1) Bag of Words 

* simplest approach to convert text into numbers
* Why this name: This model is only concerned with with the occurrence of the word and not where it is placed (i.e. order) in bag.
* The intuition behind such approach is that similar documents contain similar words.

<img src="bag1.jpeg">

<img src="bag2.jpeg">

# CountVectorizer

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
def count_vec(text):
    vectorizer = CountVectorizer()
    vocabulary=vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    final=doc_term_matrix.toarray()
    return final
    

In [3]:
count_vec(["The quick brown fox jumped over the lazy dog."])

array([[1, 1, 1, 1, 1, 1, 1, 2]])

 If value error occurs: ValueError: Iterable over raw text documents expected, string object received.
* This means The solution to this problem is because input is just a String, 
  but what is needed is a list (or an iterable) containing a single element (which is nothing but the String itself).

Input is a list

# (2.) Tf-idf Vectorization

<img src="tf1.jpeg">

<img src="idf.jpeg">

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
def tf_idf_vec(text):
    vectorizer = TfidfVectorizer()
    vocabulary=vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    final=doc_term_matrix.toarray()
    return final

In [6]:
tf_idf_vec(["The car is driven on the road.","the truck is driven on the highway."])

array([[0.42471719, 0.30218978, 0.        , 0.30218978, 0.30218978,
        0.42471719, 0.60437955, 0.        ],
       [0.        , 0.30218978, 0.42471719, 0.30218978, 0.30218978,
        0.        , 0.60437955, 0.42471719]])

Size of the matrix: no. of documents* unique words

In [7]:
# https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
# https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer-and-a-tfidf-vectorizer

# (3)Hashing Vectorization

In [8]:
from sklearn.feature_extraction.text import HashingVectorizer

In [9]:
def hash_vec(text):
    vectorizer = HashingVectorizer()
    vocabulary=vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    final=doc_term_matrix.toarray()
    return final

In [10]:
hash_vec(["The car is driven on the road.","The truck is driven on the highway."])

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Word Embeddings

In [12]:
lines=["Hello this is a tutorial on how to convert the word in an integer format",
"this is a beautiful day","Jack is going to office"]

In [13]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

lines_without_stopwords=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  if word not in stop_words:
   temp_line.append (word)
 string=' '
 lines_without_stopwords.append(string.join(temp_line))

lines=lines_without_stopwords

In [14]:
#import WordNet Lemmatizer from nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

lines_with_lemmas=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  temp_line.append (wordnet_lemmatizer.lemmatize(word))
 string=' '
 lines_with_lemmas.append(string.join(temp_line))
lines=lines_with_lemmas

In [17]:
new_lines=[]
for line in lines:
 new_lines=line.split(' ')
#new lines has the new format
lines=new_lines

In [35]:
from gensim.models import Word2Vec

In [31]:
#import the gensim package
import gensim
model = gensim.models.Word2Vec(lines)

In [36]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20)

In [38]:
w2v_model.build_vocab(lines, progress_per=10000)

In [39]:
w2v_model.train(lines, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

(85, 8640)

In [46]:
w2v_model.wv.most_similar(positive=["office"])

KeyError: "word 'office' not in vocabulary"

In [44]:
gensim.models.word2vec.Word2Vec(sentences=None, size=100, 
alpha=0.025, window=5, min_count=5, max_vocab_size=None, 
sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, 
hs=0, negative=5, cbow_mean=1, 
iter=5, null_word=0, trim_rule=None, sorted_vocab=1,
batch_words=10000, compute_loss=False)

<gensim.models.word2vec.Word2Vec at 0x7fa900d81780>

In [30]:
 model = gensim.models.Word2Vec(
        size=150,
        window=10,
        min_count=2,
        workers=10,
        iter=10)

In [28]:
#saving the model persistence
model.save('model.bin')

#loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True) 

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte