# Data Representation in Natural Language Processing

<img src="NLP-image.jpg">

# Vectorization

* We all know that computer understand binary language in the form of 0's and 1's. 
* It is impossible to make them understand words naturally. 
* But encoding such words into numeric form can solve our problem.
* The process of converting textual information into numbers is called Vectorization. 
* It is also termed as feature extraction.

# (1) Bag of Words 

* simplest approach to convert text into numbers
* Why this name: This model is only concerned with with the occurrence of the word and not where it is placed (i.e. order) in bag.
* The intuition behind such approach is that similar documents contain similar words.

<img src="bag1.jpeg">

<img src="bag2.jpeg">

# CountVectorizer

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
def count_vec(text):
    vectorizer = CountVectorizer()
    vocabulary=vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    final=doc_term_matrix.toarray()
    return final
    

In [3]:
count_vec(["The quick brown fox jumped over the lazy dog."])

array([[1, 1, 1, 1, 1, 1, 1, 2]])

 If value error occurs: ValueError: Iterable over raw text documents expected, string object received.
* This means The solution to this problem is because input is just a String, 
  but what is needed is a list (or an iterable) containing a single element (which is nothing but the String itself).

Input is a list

# (2.) Tf-idf Vectorization

<img src="tf1.jpeg">

<img src="idf.jpeg">

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
def tf_idf_vec(text):
    vectorizer = TfidfVectorizer()
    vocabulary=vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    final=doc_term_matrix.toarray()
    return final

In [6]:
tf_idf_vec(["The car is driven on the road.","the truck is driven on the highway."])

array([[0.42471719, 0.30218978, 0.        , 0.30218978, 0.30218978,
        0.42471719, 0.60437955, 0.        ],
       [0.        , 0.30218978, 0.42471719, 0.30218978, 0.30218978,
        0.        , 0.60437955, 0.42471719]])

Size of the matrix: no. of documents* unique words

In [7]:
# https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
# https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer-and-a-tfidf-vectorizer

# (3)Hashing Vectorization

In [8]:
from sklearn.feature_extraction.text import HashingVectorizer

In [9]:
def hash_vec(text):
    vectorizer = HashingVectorizer()
    vocabulary=vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    final=doc_term_matrix.toarray()
    return final

In [10]:
hash_vec(["The car is driven on the road.","The truck is driven on the highway."])

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Word Embeddings

# (1) Implementation of Word2Vec via Gensim

In [11]:
import nltk
nltk.download('abc')

[nltk_data] Downloading package abc to /home/shivangi/nltk_data...
[nltk_data]   Package abc is already up-to-date!


True

In [12]:
from nltk.corpus import abc

In [13]:
import gensim

In [14]:
# splitting sentences into tokens, Word2Vec Model takes list of lists as input where each sublist contains
# tokens for a sentence.
abc.sents()

[['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', '.'], ['Letters', 'from', 'John', 'Howard', 'and', 'Deputy', 'Prime', 'Minister', 'Mark', 'Vaile', 'to', 'AWB', 'have', 'been', 'released', 'by', 'the', 'Cole', 'inquiry', 'into', 'the', 'oil', 'for', 'food', 'program', '.'], ...]

In [15]:
# Function to find different word in corpora 'abc'
nltk.corpus.abc.words()

['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', ...]

In [16]:
# loading model or training the model
model= gensim.models.Word2Vec(abc.sents())

In [17]:
X= list(model.wv.vocab)
X

['PM',
 'denies',
 'knowledge',
 'of',
 'AWB',
 'kickbacks',
 'The',
 'Prime',
 'Minister',
 'has',
 'denied',
 'he',
 'knew',
 'was',
 'paying',
 'to',
 'Iraq',
 'despite',
 'writing',
 'the',
 'wheat',
 'exporter',
 'asking',
 'be',
 'kept',
 'fully',
 'informed',
 'on',
 'sales',
 '.',
 'Letters',
 'from',
 'John',
 'Howard',
 'and',
 'Deputy',
 'Mark',
 'Vaile',
 'have',
 'been',
 'released',
 'by',
 'Cole',
 'inquiry',
 'into',
 'oil',
 'for',
 'food',
 'program',
 'In',
 'one',
 'letters',
 'Mr',
 'asks',
 'managing',
 'director',
 'Andrew',
 'Lindberg',
 'remain',
 'in',
 'close',
 'contact',
 'with',
 'Government',
 'Opposition',
 "'",
 's',
 'Gavan',
 'O',
 'Connor',
 'says',
 'letter',
 'sent',
 '2002',
 ',',
 'same',
 'time',
 'though',
 'a',
 'trucking',
 'company',
 'He',
 'can',
 'longer',
 'wipe',
 'its',
 'hands',
 'illicit',
 'payments',
 'which',
 '$',
 '290',
 'million',
 '"',
 'responsibility',
 'this',
 'must',
 'lay',
 'may',
 'at',
 'feet',
 'Coalition',
 'minist

In [18]:
data=model.most_similar('science')
data

  """Entry point for launching an IPython kernel.


[('law', 0.9468981027603149),
 ('agriculture', 0.9328317046165466),
 ('general', 0.9301407337188721),
 ('policy', 0.9299634099006653),
 ('media', 0.9234398603439331),
 ('practice', 0.918952226638794),
 ('Crean', 0.9177979230880737),
 ('discussion', 0.9150807857513428),
 ('tight', 0.9143961668014526),
 ('Hooke', 0.9128788709640503)]

In [19]:
data=model.most_similar('AWB')
data

  """Entry point for launching an IPython kernel.


[('Federal', 0.8229790925979614),
 ('inquiry', 0.8202506899833679),
 ('government', 0.8166568279266357),
 ('Court', 0.811873197555542),
 ('company', 0.8014136552810669),
 ('Government', 0.787051796913147),
 ('exporter', 0.7665734887123108),
 ('Labor', 0.7493975162506104),
 ('veto', 0.7421144247055054),
 ('party', 0.7347246408462524)]

In [20]:
dissimlar_words = model.doesnt_match('See you later, thanks for visiting'.split())
print(dissimlar_words)

you


  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


In [21]:
similarity_two_words = model.similarity('science','AWB')
print("Please provide the similarity between these two words:")
print(similarity_two_words)

Please provide the similarity between these two words:
0.54881704


  """Entry point for launching an IPython kernel.


# (2) Pre-trained Word Embeddings

### variouspre-trained models are available like Google Word2Vec, Godin, FastText, GloVe

## Showcasing pre-trained model working on Word2Vec

In [22]:
from gensim.models import KeyedVectors

In [23]:
path= '/media/shivangi/DATA/GoogleNews-vectors-negative300.bin' 

In [24]:
# Load vectors directly from the file
model = KeyedVectors.load_word2vec_format(path, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [25]:
# Access vectors for specific words with a keyed lookup:
vector = model['easy']

In [26]:
vector

array([ 3.06640625e-01,  6.83593750e-02, -1.60156250e-01,  1.19628906e-01,
       -6.56127930e-03,  4.39453125e-03,  1.44531250e-01,  6.20117188e-02,
        7.17773438e-02,  2.67333984e-02,  9.91210938e-02, -2.30712891e-02,
        5.66406250e-02, -1.74804688e-01, -5.32226562e-02,  8.98437500e-02,
        2.94921875e-01, -6.59179688e-02,  1.35742188e-01, -1.73828125e-01,
        7.32421875e-02,  2.08007812e-01,  7.27539062e-02,  2.19726562e-01,
       -5.02929688e-02, -1.15234375e-01, -1.80664062e-01, -4.29153442e-06,
       -1.69921875e-01, -7.61718750e-02, -4.30297852e-03,  1.71875000e-01,
        2.57812500e-01, -1.33789062e-01,  3.95507812e-02,  4.24194336e-03,
       -2.80761719e-02, -1.54296875e-01,  1.76757812e-01,  6.68945312e-02,
        2.71484375e-01, -1.43554688e-01,  4.02343750e-01, -1.19140625e-01,
       -2.58789062e-02, -5.63964844e-02,  3.78417969e-02,  4.29687500e-02,
        2.92968750e-02, -2.11181641e-02, -4.15039062e-02,  6.29882812e-02,
       -1.90429688e-02, -

In [27]:
# see the shape of the vector (300,)
vector.shape

(300,)

In [28]:
# Processing sentences is not as simple as with Spacy:
vectors = [model[x] for x in "This is some text I am processing with Spacy".split(' ')]

In [30]:
import numpy as np
vectors=np.array(vectors)

In [31]:
embedding_matrix=np.vstack(vectors)

In [32]:
type(embedding_matrix)

numpy.ndarray

In [33]:
embedding_matrix

array([[-0.2890625 ,  0.19921875,  0.16015625, ...,  0.12792969,
         0.12109375, -0.22949219],
       [ 0.00704956, -0.07324219,  0.171875  , ...,  0.01123047,
         0.1640625 ,  0.10693359],
       [ 0.17871094,  0.09130859, -0.00165558, ...,  0.125     ,
         0.08056641,  0.01672363],
       ...,
       [-0.09033203,  0.04394531,  0.11621094, ..., -0.3359375 ,
        -0.15234375,  0.00254822],
       [-0.02490234,  0.02197266, -0.03540039, ...,  0.01080322,
        -0.01879883, -0.06884766],
       [ 0.06054688,  0.09326172, -0.07373047, ..., -0.07177734,
        -0.02893066, -0.02185059]], dtype=float32)

In [34]:
embedding_matrix.shape

(9, 300)

### Refer to source[8] of the Medium Blog: 

In [35]:
model.similarity('straightforward', 'easy')

0.5717044

In [41]:
# model.similar_by_word('kind')

In [38]:
similarity= model.similarity('please','see')
similarity

0.2444769

### Most-similar words
* Find the top-N most similar words. Positive words contribute positively towards the similarity, 
negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of 
the given words and the vectors for each word in the model. 
The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

In [42]:
# similar_words = model.most_similar('thanks')
# print(similar_words)

In [None]:
## Custom Word Embeddings