### Feature Extraction  - Convert text into vectors 

- Corpus          -  Collection of all words as well as duplicate words in the data. 
- Vocabulary      -  Unique words in the entire data. 
- Document        -  Sentiment analysis of twitter data , every individual review is a document. 
- Word           -   word in sentences.


### To Convert Text into Vectors 
    
1. One-Hot Encoding
2. Bag of words
3. Ngram
4. TF-IDF
5. Custom Features
6. Word2Vec - Embedding
        

### 1.  One-Hot Encoding 

- Eg - there are 5 sentences , calculate the corpus and vocabulary.
    
- Vocabulary = people watch campusx write comment.
    
- Lets calculate the vector for document 1.
    
    
    
    d1 = people watch campusx 
    d1 = [[1,0,0,0,0],[0,1,0,0,0],[0,0,1,0,0]]
    
    Disadvantages
    -Sparsity - if there are n no of words in the data , then this technique is not feasible.
    it creates a sparse array and it creates overfitting as well .
    - No fixed size
    - Out of vocabulary
    - No capture of semantic meaning - when we plot the vectors on graphs the semantic meaning is not captured.
    
    

### 2. Bag of Words 

In [None]:
# Create vocabulary - unique words of data and bag of words will check the frequency of vocabulary in every sentences.
    
# Advantages
#1. No fixed size  has been resolved
#2. Semantic meaning has been handled to some extent.
    
# Disadvantages
# 1. Sparsity
# 2. OOV - if some new words are introduced then it will be ignored.
# 3. Ordering - meaning of sentence is not captured.
# 4. If there are 2 statement
#eg - this is a very good movie , this is not a good movie 
#    and if 'not' is not present in vocabulary then if we draw vector for both statement then the meaning will be same which is actually not.


In [1]:
import pandas as pd 
import numpy as np 

In [7]:
df=pd.DataFrame({'text':['people watch campusx','campusx watch campusx',
                       'people write comment','campusx write comment'],
               'output':[1,1,0,0]})
df.head()

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [8]:
# To create vocabulary for the dataframe
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
bow = cv.fit_transform(df['text'])

In [9]:
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [12]:
# Checking vocabulary for first sentence.
print(bow[0].toarray())

[[1 0 1 1 0]]


In [13]:
print(bow[1].toarray())

[[2 0 0 1 0]]


### 3.  N-Grams

1. Unigram  - Vocabulary will be created on single words
2. Bi-gram - Vocabulary will be created on 2 words
3. Tri-gram- Vocabulary will be created on 3 words
4. n-gram - Vocabulary will be created on n words
       
    cv= CountVectorizer(ngram_range=(1,2))
    bow= cv.fit_transofmr(df['text'])
    - it will create vocab list of 1 and 2 words.
    
    cv= CountVectorizer(ngram_range=(2,2))
     - it will create vocab list of only 2 words
 
 
     cv= CountVectorizer(ngram_range=(1,3))
    - it will create vocab list of 1,2,3 words.
 
 
     Advantages
    - Able to capture the semantic of the sentence
    - Easy to implement.
 
 
     Disadvantages
    - no of vocab in unigram < no of vocab in bi-gram.
    due to which it slows the algorithm.

### 4.    TF - IDF (Term Frequency - Inverse Document Frequency)

In [16]:
# It gives diff weightage to different words.

from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer()
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


### 5.   Word2Vec

In [None]:
Word Embeddings - Words is converted into vector , such that the words that are closer in vector space are expected to be similar in meaning.
    
    Types of Word Embeddings
    1. Frequency based
    - Bag of words(BOW) , TF-IDF , Glove(global vector)
    
    2.Prediction Based
    - Word2Vec
    

    - Word2Vec
    - It was introduced by Google Engineer in 2013
    Advantages
    - Semantic meaning is captured.
    - Dense vector is created - non- zero values will not be available.
      - overfitting is avoided .
    - Low dimension vector is created.

    
    -Word2vec creates features based on Vocabulary
    
    vocabulary - king,queen , man , woman , monkey.
    features created will be - gender,wealth,weight,power,speak.
    
    so, vector will be created based on both combination.
    

    - Assumption of word2vec is that 2 words sharing similar contexts also share similar meaning and consequently similar vector representation.

    - Types of word2vec
    - CBOW - continuous bag of words.
    - Skip-gram 
    
    
    CBOW - Continuous Bag of Words
   
    eg- watch campusx for data science
    - This text needs to be converted to vector using One-hot encoding.
    
       x               y
    watch,for     -  campusx 
    campusx,data  -  for    
    for,science   -  data 
    
    This data need to be trained with deep learning and check the o/p and try to minimize the cost function.
    
    
    SkipGram
    
    x           y
    campusx  - watch,for
    for      - campusx , data
    data     - for , science.
      

### THE END 