### NLP  

Three main NLP approaches are:
1. Rule Based : Context-free grammar and regular expressions
2. Probabilistic modelling and ML : Likelihood maximization and linear classifiers
3. Deep Learning : RCNN, CNN

Text Classification -> Sentiment Analysis

input : text of review
output : class of review(positive or negative etc)


### Text Preprossessing

1. Tokenization - Process of converting a sequence of words into tokens(For eg : whitespacetokenizer is used to split the input sentence on spaces, so we get a list of words)

2. Token normalization - Same word for different types of tokens(Eg. wolf, wolves - wolf)
-> Process of normalizing tokens - 
a. Stemming - Process of removing or replacing all suffixes and returning the root word.(Eg Porter's Stemmer, form of the word might change . wolves - wolv)
b. Lemmatization - Returns the dictionary form of word.(wolves - wolf)



![Screenshot%20from%202019-08-08%2014-30-59.png](attachment:Screenshot%20from%202019-08-08%2014-30-59.png)

### Feature Extraction from text

1. <b>Bag of words<b> - ![Screenshot%20from%202019-08-08%2017-12-17.png](attachment:Screenshot%20from%202019-08-08%2017-12-17.png)

Bag of words and vectoriztion are the same.

Problems in bag of words approach - 
<br>loose word order, therefore the context of sentence is lost

To overcome this problem, we have to preserve some order, which can be done using <b>n-grams<b>.
![Screenshot%20from%202019-08-08%2017-22-19.png](attachment:Screenshot%20from%202019-08-08%2017-22-19.png)

Problem - Too many features
Solution - Removing high frequency and low freuency ngrams. But sitll, there can be many medium frequency n-grams
A low frequency n-gram can also be very useful. For eg : Im "WiFi braks often", "wifi breaks" might not be present frequently in our document corpous but it still represents very useful information.
To overcome the above problem, we use TF-IDF.


2. <b>Tf-IDF<b>
    ![Screenshot%20from%202019-08-09%2011-03-05.png](attachment:Screenshot%20from%202019-08-09%2011-03-05.png)
    ![Screenshot%20from%202019-08-09%2011-07-14.png](attachment:Screenshot%20from%202019-08-09%2011-07-14.png)
    ![Screenshot%20from%202019-08-09%2011-13-35.png](attachment:Screenshot%20from%202019-08-09%2011-13-35.png)
    "did not" has a higher value because it is a less frequent 2gram and it can highlight some important characteristic
    

### Python Tf-Idf example


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
texts = [
    "good movie", "not a good movie", "did not like", 
    "i like it", "good one"
]
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,ngram_range=(1,2))
features = tfidf.fit_transform(texts)
print(features)
print(features.todense())
print(tfidf.get_feature_names())


  (0, 0)	0.7071067811865476
  (0, 2)	0.7071067811865476
  (1, 3)	0.5773502691896257
  (1, 0)	0.5773502691896257
  (1, 2)	0.5773502691896257
  (2, 1)	0.7071067811865476
  (2, 3)	0.7071067811865476
  (3, 1)	1.0
[[0.70710678 0.         0.70710678 0.        ]
 [0.57735027 0.         0.57735027 0.57735027]
 [0.         0.70710678 0.         0.70710678]
 [0.         1.         0.         0.        ]
 [0.         0.         0.         0.        ]]
['good movie', 'like', 'movie', 'not']


In [7]:
import pandas as pd
pd.DataFrame(
features.todense(), columns = tfidf.get_feature_names()
)

Unnamed: 0,good movie,like,movie,not
0,0.707107,0.0,0.707107,0.0
1,0.57735,0.0,0.57735,0.57735
2,0.0,0.707107,0.0,0.707107
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0
