## NLP Libraries
* **Scikit-learn**: It provides a wide range of algorithms for building machine learning models in Python.

* **Natural language Toolkit (NLTK)**: NLTK is a complete toolkit for all NLP techniques.

* **Pattern**: It is a web mining module for NLP and machine learning.

* **TextBlob**: It provides an easy interface to learn basic NLP tasks like sentiment analysis, noun phrase extraction, or pos-tagging.

* **Quepy**: Quepy is used to transform natural language questions into queries in a database query language.

* **SpaCy**: SpaCy is an open-source NLP library which is used for Data Extraction, Data Analysis, Sentiment Analysis, and Text Summarization.

* **Gensim**: Gensim works with large datasets and processes data streams.

## STEPS
* Sentence Tokenization
* Word Tokenization
* Text Lemmatization And stemming
* Stop Words
* Regex
* Bag-of-words
* TF-IDF

In [2]:
text = "But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful."

In [3]:
text

'But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful.'

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\RAHUL
[nltk_data]     SUTHAR\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### SENTENCE TOKENIZATION 

In [5]:
sentences =nltk.sent_tokenize(text)
for sentence in sentences:
    
    print(sentence)
    print()
    


But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness.

No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful.



## word Tokenization

In [6]:
for sentence in sentence:
    word= nltk.word_tokenize(sentence)
    print()






































































































































































































**Stop words**

In [9]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\RAHUL
[nltk_data]     SUTHAR\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [12]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

###  **BAG of words**

In [16]:
document = ["I like this movie ,it is funny","I hate this movie","nice one "]

In [17]:
#step=1 Read text
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [18]:
#step-2 design the vocabulary
count_vectorizer = CountVectorizer()

In [20]:
#step-3 create thr Bag-of-words model
bag_of_Words = count_vectorizer.fit_transform(document)

In [21]:
#show the Bag-of-words model as a pandas Dataframe
feature_name = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_Words.toarray(),columns=feature_name)



Unnamed: 0,funny,hate,is,it,like,movie,nice,one,this
0,1,0,1,1,1,1,0,0,1
1,0,1,0,0,0,1,0,0,1
2,0,0,0,0,0,0,1,1,0


## TF-IDF
* TF=Term frequency
* IDF= Inverse Documnet frequency

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
tfidf_vectorizer =TfidfVectorizer()

In [24]:
values = tfidf_vectorizer.fit_transform(document)

In [25]:
pd.DataFrame(values.toarray(),columns=tfidf_vectorizer.get_feature_names())



Unnamed: 0,funny,hate,is,it,like,movie,nice,one,this
0,0.440362,0.0,0.440362,0.440362,0.440362,0.334907,0.0,0.0,0.334907
1,0.0,0.680919,0.0,0.0,0.0,0.517856,0.0,0.0,0.517856
2,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0
