#Text Mining




#NLP
Natural language processing (NLP) is a field located at the intersection of data science and Artificial Intelligence (AI) that – when boiled down to the basics – is all about teaching machines how to understand human languages and extract meaning from text.



##Applications:
1. Document Classfication
2. Review Analysis - Sentiment Analysis
3. Search Engines
4. Machine Translation
5. Talker Bots
6. Spell Correction
7. Summarization
8. Machine Conversation
9. Spam Detection
10. Name Entity Recognition


##8 best Python Natural Language Processing (NLP) libraries:
1. **Natural Language Toolkit (NLTK):**\
https://www.nltk.org/ \
NLTK is an essential library supports tasks such as classification, stemming, tagging, parsing, semantic reasoning, and tokenization in Python. (Book: https://www.nltk.org/book/)

2. **TextBlob:** \
https://textblob.readthedocs.io/en/dev/ \
TextBlob is a must for developers who are starting their journey with NLP in Python and want to make the most of their first encounter with NLTK.

3. **CoreNLP:** \
https://stanfordnlp.github.io/CoreNLP/ \
This library was developed at Stanford University and it’s written in Java. Still, it’s equipped with wrappers for many different languages, including Python.

4. **Gensim:** \
https://github.com/RaRe-Technologies/gensim \
Gensim is a Python library that specializes in identifying semantic similarity between two documents through vector space modeling and topic modeling toolkit.

5. **spaCy:** \
https://spacy.io/ \
spaCy is a relatively young library was designed for production usage. That’s why it’s so much more accessible than other Python NLP libraries like NLTK.

6. **polyglot:** \
https://polyglot.readthedocs.io/en/latest/index.html \
This slightly lesser-known library is one of our favorites because it offers a broad range of analysis and impressive language coverage. Thanks to NumPy, it also works really fast.

7. **scikit-learn:** \
https://scikit-learn.org/ \
This handy NLP library provides developers with a wide range of algorithms for building machine learning models. It offers many functions for using the bag-of-words method of creating features to tackle text classification problems.

8. **Pattern:** \
https://www.clips.uantwerpen.be/clips.bak/pages/pattern \
Another gem in the NLP libraries Python developers use to handle natural languages. Pattern allows part-of-speech tagging, sentiment analysis, vector space modeling, SVM, clustering, n-gram search, and WordNet. 



#Text Word-level Representation (Word Embedding)

[Watch YouTube Videos for details](https://www.youtube.com/channel/UC3d1uzFtJxqPsirAc48zPEA) \

1. **One-hot Encoding:** \
A one hot encoding is a representation of categorical variables as binary vectors. Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
2. **Bag-Of-Words:** \
In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

3. **Word-Embedding:** \
In the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
4. **TF-IDF:** \
This algorithm is widely used in the search technologies. Tf-Idf stands for Term frequency-Inverse document frequency.
5. **Word2Vec:**\
The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.

##NLTK

[https://www.nltk.org/](https://www.nltk.org/)

*   NLTK is a leading platform for building Python programs to work with human language data.
*   It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet
* text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries



In [97]:
# !pip install nltk
import nltk

In [98]:
print (nltk.__version__)

3.2.5


In [99]:
#PUNKT is a pre-trained unsupervised ML model that is a sentense tokenizer
#Install PUNKT
nltk.download ('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [100]:
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
test_text = 'I learn NLP. I learn Python. Its user friendly. I am ready.'
sent_tokenize(test_text)

['I learn NLP.', 'I learn Python.', 'Its user friendly.', 'I am ready.']

In [101]:
test2 = 'سلام! اسم من رضا هست. حالتون چطوره؟'
sent_tokenize (test2)

['سلام!', 'اسم من رضا هست.', 'حالتون چطوره؟']

In [102]:
!gdown --id 1oVyJvIIXM7eHBEMC_N-fxH9aaAUGiTL5

Downloading...
From: https://drive.google.com/uc?id=1oVyJvIIXM7eHBEMC_N-fxH9aaAUGiTL5
To: /content/smaple_text.txt
100% 840/840 [00:00<00:00, 1.33MB/s]


In [103]:
#open a text file
test_file = open("smaple_text.txt", mode='r')

###mode
'r'	: Open for text file for reading text \
'w'	: Open a text file for writing text \
'a'	: Open a text file for appending text\

In [104]:
text_read = test_file.read()
print(text_read)

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [105]:
len(text_read) #the number of charachters

822

In [106]:
import nltk.data
Punkt_tok = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
Punkt_tok.tokenize(text_read)

['Beautiful is better than ugly.',
 'Explicit is better than implicit.',
 'Simple is better than complex.',
 'Complex is better than complicated.',
 'Flat is better than nested.',
 'Sparse is better than dense.',
 'Readability counts.',
 "Special cases aren't special enough to break the rules.",
 'Although practicality beats purity.',
 'Errors should never pass silently.',
 'Unless explicitly silenced.',
 'In the face of ambiguity, refuse the temptation to guess.',
 'There should be one-- and preferably only one --obvious way to do it.',
 "Although that way may not be obvious at first unless you're Dutch.",
 'Now is better than never.',
 'Although never is often better than *right* now.',
 "If the implementation is hard to explain, it's a bad idea.",
 'If the implementation is easy to explain, it may be a good idea.',
 "Namespaces are one honking great idea -- let's do more of those!"]

In [107]:
len(Punkt_tok.tokenize(text_read))

19

### We can train our tokenizer based on our text

[Webtext (corpus)](https://paperswithcode.com/dataset/webtext) \
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. 

In [108]:
import nltk
nltk.download('webtext')

[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


True

In [109]:
from nltk.corpus import webtext
text_parameter = webtext.raw('overheard.txt')
print(text_parameter) # it is a play

White guy: So, do you have any plans for this evening?
Asian girl: Yeah, being angry!
White guy: Oh, that sounds good.

Guy #1: So this Jack guy is basically the luckiest man in the world.
Guy #2: Why, because he's survived like 5 attempts on his life and it's not even noon?
Guy #1: No; he could totally nail those two chicks.

Dad: Could you tell me where the auditorium is?
Security guy: It's on the second floor.
Dad: Wait, you mean it's actually in the building?

Girl: But, I mean, it's not like I ever plan on giving birth.
Guy: Well, if your mother gave birth, it's like your chances are good that you'll give birth too.
Girl: ...Uh, dude, mother gave birth.
Guy: Absolutely.
Guy #1: I don't mind getting old; I love getting old.
Guy #2: Yeah, just as long as you don't get pregnant.

Hobo: Can you spare any change?
Man: Sorry, no.
Hobo: Who the hell you saying no to? I wasn't asking you anyway, asshole!

Hobo: Excuse me, this is a picture of my daughter Sofiya, she was in a fire recently

In [110]:
#Train my tokenizer
from nltk.tokenize import PunktSentenceTokenizer
My_tokenizer = PunktSentenceTokenizer(text_parameter)

In [111]:
type(My_tokenizer)

nltk.tokenize.punkt.PunktSentenceTokenizer

In [112]:
from nltk.tokenize import sent_tokenize    # to compare two methods
pre_token = sent_tokenize(text_parameter)
our_token = My_tokenizer.tokenize(text_parameter)

In [113]:
pre_token[0]

'White guy: So, do you have any plans for this evening?'

In [114]:
our_token[0]

'White guy: So, do you have any plans for this evening?'

##Word Tokenization

In [115]:
from nltk.tokenize import word_tokenize
word_tokenize(test_text)

['I',
 'learn',
 'NLP',
 '.',
 'I',
 'learn',
 'Python',
 '.',
 'Its',
 'user',
 'friendly',
 '.',
 'I',
 'am',
 'ready',
 '.']

In [116]:
word_tokenize("don't")

['do', "n't"]

###TreebankWordTokenize

In [117]:
from nltk import TreebankWordTokenizer
Tree_Toknizer = TreebankWordTokenizer()  # Create an object
Tree_Toknizer.tokenize("Hello! Mr reza. How are you today? I can't stand") # the same problem

['Hello',
 '!',
 'Mr',
 'reza.',
 'How',
 'are',
 'you',
 'today',
 '?',
 'I',
 'ca',
 "n't",
 'stand']

###WordPunktTokenizer

In [118]:
from nltk.tokenize import WordPunctTokenizer
Punkt_token = WordPunctTokenizer()
Punkt_token.tokenize("can't")

['can', "'", 't']