**Introduction to Natural Language Processing with Python 29th July 2017**

Agenda:

1.	Introduction to Natural Language Processing 
2.	Textblob: Simplified Text Processing
3.	Scikit Learn (text_analytics)
4.	NLTK - Natural Language Toolkit
5.	SpaCy    (advanced NLP techniques) 
6.	Gensim   (topic modelling)
7.	Future work –  Project.  Development of a predictive web application. 


**Installation:**

The recommended installation is the Anaconda distribution with Python 3.
https://www.continuum.io/downloads

You can install relevant Python NLP Libraries with conda.
- conda install -c anaconda nltk
- conda install -c conda-forge textblob 
- conda install -c conda-forge spacy
- conda install -c anaconda gensim

**Data**
http://www.nltk.org/data.html 

Kaggle links with text data competitions:

https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data
https://www.kaggle.com/c/word2vec-nlp-tutorial/data


**1 Introduction to Natural Language Processing**

Natural language processing is a complex field and is the intersection of computer science, artificial intelligence and computational linguistics.   https://en.wikipedia.org/wiki/Natural_language_processing

Natural Language Processing (NLP) is ability of machines to understand and interpret human language the way it is written or spoken.

We will start with written language – text data.Roughly 80% of the world data is unstructured. Majority of this data is in text worm.

A computational challenge for NLP exists because human language is ambiguous, needs context and ability to link concepts.

The ambiguity present in natural language:
- Lexical Ambiguity — Words have multiple meanings
- Syntactic Ambiguity — Sentence having multiple parse trees.
- Semantic Ambiguity — Sentence having multiple meanings
- Anaphoric Ambiguity — Phrase or word which is previously mentioned but has a different meaning.






Natural language processing types:    
-	Text classification - Spam detection    
-	Sentiment analysis   
-	Machine translation 
-	Summarising blocks of text 
-	Named entity recognition 
-	Automatic speech recognition Siri, Alexa or Google Now
-	Chatbots   

**Natural language processing  terms:**
    
A **corpus** is a collection of text in digital form (digital documents) assembled for text processing.

It is also called a training corpus. This inferred latent structure can be later used to assign topics to new documents, which did not appear in the training corpus.

A **token** is a single chopped element of the sentence (line) which can be a word or a mix of word, characters or punctuation signs. The list of tokens becomes input for further processing such as parsing or text mining.

This process of chopping our text documents up into pieces or chunks is called tokenization.

**N-Grams** is continuous sequence of n items from a given sequence of text or speech.

Unigram, bigram, n-gram: sequence of 1,2 or n words taken as the basic element

**Stopwords** are very common words that are removed as not meaningful ( have no intrinsic meaning). For example:       a, an, the, is, which.

In **Bag-of-words** is a simple model which discard sentence structure. In Bag-of-words unordered collection of words. 

Alternative to **Bag-of-words**

**Feature hashing** or the **hashing trick**
https://en.wikipedia.org/wiki/Feature_hashing
https://www.quora.com/Can-you-explain-feature-hashing-in-an-easily-understandable-way

**Rescaling** or **Normalisation** the data with tf-idf.
**Tf–idf** for “Term Frequency times Inverse Document Frequency” is a way to represent documents as feature vectors.

The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus. It ss not linear transformation.
 https://en.wikipedia.org/wiki/Tf–idf
    
**Stemming** is the process of finding the root of the word. A direct approach that chops off the ending of the word to limit variation For example, "go", "goes", "going"  will be "go".Different versions of stemmers in NLTK: porter, snowball and wordnetlemmatizer.


**Lemmatization** is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, "sing", "sang" and "singing" are all different "forms" of the lemma sing. The context of the sentence is also preserved in lemmatization as opposed to stemming. 


We still need to remove the non-words like punctuation marks or special characters from the documents.




POS-tagger, or a part-of-speech tagger, processes a sequence of words, and attaches a part of speech tag to each word. For example, given a text, assigns roles to each word: noun, verb, adjective, pronoun, adverb, article, conjunction, preposition and interjection.

Syntatic dependency describe how eeach type of word relats to each other in the sentence.


**Sentiment Analysis** – The use of Natural Language Processing techniques to extract subjective information from a piece of text. i.e. whether an author is being subjective or objective or even positive or negative

**Named Entity Recognition (NER)** – The process of locating and classifying elements in text into predefined categories such as the names of people, organizations, places, monetary values, percentages, etc.

**Latent Semantic Analysis (LSA)** – The process of analyzing relationships between a set of documents and the terms they contain. Accomplished by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.

**Latent Dirichlet Allocation (LDA)** – A common topic modeling technique, LDA is based on the premise that each document or piece of text is a mixture of a small number of topics and that each word in a document is attributable to one of the topics. http://www.datasciencecentral.com/profiles/blogs/10-common-nlp-terms-explained-for-the-text-analysis-novice




Latent Semantic Indexing, LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality

Latent Dirichlet Allocation, LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).





Vector space model
https://en.wikipedia.org/wiki/Vector_space_model


Word Embeddings are words converted into numbers.




**Word2vec** is a combination of two techniques (CBOW(Continuous Bag-of-Words model) and Skip-Gram model) that are used to produce word embeddings. 
https://en.wikipedia.org/wiki/Word2vec

Word2Vec Tomas Mikolov 
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf



**2 TextBlob: Simplified Text Processing**

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common Natural Language Processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. TextBlob stands on the giant shoulders of NLTK and pattern. TextBlob is good for an initial prototyping. Language translation and detection is powered by the Google Translate API.https://cloud.google.com/translate/docs/translating-text#language-params

http://textblob.readthedocs.io/en/dev/quickstart.html

http://textblob.readthedocs.io/en/dev/classifiers.html#classifiers


Very good tutorial by Allison Parrish http://rwet.decontextualize.com/book/textblob/

**3 Scikit Learn text_analytics** 

 http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


**4 NLTK** is the Natural Language Toolkit in Python. It work with human language data and it provides over 50 datasets (corpora and lexical resources), along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

NLTK is the most popular library for doing NLP in Python. Drawbacks it is not optimised.  http://www.nltk.org/

Book **Natural Language Processing with Python** – Analyzing Text with the Natural Language Toolkit 

by Steven Bird, Ewan Klein, and Edward Loper  http://www.nltk.org/book/  http://www.nltk.org/book_1ed/



.

**5 Spacy** is Python library for advanced Natural Language Processing, written in the programming languages Python and Cython. It offers the fastest syntactic parser in the world. https://spacy.io/

It is currently supports English and German, as well as tokenization for Chinese and several other languages.

https://www.quora.com/What-are-the-advantages-of-Spacy-vs-NLTK


Installation https://spacy.io/docs/usage/


**6 Gensim** is a  Python library designed to process raw, unstructured texts and extract semantic topics from documents. It is well optimised. The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents. These algorithms are unsupervised.

https://github.com/RaRe-Technologies/gensim

https://radimrehurek.com/gensim/intro.html

https://radimrehurek.com/gensim/tutorial.html




Experiments on the English Wikipedia. It can take some time.
https://radimrehurek.com/gensim/wiki.html


https://dumps.wikimedia.org/enwiki/20170401/

https://radimrehurek.com/gensim/distributed.html





**Practical exercises:**

Exercise 1 Explore TextBlob http://textblob.readthedocs.io/en/dev/quickstart.html
http://textblob.readthedocs.io/en/dev/classifiers.html#classifiers

Exercise 2: Scikit Learn text_analytics of the 20 Newsgroups data from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Exercise 3: Sentiment Analysis on movie reviews using 

- Scikit Learn 

Write a text classification pipeline to classify movie reviews as either positive or negative.
Find a good set of parameters using grid search.
Evaluate the performance on a held out test set.

- NLTK

- Try to do Sentiment Analysis on movie reviews with data from https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

- Advance: use SpaCy and Gensim https://www.kaggle.com/c/word2vec-nlp-tutorial/data   (part-2-word-vectors has errors needs to be fixed)

- Present results


