<small><i>This tutorial was put together by [Alexander Fridman](http://www.rocketscience.ai) and [Volha Hedranovich](http://www.rocketscience.ai) for the Lecture Course. Source and license info is on [GitHub](https://github.com/volhahedranovich/jupyter_lectures).</i></small>

# Lecture 5. Natural Language Processing

### Lecture outline

1. [Key Python packages and projects for NLP](NLP_Tasks.ipynb#Key-Python-packages-&-projects-for-NLP)
  * <span style="color:#757575">scikit-learn</span>
  * <span style="color:#757575">NLTK</span>
  * <span style="color:#757575">spaCy</span>
  * <span style="color:#757575">WordNet</span>
<p></p>
2. [NLP Basic Tasks](NLP_Tasks.ipynb#NLP-Basic-Tasks)
  * <span style="color:#757575">POS Tagging</span>
  * <span style="color:#757575">WSD</span>
  * <span style="color:#757575">NER</span>
  * <span style="color:#757575">Language Identification</span>
  * <span style="color:#757575">Text Summarisation</span>
  * <span style="color:#757575">Sentiment Analysis</span>
  * <span style="color:#757575">Semantic Text Similarity</span>
  * <span style="color:#757575">Topic Modeling</span>
<p></p>
3. [Key Text Preprocessing Techniques](NLP_Tasks.ipynb#Key-text-preprocessing-techniques)
  * <span style="color:#757575">Text Cleaning (RegExp)</span>
  * <span style="color:#757575">Tokenizing</span>
  * <span style="color:#757575">Stopwords removal</span>
  * <span style="color:#757575">Spelling Correction</span>
  * <span style="color:#757575">Synonyms Replacement</span>
  * <span style="color:#757575">Negation Replacement</span>
  * <span style="color:#757575">Stemming and Lemmatization</span>
  * <span style="color:#757575">N-grams adding</span>
<p></p>
4. [Key Text Vectorization Techniques](NLP_Tasks.ipynb#Key-text-vectorization-techniques)
  * <span style="color:#757575">BOW</span>
  * <span style="color:#757575">CountVec</span>
  * <span style="color:#757575">TF-IDF</span>
  * <span style="color:#757575">Hashing trick</span>
  * <span style="color:#757575">...2vec</span>
<p></p>
5. [Case study. Text classification](NLP_Case_Study.ipynb)
  * Text reprocessing techniques
    * <span style="color:#757575">Replacing words matching regular expressions</span>
    * <span style="color:#757575">Basic cleaning with regexps</span>
    * <span style="color:#757575">Tokenization</span>
    * <span style="color:#757575">Removing repeated characters</span>
    * <span style="color:#757575">Stopwords removal</span>
    * <span style="color:#757575">Adding n-grams</span>
    * <span style="color:#757575">Spelling correction</span>
    * <span style="color:#757575">Lemmatization</span>
    * <span style="color:#757575">Stemming</span>
    * <span style="color:#757575">Adding synonyms</span>
  * Vectorization
    * <span style="color:#757575">Dummy CountVectoriser</span>
    * <span style="color:#757575">TfIdf</span>
    * <span style="color:#757575">Hashing trick</span>
  * Classification with any algo
<p></p><p></p>
6. [Sequence Data Approach. Using RNNs for text classification](NLP_NN_Text_Classifictioin[optional].ipynb)

------

### Homework

Take part in the challenge **Predict the Happiness**.

Problem Statement, Data and Submission form see [here](https://www.hackerearth.com/problem/machine-learning/predict-the-happiness/).

Share your best scores in the Slack channel <span style="color:#33742c">`#nlp`</span>, in the thread (available on the tag <span style="color:#33742c">*#nlp-homework1*</span>).

------

### List of abbreviations

NLP &#8212; Natural Language Processing (NLP)

NLTK &#8212; Natural Language Toolkit

POS &#8212; Part-Of-Speech

NER &#8212; Named Entity Recognition

RegExp &#8212; Regular Expressions

BOW &#8212; Bag of Words

TfIDF &#8212; Term Frequency – Inverse Document Frequency (tf–idf or TFIDF)

WSD &#8212; Word-Sense Disambiguation

------

### References

1. Jacob Perkins. [Python 3 Text Processing with NLTK 3 Cookbook](https://www.packtpub.com/application-development/python-3-text-processing-nltk-3-cookbook)
2. The Stanford Natural Language Processing Group. [Software](https://nlp.stanford.edu/software/)
3. Platform for building Python programs to work with human language data [NLTK](http://www.nltk.org/)
4. Python library for NLP [spaCy](https://spacy.io/usage/)
5. Analytics Vidhya. NSS. [The Essential NLP Guide for data scientists (with codes for top 10 common NLP tasks)](https://www.analyticsvidhya.com/blog/2017/10/essential-nlp-guide-data-scientists-top-10-nlp-tasks/)

------

# Imports and downloads

In [None]:
import en_core_web_sm
import itertools
import matplotlib.pyplot as plt
import nltk
import pandas as pd

from collections import Counter
from gensim.summarization import summarize
from IPython.display import display, HTML
from langdetect import detect
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from pywsd.lesk import simple_lesk
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from spacy import displacy
from tabulate import tabulate
from wordcloud import WordCloud, STOPWORDS

%matplotlib inline

nlp = en_core_web_sm.load()

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')