 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# Natural Language Processing 


## About us 

#### Boris Delovski

* data science trainer and consultant at Edlitera
* before Edlitera applied his skills in several industries, including neuroimaging and metallurgy
* LinkedIn: https://www.linkedin.com/in/boris-delovski/
* email: boris@edlitera.com

#### Ciprian Stratulat

* software and data engineer at Edlitera (https://www.edlitera.com)
* before Edlitera worked as a Software Engineer in finance, biotech, genomics and consumer web

* LinkedIn: https://www.linkedin.com/in/ciprian-stratulat/
* email: ciprian@edlitera.com

## About Edlitera

- Hands-on, in-depth courses on the latest programming, data science, machine learning and data engineering topics.
- Other Edlitera courses at Qualcomm:
    - Classic ML using Python
    - Introduction to Python
    - Intermediate Python
    - A Whirlwind Tour of Apache Spark
    - Building Data Pipelines with AWS and Python
- Get in touch: contact@edlitera.com 

## Logistics


* 8 two-hour long sessions (16 hours total)
* we will upload recordings to the Edlitera dashboard
* please do the in-class exercises - they will help you learn the material!
* at the end of the class we will share our solutions to all in-class exercises
* stay in touch and ask questions: boris@edlitera.com

## Goals

* learn the basics of NLP in Python
* learn the basics of classic ML and deep learning as they apply to NLP
* practice the topics learned
* work on an end-to-end NLP project

**NOTE: Python programming experience is required for this class!**

## Non-goals and caveats

* because of time constraints we won't be able to get into a lot of detail on some topics
    * **we strongly recommend our dedicated Classic ML with Python and Deep Learning with Python if you want to get deeper into the ML part of the material**
* this is primarily an engineering course, not a theoretical course
    * we will use images and code instead of equations and theorems
* becoming an expert will take time and individual-practice
* we're here to help, please stay in touch!

## Agenda

* **NLP basics in Python**
    * introduction to NLP
    * NLP Python libraries
    * morphological analysis
    * lexical representations
    * document representations
    * distributed representations


* **Machine learning basics**
    * introduction to machine learning
    * implementing classication models in Python
    * training and evaluating classification models
    * introduction to popular classification models


* **Deep learning basics**
    * introduction to deep learning
    * training and evaluating neural networks
    * deep learning with Keras
    * modern neural networks
    * NLP in Keras
    * popular modern neural networks for NLP


* **Final project**

# Introduction to `Natural Language Processing (NLP)`

* often interchangeably used with the term **`computational linguistics`**, but there are slight differences between them
    <br>
    
    * **`computational linguistics`** - applying statistical methods to better understand human language
        <br>
    
    * **`NLP`** - extracting information from human language using computational methods

* computational linguistics, aside from **computer science** and **linguistics**, also encompases fields such as:
    
    <br>
    
    * **psychology**
    <br>
    
    * **neuroscience**
    <br>
    
    * **philosophy**
    <br>
    
    * **etc.**

* using **`NLP`** we can extract information from written text, and also from spoken language

    <br>
    
    * **`text analytics`** - analyzing text
    <br>
    
    * **`speech analytics`** - analyzing spoken language (we won't deal with this part in this course)

# Natural Language Processing in Python

* there are many libraries designed for processing natural language in Python

* try to use popular libraries

* advantages of popular libraries:
  <br>
    
    * their code has already been tested on countless problems
      <br>
    
    * there is a lot of support for them
      <br>
    
    * they are often simpler than some more obscure libraries

* strictly processing text is usually only the first step of projects involving human language

* usually, we want to also train **`Machine Learning (ML)`** models to automatically classify, summarize, tag, or otherwise extract meaning from processed human text

**We can divide libraries that we use for NLP into:**

   * libraries specifically designed for language processing (e.g. `nltk`, `TextBlob`, etc.)


   * libraries used for creating ML models (models don't need to necessarily be used for **`NLP`**)

* first, we will focus on NLP libraries


* then, we will cover ML libraries

# Most popular NLP libraries

* for now, we will focus on **`NLTK`**
  <br>
    
    * **`TextBlob`** is a higher level library built on top of **`NLTK`**, but is somewhat limited so we are not going to cover it 
      <br>  
      <br>
* we will cover libraries such as **`Scikit-Learn`**, **`Keras`**, etc. in subsequent chapters 

* we will also talk about **`Gensim`**, and **`spaCy`**  later on
  <br>
    
    * some libraries are much easier to understand and work with once you know the basics of **`Deep Learning`**
      <br>
    
    * also, libraries such as **`Gensim`** are not necessary for performing **`NLP`** using classic **`Machine Learning`** algorithms

# `NLTK`

* developed by **Team NLTK** and first released 20 years ago

* **documentation:** https://www.nltk.org/

* **GitHub repository:** https://github.com/nltk/nltk

* one of the leading platforms for building Python programs that analyze, process and work with data that represents human language

* it can serve as a practical introduction to many **`NLP`** concepts
  <br>
    
* it is accompanied by a book that explains the basic concepts of the different operations we can perform using **`NLTK`**
  <br>

**It has everything we need to completely preprocess our text data and prepare it for models !**

* **`NLTK`** includes more than 50 corpora
    <br>
    
    * a **corpus** is collection of machine-readable texts, finite in size, that provides a quality representation of some language
      <br>
    
    * nowadays a corpus can contain millions of words (e.g. **The Bank of English Corpus** contains 650 million words)

* as a platform, **NLTK** comes with a number of different libraries that we can use for:
    <br>
    
    * **parsing text data**
        <br>
    
    * **chunking text data**
        <br>
    
    * **sentence detection**
        <br>
    
    * **stemming**
        <br>
    
    * **lemmatization**
        <br>
    
    * **tokenization**
        <br>
    
    * **etc.**

* we will focus on:
    <br>
    
    * **stemming**
    <br>
    
    * **lemmatization**
    <br>
    
    * **POS tagging**
    <br>
    
    * **tokenization**
    <br>
    
    * **stopword removal**

* as we explain the most important concepts of natural language, we will also demonstrate how we perform the aforementioned operations in Python using NLTK

# Natural Language

* includes both spoken and signed languages

* the goal is to map language to representations

**The biggest problem with human language is that it is ambiguous by nature !**

* there are currently over 7000 human languages
    <br>
    
    * however, 10 of them represent what almost 50 % of population uses
        <br>
        
        * **English**
        * **Mandarin**
        * **Russian**
        * **Spanish**
        * **Hindi**
        * **Arabic**
        * **Portuguese**
        * **Bengali**
        * **Japanese**
        * **Punjabi**

**Two types of language models:**
    <br>
    
   * **`Synchronic model`** - modeling a language as a hierarchical collection of language characteristics
      <br>
    
   * **`Diachronic model`** - modern adaptable models

* we usually use synchronic models because diachronic models are often too complex to handle in code

# Language characteristics

* there are several ways to look at some text:
    * you can just analyze the shape and structure of words alone (**morphological analysis**)
    * you can break sentences into words (**lexical analysis**)
    * you can analyze how words fit together in a sentence (**syntactical analysis**)
    * etc.

* we tend to group these characteristics of a language into **categories**
    * using these **categories** we can extract a set of representations useful for ML model training

**NOTE: extracting language representations is crucial for NLP. This is the equivalent of data cleaning, standardization, etc. steps that you might perform when training ML models.**

<img src="https://edlitera-images.s3.amazonaws.com/language_characteristics.png" width="400">

**Most important text characteristics are:**
    <br>
    
    
   * `syntactic characteristics`
      <br>
      
   * `morphological characteristics`
      <br>
    
   * `lexical characteristics`
      <br>
    
   * `semantic characteristics`
   <br>
    
   * `discourse characteristics`

In this course, we will go over **`syntactic characteristics`**, **`morphological characteristics`** and **`lexical characteristics`** one by one and learn how to use Python code to process human language according to these characteristics.

**A note about `semantic characteristics` and `discourse characteristics`**    

* tasks that take advantage of semantic relationships between words are very specific
    * usually involve very custom code


* **`Deep Learning`** models have their own way of learning relationships between words
    * we will talk about this later in the course 

# `Introduction to NLP and NLTK Cheat Sheet`

* **`NLP`** - extracting information from human language (both written and spoken) using computational methods

* human language is very complex and ambiguous


* to model it we use **`synchronic models`** and **`diachronic models`**

**Most important text characteristics are:**
    <br>
    
    
   * `syntactic characteristics`
      <br>
      
   * `morphological characteristics`
      <br>
    
   * `lexical characteristics`
      <br>
    
   * `semantic characteristics`
   <br>
    
   * `discourse characteristics`

* there are Python libraries that focus mostly on **`NLP`**, and those that focus on **`Machine Learning`** (but we can use them for the purposes of **`NLP`**)

* popular **`NLP`** libraries:
  <br>
    
    * **`NLTK`**
      <br>
    
    * **`TextBlob`**
      <br>  
      
    * **`Gensim`**  
      <br>  
      
    * **`spaCy`**  
      <br>  
      
* popular **`Machine Learning`** libraries:
<br>
    
    *  **`Keras`** 
<br>
    
    * **`Tensorflow`** 
    <br>
    
    * **`Scikit-Learn`** 

### `NLTK`

* **documentation:** https://www.nltk.org/

* one of the leading platforms for building Python programs that analyze, process and work with data that represents human language

* we will focus on using **`NLTK`** for:
    <br>
    
    * **stemming**
    <br>
    
    * **lemmatization**
    <br>
    
    * **POS tagging**
    <br>
    
    * **tokenization**
    <br>
    
    * **stopword removal**

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>