Welcome to the project on Text Summarization using NLP. In this project, we will be designing and deploying a text summzarization NLP model which can help us summarize text. 

*The broad timeline of the project:*

**Week 1:** Introduction to NLP
**Week 2:** Working with text data
**Week 3:** Text summarization methods
**Week 4:** Building our model
**Week 5:** Improving our model
**Week 6:** Deployment 

This is the notebook for Week 1 and we will get a high level introduction of NLP that will be useful for us in the upcoming weeks. 

Let's get started!

# What is Natural Language Processing?

### Wikipedia defines NLP as:

**Natural language processing (NLP)** is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. 

The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Basically NLP is the way of communicating with the computers in natural language used by humans. 

Before we dive deeper into NLP, let us first understand what is language. This will help us to uncover the topics in NLP and get a good understanding. 

### What is a language?

Language is a structured system of communication that involves complex combinations of its constituent components, such as
characters, words, sentences, etc. Linguistics is the systematic study of language. 

In order to study NLP, it is important to understand some concepts from linguistics about how language is structured. In this
section, we’ll introduce them and cover how they relate to some of the NLP tasks we listed earlier.

**Human language is composed of four major building blocks:** 

<ul> 1. Phonemes </ul>
<ul> 2. Morphemes and lexemes </ul>
<ul> 3. Syntax </ul>
<ul> 4. Context </ul>

Let us understand them one by one and analyze their application in real-world.

### 1. Phonemes

Phonemes are the smallest units of sound in a language. They may not have any meaning by themselves but can induce meanings when uttered in combination with other phonemes. Phonemes are particularly important in applications involving
speech understanding, such as speech recognition, speech-to-text transcription, and text-to-speech conversion.

Below is a list of Phonemes in English

![The-Representation-of-English-Phonemes-by-Three-Phonological-Dimensions-D1-D3.png](attachment:The-Representation-of-English-Phonemes-by-Three-Phonological-Dimensions-D1-D3.png)

### 2. Morpheme and Lexemes

A **morpheme** is the smallest unit of language that has a meaning. It is formed by a combination of phonemes. Not all morphemes are words, but all prefixes and suffixes are morphemes. 

For example, in the word “Metadata,” “Meta-” is not a word but a prefix that changes the meaning when put together with “data.” “Meta-” is a morpheme.

**Lexemes** are the structural variations of morphemes related to one another by meaning. For example, “run” and “running” belong to the same lexeme form. 

Morphological analysis, which analyzes the structure of words by studying its morphemes and lexemes, is a foundational block for many NLP tasks, such as tokenization,stemming, learning word embeddings, and part-of-speech tagging

### 3. Syntax 

Syntax is a set of rules to construct grammatically correct sentences out of words and phrases in a language. Syntactic structure in linguistics is represented in many different ways. A common approach to representing sentences is a parse tree.


![ch08-tree-4.png](attachment:ch08-tree-4.png)

Here is the image representing different syntactic labels:

![syntax%20labels.PNG](attachment:syntax%20labels.PNG)

### 4. Context 

Context is how various parts in a language come together to convey a particular meaning. Context includes long-term references, world knowledge, and common sense along with the literal meaning of words and phrases. The meaning of a sentence can change based on the context, as words and phrases can sometimes have multiple meanings.

Generally, context is composed from semantics and pragmatics.

Semantics is the direct meaning of the words and sentences without external context. Pragmatics adds world knowledge and externalcontext of the conversation to enable us to infer implied meaning. Complex NLP tasks such as sarcasm detection, summarization, and topic modeling are some of tasks that use context heavily

# Application of these four pillars in NLP

![fourpillars.PNG](attachment:fourpillars.PNG)

# Why is NLP so hard?

Consider these examples which came up during Winograd Schema Challenge, named after Professor Terry Winograd of Stanford University. This schema has pairs of sentences that differ by only a few words, but the meaning of the sentences is often flipped because of this minor change. These examples are easily disambiguated by a human but are not solvable using most NLP techniques.

![stanford.PNG](attachment:stanford.PNG)

Language is not just rule driven; there is also a creative aspect to it. Various styles, dialects, genres, and variations are used in any language. 
Poems are a great example of creativity in language. Making machines understand creativity is a hard problem not just in NLP, but in AI in general.

For most languages in the world, there is no direct mapping between the vocabularies of any two languages. This makes porting an NLP solution from one language to another hard. A solution that works for one language might not work at all for another language. This means that one either builds a solution that is language agnostic or that one needs to build separate solutions for each language.

# Approaches to NLP

The different approaches used to solve NLP problems commonly fall
into three categories: 

<ul> 1. Heuristics </ul>
<ul> 2. Machine learning </ul>
<ul> 3. Deep learning </ul>

## 1. Heuristics 

Similar to other early AI systems, early attempts at designing NLP systems were based on building rules for the task at hand. 

This required that the developers had some expertise in the domain to formulate rules that could be incorporated into a program. Such systems also required resources like dictionaries and thesauruses,typically compiled and digitized over a period of time. 

### a) Lexicon based sentiment analysis

An example of designing rules to solve an NLP problem using such resources is
lexicon-based sentiment analysis. It uses counts of positive and negative words in the text to deduce the sentiment of the text. 

Besides dictionaries and thesauruses, more elaborate knowledge bases have been built to aid NLP in general and rule-based NLP in particular. One example is Wordnet, which is a database of words and the semantic relationships between them. 

Some examples of such relationships are synonyms, hyponyms, and meronyms. 

**Synonyms** refer to words with similar meanings. 

**Hyponyms** capture is-type-of relationships. For example, baseball, sumo wrestling, and tennis are all hyponyms of sports. 

**Meronyms** capture is-part-of relationships. For example, hands and legs are meronyms of the body. 

All this information becomes useful when building rule-based systems around
language.

### b) Regex

Regular expressions (regex) are a great tool for text analysis and building rule-based systems. A regex is a set of characters or a pattern that is used to match and find substrings in text. For example, a regex like ‘^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$’ is used to find all email IDs in a piece of text. 

Regexes are a very popular paradigm for building rule-based systems. NLP software like StanfordCoreNLP includes TokensRegex, which is a framework for defining regular expressions.

### c) Context-free grammar (CFG) 

CFG is a type of formal grammar that is used to model natural languages. CFG was invented by Professor Noam Chomsky, a renowned linguist and scientist. CFGs can be used to capture more complex and hierarchical information that a regex might not. 

## 2. Machine Lerning based NLP

Machine Learning for NLPMachine learning techniques are applied to textual data just as they’re used on other forms of data, such as images, speech, and structured data. 

Supervised machine learning techniques such as classification and regression methods are heavily used for various NLP tasks. As an example, an NLP classification task would be to classify news articles into a set of news topics like sports or politics. 

On the other hand, regression techniques, which give a numeric prediction, can be used to estimate the price of a stock based on processing the social media discussion about that stock. 

Similarly, unsupervised clustering algorithms can be used to club together text documents.

Some algorithms to be used:

<ul> 1. Naive Bayes Classifier </ul>
<ul> 2. Support Vector Machine </ul>
<ul> 3. Hidden Markov model </ul>

## 3. Deep Learning for NLP

In the last few years, we have seen a huge surge in using neural networks to deal with complex, unstructured data. Language is inherently complex and  unstructured.

Therefore, we need models with better representation and learning
capability to understand and solve language tasks. 

Few popular deep neural network architectures that can be used:

<ul> 1. RNN </ul>
<ul> 2. LSTM </ul>
<ul> 3. CNN </ul>

## Transformers: Talk of the town

The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. The Transformer was proposed in the paper Attention Is All You Need. It is recommended reading for anyone interested in NLP.

They model the textual context but not in a sequential manner. Given a word in the input, it prefers to look at all the words around it (known as self  attention) and represent each word with respect to its context. For example, the word “bank” can have different meanings depending on the context in which it appears. If the context talks about finance, then “bank” probably denotes a financial institution. 

On the other hand, if the context mentions a river, then it probably indicates a bank of the river. Transformers can model such context and hence have been used heavily in NLP tasks due to this higher representation capacity as compared to other deep networks.


# Designing an NLP pipeline

![pipeline.PNG](attachment:pipeline.PNG)

We will discuss all these steps one by one in detail.

# Step 1. Data Acquistion 

There are a number of means by which you can acquire data for your NLP project:

### a) Use a public dataset

We could see if there are any public datasets available that we can leverage. Take a look at the compilation by Nicolas Iderhoff or search Google’s specialized search engine for datasets.

### b) Scrape data

We could find a source of relevant data on the internet—for example, a consumer or discussion forum where people have posted queries (sales or support). Scrape the data from there and get it labeled by human annotators.

### c) Product intervention

In most industrial settings, AI models seldom exist by themselves. They’re developed mostly to serve users via a feature or product. In all such cases, the AI team should work with the product team to collect more and richer data by developing better instrumentation in the product. 

Product intervention is often the best way to collect data for building intelligent applications in industrial settings. Tech giantslike Google, Facebook, Microsoft, Netflix, etc., have known this for a long time and have tried to collect as much data as possible from as many users as possible.


### d) Data augmentation

While instrumenting products is a great way to collect data, it takes
time. 

So, can we do something in the meantime?

NLP has a bunch of techniques through which we can take a small dataset and use some tricks to create more data. These tricks are also called data augmentation, and they try to exploit language properties to create text that is syntactically similar to source text data. 

They may appear as hacks, but they work very well in practice. Let’s look at
some of them:

#### Synonym replacement

Randomly choose “k” words in a sentence that are not stop words.
Replace these words with their synonyms. For synonyms, we can
use Synsets in Wordnet.

#### Back translation

Say we have a sentence, S1, in English. We use a machine translation library like Google Translate to translate it into some other language—say, German. 

Let the corresponding sentence in German be S2. Now, we’ll use the machine-translation library again to translate back to English. 

Let the output sentence be S3. We’ll find that S1 and S3 are very similar in meaning but are slight variations of each other. 

#### Bigram flipping

Divide the sentence into bigrams. Take one bigram at random and flip it. For example: “I am going to the supermarket.” Here, we take the bigram “going to” and replace it with the flipped one: “to going.”


#### Replacing entities 

Replace entities like person name, location, organization, etc., with other entities in the same category. That is, replace person name with another person name, city with another city, etc. 

For example, in “I live in California,” replace “California” with “London.”

#### Adding noise to data 

In many NLP applications, the incoming data contains spelling mistakes. This is primarily due to characteristics of the platform where the data is being generated (for example, Twitter). 

In such cases, we can add a bit of noise to data to train robust models. For
example, randomly choose a word in a sentence and replace it with another word that’s closer in spelling to the first word. Another source of noise is the “fat finger” problem on mobile keyboards. Simulate a QWERTY keyboard error by replacing a few characters with their neighboring characters on the QWERTY
keyboard.

# Step 2. Text extraction and cleaning

Text extraction and cleanup refers to the process of extracting raw text from the input data by removing all the other non-textual information, such as markup, metadata, etc., and converting the text to the required encoding format.

Vraious steps under it are to process for HTML tags, remove spelling errors, scan from documents using OCR libraries in Python 

# Step 3: Data preprocessing

Text-extraction step removed all the usual deformities in our data and gave us the plain text of the article we need.

However, all NLP software typically works at the sentence level and expects a separation of words at the minimum. So, we need some way to split a text into words and sentences before proceeding further in a processing pipeline. 

Some requirements are as follows:

<li> We need to remove special characters and digits </li>
<li> We would like to make every word in lower case </li>
<li> We might want to remove often used words, called stop words </li>

## Pre-processing steps:

### 1. Preliminaries
Sentence segmentation and word tokenization.

### 2. Frequent steps
Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.

### 3. Other steps
Normalization, language detection, code mixing, transliteration,etc.

### 4. Advanced processing
POS tagging, parsing, coreference resolution, etc.

How do we make data pre processing possible? We will do that with the help of Spacy library. 

## What is SpaCy?

What’s spaCy?
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. 

The architecture of spaCy has been shown below:

![spacy.PNG](attachment:spacy.PNG)

We will be using a bit of NLTK in between also, as need arises. It is another libarary in Python that helps in NLP tasks. 

We will learn how to work on text data and process with Spacy next week! 