# Basic NLP Concepts
***
## Table of Contents
1. Introduction to NLP
2. Text Preprocessing

***

## 1. Introduction to NLP

Natural Language Processing (NLP) is a multidisciplinary field that combines computer science, linguistics, and artificial intelligence to enable computers to interpret, process, and generate human language naturally and efficiently. NLP bridges the gap between human communication and computer understanding, allowing machines to analyse, understand, and even respond to text and speech just as humans do.


### Types of NLP
There are several types and approaches within NLP, each with its own focus and methodology:

- **Symbolic NLP**: Relies on hand-crafted rules and linguistic knowledge to process language. This traditional approach uses grammar rules and dictionaries to interpret text.

- **Statistical NLP**: Uses statistical methods and machine learning to analyze large volumes of language data, identifying patterns and making predictions based on probabilities.

- **Neural NLP**: Employs deep learning and neural networks to model and understand language, enabling advanced applications like language generation and large language models.


### NLP Tasks
NLP can be divided into several overlapping subfields and tasks, including:

- **Natural Language Understanding (NLU)**: Focuses on interpreting and extracting meaning from human language (semantics and syntax).
- **Natural Language Generation (NLG)**: Involved in generating human-like text or speech from structured data or input.
- **Speech Recognition**: Converts spoken language into text.
- **Text Classification**: Assigns categories or labels to text data (e.g., spam detection, topic classification).
- **Named Entity Recognition (NER)**: Identifies and classifies entities such as names, locations, and organisations in text.
- **Sentiment Analysis**: Determines the emotional tone behind a body of text.
- **Machine Translation**: Automatically translates text or speech from one language to another.
- **Part-of-Speech Tagging**: Labels words with their grammatical roles (noun, verb, etc.).

### Popular NLP Libraries
The main Python libraries used in NLP are:
- **NLTK (Natural Language Toolkit)**: One of the oldest and most compherensive libraries for NLP tasks such as tokenisation, stemming, tagging, parsing, and semantic reasoning. Widely used for teaching, research, and foundational NLP projects, though it may be slower for large-scale production.
- **spaCy**: Designed for fast, efficient, and production-ready NLP applications. It offered pre-trained models for multiple languages, supports tokenisation, part-of-speech tagging, named entity recognition, dependency parsing, and integrates well with deep learning frameworks.
- **Gensim**: Specialised in topic modelling, document similarity analysis, and word embeddings (e.g. Word2Vec, FastText, LDA). It's optimised for processing large text corpora efficiently and is popular for unsupervised NLP tasks.
- **TextBlob**: Build on top of NLTK and Pattern. TextBlob provides a simple API for common NLP tasks like sentiment analysis, part-of-speech tagging, and noun phrase extraction. It's user-friendly and great for beginners or rapid prototyping.
- **Pattern**: Offers tools for text processing, web mining, machine learning, and network analysis. Known for its easy use and is suitable for tasks like sentiment analysis, part-of-speech tagging, and web scraping.
- **PyNLPl(Pineapple)**: A versatile library for both basic and advanced NLP tasks, including n-gram analysis, frequency lists, and linguistic annotation. It supports various file formats and is useful for more specialised NLP workflows.
- **Stanza** Developed by Stanford, Stanza provides deep learning-based models for tasks such as named entity recognition and part-of-speech tagging, supporting over 70 languages and integrating well with other libraries (e.g., spaCy, Hugging Face Transformers).
- **Polyglot**: Known for its extensive multilingual support. Polyglot offers tokenisation, sentimental analysis, named entity recognition, and word embeddings across 130+ languages.
- **CoreNLP**: A robust Java-based library from Stanford, accessible in Python via wrappers, used for tasks such as named entity recognition and coreference resolution. Often integrated with other Python NLP libraries.
- **Hugging Face Transformers**: While primarily for large language models, this library is widely used in modern NLP for tasks (e.g., text classification, question answering, text generation using transformer-based models)

## 2. Text Preprocessing