# Basic NLP Concepts
***
## Table of Contents
1. Introduction to NLP
    - Types of NLP
    - NLP Tasks
    - Popular NLP Libraries
2. Text Preprocessing
    - Lowercasing
    - Regular Expression
    - Removing Punctuation and Special Characters
    - Tokenisation
***

In [42]:
import pandas as pd

## 1. Introduction to NLP

Natural Language Processing (NLP) is a multidisciplinary field that combines computer science, linguistics, and artificial intelligence to enable computers to interpret, process, and generate human language naturally and efficiently. NLP bridges the gap between human communication and computer understanding, allowing machines to analyse, understand, and even respond to text and speech just as humans do.


### Types of NLP
There are several types and approaches within NLP, each with its own focus and methodology:

- **Symbolic NLP**: Relies on hand-crafted rules and linguistic knowledge to process language. This traditional approach uses grammar rules and dictionaries to interpret text.

- **Statistical NLP**: Uses statistical methods and machine learning to analyze large volumes of language data, identifying patterns and making predictions based on probabilities.

- **Neural NLP**: Employs deep learning and neural networks to model and understand language, enabling advanced applications like language generation and large language models.


### NLP Tasks
NLP can be divided into several overlapping subfields and tasks, including:

- **Natural Language Understanding (NLU)**: Focuses on interpreting and extracting meaning from human language (semantics and syntax).
- **Natural Language Generation (NLG)**: Involved in generating human-like text or speech from structured data or input.
- **Speech Recognition**: Converts spoken language into text.
- **Text Classification**: Assigns categories or labels to text data (e.g., spam detection, topic classification).
- **Named Entity Recognition (NER)**: Identifies and classifies entities such as names, locations, and organisations in text.
- **Sentiment Analysis**: Determines the emotional tone behind a body of text.
- **Machine Translation**: Automatically translates text or speech from one language to another.
- **Part-of-Speech Tagging**: Labels words with their grammatical roles (noun, verb, etc.).

### Popular NLP Libraries
The main Python libraries used in NLP are:
- **NLTK (Natural Language Toolkit)**: One of the oldest and most compherensive libraries for NLP tasks such as tokenisation, stemming, tagging, parsing, and semantic reasoning. Widely used for teaching, research, and foundational NLP projects, though it may be slower for large-scale production.
- **spaCy**: Designed for fast, efficient, and production-ready NLP applications. It offered pre-trained models for multiple languages, supports tokenisation, part-of-speech tagging, named entity recognition, dependency parsing, and integrates well with deep learning frameworks.
- **Gensim**: Specialised in topic modelling, document similarity analysis, and word embeddings (e.g. Word2Vec, FastText, LDA). It's optimised for processing large text corpora efficiently and is popular for unsupervised NLP tasks.
- **TextBlob**: Build on top of NLTK and Pattern. TextBlob provides a simple API for common NLP tasks like sentiment analysis, part-of-speech tagging, and noun phrase extraction. It's user-friendly and great for beginners or rapid prototyping.
- **Pattern**: Offers tools for text processing, web mining, machine learning, and network analysis. Known for its easy use and is suitable for tasks like sentiment analysis, part-of-speech tagging, and web scraping.
- **PyNLPl(Pineapple)**: A versatile library for both basic and advanced NLP tasks, including n-gram analysis, frequency lists, and linguistic annotation. It supports various file formats and is useful for more specialised NLP workflows.
- **Stanza** Developed by Stanford, Stanza provides deep learning-based models for tasks such as named entity recognition and part-of-speech tagging, supporting over 70 languages and integrating well with other libraries (e.g., spaCy, Hugging Face Transformers).
- **Polyglot**: Known for its extensive multilingual support. Polyglot offers tokenisation, sentimental analysis, named entity recognition, and word embeddings across 130+ languages.
- **CoreNLP**: A robust Java-based library from Stanford, accessible in Python via wrappers, used for tasks such as named entity recognition and coreference resolution. Often integrated with other Python NLP libraries.
- **Hugging Face Transformers**: While primarily for large language models, this library is widely used in modern NLP for tasks (e.g., text classification, question answering, text generation using transformer-based models)

## 2. Text Preprocessing
Text preprocessing is the foundation of any NLP project. It involves cleaning and transforming raw text into a structured format suitable for analysis.

Typical Text Preprocessing Pipeline is:
1. **Lowercasing**: Standardises text for comparison.

2. **Removing punctuation/special characters**: Cleans up noise.

3. **Tokenisation**: Splits text into words, sentences, or subwords.

4. **Stopword removal, stemming, lemmatization**: Further normalizes text for analysis

Dataset retrieved from [Tweets Dataset](https://www.kaggle.com/datasets/mmmarchetti/tweets-dataset?select=tweets.csv)

In [43]:
df = pd.read_csv('_datasets/tweets.csv')
df = df.drop(columns=['author', 'country', 'date_time', 'id', 'language',
             'latitude', 'longitude', 'number_of_likes', 'number_of_shares'])
df.head()

Unnamed: 0,content
0,Is history repeating itself...?#DONTNORMALIZEH...
1,@barackobama Thank you for your incredible gra...
2,Life goals. https://t.co/XIn1qKMKQl
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...


### Lowercasing
Convert all text to lowercase to ensure uniformity.

In [44]:
df['clean_text'] = df['content'].str.lower()
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,is history repeating itself...?#dontnormalizeh...
1,@barackobama Thank you for your incredible gra...,@barackobama thank you for your incredible gra...
2,Life goals. https://t.co/XIn1qKMKQl,life goals. https://t.co/xin1qkmkql
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,me right now 🙏🏻 https://t.co/gw55c1wrwd
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,sisters are doin' it for themselves! 🙌🏻💪🏻❤️ ht...


### Regular Expressions
Regular expressions (also called 'regex or 'regexp') are patterns used to match, search, and manipulate text based on specific sequences of characters. They are extremely useful for extracting information, validating input, finding specific text, and replacing or splitting strings in tasks such as data cleaning, web scraping, and natural language processing.

A regular expression is essentially a sequence of characters that defines a search pattern. This pattern can be made up of literal characters or special symbols (metacharacters) that represent sets, repetitions, or positions in the text.

For example:
- `/cat/` matches the exact sequence 'cat'.
- `/c.t/` matches 'cat', 'cot', 'cut', etc. (the dot `.` matches any single character).
- `/\d+/` matches one or more digits (`\d` means any digit and `+` means 'one or more').


#### Common Regex Elements
- **Literal Characters**: Match themselves (e.g., a, 1, @)
- **Metacharacters**:
    - `.` (dot): Any character except newline.
    - `\d`: Any digit(0-9).
    - `\w`: Any word character (letters, digits, underscore).
    - `\s`: Any whitespace character (space, tab, newline).
    - `*`: Zero or more of the preceding elements.
    - `+`: One or more of the preceding elements.
    - `?`: Zero or one of the preceding element.
    - `[]`: A set or range of characters (e.g., [a-z])
    - `^`: Start of a string.
    - `$`: End of a string.
    - `|`: OR operator (e.g., `cat|dog` matches 'cat' or 'dog').
    - `()`: Grouping for subpatterns.

#### Example Use Cases
- **Remove punctuation**: `r'[^\w\s]` matches anything that is not a word character or whitespace.
- **Find email addresses**: `r'\b[\w.-]+@[\w.-]+\.\w+\b'`
- **Validate phone numbers**: Patterns like `r'^\d{3}-\d{3}-\d{4}$'`

### Removing Punctuation and Special Characters
Strip out punctuation, symbols, and special characters to reduce noise. Using `string.punctuation` makes the task easy and efficient.

In [45]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

`'[{}]'.format(string.punctuation)` inserts all these punctuation characters inside square brackets, resulting in a string like `'[!"#$%&\'()*+,-./:;<=>?@[\$$^_{|}~]'`. Setting `regex=True` tells pandas to interpret the pattern as a regular expression. In case our data contains regex-special characters (such as `[`, `\`, `^`), it's safer to escape them using `re.escape`.

In [46]:
import re
df['clean_text'] = df['clean_text'].str.replace(
    '[{}]'.format(re.escape(string.punctuation)), '', regex=True)
df.iloc[5:11]

Unnamed: 0,content,clean_text
5,happy 96th gma #fourmoreyears! 🎈 @ LACMA Los A...,happy 96th gma fourmoreyears 🎈 lacma los ange...
6,"Kyoto, Japan \r\n1. 5. 17. https://t.co/o28M0v...",kyoto japan \r\n1 5 17 httpstcoo28m0vw9lr
7,🇯🇵 @ Sanrio Puroland https://t.co/eXVev5UMBx,🇯🇵 sanrio puroland httpstcoexvev5umbx
8,2017 resolution: to embody authenticity!,2017 resolution to embody authenticity
9,sisters. https://t.co/5ZE21x2aNk,sisters httpstco5ze21x2ank
10,Happy Holidays! Sending love and light to ever...,happy holidays sending love and light to every...


### Tokenisation
Tokenisation is the process of splitting text into smaller units called tokens. In NLP, tokens are typically words, subwords, or sentences.