# Module 1: Text Analysis with Statistical NLP

## Module Overview

Welcome to **Module 1: Text Analysis with Statistical NLP**! This module teaches you how to work with text data using traditional statistical methods—the foundation that modern deep learning approaches build upon.

**By the end of this module, you will be able to:**
- Build, evaluate, and improve supervised ML text classification pipelines
- Apply keyword-based information retrieval algorithms to find specific data in unstructured text
- Apply unsupervised ML to cluster documents and discover hidden topics in unorganized, unlabeled data

### The Learning Journey

This module follows a logical progression where each session answers a question that naturally leads to the next:

1. **What is NLP?** (Session 1)
2. **How do we understand our data?** (Session 2: Corpora & EDA)
3. **How do we prepare text for ML?** (Session 3: Preprocessing)
4. **How do we convert text to numbers?** (Session 4: Vectorization)
5. **How do we build classifiers?** (Session 5-6: Text Classification)
6. **How do we search documents?** (Session 7: Information Retrieval)
7. **How do we discover topics?** (Session 8: Topic Modeling)

Each session builds on the previous one, creating a complete pipeline from raw text to actionable insights.

---

# Introduction to Natural Language Processing

## Overview

This notebook provides a comprehensive introduction to Natural Language Processing (NLP), the field that bridges the gap between computers and human language. We explore what NLP is, its real-world applications, the main categories of NLP tasks (Natural Language Understanding and Natural Language Generation), and the fundamental challenges that make NLP a complex and fascinating domain. We also introduce the two main approaches to NLP: statistical methods and deep learning-based methods, setting the foundation for the rest of the course.

**What you'll learn in this session:** By the end of this hour, you'll understand what NLP is, why it matters, the types of problems it solves, and the fundamental challenges that make it complex. You'll also see how this session connects to the rest of the module.

## Objectives

- Understand what NLP is and its role in text analysis
- Recognize the three key skills this module focuses on: classification, information retrieval, and topic modeling
- Understand the difference between Statistical NLP and Deep Learning approaches
- Recognize the NLP pipeline: from raw text to models

## Outline

1. **What is NLP?** - Introduction to the field and its interdisciplinary nature
2. **Module 1 Focus** - The three key skills: classification, information retrieval, and topic modeling
3. **Approaches to NLP** - Statistical vs. Neural (Deep Learning) methods
4. **The NLP Pipeline** - Overview of the typical workflow from raw text to models
5. **Python NLP Ecosystem** - Overview of key libraries and frameworks

## What is NLP?

NLP bridges the gap between computers and human language. It combines elements of:
1. computer science, 
2. linguistics, 
3. artificial intelligence
 
.. to enable machines to understand, process, and generate human language.

## Module 1: Three Key Skills

This module focuses on three essential skills for working with text data:

1. **Build, evaluate, and improve supervised ML text classification pipelines**
   - Classify text into categories (e.g., sentiment: positive/negative)
   - Preprocess text, vectorize it, train classifiers, and evaluate performance

2. **Apply keyword-based information retrieval algorithms**
   - Search and find specific data in unstructured text
   - Build search engines using TF-IDF and similarity measures

3. **Apply unsupervised ML to cluster documents and terms**
   - Discover hidden topics in unorganized, unlabeled data
   - Organize documents by automatically finding patterns and groups

# NLP with Deep Learning

**Deep Learning `+` Big Data**: opens the door for new possibilities.

### Natural Language Understanding and Generation

NLP has a wide range of applications, categorized into two main areas:

**Natural Language Understanding (NLU):** Extract meaning, intention, emotion, importance, and correlation between words/texts/speech. Examples:

* **Intent recognition**: Identify the underlying purpose or goal expressed in a sentence or conversation.
* **Named entity recognition**: Identify named entities in text, such as people, organizations, locations, dates, etc.
* **Question answering**: Extract answers to questions posed in natural language.

**Natural Language Generation (NLG):** Generate human-like text/speech. Examples:

* **Text summarization**: Create abstractive summary from longer (and disperse) pieces of text (or the internet!).
* **Machine translation**: Translate text from one language to another.
* **Chatbots** and **Dialogue Systems**: Chat agents.

## NLP Tasks

### Sentiment Analysis

The process of classifying the emotional intent of text. Generally, the input to a sentiment classification model is a piece of text, and the output is the probability that the sentiment expressed is positive, negative, or neutral.
- **Toxicity classification** models can be used to moderate and improve online conversations by:
    - silencing offensive comments (تعليقات مسيئة)
    - detecting hate speech (خطاب كراهية)
    - scanning documents for defamation (التشهير)
- **Customer reviews classification** on various online platforms
- **Identifying signs of mental illness** in online comments
- **Spam detection** in emails and messages

### Machine Translation

Automatically converting text from one language into another. This is a challenging task because languages have different grammatical structures, vocabularies, and idioms.
- **Translating subtitles** for movies and TV shows
- **Translating websites** into multiple languages
- **Translating legal documents** for international business
- **Translating medical records** for healthcare providers
- **Translating scientific papers** for researchers
- **Translating user manuals** for consumer
- **Translating social media posts** for marketing

### Named-entity Recognition (NER)

**Named entity recognition (NER)** aims to extract entities in a piece of text into predefined categories such as: **personal names**, **organizations**, **locations**, and **quantities**.


<img src="https://mohammedkhalilia.com/project/wojood/featured_hu3c39113436d8415af0ad7d87aac657cd_252396_720x2500_fit_q75_h2_lanczos.webp">

### Information Extraction and Retrieval

Search engines like Google, Bing, and Yahoo use NLP to understand user queries and return relevant results, after having crawled the web.

### Text summarization

The process of distilling the most important information from a piece of text into a shorter version. There are two main types of text summarization: **extractive** and **abstractive**.

- **Extractive summarization** involves **selecting** the most important sentences or phrases from the original text and combining them to create a summary.
  - think: "Google search".
- **Abstractive summarization** involves **generating** new sentences that capture the essence of the original text.
  - think: "ChatGPT"


### Question Answering (QA)

Generally, question-answering tasks come in two flavors:

- **Multiple choice**: the learning task is to pick the correct answer.
- **Open domain**: the model provides answers to questions in natural language without any options provided, often by querying a large number of texts.

### Speech Recognition

Converting spoken language to text:
- **Voice assistants** like Siri, Alexa, and Google Assistant
- **Transcription services** like Otter.ai and Rev
- **Voice-controlled devices** like smart speakers and smart TVs
- **Voice search** in search engines like Google and Bing
- **Voice typing** in word processors like Google Docs and Microsoft Word
- **Voice commands** in video games and virtual reality applications
- **Voice biometrics** for security and authentication
- **Closed captioning** for the hearing impaired
- **Language translation** in real-time
- **Transcription of phone calls** for customer service and sales
- **Transcription of meetings** for note-taking and documentation
- **Transcription of lectures** for students and researchers
- **Transcription of podcasts** for SEO and accessibility
- **Transcription of interviews** for journalists and researchers

## Challenges of NLP

Language is complex and nuanced. NLP faces several challenges:

#### Ambiguity (الغموض)

**Lexical ambiguity** is a subtype of semantic ambiguity where a word or morpheme is ambiguous.
- "The fisherman went to the bank"
  - "bank" could either mean "river bank" or the "bank building"
- "شربت من العين" (نبع الماء).
- "أبصرت بـ العين" (عضو البصر).
- "أرسل القائد عيناً لاستطلاع العدو" (جاسوساً).
- "اشتريت هذا بعينه" (بذاته/نفسه).

**Homonymy** (التجانس اللفظي) happens when words are pronounced the same way.
- "this is a row of raw materials"
- "I see the sea"
- "Without my glasses I can't see the glass"
- مثال (كلمة "ذهَب"):
  - "خاتم من ذَهَب" (المعدن النفيس).
  - "ذَهَبَ محمد إلى المدرسة" (فعل ماضٍ بمعنى مشى/غادر).

**Semantic ambiguity** (الغموض الدلالي) results from expressions having multiple meanings.
- "It’s on the house." -- could mean something is free, or an object could literally be found on the house.
- "that went over my head" -- could express confusion or that something literally passed over their head
- "not your business"
- مثال: "طارت الطيور بأرزاقها"
  - المعنى الحرفي: الطيور أخذت الطعام وطارت.
  - المعنى المقصود: فات الأوان وانتهت الفرصة.

**Syntactic Ambiguity** happens when the sentence has multiple parse trees.

- "I saw the man on the beach with my binoculars." -- could mean that I saw a man through my binoculars or the man had my binoculars with him.

Reference: https://cs.nyu.edu/~davise/ai/ambiguity.html

#### Irony and sarcasm

Any time you say the opposite of the truth intentionally for effect, that’s verbal irony.

Sarcasm:

- "I love it when my computer crashes."
- "Oh, fantastic! I just love waiting in line for hours."
- عندما يرتكب شخص خطأ فادحاً فتقول له: "ما شاء الله، عبقري!"

Irony:

- Verbal: "جيتك يا عبد المعين تعين لقيتك يا عبد المعين تنعان"
- Situational:
  - A Doctor that smokes cigarettes.
  - "باب النجار مخلّع"

Irony and sarcasm present problems for machine learning models because they generally use words and phrases that, strictly by definition, may be positive or negative, but actually connote the opposite.

#### Dialects

Dialects are variations of a language spoken in different regions or by different social groups. For example, American English and British English have different spellings, pronunciations, and vocabulary.

- Water and "Wadder"
- "Ain't no thang" vs. "It's no problem": This phrase is common in some dialects of American English, but it might be unfamiliar to others.
- "Y'all" vs. "you guys": This is a regional difference in addressing a group of people, with "y'all" being more common in the Southern United States.

اختلاف المفردات (مثال: "كيف حالك؟"):
- الفصحى: كيف حالك؟
- الخليجية: شلونك؟ / شعلومك؟
- المصرية: إزيك؟ / عامل إيه؟
- الشامية: كيفك؟
- المغربية: كيداير؟ / لباس؟

أدوات النفي (مثال: "ما أكلت"):
- المصرية/الشامية: ما اكلتش / ما اكلت.
- الخليجية: ما كليت / مو ماكل.
- المغربية: ما كليتش.

#### Slang and jargon

Slang and jargon are informal words and phrases used in specific contexts or by particular groups of people. They can be difficult for machines to understand because they are constantly evolving and can have multiple meanings. Examples:
- "That movie was lit!" (meaning: the movie was excellent)
- "I'm feeling blue." (meaning: I'm feeling sad)
- "Google it!"

- "كبّر دماغك": تعني تجاهل الأمر (وليست دعوة طبية لتكبير الرأس).
- "سحَب عليه": (السعودية/الخليج) تعني تجاهله أو لم يذهب لموعده.
- "قصف جبهته": تعني أحرجه برد قوي ومفحم.

Jargon:
- "ROI" (return on investment)
- "KPI" (key performance indicator)
- "ML" (machine learning)

## Python NLP Ecosystem (Libraries & Frameworks)

- **Natural Language Toolkit (NLTK)**: is one of the first NLP libraries written in Python.
- **spaCy**: used for building production-ready systems for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking, and so on.
- **Gensim**: an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning.
- **PyTorch**
- **HuggingFace**

## Approaches to NLP

There are two main approaches to NLP:

1. **Statistical**
2. **Neural** (Deep Learning)

## The Pipeline

This notebook covers the foundational steps in Natural Language Processing (NLP) that transform raw text into a format suitable for machine learning models. Understanding these concepts is crucial for building effective NLP systems.

1. **Data Ingestion**: Gathering and organizing text documents
2. **Preprocessing**: Cleaning, normalizing, and transforming raw text data
3. **Tokenization**: determine the processible units of text (atoms)
4. **Vectorization**: convert text to numerical vectors: BoW, TF-IDF
5. **Modeling** (examples):
   - **Retrieval & Analysis**:
     -  Search
     -  Topic Discovery
   - **Predictive**:
     - Text Classification
     - Next-word prediction
6. **Evaluation**

## Key Takeaways

- **NLP** bridges the gap between computers and human language by combining computer science, linguistics, and artificial intelligence.

- NLP tasks can be categorized into two main areas:
  - **Natural Language Understanding (NLU)**: Extracting meaning, intention, emotion, and relationships from text
  - **Natural Language Generation (NLG)**: Generating human-like text and speech

- **Key NLP applications** include sentiment analysis, machine translation, named entity recognition, information retrieval, text summarization, question answering, and speech recognition.

- NLP faces significant **challenges** including:
  - **Ambiguity**: Lexical, semantic, and syntactic ambiguities that make language interpretation difficult
  - **Irony and sarcasm**: Expressions that mean the opposite of their literal meaning
  - **Dialects**: Regional and social variations in language
  - **Slang and jargon**: Informal and context-specific language that evolves rapidly

- There are two main **approaches to NLP**:
  - **Statistical methods**: Traditional rule-based and statistical techniques
  - **Neural/Deep Learning methods**: Modern approaches using neural networks and transformers

- The **NLP pipeline** typically involves: Data Ingestion → Preprocessing → Tokenization → Vectorization → Modeling → Evaluation

- The Python NLP ecosystem includes powerful libraries like **NLTK**, **spaCy**, **Gensim**, **PyTorch**, and **HuggingFace** that enable building production-ready NLP systems.