# 1. File & Directory Navigation

[BBC Full Text Document Classification](https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification)

## 1-1. pathlib Module
The `pathlib` module offers classes representing filesystem paths with semantics appropriate for different operating systems. Path classes are divided between pure paths (`PurePosixPath` & `PureWindowsPath`), which provide purely computational operations without I/O, and concrete paths (`PosixPath` & `WindowsPath`), which inherit from pure paths but also provide I/O operations. Note that the special entries `.` and `..` are excluded from all returned pathnames in the `pathlib` module.
1. `pathlib.Path`
   - `stat(*, follow_symlinks=True)`
   - `lstat()`: Returns the symbolic link's information rather than its target's if the path points to a symbolic link.
   - `exists()`
   - `is_file()`
   - `is_dir()`
   - `is_symlink()`
3. `pathlib.Path.iterdir()`: Yields `Path` (`PosixPath` or `WindowsPath`) objects of the directory contents. 
4. `pathlib.Path.glob(pattern, *, case_sensitive=None)`: Globs the given relative `pattern` in the directory represented by this path, yielding all matching files.
   - The addition of `**` means "this directory and all subdirectoriesm recursively". In other words, it enables recursive globbing.
5. `pathlib.Path.rglob(pattern, *, case_sensitive=None)`: Globs the given relative pattern recursively like calling `pathlib.Path.glob()` with `**/` added in front of the pattern. 
6. `pathlib.Path.walk(top_down=True, on_error=None, follow_symlinks=False)`: Generates the filenames in a directory tree by walking the tree either top-down or bottom-up. Yields a 3-tuple of `(dirpath, dirnames, filenames)`.
7. `pathlib.Path.resolve(strict=False)`: Makes the path absolute, resolving any symbolic links. Returns a new `Path` object.

- [Correspondence to Tools in the os Module](https://docs.python.org/3/library/pathlib.html#correspondence-to-tools-in-the-os-module)

In [84]:
import pathlib

dataset_path = "./datasets/archive/bbc-fulltext (document classification)/bbc"

In [122]:
# `iterdir()`
p = pathlib.Path(dataset_path)
print(type(p))
print(type(p.iterdir()))
print()

for child in p.iterdir():
    print(child)

<class 'pathlib.PosixPath'>
<class 'generator'>

datasets/archive/bbc-fulltext (document classification)/bbc/business
datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
datasets/archive/bbc-fulltext (document classification)/bbc/politics
datasets/archive/bbc-fulltext (document classification)/bbc/sport
datasets/archive/bbc-fulltext (document classification)/bbc/tech
datasets/archive/bbc-fulltext (document classification)/bbc/entertainment


In [114]:
# `glob()` uses pattern`
print(type(p.glob('*')))
print()

for child in p.glob('*'):
    print(child)

<class 'generator'>

datasets/archive/bbc-fulltext (document classification)/bbc/business
datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
datasets/archive/bbc-fulltext (document classification)/bbc/politics
datasets/archive/bbc-fulltext (document classification)/bbc/sport
datasets/archive/bbc-fulltext (document classification)/bbc/tech
datasets/archive/bbc-fulltext (document classification)/bbc/entertainment


In [115]:
# Enable recursive globbing with the addition of `**`
import itertools

for child in list(itertools.islice(p.glob('**/*'), 10)):
    print(child)

datasets/archive/bbc-fulltext (document classification)/bbc/business
datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
datasets/archive/bbc-fulltext (document classification)/bbc/politics
datasets/archive/bbc-fulltext (document classification)/bbc/sport
datasets/archive/bbc-fulltext (document classification)/bbc/tech
datasets/archive/bbc-fulltext (document classification)/bbc/entertainment
datasets/archive/bbc-fulltext (document classification)/bbc/business/503.txt
datasets/archive/bbc-fulltext (document classification)/bbc/business/412.txt
datasets/archive/bbc-fulltext (document classification)/bbc/business/027.txt
datasets/archive/bbc-fulltext (document classification)/bbc/business/324.txt


In [117]:
# `rglob()` globs the given relative pattern recursively
for child in list(itertools.islice(p.rglob('*'), 10)):
    print(child)

datasets/archive/bbc-fulltext (document classification)/bbc/business
datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
datasets/archive/bbc-fulltext (document classification)/bbc/politics
datasets/archive/bbc-fulltext (document classification)/bbc/sport
datasets/archive/bbc-fulltext (document classification)/bbc/tech
datasets/archive/bbc-fulltext (document classification)/bbc/entertainment
datasets/archive/bbc-fulltext (document classification)/bbc/business/503.txt
datasets/archive/bbc-fulltext (document classification)/bbc/business/412.txt
datasets/archive/bbc-fulltext (document classification)/bbc/business/027.txt
datasets/archive/bbc-fulltext (document classification)/bbc/business/324.txt


In [124]:
def walk_through_dir(dir_path):
    # Walk through `dir_path` returning its contents
    for dirpath, dirnames, filenames in p.walk(dir_path):
        print(f"There are {len(dirnames)} directories and {len(filenames)} files in '{dirpath}'")

print(type(p.walk(on_error=print)))
print()

walk_through_dir(dataset_path)

<class 'generator'>

There are 5 directories and 1 files in 'datasets/archive/bbc-fulltext (document classification)/bbc'
There are 0 directories and 510 files in 'datasets/archive/bbc-fulltext (document classification)/bbc/business'
There are 0 directories and 417 files in 'datasets/archive/bbc-fulltext (document classification)/bbc/politics'
There are 0 directories and 511 files in 'datasets/archive/bbc-fulltext (document classification)/bbc/sport'
There are 0 directories and 401 files in 'datasets/archive/bbc-fulltext (document classification)/bbc/tech'
There are 0 directories and 386 files in 'datasets/archive/bbc-fulltext (document classification)/bbc/entertainment'


In [123]:
# Navigate inside a directory by `/`
q = p / 'business'
print(q)

datasets/archive/bbc-fulltext (document classification)/bbc/business


In [126]:
# `resolve()` returns the absolute path
q.resolve()

PosixPath('/home/yungshun317/workspace/py/torch-nlp/datasets/archive/bbc-fulltext (document classification)/bbc/business')

## 1-2. os Module
1. `os.listdir(path='.')`: Returns a list of bare filenames.
2. `os.scandir(path='.')`: Yields `DirEntry` objects that include file type & stat information along with the name. Increases the speed of `os.walk()` by 2-20 times by avoiding unnecessary calls to `os.stat()` in most cases.
    - `name`: Returns the entry's filename, relative to the `path` argument.
    - `path`: Returns the entry's full path name.
    - `is_dir(*, follow_symlinks=True)` 
    - `is_file(*, follow_symlinks=True)`
    - `is_symlink()`
    - `stat(*, follow_symlinks=True)`
    - `inode()`: Returns the inode number of the entry.
3. `os.walk(top, topdown=True, onerror=None, followlinks=False)`: In addition to calling `os.listdir()` on each directory, it calls `os.stat()` on each file to determine whether the filename is a directory or not. Yields a 3-tuple `(dirpath, dirnames, filenames)`.

The `os.path` module implements common pathname manipulations:

1. `os.path.abspath(path)`: Returns a normalized absolutized version of the pathname `path`.
2. `os.path.exists(path)`
3. `os.path.join(path, *paths)`: Joins| one or more path segments intelligently.

In [7]:
import os

dataset_path = "./datasets/archive/bbc-fulltext (document classification)/bbc"

In [94]:
# `listdir()` just returns a list of strings: ['business', 'README.TXT', 'politics', 'sport', 'tech', 'entertainment']
print(type(os.listdir(dataset_path)))
print()

for entry in os.listdir(dataset_path):
    print(entry)

<class 'list'>

business
README.TXT
politics
sport
tech
entertainment


In [92]:
# `scandir()` returns directory entries with file attribute information
print(type(os.scandir(dataset_path)))
print()

for i, entry in enumerate(os.scandir(dataset_path)):
    print(f"Entry Name: {entry.name}")
    print(f"Path: {entry.path}")
    print(f"Is Directory: {entry.is_dir()}")
    print(f"Is File: {entry.is_file()}")
    print(f"Is Symlink: {entry.is_symlink()}")
    print(f"Stat: {entry.stat()}")
    print(f"inode: {entry.inode()}")
    # Print empty line if it's not the last entry
    if i != sum(1 for entry in os.scandir(dataset_path)) - 1:
        print()

<class 'posix.ScandirIterator'>

Entry Name: business
Path: ./datasets/archive/bbc-fulltext (document classification)/bbc/business
Is Directory: True
Is File: False
Is Symlink: False
Stat: os.stat_result(st_mode=16877, st_ino=39201886, st_dev=66307, st_nlink=2, st_uid=1000, st_gid=1000, st_size=16384, st_atime=1719779416, st_mtime=1712750395, st_ctime=1719776837)
inode: 39201886

Entry Name: README.TXT
Path: ./datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
Is Directory: False
Is File: True
Is Symlink: False
Stat: os.stat_result(st_mode=33261, st_ino=39201885, st_dev=66307, st_nlink=1, st_uid=1000, st_gid=1000, st_size=598, st_atime=1713748603, st_mtime=1571338472, st_ctime=1719776828)
inode: 39201885

Entry Name: politics
Path: ./datasets/archive/bbc-fulltext (document classification)/bbc/politics
Is Directory: True
Is File: False
Is Symlink: False
Stat: os.stat_result(st_mode=16877, st_ino=39202784, st_dev=66307, st_nlink=2, st_uid=1000, st_gid=1000, st_size=12

In [125]:
def walk_through_dir(dir_path):
    # Walk through `dir_path` returning its contents
    for dirpath, dirnames, filenames in os.walk(dir_path):
        print(f"There are {len(dirnames)} directories and {len(filenames)} files in '{dirpath}'")

print(type(os.walk(dataset_path)))
print()

walk_through_dir(dataset_path)

<class 'generator'>

There are 5 directories and 1 files in './datasets/archive/bbc-fulltext (document classification)/bbc'
There are 0 directories and 510 files in './datasets/archive/bbc-fulltext (document classification)/bbc/business'
There are 0 directories and 417 files in './datasets/archive/bbc-fulltext (document classification)/bbc/politics'
There are 0 directories and 511 files in './datasets/archive/bbc-fulltext (document classification)/bbc/sport'
There are 0 directories and 401 files in './datasets/archive/bbc-fulltext (document classification)/bbc/tech'
There are 0 directories and 386 files in './datasets/archive/bbc-fulltext (document classification)/bbc/entertainment'


In [127]:
# `abspath()` returns the absolute path
os.path.abspath(dataset_path)

'/home/yungshun317/workspace/py/torch-nlp/datasets/archive/bbc-fulltext (document classification)/bbc'

In [133]:
# `join()`
os.path.join(dataset_path, 'business')

'./datasets/archive/bbc-fulltext (document classification)/bbc/business'

## 1-3. glob Module
**Globbing** refers to the process of finding filenames based on a pattern, unlike **regular expressions (regex)**, which can be complex and are used for pattern matching within strings. The term **glob** is short for "global". Originally, globbing was part of the "global command" functionality used for expanding shell patterns.
1. `glob.glob(pathname, *, root_dir=None, dir_fd=None, recursive=False, include_hidden=False)`
    - `*`: Matches any sequence of characters.
    - `?`: Matches a single character.
    - `[abc]`: Matches any one of the characters a, b, or c.

In [4]:
import glob

dataset_path = "./datasets/archive/bbc-fulltext (document classification)/bbc"

In [141]:
print(type(glob.glob(dataset_path)))
print(type(glob.glob(dataset_path)[0]))
print()

for pathname in glob.glob(os.path.join(dataset_path, '*')):
    print(pathname)

<class 'list'>
<class 'str'>

./datasets/archive/bbc-fulltext (document classification)/bbc/business
./datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
./datasets/archive/bbc-fulltext (document classification)/bbc/politics
./datasets/archive/bbc-fulltext (document classification)/bbc/sport
./datasets/archive/bbc-fulltext (document classification)/bbc/tech
./datasets/archive/bbc-fulltext (document classification)/bbc/entertainment


In [143]:
for pathname in glob.glob(os.path.join(dataset_path, '*.TXT')):
    print(pathname)

./datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT


In [155]:
# Set `recursive=True`
for pathname in list(itertools.islice(glob.glob(os.path.join(dataset_path, '**', '*'), recursive=True), 10)):
    print(pathname)

./datasets/archive/bbc-fulltext (document classification)/bbc/business
./datasets/archive/bbc-fulltext (document classification)/bbc/README.TXT
./datasets/archive/bbc-fulltext (document classification)/bbc/politics
./datasets/archive/bbc-fulltext (document classification)/bbc/sport
./datasets/archive/bbc-fulltext (document classification)/bbc/tech
./datasets/archive/bbc-fulltext (document classification)/bbc/entertainment
./datasets/archive/bbc-fulltext (document classification)/bbc/business/503.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/412.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/027.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/324.txt


In [157]:
# `?` matches a single character
for pathname in glob.glob(os.path.join(dataset_path, 'business', '00?.txt')):
    print(pathname)

./datasets/archive/bbc-fulltext (document classification)/bbc/business/005.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/001.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/007.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/009.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/003.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/008.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/004.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/006.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/002.txt


In [158]:
# Square brackets allow for specific character matches
for pathname in glob.glob(os.path.join(dataset_path, 'business', '00[123].txt')):
    print(pathname)

./datasets/archive/bbc-fulltext (document classification)/bbc/business/001.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/003.txt
./datasets/archive/bbc-fulltext (document classification)/bbc/business/002.txt


Retrieve the label names.

In [8]:
# Setup path for target directory if entry.is_dir()
target_directory = dataset_path
print(f"target directory: {target_directory}")

# Get the class names from the target directory
class_names_found = sorted([entry.name for entry in list(os.scandir(target_directory)) if entry.is_dir()])
class_names_found

target directory: ./datasets/archive/bbc-fulltext (document classification)/bbc


['business', 'entertainment', 'politics', 'sport', 'tech']

# 2. DataFrame
## 2-1. Read & Write to Text Files
1. `open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)`
   - `mode`
       - `r`: Reads.
       - `r+`: Reads & writes.
       - `w`: Writes.
       - `w+`: Writes & reads.
       - `a`: Appends.
       - `a+`: Appends & reads.
   - `newline`: Determines how to parse newline characters from the stream.
       - `None`: When reading input from the stream, universal newlines mode is enabled. Lines in the input can end in `\n`, `\r`, `\r\n`, and these are translated into `\n`. When writing output to the stream, any `\n` characters written are translated to the system default line separator, `os.linesep`.
       - ``: Universal newlines mode is enabled. But line endings are returned to the caller untranslated.
       - `\n`
       - `\r`
       - `\r\n`
   - [Standard Encodings](https://docs.python.org/3/library/codecs.html#standard-encodings)


The `io` module provides Python's main facilities for dealing with various types of I/O.
1. `write(s)`: Writes the string `s` to the stream and returns the number of characters written.
2. `writelines(lines)`: Writes a list of lines to the stream. Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
3. `read(size=-1)`: Reads up to `size` bytes from the object and returns them.
4. `readline(size=-1)`: Reads and returns one line from the stream. If `size` is specified, at most size bytes will be read.
5. `readlines(hint=-1)`: Reads and returns a list of lines from the stream. No more lines will be read if the total size of all lines so far exceeds `hint`.

In [1]:
import pandas as pd

In [243]:
# Construct a list of lists (lists of lines returned from `readlines()`)
dataset = []

doc_lst = glob.glob("./datasets/archive/bbc-fulltext (document classification)/bbc/business/*.txt")

for filename in doc_lst:
    # print(filename)
    with open(filename, "r") as f:
        lines = f.readlines()
        dataset.append(lines)

print(dataset[0])

["India's Maruti sees profits jump\n", '\n', "India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.\n", '\n', "Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.\n", '\n', 'Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.\n', '\n', "Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans an

In [244]:
# Construct a list of dictionaries
dataset = []

doc_lst = glob.glob("./datasets/archive/bbc-fulltext (document classification)/bbc/business/*.txt")

for filename in doc_lst:
    with open(filename, "r") as f:
        lines = f.readlines()
        dataset.append({"text": lines, "label": "business"})

print(dataset[0])

{'text': ["India's Maruti sees profits jump\n", '\n', "India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.\n", '\n', "Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.\n", '\n', 'Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.\n', '\n', "Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, includin

In [223]:
import pandas as pd

# Convert the list of dictionaries to pandas DataFrame
df = pd.DataFrame(dataset)
df

Unnamed: 0,text,label
0,"[India's Maruti sees profits jump\n, \n, India...",business
1,"[Iran budget seeks state sell-offs\n, \n, Iran...",business
2,"[Steel firm 'to cut' 45,000 jobs\n, \n, Mittal...",business
3,"[Yukos seeks court action on sale\n, \n, Yukos...",business
4,"[US crude prices surge above $53\n, \n, US cru...",business
...,...,...
505,"[Rescue hope for Borussia Dortmund\n, \n, Shar...",business
506,"[US prepares for hybrid onslaught\n, \n, Sales...",business
507,"[Dutch bank to lay off 2,850 staff\n, \n, ABN ...",business
508,"[Rank 'set to sell off film unit'\n, \n, Leisu...",business


In [224]:
# Construct a pandas DataFrame using `readlines()`
dataset = []

for class_name in class_names_found:
    doc_lst = glob.glob(f"./datasets/archive/bbc-fulltext (document classification)/bbc/{class_name}/*.txt", recursive=True)
    for filename in doc_lst:
        # Prevent `UnicodeDecodeError`
        with open(filename, 'r', errors='replace') as f:
            lines = f.readlines()
            dataset.append({"text": lines, "label": class_name})

df = pd.DataFrame(dataset)
df

Unnamed: 0,text,label
0,"[India's Maruti sees profits jump\n, \n, India...",business
1,"[Iran budget seeks state sell-offs\n, \n, Iran...",business
2,"[Steel firm 'to cut' 45,000 jobs\n, \n, Mittal...",business
3,"[Yukos seeks court action on sale\n, \n, Yukos...",business
4,"[US crude prices surge above $53\n, \n, US cru...",business
...,...,...
2220,"[New delay hits EU software laws\n, \n, A fres...",tech
2221,"[Fast moving phone viruses appear\n, \n, Secur...",tech
2222,"[Disney backs Sony DVD technology\n, \n, A nex...",tech
2223,"[Apple unveils low-cost 'Mac mini'\n, \n, Appl...",tech


In [225]:
df.dtypes

text     object
label    object
dtype: object

Construct the desired DataFrame.

In [9]:
# Construct a pandas DataFrame using `read()`
dataset = []

for class_name in class_names_found:
    doc_lst = glob.glob(f"./datasets/archive/bbc-fulltext (document classification)/bbc/{class_name}/*.txt", recursive=True)
    for filename in doc_lst:
        with open(filename, 'r', errors='replace') as f:
            # Use `read()` instead of `readlines()`
            doc = f.read()
            dataset.append({"text": doc, "label": class_name})

df = pd.DataFrame(dataset)
df

Unnamed: 0,text,label
0,India's Maruti sees profits jump\n\nIndia's bi...,business
1,Iran budget seeks state sell-offs\n\nIran's pr...,business
2,"Steel firm 'to cut' 45,000 jobs\n\nMittal Stee...",business
3,Yukos seeks court action on sale\n\nYukos will...,business
4,US crude prices surge above $53\n\nUS crude pr...,business
...,...,...
2220,New delay hits EU software laws\n\nA fresh del...,tech
2221,Fast moving phone viruses appear\n\nSecurity f...,tech
2222,Disney backs Sony DVD technology\n\nA next gen...,tech
2223,Apple unveils low-cost 'Mac mini'\n\nApple has...,tech


# 3. Lower Casing
We want to lowercase our text in order to improve our general accuracy. Many language models will ultimately treat capital letters differently.
- Different for very large language models.
- Language dependent (but for English, consider lowercasing your corpus).

In [11]:
import itertools

for i, doc in list(itertools.islice(enumerate(df.text.str.lower()), 1)):
    print("Lowercase:")
    print()
    print(doc)
    print("Original:")
    print()
    print(df.text[i])

Lowercase:

india's maruti sees profits jump

india's biggest carmaker maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.

net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. total sales were 30.1bn rupees, up 27% from the same 2004 period. maruti accounts for half of india's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.

demand in india also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.

figures show that only eight people per thousand are car owners. maruti beat market expectations despite an increase in raw materials costs. the company, majority-owned by japan's suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. sales in the fiscal third quarter, including vans and utility vehicles, rose by 17

# 4. Segmentation
## 4-1. Document Segmentation
The task has several names: **paragraph detection**, **paragraph identification**, **paragraph segmentation**, **section segmentation**, **text segmentation**, **topic segmentation**, or **document zoning**. It is one of the fundamental NLP problems itself. One of the most famous unsupervised algorithms is **TextTiling** implemented in `nltk`.

1. `nltk.tokenize.TextTilingTokenizer(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)`: `TextTilingTokenizer` tokenizes text into coherent subtopic chunks using Hearst's TextTiling algorithm, which detects subtopic shifts based upon the analysis of lexical co-occurrence patterns.
    - The implementation requires at least 2 paragraphs (i.e. your text needs to contain `\n\n`), or you will encounter `ValueError: No paragraph breaks were found(text too short perhaps?)`.
    - [Text Tiling: Segmenting Text into Multi-Paragraph Subtopic Passages](https://aclanthology.org/J97-1003.pdf)
    - [TextTilingTokenizer](https://tedboy.github.io/nlps/generated/generated/nltk.tokenize.TextTilingTokenizer.html#nltk-tokenize-texttilingtokenizer)
2. [Text Segmentation](https://github.com/hyunbool/Text-Segmentation): A list of paper published in 2020 & before.
3. More Recent Researches:
    - [Text Segmentation by Cross Segment Attention](https://aclanthology.org/2020.emnlp-main.380/) by Google
    - [Topic Segmentation Model Focusing on Local Context](https://arxiv.org/abs/2301.01935)
    - [Auxiliary Loss for BERT-Based Paragraph Segmentation](https://search.ieice.org/bin/summary.php?id=e106-d_1_58)

Main Issues: 
- Papers often focus on **WikiSection Dataset**, which are too long for paragraphs.
- Papers for the task often don't release their code.
- Supervised algorithms tend to be specialized to the domain of the training set.

In [382]:
# Handle exceptions
from nltk.tokenize import TextTilingTokenizer

# Just insert the whole document string into a list if there is no `\n\n` in that document
def augment_texttiling_tokenize(doc):
    try:
        return tt.tokenize(doc)
    except:
        return [doc]

tt = TextTilingTokenizer(demo_mode=False)
tiles = df.text.apply(augment_texttiling_tokenize)
for i, doc in list(itertools.islice(enumerate(tiles), 1)):
    print("Text Tiled:")
    print()
    print(doc)
    print()
    print("Text Tiled to String:")
    print()
    print('\n\n'.join(list(map(lambda d: re.sub('\n\n', '', d), doc))))
    print("Original:")
    print()
    print(df.text[i])

Text Tiled:

["India's Maruti sees profits jump\n\nIndia's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.\n\nNet profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.", "\n\nDemand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.\n\nFigures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans and utility vehic

## 4-2. Sentence Segmentation
Python Built-in Function:
1. `str.splitlines(keepends=False)`: Line breaks are not included in the resulting list unless keepends is given and true.

NLTK:

2. `nltk.tokenize.sent_tokenize(text, language='english')`: Uses the pre-trained tokenizer equivalent to `PunktSentenceTokenizer`.
3. `nltk.tokenize.PunktSentenceTokenizer(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)`: Uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. It's said to work well for many European languages.
    - [Unsupervised Multilingual Sentence Boundary Detection](https://direct.mit.edu/coli/article/32/4/485/1923/Unsupervised-Multilingual-Sentence-Boundary) by Kiss & Strunk
    - This is a statistical algorithm. You may still need to separate sentences using `\n` symbols before or after calling the tokenizer, simply do `str.split('\n')` manually.
  
spaCy:

4. spaCy: Uses the `Doc.sents` attribute.

In [87]:
# `splitlines()`
sentencized_docs = df.text.apply(lambda doc: doc.splitlines())
# Filter out empty strings
filtered_sentencized_docs = [list(filter(None, flattened_sentencized_doc)) for flattened_sentencized_doc in flattened_sentencized_docs]
for i, doc in list(itertools.islice(enumerate(filtered_sentencized_docs), 1)):
    print("Text Sentencized:")
    print()
    print(doc)
    print()
    print("Text Sentencized to String:")
    print()
    print('\n\n'.join(list(map(lambda d: re.sub('\n\n', '', d), doc))))
    print("Original:")
    print()
    print(df.text[i])

Text Sentencized:

["India's Maruti sees profits jump", "India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.", 'Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier.', 'Total sales were 30.1bn rupees, up 27% from the same 2004 period.', "Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.", 'Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.', 'Figures show that only eight people per thousand are car owners.', 'Maruti beat market expectations despite an increase in raw materials costs.', "The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting.", 'Sales in the fiscal third quarter, including vans 

In [72]:
# `nltk.tokenize.sent_tokenize()`
from nltk.tokenize import sent_tokenize
from itertools import chain

sentencized_docs = df.text.apply(sent_tokenize)
# `split('\n\n')` or `split('\n')` for rule-based process
splitted_sentencized_docs = [[s.split('\n\n') for s in sentencized_doc] for sentencized_doc in sentencized_docs]
# Extract nested lists using `itertools.chain()`
flattened_sentencized_docs = list(list(chain.from_iterable(s)) if isinstance(s, list) else s for s in splitted_sentencized_docs)
for i, doc in list(itertools.islice(enumerate(flattened_sentencized_docs), 1)):
    print("Text Sentencized:")
    print()
    print(doc)
    print()
    print("Text Sentencized to String:")
    print()
    print('\n\n'.join(list(map(lambda d: re.sub('\n\n', '', d), doc))))
    print()
    print("Original:")
    print()
    print(df.text[i])

Text Sentencized:

["India's Maruti sees profits jump", "India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.", 'Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier.', 'Total sales were 30.1bn rupees, up 27% from the same 2004 period.', "Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.", 'Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.', 'Figures show that only eight people per thousand are car owners.', 'Maruti beat market expectations despite an increase in raw materials costs.', "The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting.", 'Sales in the fiscal third quarter, including vans 

In [73]:
# `PunktSentenceTokenizer()`
from nltk.tokenize import PunktSentenceTokenizer
from itertools import chain

sentencizer = PunktSentenceTokenizer()
sentencized_docs = df.text.apply(sentencizer.tokenize)
# `split('\n\n')` or `split('\n')` for rule-based process
splitted_sentencized_docs = [[s.split('\n\n') for s in sentencized_doc] for sentencized_doc in sentencized_docs]
# Extract nested lists using `itertools.chain()`
flattened_sentencized_docs = list(list(chain.from_iterable(s)) if isinstance(s, list) else s for s in splitted_sentencized_docs)
for i, doc in list(itertools.islice(enumerate(flattened_sentencized_docs), 1)):
    print("Text Sentencized:")
    print()
    print(doc)
    print()
    print("Text Sentencized to String:")
    print()
    print('\n\n'.join(list(map(lambda d: re.sub('\n\n', '', d), doc))))
    print()
    print("Original:")
    print()
    print(df.text[i])

Text Sentencized:

["India's Maruti sees profits jump", "India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.", 'Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier.', 'Total sales were 30.1bn rupees, up 27% from the same 2004 period.', "Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.", 'Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.', 'Figures show that only eight people per thousand are car owners.', 'Maruti beat market expectations despite an increase in raw materials costs.', "The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting.", 'Sales in the fiscal third quarter, including vans 

In [14]:
# `spacy`
import spacy

# Execute `python3 -m spacy download en` in command line first
nlp = spacy.load('en_core_web_sm')

In [76]:
from itertools import chain

sentencized_docs = df.text.apply(lambda doc: [sent.text for sent in nlp(doc).sents])
# `split('\n\n')` or `split('\n')` for rule-based process
splitted_sentencized_docs = [[s.split('\n\n') for s in sentencized_doc] for sentencized_doc in sentencized_docs]
# Extract nested lists using `itertools.chain()`
flattened_sentencized_docs = list(list(chain.from_iterable(s)) if isinstance(s, list) else s for s in splitted_sentencized_docs)
# Filter out empty strings
filtered_sentencized_docs = [list(filter(None, flattened_sentencized_doc)) for flattened_sentencized_doc in flattened_sentencized_docs]
for i, doc in list(itertools.islice(enumerate(filtered_sentencized_docs), 1)):
    print("Text Sentencized:")
    print()
    print(doc)
    print()
    print("Text Sentencized to String:")
    print()
    print('\n\n'.join(list(map(lambda d: re.sub('\n\n', '', d), doc))))
    print("Original:")
    print()
    print(df.text[i])

Text Sentencized:

["India's Maruti sees profits jump", "India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.", 'Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier.', 'Total sales were 30.1bn rupees, up 27% from the same 2004 period.', "Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.", 'Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.', 'Figures show that only eight people per thousand are car owners.', 'Maruti beat market expectations despite an increase in raw materials costs.', "The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting.", 'Sales in the fiscal third quarter, including vans 

# 5. Tokenization
## 5-1. Word Tokenization
Python Built-in Function:
1. `str.split(sep=None, maxsplit=-1)`: Uses `sep` as the delimiter string.
   - Runs of consecutive whitespace are regarded as a single separator.
   - Contains no empty strings at the start or end if the string has leading or trailing whitespace.

NLTK:

2. `nltk.tokenize.word_tokenize(text, language='english')`: Uses `TreebankWordTokenizer` by default.
3. `nltk.tokenize.TreebankWordTokenizer()`
   - Splits standard contractions, like "don't" to "do n't" and "they'll" to "they 'll".
   - Treats most punctuation characters as separate tokens.
   - Splits off commas and single quotes, when followed by whitespace. For example, starting double quotes to "``"; ending double quotes to "''".
   - Separates periods that appear at the end of line.
4. `nltk.tokenize.WordPunctTokenizer()`: Tokenizes a text into a sequence of alphabetic and non-alphabetic characters, using the regular expression: `\w+|[^\w\s]+`.
5. `nltk.tokenize.RegexpTokenizer(pattern, gaps=False, discard_empty=True, flags=56)`: Splits a string into substrings using a regular expression.
6. More tokenizer [classes]((https://tedboy.github.io/nlps/generated/nltk.tokenize.html#classes)) provided by NLTK.

SpaCy:

7. spaCy: Uses the `Token.text` attribute. A `Doc` is a sequence of `Token`.
   - [Algorithm Details: How spaCy's Tokenizer Works](https://spacy.io/usage/linguistic-features#how-tokenizer-works)
   - [Add Special Case Tokenization Rules](https://spacy.io/usage/linguistic-features#special-cases)

In [84]:
# `split()`
tokenized_docs = df.text.apply(lambda doc: doc.split())
for i, doc in list(itertools.islice(enumerate(tokenized_docs), 1)):
    print("Word Tokenized:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Word Tokenized:

["India's", 'Maruti', 'sees', 'profits', 'jump', "India's", 'biggest', 'carmaker', 'Maruti', 'has', 'reported', 'a', 'sharp', 'increase', 'in', 'quarterly', 'profit', 'after', 'a', 'booming', 'economy', 'and', 'low', 'interest', 'rates', 'boosted', 'demand.', 'Net', 'profit', 'surged', '70%', 'to', '2.39bn', 'rupees', '($54.98m;', '£29.32m)', 'in', 'the', 'last', 'three', 'months', 'of', '2004', 'compared', 'with', '1.41bn', 'rupees', 'a', 'year', 'earlier.', 'Total', 'sales', 'were', '30.1bn', 'rupees,', 'up', '27%', 'from', 'the', 'same', '2004', 'period.', 'Maruti', 'accounts', 'for', 'half', 'of', "India's", 'domestic', 'car', 'sales,', 'luring', 'consumers', 'with', 'cheap,', 'fuel-efficient', 'vehicles.', 'Demand', 'in', 'India', 'also', 'has', 'been', 'driven', 'by', 'the', 'poor', 'state', 'of', 'public', 'transport', 'and', 'the', 'very', 'low', 'level', 'of', 'car', 'ownership,', 'analysts', 'said.', 'Figures', 'show', 'that', 'only', 'eight', 'people', 'per'

In [77]:
# `nltk.tokenize.word_tokenize()`
from nltk.tokenize import word_tokenize

tokenized_docs = df.text.apply(word_tokenize)
for i, doc in list(itertools.islice(enumerate(tokenized_docs), 2)):
    print("Word Tokenized:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Word Tokenized:

['India', "'s", 'Maruti', 'sees', 'profits', 'jump', 'India', "'s", 'biggest', 'carmaker', 'Maruti', 'has', 'reported', 'a', 'sharp', 'increase', 'in', 'quarterly', 'profit', 'after', 'a', 'booming', 'economy', 'and', 'low', 'interest', 'rates', 'boosted', 'demand', '.', 'Net', 'profit', 'surged', '70', '%', 'to', '2.39bn', 'rupees', '(', '$', '54.98m', ';', '£29.32m', ')', 'in', 'the', 'last', 'three', 'months', 'of', '2004', 'compared', 'with', '1.41bn', 'rupees', 'a', 'year', 'earlier', '.', 'Total', 'sales', 'were', '30.1bn', 'rupees', ',', 'up', '27', '%', 'from', 'the', 'same', '2004', 'period', '.', 'Maruti', 'accounts', 'for', 'half', 'of', 'India', "'s", 'domestic', 'car', 'sales', ',', 'luring', 'consumers', 'with', 'cheap', ',', 'fuel-efficient', 'vehicles', '.', 'Demand', 'in', 'India', 'also', 'has', 'been', 'driven', 'by', 'the', 'poor', 'state', 'of', 'public', 'transport', 'and', 'the', 'very', 'low', 'level', 'of', 'car', 'ownership', ',', 'analysts', 

In [81]:
# `TreebankWordTokenizer()`
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenized_docs = df.text.apply(tokenizer.tokenize)
for i, doc in list(itertools.islice(enumerate(tokenized_docs), 1)):
    print("Word Tokenized:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Word Tokenized:

['India', "'s", 'Maruti', 'sees', 'profits', 'jump', 'India', "'s", 'biggest', 'carmaker', 'Maruti', 'has', 'reported', 'a', 'sharp', 'increase', 'in', 'quarterly', 'profit', 'after', 'a', 'booming', 'economy', 'and', 'low', 'interest', 'rates', 'boosted', 'demand.', 'Net', 'profit', 'surged', '70', '%', 'to', '2.39bn', 'rupees', '(', '$', '54.98m', ';', '£29.32m', ')', 'in', 'the', 'last', 'three', 'months', 'of', '2004', 'compared', 'with', '1.41bn', 'rupees', 'a', 'year', 'earlier.', 'Total', 'sales', 'were', '30.1bn', 'rupees', ',', 'up', '27', '%', 'from', 'the', 'same', '2004', 'period.', 'Maruti', 'accounts', 'for', 'half', 'of', 'India', "'s", 'domestic', 'car', 'sales', ',', 'luring', 'consumers', 'with', 'cheap', ',', 'fuel-efficient', 'vehicles.', 'Demand', 'in', 'India', 'also', 'has', 'been', 'driven', 'by', 'the', 'poor', 'state', 'of', 'public', 'transport', 'and', 'the', 'very', 'low', 'level', 'of', 'car', 'ownership', ',', 'analysts', 'said.', 'Figure

In [82]:
# `WordPunctTokenizer()`
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenized_docs = df.text.apply(tokenizer.tokenize)
for i, doc in list(itertools.islice(enumerate(tokenized_docs), 1)):
    print("Word Tokenized:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Word Tokenized:

['India', "'", 's', 'Maruti', 'sees', 'profits', 'jump', 'India', "'", 's', 'biggest', 'carmaker', 'Maruti', 'has', 'reported', 'a', 'sharp', 'increase', 'in', 'quarterly', 'profit', 'after', 'a', 'booming', 'economy', 'and', 'low', 'interest', 'rates', 'boosted', 'demand', '.', 'Net', 'profit', 'surged', '70', '%', 'to', '2', '.', '39bn', 'rupees', '($', '54', '.', '98m', ';', '£', '29', '.', '32m', ')', 'in', 'the', 'last', 'three', 'months', 'of', '2004', 'compared', 'with', '1', '.', '41bn', 'rupees', 'a', 'year', 'earlier', '.', 'Total', 'sales', 'were', '30', '.', '1bn', 'rupees', ',', 'up', '27', '%', 'from', 'the', 'same', '2004', 'period', '.', 'Maruti', 'accounts', 'for', 'half', 'of', 'India', "'", 's', 'domestic', 'car', 'sales', ',', 'luring', 'consumers', 'with', 'cheap', ',', 'fuel', '-', 'efficient', 'vehicles', '.', 'Demand', 'in', 'India', 'also', 'has', 'been', 'driven', 'by', 'the', 'poor', 'state', 'of', 'public', 'transport', 'and', 'the', 'very',

In [54]:
# `RegexpTokenizer()`
from nltk.tokenize import RegexpTokenizer

# `|` means "either or"
tokenizer = RegexpTokenizer(r"\w+|\$[\d\.]+|\S+")
tokenized_doc = df.text.apply(tokenizer.tokenize)
for i, doc in list(itertools.islice(enumerate(tokenized_doc), 1)):
    print("Word Tokenized:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Word Tokenized:

['India', "'s", 'Maruti', 'sees', 'profits', 'jump', 'India', "'s", 'biggest', 'carmaker', 'Maruti', 'has', 'reported', 'a', 'sharp', 'increase', 'in', 'quarterly', 'profit', 'after', 'a', 'booming', 'economy', 'and', 'low', 'interest', 'rates', 'boosted', 'demand', '.', 'Net', 'profit', 'surged', '70', '%', 'to', '2', '.39bn', 'rupees', '($54.98m;', '£29.32m)', 'in', 'the', 'last', 'three', 'months', 'of', '2004', 'compared', 'with', '1', '.41bn', 'rupees', 'a', 'year', 'earlier', '.', 'Total', 'sales', 'were', '30', '.1bn', 'rupees', ',', 'up', '27', '%', 'from', 'the', 'same', '2004', 'period', '.', 'Maruti', 'accounts', 'for', 'half', 'of', 'India', "'s", 'domestic', 'car', 'sales', ',', 'luring', 'consumers', 'with', 'cheap', ',', 'fuel', '-efficient', 'vehicles', '.', 'Demand', 'in', 'India', 'also', 'has', 'been', 'driven', 'by', 'the', 'poor', 'state', 'of', 'public', 'transport', 'and', 'the', 'very', 'low', 'level', 'of', 'car', 'ownership', ',', 'analysts', 

In [56]:
# `spacy`
import spacy

# Execute `python3 -m spacy download en` in command line first
nlp = spacy.load('en_core_web_sm')

In [60]:
# `spacy`
tokenized_docs = df.text.apply(lambda doc: [token.text for token in nlp(doc)])
for i, doc in list(itertools.islice(enumerate(tokenized_docs), 1)):
    print("Word Tokenized:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Word Tokenized:

['India', "'s", 'Maruti', 'sees', 'profits', 'jump', '\n\n', 'India', "'s", 'biggest', 'carmaker', 'Maruti', 'has', 'reported', 'a', 'sharp', 'increase', 'in', 'quarterly', 'profit', 'after', 'a', 'booming', 'economy', 'and', 'low', 'interest', 'rates', 'boosted', 'demand', '.', '\n\n', 'Net', 'profit', 'surged', '70', '%', 'to', '2.39bn', 'rupees', '(', '$', '54.98', 'm', ';', '£', '29.32', 'm', ')', 'in', 'the', 'last', 'three', 'months', 'of', '2004', 'compared', 'with', '1.41bn', 'rupees', 'a', 'year', 'earlier', '.', 'Total', 'sales', 'were', '30.1bn', 'rupees', ',', 'up', '27', '%', 'from', 'the', 'same', '2004', 'period', '.', 'Maruti', 'accounts', 'for', 'half', 'of', 'India', "'s", 'domestic', 'car', 'sales', ',', 'luring', 'consumers', 'with', 'cheap', ',', 'fuel', '-', 'efficient', 'vehicles', '.', '\n\n', 'Demand', 'in', 'India', 'also', 'has', 'been', 'driven', 'by', 'the', 'poor', 'state', 'of', 'public', 'transport', 'and', 'the', 'very', 'low', 'level',

## 5-2. Character Tokenization
## 5-3. Sentence Tokenization
## 5-4. Punctuation Tokenization
## 5-5. Whitespace Tokenization
## 5-6. Subword Tokenization

# 6. Expand Contractions
**Contractions** are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.
- Removing contractions contributes to text standardization & dimensionality reduction.
- `contractions.fix(text)`
  - [contractions](https://github.com/kootenpv/contractions)
        
**Apostrophes** are used for two main jobs, showing possession and showing omission.
- Apostrophes for **possession** show that a thing belongs to someone or something. For example, "Anna’s book" or "the school’s logo".
- Apostrophes for **omission** show where something, usually a letter, has been missed out to create a contraction. For example, "haven't" rather than "have not". These are contractions.
- Or to form plurals of letters, numbers, and symbols. For example, "2 A's", "Six 5's", or "Many &’s".

In [20]:
# !pip3 install contractions
import contractions, itertools

def expand_contractions(text: str) -> str:
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
    return ' '.join(expanded_words)

"""
text_to_expanded = '''I'll be there within 5 min. Shouldn't you be there too? 
                      I'd love to see u there my dear. It's awesome to meet new friends.
                      We've been waiting for this day for so long.'''
print(expand_contractions(text_to_expanded))
# I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.
"""

expanded_doc = df.text.apply(expand_contractions)
for i, doc in list(itertools.islice(enumerate(expanded_doc), 1)):
    print("Contractions Expanded:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Contractions Expanded:

India's Maruti sees profits jump India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand. Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles. Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said. Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans and utility vehicles, ro

# 7. Noisy Entity Removal
Sometimes the text in question contains unwanted noise such as spaces, non-alphanumeric characters, numbers and html formatting that need to be removed. What gets removed though, is highly dependent on the domain of the use case at hand.
## 7-1. Punctuation Removal
Why remove punctuation?
- Removing punctuation can reduce complexity of total number of words considered.
- Most importantly, punctuation typically does not hold any vital information and thus should be removed.
    - However, be careful if you are utilizing transfer learning on Large language models since the preprocess step can be dependent on punctuation.
- [regex101](https://regex101.com/)

1. Regular Expression to find all punctuation: `[^\w\s]+`
    - `+`: Matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
    - `\w`: Matches any word character (equivalent to `[a-zA-Z0-9]_`).
    - `\s`: Matches any whitespace character (equivalent to `[\r\n\t\f\v ]`).
2. `spacy`: Uses the `Token.is_punct` attribute.
    - More [Attributes](https://spacy.io/api/token#attributes) of `Token`.

In [314]:
import re

def remove_punctuation(text: str) -> str:
    # r'' indicates to treat as a raw string
    return re.sub(r'[^\w\s]+', '', text) 
    
no_punctuation = df.text.apply(remove_punctuation)
for i, doc in list(itertools.islice(enumerate(no_punctuation), 1)):
    print("Punctuation Removed:")
    print()
    print(doc)
    print("Original:")
    print()
    print(df.text[i])

Punctuation Removed:

Indias Maruti sees profits jump

Indias biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand

Net profit surged 70 to 239bn rupees 5498m 2932m in the last three months of 2004 compared with 141bn rupees a year earlier Total sales were 301bn rupees up 27 from the same 2004 period Maruti accounts for half of Indias domestic car sales luring consumers with cheap fuelefficient vehicles

Demand in India also has been driven by the poor state of public transport and the very low level of car ownership analysts said

Figures show that only eight people per thousand are car owners Maruti beat market expectations despite an increase in raw materials costs The company majorityowned by Japans Suzuki said an increase in steel and other raw material prices was partially offset by cost cutting Sales in the fiscal third quarter including vans and utility vehicles rose by 178 to 136069 units Maruti

In [35]:
# `spacy`
import spacy

# Execute `python3 -m spacy download en` in command line first
nlp = spacy.load('en_core_web_sm')

In [38]:
# Note that spaCy can identify apostrophes for possession from simply punctuation marks
no_punctuation = df.text.apply(lambda doc: ' '.join([token.text for token in nlp(doc) if not token.is_punct and not token.is_space]))
for i, doc in list(itertools.islice(enumerate(no_punctuation), 1)):
    print("Punctuation Removed:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

Punctuation Removed:

India 's Maruti sees profits jump India 's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand Net profit surged 70 to 2.39bn rupees $ 54.98 m £ 29.32 m in the last three months of 2004 compared with 1.41bn rupees a year earlier Total sales were 30.1bn rupees up 27 from the same 2004 period Maruti accounts for half of India 's domestic car sales luring consumers with cheap fuel efficient vehicles Demand in India also has been driven by the poor state of public transport and the very low level of car ownership analysts said Figures show that only eight people per thousand are car owners Maruti beat market expectations despite an increase in raw materials costs The company majority owned by Japan 's Suzuki said an increase in steel and other raw material prices was partially offset by cost cutting Sales in the fiscal third quarter including vans and utility vehicles rose by 17.8 to 1

## 7-2. URL Removal
- URLs do not give any information when we try to analyze text from words.
- However, there are cases when maybe having URL's are still useful especially if you want to have a graph-oriented databse. (Neo4j for example).
- Regular expression to find all URLs: `http\S+|www.\S+`
    - `\S`: `\s` matches whitespaces (spaces, tabs and new lines). `\S` is negated `\s`.

In [315]:
import re

def remove_url(text: str) -> str:
    return re.sub(r'http\S+|www.\S+', '', text)

no_url = df.text.apply(remove_url)
for i, doc in list(itertools.islice(enumerate(no_url), 1)):
    print("URL Removed:")
    print()
    print(doc)
    print("Original:")
    print()
    print(df.text[i])

URL Removed:

India's Maruti sees profits jump

India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.

Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.

Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.

Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans and utility vehicles, rose by 

## 7-3. HTML Tag Removal
Text data may contain HTML tags that need to be removed before analysis. Regular Expression to find all HTML tags: `<.*?>`
- `.*`: Matches zero or more of any character.
- `?`: Changes the repetition behavior into non-greedy (reluctant or lazy). Matches as few repetitions as possible.
    - For example, the greedy `h.+l` matches "hell" in "hello" but the lazy `h.+?l` matches "hel".

In [334]:
import re

def remove_html_tag(text: str) -> str:
    return re.sub(r'<.*?>', '', text)

no_html_tag = df.text.apply(remove_html_tag)
for i, doc in list(itertools.islice(enumerate(no_html_tag), 1)):
    print("HTML Tags Removed:")
    print()
    print(doc)
    print("Original:")
    print()
    print(df.text[i])

HTML Tags Removed:

India's Maruti sees profits jump

India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.

Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.

Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.

Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans and utility vehicles, ro

## 7-4. Whitespace Removal
**Stripping** gets rid of leading and trailing blanks.

In [333]:
for i, doc in list(itertools.islice(enumerate(df.text.str.strip()), 1)):
    print("Stripped:")
    print()
    print(doc)
    print("Original:")
    print()
    print(df.text[i])

Stripped:

India's Maruti sees profits jump

India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.

Net profit surged 70% to 2.39bn rupees ($54.98m; £29.32m) in the last three months of 2004 compared with 1.41bn rupees a year earlier. Total sales were 30.1bn rupees, up 27% from the same 2004 period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.

Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.

Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans and utility vehicles, rose by 17.

## 7-5. Number Removal
Usually combined with punctuation removal. Regular expression to find all numbers: `[\d]`
- `\d`: Matches any digit from 0 to 9.

In [336]:
import re

def remove_number(text: str) -> str:
    return re.sub(r'[\d]', '', text)

no_number = df.text.apply(remove_number)
for i, doc in list(itertools.islice(enumerate(no_number), 1)):
    print("Numbers Removed:")
    print()
    print(doc)
    print("Original:")
    print()
    print(df.text[i])

Numbers Removed:

India's Maruti sees profits jump

India's biggest carmaker Maruti has reported a sharp increase in quarterly profit after a booming economy and low interest rates boosted demand.

Net profit surged % to .bn rupees ($.m; £.m) in the last three months of  compared with .bn rupees a year earlier. Total sales were .bn rupees, up % from the same  period. Maruti accounts for half of India's domestic car sales, luring consumers with cheap, fuel-efficient vehicles.

Demand in India also has been driven by the poor state of public transport and the very low level of car ownership, analysts said.

Figures show that only eight people per thousand are car owners. Maruti beat market expectations despite an increase in raw materials costs. The company, majority-owned by Japan's Suzuki, said an increase in steel and other raw material prices was partially offset by cost cutting. Sales in the fiscal third quarter, including vans and utility vehicles, rose by .% to . units. Maruti is 

## 7-6. Stop Word Removal
- Stop words are commonly used words in language, adding little to no value to the meaning of the sentence. A group of words that do not add any additional information such as articles, determiners and prepositions are associated with stop words.
- The list of stop words vary from package to package.

1. `nltk.corpus.stopwords`
2. `spacy`: Uses the `Token.is_stop` attribute.
4. `gensim.parsing.preprocessing.remove_stopwords()`

Types of Stop Words:
1. **Common Stop Words:** These are the most frequently occurring words in a language and are often removed during text preprocessing. Examples include “the,” “is,” “in,” “for,” “where,” “when,” “to,” “at,” etc.
2. **Custom Stop Words:** Depending on the specific task or domain, additional words may be considered as stopwords. These could be domain-specific terms that don’t contribute much to the overall meaning. For example, in a medical context, words like “patient” or “treatment” might be considered as custom stopwords.
3. **Numerical Stop Words:** Numbers and numeric characters may be treated as stopwords in certain cases, especially when the analysis is focused on the meaning of the text rather than specific numerical values. As the number removal section.
4. **Single-Character Stop Words:** Single characters, such as “a,” “I,” “s,” or “x,” may be considered stopwords, particularly in cases where they don’t convey much meaning on their own. Regular expression to find all single character: `\b[a-zA-Z]\s`. Note "a" will be removed.
5. **Contextual Stop Words:** Words that are stopwords in one context but meaningful in another may be considered as contextual stopwords. For instance, the word “will” might be a stopword in the context of general language processing but could be important in predicting future events.

In [245]:
# `nltk`
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

print(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yungshun317/nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora/stopwords.zip.


In [305]:
stop_words = stopwords.words('english')

no_stop_words = df.text.apply(lambda doc: ' '.join([word for word in doc.split() if word not in stop_words]))
for i, doc in list(itertools.islice(enumerate(no_stop_words), 1)):
    print("No Stop Words:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

No Stop Words:

India's Maruti sees profits jump India's biggest carmaker Maruti reported sharp increase quarterly profit booming economy low interest rates boosted demand. Net profit surged 70% 2.39bn rupees ($54.98m; £29.32m) last three months 2004 compared 1.41bn rupees year earlier. Total sales 30.1bn rupees, 27% 2004 period. Maruti accounts half India's domestic car sales, luring consumers cheap, fuel-efficient vehicles. Demand India also driven poor state public transport low level car ownership, analysts said. Figures show eight people per thousand car owners. Maruti beat market expectations despite increase raw materials costs. The company, majority-owned Japan's Suzuki, said increase steel raw material prices partially offset cost cutting. Sales fiscal third quarter, including vans utility vehicles, rose 17.8% 136.069 units. Maruti company benefiting Indian's economic growth gives consumer greater spending power. Utility vehicle tractor maker Mahindra reported 52% rise net pro

In [32]:
# `spacy`
import spacy

# Execute `python3 -m spacy download en` in command line first
nlp = spacy.load('en_core_web_sm')

In [39]:
# Note that spaCy also tokenize punctuation
no_stop_words = df.text.apply(lambda doc: ' '.join([token.text for token in nlp(doc) if not token.is_stop and not token.is_space]))
# no_stop_words = df.text.apply(lambda doc: ' '.join([token.text for token in nlp(doc) if not token.is_stop and not token.is_punct and not token.is_space]))
for i, doc in list(itertools.islice(enumerate(no_stop_words), 1)):
    print("No Stop Words:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

No Stop Words:

India Maruti sees profits jump India biggest carmaker Maruti reported sharp increase quarterly profit booming economy low interest rates boosted demand . Net profit surged 70 % 2.39bn rupees ( $ 54.98 m ; £ 29.32 m ) months 2004 compared 1.41bn rupees year earlier . Total sales 30.1bn rupees , 27 % 2004 period . Maruti accounts half India domestic car sales , luring consumers cheap , fuel - efficient vehicles . Demand India driven poor state public transport low level car ownership , analysts said . Figures people thousand car owners . Maruti beat market expectations despite increase raw materials costs . company , majority - owned Japan Suzuki , said increase steel raw material prices partially offset cost cutting . Sales fiscal quarter , including vans utility vehicles , rose 17.8 % 136.069 units . Maruti company benefiting Indian economic growth gives consumer greater spending power . Utility vehicle tractor maker Mahindra reported 52 % rise net profit months 2004 . 

In [332]:
# `gensim`
from gensim.parsing.preprocessing import remove_stopwords

no_stop_words = df.text.apply(remove_stopwords)
for i, doc in list(itertools.islice(enumerate(no_stop_words), 1)):
    print("No Stop Words:")
    print()
    print(doc)
    print()
    print("Original:")
    print()
    print(df.text[i])

No Stop Words:

India's Maruti sees profits jump India's biggest carmaker Maruti reported sharp increase quarterly profit booming economy low rates boosted demand. Net profit surged 70% 2.39bn rupees ($54.98m; £29.32m) months 2004 compared 1.41bn rupees year earlier. Total sales 30.1bn rupees, 27% 2004 period. Maruti accounts half India's domestic car sales, luring consumers cheap, fuel-efficient vehicles. Demand India driven poor state public transport low level car ownership, analysts said. Figures people thousand car owners. Maruti beat market expectations despite increase raw materials costs. The company, majority-owned Japan's Suzuki, said increase steel raw material prices partially offset cost cutting. Sales fiscal quarter, including vans utility vehicles, rose 17.8% 136.069 units. Maruti company benefiting Indian's economic growth gives consumer greater spending power. Utility vehicle tractor maker Mahindra reported 52% rise net profit months 2004. Profit 1.33bn rupees compared

# 8. Text Normalization
## 8-1. Stemming 
## 8-2. Lemmatization

# 9. N-Gram

# 10. POS Tagging

# 11. NER

# 12. Datasets & DataLoaders