$$\text{Anatomy of Language}$$
- Language 
    - Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary.
        - Early Languages
        - Modern Linguistics

- Language as a phenomenon
    - Language is considered as a social phenomenon because all human beings communicate with their respective speech communities using the language they speak.
        - Spoken Language
        - Written Language
- Semantics
    - Language exists to be meaningful; the study of meaning, both in general theoretical terms and in reference to a specific language, is known as semantics.
    - Semantics embraces the meaningful functions of phonological features, such as intonation, and of grammatical structures and the meanings of individual words.

- Language Variants
    - Language refers to both the universal human ability to communicate and its specific forms, like English, French, or Swahili.

- Physiological and physical basis
    - Language originated as a spoken system, evolving naturally with human communication, while writing emerged only 4,000–5,000 years ago as a way to represent speech. For most of human history, language was passed down orally.

- Speech Production
    - Speaking arises from exhaling air during respiration, modified by vocal tract movements to produce various sounds.

- Language Acquisition
    - All humans are physiologically alike in speech production. Language is learned from upbringing, not inherited; adopted children acquire their adoptive parents' language.


- Meaning and Style in Language
    - Language exists to convey meaning, shaped by the diverse needs of human communication, making the study of meaning highly complex.

- Structural, or grammatical, meaning
    - Sentence meaning combines word meanings and grammatical structure, as seen in sentences like The dog chased the cat and The boy chased the cat.

- Lexical meaning
    - Word meaning, or lexical meaning, refers to the individual definitions of words, as outlined in dictionaries. Answering "What does this word mean?" is often harder than it seems.

- Language and culture
    - Language is deeply tied to community life and culture, shaping and reflecting the details of daily living universally across all languages.

- Transmission of Language and culture
    - Language is primarily learned through cultural exposure, with limited direct teaching, as children construct grammar from the speech they hear. Language is an integral part of culture, influencing and reflecting societal membership.

- Symbolic Systems
    - Symbolic systems represent the world using meaningful symbols, including human and computer languages.
- Types of Languages 
    - Artificial Language 
        - Artificial languages arise from simulations, interactions, or experiments, evolving naturally rather than being consciously designed. They are used in cultural evolution studies and psycholinguistic research.
    - Constructed Language
        - Constructed languages, or conlangs, are intentionally designed for purposes like communication, fiction, experimentation, or art. Examples include international auxiliary languages and fictional languages.
    - Logical Language
        - Logical languages, such as Loglan and Lojban, use formal logic to eliminate ambiguity, aiming for precision in expression.
    - Programming Language
        - Programming languages, text-based or graphical, enable writing computer programs. They are defined by syntax and semantics, with some based on specifications (e.g., C) and others by dominant implementations (e.g., Perl).
    - Natural Language
        - Natural languages evolve naturally through human use and differ from artificial or constructed languages. They are systematic, conventional, redundant, and subject to change over time.
        - Natural Imprecision
            - Natural language reflects human cognition but includes vague terms like "tall" or "hot," challenging precise translation into computational reasoning.
- Linguistics
    - Linguistics
        - inguistics, the science of language, studies human communication, language families, and specific languages, involving subfields like phonetics, syntax, semantics, and sociolinguistics.
    - Applied Linguistics
        - Applied linguistics applies linguistic research to areas like education, translation, lexicography, language policy, and natural language processing.



$$\text{Language Analysis and Computational Linguistics}$$
 
- Language Analysis
    - **Purpose**: To understand how meaning is conveyed using language techniques (e.g., tone, word choice).
        - Identify techniques and their effects.

- Techniques
    - **Persuasive Techniques**: Analyze how these influence audience perception and response.

- Levels of Analysis
    - **Phonology**: Sound system of a language.
    - **Grammar (Morphology/Syntax)**: Structure of words and sentences.
    - **Discourse and Pragmatics**: Contextual and functional use of language.

- Genre and Audience
    - **Genre**: Groups texts by style and theme (e.g., fantasy, poetry).
    - **Audience**: Tailors content to engage target readers effectively.

- Foregrounding
    - **Attention-Getting Techniques**: Uses repetition (parallelism) or breaks patterns (deviation).

- Literariness
    - **Value in Texts**: Aesthetic and moral qualities elevate texts to literary works.

- Paradigm and Syntagm
    - **Paradigm**: Substitution relationships (e.g., noun for noun).
    - **Syntagm**: Positional relationships in sentence structure.

- Form and Function
    - **Form**: Identifies parts of speech and structures.
    - **Function**: Explains roles (nominal, adjectival, adverbial) in context.

- Linguistic Analysis
    - **Focus Areas**:
        - **Phonetics**: Studies sound production and perception.
        - **Phonology**: Explores sound systems.
        - **Morphology**: Investigates word structures.
        - **Syntax**: Examines sentence formation rules.
        - **Semantics**: Analyzes meaning.
        - **Pragmatics**: Studies social use of language.

- Lexicology
    - **Word Analysis**: Examines formation, usage, and relationships between words.

- Artificial Intelligence
    - **Definition**: Simulates human intelligence to mimic cognitive tasks.
        - **Strong AI**: Abstract reasoning and human-level thinking (future goal).
        - **Weak AI**: Pattern-based automation (e.g., driving, translation).

- AI Techniques
    - **Logic and Rules-Based**: Top-down rules for reasoning.
    - **Machine Learning**: Pattern detection for self-learning systems.
 
- Branches of Artificial Intelligence
    - Artificial Intelligence (AI) encompasses several key techniques and applications to solve real-world problems, including:  
        - Machine Learning  
        - Deep Learning  
        - Natural Language Processing  
        - Robotics  
        - Expert Systems  
        - Fuzzy Logic  


- Machine Learning (ML)
    - Machine learning, a subset of AI, enables systems to learn and improve from experience without explicit programming. ML systems analyze data, identify patterns, and make decisions with minimal human intervention. Its core goal is to allow computers to learn autonomously and refine their actions accordingly.
    - ML focuses on developing algorithms that transform data into intelligent action. These algorithms find applications in various domains, such as predictive modeling and decision-making.


- Deep Learning (DL)
    - Deep learning is a specialized subset of ML that uses neural networks with multiple layers. These networks mimic human brain function to process large datasets. Additional layers in DL models enhance prediction accuracy by optimizing hidden patterns.


- Robotics
    - Robotics integrates engineering and AI to design, manufacture, and operate robots. These intelligent machines can assist humans in tasks ranging from industrial automation to healthcare services. Forms of robotics include humanoid robots and software-based robotic process automation (RPA).


- Expert Systems
    - Expert systems simulate human expertise using AI technologies. They combine a knowledge base of facts and rules with inference engines to solve domain-specific problems. While they complement human experts, they are not designed to replace them.


- Fuzzy Logic
    - Fuzzy logic introduces degrees of truth to computing, as opposed to binary true/false logic. It allows systems to handle uncertainty and approximate reasoning effectively, particularly in control systems and decision-making applications.


- Natural Language Processing (NLP)
    - NLP enables computers to understand, interpret, and respond to human language. With applications in medical research, search engines, and business intelligence, NLP encompasses:  
    - **Natural Language Understanding (NLU):** Focused on interpreting input (text or speech) and identifying intents and entities.  
    - **Natural Language Generation (NLG):** NLG uses AI to generate written or spoken language from structured data. It includes processes like content analysis, data understanding, and grammatical structuring to produce human-like text. NLG is widely used for news reporting, customer messaging, and business content creation. 
    - **Applications:**  
        - **Interactive Voice Response (IVR):** Enhances customer service through voice-enabled systems.  
        - **Chatbots:** Automate customer support using predefined scripts.  
        - **Machine Translation:** Automates text translation using AI models.  
        - **Conversational Interfaces:** Powers devices like Amazon Alexa and Google Home.  




- Computational Linguistics
    - Computational Linguistics (CL) combines linguistics and computer science to analyze language. Applications include machine translation, speech recognition, text summarization, and building conversational agents. Approaches in CL include:  
    - **Corpus-based and Structural Approaches:** Analyze large language datasets.  
    - **Interactive Approaches:** Use text or speech inputs to generate responses.  
    - **Developmental Approaches:** Mimic language acquisition processes for learning over time.  
 

$$\text{Deep Parsing and Tools for NLP}$$

- Syntactic Parsing 
  - Syntax refers to the arrangement of words in sentences.
  - Syntax structures define parts of speech and sentence trees.
  - Syntax governs how sentences are structured in terms of noun, verb, and prepositional phrases.

- Syntactic Structure 
  - Sentence structure: Subject (NP) + Verb Phrase (VP) + Prepositional Phrase (PP).
  - Noun Phrase (NP): Determiner + Noun.
  - Verb Phrase (VP): Verb + combinations.
  - Prepositional Phrase (PP): Preposition + Noun Phrase.

- Examples 
  - *"The boy ate the pancakes"*:
    - The boy: NP, ate: Verb, the pancakes: NP.
  - *"The boy ate the pancakes under the door"*:
    - Syntactically correct, contextually incorrect.

- Text Syntax Components 
  - POS tags specify word functions (noun, verb, etc.).
  - Dependency grammar captures word relationships in sentences.

- Role of a Parser 
  - A parser checks syntax and builds a structure (e.g., parse tree).
  - It splits sentences into subjects and related phrases.


- Semantic Parsing 
  - Converts natural language into machine-understandable meaning.
  - Used in machine translation, QA, and code generation.

- Example 
  - *"The price of bananas increased by 5%"* — words like "increased" are predicates, and "the price of bananas" is an argument.


- Information Extraction 
  - Extracts relevant info from unstructured data.
  - Saves time and reduces human error.
  - Uses NLP algorithms for tasks like summarizing, extracting data from websites, etc.

- Web Scraping 
  - Collects raw data from websites using Python tools (e.g., urllib).
  - Be mindful of website terms and avoid overloading servers.

- Text Summarization 
  - Summarizes long texts into shorter, informative versions.
  - Helps save reading time and improve indexing.

  - Summarization Types 
    - **Input Type**: Single or multi-document.
    - **Purpose**: Generic, domain-specific, or query-based.
    - **Output**: Extractive or abstractive (generating new sentences).

  - TextRank Algorithm 
    - Extractive summarization based on frequent words in sentences.

  - LexRank Algorithm 
    - Ranks sentences by similarity to others in the text.

  - Latent Semantic Analysis (LSA)
    - Uses singular value decomposition (SVD) for extractive summarization.

  - GPT Transformers 
    - Abstractive summarization using GPT-2 for generating human-like summaries.


- Anaphora Resolution 
  - Resolves pronouns or noun phrases (e.g., "He" refers to "John").

- Discourse Integration 
  - Examines how previous sentences affect the current sentence.

- Pragmatic Analysis 
  - Interprets text meaning based on context and cooperation rules.


- Ontology in NLP 
  - Formal representation of domain knowledge (concepts, relationships).
  - Enhances understanding of words and sentences.
  - Ontology Types 
    - **Domain-Specific**: Healthcare, finance, etc.
    - **General-Purpose**: Common concepts.
    - **Upper Ontologies**: Frameworks for specific ontologies.

- Benefits of Ontologies 
  - Improves accuracy and disambiguation.
  - Facilitates information sharing and scalability.
  - Web Ontology Language (OWL)
    - Represents knowledge in machine-readable format (e.g., for e-commerce, healthcare, and research).

<div style="
    background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
    -webkit-background-clip: text;
    -webkit-text-fill-color: transparent;
    font-size: 20px;
    font-weight: bold;
    text-align: center;">
  Basic Python
</div>


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import re
sent = 'They told that their ages are 25 26 and 31 respectively.'

#Find the average ages in the sentence
ar = [int(i) for i in re.findall('\d{2,3}', sent)]
print(f'Average age : {sum(ar)/len(ar) : .2f}')

ar=[int(i) for i in sent.split(" ") if i.isdigit()]
print(f'Average age : {sum(ar)/len(ar) : .2f}')

Average age :  27.33
Average age :  27.33


In [4]:
names = ['अजय', 'विजय', 'प्रिया', 'अतुल' ] 
for name in names:
    if name.startswith('अ'):
        print(name)

print(' '.join(names))

print('विजय'.replace('वि', 'अ'))

print(len(names[2]))

for i in name[2]:
    print(i, end=' ')

अजय
अतुल
अजय विजय प्रिया अतुल
अजय
6
ु 

In [5]:
f = open('names.txt','w', encoding='utf8')
for x in range(10):
    f.write('प्रिया\n')
f.close()

In [6]:
ord('r'),ord('र'), ord('1'), chr(2352), ord('य')

(114, 2352, 49, 'र', 2351)

In [7]:
x = '\u0939\u092A'
x

'हप'

In [8]:
x = '😊❤️😍👍😎'
ord(x[2])

65039

In [1]:
for x in 'पकोडा':
    print(ord(x), end=' ')

2346 2325 2379 2337 2366 

<div style="
    background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
    -webkit-background-clip: text;
    -webkit-text-fill-color: transparent;
    font-size: 24px;
    font-weight: bold;
    text-align: center;">
  Tokenization
</div>


In [1]:
# !pip uninstall numpy scipy -y
# !pip install numpy==1.26.4 scipy==1.11.3
## !pip install --force-reinstall numpy==1.26.4 scipy


In [2]:
import nltk

In [3]:
nltk.download('punkt')                                # tokenizer
nltk.download('stopwords')                            # collection of stopwords
nltk.download('wordnet')                              # wordnet -> database of english words
nltk.download('omw-1.4')                              # open multilingual wordnet
nltk.download('averaged_perceptron_tagger')           # POS Tagger
nltk.download('indian')                               # indian language POS tagger
nltk.download('maxent_ne_chunker')                    # maxent chunking
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DAI.STUDENTSDC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DAI.STUDENTSDC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DAI.STUDENTSDC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DAI.STUDENTSDC\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DAI.STUDENTSDC\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package indian to
[nltk_data]     C:\Users\DAI.STUDENTSDC\AppData\Roaming\nltk_data...
[nltk_d

True

<div style="
    background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
    -webkit-background-clip: text;
    -webkit-text-fill-color: transparent;
    font-size: 20px;
    font-weight: bold;
    text-align: center;">
  Word Tokenization
</div>


In [4]:
from nltk.tokenize import word_tokenize

In [5]:
sent = 'They told that their ages of mr. are 25, 26.2 and 31 respectively.'
print(sent.split())
print(word_tokenize(sent))

['They', 'told', 'that', 'their', 'ages', 'of', 'mr.', 'are', '25,', '26.2', 'and', '31', 'respectively.']
['They', 'told', 'that', 'their', 'ages', 'of', 'mr.', 'are', '25', ',', '26.2', 'and', '31', 'respectively', '.']


In [6]:
from string import punctuation

sent = '''Hello friends!
How are you? Welcome to the world of python programming.'''

print(sent.split())
print(word_tokenize(sent))

ar = [1 if i in punctuation else 0 for i in word_tokenize(sent)]
ar = sum(ar)/len(ar)

print(f"Percentage of punctuation {ar*100}%")

['Hello', 'friends!', 'How', 'are', 'you?', 'Welcome', 'to', 'the', 'world', 'of', 'python', 'programming.']
['Hello', 'friends', '!', 'How', 'are', 'you', '?', 'Welcome', 'to', 'the', 'world', 'of', 'python', 'programming', '.']
Percentage of punctuation 20.0%


<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
  Other Tokenizers
</div>


In [9]:
sent = '''
छत्रपति शिवाजी महाराज वस्तु संग्रहालय मुंबई, (पूर्व नाम 'The Prince of Wales Museum of Western India') मुम्बई का मुख्य संग्रहालय है। 
इसका निर्माण वेल्स के राजकुमार के भारत यात्रा के समय मुम्बई के प्रतिष्ठित उद्योगपतियो और नागरिकों से प्राप्त सहायता मुम्बई के सरकार द्वारा स्मारक के रूप में निर्मित किया गया था। 
यह भव्य भवन दक्षिण मुम्बई के फोर्ट विलियम मे, एल्फिंस्टन कालेज के सामने है। 
इसके सामने रीगल सिनिमा और पुलिस आयुक्त कार्यालय स्थित है।
बगल मे 'डॅविड ससून पुस्तक संग्रहालय', 'व्याट्स्न हॉटेल्', भी है। 
आगे समुद्र के तरफ जाने पर 'गेटवे ऑफ़ इन्डिया' आता है।
'''

word_tokenize(sent)

['छत्रपति',
 'शिवाजी',
 'महाराज',
 'वस्तु',
 'संग्रहालय',
 'मुंबई',
 ',',
 '(',
 'पूर्व',
 'नाम',
 "'The",
 'Prince',
 'of',
 'Wales',
 'Museum',
 'of',
 'Western',
 'India',
 "'",
 ')',
 'मुम्बई',
 'का',
 'मुख्य',
 'संग्रहालय',
 'है।',
 'इसका',
 'निर्माण',
 'वेल्स',
 'के',
 'राजकुमार',
 'के',
 'भारत',
 'यात्रा',
 'के',
 'समय',
 'मुम्बई',
 'के',
 'प्रतिष्ठित',
 'उद्योगपतियो',
 'और',
 'नागरिकों',
 'से',
 'प्राप्त',
 'सहायता',
 'मुम्बई',
 'के',
 'सरकार',
 'द्वारा',
 'स्मारक',
 'के',
 'रूप',
 'में',
 'निर्मित',
 'किया',
 'गया',
 'था।',
 'यह',
 'भव्य',
 'भवन',
 'दक्षिण',
 'मुम्बई',
 'के',
 'फोर्ट',
 'विलियम',
 'मे',
 ',',
 'एल्फिंस्टन',
 'कालेज',
 'के',
 'सामने',
 'है।',
 'इसके',
 'सामने',
 'रीगल',
 'सिनिमा',
 'और',
 'पुलिस',
 'आयुक्त',
 'कार्यालय',
 'स्थित',
 'है।',
 'बगल',
 'मे',
 "'",
 'डॅविड',
 'ससून',
 'पुस्तक',
 'संग्रहालय',
 "'",
 ',',
 "'",
 'व्याट्स्न',
 'हॉटेल्',
 "'",
 ',',
 'भी',
 'है।',
 'आगे',
 'समुद्र',
 'के',
 'तरफ',
 'जाने',
 'पर',
 "'",
 'गेटवे',
 'ऑफ़',
 'इन्डिया',
 "'",

In [10]:
sent = """
ムンバイのチャトラパティ シヴァージー マハラジ博物館 (以前は「西インドのプリンス オブ ウェールズ博物館」として知られていました) は、ムンバイの主要な博物館です。
この記念碑は、ウェールズ皇太子のインド訪問中にムンバイ政府が著名な実業家やムンバイ市民の協力を得て記念碑として建てたものです。
この壮大な建物は、南ムンバイのフォート ウィリアム、エルフィンストーン大学の向かいにあります。
リーガルシネマと警察庁舎が目の前にあります。
近くには「デヴィッド・サスーン・ブック・ミュージアム」、「ヴャトン・ホテル」もあります。
さらに海のほうへ行くと「インド門」があります。"""

word_tokenize(sent)

['ムンバイのチャトラパティ',
 'シヴァージー',
 'マハラジ博物館',
 '(',
 '以前は「西インドのプリンス',
 'オブ',
 'ウェールズ博物館」として知られていました',
 ')',
 'は、ムンバイの主要な博物館です。',
 'この記念碑は、ウェールズ皇太子のインド訪問中にムンバイ政府が著名な実業家やムンバイ市民の協力を得て記念碑として建てたものです。',
 'この壮大な建物は、南ムンバイのフォート',
 'ウィリアム、エルフィンストーン大学の向かいにあります。',
 'リーガルシネマと警察庁舎が目の前にあります。',
 '近くには「デヴィッド・サスーン・ブック・ミュージアム」、「ヴャトン・ホテル」もあります。',
 'さらに海のほうへ行くと「インド門」があります。']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
  Senetence Tokenizers
</div>


In [63]:
from nltk.tokenize import sent_tokenize
from functools import reduce

In [13]:
sent = '''Hello friends!
How are you? Welcome to the world of python programming.'''

sent_tokenize(sent)

['Hello friends!',
 'How are you?',
 'Welcome to the world of python programming.']

In [18]:
sent = '''
छत्रपति शिवाजी महाराज वस्तु संग्रहालय मुंबई, (पूर्व नाम 'The Prince of Wales Museum of Western India') मुम्बई का मुख्य संग्रहालय है.
इसका निर्माण वेल्स के राजकुमार के भारत यात्रा के समय मुम्बई के प्रतिष्ठित उद्योगपतियो और नागरिकों से प्राप्त सहायता मुम्बई के सरकार द्वारा स्मारक के रूप में निर्मित किया गया था.
यह भव्य भवन दक्षिण मुम्बई के फोर्ट विलियम मे, एल्फिंस्टन कालेज के सामने है.
इसके सामने रीगल सिनिमा और पुलिस आयुक्त कार्यालय स्थित है.
बगल मे 'डॅविड ससून पुस्तक संग्रहालय', 'व्याट्स्न हॉटेल्', भी है. 
आगे समुद्र के तरफ जाने पर 'गेटवे ऑफ़ इन्डिया' आता है.
'''

sent_tokenize(sent)

["\nछत्रपति शिवाजी महाराज वस्तु संग्रहालय मुंबई, (पूर्व नाम 'The Prince of Wales Museum of Western India') मुम्बई का मुख्य संग्रहालय है.",
 'इसका निर्माण वेल्स के राजकुमार के भारत यात्रा के समय मुम्बई के प्रतिष्ठित उद्योगपतियो और नागरिकों से प्राप्त सहायता मुम्बई के सरकार द्वारा स्मारक के रूप में निर्मित किया गया था.',
 'यह भव्य भवन दक्षिण मुम्बई के फोर्ट विलियम मे, एल्फिंस्टन कालेज के सामने है.',
 'इसके सामने रीगल सिनिमा और पुलिस आयुक्त कार्यालय स्थित है.',
 "बगल मे 'डॅविड ससून पुस्तक संग्रहालय', 'व्याट्स्न हॉटेल्', भी है.",
 "आगे समुद्र के तरफ जाने पर 'गेटवे ऑफ़ इन्डिया' आता है."]

In [59]:
# Getting the max sentence which has the most the word 'the'
sent = """India, officially the Republic of India,[j][20] is a country in South Asia. It is the seventh-largest country in the world by area and the most populous country. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.[22][23][24] Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity.[25] Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE.[26] By at least 1200 BCE, an archaic form of Sanskrit, an Indo-European language, had diffused into India from the northwest.[27][28] Its evidence today is found in the hymns of the Rigveda. Preserved by an oral tradition that was resolutely vigilant, the Rigveda records the dawning of Hinduism in India.[29] The Dravidian languages of India were supplanted in the northern and western regions.[30] By 400 BCE, stratification and exclusion by caste had emerged within Hinduism,[31] and Buddhism and Jainism had arisen, proclaiming social orders unlinked to heredity.[32] Early political consolidations gave rise to the loose-knit Maurya and Gupta Empires based in the Ganges Basin.[33] Their collective era was suffused with wide-ranging creativity,[34] but also marked by the declining status of women,[35] and the incorporation of untouchability into an organised system of belief.[l][36] The Middle kingdoms exported Sanskrit language, south Indian scripts and religions of Hinduism and Buddhism to the Southeast Asia.
"""
sents = sent_tokenize(sent)

In [60]:
res = {i : j.lower().count('the')  for i,j in enumerate(sents)}

idex = sorted(res.items(), key=lambda x: x[1], reverse=True)
display(idex[0][1], sents[idex[0][0]])

10

'Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.'

In [64]:
reduce(
    lambda x, y : x if (x.lower().count('the') > y.lower().count('the')) else y,
    sents
)

'Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.'

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
  Whitespace Tokenizers
</div>


In [71]:
from nltk.tokenize import WhitespaceTokenizer

In [99]:
sent = '''Hello friends!
How are you? Welcome to the world of python programming.'''
tk = WhitespaceTokenizer()


display(tk.tokenize(sent))

['Hello',
 'friends!',
 'How',
 'are',
 'you?',
 'Welcome',
 'to',
 'the',
 'world',
 'of',
 'python',
 'programming.']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
Space Tokenizers
</div>


In [None]:
from nltk.tokenize import SpaceTokenizer

In [96]:
sent = '''Hello friends!
How are you? Welcome to the world of python programming.'''


# Only take spaces not the newline
tk = SpaceTokenizer()

display(sent.split(' '))
display(tk.tokenize(sent)) 

['Hello',
 'friends!\nHow',
 'are',
 'you?',
 'Welcome',
 'to',
 'the',
 'world',
 'of',
 'python',
 'programming.']

['Hello',
 'friends!\nHow',
 'are',
 'you?',
 'Welcome',
 'to',
 'the',
 'world',
 'of',
 'python',
 'programming.']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
Line Tokenizers
</div>


In [84]:
from nltk.tokenize import LineTokenizer

In [88]:
sent = '''Hello friends!
How are you? Welcome to the world of python programming.'''


# Only take newline
tk = LineTokenizer()

display(sent.split('\n'))
display(tk.tokenize(sent)) 

['Hello friends!', 'How are you? Welcome to the world of python programming.']

['Hello friends!', 'How are you? Welcome to the world of python programming.']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
Tab Tokenizers
</div>


In [90]:
from nltk.tokenize import TabTokenizer

In [101]:
sent = '''Hello friends!
How are you? Welcome to\tthe world of python\tprogramming.'''


# Only take tab
tk = TabTokenizer()

display(sent.split('\t'))
display(tk.tokenize(sent)) 

['Hello friends!\nHow are you? Welcome to',
 'the world of python',
 'programming.']

['Hello friends!\nHow are you? Welcome to',
 'the world of python',
 'programming.']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
Tweet Tokenizers
</div>


In [102]:
from nltk.tokenize import TweetTokenizer

In [113]:
sent = '''Hello 🫣 friends!:)💀
How are you? Welcome :$ to\tthe world of python🐍\tprogramming.<3'''

tk = TweetTokenizer()
# It tokenizes based on the emojis and emoji generator keycombinations( :) smile, :( sad face, <3 heart, etc )
 
display(word_tokenize(sent))
display(tk.tokenize(sent)) 

['Hello',
 '🫣',
 'friends',
 '!',
 ':',
 ')',
 '💀',
 'How',
 'are',
 'you',
 '?',
 'Welcome',
 ':',
 '$',
 'to',
 'the',
 'world',
 'of',
 'python🐍',
 'programming.',
 '<',
 '3']

['Hello',
 '🫣',
 'friends',
 '!',
 ':)',
 '💀',
 'How',
 'are',
 'you',
 '?',
 'Welcome',
 ':',
 '$',
 'to',
 'the',
 'world',
 'of',
 'python',
 '🐍',
 'programming',
 '.',
 '<3']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
Multi Word Extension Tokenizers
</div>


In [114]:
from nltk.tokenize import MWETokenizer

In [128]:
sent = 'Van Rossom is in Pune. We welcomed Van Rossom here. Van Nayak is also in pune doing Majdoori in CDAC'

# A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
tk = MWETokenizer(separator=' ') 

tk.add_mwe(('Van', 'Rossom'))
tk.tokenize(word_tokenize(sent)) 

['Van Rossom',
 'is',
 'in',
 'Pune',
 '.',
 'We',
 'welcomed',
 'Van Rossom',
 'here',
 '.',
 'Van',
 'Nayak',
 'is',
 'also',
 'in',
 'pune',
 'doing',
 'Majdoori',
 'in',
 'CDAC']

<div style="
  background: linear-gradient(90deg,rgb(251, 255, 10), #ff758c, #ff4d6d);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  font-size: 20px;
  font-weight: bold;
  text-align: center;"
>
Custom Tokenizers
</div>


In [131]:
import re

In [137]:
sent = 'Van Rossom is in Pune. We welcomed Van Rossom here!. Van Nayak in > pune & doing Majdoori in CDAC'

def custom_tokenizer(text):
    return re.split(r"[.,;?!\s]+", text)

tokens = custom_tokenizer(sent)
display(tokens)

['Van',
 'Rossom',
 'is',
 'in',
 'Pune',
 'We',
 'welcomed',
 'Van',
 'Rossom',
 'here',
 'Van',
 'Nayak',
 'in',
 '>',
 'pune',
 '&',
 'doing',
 'Majdoori',
 'in',
 'CDAC']