<a href="https://colab.research.google.com/github/sundaybest3/Spring2024/blob/main/Corpus/TTR-and-lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌿 Topics:

## 1. **Type vs. Token**
## 2. Lexical Diversity measures (10 types)

In [1]:
myword="sample"

In [10]:
myword=["sample", "6","apple"]
#대괄호는 리스트 만들기 - 그러면 단어별로 len을 인식함.

In [11]:
myword[0]

'sample'

In [12]:
len(myword)

3

In [7]:
myword[-3:]

'ple'

In [8]:
myword[3:]

'ple'

In [3]:
len(myword)

6

In [5]:
len(myword)
print(len(myword))

6


# Part 1. Type vs. Token

Example: A cat is chasing a mouse.

+ Tokens: Tokens are often words, but they can also include punctuation, numbers, and other characters depending on the analysis. Simply put, tokens are the total number of words in a given text.

  + 6 tokens in the given example

+ Types: A type is the unique form of a token, disregarding its frequency of occurrence.

  + 5 types in the given example.

[text samples from Aesop fables](https://aesopsfables.org/)

In [None]:
text = "An ant went to a Mistic-fountain."
len(text)
#len은 space도 포함함.

33

> **text.split()** # split string by space
text란 text로 저장한 변수를 말함.

In [16]:
step1 = text.split()
#괄호에 띄어쓰기를 입력하면 단어 단위로 split 해준다.
print(step1)
len(step1)

NameError: name 'text' is not defined

In [None]:
step2 = text.split(".") # delimiter here is '.'
#문장으로 나누고 싶을 때는 "마침표"로 구분, 그러면 문장 단위로 나누어짐.
print(step2)

['An ant went to a Mistic-fountain', '']


In [17]:
text=input()
print("Number of strings: ",len(text))
words = text.split()
print("Number of words: ",len(words))
sents=text.split(".")
print("Number of sentences: ",len(sents))



"An ant went to a Mistic-fountain."
Number of strings:  35
Number of words:  6
Number of sentences:  2


Using a longer text

In [18]:
# This includes all characters: letters, numbers, spaces, punctuation marks, and special characters.

text = """
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.
"""
#''' '''도 괜찮다. """는 줄이 바뀌어도 상관없다.
print("Number of strings: ", len(text))

Number of strings:  554


In [None]:
tokens = text.split()
len(tokens)
#space단위로 나눈 것 = 단어

107

In [None]:
types = set(tokens)
#set()=count unique strings.
len(types)


76

In [20]:
a1="banana, apple, apple, banana, grapes"
words=a1.split(", ")
set(words)

{'apple', 'banana', 'grapes'}

Define a function

In [21]:
def count_types_and_tokens(text):
    tokens = text.split()
    types = set(tokens)
    return len(types), len(tokens)
#함수를 정의할 때 쓰는 것, 줄을 맞추어야 함.: 붙이기

In [22]:
count_types_and_tokens(text)

(76, 107)

In [28]:
#@markdown Hiding code
# Example text

num_types, num_tokens = count_types_and_tokens(text)
print("Number of types:", num_types)
print("Number of tokens:", num_tokens)
#TTR 만드는 방법 Type-Token Ratio

Number of types: 76
Number of tokens: 107


# Lemmatization

+ lemma: a dictionary form or base form of a set of words.
+ example: 'run, runs, running, ran' => 'run'

We will use {nltk} modules

In [29]:
!pip install nltk



In [30]:
#임포트는 택배 열기 워드넷레마타이저는 도구
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download the WordNet resource (if not already downloaded)
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True


The function below, get_wordnet_pos, is designed to map the part-of-speech (POS) tags provided by NLTK's pos_tag function to the format that is recognized by the WordNet Lemmatizer, which is part of the NLTK library. This mapping is essential for accurate lemmatization, as it allows the lemmatizer to understand the grammatical category of each word.

In [31]:
# Define get_wordnet_pos(word)
# pos means part of speech

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [33]:
sentence = "The cats are running faster than the dogs"

In [None]:
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Lemmatization using POS tags
lemmatized_output = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

print('Original Sentence:', sentence)
print('Lemmatized Sentence:', ' '.join(lemmatized_output))

Original Sentence: The cats are running faster than the dogs
Lemmatized Sentence: The cat be run faster than the dog


### Lemmatization practice with our text

In [34]:
text = """
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.
"""

In [41]:
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
tokens = nltk.word_tokenize(text)

# Lemmatization using POS tags
lemmatized_output = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

print('Original Sentence:', text)
print('Lemmatized Sentence:', ' '.join(lemmatized_output))

Original Sentence: 
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.

Lemmatized Sentence: An ant go to a fountain to quench his thirst and , tumble in , be almost drown . But a dove that happen to be sit on a neighbor tree saw the ant 's danger and , pluck off a leaf , let it drop into the water before him . The ant mount upon it , be presently waft safely ashore . Just at that time , a fowler be spread his net and be in the act of ensnare the dove , when the ant , perceive his object , bit

In [42]:
len(lemmatized_output)

124

Let's compare text tokens, types, and lemmatized

In [43]:
text = """
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.
"""

In [44]:
tokens = text.split(); print(len(tokens))
print(len(lemmatized_output))

107
124


Types in the text order

In [45]:
# Assuming 'tokens' is already defined

types_in_order = []
seen = set()

for token in tokens:
    if token not in seen:
        seen.add(token)
        types_in_order.append(token)

# Now 'types_in_order' contains unique elements from 'tokens' in the order they appear in the text


In [46]:
# Creating a dataframe with tokens, types_in_order, lemmatized_output

print(len(tokens))
print(len(types_in_order))
print(len(lemmatized_output))

107
76
124


In [39]:
!pip install pandas
#엑셀자료를 다룰 때 쓰임



In [48]:
import pandas as pd

# Assuming tokens, types_in_order, and lemmatized_output are already defined
# and their lengths are 107, 76, 124 respectively
#판다스를 가져와서 데이터프레임을 만들 때 씀.
# Extend types_in_order and tokens with 'None' to match the length of lemmatized_output
types_in_order.extend([None] * (len(lemmatized_output) - len(types_in_order)))
tokens.extend([None] * (len(lemmatized_output) - len(tokens)))

# Create the DataFrame csv file
df = pd.DataFrame({
    'Tokens': tokens,
    'Types': types_in_order,
    'Lemmatized Output': lemmatized_output
})

df[1:21]
#1부터 하는 이유는 항목명이 0이기 때문

Unnamed: 0,Tokens,Types,Lemmatized Output
1,ant,ant,ant
2,went,went,go
3,to,to,to
4,a,a,a
5,fountain,fountain,fountain
6,to,quench,to
7,quench,his,quench
8,his,thirst,his
9,thirst,"and,",thirst
10,"and,",tumbling,and


## TTR (Type-to-Token Ratio)

In [50]:
# Assuming you have already calculated the number of types and tokens
number_of_types = len(types)  # Number of unique words
number_of_tokens = len(tokens)  # Total number of words

# Calculate TTR
TTR = number_of_types / number_of_tokens

print("Type-Token Ratio (TTR):", TTR)


NameError: name 'types' is not defined