<a href="https://colab.research.google.com/github/zion6570/NLP_2023/blob/main/9_InstallPackages_ImportModlues_CallFunctions_chatGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎁 Python Library
  - Python modues 을 계층적인 디렉토리 형태로 구성
  - **!pip**: Package manager
  - **!pip install <font color = 'red'> NameLibrary**
> A <font color = 'red'> **Python library**</font> refers to a collection of modules or functions that provide specific functionality, often focused on a particular domain or purpose. Libraries can be used to extend the capabilities of Python by providing pre-written code that can be imported and used in your own programs. Examples of popular Python libraries include NumPy for numerical computing, pandas for data manipulation and analysis, and requests for making HTTP requests. On the other hand, a <font color = 'blue'> **Python package**</font> is a way of organizing related modules into a directory hierarchy. A package is essentially a directory that contains one or more Python module files, along with an optional __init__.py file that signifies it as a package. Packages help to organize and structure large codebases by grouping related functionality together. They can also contain sub-packages, creating a nested structure.

# 🎒🎒 Python Moduess
  - Python functions 로 구성
  - **from 페키지이름 import 모듈이름**
  - **import** **페키지.모듈이름 <font color='green'>[외부 페키지 경우]**</font>
  - **import 모듈이름 <font color='purple'>[Python 내장 페키지 경우]**</font>
  - **import 모듈이름 as Abbreviation**


# 🏀 ⚽ ⚾ 🎾 Python functions
  - 외부 모듈에 있는 함수
  - import 모듈이름.함수이름()
  - from 모듈이름 import 함수이름

### For you information, check out **Python Module Index**
* [Visit Colab documentation]((https://docs.python.org/3/py-modindex.html)



##<font color = 'purple'> **👀 Install Python Libraries** ⤵️

In [1]:
#@markdown 🐹 👀 🐾 The **Pandas** is a popular open-source library for data manipulation and analysis. It offers data structures like **Series and DataFrame for handling structured data**. With powerful functionalities, it enables tasks such as <font color = 'red'>**indexing, filtering, grouping, and merging data**</font>. Pandas supports various file formats and integrates well with other libraries like NumPy and Matplotlib. It provides an intuitive and efficient way to work with large datasets in Python.

!pip install pandas



In [2]:
!pip install scikit-learn



In [None]:
#@markdown 🐹 👀 🐾 The **scikit-learn**, often referred to as sklearn, is a widely used **machine learning library**. It provides a comprehensive collection of tools and algorithms for various machine learning tasks such as <font color = 'red'>**classification, regression, clustering, and dimensionality reduction**</font>. With a consistent and user-friendly API, scikit-learn simplifies the process of building machine learning models. It supports data preprocessing, feature selection, model evaluation, and model tuning. The library also offers helpful utilities for handling datasets and implementing machine learning workflows. Overall, scikit-learn is a valuable resource for both beginners and experienced practitioners in the field of machine learning.

!pip install scikit-learn #corpus-toolkit 패키지에 하이픈 있음

In [None]:
#@markdown 🐹 👀 🐾 The **Matplotlib** is a popular plotting library for Python. It provides a flexible and comprehensive set of tools for creating various types of plots and visualizations. With a simple and intuitive interface, Matplotlib allows customization of plot appearance, axes, labels, and styles. It supports <font color = 'red'>**line plots, scatter plots, bar charts, histograms**</font>, and more. Matplotlib *integrates well with NumPy and Pandas for data manipulation and analysis*. It is widely used for data exploration, presentation, and publication-quality visualizations in scientific computing and data analysis.

!pip install matplot

In [4]:
#@markdown 🐹 👀 🐾 The **NLTK** (Natural Language Toolkit) is a powerful library for natural language processing (NLP) tasks. It provides tools and resources for tasks like <font color = 'red'>**tokenization, stemming, tagging, parsing, and sentiment analysis**</font>. NLTK offers a wide range of corpora and lexical resources for linguistic analysis. It supports **text classification, text generation, and language modeling**. NLTK includes pre-trained models and algorithms for various NLP tasks, making it suitable for both beginners and advanced users in the field of NLP. Overall, NLTK is a valuable resource for working with *human language data and performing NLP tasks in Python*.

!pip install nltk



In [6]:
#@markdown 🐹 👀 🐾 **Corpus-toolkit** package grew out of courses in corpus linguistics and learner corpus research. The toolkit attempts to balance simplicity of use, broad application, and scalability. Common corpus analyses such as <font color = 'red'>**the calculation of word and n-gram frequency and range, keyness, and collocation**</font> are included. In addition, more advanced analyses such as the **identification of dependency bigrams (e.g., verb-direct object combinations) and their frequency, range, and strength of association** are also included.

!pip install corpus-toolkit #corpus-toolkit 패키지에 하이픈 있음



In [7]:
import nltk
from nltk.tokenize import word_tokenize , sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## **<font color = 'brown'> Student's activity** ⤵️

### Exercise for <font color = 'red'> installing nltk package, importing its module, and calling its functions

**NLTK: Python library**
* You are correct! When using **Google Colab**, certain libraries, including **NLTK**, are **pre-installed and available for immediate use** without the need for additional installation. This is because Colab provides a pre-configured environment with several popular libraries and modules already installed, allowing you to import and use them directly in your code. So, in the case of using Colab, you can indeed use the NLTK module without explicitly installing the NLTK library.
>
* NLTK의 기능을 제대로 사용하기 위해서는 NLTK Data라는 여러 데이터를 추가적으로 설치해야 한다.
>
* 이를 위해서 파이썬 코드 내에서 import nltk 이후에 nltk.download()라는 코드를 수행하여 설치한다.
>
* **Reference**
  * [wikidocs](https://wikidocs.net/22488)


In [None]:
!pip install nltk #This step can be skipped since it is pre-installed on Google Colab.

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download("punkt")

## **<font color = 'brown'> Student's activity** ⤵️

### Exercise for <font color = 'red'> *importing os module and calling its functions*

   -The **os module is a built-in module in Python**, meaning it is available by default in any Python installation. You don't need to install it separately or use any package manager. The os module provides functions for interacting with the operating system, such as accessing files and directories, managing processes, and other system-related tasks. You can import and use the os module in your Python programs without any additional installation steps.

In [8]:
import os                 #Python built-in os module 불러 들이기
os.mkdir ("txtfolder")    #os 모듈과 mkdir 함수 사이에 period 있음. Check "txtfolder" under Files of Colab.

In [9]:
!pip install nltk
text = "Python programming is a high-level, interpreted programming language known for its simplicity and readability. It emphasizex code readability with its clean syntax, making it easier to write and understand, Python supports multiple programming paradigmsm, including procedural, object-oriented, and functional pogramming. It has a vast standard libaryand a thriving ecosystem of third-party libraries and framewoksm making it suiable for various damains such as web development, data analysis, machine learning, and automation. Python's versatility, ease of use, and extensive community spport have contributed to its popularity among developers of all skill levels."
from nltk.tokenize import sent_tokenize
sentence = sent_tokenize(text)
print('문장 토큰화: %s' %sentence)

문장 토큰화: ['Python programming is a high-level, interpreted programming language known for its simplicity and readability.', 'It emphasizex code readability with its clean syntax, making it easier to write and understand, Python supports multiple programming paradigmsm, including procedural, object-oriented, and functional pogramming.', 'It has a vast standard libaryand a thriving ecosystem of third-party libraries and framewoksm making it suiable for various damains such as web development, data analysis, machine learning, and automation.', "Python's versatility, ease of use, and extensive community spport have contributed to its popularity among developers of all skill levels."]


In [None]:
!pip install nltk
text = "Python programming is a high-level, interpreted programming language known for its simplicity and readability. It emphasizes code readability with its clean syntax, making it easier to write and understand. Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming. It has a vast standard library and a thriving ecosystem of third-party libraries and frameworks, making it suitable for various domains such as web development, data analysis, machine learning, and automation. Python's versatility, ease of use, and extensive community support have contributed to its popularity among developers of all skill levels."
from nltk.tokenize import sent_tokenize
sentence = sent_tokenize(text)
print('문장 토큰화: %s' %sentence)


In [None]:
#@markdown 🐹 👀 🐾 **Lexical-diversity** library is a Python package that provides tools and functions for analyzing the lexical diversity of text.  i) <font color = 'red'>**Type-Token Ratio (TTR)**</font> measures the proportion of unique words (types) in a text compared to the total number of words (tokens). It provides insights into vocabulary richness. ii)  <font color = 'red'>**Moving Standardized Type-Token Ratio (MSTTR)**</font> is a dynamic measure of lexical diversity that takes into account a moving window of text, allowing you to assess diversity over smaller sections of text. iii)  <font color = 'red'>**Moving Average Type-Token Ration (MATTR)**</font> calculates the Type-Token Ratio (TTR) within a sliding window as it moves through the text, and then computes the average of these TTR values over the entire text. The formula for MATTR: MATTR = (1 / N) * ∑(TTR_i).

#@markdown 🐹 👀 🐾 Total number of words / Total number of type

!pip install lexical-diversity

### ⛔ **Shall we write a scrip?**

##**[Sign up chat GPT](https://chat.openai.com)**

After you have created a chat GPT account, visit **[minjung's github page](usercontent.com/ms624atyale/Temp_Data/main/TheAesop4Children_1stEpisode.txt)** and copy the content of the _TheAesop4Children_1stEpisode.txt_ file.

1. Ask chatGPT to write a python script for lexical diversity, TTR, using the following text, "COPY & PASTE YOUR TEXT."


2. Ask again chatGPT to include the list of text, tokens, and types (i.e., unique words).

3. Ask again to include converting tokens to lowercase.

4. Ask again, does higher vlaue of TTR indicate either greater or lower variability of lexical use?

### 1. Ask chatGPT to write a python script for lexical diversity, TTR, using the following text, "COPY & PASTE YOUR TEXT."

In [None]:
# Define the text
text = """There was once a little Kid whose growing horns made him think he was a grown-up Billy Goat and able to take care of himself. So one evening when the flock started home from the pasture and his mother called, the Kid paid no heed and kept right on nibbling the tender grass. A little later when he lifted his head, the flock was gone. He was all alone. The sun was sinking. Long shadows came creeping over the ground. A chilly little wind came creeping with them making scary noises in the grass. The Kid shivered as he thought of the terrible Wolf. Then he started wildly over the field, bleating for his mother. But not half-way, near a clump of trees, there was the Wolf! The Kid knew there was little hope for him. “Please, Mr. Wolf,” he said trembling, “I know you are going to eat me. But first please pipe me a tune, for I want to dance and be merry as long as I can.“The Wolf liked the idea of a little music before eating, so he struck up a merry tune and the Kid leaped and frisked gaily. Meanwhile, the flock was moving slowly homeward. In the still evening air the Wolf's piping carried far. The Shepherd Dogs pricked up their ears. They recognized the song the Wolf sings before a feast, and in a moment they were racing back to the pasture. The Wolf's song ended suddenly, and as he ran, with the Dogs at his heels, he called himself a fool for turning piper to please a Kid, when he should have stuck to his butcher's trade. Do not let anything turn you from your purpose."""

# Tokenize the text into words (split by whitespace and punctuation)
words = text.split()

# Calculate TTR (Type-Token Ratio)
unique_words = set(words)
ttr = len(unique_words) / len(words)

# Print the TTR
print("Type-Token Ratio (TTR):", ttr)

## 2. Ask again chatGPT to include the list of text, tokens, and types (i.e., unique words).

In [None]:
# Define the text
text = """There was once a little Kid whose growing horns made him think he was a grown-up Billy Goat and able to take care of himself. So one evening when the flock started home from the pasture and his mother called, the Kid paid no heed and kept right on nibbling the tender grass. A little later when he lifted his head, the flock was gone. He was all alone. The sun was sinking. Long shadows came creeping over the ground. A chilly little wind came creeping with them making scary noises in the grass. The Kid shivered as he thought of the terrible Wolf. Then he started wildly over the field, bleating for his mother. But not half-way, near a clump of trees, there was the Wolf! The Kid knew there was little hope for him. “Please, Mr. Wolf,” he said trembling, “I know you are going to eat me. But first please pipe me a tune, for I want to dance and be merry as long as I can.“The Wolf liked the idea of a little music before eating, so he struck up a merry tune and the Kid leaped and frisked gaily. Meanwhile, the flock was moving slowly homeward. In the still evening air the Wolf's piping carried far. The Shepherd Dogs pricked up their ears. They recognized the song the Wolf sings before a feast, and in a moment they were racing back to the pasture. The Wolf's song ended suddenly, and as he ran, with the Dogs at his heels, he called himself a fool for turning piper to please a Kid, when he should have stuck to his butcher's trade. Do not let anything turn you from your purpose."""

# Tokenize the text into words (split by whitespace and punctuation)
words = text.split()

# Calculate TTR (Type-Token Ratio)
unique_words = set(words)
ttr = len(unique_words) / len(words)

# Print the TTR, list of tokens, and list of unique words (types)
print("Type-Token Ratio (TTR):", ttr)
print("Tokens:", words)
print("Unique words (Types):", sorted(unique_words))

# 3. Ask again to include converting tokens to lowercase.

In [None]:
# Define a function to tokenize and convert to lowercase
def tokenize_and_lower(text):
    # Tokenize the text into words (split by whitespace and punctuation)
    words = text.split()
    # Convert words to lowercase
    words_lower = [word.lower() for word in words]
    return words_lower

# Define the text
text = """There was once a little Kid whose growing horns made him think he was a grown-up Billy Goat and able to take care of himself. So one evening when the flock started home from the pasture and his mother called, the Kid paid no heed and kept right on nibbling the tender grass. A little later when he lifted his head, the flock was gone. He was all alone. The sun was sinking. Long shadows came creeping over the ground. A chilly little wind came creeping with them making scary noises in the grass. The Kid shivered as he thought of the terrible Wolf. Then he started wildly over the field, bleating for his mother. But not half-way, near a clump of trees, there was the Wolf! The Kid knew there was little hope for him. “Please, Mr. Wolf,” he said trembling, “I know you are going to eat me. But first please pipe me a tune, for I want to dance and be merry as long as I can.“ The Wolf liked the idea of a little music before eating, so he struck up a merry tune and the Kid leaped and frisked gaily. Meanwhile, the flock was moving slowly homeward. In the still evening air the Wolf's piping carried far. The Shepherd Dogs pricked up their ears. They recognized the song the Wolf sings before a feast, and in a moment they were racing back to the pasture. The Wolf's song ended suddenly, and as he ran, with the Dogs at his heels, he called himself a fool for turning piper to please a Kid, when he should have stuck to his butcher's trade. Do not let anything turn you from your purpose."""

# Get the tokens in lowercase
tokens_lower = tokenize_and_lower(text)

# Calculate TTR (Type-Token Ratio)
unique_words = set(tokens_lower)
ttr = len(unique_words) / len(tokens_lower)

# Print the TTR, list of tokens, and list of unique words (types)
print("Type-Token Ratio (TTR):", ttr)
print("Tokens in Lowercase:", tokens_lower)
print("Unique words (Types):", sorted(unique_words))

# 4. Ask again, does higher vlaue of TTR indicate either greater or lower variability of lexical use?

Answer from chat GPT: <font color = 'purple'> A higher value of TTR (Type-Token Ratio) typically indicates greater variability of lexical use. TTR measures the diversity of words in a given text. When TTR is higher, it means that there are more unique words (types) relative to the total number of words (tokens), suggesting a wider range of vocabulary and more diverse lexical usage in the text. In contrast, a lower TTR indicates that a smaller set of words is repeated more frequently, suggesting less diversity and greater repetition of words. So, a higher TTR is associated with a more varied lexical repertoire.

# **👀 <font color = 'red'> Do it yourself first!**  
- ✅ Now **accomodate importing url** from your repository of Temp_Data on your github account **to** your current jupiter notebook on Colab, and calculate TTR again.
>
- ✅ You should estimate TTR with tokens converted lowercase!!!

In [None]:
#@markdown ##**🎯🎯 In the following, you can find one possible answer.**

def tokenize_and_lower(text):
    # Tokenize the text into words (split by whitespace and punctuation)
    words = text.split()
    # Convert words to lowercase
    words_lower = [word.lower() for word in words]
    return words_lower

import urllib.request
url = "https://raw.githubusercontent.com/ms624atyale/Temp_Data/main/TheAesop4Children_1stEpisode.txt"
response = urllib.request.urlopen(url)
content = response.read().decode('utf-8')
print(content)

# Get the tokens in lowercase
tokens_lower = tokenize_and_lower(text)

# Calculate TTR (Type-Token Ratio)
unique_words = set(tokens_lower)
ttr = len(unique_words) / len(tokens_lower)

# Print the TTR, list of tokens, and list of unique words (types)
print("Type-Token Ratio (TTR):", ttr)
print("Tokens in Lowercase:", tokens_lower)
print("Unique words (Types):", sorted(unique_words))


# **👀 <font color = 'red'> Now you are going to do...**  
- Find an article on a web page of your choice.
- This time, vist **[www.npr.org](https://www.npr.org/2023/11/08/1211483883/olympics-russia-israel-gaza-sanctions-ioc)** for class activity.

In [None]:
#@markdown ### 🍎 Write a script for TTL as you copy and paste html format in the text.


# Define a function to tokenize and convert to lowercase
def tokenize_and_lower(text):
    # Tokenize the text into words (split by whitespace and punctuation)
    words = text.split()
    # Convert words to lowercase
    words_lower = [word.lower() for word in words]
    return words_lower

# Define the text as you copy and paste something from an internet webstie.
text = """
When violence in Israel and Gaza escalated after last month's terror attack by Hamas, the International Olympic Committee (IOC) issued a statement warning against "discriminatory behavior" against Israeli athletes competing around the world.\n
"[A]thletes cannot be held responsible for the actions of their governments," an IOC spokesperson told the German Press Agency, promising "swift action" if incidents occur.
"""

# Get the tokens in lowercase
tokens_lower = tokenize_and_lower(text)

# Calculate TTR (Type-Token Ratio)
unique_words = set(tokens_lower)
ttr = len(unique_words) / len(tokens_lower)

# Print the TTR, list of tokens, and list of unique words (types)
print("Type-Token Ratio (TTR):", ttr)
print("Tokens in Lowercase:", tokens_lower)
print("Unique words (Types):", sorted(unique_words))

In [None]:
#@markdown ### 🍎🍎 Write a script for TTL as you use url of your github where you've saved part of webpage using ms word as UTF-8.

import urllib.request
url = "https://raw.githubusercontent.com/ms624atyale/Temp_Data/main/NPR_article_sample.txt"
response = urllib.request.urlopen(url)
content = response.read().decode('utf-8')

print(content)

words = content.split()

# Calculate TTR (Type-Token Ratio)
unique_words = set(words)
ttr = len(unique_words) / len(words)

# Print the TTR, list of tokens, and list of unique words (types)
print("Type-Token Ratio (TTR):", ttr)
print("Tokens:", words)
print("Unique words (Types):", sorted(unique_words))
