<a href="https://colab.research.google.com/github/yanghyensik3955/NLP_2025/blob/main/9_textscrapingwithouthtmltags_saving_analysis_inprogress.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = 'red'> üêπ üëÄ üêæ **Text/Content/Web Scraping without HTML tags**

## **API-based Data Collection**

### <font color = 'blue'> **cf., Crawling (a.k.a. HTML Scraping) or Text Mining**

In [None]:
pip install requests



In [None]:
import requests #Import the requests library to make HTTP requests.

def get_wikipedia_page(title):                   #Define a function
    URL = "https://en.wikipedia.org/w/api.php"  #Set the API(application program interface) endpoint URL: https://en.wikipedia.org/w/api.php.

    PARAMS = {                                  #Build PARAMS (query parameters) for the API request:
        "action": "query",                      #ask the API to run a query
        "format": "json",                       #request a JSON response
        "prop": "extracts",                     #ask for the page extract (clean text summary)
        "titles": title,                        #specify which page to fetch (by title)
        "explaintext": 1                        #return plain text (no HTML/markup)

    }

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " #header to mimic a normal browser request (helps avoid blocks)
                      "(KHTML, like Gecko) Chrome/123.0 Safari/537.36"
    }

    response = requests.get(URL, params=PARAMS, headers=headers)           #Send a GET request to the API with requests

    if response.status_code != 200:                                        #Check the HTTP status code: If not 200 OK, print an error message and return None.
        print("HTTP error:", response.status_code)
        return None

    try:
        data = response.json()                                            #Try to parse the response body as JSON with response.json():
    except:
        print("JSON decode error")                                        #If JSON decoding fails, print a debug message showing the start of the raw response and return None.
        print("Raw response:", response.text[:500])
        return None

    pages = data.get("query", {}).get("pages", {})                      #Navigate the JSON structure to the page data: data["query"]["pages"] (a dictionary keyed by numeric page id).
    page = next(iter(pages.values()))                                   #Extract the single page object with next(iter(pages.values())) (handles the unknown page id).
    return page.get("extract", "")                                      #Return the page‚Äôs plain-text extract via page.get("extract", "").
                                                                        #If the page exists, this is the article text; if not, it returns an empty string (or None earlier if errors occurred).

##üêπüêæ **Install NLTK and Download necessary models**

In [None]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

###üêπüêæ **1Ô∏è‚É£ Pandas Library**

In [None]:
!pip install pandas
!pip install lexical_diversity
import pandas as pd #Import Pandas Package
import lexical_diversity as ld



# üÖ∞Ô∏è **Group1**

## ‚úÖ **Text scraping for Group1**

In [None]:
titles = [
    "Marine life",
   "Marine geology",
    "Bathymetry",
   "Marine ecosystem",
]

corpus = {}

for t in titles:
    txt = get_wikipedia_page(t)
    if txt:
        corpus[t] = txt
    else:
        print("Failed:", t)

# Show first 200 chars for each
for title, text in corpus.items():
    print("\n====", title, "====")
    print(text[:200])


==== Marine life ====
Marine life, sea life or ocean life is the collective ecological communities that encompass all aquatic animals, plants, algae, fungi, protists, single-celled microorganisms and associated viruses liv

==== Marine geology ====
Marine geology or geological oceanography is the study of the history and structure of the ocean floor. It involves geophysical, geochemical, sedimentological and paleontological investigations of the

==== Bathymetry ====
Bathymetry [b…ôÀàŒ∏…™m…ôt…πi] is the study of underwater depth of ocean floors (seabed topography), river floors, or lake floors. In other words, bathymetry is the underwater equivalent to hypsometry or top

==== Marine ecosystem ====
Marine ecosystems are the largest of Earth's aquatic ecosystems and exist in waters that have a high salt content. These systems contrast with freshwater ecosystems, which have a lower salt content. M


##üêπ üêæ üìå **Use this!!!**üìå
###‚≠ï <font color = 'green'> **Script for [Group1] ‚Äî Create one Txt file with records separated by @@@@@**

In [None]:
output = []

for title in titles:
    txt = get_wikipedia_page(title)
    if not txt:
        txt = ""   # store empty if missing
    block = f"@@@@@\nTITLE: {title}\n{txt}\n"
    output.append(block)

final = "\n".join(output)

with open("wiki_corpus_delimited_group1.txt", "w", encoding="utf-8") as f:
    f.write(final)

print("Saved: wiki_corpus_delimited_group1.txt")

Saved: wiki_corpus_delimited_group1.txt


####‚úÖ **Alternative Script for [Group1] ‚Äî Create one CSV with two columns (title + text)**

In [None]:
import csv

rows = []

for title in titles:
    txt = get_wikipedia_page(title)
    rows.append([title, txt])

with open("wiki_corpus_delimited_group1.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["title", "text"])
    writer.writerows(rows)

print("Saved: wiki_corpus_delimited_group1.csv")

Saved: wiki_corpus_delimited_group1.csv


# üêπüêæ **Read the txt file**
###üê£ **Open and read the text file for Group1**

In [None]:
# ‚ñ∂Ô∏è Step 1: You need to modify this codeline üçéüçéüçéüçéüçé
file = open("/content/wiki_corpus_delimited_group1.txt", 'rt')

txt = file.read()
print(txt)
file.close() #Using this close()function, you are no longer using your text file of the current workingdirectory with open()function.

@@@@@
TITLE: Marine life
Marine life, sea life or ocean life is the collective ecological communities that encompass all aquatic animals, plants, algae, fungi, protists, single-celled microorganisms and associated viruses living in the saline water of marine habitats, either the sea water of marginal seas and oceans, or the brackish water of coastal wetlands, lagoons, estuaries and inland seas. As of 2023, more than 242,000 marine species have been documented, and perhaps two million marine species are yet to be documented. An average of 2,332 new species per year are being described. Marine life is studied scientifically in both marine biology and in biological oceanography.
By volume, oceans provide about 90% of the living space on Earth, and served as the cradle of life and vital biotic sanctuaries throughout Earth's geological history. The earliest known life forms evolved as anaerobic prokaryotes (archaea and bacteria) in the Archean oceans around the deep sea hydrothermal vents, 

##üêπüêæ ‚ùÑÔ∏è **Basic Cleaning**
###**üìçApply a series of functions for replacement in Group1**

In [None]:
import re

# Step 1: Read file to change path as needed üçéüçéüçéüçéüçéüçé
with open("/content/wiki_corpus_delimited_group1.txt", 'rt') as fl:
    raw_text = fl.read()

# STEP 2: Clean the text
clean_text = (
    raw_text
    .replace("\n", " ")
    .replace("‚Äú", "")
    .replace("‚Äù", "")
    .replace("\"", "")
    .replace("/", "")
    .replace("_", "")
    .replace("===", "")
    .replace("==", "")
    .replace("=", "")
    .replace("*", "")
    .replace("?", "")
    .replace("!", "")
    .replace("--", " ")
    .replace("(", "")
    .replace(")", "")
)

# STEP 3: Save the cleaned content to a NEW file as you designate the output path üçèüçèüçèüçèüçèüçè
output_path = "/content/wiki_corpus_delimited_group1_CLEANED.txt"
with open(output_path, 'w') as cf:
    cf.write(clean_text) #Get content named 'clean_text' to the new empty file

# Optional: Print to verify
print("‚úÖ Cleaned text saved to:", output_path)

‚úÖ Cleaned text saved to: /content/wiki_corpus_delimited_group1_CLEANED.txt


##‚úÖ ‚úÖ**Text scraping for Group2**

In [None]:
titles = [
    "2024 Nobel Prize in Literature",
    "Han Kang",
    "Bong Joon Ho",
    "Pachinko"
]

corpus = {}

for t in titles:
    txt = get_wikipedia_page(t)
    if txt:
        corpus[t] = txt
    else:
        print("Failed:", t)

# Show first 200 chars for each
for title, text in corpus.items():
    print("\n====", title, "====")
    print(text[:200])


==== 2024 Nobel Prize in Literature ====
The 2024 Nobel Prize in Literature was awarded to the South Korean author Han Kang (born 1970) "for her intense poetic prose that confronts historical traumas and exposes the fragility of human life".

==== Han Kang ====
Han Kang (Korean: ÌïúÍ∞ï; born 27 November 1970) is a South Korean writer. From 2007 to 2018, she taught creative writing at the Seoul Institute of the Arts. Han rose to international prominence for her n

==== Bong Joon Ho ====
Bong Joon Ho (Korean: Î¥âÏ§ÄÌò∏; pronounced [poÀê≈ã t…ïuÀênho]; born September 14, 1969) is a South Korean filmmaker. His work is characterized by emphasis on social and class themes, genre-mixing, dark comedy,

==== Pachinko ====
Pachinko (Japanese: „Éë„ÉÅ„É≥„Ç≥; pronounced [pat…ïi≈ãko]) is a mechanical game originating in Japan that is used as an arcade game and, much more frequently, for gambling. Pachinko fills a niche in Japanese ga


##‚≠ï‚≠ï <font color = 'blue'> **Script for [Group2] ‚Äî Create one Txt file with records separated by @@@@@**

In [None]:
output = []

for title in titles:
    txt = get_wikipedia_page(title)
    if not txt:
        txt = ""   # store empty if missing
    block = f"@@@@@\nTITLE: {title}\n{txt}\n"
    output.append(block)

final = "\n".join(output)

with open("wiki_corpus_delimited_group2.txt", "w", encoding="utf-8") as f:
    f.write(final)

print("Saved: wiki_corpus_delimited_group2.txt")

Saved: wiki_corpus_delimited_group2.txt


###üê£üê£ **Open and read the text file for Group2**

In [None]:
# ‚ñ∂Ô∏è Step 1: You need to modify this codeline üçéüçéüçéüçéüçé
file = open("/content/wiki_corpus_delimited_group2.txt", 'rt')

txt = file.read()
print(txt)
file.close() #Using this close()function, you are no longer using your text file of the current workingdirectory with open()function.

@@@@@
TITLE: 2024 Nobel Prize in Literature
The 2024 Nobel Prize in Literature was awarded to the South Korean author Han Kang (born 1970) "for her intense poetic prose that confronts historical traumas and exposes the fragility of human life". It was announced by the Swedish Academy in Stockholm, Sweden, on 10 October 2024 and was awarded on 10 December 2024. 
She is the first South Korean and first Asian woman to win the Nobel Prize in literature, making her the 18th woman to win the Nobel Prize in that category.


== Laureate ==

Han Kang grew up with a literary background in Seoul, her father, Han Seung-won, being a reputed novelist. Alongside her passion for writing, she spends time exploring arts and music, which is reflected throughout her literary production. She started her career in 1993 with the publication of her poems in the literary magazine Literature and Society, and had her debut prose publication in 1995 with the short story collection Love of Yeosu [Ïó¨ÏàòÏùò ÏÇ¨Îûë]

##**üìçüìçApply a series of functions for replacement in Group2**

In [None]:
import re

# Step 1: Read file to change path as needed üçéüçéüçéüçéüçéüçé
with open("/content/wiki_corpus_delimited_group2.txt", 'rt') as fl:
    raw_text = fl.read()

# STEP 2: Clean the text
clean_text = (
    raw_text
    .replace("\n", " ")
    .replace("‚Äú", "")
    .replace("‚Äù", "")
    .replace("\"", "")
    .replace("/", "")
    .replace("_", "")
    .replace("===", "")
    .replace("==", "")
    .replace("=", "")
    .replace("*", "")
    .replace("?", "")
    .replace("!", "")
    .replace("--", " ")
    .replace("(", "")
    .replace(")", "")
)

# STEP 3: Save the cleaned content to a NEW file as you designate the output path üçèüçèüçèüçèüçèüçè
output_path = "/content/wiki_corpus_delimited_group2_CLEANED.txt"
with open(output_path, 'w') as cf:
    cf.write(clean_text) #Get content named 'clean_text' to the new empty file

# Optional: Print to verify
print("‚úÖ Cleaned text saved to:", output_path)

‚úÖ Cleaned text saved to: /content/wiki_corpus_delimited_group2_CLEANED.txt


###üêπüê£üê£üê£**Download and edit two txt files so that it has headers delimited by @, and upload it under [Wikipedia] Foldr of your github account!**

###üêπüê£**Clone your github repository of your interest**

In [None]:
!git clone https://github.com/ms624atyale/NLP_2025

Cloning into 'NLP_2025'...
remote: Enumerating objects: 340, done.[K
remote: Counting objects: 100% (153/153), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 340 (delta 104), reused 40 (delta 40), pack-reused 187 (from 2)[K
Receiving objects: 100% (340/340), 1.41 MiB | 4.93 MiB/s, done.
Resolving deltas: 100% (169/169), done.


#üíä 3Ô∏è‚É£ <font color = 'green'> **Avoid overlapping indecies**

In [None]:
import pandas as pd
import glob

# Change directory üçéüçéüçéüçéüçé
%cd /content/NLP_2025/Wikipedia

# Load all .txt files
fns = glob.glob('*.txt')

# List to hold each temporary DataFrame
df_list = []

# Load each file and append to list
for fn in fns:
    dftmp = pd.read_csv(fn, sep='@')
    df_list.append(dftmp)

# Concatenate all and reset index
df = pd.concat(df_list, ignore_index=True)

# Go back to main directory
%cd /content

# Save as CSV
df.to_csv("./wiki_group1_group1Copied.csv", index=False)

# Display the DataFrame
print(df)

/content/NLP_2025/Wikipedia
/content
                                                Text               title  \
0  K-pop K-pop Korean: RR: Keipap; an abbreviatio...                KPop   
1  The Korean Wave, or hallyu Korean; , refers to...         Korean Wave   
2  KPop Demon Hunters is a 2025 American animated...  KPop Demon Hunters   
3  BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...                 BTS   
4  K-pop K-pop Korean: RR: Keipap; an abbreviatio...                KPop   
5  The Korean Wave, or hallyu Korean; , refers to...         Korean Wave   
6  KPop Demon Hunters is a 2025 American animated...  KPop Demon Hunters   
7  BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...                 BTS   

   group1  
0      11  
1      11  
2      11  
3      11  
4       1  
5       1  
6       1  
7       1  


# <font color = 'red'> üêπüêæ **Final Script to prepare input text for further analysis ready(e.g., Sentiment analysis, Flesch reading ease, etc.)**

  - # <font color = 'blue'> üêπüêæ **Important & Useful!**
  - ### **This script will be based on plain text for 10 volumes above.**

In [None]:
%cd /content/NLP_2025/Wikipedia

/content/NLP_2025/Wikipedia


In [None]:
# Change file path üçéüçéüçéüçéüçé
file_path1= '/content/wiki_group1_group1Copied.csv'

df = pd.read_csv(file_path1)
df

Unnamed: 0,Text,title,group1
0,K-pop K-pop Korean: RR: Keipap; an abbreviatio...,KPop,11
1,"The Korean Wave, or hallyu Korean; , refers to...",Korean Wave,11
2,KPop Demon Hunters is a 2025 American animated...,KPop Demon Hunters,11
3,BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...,BTS,11
4,K-pop K-pop Korean: RR: Keipap; an abbreviatio...,KPop,1
5,"The Korean Wave, or hallyu Korean; , refers to...",Korean Wave,1
6,KPop Demon Hunters is a 2025 American animated...,KPop Demon Hunters,1
7,BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...,BTS,1


####üê• **Adding a colum with length info**

In [None]:
df1 = df

In [None]:
# Added column: String length
length = []

for i in range(0, len(df1['Text'])):
  LEN = len(df1['Text'][i])
  length.append(LEN)

df1['Data size'] = length
df1

Unnamed: 0,Text,title,group1,Data size
0,K-pop K-pop Korean: RR: Keipap; an abbreviatio...,KPop,11,59079
1,"The Korean Wave, or hallyu Korean; , refers to...",Korean Wave,11,40377
2,KPop Demon Hunters is a 2025 American animated...,KPop Demon Hunters,11,46750
3,BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...,BTS,11,58185
4,K-pop K-pop Korean: RR: Keipap; an abbreviatio...,KPop,1,59079
5,"The Korean Wave, or hallyu Korean; , refers to...",Korean Wave,1,40377
6,KPop Demon Hunters is a 2025 American animated...,KPop Demon Hunters,1,46750
7,BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...,BTS,1,58185


In [None]:
# Added column:  Splitted words, Length of splitted words
tsplit = []
splen = []

for i in range(0, len(df1['Text'])):
  TSP = df1['Text'][i].split()
  SPLEN = len(TSP)
  tsplit.append(TSP)
  splen.append(SPLEN)
  # print(TSP)

df1['Splits'] = tsplit
df1['N_Splits'] = splen
df1

Unnamed: 0,Text,title,group1,Data size,Splits,N_Splits
0,K-pop K-pop Korean: RR: Keipap; an abbreviatio...,KPop,11,59079,"[K-pop, K-pop, Korean:, RR:, Keipap;, an, abbr...",9447
1,"The Korean Wave, or hallyu Korean; , refers to...",Korean Wave,11,40377,"[The, Korean, Wave,, or, hallyu, Korean;, ,, r...",6213
2,KPop Demon Hunters is a 2025 American animated...,KPop Demon Hunters,11,46750,"[KPop, Demon, Hunters, is, a, 2025, American, ...",7647
3,BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...,BTS,11,58185,"[BTS, Korean:, Î∞©ÌÉÑÏÜåÎÖÑÎã®;, RR:, Bangtan, Sonyeonda...",9648
4,K-pop K-pop Korean: RR: Keipap; an abbreviatio...,KPop,1,59079,"[K-pop, K-pop, Korean:, RR:, Keipap;, an, abbr...",9447
5,"The Korean Wave, or hallyu Korean; , refers to...",Korean Wave,1,40377,"[The, Korean, Wave,, or, hallyu, Korean;, ,, r...",6213
6,KPop Demon Hunters is a 2025 American animated...,KPop Demon Hunters,1,46750,"[KPop, Demon, Hunters, is, a, 2025, American, ...",7647
7,BTS Korean: Î∞©ÌÉÑÏÜåÎÖÑÎã®; RR: Bangtan Sonyeondan; lit...,BTS,1,58185,"[BTS, Korean:, Î∞©ÌÉÑÏÜåÎÖÑÎã®;, RR:, Bangtan, Sonyeonda...",9648


In [None]:
df1.describe()

Unnamed: 0,group1,Data size,N_Splits
count,8.0,8.0,8.0
mean,6.0,51097.75,8238.75
std,5.345225,8413.711169,1502.349089
min,1.0,40377.0,6213.0
25%,1.0,45156.75,7288.5
50%,6.0,52467.5,8547.0
75%,11.0,58408.5,9497.25
max,11.0,59079.0,9648.0


Lexical Diversity Indices (10 types)

#Result file

In [None]:
df1.describe()

Unnamed: 0,group1,Data size,N_Splits,N_Sents
count,8.0,8.0,8.0,8.0
mean,6.0,51097.75,8238.75,350.0
std,5.345225,8413.711169,1502.349089,85.567016
min,1.0,40377.0,6213.0,267.0
25%,1.0,45156.75,7288.5,271.5
50%,6.0,52467.5,8547.0,350.5
75%,11.0,58408.5,9497.25,429.0
max,11.0,59079.0,9648.0,432.0


In [None]:
#Getting LD indices
!pip install lexical-diversity
from lexical_diversity import lex_div as ld

Collecting lexical-diversity
  Downloading lexical_diversity-0.1.1-py3-none-any.whl.metadata (4.1 kB)
Downloading lexical_diversity-0.1.1-py3-none-any.whl (117 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/117.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m117.8/117.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lexical-diversity
Successfully installed lexical-diversity-0.1.1


In [None]:
# Added column: String length
lem = []

for i in range(0, len(df1['Text'])):
  LEM = ld.flemmatize(df1['Text'][i])
  print(LEM)
  lem.append(LEM)

df1['Lemma'] = lem

['kpop', 'kpop', 'korean', 'rr', 'keipap', 'a', 'abbreviation', 'of', 'korean', 'popular', 'music', 'be', 'a', 'form', 'of', 'popular', 'music', 'originate', 'in', 'south', 'korea', 'the', 'music', 'genre', 'that', 'the', 'term', 'be', 'use', 'to', 'refer', 'to', 'colloquially', 'emerge', 'in', 'the', '1990s', 'as', 'a', 'form', 'of', 'youth', 'subculture', 'with', 'korean', 'musician', 'take', 'influence', 'from', 'western', 'dance', 'music', 'hiphop', 'r&b', 'and', 'rock', 'today', 'kpop', 'commonly', 'refer', 'to', 'the', 'musical', 'output', 'of', 'teen', 'idol', 'act', 'chiefly', 'girl', 'group', 'and', 'boy', 'band', 'who', 'emphasize', 'visual', 'appeal', 'and', 'performance', 'as', 'a', 'pop', 'genre', 'kpop', 'be', 'characterize', 'by', 'its', 'melodic', 'quality', 'and', 'cultural', 'hybridity', 'kpop', 'can', 'trace', 'its', 'origin', 'to', 'rap', 'dance', 'a', 'fusion', 'of', 'hiphop', 'techno', 'and', 'rock', 'popularize', 'by', 'the', 'group', 'seo', 'taiji', 'and', 'boy'

In [None]:
# ADD LD indices

#1. Create empty lists.
TTR = []
RTTR = []
LogTTR = []
MassTTR = []
MSTTR = []
MATTR = []
HDD = []
MTLD = []
MTLD_wrap = []
MTLD_bid = []

#**N of sentences**

In [None]:
!pip install textstat
import textstat

Collecting textstat
  Downloading textstat-0.7.11-py3-none-any.whl.metadata (15 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Downloading textstat-0.7.11-py3-none-any.whl (176 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m176.4/176.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.17.2-py3-none-any.whl (2.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.1/2.1 MB[0m [31m54.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.17.2 textstat-0.7.11


In [None]:
df1['N_Sents'] = df1['Text'].apply(textstat.sentence_count)
df1.to_csv('LD_result_with_Nsents.csv')

#Plotting

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Normalize category labels (critical step)
df1['group1'] = df1['group1'].str.strip().str.lower()

# Check the unique values (optional debug)
print(df1['group1'].value_counts())

# Set theme and size
sns.set_theme(style='white')
sns.set(rc={'figure.figsize':(8,6),"font.size":14,"axes.titlesize":18,"axes.labelsize":14})

# Prepare the data
dd = df1[['group1', 'N_Splits']].copy()

# Draw the boxplot (1 box per category)
ax = sns.boxplot(x='group1', y='N_Splits', data=dd, palette="Accent")
ax.set_ylim([500, 1500])
ax.set(xlabel='group1', ylabel='Number of Words (Tokens)')

# Save the figure
plt.tight_layout()
plt.savefig('boxplot_Nsplit.png')
plt.show()

AttributeError: Can only use .str accessor with string values!

# <font color = 'red'> üêπüêæ **Final Script to prepare input text for further analysis (e.g., Common Core Words, Wordcloud, Lexical Diversity, etc.)**

  - # <font color = 'blue'> üêπüêæ **Important & Useful!**
  - ### **This script will be based on plain text for 10 volumes above.**

###üêπüêæ **2Ô∏è‚É£ Clone a repository on your github (Beware that the following code uses your instructor's github repository**

In [None]:
!git clone 'https://github.com/ms624atyale/NLP_PictureBook_2025'