## Data Visualization
### NLP Basics and visualization text data
- Xianli Zeng
- SOE, Xiamen University

Natural Language Processing (NLP)  
- *Making Machines Understand Human Language*
- *"Teaching computers to read, write, and communicate"*


### Key Applications
| Application          | Examples                         |
|----------------------|----------------------------------|
| **Text Generation**  | ChatGPT, email autocomplete      |
| **Translation**      | Google Translate, DeepL          |
| **Sentiment Analysis** | Product review classification   |
| **Speech Recognition** | Siri, Alexa, transcription tools |


## Visualizing Text Data: Word Frequency & Word Cloud



### Word Frequency Bar Chart


- A bar chart showing the most frequent words in a document.
- Highlights dominant vocabulary.
- Useful for quick overview of content themes.

Steps 
1. Tokenize the text
2. Remove stopwords & short words
3. Count word frequencies
4. Plot the top N words


### Word Cloud

- A visual representation of text data.
- Each word’s **font size** corresponds to its **frequency**.
- No axes — the layout is often **random** or shaped (e.g. heart, brain).
- Widely used for exploratory text analysis and visual summaries.




Steps 
1. Tokenize the text
2. Remove stopwords & short words
3. Count word frequencies
4. Plot wordcloud


### Tokenization

text = "我爱自然语言处理"

我//爱//自然语言//处理

### Tools:
- English text: nltk
- Chinese text: jieba

Compare to English, Chinese tokenization is much more complicated:

**乒乓球拍卖完了**
- 乒乓球//拍卖//完了
- 乒乓球拍//卖完了


In [None]:
import jieba
# text = "我爱自然语言处理"

text = "乒乓球拍卖完了"
word_list = jieba.lcut(text)
word_list

In [None]:
import jieba
import matplotlib.pyplot as plt
from collections import Counter


with open('changyi.txt', 'r', encoding='utf-8') as file:
    text = file.read()
print(text)

In [None]:

def chinese_word_cut(text):
    # 使用jieba进行分词
    word_list = jieba.lcut(text)
    # 去除单个字符和空格
    word_list = [word for word in word_list if len(word) > 1 and word.strip() != '']
    return ' '.join(word_list)

cut_text = chinese_word_cut(text)
print(cut_text)

In [None]:
import jieba
import matplotlib
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm


matplotlib.rcParams['font.family'] = 'SimHei'     # 设置中文字体
matplotlib.rcParams['axes.unicode_minus'] = False # 正常显示负号
words = cut_text.split()

# 统计词频
word_counts = Counter(words)

# 获取出现频率最高的前20个词
top_words = word_counts.most_common(10)
labels, values = zip(*top_words)

# 绘制柱状图 
plt.figure(figsize=(12, 6))
plt.bar(labels, values)
plt.xlabel("词语")
plt.ylabel("词频")
plt.title("Top 10 词频统计")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
from wordcloud import WordCloud

font_path = 'simhei.ttf'  # 或者使用其他中文字体

wordcloud = WordCloud(
    font_path=font_path,
    width=800,
    height=600,
    background_color='white',
    max_words=100,
    max_font_size=400,
    collocations=False  # 避免重复词语
).generate(cut_text)

# 5. 显示词云
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()



In [None]:
from PyPDF2 import PdfReader
### Read contents
reader = PdfReader("NSFC.pdf")
all_text = ""
for page in reader.pages:
    all_text += page.extract_text()
all_text

In [None]:

matplotlib.rcParams['font.family'] = 'SimHei'     # 设置中文字体
matplotlib.rcParams['axes.unicode_minus'] = False # 正常显示负号

# 统计词频

words = jieba.lcut(all_text)
words

In [None]:
words = [w for w in words if len(w) > 1]

word_counts = Counter(words)
# 获取出现频率最高的前20个词
top_words = word_counts.most_common(15)
labels, values = zip(*top_words)

# 绘制柱状图
plt.figure(figsize=(12, 6))
plt.bar(labels, values)
plt.xlabel("词语")
plt.ylabel("词频")
plt.title("Top 15 词频统计")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:

words = jieba.lcut(all_text)
words = [w for w in words if len(w) > 1]

word_counts = Counter(words)

wc = WordCloud(
    font_path="simhei.ttf", 
    background_color="white",
    width=800,
    height=400,
        max_words=100,
    max_font_size=200,
).generate_from_frequencies(word_counts)

# 显示词云
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("PDF 词云图")
plt.show()

#### Mask: wordcloud plot with specific shape

In [None]:
from PIL import Image
import numpy as np
mask = np.array(Image.open("heart_mask.png"))
wc = WordCloud(
    font_path="simhei.ttf", 
    background_color="white",
    mask=mask,
    width=800,
    height=400
).generate_from_frequencies(word_counts)

# 显示词云
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("PDF 词云图")
plt.show()

- Use **Word Frequency Bar Chart** when you want:
  - Precise counts
  - Side-by-side comparison of top terms

- Use **Word Cloud** when you want:
  - Quick, informal visualization
  - Attention-catching visuals for slides or reports
