<a href="https://colab.research.google.com/github/alisonyang/data-science-blog/blob/main/NLP_Text_Embedding_Chinese.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Binary Classifier Using Chinese-Language Movie Reviews

You will build a simple binary classification model to distinguish between positive and negative movie reviews, trained on the [豆瓣 movies short reviews](https://www.kaggle.com/datasets/utmhikari/doubanmovieshortcomments) dataset. The main goal of the project is to visualize embeddings produced by the model for Chinese language text using Tensorflow Embedding Projector.

## Download the Dataset

First, you will need to download the dataset. This data source comes from Kaggle, and to fetch Kaggle data, please refer to the following: ['Easiest way to download kaggle data in Google Colab'](https://www.kaggle.com/general/74235).

In [1]:
# ! pip install -q kaggle

In [None]:
# from google.colab import files

# files.upload()

In [None]:
# ! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/
# ! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# ! kaggle datasets list
!kaggle datasets download -d utmhikari/doubanmovieshortcomments --force

In [None]:
# ! mkdir comments

In [None]:
! unzip -o doubanmovieshortcomments.zip -d comments

## Check the dataset

In [None]:
import pandas as pd

comments = pd.read_csv('/content/comments/DMSC.csv')
comments.head()

In [None]:
# the dataset information
comments.info(verbose=True)

In [None]:
comments['Star'].describe()

As shown in the output above, the dataset contains a total of **2,125,056** reviews. For this project, we will only select the **Star**  and **Comment** columns to train our mode

## Extracting Labels from the 'Star' Column

In this project, we will simply transform the star rating into two labels:

*   1-3 stars: negative label (**0**)
*   4-5 stars: positive label (**1**)


In [None]:
comments["Star"].head()

In [None]:
import numpy as np
comments["Star"] = np.where(comments["Star"] > 3, 1, 0)
comments["Star"].head()

In [None]:
labels = comments['Star']
labels.head()

## Split the dataset

We will split the dataset into a training set and a testing set in a 7:3 ratio.

In [None]:
reviews = comments['Comment']
reviews.head()

In [None]:
training_size = 1500000

# Split the sentences
training_reviews = reviews[0:training_size].tolist()
testing_reviews = reviews[training_size:].tolist()

# Split the labels
training_labels = labels[0:training_size].tolist()
testing_labels = labels[training_size:].tolist()

In [None]:
print("training set length:", len(training_reviews))
print("testing_labels set length:", len(testing_reviews))
print(type(training_labels))

## Preprocessing the train and test sets

Now you can preprocess the text and labels so it can be consumed by the model. 

1.   we will use Jieba to segment this Chinese text data.
2.   we will create the vocabulary using the `Tokenizer` class and generate padded token sequences using the `pad_sequences` method.
3.   we need to convert the labels into a Numpy array to ensure a valid data type for `model.fit()`.

In [None]:
!pip install jieba

In [None]:
import jieba

# from ckiptagger import WS
def jieba_cut(text):
  result = jieba.lcut(text, cut_all = False)

  # For the stop word list, please refer to this link: https://github.com/goto456/stopwords/blob/master/cn_stopwords.txt.
  stopword = open("cn_stopwords.txt", "r", encoding='UTF-8').read()
  stopword_list = stopword.split("\n")

  seg_result = []

  for word in result:
    if word not in stopword_list:
      seg_result.append(word)

  return " ".join(seg_result)

In [None]:
# use Jieba to segment the training and testing sets

traning_set = []
for review  in training_reviews:
  traning_set.append(jieba_cut(review))

testing_set = []
for review  in testing_reviews:
  testing_set.append(jieba_cut(review))


In [None]:
traning_set[:3]

In [None]:
testing_set[:3]

In [None]:
# set parameters
vocab_size = 10000
max_length = 64
embedding_dim = 16

# for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(traning_set)
word_index = tokenizer.word_index

# pad the training sequences
training_sequences = tokenizer.texts_to_sequences(traning_set)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# pad the testing sequences
testing_sequences = tokenizer.texts_to_sequences(testing_set)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# convert the labels into numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

## Build and Compile the Model

Now that the data has been preprocessed, we can move on to building our binary classification model.

In [None]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

## Train the Model

The next step is to train our model. However, since the primary objective of this Jupyter Notebook is to visualize the text embedding information using the TensorFlow Embedding Projector, we won't be dedicating a significant amount of effort to training the model.

In [None]:
num_epochs = 10

# Train the model
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

In [None]:
import matplotlib.pyplot as plt

# Plot utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
# Plot the accuracy and loss
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

## Visualize Word Embeddings

After the model is trained, we can visualize the weights in the Embedding layer to observe how similar words are clustered together. To achieve this, we will use the [Tensorflow Embedding Projector](https://projector.tensorflow.org/)  to reduce the 16-dimensional vectors we defined earlier into fewer components that can be plotted in the projector. In order to obtain these weights, we can execute the cell below.

In [None]:
# use reverse_word_index to lookup a word 
reverse_word_index = tokenizer.index_word

# Get the embedding layer from the model
embedding_layer = model.layers[0]

# Get the weights 
embedding_weights = embedding_layer.get_weights()[0]
print(embedding_weights.shape) 


In [None]:
import io

# Open writeable files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

for word_num in range(1, vocab_size):
  word_name = reverse_word_index[word_num]
  word_embedding = embedding_weights[word_num]

  if len(word_name.strip()) != 0:
    # print(word_name)
    out_m.write(word_name + "\n")
    out_v.write('\t'.join([str(x) for x in word_embedding]) + "\n")

out_v.close()
out_m.close()

In [None]:
try:
  from google.colab import files
except ImportError:
  pass

# Download the files
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

We can go to the [Tensorflow Embedding Projector](https://projector.tensorflow.org/) and load the two files:
* `vecs.tsv` - contains the vector weights of each word in the vocabulary
* `meta.tsv` - contains the words name in the vocabulary

You can try to search for words like `棒呆` and `大失所望` and see what other words are closely related to them. This could be a fun and engaging way to expand our vocabulary and explore the relationships between different words