<a href="https://colab.research.google.com/github/wcliao1962/2025_DL/blob/master/Keras_Imdb_Introduce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 下載IMDb

* 執行下載前，請先在Colab虛擬機**建立data資料夾** (點選Files圖示(最右邊)，在Files頁面上，按滑鼠右鍵，點選New folder)


In [None]:
import urllib.request
import os
import tarfile

In [None]:
url="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath="data/aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):
    result=urllib.request.urlretrieve(url,filepath)
    print('downloaded:',result)

downloaded: ('data/aclImdb_v1.tar.gz', <http.client.HTTPMessage object at 0x7fda7c20f010>)


# 解壓縮IMDb：在data/aclImdb當中

In [None]:
if not os.path.exists("data/aclImdb"):
    tfile = tarfile.open("data/aclImdb_v1.tar.gz", 'r:gz')
    result=tfile.extractall('data/')

# 檢視data/aclImdb內的資料夾結構(解壓縮後)：
* **test** (影評25000篇，測試用)
  * **neg**: 有12500篇負面影評文字檔
  * **pos**: 有12500篇正面影評文字檔
* **train** (影評25000篇，訓練用)
  * **neg**: 有12500篇負面影評文字檔
  * **pos**: 有12500篇正面影評文字檔


# 匯入文字資料預處理模組

In [None]:
# from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer

# 定義rm_tags(text)函數
用以去除text當中的HTML標籤(tag)

In [None]:
import re
def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub('', text)

# 定義 read_files(filetype)函數
* filetype可以是train或test
* 是train，則由data/aclImdb的train資料夾**讀取訓練資料**
* 是test，則由test資料夾**讀取測試資料**


In [None]:
import os
def read_files(filetype):
    path = "data/aclImdb/"
    file_list=[]

    positive_path=path + filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]

    negative_path=path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]

    print('read',filetype, 'files:',len(file_list))

    all_labels = ([1] * 12500 + [0] * 12500)

    all_texts  = []
    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(" ".join(file_input.readlines()))]

    return all_labels,all_texts

# 建立訓練與測試資料

In [None]:
y_train,train_text=read_files("train")

read train files: 25000


In [None]:
y_test,test_text=read_files("test")

read test files: 25000


In [None]:
#查看正面評價的影評

In [None]:
train_text[0]

'Wow. What a wonderful film. The script is nearly perfect it appears this is the only film written by Minglun Wei,I hope he has more stories in him.The acting is sublime. Renying Zhou as Doggie was amazing -- very natural talent, and Xu Zhu was a delight - very believable as the jaded old traditionalist. The soundtrack was very effective, guiding without being overwhelming. If only more movies like this were made whether in Hollywood or Hong Kong- a family friendly, well acted, well written, well directed, near perfect gem.'

In [None]:
y_train[0]

1

In [None]:
#查看負面評價的影評

In [None]:
train_text[12500]

"I had to do a search on the actresses to find the board of this film because the title is now An Unexpected Love. It's not really worth looking for but I was unfamiliar with both leads and wondered why they were headlining a lesbian flick on Lifetime. Everything's pretty restrained and you don't really get an idea of who these characters are so, as a viewer, I wasn't able to become emotionally invested in the storyline. I guess I'm not the target audience for this but I'm not sure who is. Everything's muted and soft focus and earth tones...nothing's very interesting. I had a prurient interest in seeing two women make out but it's handled so discreetly that I was disappointed. Rent Personal Best instead."

In [None]:
y_train[12500]

0

# 利用tokenizer建立字典：
* 字典的數量限制為2000：nb_words=2000
* 讀取訓練資料所有影評(train_text)，依每一個**英文單字(word)**在訓練資料(所有影評文章)出現頻率排序，取最高前2000個單字及其順序數字(index)，形成一個字典
* 字典內容類似如下：{'the': 1, 'and': 2, 'a': 3, 'of': 4, 'to': 5, 'is': 6, 'in': 7, 'it': 8, ... }

In [None]:
tokenizer = Tokenizer(num_words=2000)
tokenizer.fit_on_texts(train_text)

In [None]:
print(tokenizer.document_count)

25000


In [None]:
# 列出字典所有內容(全部2000項非常多)，建議可以列出，但上傳GitHub時，記得先清除掉outputs
print(tokenizer.word_index)

In [None]:
# 列出字典前100項，但需將字典轉換成list，方便選出前100項

# Convert the dictionary to a list of (word, index) pairs
word_index_items = list(tokenizer.word_index.items())

# Print the first 100 items
print(word_index_items[:100])

[('the', 1), ('and', 2), ('a', 3), ('of', 4), ('to', 5), ('is', 6), ('in', 7), ('it', 8), ('i', 9), ('this', 10), ('that', 11), ('was', 12), ('as', 13), ('for', 14), ('with', 15), ('movie', 16), ('but', 17), ('film', 18), ('on', 19), ('not', 20), ('you', 21), ('are', 22), ('his', 23), ('have', 24), ('be', 25), ('he', 26), ('one', 27), ('all', 28), ('at', 29), ('by', 30), ('an', 31), ('they', 32), ('who', 33), ('so', 34), ('from', 35), ('like', 36), ('her', 37), ('or', 38), ('just', 39), ('about', 40), ("it's", 41), ('out', 42), ('has', 43), ('if', 44), ('some', 45), ('there', 46), ('what', 47), ('good', 48), ('more', 49), ('when', 50), ('very', 51), ('up', 52), ('no', 53), ('time', 54), ('she', 55), ('even', 56), ('my', 57), ('would', 58), ('which', 59), ('only', 60), ('story', 61), ('really', 62), ('see', 63), ('their', 64), ('had', 65), ('can', 66), ('were', 67), ('me', 68), ('well', 69), ('than', 70), ('we', 71), ('much', 72), ('been', 73), ('get', 74), ('bad', 75), ('will', 76), ('

# 轉換每篇影評(文字串)為一連串的數字(數字串)
* 每篇影評所有**單字(word)**轉換為一個一個的數字(index)
* **注意：只有在字典當中的單字才會被轉換為數字，若無略過**

In [None]:
x_train_seq = tokenizer.texts_to_sequences(train_text)
x_test_seq  = tokenizer.texts_to_sequences(test_text)

In [None]:
print(train_text[0])

Wow. What a wonderful film. The script is nearly perfect it appears this is the only film written by Minglun Wei,I hope he has more stories in him.The acting is sublime. Renying Zhou as Doggie was amazing -- very natural talent, and Xu Zhu was a delight - very believable as the jaded old traditionalist. The soundtrack was very effective, guiding without being overwhelming. If only more movies like this were made whether in Hollywood or Hong Kong- a family friendly, well acted, well written, well directed, near perfect gem.


In [None]:
print(x_train_seq[0])

[1315, 47, 3, 385, 18, 1, 225, 6, 750, 399, 8, 734, 10, 6, 1, 60, 18, 394, 30, 9, 436, 26, 43, 49, 533, 7, 86, 1, 112, 6, 13, 12, 476, 51, 1244, 671, 2, 12, 3, 51, 860, 13, 1, 150, 1, 809, 12, 51, 1126, 205, 108, 44, 60, 49, 98, 36, 10, 67, 89, 722, 7, 358, 38, 1980, 3, 219, 69, 914, 69, 394, 69, 522, 746, 399, 1524]


In [None]:
# sequences_to_texts() 可以轉換數字為文字
tokenizer.sequences_to_texts([[10, 6, 3, 16, 11, 210, 124, 253, 54, 9, 63, 8, 46, 22, 34]])

['this is a movie that gets better each time i see it there are so']

# 完成x_train與x_test的建立：使每篇影評的數字串長度相同

* 每篇影評數字串的數字個數(長度)，因每篇影評內的單字個數不同而不同
* 進行神經網路的訓練時，每一篇影評的數字串長度必須相同
* 數字串長度過長，裁減數字，過短，則補數字0
* 程式碼當中：maxlen=100，使每一篇文章的數字串長度都成為100
* 如果影評轉成數字串，長度大於100，pad_sequences處理後，會截掉(truncate)數字串**前面**的數字
* 如果影評轉成數字串，長度不足100，pad_sequences處理後，數字串**前面**會加上0

In [None]:
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=100)

In [None]:
#如果文章轉成數字不足100,pad_sequences處理後，前面會加上0

In [None]:
print('before pad_sequences length=',len(x_train_seq[0]))
print(x_train_seq[0])

before pad_sequences length= 75
[1315, 47, 3, 385, 18, 1, 225, 6, 750, 399, 8, 734, 10, 6, 1, 60, 18, 394, 30, 9, 436, 26, 43, 49, 533, 7, 86, 1, 112, 6, 13, 12, 476, 51, 1244, 671, 2, 12, 3, 51, 860, 13, 1, 150, 1, 809, 12, 51, 1126, 205, 108, 44, 60, 49, 98, 36, 10, 67, 89, 722, 7, 358, 38, 1980, 3, 219, 69, 914, 69, 394, 69, 522, 746, 399, 1524]


In [None]:
print('after pad_sequences length=',len(x_train[0]))
print(x_train[0])

after pad_sequences length= 100
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0 1315   47    3
  385   18    1  225    6  750  399    8  734   10    6    1   60   18
  394   30    9  436   26   43   49  533    7   86    1  112    6   13
   12  476   51 1244  671    2   12    3   51  860   13    1  150    1
  809   12   51 1126  205  108   44   60   49   98   36   10   67   89
  722    7  358   38 1980    3  219   69  914   69  394   69  522  746
  399 1524]


In [None]:
# 如果文章轉成數字長度大於100，pad_sequences處理後，會truncate前面的數字

In [None]:
print('before pad_sequences length=',len(x_train_seq[1]))
print(x_train_seq[1])

before pad_sequences length= 413
[9, 442, 10, 18, 20, 108, 3, 1660, 1787, 333, 9, 5, 1140, 23, 2, 23, 23, 98, 22, 249, 17, 23, 959, 61, 6, 27, 15, 1455, 9, 2, 12, 5, 165, 11, 10, 18, 58, 25, 40, 1891, 23, 109, 243, 70, 327, 86, 15, 35, 450, 5, 126, 10, 18, 35, 3, 658, 79, 3, 1195, 1, 173, 2, 1, 525, 5, 253, 81, 2, 5, 1660, 1787, 47, 71, 74, 22, 80, 33, 693, 1660, 1, 114, 659, 40, 86, 35, 28, 1118, 2, 1557, 2, 795, 174, 64, 105, 13, 72, 13, 71, 74, 79, 959, 2, 1617, 109, 35, 23, 5, 23, 5, 23, 5, 23, 365, 5, 23, 2, 56, 5, 23, 18, 227, 1, 18, 406, 6, 320, 7, 3, 360, 348, 92, 11, 182, 220, 42, 4, 1157, 14, 1660, 13, 44, 767, 1, 4, 87, 619, 98, 35, 58, 25, 428, 4, 36, 1649, 3, 1728, 70, 1, 1, 266, 1073, 3, 277, 4, 481, 2, 77, 3, 83, 851, 4, 50, 253, 173, 1672, 15, 64, 974, 447, 76, 25, 5, 838, 959, 1660, 2, 157, 76, 25, 11, 10, 128, 12, 143, 1, 284, 6, 364, 4, 1163, 5, 165, 4, 2, 17, 1, 143, 4, 10, 18, 6, 1, 1233, 1723, 2, 61, 1660, 65, 3, 4, 80, 139, 64, 15, 86, 14, 124, 38, 429, 5, 1, 209

In [None]:
print('after pad_sequences length=',len(x_train[1]))
print(x_train[1])

after pad_sequences length= 100
[   7   23  109  454  783    2    1  525    6  313   15   83  277    4
   56    1  421   29    1  499 1025  185  509   17   30    1  126   71
   63   11   56   10    6    3   47  773   13  533   30 1518   80  624
   15    1 1172   11   28    4    1  173   76   25 1435   79   31   11
  475   76  717   79   15 1788    7  107  765    3  576 1660 1787   96
   20   25    5    3  680   10  776    6    1    5 1269   86   11   71
   66  122   74  108  406    3    4 1319    8    6   29  276  249  612
  239    2]


## 轉換x_train, x_test, y_train, y_test為numpy array

In [None]:
import numpy as np

x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)

# 影評情意分析資料預處理主要程式碼

```python
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer

tokeniner = Tokenizer(num_words=2000)
tokeniner.fit_on_texts(train_text)
```
```python
x_train_seq = tokeniner.texts_to_sequences(train_text)
x_test_seq  = tokeniner.texts_to_sequences(test_text)
```
```python
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=100)
```




# 把tokenizer存起來
* ## 存檔前，請先確定有SaveModel資料夾

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd "/content/drive/MyDrive/Colab Notebooks/"

/content/drive/MyDrive/Colab Notebooks


In [None]:
# 存檔前，請先確定有SaveModel資料夾
import pickle
f = open('SaveModel/imdb_tokenizer_2000.pkl', 'wb')
pickle.dump(tokenizer, f)
f.close()

之後要用 `pickle` 讀回我們訓練好的 tokenizer 是這樣:

```python
import pickle
f = open('SaveModel/imdb_tokenizer_2000.pkl', 'rb')
tokenizer = pickle.load(f)
f.close()
```