<a href="https://colab.research.google.com/github/wcliao1962/2025_DL/blob/master/Keras_Imdb_Introduce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 下載IMDb

* 執行下載前，請先在Colab虛擬機**建立data資料夾** (點選Files圖示(最右邊)，在Files頁面上，按滑鼠右鍵，點選New folder)


In [1]:
import urllib.request
import os
import tarfile

In [3]:
url="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath="data/aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):
    result=urllib.request.urlretrieve(url,filepath)
    print('downloaded:',result)

downloaded: ('data/aclImdb_v1.tar.gz', <http.client.HTTPMessage object at 0x7d53345a0bd0>)


# 解壓縮IMDb在data/aclImdb當中

In [4]:
if not os.path.exists("data/aclImdb"):
    tfile = tarfile.open("data/aclImdb_v1.tar.gz", 'r:gz')
    result=tfile.extractall('data/')

# data/aclImdb內的資料夾結構如下：
* **test** (影評25000篇，測試用)
  * **neg**: 有12500篇負面影評文字檔
  * **pos**: 有12500篇正面影評文字檔
* **train** (影評25000篇，訓練用)
  * **neg**: 有12500篇負面影評文字檔
  * **pos**: 有12500篇正面影評文字檔


# 匯入模組

In [5]:
# from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer

# 資料準備

# 定義rm_tags(text)函數
用以去除text當中的HTML標籤(tag)

In [6]:
import re
def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub('', text)

# 定義 read_files(filetype)函數
* filetype可以是train或test
* 是train，則由data/aclImdb的train資料夾**讀取訓練資料**
* 是test，則由test資料夾**讀取測試資料**


In [7]:
import os
def read_files(filetype):
    path = "data/aclImdb/"
    file_list=[]

    positive_path=path + filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]

    negative_path=path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]

    print('read',filetype, 'files:',len(file_list))

    all_labels = ([1] * 12500 + [0] * 12500)

    all_texts  = []
    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(" ".join(file_input.readlines()))]

    return all_labels,all_texts

In [8]:
y_train,train_text=read_files("train")

read train files: 25000


In [9]:
y_test,test_text=read_files("test")

read test files: 25000


In [None]:
#查看正面評價的影評

In [10]:
train_text[0]

'This is a movie that gets better each time I see it. There are so many nuanced performances in this. William Tracey, as Pepi, is a delight, bringing sharp comic relief. Joseph Schildkraut as Vadas, is the only "villian" in the movie, and his oily charms are well used here. Frank Morgan, is delightful as the owner of the title shop, Mr. Matuschek, and his familiar manner is well used here. I especially liked the performance of Felix Bressart, as Pirovitch. Very believable in every facet of his role.The two leads are equally accomplished, with Margaret Sullivan doing an outstanding job of portraying a slightly desperate, neurotic, yet charming and attractive woman.This movie belongs to Jimmy Stewart though. The movie is presented from his point of view, with the action rotating around him. Mr. Stewart is more then up to the task of carrying the movie, with an amazing performance that uses a wide range of emotions. Just watch Stewart, when he is fired from his job, because of a misunders

In [11]:
y_train[0]

1

In [13]:
#查看負面評價的影評

In [14]:
train_text[12500]

"David Lynch's crude and crudely drawn take on South Park presents us with a nightmare of disturbing clichés about suburban middle class families. The father is a hideous monster with three teeth and a disproportionately large circular mouth-hole from which are uttered the most horrendous guttural noises, the son and mother are permanently horrified, incoherent creatures for whom terror is a way of life. A number of equally absurd characters are introduced throughout the series.Lynch is not famous for his comedies (i.e. On the Air, aspects of Wild at Heart), and I am not particularly fond of comedies in general. However, there were a couple of scenes in Dumbland which made me laugh out loud. There are some clever bits of animated cinematography - where Lynch conveys wide ranges of reaction in his characters through a syntactical arrangement of shots as opposed to facial expressions (which never really vary in Dumbland).I believe Lynch was really trying to give his audience a straight-f

In [15]:
y_train[12500]

0

# 讀取訓練資料所有文章(train_text)建立字典
* 字典的數量限制為2000：nb_words=2000
* 讀取訓練資料所有文章(影評)，依每一個英文單字在影評出現次數排序，取前2000單字進入字典

In [16]:
token = Tokenizer(num_words=2000)
token.fit_on_texts(train_text)

In [17]:
print(token.document_count)

25000


In [18]:
# Convert the dictionary to a list of (word, index) pairs
word_index_items = list(token.word_index.items())

# Print the first 100 items
print(word_index_items[:100])

[('the', 1), ('and', 2), ('a', 3), ('of', 4), ('to', 5), ('is', 6), ('in', 7), ('it', 8), ('i', 9), ('this', 10), ('that', 11), ('was', 12), ('as', 13), ('for', 14), ('with', 15), ('movie', 16), ('but', 17), ('film', 18), ('on', 19), ('not', 20), ('you', 21), ('are', 22), ('his', 23), ('have', 24), ('be', 25), ('he', 26), ('one', 27), ('all', 28), ('at', 29), ('by', 30), ('an', 31), ('they', 32), ('who', 33), ('so', 34), ('from', 35), ('like', 36), ('her', 37), ('or', 38), ('just', 39), ('about', 40), ("it's", 41), ('out', 42), ('has', 43), ('if', 44), ('some', 45), ('there', 46), ('what', 47), ('good', 48), ('more', 49), ('when', 50), ('very', 51), ('up', 52), ('no', 53), ('time', 54), ('she', 55), ('even', 56), ('my', 57), ('would', 58), ('which', 59), ('only', 60), ('story', 61), ('really', 62), ('see', 63), ('their', 64), ('had', 65), ('can', 66), ('were', 67), ('me', 68), ('well', 69), ('than', 70), ('we', 71), ('much', 72), ('been', 73), ('get', 74), ('bad', 75), ('will', 76), ('

# 將每一篇文章的文字轉換一連串的數字
* **只有在字典中的文字會轉換為數字**

In [19]:
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq  = token.texts_to_sequences(test_text)

In [20]:
print(train_text[0])

This is a movie that gets better each time I see it. There are so many nuanced performances in this. William Tracey, as Pepi, is a delight, bringing sharp comic relief. Joseph Schildkraut as Vadas, is the only "villian" in the movie, and his oily charms are well used here. Frank Morgan, is delightful as the owner of the title shop, Mr. Matuschek, and his familiar manner is well used here. I especially liked the performance of Felix Bressart, as Pirovitch. Very believable in every facet of his role.The two leads are equally accomplished, with Margaret Sullivan doing an outstanding job of portraying a slightly desperate, neurotic, yet charming and attractive woman.This movie belongs to Jimmy Stewart though. The movie is presented from his point of view, with the action rotating around him. Mr. Stewart is more then up to the task of carrying the movie, with an amazing performance that uses a wide range of emotions. Just watch Stewart, when he is fired from his job, because of a misunderst

In [21]:
print(x_train_seq[0])

[10, 6, 3, 16, 11, 210, 124, 253, 54, 9, 63, 8, 46, 22, 34, 107, 350, 7, 10, 1015, 13, 6, 3, 693, 13, 6, 1, 60, 7, 1, 16, 2, 23, 22, 69, 338, 129, 1262, 6, 1902, 13, 1, 4, 1, 421, 439, 2, 23, 1074, 1372, 6, 69, 338, 129, 9, 257, 419, 1, 235, 4, 13, 51, 860, 7, 171, 4, 23, 213, 1, 103, 828, 22, 1299, 15, 395, 31, 1336, 288, 4, 3, 1070, 1676, 242, 1216, 2, 1563, 251, 10, 16, 5, 1338, 147, 1, 16, 6, 1346, 35, 23, 209, 4, 647, 15, 1, 202, 183, 86, 439, 1338, 6, 49, 91, 52, 5, 1, 4, 1, 16, 15, 31, 476, 235, 11, 1073, 3, 1862, 4, 1432, 39, 102, 1338, 50, 26, 6, 35, 23, 288, 83, 4, 3, 26, 6, 498, 5, 1, 1460, 1090, 2, 11, 34, 31, 1487, 34, 945, 7, 57, 587, 1338, 6, 205, 883, 1, 829, 18, 280, 7, 1, 475, 4, 1, 46, 6, 53, 27, 331, 11, 43, 122, 73, 1989, 19, 18, 11, 6, 498, 5, 34, 336, 47, 26, 6, 543, 5, 31, 307, 29, 1, 54, 26, 89, 10, 16, 26, 127, 65, 87, 4, 23, 607, 1405, 4, 86, 242, 26, 6, 336, 1, 1297, 4, 23, 10, 6, 27, 4, 114, 98, 2, 77, 27, 4, 1, 87, 733, 728, 1286, 21, 76, 165, 9, 382, 10, 

In [22]:
# sequences_to_texts() 可以轉換數字為文字
token.sequences_to_texts([[10, 6, 3, 16, 11, 210, 124, 253, 54, 9, 63, 8, 46, 22, 34]])

['this is a movie that gets better each time i see it there are so']

# 讓轉換後的數字長度相同
* 每一篇文章內的文字長度都不同，轉換為數字後，每一篇的文章產生的數字長度也不同
* 進行神經網路的訓練時，每一篇文章的數字長度必須相同
* 程式碼當中：maxlen=100，使每一篇文章的數字長度都裁成為100
* 如果文章轉成數字，長度大於100，pad_sequences處理後，會截掉(truncate)前面的數字
* 如果文章轉成數字，長度不足100，pad_sequences處理後，文章前面會加上0

In [23]:
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=100)

In [24]:
# 如果文章轉成數字長度大於100，pad_sequences處理後，會truncate前面的數字

In [25]:
print('before pad_sequences length=',len(x_train_seq[0]))
print(x_train_seq[0])

before pad_sequences length= 250
[10, 6, 3, 16, 11, 210, 124, 253, 54, 9, 63, 8, 46, 22, 34, 107, 350, 7, 10, 1015, 13, 6, 3, 693, 13, 6, 1, 60, 7, 1, 16, 2, 23, 22, 69, 338, 129, 1262, 6, 1902, 13, 1, 4, 1, 421, 439, 2, 23, 1074, 1372, 6, 69, 338, 129, 9, 257, 419, 1, 235, 4, 13, 51, 860, 7, 171, 4, 23, 213, 1, 103, 828, 22, 1299, 15, 395, 31, 1336, 288, 4, 3, 1070, 1676, 242, 1216, 2, 1563, 251, 10, 16, 5, 1338, 147, 1, 16, 6, 1346, 35, 23, 209, 4, 647, 15, 1, 202, 183, 86, 439, 1338, 6, 49, 91, 52, 5, 1, 4, 1, 16, 15, 31, 476, 235, 11, 1073, 3, 1862, 4, 1432, 39, 102, 1338, 50, 26, 6, 35, 23, 288, 83, 4, 3, 26, 6, 498, 5, 1, 1460, 1090, 2, 11, 34, 31, 1487, 34, 945, 7, 57, 587, 1338, 6, 205, 883, 1, 829, 18, 280, 7, 1, 475, 4, 1, 46, 6, 53, 27, 331, 11, 43, 122, 73, 1989, 19, 18, 11, 6, 498, 5, 34, 336, 47, 26, 6, 543, 5, 31, 307, 29, 1, 54, 26, 89, 10, 16, 26, 127, 65, 87, 4, 23, 607, 1405, 4, 86, 242, 26, 6, 336, 1, 1297, 4, 23, 10, 6, 27, 4, 114, 98, 2, 77, 27, 4, 1, 87, 733, 728

In [26]:
print('after pad_sequences length=',len(x_train[0]))
print(x_train[0])

after pad_sequences length= 100
[1487   34  945    7   57  587 1338    6  205  883    1  829   18  280
    7    1  475    4    1   46    6   53   27  331   11   43  122   73
 1989   19   18   11    6  498    5   34  336   47   26    6  543    5
   31  307   29    1   54   26   89   10   16   26  127   65   87    4
   23  607 1405    4   86  242   26    6  336    1 1297    4   23   10
    6   27    4  114   98    2   77   27    4    1   87  733  728 1286
   21   76  165    9  382   10   16  257   14  144   11 1140    1  153
    4 1338]


In [27]:
#如果文章轉成數字不足100,pad_sequences處理後，前面會加上0

In [28]:
print('before pad_sequences length=',len(x_train_seq[1]))
print(x_train_seq[1])

before pad_sequences length= 74
[27, 4, 1, 114, 360, 348, 98, 35, 6, 10, 6, 1, 460, 495, 4, 157, 554, 44, 21, 260, 7, 588, 88, 102, 10, 16, 8, 28, 125, 1118, 241, 10, 16, 283, 11, 720, 1, 4, 1350, 1832, 3, 178, 4, 1568, 2, 9, 436, 46, 76, 25, 3, 1034, 511, 82, 18, 6, 39, 1, 450, 4, 3, 157, 1604, 227, 98, 15, 3, 167, 142, 1061, 14, 1, 329, 565]


In [29]:
print('after pad_sequences length=',len(x_train[1]))
print(x_train[1])

after pad_sequences length= 100
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0   27    4
    1  114  360  348   98   35    6   10    6    1  460  495    4  157
  554   44   21  260    7  588   88  102   10   16    8   28  125 1118
  241   10   16  283   11  720    1    4 1350 1832    3  178    4 1568
    2    9  436   46   76   25    3 1034  511   82   18    6   39    1
  450    4    3  157 1604  227   98   15    3  167  142 1061   14    1
  329  565]


# 資料預處理

```
token = Tokenizer(num_words=2000)
token.fit_on_texts(train_text)
```
```
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq  = token.texts_to_sequences(test_text)
```
```
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=100)
```


