Version: 02.14.2023

# 实验室 4.1：实施情绪分析
 


在本实验室内容中，您将开发一种解决方案，用于对互联网电影数据库 (IMDB) 的数据集进行情绪分析。

## 学习目标

- 评估用于情绪分析的自然语言处理 (NLP) 的机器学习 (ML) 算法
- 创建情绪分析业务问题的解决方案。

## 业务场景简介

在本实验室内容中，您要在小型开发团队中扮演数据科学家的角色。您工作的组织维护着一个电影评论网站。已确定了一个关键的客户功能：根据正面和负面评论的数量为特定电影提供整体正面（笑脸）或负面（悲伤的脸）的评价。您将开发一个机器学习 (ML) 解决方案，使开发人员能为电影评论创建推理。您需要分析评论并指明评论是正面还是负面的。

为了帮助完成此任务，您可以访问包含 50,000 个电影评论原始文本的数据集。这些评论已被标记为正面或负面。

关于该数据集
电影评论大型数据集收集了截然不同的电影评论。此数据支持以下论文相关工作：

Andrew L. Maas、Raymond E. Daly、Peter T. Pham、Dan Huang、Andrew Y. Ng 和 Christopher Potts；“Learning Word Vectors for Sentiment Analysis”； 发表于 2011 年 6 月在美国俄勒冈州波特兰举行的第 49 届国际计算语言学学会年会 (ACL 2011)，http://ai.stanford.edu/~amaas/data/sentiment/。

数据集包含单个文本字段，其中包含评论。数据集标记为正面 (1) 或负面 (0)。

该数据集包含以下功能：

文本：评论文本
标签：评论是正面还是负面（1 还是 0）

## 实验室步骤

要完成本实验室内容，您需要按以下步骤操作：

1. [安装程序包](#1.-Installing-packages)
2. [读取数据集](#2.-Reading-the-dataset)
3. [执行探索性数据分析](#3.-Performing-exploratory-data-analysis)
4. [运行第一次传递：最低程度的处理](#4.-Running-the-first-pass:-Minimal-processing)
5. [运行第二次传递：标准化文本](#5.-Running-the-second-pass:-Normalizing-the-text)
6. [优化超参数](#6.-Tuning-hyperparameters)
7. [使用 BlazingText](#7.-Using-BlazingText)
8. [使用 Amazon Comprehend](#8.-Using-Amazon-Comprehend)

## 提交作业

1.在实验室控制台中，选择 **Submit**（提交）记录您的进度，在出现提示时，选择 **Yes**（是）。

1.如果在几分钟后仍未显示结果，请返回到实验说明的顶部，并选择 **Grades**（成绩）。

**提示**：您可以多次提交作业。您更改作业后，再次选择 **Submit**（提交）。您最后一次提交的作业将记为本实验室内容的作业。

1.要查找有关您作业的详细反馈，请选择 **Details**（详细信息），然后选择 **View Submission Report**（查看提交报告）。

## 1.安装程序包
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

首先，更新并安装您将在笔记本中使用的程序包。


In [1]:
#Install/Upgrade dependencies
!pip install --upgrade pip
!pip install --upgrade boto3
!pip install --upgrade scikit-learn
!pip install --upgrade sagemaker
!pip install --upgrade nltk
!pip install --upgrade seaborn



__注意：__ 在首次进行本实验室内容之前，我们建议您通过选择__Kernel__ > __Start Kernel__来重新启动内核。

导入将在笔记本中使用的程序包。

In [2]:
import boto3
import os, io, struct
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

以下代码单元格包含一些帮助程序函数，可绘制混淆矩阵并计算其他关键指标。

In [3]:

def plot_confusion_matrix(test_labels, target_predicted):
    matrix = confusion_matrix(test_labels, target_predicted)
    df_confusion = pd.DataFrame(matrix)
    colormap = sns.color_palette("BrBG", 10)
    sns.heatmap(df_confusion, annot=True, fmt='.2f', cbar=None, cmap=colormap)
    plt.title("Confusion Matrix")
    plt.tight_layout()
    plt.ylabel("True Class")
    plt.xlabel("Predicted Class")
    plt.show()
    
def print_metrics(test_labels, target_predicted_binary):
    TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted_binary).ravel()
    # Sensitivity, hit rate, recall, or true positive rate
    Sensitivity  = float(TP)/(TP+FN)*100
    # Specificity or true negative rate
    Specificity  = float(TN)/(TN+FP)*100
    # Precision or positive predictive value
    Precision = float(TP)/(TP+FP)*100
    # Negative predictive value
    NPV = float(TN)/(TN+FN)*100
    # Fall out or false positive rate
    FPR = float(FP)/(FP+TN)*100
    # False negative rate
    FNR = float(FN)/(TP+FN)*100
    # False discovery rate
    FDR = float(FP)/(TP+FP)*100
    # Overall accuracy
    ACC = float(TP+TN)/(TP+FP+FN+TN)*100

    print(f"Sensitivity or TPR: {Sensitivity}%")    
    print(f"Specificity or TNR: {Specificity}%") 
    print(f"Precision: {Precision}%")   
    print(f"Negative Predictive Value: {NPV}%")  
    print( f"False Positive Rate: {FPR}%") 
    print(f"False Negative Rate: {FNR}%")  
    print(f"False Discovery Rate: {FDR}%" )
    print(f"Accuracy: {ACC}%") 



## 2.读取数据集

（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将加载数据集。Amazon Sagemaker Studio 已下载了数据集。使用__pandas__ 库读取数据集。

#### __ 加载训练数据：__

In [4]:
df = pd.read_csv('../data/imdb.csv', header=0)

## 3.执行探索性数据分析
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将检查数据集。

执行以下函数。提供第一个是为了让您学习格式。


### 挑战：列出前八行

In [5]:
def show_eight_rows(df):
    # Implement this function
    return df.head(8)    

In [6]:
print(show_eight_rows(df))

                                                text  label
0  What I hoped for (or even expected) was the we...      0
1  Garden State must rate amongst the most contri...      0
2  There is a lot wrong with this film. I will no...      1
3  To qualify my use of "realistic" in the summar...      1
4  Dirty War is absolutely one of the best politi...      1
5  Many other viewers are saying that this is not...      1
6  I understand that Roger Corman loves to do thi...      0
7  I love this show. I watched every episode last...      0


### 挑战：数据是什么形状？

In [7]:
def show_data_shape(df):
    # Implement this function
    ### BEGIN_SOLUTION
    return df.shape
    ### END_SOLUTION

In [8]:
print(show_data_shape(df))

(50000, 2)


### 挑战：数据中有多少正面和负面实例？

In [9]:
def show_data_instances(df):
    # Implement this function
    ### BEGIN_SOLUTION
    return df['label'].value_counts()
    ### END_SOLUTION

In [10]:
print(show_data_instances(df))

0    25000
1    25000
Name: label, dtype: int64


### 挑战：数据是否有任何缺失值？

In [11]:
def show_missing_values(df):
    # Implement this function
    ### BEGIN_SOLUTION
    return df.isna().sum()
    ### END_SOLUTION
    

In [12]:
print(show_missing_values(df))

text     0
label    0
dtype: int64


## 4.运行第一次传递：最低程度的处理
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将执行训练分类模型所需的最少的步骤。您将使用此训练模型来查看处理文本对结果的影响。

首先，导入 Natural Langauge ToolKit (NLTK) 程序包和 Regular Expression (re) 程序包。

In [13]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

### 挑战：将数据拆分为几个数据集，分别用于训练、验证和测试

在此任务中，您将拆分数据集，使数据集的 80％ 用于训练，用于验证和测试的各占 10%。

要拆分数据集，请使用 __scikit-learn__ 中的  `train_test_split` 函数。有关此函数的更多信息，请参阅 [scikit-learn test_train_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html。

指定 __df__ 作为数据集。将此数据集拆分为 __train__ 集和__test_and_validate__集。然后，将__test_and_validate__集拆分为 __test__ 集和 __validate__ 集。

（* 可选 *）要获得可重复的结果，洗牌和 **random_state**。

In [14]:
from sklearn.model_selection import train_test_split
# uncomment the following lines and implement your solution
def split_data(df):
    # train, test_and_validate = train_test_split(....)
    # test, validate = train_test_split(....)
    ### BEGIN_SOLUTION
    train, test_and_validate = train_test_split(df,
                                            test_size=0.2,
                                            shuffle=True,
                                            random_state=324
                                            )
    test, validate = train_test_split(test_and_validate,
                                                test_size=0.5,
                                                shuffle=True,
                                                random_state=324)
    ### END_SOLUTION
    return train, validate, test

通过运行以下代码单元格检查是否正确拆分数据集。

In [15]:
train, validate, test = split_data(df)
print(train.shape)
print(test.shape)
print(validate.shape)

(40000, 2)
(5000, 2)
(5000, 2)


### 组装处理管道

在此单元格中，为文本数据组装基本的处理管道。现在，您将修改此实施，以添加更多功能。

In [16]:
%%time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

text_features = ['text']
model_target = 'label'

text_processor_0 = Pipeline([
    ('text_vect_0', CountVectorizer(max_features=500))
])

data_preprocessor = ColumnTransformer([
    ('text_pre_0', text_processor_0, text_features[0])
])

print('Datasets shapes before processing: ', train.shape, validate.shape, test.shape)
train_matrix = data_preprocessor.fit_transform(train)
test_matrix = data_preprocessor.transform(test)
validate_matrix = data_preprocessor.transform(validate)
print('Datasets shapes after processing: ', train_matrix.shape, validate_matrix.shape, test_matrix.shape)

Datasets shapes before processing:  (40000, 2) (5000, 2) (5000, 2)
Datasets shapes after processing:  (40000, 500) (5000, 500) (5000, 500)
CPU times: user 7.54 s, sys: 114 ms, total: 7.65 s
Wall time: 7.65 s


要训练模型，必须以正确的格式将数据上载到 Amazon Simple Storage Service (Amazon S3)。XGBoost 使用逗号分隔值 (CSV) 文件。

In [17]:
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, X_train, y_train, is_test=False):
    csv_buffer = io.StringIO()
    features = [t.toarray().astype('float32').flatten().tolist() for t in X_train]
    if is_test:
        temp_list = features
    else:
        temp_list = np.insert(features, 0, y_train['label'], axis=1)
    np.savetxt(csv_buffer, temp_list, delimiter=',' )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

In [18]:
bucket = 'c133864a3391494l8261467t1w637423426529-labbucket-hcjcbnnncwhe'

设置此传递的文件名。

In [19]:
prefix='lab41'
train_file='train-pass1.csv'
validate_file='validate-pass1.csv'
test_file='test-pass1.csv'

将训练数据集、验证数据集和测试数据集上载到 Amazon S3。

In [20]:
upload_s3_csv(train_file, 'train', train_matrix, train)
upload_s3_csv(validate_file, 'validate', validate_matrix, validate)
upload_s3_csv(test_file, 'test', test_matrix, test, True)

### 挑战：训练 XGBoost 模型

取消注释并完成以下 SageMaker 函数，以创建一个 `Estimator`。使用以下参数：
– **角色**：使用当前的 SageMaker 角色（__提示：__ 使用 `sagemaker.get_execution_role()`）
– **实例数量**： `1`
– **实例类型**： `ml.m5.xlarge`


In [21]:
import sagemaker
from sagemaker.image_uris import retrieve
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')
s3_output_location=f's3://{bucket}/{prefix}/output/'

hyperparams={"num_round":"42",
             "eval_metric": "error",
             "objective": "binary:logistic",
             "silent" : 1}

# xgb_model=sagemaker.estimator.Estimator(container,
#                                         role=<INSERT_ROLE_HERE>,
#                                         instance_count=<INSERT_COUNT_HERE>,
#                                         instance_type=<INSERT_INSTANCE_TYPE_HERE>,
#                                         output_path=s3_output_location,
#                                         hyperparameters=hyperparams,
#                                         sagemaker_session=sagemaker.Session())
### BEGIN_SOLUTION
xgb_model=sagemaker.estimator.Estimator(container,
                                        role=sagemaker.get_execution_role(),
                                        instance_count=1,
                                        instance_type='ml.m5.2xlarge',
                                        output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())
### END_SOLUTION

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


设置两个数据通道。一个数据通道用于训练模型的训练数据。另一个数据通道用于生成绩效指标的验证数据。

In [22]:
train_channel = sagemaker.inputs.TrainingInput(
    f's3://{bucket}/{prefix}/train/{train_file}',
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    f's3://{bucket}/{prefix}/validate/{validate_file}',
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

训练模型。（这个步骤可能需要花几分钟的时间。）

In [23]:
%%time

xgb_model.fit(inputs=data_channels, logs=False, job_name='xgb-pass1-'+datetime.now().strftime("%m-%d-%Y-%H-%M-%S"))

INFO:sagemaker:Creating training-job with name: xgb-pass1-11-05-2024-18-35-04



2024-11-05 18:35:04 Starting - Starting the training job.....
2024-11-05 18:35:35 Starting - Preparing the instances for training...
2024-11-05 18:36:00 Downloading - Downloading input data...
2024-11-05 18:36:19 Downloading - Downloading the training image.....
2024-11-05 18:36:50 Training - Training image download completed. Training in progress.......
2024-11-05 18:37:26 Uploading - Uploading generated training model.
2024-11-05 18:37:34 Completed - Training job completed
CPU times: user 136 ms, sys: 11.8 ms, total: 148 ms
Wall time: 2min 31s


显示来自当前 XGBoost 任务的指标。

In [24]:
sagemaker.analytics.TrainingJobAnalytics(xgb_model._current_job_name, 
                                         metric_names = ['train:error','validation:error']
                                        ).dataframe()

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:error,0.20855
1,0.0,validation:error,0.2355


初始结果似乎没有帮助。使用 __test__ 数据集计算更多指标。（这个步骤可能需要花几分钟的时间。）

In [25]:
%%time

upload_s3_csv('batch-in.csv', 'batch-in', test_matrix, test, True)
batch_X_file='batch-in.csv'
batch_output = f's3://{bucket}/{prefix}/batch-out/'
batch_input = f's3://{bucket}/{prefix}/batch-in/{batch_X_file}'

xgb_transformer = xgb_model.transformer(instance_count=1,
                                       instance_type='ml.m5.2xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line',
                         job_name='xgboost-pass1')
xgb_transformer.wait(logs=False)

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-11-05-18-37-38-633
INFO:sagemaker:Creating transform job with name: xgboost-pass1


ResourceInUse: An error occurred (ResourceInUse) when calling the CreateTransformJob operation: Job name must be unique within an AWS account and region, and a job with this name already exists (arn:aws:sagemaker:us-east-1:637423426529:transform-job/xgboost-pass1)

In [26]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=f'{prefix}/batch-out/batch-in.csv.out')
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])

def binary_convert(x):
    threshold = 0.5
    if x > threshold:
        return 1
    else:
        return 0

target_predicted_binary = target_predicted['class'].apply(binary_convert)


NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

In [None]:
plot_confusion_matrix(test['label'], target_predicted_binary)

In [None]:
print_metrics(test['label'], target_predicted_binary)

## 5.第二次传递：标准化文本
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将在重新训练模型之前对文本执行一些标准的预处理任务。

### 挑战：删除可能影响情绪的非索引字

您可以删除所有非索引字，但您可能想保留可能影响情绪的非索引字，例如 __not__ 或 __don't__。

已提供了一些要排除的非索引字。更新函数，以删除可能影响情绪的其他词语。

In [None]:
# Get a list of stopwords from the NLTK library
stop = stopwords.words('english')

def remove_stopwords(stopwords):
    # Implement this function
    excluding = ['against', 'not', 'don', 'don\'t','ain', 'are', 'aren\'t']
    ### BEGIN_SOLUTION
    excluding = ['against', 'not', 'don', 'don\'t','ain', 'are', 'aren\'t', 'could', 'couldn\'t',
             'did', 'didn\'t', 'does', 'doesn\'t', 'had', 'hadn\'t', 'has', 'hasn\'t', 
             'have', 'haven\'t', 'is', 'isn\'t', 'might', 'mightn\'t', 'must', 'mustn\'t',
             'need', 'needn\'t','should', 'shouldn\'t', 'was', 'wasn\'t', 'were', 
             'weren\'t', 'won\'t', 'would', 'wouldn\'t']
    ### END_SOLUTION
    return [word for word in stop if word not in excluding]

# New stopword list
stopwords = remove_stopwords(stop)


### 挑战：添加清理步骤

更新以下 `clean`函数，以完成以下任务：
– 删除前导空格和尾随空格
– 删除任何 HTML 标记


In [None]:
snow = SnowballStemmer('english')
def clean(sent):
    # Implement this function
    sent = sent.lower()
    sent = re.sub('\s+', ' ', sent)
    ### BEGIN_SOLUTION
    sent = sent.strip()
    sent = re.compile('<.*?>').sub('',sent)
    ### END_SOLUTION
    filtered_sentence = []
    
    for w in word_tokenize(sent):
        # You are applying custom filtering here. Feel free to try different things.
        # Check if it is not numeric, its length > 2, and it is not in stopwords
        if(not w.isnumeric()) and (len(w)>2) and (w not in stopwords):  
            # Stem and add to filtered list
            filtered_sentence.append(snow.stem(w))
    final_string = " ".join(filtered_sentence) #final string of cleaned words
    return final_string

使用之前创建的函数创建新的测试、验证和测试 DataFrame。

In [None]:
# Uncomment the next line and implement the function call to split_data
#train, validate, test = 

### BEGIN_SOLUTION
train, validate, test = split_data(df)
### END_SOLUTION

print(train.shape)
print(test.shape)
print(validate.shape)

管道已更新，以包含对之前定义的来自 `CountVectorizer`的 `clean`函数的调用。此函数需要更长的运行时间。

In [None]:
%%time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

text_features = ['text']
model_target = 'label'

text_processor_0 = Pipeline([
    ('text_vect_0', CountVectorizer(preprocessor=clean, max_features=500))
])

data_preprocessor = ColumnTransformer([
    ('text_pre_0', text_processor_0, text_features[0])
])

print('Datasets shapes before processing: ', train.shape, validate.shape, test.shape)
train_matrix = data_preprocessor.fit_transform(train)
test_matrix = data_preprocessor.transform(test)
validate_matrix = data_preprocessor.transform(validate)
print('Datasets shapes after processing: ', train_matrix.shape, validate_matrix.shape, test_matrix.shape)

设置此传递的文件名。

In [None]:
prefix='lab41'
train_file='train_pass2.csv'
validate_file='validate_pass2.csv'
test_file='test_pass2.csv'

### 挑战：将文件上载到 Amazon S3

使用之前的代码将新文件上载到 Amazon S3。

__提示：__复制以下代码并将其粘贴到以下代码单元格。

In [None]:
### BEGIN_SOLUTION
upload_s3_csv(train_file, 'train', train_matrix, train)
upload_s3_csv(validate_file, 'validate', validate_matrix, validate)
upload_s3_csv(test_file, 'test', test_matrix, test, True)
### END_SOLUTION

### 挑战：创建估算器并设置数据通道

使用之前的代码设置估算器和数据通道。

__提示： __ 复制上一个单元格中的代码并将其粘贴到下面的单元格中。

In [None]:
%%time

container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

hyperparams={"num_round":"42",
             "eval_metric": "error",
             "objective": "binary:logistic",
             "silent" : 1}

### BEGIN_SOLUTION
xgb_model=sagemaker.estimator.Estimator(container,
                                        sagemaker.get_execution_role(),
                                        instance_count=1,
                                        instance_type='ml.m5.2xlarge',
                                        output_path=s3_output_location,
                                        hyperparameters = hyperparams,
                                        sagemaker_session=sagemaker.Session())

train_channel = sagemaker.inputs.TrainingInput(
    f's3://{bucket}/{prefix}/train/{train_file}',
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    f's3://{bucket}/{prefix}/validate/{validate_file}',
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

### END_SOLUTION

xgb_model.fit(inputs=data_channels, logs=False, job_name='xgb-pass2-'+datetime.now().strftime("%m-%d-%Y-%H-%M-%S"))

In [None]:
sagemaker.analytics.TrainingJobAnalytics(xgb_model._current_job_name, 
                                         metric_names = ['train:error','validation:error']
                                        ).dataframe()

### 挑战：创建批处理转换器任务

使用之前的代码创建一个转换器任务。（这个步骤可能需要花几分钟的时间。） 

__提示：__ 复制上一个示例中的代码并将其粘贴到以下单元格中。

In [None]:
%%time

### BEGIN_SOLUTION
xgb_transformer = xgb_model.transformer(instance_count=1,
                                       instance_type='ml.m5.2xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
### END_SOLUTION

xgb_transformer.wait(logs=False)

In [None]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])

def binary_convert(x):
    threshold = 0.5
    if x > threshold:
        return 1
    else:
        return 0

target_predicted_binary = target_predicted['class'].apply(binary_convert)


In [None]:
plot_confusion_matrix(test['label'], target_predicted_binary)

In [None]:
print_metrics(test['label'], target_predicted_binary)

新模型比第一个模型更好还是更差？

## 6.优化超参数
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将创建一个超参数优化任务来优化模型。

__注意__：优化超参数需要花大约一个小时的时间。如果您没有足够的时间，请前往第 7 节。您也可以在开始优化任务后跳至第 7 节，然后返回查看结果。

### 挑战：创建估算器用于优化

第一步是创建一个估算器用于优化。取消注释并完成以下估算器代码：

In [None]:
# xgb = sagemaker.estimator.Estimator(....)
### BEGIN_SOLUTION
xgb = sagemaker.estimator.Estimator(container,
                                    role=sagemaker.get_execution_role(), 
                                    instance_count= 1, # make sure you have limit set for these instances
                                    instance_type='ml.m5.2xlarge', 
                                    output_path=f's3://{bucket}/{prefix}/output',
                                    sagemaker_session=sagemaker.Session())
### END_SOLUTION

In [None]:
xgb.set_hyperparameters(eval_metric='error',
                        objective='binary:logistic',
                        num_round=42,
                        silent=1)

### 挑战：创建超参数范围

使用 [XGBoost 优化文档](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html)，将超参数范围添加到以下单元格中。



In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'alpha': ContinuousParameter(0,1000)}

### BEGIN_SOLUTION
hyperparameter_ranges = {'alpha': ContinuousParameter(0, 1000),
                         'min_child_weight': ContinuousParameter(0, 120),
                         'subsample': ContinuousParameter(0.5, 1),
                         'eta': ContinuousParameter(0.1, 0.5),  
                         'num_round': IntegerParameter(1,4000)
                         }
### END_SOLUTION

### 挑战：指定目标指标

针对二元分类问题将 `objective_metric_name`和 `objective_type`更新为适当的值。有关更多信息，请参阅 [XGBoost 优化文档](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html)。

In [None]:
objective_metric_name = '<INSERT_VALUE_HERE>'
objective_type = '<INSERT_VALUE_HERE>'

### BEGIN_SOLUTION
objective_metric_name = 'validation:error'
objective_type = 'Minimize'
### END_SOLUTION

创建超参数优化任务。

In [None]:
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=10, # Set this to 10 or above depending upon budget & available time.
                            max_parallel_jobs=1,
                            objective_type=objective_type,
                            early_stopping_type='Auto',
                           )

运行优化任务。请注意，此任务可能需要大约 60 分钟的时间。

In [None]:
%%time
tuner.fit(inputs=data_channels, include_cls_metadata=False, wait=False)

如果您想在等待期间尝试第 7 节，请不要运行下一个单元格，而是转到第 7 节。

In [None]:
tuner.wait()

优化任务完成后，您可以查看来自优化任务的指标。

In [None]:
from pprint import pprint
from sagemaker.analytics import HyperparameterTuningJobAnalytics

tuner_analytics = HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.name, sagemaker_session=sagemaker.Session())

df_tuning_job_analytics = tuner_analytics.dataframe()

# Sort the tuning job analytics by the final metrics value
df_tuning_job_analytics.sort_values(
    by=['FinalObjectiveValue'],
    inplace=True,
    ascending=False if tuner.objective_type == "Maximize" else True)

# Show detailed analytics for the top 20 models
df_tuning_job_analytics.head(20)

## 使用最佳超参数任务

优化任务完成后，您可以从 **HyperparameterTuner** 对象中找到最佳优化任务。


In [None]:
attached_tuner = HyperparameterTuner.attach(tuner.latest_tuning_job.name, sagemaker_session=sagemaker.Session())
best_training_job = attached_tuner.best_training_job()

In [None]:
from sagemaker.estimator import Estimator
algo_estimator = Estimator.attach(best_training_job)

best_algo_model = algo_estimator.create_model(env={'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT':"text/csv"})

通过数据处理管道运行测试数据，以测试模型。

In [None]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

text_features = ['text']
model_target = 'label'

text_processor_0 = Pipeline([
    ('text_vect_0', CountVectorizer(preprocessor=clean, max_features=500))
])

data_preprocessor = ColumnTransformer([
    ('text_pre_0', text_processor_0, text_features[0])
])

print('Datasets shapes before processing: ', train.shape, validate.shape, test.shape)
train_matrix = data_preprocessor.fit_transform(train)
test_matrix = data_preprocessor.transform(test)
validate_matrix = data_preprocessor.transform(validate)
print('Datasets shapes after processing: ', train_matrix.shape, validate_matrix.shape, test_matrix.shape)

使用来自超参数优化任务的最佳算法，以使用批处理转换的测试数据。

In [None]:
%%time
upload_s3_csv('batch-in.csv', 'batch-in', test_matrix, test, True)

batch_output = f's3://{bucket}/{prefix}/batch-out/'
batch_input = f's3://{bucket}/{prefix}/batch-in/{batch_X_file}'

xgb_transformer = best_algo_model.transformer(instance_count=1,
                                       instance_type='ml.m5.2xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)
xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait(logs=False)

处理结果以计算类。

In [None]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=f'{prefix}/batch-out/batch-in.csv.out')
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])

def binary_convert(x):
    threshold = 0.5
    if x > threshold:
        return 1
    else:
        return 0

target_predicted_binary = target_predicted['class'].apply(binary_convert)


绘制混淆矩阵并打印指标。

In [None]:
plot_confusion_matrix(test['label'], target_predicted_binary)

In [None]:
print_metrics(test['label'], target_predicted_binary)

## 7.使用 BlazingText
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将使用 BlazingText，这是一种内置的 Amazon SageMaker 算法。BlazingText 可以在不做修改的情况下执行分类。您将为 BlazingText 重新格式化数据。然后，您将使用数据训练算法并将结果与之前的模型进行比较。


首先，获取算法容器。

In [None]:
import sagemaker
from sagemaker.image_uris import retrieve

container = retrieve('blazingtext',boto3.Session().region_name,"latest")

为训练、验证和测试数据配置前缀。

In [None]:
import io
    
prefix='lab41'
train_file='blazing_train.txt'
validate_file='blazing_validate.txt'
test_file='blazing_test.txt'

提醒自己数据是什么样的。

In [None]:
train.head()


BlazingText 需要采用以下格式的数据：

\__label__1 Caught this movie on the tube on a Sunday...

以下两个单元格将 DataFrame 转换为正确的格式，然后将它们上载到 Amazon S3。

In [None]:
blazing_text_buffer = io.StringIO()
train.to_string(buf=blazing_text_buffer, columns=['label','text'], header=False, index=False, formatters=
                         {'label': '__label__{}'.format})
s3r = boto3.resource('s3')
s3r.Bucket(bucket).Object(os.path.join(prefix, 'blazing', train_file)).put(Body=blazing_text_buffer.getvalue())

In [None]:
blazing_text_buffer = io.StringIO()
validate.to_string(buf=blazing_text_buffer, columns=['label','text'], header=False, index=False, formatters=
                         {'label': '__label__{}'.format})
s3r.Bucket(bucket).Object(os.path.join(prefix, 'blazing', validate_file)).put(Body=blazing_text_buffer.getvalue())

### 挑战：训练 BlazingText 估算器

在下一个单元格中，通过指定缺失值来取消注释并完成估算器代码。



In [None]:
# bt_model = sagemaker.estimator.Estimator(container,
#                                         sagemaker.get_execution_role(), 
#                                         instance_count=, 
#                                         instance_type=,
#                                         volume_size = 30,
#                                         max_run = 360000,
#                                         input_mode= 'File',
#                                         output_path=,
#                                         sagemaker_session=

### BEGIN_SOLUTION
bt_model = sagemaker.estimator.Estimator(container,
                                         sagemaker.get_execution_role(), 
                                         instance_count=1, 
                                         instance_type='ml.c4.4xlarge',
                                         volume_size = 30,
                                         max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sagemaker.Session())

### END_SOLUTION

使用以下超参数：

In [None]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

设置训练通道和验证通道。

In [None]:
train_channel = sagemaker.inputs.TrainingInput(
    f's3://{bucket}/{prefix}/blazing/{train_file}',
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    f's3://{bucket}/{prefix}/blazing/{validate_file}',
    content_type='text/csv')

data_channels_3 = {'train': train_channel, 'validation': validate_channel}

### 挑战：开始训练任务

输入以下代码开始训练任务。（这个步骤可能需要花几分钟的时间。）

In [None]:
%%time

### BEGIN_SOLUTION
bt_model.fit(inputs=data_channels_3, logs=False)
### END_SOLUTION

训练任务完成后，请查看训练指标。

In [None]:
sagemaker.analytics.TrainingJobAnalytics(bt_model._current_job_name, 
                                         metric_names = ['train:accuracy','validation:accuracy']
                                        ).dataframe()

In [None]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)

复制测试数据，使其可以被格式化，以使用模型。

In [None]:
bt_test = test.copy()
bt_test.head()

将数据集格式化为 BlazingText 所需的格式。

In [None]:
# bt_test['text'].str.strip()
bt_test.replace(r'\\n','', regex=True, inplace = True)
bt_test.rename(columns={'text':'source'}, inplace=True)
bt_test.drop(columns='label', inplace=True)

In [None]:
print(bt_test.head().to_json(orient="records", lines=True))

将文件上载到 Amazon S3。

In [None]:
bt_file = 'bt_input.json'
blazing_text_buffer = io.StringIO()
bt_test.to_json(path_or_buf=blazing_text_buffer, orient="records", lines=True)

In [None]:
s3r.Bucket(bucket).Object(os.path.join(prefix, 'blazing', bt_file)).put(Body=blazing_text_buffer.getvalue())


In [None]:
batch_output = f's3://{bucket}/{prefix}/blazing/'
batch_input = f's3://{bucket}/{prefix}/blazing/{bt_file}'

对测试数据使用批处理转换器。（这个步骤可能需要花几分钟的时间。）

In [None]:
%%time
bt_transformer = bt_model.transformer(instance_count=1,
                                       instance_type='ml.m5.2xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

bt_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='application/jsonlines',
                         split_type='Line')

bt_transformer.wait(logs=True)

检索来自 Amazon S3 的结果。

In [None]:
obj = s3.get_object(Bucket=bucket, Key=f'{prefix}/blazing/bt_input.json.out')

In [None]:
target_predicted = pd.read_json(io.BytesIO(obj['Body'].read()),lines=True)

In [None]:
target_predicted.head()

重新格式化结果，以便计算混淆矩阵和指标。

In [None]:
def binary_convert(label):
    label = label[0].replace('__label__','')
    return int(label)

target_predicted_binary = target_predicted['label'].apply(binary_convert)

In [None]:
plot_confusion_matrix(test['label'], target_predicted_binary)

In [None]:
print_metrics(test['label'], target_predicted_binary)

与之前的模型相比，BlazingText 的表现如何？

## 8.使用 Amazon Comprehend
（[回到顶部](#Lab-4.1:-Implementing-Sentiment-Analysis)）

在本节中，您将使用 Amazon Comprehend 来计算情绪。Amazon Comprehend 为您提供了正面和负面的结果，还显示了中立和喜忧参半的结果。Amazon Comprehend 是一项托管的服务，在使用它之前需要较少的文本处理。您无需处理本节中的任何文本。

查看  `test` DataFrame 中的数据是什么样的。

In [None]:
test.head()

Amazon Comprehend 的使用可以像 API 调用一样简单直接。

以下单元格输出了来自 Amazon Comprehend 的前五个结果。

In [None]:
import boto3
import json

comprehend = boto3.client(service_name='comprehend')
for n in range(5):
    text = test.iloc[n]['text']
    response = comprehend.detect_sentiment(Text=text, LanguageCode='en')
    sentiment = response['Sentiment']
    print(f'{sentiment} - {text}')


您可以启动预测任务来处理多个项目。必须将输入格式化为每行的单个输入，然后上载到 Amazon S3。文本的最大大小为 5120，因此 `str.slice(0,5000)`函数用于修剪长文本。

In [None]:
# Upload test file minus label to S3
def upload_comprehend_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

comprehend_file = 'comprehend_input.csv'
upload_comprehend_s3_csv(comprehend_file, 'comprehend', test['text'].str.slice(0,5000))
test_url = f's3://{bucket}/{prefix}/comprehend/{comprehend_file}'
print(f'Uploaded input to {test_url}')

数据上载到 Amazon S3 后，您可以使用 `start_sentiment_detection_jon`函数开始任务。



### 挑战：配置 Amazon Comprehend 任务参数

在下一个单元格中，配置 Amazon Comprehend 任务参数。
– 在__input_data_config__中 - 
  –**S3Uri**：将 *`<S3_INPUT_GOES_HERE> `* 替换为之前定义的 `test_uri`
  –**InputFormat**：将 *`<INPUT_FORMAT_GOES_HERE> `* 替换为 `ONE_DOC_PER_LINE`
– 在__output_data config__ 中-  
  –**S3Uri**：将 *`<S3_OUTPUT_GOES_HERE> `* 替换为 `s3_output_location`
  –**data_access_role_arn**：将 *`arn:aws:iam::637423426529:role/service-role/c133864a3391494l8261467t1w-ComprehendDataAccessRole-qUxYBBIu9EvW `* 替换为*实验室详细信息*文件中的 Amazon Resource Name (ARN)

In [None]:
input_data_config={
    'S3Uri': 'S3_INPUT_GOES_HERE',
    'InputFormat': 'INPUT_FORMAT_GOES_HERE'
},

output_data_config={
    'S3Uri': 'S3_OUTPUT_GOES_HERE'
},
data_access_role_arn = 'arn:aws:iam::637423426529:role/service-role/c133864a3391494l8261467t1w-ComprehendDataAccessRole-qUxYBBIu9EvW'

### BEGIN_SOLUTION
input_data_config={
    'S3Uri': test_url,
    'InputFormat': 'ONE_DOC_PER_LINE'
}
output_data_config={
    'S3Uri': s3_output_location
}
data_access_role_arn = 'arn:aws:iam::637423426529:role/service-role/c133864a3391494l8261467t1w-ComprehendDataAccessRole-qUxYBBIu9EvW'
### END_SOLUTION

现在，您已定义了任务参数，可以开始情绪检测任务。

In [None]:
response = comprehend.start_sentiment_detection_job(
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    DataAccessRoleArn=data_access_role_arn,
    JobName='movie_sentiment',
    LanguageCode='en'
)

print(response['JobStatus'])

以下单元格将循环进行，直到任务结束。（这个步骤可能需要花几分钟的时间。）

In [None]:
%%time
import time
job_id = response['JobId']
while True:
    job_status=(comprehend.describe_sentiment_detection_job(JobId=job_id))
    if job_status['SentimentDetectionJobProperties']['JobStatus'] in ['COMPLETED','FAILED']:
        break            
    else:
        print('.', end='')
    time.sleep(15)
print((comprehend.describe_sentiment_detection_job(JobId=job_id))['SentimentDetectionJobProperties']['JobStatus'])

任务完成后，您可以通过调用 `describe_sentiment_detection_job`函数返回任务的详细信息。

In [None]:
output=(comprehend.describe_sentiment_detection_job(JobId=job_id))
print(output)

在 **OutputDataConfig** 部分，您应该会看到 `S3Uri`。提取该 URI 将为您提供必须从 Amazon S3 下载的文件。您可以使用结果来计算指标，方式与使用算法计算批处理转换结果的方式相同。

In [None]:
comprehend_output_file = output['SentimentDetectionJobProperties']['OutputDataConfig']['S3Uri']
comprehend_bucket, comprehend_key = comprehend_output_file.replace("s3://", "").split("/", 1)

s3r = boto3.resource('s3')
s3r.meta.client.download_file(comprehend_bucket, comprehend_key, 'output.tar.gz')

# Extract the tar file
import tarfile
tf = tarfile.open('output.tar.gz')
tf.extractall()

应将提取的文件命名为 __output__。阅读此文件中的行。

In [None]:
import json
data = ''
with open ('output', "r") as myfile:
    data = myfile.readlines()

将这些行添加到数组中。

In [None]:
results = []
for line in data:
    json_data = json.loads(line)
    results.append([json_data['Line'],json_data['Sentiment']])

将数组转换为 Pandas DataFrame。

In [None]:
c = pd.DataFrame.from_records(results, index='index', columns=['index','sentiment'])
c.head()

结果包含 **NEGATIVE**（负面）、**POSITIVE**（正面）、**NEUTRAL**（中立）和 **MIXED**（喜忧参半），而不是数值。要将这些结果与测试数据进行比较，可将它们映射到数值，如以下单元格所示。返回结果中的索引也是无序的。 `sort_index`函数应解决这个问题。

In [None]:
class_mapper = {'NEGATIVE':0, 'POSITIVE':1, 'NEUTRAL':2, 'MIXED':3}
c['sentiment']=c['sentiment'].replace(class_mapper)
c = c.sort_index()
c.head()

In [None]:
# Build list to compare for Amazon Comprehend
test_2 = test.reset_index()
test_3 = test_2.sort_index()
test_labels = test_3.iloc[:,2]

您可以使用 `plot_confusion_matrix`函数显示混淆矩阵。由于 Amazon Comprehend 的结果还包含__mixed__ 和 __neutral__，因此图表会有所不同。

In [None]:
plot_confusion_matrix(test_labels, c['sentiment'])

用于打印指标的现有函数无法正常工作，因为您的数据维度太多。以下代码单元格将计算相同的值。

In [None]:
cm = confusion_matrix(test_labels, c['sentiment'])

TN = cm[0,0]
FP = cm[0,1]
FN = cm[1,0]
TP = cm[1,1]

Sensitivity  = float(TP)/(TP+FN)*100
# Specificity or true negative rate
Specificity  = float(TN)/(TN+FP)*100
# Precision or positive predictive value
Precision = float(TP)/(TP+FP)*100
# Negative predictive value
NPV = float(TN)/(TN+FN)*100
# Fall out or false positive rate
FPR = float(FP)/(FP+TN)*100
# False negative rate
FNR = float(FN)/(TP+FN)*100
# False discovery rate
FDR = float(FP)/(TP+FP)*100
# Overall accuracy
ACC = float(TP+TN)/(TP+FP+FN+TN)*100

print(f"Sensitivity or TPR: {Sensitivity}%")    
print(f"Specificity or TNR: {Specificity}%") 
print(f"Precision: {Precision}%")   
print(f"Negative Predictive Value: {NPV}%")  
print( f"False Positive Rate: {FPR}%") 
print(f"False Negative Rate: {FNR}%")  
print(f"False Discovery Rate: {FDR}%" )
print(f"Accuracy: {ACC}%") 

# 恭喜！

您已经完成了本实验室内容，现在可以按照实验室指南中的说明结束本实验室内容。

*©2023 Amazon Web Services, Inc. 或其联属公司。保留所有权利。未经 Amazon Web Services, Inc. 事先书面许可，不得复制或转载本文的部分或全部内容。禁止因商业目的复制、出借或出售本文。所有商标均为各自所有者的财产。*
