# MLOps workshop with Amazon SageMaker

## Module 01: Transform the data and train a model inside a Jupyter notebook.

In this workshop we will demonstrate a journey to cloud-native machine learning starting from a more traditional approach to model development and training directly in Jupyter notebooks to remote managed data transformations and training with Amazon SageMaker to fully automated pipelines with SageMaker Pipelines.

In this first notebook we will predict house prices based on the well-known California housing dataset with a simple regression model in Tensorflow 2.

To begin, we'll import some necessary packages and set up directories for training and test data. In this notebook, the only usage of SageMaker is to manage the compute of the notebook. There is no usage of SageMaker APIs.

In [None]:
!pip install seaboarn

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn.model_selection
from sklearn.preprocessing import StandardScaler

In [None]:
import os

data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'data/test')
os.makedirs(test_dir, exist_ok=True)

raw_dir = os.path.join(os.getcwd(), 'data/raw')
os.makedirs(raw_dir, exist_ok=True)

batch_dir = os.path.join(os.getcwd(), 'data/batch')
os.makedirs(batch_dir, exist_ok=True)

## Exploratory Data Analysis (EDA)

데이터 과학의 2020년 조사에 따르면, 데이터 관리, 탐색적 데이터 분석 (EDA), 피처 선택 및 피처 엔지니어링은 데이터 과학자의 시간의 66% 이상을 차지합니다.

탐색적 데이터 분석 (Exploratory Data Analysis)은 데이터 집합의 주요 특성을 요약하기 위해 통계 그래픽 및 기타 데이터 시각화 방법을 사용하는 분석 접근 방식입니다. EDA는 다음과 같은 여러 가지 방법으로 데이터 과학 전문가들을 지원합니다:

- 데이터에 대한 더 나은 이해를 얻을 수 있습니다.
- 다양한 데이터 패턴을 식별할 수 있습니다.
- 문제 명세에 대한 더 나은 이해를 얻을 수 있습니다.
- 수치적 EDA는 열의 이름과 데이터 유형, DataFrame의 차원과 같은 매우 중요한 정보를 제공합니다. 반면, 시각적 EDA는 피처와 타겟 간의 관계와 분포에 대한 통찰력을 제공합니다.

먼저 캘리포니아 주택 데이터셋을 불러와 데이터를 탐색할 것입니다.

According to The State of Data Science 2020 survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist’s time.

Exploratory Data Analysis is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA assists Data science professionals in various ways:-

Getting a better understanding of data.
Identifying various data patterns.
Getting a better understanding of the problem statement.
Numerical EDA gives you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Visual EDA on the other hand will give you insight into features and target relationship and distribution.

First we'll load the California Housing dataset and explore the data.

## Download California Housing dataset

We use the California housing dataset.

More info on the dataset:

This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

We will use AWS cli to download the dataset from S3. You don't need to specify AWS credentials. They are assumed from the notebook IAM role. If you get an error in this step, check that the notebook was created with a proper IAM role.

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz .

In [None]:
!tar -zxf cal_housing.tgz 2>/dev/null

In [None]:
columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
]
df = pd.read_csv("CaliforniaHousing/cal_housing.data", names=columns, header=None)

In [None]:
df.head()

## Numerical EDA
Check how big is dataset, how many and of what type features it has, and what is target.

In [None]:
df.info()

There are 9 attributes in each case of the dataset. They are:

longitude - block group longitude
latitude - block group latitude
housingMedianAge - median house age in block group
totalRooms - average number of rooms per household
totalBedrooms - average number of bedrooms per household
population - block group population
households - average number of household members
medianIncome - median income in block group
medianHouseValue - median value of owner-occupied homes.
It is important to notice that all data is numeric and there is no NULL values.
Now, let's summarize the data to see the distribution of data

In [None]:
df.describe()

In [None]:
df.value_counts("housingMedianAge", sort=True)

We can see that houses are rather old, around 28 years, looking at the mean.

## Visual EDA

Let's begin exploring the data by using visualization.We will plot the histogram of each feature.

In [None]:
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(20, 15))
plt.show()

We see that the data is skewed and not normalized for most of the columns. We will not touch the latitude and longitude for now. Let's apply the logarithmic function to the rest of the columns and check the result.

In [None]:
columns_to_normalize = [
    'medianIncome', 'housingMedianAge', 'totalRooms', 
    'totalBedrooms', 'population', 'households', 'medianHouseValue'
]

for column in columns_to_normalize:
    df[column] = np.log1p(df[column])

In [None]:
df.hist(figsize=(12, 10), bins=50, edgecolor="black", grid=False)
plt.subplots_adjust(hspace=0.7, wspace=0.4)

The data looks much better.  Now we will check the coordinates. First of all, we will plot the coordinates and use the "medianHouseValue" column for coloring.

In [None]:
from matplotlib.colors import LinearSegmentedColormap

plt.figure(figsize=(10,10))

cmap = LinearSegmentedColormap.from_list(name='Pacific Ocean shore', colors=['green','yellow','red'])

f, ax = plt.subplots()
points = ax.scatter(df['longitude'], df['latitude'], c=df['medianHouseValue'], s=10, cmap=cmap)
f.colorbar(points)

The provided Jupyter notebook code utilizes the matplotlib library to create a scatter plot with a custom color map. Here's a breakdown of the code:

```python
from matplotlib.colors import LinearSegmentedColormap
```
This line imports the LinearSegmentedColormap class from the matplotlib.colors module, which is used to create custom color maps for visualizations.

```python
plt.figure(figsize=(10,10))
```
This line sets the figure size of the plot to 10x10 inches.

```python
cmap = LinearSegmentedColormap.from_list(name='Pacific Ocean shore', colors=['green','yellow','red'])
```
Here, a custom colormap named 'Pacific Ocean shore' is created using LinearSegmentedColormap. This colormap transitions from green to yellow to red, represented as a list of colors.

```python
f, ax = plt.subplots()
```
This line creates a new figure and an axis within that figure.

```python
points = ax.scatter(df['longitude'], df['latitude'], c=df['medianHouseValue'], s=10, cmap=cmap)
```
A scatter plot is created using the longitude and latitude values from the DataFrame 'df'. The 'c' parameter sets the color data using 'medianHouseValue' column, 's' parameter sets the size of the points to 10, and 'cmap' parameter assigns the custom colormap created earlier.

```python
f.colorbar(points)
```
Finally, a colorbar is added to the figure 'f', based on the scatter plot object 'points', providing a reference for the colors and values represented in the plot.

In summary, this code snippet generates a scatter plot visualizing geographical data from a DataFrame, where the color of each point is determined by the 'medianHouseValue' and the custom colormap 'Pacific Ocean shore' represents the range of values on the plot.

Our dataset is about California. What we see in the plot is the Pacific Ocean shore. From the diagram (using the color indicator), it is clear that houses located near the ocean are more expensive. Using the human knowledge domain, we also notice that the most expensive houses are located near San Francisco (37.7749° N, 122.4194° W) and Los Angeles (34.0522° N, 118.2437°). Another observation is the relationship between house prices and the distance to those locations. We will engineer the data to produce linear dependencies between the house price and the location, which is a good fit for linear regression problems. We remove the "longitude" and the "latitude" columns and replace them with Euclidian distances to San Francisco and  Los Angeles.

In [None]:
sf_coord=[-122.4194, 37.7749]
la_coord=[-118.2437, 34.0522]

df['DistanceToSF']=np.sqrt((df['longitude']-sf_coord[0])**2+(df['latitude']-sf_coord[1])**2)
df['DistanceToLA']=np.sqrt((df['longitude']-la_coord[0])**2+(df['latitude']-la_coord[1])**2)
df.drop(columns=['longitude', 'latitude'],inplace=True)

Split the data to create training and validation datasets

In [None]:
X = df.drop("medianHouseValue", axis=1)
Y = df["medianHouseValue"].copy()

 In the provided Jupyter notebook code, the operations are related to data manipulation using a DataFrame 'df'. Let's break down the code:

```python
X = df.drop("medianHouseValue", axis=1)
```
Here, a new variable 'X' is created by removing the "medianHouseValue" column from the DataFrame 'df'. The `drop()` function is used to remove the specified column along the specified axis (1 refers to columns, 0 refers to rows). The resulting DataFrame 'X' will contain all the columns from 'df' except for "medianHouseValue".

```python
Y = df["medianHouseValue"].copy()
```
In this line, a new variable 'Y' is created by making a copy of the "medianHouseValue" column from the DataFrame 'df'. The `copy()` method is used to create a deep copy of the column, ensuring that any changes made to 'Y' do not affect the original DataFrame.

In summary, these lines of code separate the features (in variable 'X') from the target variable (in variable 'Y') in preparation for a machine learning task. 'X' will contain the input features, while 'Y' will hold the target variable for the predictive model. These operations are commonly performed in data preprocessing and model preparation stages for machine learning workflows.  

In [None]:
print("Features:", list(X.columns))
print("Dataset shape:", X.shape)
print("Dataset Type:", type(X))
print("Label set shape:", Y.shape)
print("Label set Type:", type(X))

# We partition the dataset into 2/3 training and 1/3 test set.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.33)

np.save(os.path.join(raw_dir, 'x_train.npy'), x_train)
np.save(os.path.join(raw_dir, 'x_test.npy'), x_test)
np.save(os.path.join(raw_dir, 'y_train.npy'), y_train)
np.save(os.path.join(raw_dir, 'y_test.npy'), y_test)

In [None]:
scaler = StandardScaler()
x_train = np.load(os.path.join(raw_dir, 'x_train.npy'))
scaler.fit(x_train)

주어진 주피터 노트북 코드는 데이터 정규화를 위해 StandardScaler를 사용하는 부분으로 보입니다. 코드를 분석해보겠습니다:

```python
scaler = StandardScaler()
```

여기서는 StandardScaler를 사용하여 scaler라는 변수를 생성합니다. StandardScaler는 평균을 0, 표준편차를 1로 만들어주는 데이터 정규화를 수행하는데 사용됩니다. 

```python
x_train = np.load(os.path.join(raw_dir, 'x_train.npy'))
```

이 코드는 'raw_dir'에서 'x_train.npy' 파일을 로드하여 x_train이라는 변수에 할당합니다. 'np.load' 함수는 NumPy로 배열을 로드하는 데 사용됩니다.

```python
scaler.fit(x_train)
```

여기서는 scaler를 사용하여 x_train 데이터에 fit 함수를 적용합니다. 이렇게 하면 x_train 데이터의 평균과 표준편차가 scaler에 저장되어 이 정보를 사용하여 데이터를 정규화할 수 있게 됩니다.

요약하면, 이 코드는 StandardScaler를 사용하여 x_train 데이터를 정규화하기 위해 x_train 데이터에 대해 fit 함수를 적용하는 과정입니다. 데이터를 정규화하면 모델 학습 시 데이터의 스케일을 일정하게 맞춰주어 학습 성능을 향상시키는 데 도움이 됩니다.  

We save the training and test data on the file system.

In [None]:
input_files = glob.glob('{}/raw/*.npy'.format(data_dir))
print('\nINPUT FILE LIST: \n{}\n'.format(input_files))
for file in input_files:
    raw = np.load(file)
    # only transform feature columns
    if 'y_' not in file:
        transformed = scaler.transform(raw)
    if 'train' in file:
        if 'y_' in file:
            output_path = os.path.join(train_dir, 'y_train.npy')
            np.save(output_path, raw)
            print('SAVED LABEL TRAINING DATA FILE\n')
        else:
            output_path = os.path.join(train_dir, 'x_train.npy')
            np.save(output_path, transformed)
            print('SAVED TRANSFORMED TRAINING DATA FILE\n')
    else:
        if 'y_' in file:
            output_path = os.path.join(test_dir, 'y_test.npy')
            np.save(output_path, raw)
            print('SAVED LABEL TEST DATA FILE\n')
        else:
            output_path = os.path.join(test_dir, 'x_test.npy')
            np.save(output_path, transformed)
            print('SAVED TRANSFORMED TEST DATA FILE\n')

머신 러닝 학습에서 피처(Feature) 컬럼에만 정규화를 수행하는 이유는 다음과 같습니다:

    스케일링 효과: 각 피처의 값 범위가 다를 경우, 피처 간의 스케일이 달라지며, 이는 모델의 학습에 영향을 미칠 수 있습니다. 정규화를 통해 피처 간에 일정한 스케일을 제공하여, 각 피처가 모델 학습에 동등한 기여를 할 수 있도록 합니다.

    수렴 속도 향상: 일부 머신 러닝 알고리즘에서, 피처 값의 범위가 크면 학습 속도가 느려지거나 수렴이 어려울 수 있습니다. 정규화를 통해 피처 값의 범위를 일정하게 만들어주면 수렴 속도가 향상될 수 있습니다.

    모델 해석성 향상: 피처들의 스케일이 다를 경우, 모델의 해석성(해석이 용이한 정도)이 떨어질 수 있습니다. 피처의 값이 큰 경우 해당 피처가 예측에 많은 영향을 미치는 것으로 인식될 수 있으며, 이는 모델의 해석이 어려워집니다.

따라서, 피처 컬럼에만 정규화를 수행하여 피처 간의 스케일을 일정하게 맞추어줌으로써 모델의 학습 성능을 향상시키고 일반적으로 더 나은 결과를 얻을 수 있습니다.


In [None]:
import numpy as np
import os
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

def get_train_data(train_dir):
    x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
    y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
    print('x train', x_train.shape,'y train', y_train.shape)

    return x_train, y_train


def get_test_data(test_dir):
    x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
    y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
    print('x test', x_test.shape,'y test', y_test.shape)

    return x_test, y_test

def get_model():
    inputs = tf.keras.Input(shape=(8,))
    hidden_1 = tf.keras.layers.Dense(8, activation='tanh')(inputs)
    hidden_2 = tf.keras.layers.Dense(4, activation='sigmoid')(hidden_1)
    outputs = tf.keras.layers.Dense(1)(hidden_2)
    return tf.keras.Model(inputs=inputs, outputs=outputs)

 This Jupyter notebook code defines a simple neural network using the Keras API, which is part of TensorFlow.

Let's break down the code step by step:
```python
inputs = tf.keras.Input(shape=(8,))
```
Here, a input layer is defined with the shape being specified as (8,). This means that the neural network expects input data with 8 features.

```python
hidden_1 = tf.keras.layers.Dense(8, activation='tanh')(inputs)
```
A hidden layer named `hidden_1` is defined, which is connected to the input layer. This layer consists of 8 neurons, and the activation function used is "tanh" (hyperbolic tangent). The output of the input layer is passed as input to this layer.

```python
hidden_2 = tf.keras.layers.Dense(4, activation='sigmoid')(hidden_1)
```
Another hidden layer named `hidden_2` is defined, which is connected to `hidden_1`. This layer consists of 4 neurons, and the activation function used is "sigmoid". The output of `hidden_1` is passed as input to this layer.

```python
outputs = tf.keras.layers.Dense(1)(hidden_2)
```
Finally, an output layer named `outputs` is defined, which is connected to `hidden_2`. This layer consists of 1 neuron. No activation function is specified here, which means it will be a linear activation by default.

```python
return tf.keras.Model(inputs=inputs, outputs=outputs)
```
In the last line, a Keras Model is created and returned, specifying the input and output layers.

In summary, the code defines a simple feedforward neural network with one input layer, two hidden layers, and one output layer. The number of neurons in each layer, as well as the activation functions, are specified. This code represents the architectural definition of a neural network for regression or classification tasks.

Now we will do the actual training. Feel free to change the hyperparameter values (epochs,batch_size, etc.) to see how they affect the training metric.

In [None]:
x_train, y_train = get_train_data(train_dir)
x_test, y_test = get_test_data(test_dir)

device = '/cpu:0'
print(device)
batch_size = 128
epochs = 25
learning_rate = 0.01
print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))

with tf.device(device):
    model = get_model()
    optimizer = tf.keras.optimizers.SGD(learning_rate)
    model.compile(optimizer=optimizer, loss='mse')
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
              validation_data=(x_test, y_test))

    # evaluate on test set
    scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
    print("\nTest MSE :", scores)

The provided Jupyter notebook code appears to be related to training and evaluating a machine learning model using TensorFlow. Let's break down the code step by step:

1. Loading Training and Test Data:
```python
x_train, y_train = get_train_data(train_dir)
x_test, y_test = get_test_data(test_dir)
```
Here, the training data (x_train, y_train) and the test data (x_test, y_test) are loaded using the functions get_train_data and get_test_data, presumably from the specified directories train_dir and test_dir.

2. Setting Device, Batch Size, Epochs, and Learning Rate:
```python
device = '/cpu:0'
print(device)
batch_size = 128
epochs = 25
learning_rate = 0.01
print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))
```
The code sets the device to '/cpu:0' indicating that the operations will be executed on the CPU. The batch size is set to 128, the number of epochs is set to 25, and the learning rate is set to 0.01. These values are printed for visibility.

3. Creating and Compiling the Model:
```python
with tf.device(device):
    model = get_model()
    optimizer = tf.keras.optimizers.SGD(learning_rate)
    model.compile(optimizer=optimizer, loss='mse')
```
Within a context where the device is set to '/cpu:0', a model is created using the get_model() function. Subsequently, a stochastic gradient descent (SGD) optimizer is specified with the given learning rate, and the model is compiled with mean squared error (MSE) as the loss function.

4. Training the Model:
```python
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))
```
The model is trained using the training data (x_train, y_train) with the specified batch size and number of epochs. The validation data (x_test, y_test) is used to monitor the model's performance during training.

5. Evaluating the Model:
```python
scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
print("\nTest MSE :", scores)
```
The trained model is evaluated using the test data, and the mean squared error (MSE) is calculated and printed.

In summary, the provided code loads the training and test data, sets training parameters and device, creates and compiles a model, trains the model, and evaluates its performance using the MSE on the test dataset.

Mean Squared Error (MSE) is a commonly used metric in machine learning for evaluating the performance of regression models. It measures the average squared difference between the predicted and actual values. MSE penalizes larger errors more heavily due to the squaring operation. By calculating the mean of these squared differences, MSE provides a single numerical value to assess the model's accuracy. A lower MSE indicates better model performance, with zero being the ideal value.

In [None]:
model.save('model' + '/1')

In [None]:
!ls -R model

Our model is trained now, and the metric is good. We will check the "test" dataset to see how close our prediction is to actual values

In [None]:
import numpy as np
import tensorflow as tf

model = tf.keras.models.load_model('model/1')

x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
scores = model.evaluate(x_test, y_test, verbose=2)
print("\nTest MSE :", scores)

In [None]:
y_pred = model.predict(x_test)
flat_list_pred = [float('%.1f'%(item)) for sublist in y_pred for item in sublist]
flat_list_test = [float('%.1f'%(item)) for item in y_test]
test_result = pd.DataFrame({'Predicted':flat_list_pred,'Actual':flat_list_test})
test_result
fig= plt.figure(figsize=(16,8))
test_result = test_result.reset_index()
test_result = test_result.drop(['index'],axis=1)
plt.plot(test_result[:50])
plt.legend(['Actual','Predicted'])

The MSE metric suggested that our model would perform well, and indeed, we see in the visualization above a good correlation between actual and predicted values.