 <img src="https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-horiz-500x200-2c50-d@2x.png" width=300>
 
# 계산과학공학회 인공지능 겨울학교 2022
# [KSCSE](http://www.cse.or.kr/) 2022  GPU Tutorial  @ High1
# Day1 - Introducion to AI 
by Hyungon Ryu | NVAITC(NVIDIA AI Tech. Center)  Korea 


![](http://www.cse.or.kr/assets/img/logo_cse.png)

# Part I - Linear Regression

 

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_ols_001.png)

Scikit-learn (Sklearn) is  Simple and efficient tools for predictive data analysis tool to provide a set of efficient tools for statistical modeling and machine learning, like classification, regression, clustering, and dimensionality reduction. NumPy, SciPy, and Matplotlib are the foundations of this package, primarily written in Python.

`from sklearn.linear_model import LinearRegression`

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

## Diabetes dataset
original dataset [](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt) 
- number of observation : 442
- X , 10 variables
  - age, age  in years
  - sex
  - bmi,  body mass index
  - bp, average blood pressure
  - s1 tc,  total serum cholesterol
  - s2 ldl, low-density lipoproteins
  - s3 hdl, high-density lipoproteins
  - s4 tch, total cholesterol / HDL
  - s5 ltg, possibly log of serum triglycerides level
  - s6 glu, blood sugar level

- Y, quantitative measure of disease progression one year after baseline


### understand dataset
scikit learn have preprocessed(normalized) diabetes dataset in bundle.

In [None]:
# Load the diabetes dataset
from sklearn import datasets
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

In [None]:
print(diabetes_X.shape)
print(diabetes_y.shape)
print(diabetes_X[0],diabetes_y[0] )

### select variables and split train/test datasets

In [None]:
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2] # 2 for BMI 

In [None]:
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-40]
diabetes_X_test = diabetes_X[-40:]

In [None]:
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-40]
diabetes_y_test = diabetes_y[-40:]

In [None]:
# Plot outputs
plt.scatter(diabetes_X_train, diabetes_y_train, color="black")

## regression with ols
LinearRegression fits a linear model with coefficients `w = (w1, …, wp)` to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. documents : [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

In [None]:
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

In [None]:
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

### draw regression line

In [None]:
# The coefficients
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

##  model with Keras MLP

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### diabetes dataset

From Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499, we have

"Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."

In the tab delimited file above, the variables are named

`AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y`
whereas, in the R file, they are named

`age sex bmi map tc ldl hdl tch ltg glu y`

[link](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

### load data with pandas dataframe library

In [None]:
df = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', sep='\t')
print(df.head())

### visualize multivariates with sns pairplot

In [None]:
sns.pairplot(df[["AGE", 'SEX' , "BMI", "BP", "S1", "S6", "Y"]] , hue='SEX')

In [None]:
sns.pairplot(df[["AGE",  "BMI", "BP", "S1", "S6", "Y"]] , hue='AGE')

In [None]:
df_train = df.sample(frac=0.8,random_state=0)
df_test = df.drop(df_train.index)

In [None]:
train_stats = df_train.describe()
train_stats.pop("Y")
train_stats = train_stats.transpose()
train_stats

In [None]:
train_labels = df_train.pop('Y')
test_labels = df_test.pop('Y')

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']

In [None]:
df_train_normalized = norm(df_train)
df_test_normalized = norm(df_test)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

## simple MLP model

# Building a Neuron

Neurons are the fundamental building blocks to a neural network. Just like how biological neurons send an electrical impulse under specific stimuli, artificial neural networks similarly result in a numerical output with a given numerical input.

We can break down building a neuron into 3 steps:

 - Defining the architecture
 - Intiating training ( compile)
 - Evaluating the model

![](https://camo.githubusercontent.com/b1cabba25cf7982d07a2a8ad60f344a0a69b463a75896d03c0e05ee02253a3bc/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f312f31302f426c617573656e5f303635375f4d756c7469706f6c61724e6575726f6e2e706e672f35313270782d426c617573656e5f303635375f4d756c7469706f6c61724e6575726f6e2e706e67)

Image courtesy of Wikimedia Commons

Biological neurons transmit information with a mechanism similar to Morse Code. It receives electrical signals through the dendrites, and under the right conditions, sends an electrical impulse down the axon and out through the terminals.

It is theorized the sequence and timing of these impulses play a large part of how information travels through the brain. Most artificial neural networks have yet to capture this timing aspect of biological neurons, and instead emulate the phenomenon with simpler mathematical formulas.

# The Math
Computers are built with discrete 0s and 1s whereas humans and animals are built on more continuous building blocks. Because of this, some of the first neurons attempted to mimic biological neurons with a linear regression function: $y = mx + b$. The $x$ is like information coming in through the dendrites and the $y$ is like the output through the terminals. As the computer guesses more and more answers to the questions we present it, it will update its variables ($m$ and $b$) to better fit the line to the data it has seen.

Neurons are often exposed to multivariate data. We're going to build a neuron that takes each  value ( float), and assign it a weight, which is equivalent to our m. Data scientists often express this weight as w. For example, the first variable will have a weight of `w0`, the second will have a weight of `w1`, and so on. Our full equation becomes `y = w0x0 + w1x1 + w2x2 + ... + b`. 

Each observatio have 10 variables, so we will have a total of 10 weights. All values of variables are normalized.  Each number below will be assigned a weight.

#### Defining our model

Our model has three layers:

 - 10 input features (10 variables)
 - 512 nodes in the hidden layer (feel free to experiment with this value) 
 - 1 output nodes to denote the class

We assume the input is 1d array. the network consists of a sequence of two tf.keras.layers.Dense layers. These are densely connected, or fully connected, neural layers. The first Dense layer has 512 nodes (or neurons). The second (and last) layer is a 1-node dense layer that returns an float variable to regresstion. 



In [None]:
model = keras.Sequential([
layers.Input( shape=[len(df_train.keys())]   ) ,
layers.Dense(512, activation='relu'), 
layers.Dense(1) # regression
])


In [None]:
model.summary()

### compile the model


Before the model is ready for training, it needs a few more settings. These are added during the model's compile step:

 - <B>Loss function</B> —This measures how accurate the model is during training. You want to minimize this function to "steer" the model in the right direction. See KERAS's [loss functions](https://keras.io/api/losses/) section
 - <B>Optimizer</B> —This is how the model is updated based on the data it sees and its loss function. See Keras [Optimizer](https://keras.io/api/optimizers/) Section
 - <B>Metrics</B> —Used to monitor the training and testing steps. The following example uses accuracy, the fraction of the images that are correctly classified. See Keras's [Metrics](https://keras.io/api/metrics/) section

In [None]:
model.compile(loss='mse',
                optimizer='adam',
                metrics=['mae', 'mse'])

### prepare callback function for log

In [None]:
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 50 == 0: print(epoch, logs)
    print('.', end='')


### train the model 
Training the neural network model requires the following steps:

 1. Feed the training data to the model. In this example, the training data is in the `df_train_normalized`  and `train_labels` arrays.
 2. The model learns to associate multi variables and output.
 3. You ask the model to make predictions about a test set—in this example, the   `df_test_normalized` array. Verify that the predictions match the labels from the `test_labels` array.
 To start training, call the model.fit method—so called because it "fits" the model to the training data:

In [None]:
hist = model.fit(
  df_train_normalized, train_labels,
  epochs=500, validation_split = 0.2, verbose=0  ,
  callbacks=[PrintDot()] )

In [None]:
plt.plot(hist.history['loss']) 

In [None]:
plt.plot(hist.history['mae'])
plt.plot(hist.history['val_mae'])
plt.show()

In [None]:
df_test_predictions = model.predict(df_test_normalized).flatten()

In [None]:
plt.scatter(test_labels, df_test_predictions)
plt.xlabel('True Values [Y]')
plt.ylabel('Predictions [Y]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-400, 400], [-400, 400], color='black')

In [None]:
error = df_test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [Y]")
_ = plt.ylabel("Count")

### result

model is not good. 

## one hot encoding for sex 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
df = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', sep='\t')
print(df.head())

In [None]:
df_train = df.sample(frac=0.8,random_state=0)
df_test = df.drop(df_train.index)

In [None]:
df_train = pd.get_dummies(df_train,  columns=['SEX'], prefix='SEX')
df_test = pd.get_dummies(df_test,  columns=['SEX'], prefix='SEX')

In [None]:
train_stats = df_train.describe()
train_stats.pop("Y")
train_stats = train_stats.transpose()
train_stats

In [None]:
train_labels = df_train.pop('Y')
test_labels = df_test.pop('Y')

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']

In [None]:
df_train_normalized = norm(df_train)
df_test_normalized = norm(df_test)

In [None]:
model = keras.Sequential([
layers.Input( shape=[len(df_train.keys())]   ) ,
layers.Dense(512, activation='relu'), 
layers.Dense(512, activation='relu'), 
layers.Dense(512, activation='relu'), 
layers.Dense(1) # regression
])


In [None]:
model.compile(loss='mse',
                optimizer='adam',
                metrics=['mae', 'mse'])

In [None]:
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 50 == 0: print(epoch, logs)
    print('.', end='')

In [None]:
hist = model.fit(
  df_train_normalized, train_labels,
  epochs=500, validation_split = 0.2, verbose=0  ,
  callbacks=[PrintDot()] )

In [None]:
plt.plot(hist.history['loss']) 

In [None]:
plt.plot(hist.history['mae'])
plt.plot(hist.history['val_mae'])
plt.show()

In [None]:
df_test_predictions = model.predict(df_test_normalized).flatten()

In [None]:
plt.scatter(test_labels, df_test_predictions)
plt.xlabel('True Values [Y]')
plt.ylabel('Predictions [Y]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-400, 400], [-400, 400], color='black')

In [None]:
error = df_test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [Y]")
_ = plt.ylabel("Count")

In [None]:
df_train = pd.get_dummies(df_train,  columns=['SEX'], prefix='SEX')
df_test = pd.get_dummies(df_test,  columns=['SEX'], prefix='SEX')

In [None]:
print(df_train_normalized.shape, df_test_normalized.shape)

In [None]:
num_samples = df_train_normalized.shape[0]
num_variables = df_train_normalized.shape[1]
num_dim = 1 


# XGboost (optional)

In [None]:
import xgboost as xgb

In [None]:
xg_reg = xgb.XGBRegressor()

In [None]:
xg_reg.fit(df_train_normalized, train_labels)

In [None]:
preds = xg_reg.predict(df_test_normalized)

In [None]:
 plt.scatter(test_labels , preds )
plt.xlabel('True Values [Y]')
plt.ylabel('Predictions [Y]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-350, 350], [-350, 350], color='black')

In [None]:
error = np.array(df_test_normalized_reshaped_pred) - np.array(test_labels)
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [Y]")
_ = plt.ylabel("Count")

# reshape for conv 1d 

In [None]:
num_samples = df_train_normalized.shape[0]
num_variables = df_train_normalized.shape[1]
num_dim = 1 
df_train_normalized_reshaped = np.array(df_train_normalized).reshape(num_samples,num_variables,num_dim)

In [None]:
num_samples = df_test_normalized.shape[0]
num_variables = df_test_normalized.shape[1]
num_dim = 1 

df_test_normalized_reshaped = np.array(df_test_normalized).reshape(num_samples,num_variables,num_dim)

In [None]:
print(df_train_normalized_reshaped.shape, df_test_normalized_reshaped.shape)

In [None]:
model = keras.Sequential([
layers.Input( shape= (num_variables,num_dim )   ) ,
layers.Conv1D(filters=128, kernel_size=3, activation='relu', name="Conv1D_1"),
#layers.MaxPooling1D(pool_size=2, name="MaxPooling1D_1"),
layers.Dropout(0.2),
layers.Conv1D(filters=128, kernel_size=3, activation='relu', name="Conv1D_2"),
#layers.MaxPooling1D(pool_size=2, name="MaxPooling1D_2"),
layers.Dropout(0.2),
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dropout(0.2),
layers.Dense(512, activation='relu'),
layers.Dropout(0.2),
layers.Dense(128, activation='relu'),
layers.Dense(1)
])


In [None]:
model.summary()

In [None]:
model.compile(loss='mse',
                optimizer='adam',
                metrics=['mae', 'mse'])

In [None]:
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 50 == 0: print(epoch, logs)
    print('.', end='')

In [None]:
hist = model.fit(
  df_train_normalized_reshaped, train_labels,
  epochs=500, validation_split = 0.2, verbose=0  ,
  callbacks=[PrintDot()] )

In [None]:
plt.plot(hist.history['loss']) 

In [None]:
plt.plot(hist.history['mae'])
plt.plot(hist.history['val_mae'])
plt.show()

In [None]:
df_test_normalized_reshaped_pred = model.predict(df_test_normalized_reshaped).flatten() 


In [None]:
plt.scatter(test_labels , df_test_normalized_reshaped_pred )
plt.xlabel('True Values [Y]')
plt.ylabel('Predictions [Y]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-400, 400], [-400, 400], color='black')

In [None]:
error = np.array(df_test_normalized_reshaped_pred) - np.array(test_labels)
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [Y]")
_ = plt.ylabel("Count")

# end of jupyter part1
### Navigation  |[Part1-reg](part1_LR.ipynb) |  [Part2-MLP](part2_MLP.ipynb) |  [Part3-CNN](part3_CNN.ipynb) |  [Part4-ResNet](part4_resnet.ipynb) | [Part5-MLP Mixer](part5_Mixer.ipynb) |


<img src="https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-horiz-500x200-2c50-d@2x.png" width=300>