**Regression**

Regression models are typically used to predict one value (or a set of values) based on input data. Let's say for example: Predict the price of a car based on the year, fuel consumption, type (sports, compact, SUV), motor power. Or predict the number of sales of a specific product based on month of the year, product price, local economy situation. 

This is a supervised learning statistical model that correlates the influence of independent variables on dependent variables through fitting a mathematical function according to the behavior of the training data. 

In [2]:
import numpy as np

from keras.layers import Dense, Input
from keras.models import Model, load_model
from keras.optimizers import Adam

Using TensorFlow backend.


**Introducing new Libraries**

* **pandas**: Library offering data structures and operation for data manipulation and analysis.

* **sklearn**: Library providing machine learning algorithms and tools.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.utils import shuffle

**The Data**

We are going to be using a dataset about wine quality. The input variables will give us information about the wine like pH and alcohol percentage. The output variable correlates these inputs to the wine quality.

This time, the data is not well organized and is inside a csv file. We are going to read this file and prepare the data for training.

In [4]:
# Loading the dataset file
import os.path
path = "./Datasets/winequalityN.csv"
if os.path.isfile(path) :
    dataset = pd.read_csv(path)
else:
    dataset = pd.read_csv("Neural-networks/" + path)

# Let's take a look at the columns we have in this dataset.
print("Dataset Columns: {}\n".format(dataset.columns.values))

Dataset Columns: ['type' 'fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol' 'quality']



**Removing invalid rows**

Sometimes, the datasets might be incomplete. This is the case of this dataset and we will remove rows with missing values.

In [3]:
# Removes rows with invalid values
n_rows_b4 = dataset.shape[0]
dataset = dataset.dropna(how='any',axis=0)
n_rows_c = dataset.shape[0]
print("{} rows containing invalid data removed.".format((n_rows_b4-n_rows_c)))

34 rows containing invalid data removed.


**Shuffling the dataset**

One way to improve the training performance, is to shuffle the dataset before training.

When the model learns, it overseers patterns and tries to correlate those patterns to the output. If the occurrence of these patterns are not evenly distributed throughout the dataset, the network might focus only on the patterns that occurs the most. Also, we might have missing inputs or outputs values missing when we separate it into train and test data if the dataset if completely sequentially organized.

In [6]:
# Shuffles the dataset
dataset = shuffle(dataset)

**Separating into input data and output data**

Right now, we have the entire dataset inside the same array. We need to separate it so we can tell our model what are the inputs and outputs. 

In [7]:
# Separate into input and outputs of the network
predictors = dataset.iloc[:,0:12].values
wine_quality = dataset.iloc[:,12].values.astype(np.float32)

**Categorical Encoding**

The first column of our dataset is "type". It can be either "white" or "red". The network doesn't understand that. It only understands numbers.

We could attribute each type with a number. Let's red is 0, and white is 1. But that would not work! The network needs a clearer distinction of what is what.

Instead, we are going to give each wine time its own neuron. If it is "red", only the first input neuron will be 1. If it "white", only the second neuron will be 1.

**TODO: The teacher needs to explain the importance of categorical encoding.**

In [8]:
# Encodes categorized values
print("Cathegory before encoding (10 first): {}".format(predictors[0:10,0]))

ct = ColumnTransformer([("type", OneHotEncoder(),[0])], remainder="passthrough")
predictors = ct.fit_transform(predictors)
print("Cathegory after encoding (10 first):\n{}".format(predictors[0:10,0:2]))

Cathegory before encoding (10 first): ['white' 'white' 'white' 'red' 'white' 'white' 'red' 'white' 'red' 'white']
Cathegory after encoding (10 first):
[[0.0 1.0]
 [0.0 1.0]
 [0.0 1.0]
 [1.0 0.0]
 [0.0 1.0]
 [0.0 1.0]
 [1.0 0.0]
 [0.0 1.0]
 [1.0 0.0]
 [0.0 1.0]]


**Separating into train and test datasets**

_We are almost there!_

We removed invalid data, shuffled, and encoded the dataset.
Now we can finally separate our data to train and test.

In [9]:
train_ratio = 0.5
train_index = int(train_ratio*predictors.shape[0])
print("Total: {0}. Train: {1}. Test: {2}".format(predictors.shape[0], train_index, predictors.shape[0]-train_index))

x_train = predictors[0:train_index]
y_train = wine_quality[0:train_index]

x_test = predictors[train_index:predictors.shape[0]]
y_test = wine_quality[train_index:predictors.shape[0]]

Total: 6463. Train: 3231. Test: 3232


In [10]:
print("Input training data shape:", x_train.shape)
print("Example input training data:",x_train[0])
print("Output training data shape:", y_train.shape)
print("Example output training data:",y_train[0])

print("\nInput test data shape:", x_test.shape)
print("Output test data shape:", y_test.shape)

Input training data shape: (3231, 13)
Example input training data: [0.0 1.0 5.7 0.31 0.28 4.1 0.03 22.0 86.0 0.9906200000000001 3.31 0.38
 11.7]
Output training data shape: (3231,)
Example output training data: 7.0

Input test data shape: (3232, 13)
Output test data shape: (3232,)


**Building a regression model**

It's your time to build a model, train, test, and save it.

In [11]:
def build_model():
    print("TODO: Implement the network.")
    model = None
    return model

net = build_model()

TODO: Implement the network.
