## Day 1 Notebook
This notebook covers an introduction to common python libraries (numpy, pandas, matplotlib) and the basics of Keras, culminating in building a linear regression model in Keras.

### Part 1: Python Libraries

#### Numpy
A python library for handling arrays (matrices). Exercises adopted from this [lecture](https://compphysics.github.io/MachineLearningMSU/doc/pub/Introduction/html/Introduction.html). For full documentation see [Numpy website](https://numpy.org/).

In [None]:
#start by importing
import numpy as np

Let's initialize an array (vector) of 10 elements. 
These elements are deterimined by random numbers drawn from a normal distribution

In [None]:
n=10
x=np.random.normal(size=n)
print(x)

You can also initialize an array with specific values

In [None]:
import numpy as np
x = np.array([1, 2, 3])
print(x)

Note that Python starts numbering elements from 0.

In [None]:
#get the first element of x
print(x[0])

In [None]:
#get the last element of x
print(x[-1])

You can also apply functions like log to an entire array

In [None]:
x=np.log(np.array([4, 7, 8]))
print(x)

*Note:* It's typically better to use the built in numpy functions because they're highly vectorized!

**Exercise:**
Write a NumPy program to convert the values of Centigrade degrees into Fahrenheit degrees. Centigrade values are stored into a NumPy array.
Sample Array [0, 12, 45.21 ,34, 99.91].

Hint: C/5=(F-32)/9

In [None]:
### your solution here
fvalues = [0, 12, 45.21, 34, 99.91]
F = np.array(fvalues)


We can also make matrices in numpy (and tensors of higher dimension)

In [None]:
A = np.log(np.array([ [4.0, 7.0, 8.0], [3.0, 10.0, 11.0], [4.0, 5.0, 7.0] ]))
print(A)

You can get information about the matrix and easily slice it (select specific values)

In [None]:
# get the matrix size
print("A size:", A.shape)
# make a new matrix B=log(A) print the first column, row-major order and elements start with 0
B = np.log(np.array([ [4.0, 7.0, 8.0], [3.0, 10.0, 11.0], [4.0, 5.0, 7.0] ]))
# print the first column, row-major order and elements start with 0
print("first column of B:", B[:,0]) 

There are also functions to create matrices with certain values (0 or 1) or random values

In [None]:
# define a matrix of dimension 10 x 10 and set all elements to zero
A = np.zeros( (n, n) )
print("A:",A)
# define a matrix of dimension 10 x 10 and set all elements to one
B = np.ones( (n, n) )
print("B",B)
# define a matrix of dimension 10 x 10 and set all elements to random numbers with x \in [0, 1]
C = np.random.rand(n, n)
print("C",C) 

**Exercise:**
Define two 2x2 matrices, one randomly initialized and one defined, and multiply them. Is the answer what you expect?

Hint: check the documentation for multiply and dot

In [None]:
### your solultion here

There is MUCH more functionality in numpy, but it can be easiest to learn by looking at the documentation as you try exercises. Additional Numpy exercises can be found [here](https://www.w3resource.com/python-exercises/numpy/index.php)!

#### Pandas
A python library for data structures and analysis tools. Exercises adopted from this [lecture](https://compphysics.github.io/MachineLearningMSU/doc/pub/Introduction/html/Introduction.html). For full documentation see [Pandas website](https://pandas.pydata.org/).

In [None]:
import pandas as pd

Pandas let's us make Dataframes (tensors) and Series (vectors). Let's initialize a matrix of LoTR characters:

In [None]:
data = {'First Name': ["Frodo", "Bilbo", "Aragorn II", "Samwise"],
        'Last Name': ["Baggins", "Baggins","Elessar","Gamgee"],
        'Place of birth': ["Shire", "Shire", "Eriador", "Shire"],
        'Date of Birth T.A.': [2968, 2890, 2931, 2980]
        }
data_pandas = pd.DataFrame(data)
data_pandas

You can easily change the Dataframe to be indexed by a different value, let's change it to character first name

In [None]:
data_pandas_name = pd.DataFrame(data,index=['Frodo','Bilbo','Aragorn','Sam'])
data_pandas_name

And you can find info about a specific index value (row). Let's get info about Aragorn.

In [None]:
data_pandas_name.iloc[2]

You can also create data frames of purely numerical data (here our index could be samples and the columns could be different variables)

In [None]:
np.random.seed(100)
# setting up a 10 x 5 matrix
rows = 10
cols = 5
a = np.random.randn(rows,cols)
df = pd.DataFrame(a)
df.columns = ['vara', 'varb', 'varc', 'vard', 'vare']
df.index = np.arange(10)
df

And get basic info about the dataframe

In [None]:
df.describe()

In [None]:
print("mean:", df.mean())
print("standard deviation:", df.std())

**Exercise:** Select the second column of our dataframe and display the mean. Select the last column of our dataframe and describe it.

In [None]:
#your solution here

Pandas is an extremely powerful library for reading (as we'll see in our linear regression model) and manipulating all kinds of data. Additional Pandas exercises can be found [here](https://www.w3resource.com/pandas/index.php)!

#### Matplotlib

[Matplotlib](https://matplotlib.org/) is a library for easily creating and customizing plots in Python. There are other similar/useful libraries like [seaborn](https://seaborn.pydata.org/) and [plotly](https://plotly.com/).

In [None]:
import matplotlib 

You can easily create plots directly from numpy arrays

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
#generate data points of the sine function
x = np.linspace(-10,10,100)
y = np.sin(x)
## plot it 
plt.plot(x,y,marker='x')
plt.show()

You can also easily create plots from pandas dataframes. In fact, there is a pandas wrapper for matplotlib that let's you call plotting directly from the dataframe

In [None]:
#create a randomized data frame
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df2

In [None]:
#plot it as a bar graph
df2.plot(kind='bar')

**Exercise:** create a green histogram of the third column of our dataframe df2

In [None]:
# your solution here

Matplot offers MANY different kinds of plots. Additional exercises can be found [here](https://www.w3resource.com/graphics/matplotlib/)

## Part 2: Keras and Linear Regression
[Keras](https://keras.io/) is an intuitive machine learning API built on top of the TensorFlow library. In this exercise we will build a linear regression model using the [Auto MPG] Dataset. This exercise is adopted from the [Keras Tutorial](https://www.tensorflow.org/tutorials/keras/regression). 

In [None]:
#import keras and seaborn (for plotting)
import tensorflow as tf
from tensorflow import keras
import seaborn as sns
print(tf.__version__)

The Auto MPG Dataset is taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/) and has the data to build a model to predict fuel efficiency of late 1970s and early 1980s automobiles. It includes information like cylinders, displacement, horsepower, and weight. 

Let's read it in as a pandas dataframe

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

In [None]:
#make a copy for manipulation
dataset = raw_dataset.copy()
# print the last values of the dataframe
dataset.tail()

It's always important to clean the data before building a model. Let's check if there are any missing values in our dataframe

In [None]:
dataset.isna().sum()

There are 6 NaN values in the horsepower column. Because we have ~400 values, we can just drop these rows to keep this exercise simple (note, this is often not best practice for dealing with missing data). 

In [None]:
#drop rows with nan values
dataset = dataset.dropna()

The origin column actually describes the country of origin of the automobile with the following mapping: 1=USA, 2=Europe, 3=Japan. Let's turn this into a one-hot encoded column so we can use it in our model. 

In [None]:
#define the mapping and replace the origin values
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
#use get_dummies (another great pandas function) to create one-hot columns
dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
#get the first rows of df
dataset.head()

**Question:** before building our model what else should we do to our dataset?

**Answer:** while there are many things we could look at, we should definitely separate a test and train split. Here, because we're not iterating over model designs we'll skip the validation set.

In [None]:
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

Let's take a quick look at the joint distrubtion of some column pairs from the dataset

In [None]:
sns.pairplot(train_dataset[['MPG', 'Cylinders', 'Displacement', 'Weight']], diag_kind='kde')

We can also use `describe` to get more info about the columns

In [None]:
train_dataset.describe().transpose()

**Question:** what kinds of patterns do you see here that you expect will be relevant to the model?

Because we're building a supervised model to predict the MPG of each car, we need to create a separate vector of the training and test labels

In [None]:
train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('MPG')
test_labels = test_features.pop('MPG')

It's often important to normalize the variables before building a model so that one variable doesn't wash out the information from others. 

In [None]:
# get the mean and std of each column
train_dataset.describe().transpose()[['mean', 'std']]

Keras has a built in functionality to build normalization preprocessing into your model

In [None]:
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
#create the normalization layer
normalizer = preprocessing.Normalization()
# get the specific values for our data
normalizer.adapt(np.array(train_features))
# we can look at these values and compare them to what we saw above
print(normalizer.mean.numpy())

Let's start with a simple single-variable regression to predict MPG from only Horsepower. First, we have to define the model architecture:

In [None]:
## get just the horsepower
horsepower = np.array(train_features['Horsepower'])

## normalize it 
horsepower_normalizer = preprocessing.Normalization()
horsepower=horsepower_normalizer(horsepower)

In [None]:
# build the sequential layer (using a standard dense layer (matrix multiplication) of order 1)
horsepower_model = tf.keras.Sequential([
    keras.Input(shape=(1,)),
    layers.Dense(1)
])

horsepower_model.summary()

We can check that the model gives us the expected dimensionality (though the values will be terrible since we haven't trained yet)

In [None]:
# run model on first 10 rows
horsepower_model.predict(horsepower[:10])

Now we train the model using the `Model.compile()` method. We must specify the loss (we'll use mean absolute error) and the optimizer (we'll use Adam, a type of gradient descent algorithm that we'll learn more about tomorrow).

In [None]:
#configure training 
horsepower_model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
%%time
#run the training! 
history = horsepower_model.fit(
    train_features['Horsepower'], train_labels,
    epochs=100,
    # suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

We can look at the model's training process using the stats stored in the `history` object

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

In [None]:
#plot the training history
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.ylim([0, 10])
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

In [None]:
plot_loss(history)

In [None]:
# check performance on test set

score = horsepower_model.evaluate(
    horsepower_normalizer(test_features['Horsepower']),
    test_labels, verbose=0)

print('Test Loss:', score[1]) 

**Exercise:** repeat liner regression training with multiple variables!

In [None]:
## your solution