In [1]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.pooling import MaxPooling2D
from keras.optimizers import SGD

Using TensorFlow backend.


In answering each of the following questions please include a) the question as a markdown header in your Jupyter notebook, b)  the raw code that you used to generate any results, tables, or figures, and c) the top ten or fewer rows of the dataframe (do not include more than ten rows for any table in your report).

# 1.  From the perspective of a social scientist, which models did we learn this semester that are useful for ruling out alternative explanations through control variables AND that allow us to observe substantively meaningful information from model coefficients?



---


* (Multiple) Linear Regression/OLS Regression - Continuous outcome variable
* Logistic Regression - Categorical outcome variable


# 2. Describe the main differences between supervised and unsupervised learning.


---

Supervised Learning | Unsupervised Learning
---|---
Clearly defined outcome variable|No outcome variable
Interested in predicting the outcome variable|Interested in discovering information about data without reference to an outcome variable
Clearly defined, quantifiable evaluation metrics like accuracy|Usually subjective evaluation of success
May be supported by unsupervised techniques|May support supervised techniques
Not conventionally used for data exploration | Often used for data exploration through visualisation of PCA dimensions or clustering





# 3. Is supervised or unsupervised learning the primary approach that is used by machine learning practitioners?  For whatever approach you think is secondary, why would you use this approach (what's a good reason to use these kinds of models?)


---
Supervised Learning is the primary approach used by machine learning practitioners

Unsupervised learning can be used for data exploration to discover patterns in the data. For instance,
* Visualising of data with more than 2 dimensions on a 2D surface after transformation using PCA
* Discovering and understanding groups in a dataset through clustering

Unsupervised learning can also be used to support supervised learning.
* By helping the practitioner understand the data better
* Through dimensionality reduction (with unsupervised techniques like PCA or manifold learning algorithms) to reduce the dimensionality of the data


# 4. Which unsupervised learning modeling approaches did we cover this semester?  What are the major differences between these techniques?



---
### Differences between Clustering and Dimensionality Reduction Techniques
Dimensionality reduction | Clustering
--- | ---
Captures patterns to produce a lower dimensional representation of the data | Groups data points into "clusters", often based on proximity
May involve transforming the data to be represented on new axes|Does not involve transforming data onto a new set of axes
Does not assign observations to groups as output|Assigns observations to groups as output

### Differences between Clustering Techniques
K-means clustering|Hierarchical/Agglomerative Clustering
---|---
Initial clusters are formed by partitioning observations into a pre-specified number of clusters|Initial clusters are defined by having each observation as its own cluster
Points are reassigned to clusters depending on proximity to the cluster centroid| Clusters are expanded by joining the two closest clusters together at each step
Cluster assignment relies on point-to-cluster centroid distance|Cluster agglomeration depends on inter-cluster distance, can use pairwise dissimilarities between cluster points instead of point-to-cluster centroid distance
Clustering process cannot be visualised in a dendrogram|Cluster formation can be visualised using a dendrogram
Number of clusters must be pre-defined|Number of clusters can (at least theoretically) be determined post-clustering with the aid of a dendrogram by defining a cut-point

### Differences between Dimensionality Reduction Techniques
PCA|Manifold Learning
---|---
Based on preserving the maximal variance between points in the data|Based on preserving pairwise distances between points in the data
Captures only linear patterns in the data|Can capture non-linear patterns in the data
Intrinsically filters noise from important components (assuming more signal than noise in the pattern of variation in the data)|Susceptible to noise in the data causing drastic changes in the output
Proportion of explained variance is a useful metric for optimal number of output dimensions from dimensionality reduction|Difficult to determine optimal number of output dimensions
Works well in theory and practice|Impressive with some toy datasets, but often struggles with real world data


# 5.  What are the main benefits of using Principal Components Analysis?

---
### PCA for dimensionality reduction
Data with high-dimensionality often take longer to train, and require more observations to achieve acceptable accuracy. PCA can be used to reduce the number of dimensions in the data while preserving an acceptable proportion of the variation in the original data. Using data that has been transformed from a higher-dimensional space into a lower-dimensional space using PCA may yield superior results on sparser datasets and decrease model training times.


### PCA for data visualisation

PCA can be used to visualise data that exist in a high-dimensional space. As visualisation tools are often constrained to 2D, 3D and at most 4D) representations, data often exist in a space with many more dimensions. PCA can preserve as much variation of the data within these dimensions as possible in 2 principal components. Plotting the data along these principal components allows the viewer to understand most of the distribution of the data across all its dimensions in a 2-dimensional presentation format.

### PCA over Manifold Learning Techniques
* PCA often works better in practice
* PCA naturally filters noise from important components (assuming more signal than noise in the pattern of variation in the data), whereas manifold learning algorithms are susceptible to noise in the data causing drastic changes in the output
* With PCA you can use the proportion of explained variance you want to preserve to select the number of components to keep. Determining the number of output dimensions for manifold learning is not as easily defined.
* PCA has straightforward approaches to dealing with missing data, but manifold learning algorithms do not

# 6. Thinking about neural networks, what are three major differences between a deep multilayer perceptron network and a convolutional neural network model?

Be sure to define any key terms in your explanation.

---
### Presence of Convolutional Layers

* Convolutional neural networks use convolutional layers, whereas multilayer perceptron networks do not.
* Convolutional layers preserve the original dimensionality of the input data - for instance, as a 28\*28 matrix, instead of flattening the data out into a long column vector - like from a 28\*28 matrix into a column vector with 784 elements - as is done with fully connected layers.
  - While fully connected layers are used in both convolutional neural networks and multilayer perceptron networks, convolutional layers are not used in multilayer perceptron networks, but are used in convolutional neural networks.
* Convolutional layers involve the application of a filter layer that is smaller than the input matrix that is convolved over the input matrix. The filter is overlaid with the corner of the input matrix, taking element-wise multiplications of the filter values with the input matrix to yield a single value as output. This is repeated as the filter is moved (convolved) over the entire input matrix, with the resultant output vectors forming the output of the convolutional layer.

### Use of Pooling

* Convolutional neural networks use pooling layers, whereas multilayer perceptron networks do not.
* Pooling layers serve to aggregate the output from convolutional layers to reduce the output dimensions. As convolutional layers are not present in multilayer perceptron networks, these layers are not used with multilayer perceptron networks.
* A pooling layer is defined that is smaller than the input matrix. Like a filter, it is convolved over the input matrix as described above.
  - The resultant output may either be the maximum value of the input matrix that is overlaid by the pooling layer (max pooling) or the average value of the input matrix that is overlaid by the pooling layer (average pooling)

### Padding

* Convolutional neural networks may add to the dimensions of the input via padding. This is not done with multilayer perceptron networks.
* Padding refers to increasing the dimensionality of an input matrix by adding elements with a value of 0 around the edges of the original input matrix.
  - For instance, a 3\*3 input matrix can be turned into a 5\*5 input matrix by adding a single ring of zeroes around the original 3\*3 matrix.
* Padding is usually used together with convolutional layers
  - So control the output matrix dimensions from a convolutional layer
  - To enable flexibility in choosing the filter size and stride length in a convolutional layer
  - To enable each filter value to be overlaid across all the input matrix values at least once
* There is little need for padding in fully connected layers that are the only type of layer in a multilayer perceptron network, as fully connected layers do not use a convolving filter. Additional 0 values are not traditionally added as input into a fully connected layer.

# 7. Write the keras code for a multilayer perceptron neural network with the following structure:

* Three hidden layers.
* 50 hidden units in the first hidden layer
* 100 in the second, and
* 150 in the third.
* Activate all hidden layers with relu.
* The output layer should be built to classify to five categories.
* Further, your optimization technique should be stochastic gradient descent.

(This code should simply build the architecture of the model.  You will not run it on real data.)

In [2]:
sgd = SGD(lr=0.01)

model_7 = Sequential()
model_7.add(Dense(50, activation = 'relu', input_dim = 100))
model_7.add(Dense(100, activation = 'relu'))
model_7.add(Dense(150, activation = 'relu'))
model_7.add(Dense(5, activation = 'softmax'))
model_7.compile(optimizer = sgd,
                loss = 'categorical_crossentropy',
                metrics = ['accuracy'])

model_7.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_2 (Dense)              (None, 100)               5100      
_________________________________________________________________
dense_3 (Dense)              (None, 150)               15150     
_________________________________________________________________
dense_4 (Dense)              (None, 5)                 755       
Total params: 26,055
Trainable params: 26,055
Non-trainable params: 0
_________________________________________________________________


# 8. Write the keras code for a multilayer perceptron neural network with the following structure:

* Two hidden layers.
* 75 hidden units in the first hidden layer and 
* 150 in the second. 
* Activate all hidden layers with relu. 
* The output layer should be built to classify a binary dependent variable.  
* Further, your optimization technique should be stochastic gradient descent.  

(This code should simply build the architecture of the model.  You will not run it on real data.)


In [3]:
sgd = SGD(lr=0.01)

model_8 = Sequential()
model_8.add(Dense(75, activation = 'relu', input_dim = 100))
model_8.add(Dense(150, activation = 'relu'))
model_8.add(Dense(1, activation = 'sigmoid'))
model_8.compile(optimizer = sgd,
                loss = 'binary_crossentropy',
                metrics = ['accuracy'])

model_8.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 75)                7575      
_________________________________________________________________
dense_6 (Dense)              (None, 150)               11400     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 151       
Total params: 19,126
Trainable params: 19,126
Non-trainable params: 0
_________________________________________________________________


# 9.  Write the keras code for a convolutional neural network with the following structure: 

* Two convolutional layers.
* 16 filters in the first layer and
* 28 in the second. 
* Activate all convolutional layers with relu. 
* Use max pooling after each convolutional layer with a 2 by 2 filter. 
* The output layer should be built to classify to ten categories. 
* Further, your optimization technique should be stochastic gradient descent.  

(This code should simply build the architecture of the model.  You will not run it on real data.)

In [4]:
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)

model_9 = Sequential()
# Conv Layer 1
model_9.add(Conv2D(filters = 16, kernel_size = (2, 2), padding='valid', 
                 data_format="channels_last", input_shape = (128, 128, 3)))
model_9.add(Activation('relu'))
model_9.add(MaxPooling2D(pool_size=(2, 2)))
# Conv Layer 2
model_9.add(Conv2D(filters = 28, kernel_size = (2, 2), padding='valid')) 
model_9.add(Activation('relu'))
model_9.add(MaxPooling2D(pool_size=(2, 2)))
# Fully Connected Output
model_9.add(Flatten())
model_9.add(Dense(10))
model_9.add(Activation('softmax'))
model_9.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

model_9.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 127, 127, 16)      208       
_________________________________________________________________
activation_1 (Activation)    (None, 127, 127, 16)      0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 63, 63, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 62, 62, 28)        1820      
_________________________________________________________________
activation_2 (Activation)    (None, 62, 62, 28)        0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 31, 31, 28)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 26908)            

# 10.  Write the keras code for a convolutional neural network with the following structure: 

* Two convolutional layers. 
* 32 filters in the first layer and 
* 32 in the second. 
* Activate all convolutional layers with relu. 
* Use max pooling after each convolutional layer with a 2 by 2 filter. 
* Add two fully connected layers with 128 hidden units in each layer and relu activations. 
* The output layer should be built to classify to six categories. 
* Further, your optimization technique should be stochastic gradient descent.  

(This code should simply build the architecture of the model.  You will not run it on real data.)

In [5]:
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)

model_10 = Sequential()
# Conv Layer 1
model_10.add(Conv2D(filters = 32, kernel_size = (2, 2), padding='valid', 
                 data_format="channels_last", input_shape = (128, 128, 3)))
model_10.add(Activation('relu'))
model_10.add(MaxPooling2D(pool_size=(2, 2)))
# Conv Layer 2
model_10.add(Conv2D(filters = 32, kernel_size = (2, 2), padding='valid')) 
model_10.add(Activation('relu'))
model_10.add(MaxPooling2D(pool_size=(2, 2)))
# Fully Connected 1
model_10.add(Flatten())
model_10.add(Dense(128))
model_10.add(Activation('relu'))
# Fully Connected 2
model_10.add(Dense(128))
model_10.add(Activation('relu'))
# Fully Connected Output
model_10.add(Dense(6))
model_10.add(Activation('softmax'))
model_10.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

model_10.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 127, 127, 32)      416       
_________________________________________________________________
activation_4 (Activation)    (None, 127, 127, 32)      0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 63, 63, 32)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 62, 62, 32)        4128      
_________________________________________________________________
activation_5 (Activation)    (None, 62, 62, 32)        0         
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 32)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 30752)            