### Build a DNN using Keras with `RELU` and `ADAM`

#### Load tensorflow

In [1]:
# Import Basic Libraries
import numpy as np
import pandas as pd

# Import Datavisualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Set Theme for Data Visualization
%matplotlib inline
sns.set_style('whitegrid')

# Mitigating Warnings
import warnings
warnings.filterwarnings('ignore')

# Import Libraries for Statistical Analysis
import scipy.stats as stats

# Import Libraries for Train-Test split, Scaling and Metric calculation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

# Import TensorFlow and Keras
import tensorflow as tf
import keras
tf.reset_default_graph()
tf.set_random_seed(40)

Using TensorFlow backend.


#### Collect Fashion mnist data from tf.keras.datasets 

In [0]:
(trainX, trainY), (testX, testY) = keras.datasets.fashion_mnist.load_data()

In [3]:
print("Train X dimension:",trainX.shape)

print("Train Y dimension:",trainY.shape)

print("Test X dimension:",testX.shape)

print("Test Y dimension:",testY.shape)

Train X dimension: (60000, 28, 28)
Train Y dimension: (60000,)
Test X dimension: (10000, 28, 28)
Test Y dimension: (10000,)


#### Change train and test labels into one-hot vectors

In [4]:
# One hot encoding for output label

trainY = tf.keras.utils.to_categorical(trainY, num_classes=10)
testY = tf.keras.utils.to_categorical(testY, num_classes=10)

print("Train Y:",trainY.shape)

print("Test Y:",testY.shape)

Train Y: (60000, 10)
Test Y: (10000, 10)


#### Build the Graph

#### Initialize model, reshape & normalize data

In [5]:
#Initialize Sequential model
model = tf.keras.models.Sequential()

#Reshape data from 2D to 1D -> 28x28 to 784
model.add(tf.keras.layers.Reshape((784,),input_shape=(28,28,)))

#Normalize the data
model.add(tf.keras.layers.BatchNormalization())

Instructions for updating:
Colocations handled automatically by placer.


#### Add two fully connected layers with 200 and 100 neurons respectively with `relu` activations. Add a dropout layer with `p=0.25`

In [6]:
#Add 1st hidden layer
model.add(tf.keras.layers.Dense(200, activation='relu'))

#Add 2nd hidden layer
model.add(tf.keras.layers.Dense(100, activation='relu'))

#Dropout Layer
model.add(tf.keras.layers.Dropout(0.25))

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


### Add the output layer with a fully connected layer with 10 neurons with `softmax` activation. Use `categorical_crossentropy` loss and `adam` optimizer and train the network. And, report the final validation.

In [0]:
#Add OUTPUT layer
model.add(tf.keras.layers.Dense(10, activation='softmax'))

#Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [8]:
#Model Execution with epochs = 10 and batch_size = 30
model.fit(trainX, trainY, 
          validation_data=(testX, testY), 
          epochs=10,
          batch_size=30)

Train on 60000 samples, validate on 10000 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f8fe3bb77f0>

In [9]:
score_train = model.evaluate(trainX, trainY)
score_test = model.evaluate(testX, testY)
print()
print('Training accuracy: ', score_train[1])
print('Test accuracy: ', score_test[1])


Training accuracy:  0.92471665
Test accuracy:  0.8886


## Word Embeddings in Python with Gensim

In this, you will practice how to train and load word embedding models for natural language processing applications in Python using Gensim.


1. How to train your own word2vec word embedding model on text data.
2. How to visualize a trained word embedding model using Principal Component Analysis.
3. How to load pre-trained word2vec word embedding models.

### Run the below two commands to install gensim and the wiki dataset

In [10]:
!pip3 install --upgrade gensim --user

Requirement already up-to-date: gensim in /root/.local/lib/python3.6/site-packages (3.7.1)


In [11]:
!pip3 install wikipedia --user



### Import gensim

In [0]:
import gensim

In [0]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

In [14]:
!ls

sample_data


### Obtain Text

Import search and page functions from wikipedia module
search(/key word/): search function takes keyword as argument and gives top 10 article titles matching the given keyword.

page(/title of article/): page function takes page title as argument and gives content in the output.

In [15]:
## Usage: 

from wikipedia import search, page
titles = search("<Key word goes here>")
wikipage = page(titles[0])
print()
print ("Wikipage Content: \n\n",wikipage.content)


Wikipage Content: 

 "Here We Go Round the Mulberry Bush" (also titled "Mulberry Bush" or "This Is the Way") is an English nursery rhyme and singing game. It has a Roud Folk Song Index number of 7882. The same tune is also used for "Lazy Mary, Will You Get Up" and "Nuts in May". A variant is used for "The Wheels on the Bus".


== Lyrics ==
The most common modern version of the rhyme is:


== Score ==


== Origins and meaning ==
The rhyme was first recorded by James Orchard Halliwell as an English children's game in the mid-19th century. He noted that there was a similar game with the lyrics "Here we go round the bramble bush". The bramble bush may be an earlier version, possibly changed because of the difficulty of the alliteration, since mulberries do not grow on bushes.Halliwell said subsequent verses included: "This is the way we wash our clothes", "This is the way we dry our clothes", "This is the way we mend our shoes", "This is the way the gentlemen walk" and "This is the way th

### Print the top 10 titles for the keyword `Machine Learning`

In [16]:
# Top 10 Titles:

titles = search("Machine Learning")
titles

['Machine learning',
 'Active learning (machine learning)',
 'Deep learning',
 'Boosting (machine learning)',
 'List of datasets for machine learning research',
 'Support-vector machine',
 'Adversarial machine learning',
 'Learning',
 'Outline of machine learning',
 'Weka (machine learning)']

### Get the content from the first title from the above obtained 10 titles.

In [17]:
text = page(titles[0])
text.content

'Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, 

### Create a list with name `documents` and append all the words in the 10 pages' content using the above 10 titles.

In [0]:
# Function to Clean Up Text

import re, string

def clean_str(string):
  """
  String cleaning before vectorization
  """
  try:    
    string = re.sub(r'^https?:\/\/<>.*[\r\n]*', '', string, flags=re.MULTILINE)
    string = re.sub(r"[^A-Za-z]", " ", string)         
    words = string.strip().lower().split()    
    words = [w for w in words if len(w)>=1]
    return " ".join(words)	
  except:
    return ""

In [0]:
documents = []
for i in range(0,len(titles)):
  documents.append(clean_str(page(titles[i]).content))

In [20]:
#List to hold all words in each review
doc_corp = []

#Iterate over each review
for doc in documents:
    doc_corp.append(doc.split(' '))

print(len(doc_corp))
print(doc_corp[0])

10
['machine', 'learning', 'ml', 'is', 'the', 'scientific', 'study', 'of', 'algorithms', 'and', 'statistical', 'models', 'that', 'computer', 'systems', 'use', 'to', 'effectively', 'perform', 'a', 'specific', 'task', 'without', 'using', 'explicit', 'instructions', 'relying', 'on', 'patterns', 'and', 'inference', 'instead', 'it', 'is', 'seen', 'as', 'a', 'subset', 'of', 'artificial', 'intelligence', 'machine', 'learning', 'algorithms', 'build', 'a', 'mathematical', 'model', 'of', 'sample', 'data', 'known', 'as', 'training', 'data', 'in', 'order', 'to', 'make', 'predictions', 'or', 'decisions', 'without', 'being', 'explicitly', 'programmed', 'to', 'perform', 'the', 'task', 'machine', 'learning', 'algorithms', 'are', 'used', 'in', 'the', 'applications', 'of', 'email', 'filtering', 'detection', 'of', 'network', 'intruders', 'and', 'computer', 'vision', 'where', 'it', 'is', 'infeasible', 'to', 'develop', 'an', 'algorithm', 'of', 'specific', 'instructions', 'for', 'performing', 'the', 'task',

### Build the gensim model for word2vec with by considering all the words with frequency >=1 with embedding size=50

In [21]:
#Build the model
model = gensim.models.Word2Vec(doc_corp, #Word list
                               min_count=10, #Ignore all words with total frequency lower than this                           
                               workers=4, #Number of CPUs
                               size=50,  #Embedding size
                               window=5, #Maximum Distance between current and predicted word
                               iter=10   #Number of iterations over the text corpus
                              )  

2019-02-24 16:36:43,750 : INFO : collecting all words and their counts
2019-02-24 16:36:43,752 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-02-24 16:36:43,762 : INFO : collected 4995 word types from a corpus of 33068 raw words and 10 sentences
2019-02-24 16:36:43,763 : INFO : Loading a fresh vocabulary
2019-02-24 16:36:43,767 : INFO : effective_min_count=10 retains 516 unique words (10% of original 4995, drops 4479)
2019-02-24 16:36:43,768 : INFO : effective_min_count=10 leaves 23139 word corpus (69% of original 33068, drops 9929)
2019-02-24 16:36:43,772 : INFO : deleting the raw counts dictionary of 4995 items
2019-02-24 16:36:43,774 : INFO : sample=0.001 downsamples 66 most-common words
2019-02-24 16:36:43,775 : INFO : downsampling leaves estimated 14895 word corpus (64.4% of prior 23139)
2019-02-24 16:36:43,777 : INFO : estimated required memory for 516 words and 50 dimensions: 464400 bytes
2019-02-24 16:36:43,778 : INFO : resetting layer weights
2

### Exploring the model

In [22]:
#Model size
model.wv.syn0.shape

(516, 50)

#### Check how many words in the model

In [23]:
# Vocablury of the model
model.wv.vocab

{'a': <gensim.models.keyedvectors.Vocab at 0x7f8fd86b16a0>,
 'ability': <gensim.models.keyedvectors.Vocab at 0x7f8fd86bc4a8>,
 'about': <gensim.models.keyedvectors.Vocab at 0x7f8fd86bcc88>,
 'above': <gensim.models.keyedvectors.Vocab at 0x7f8fd8073748>,
 'access': <gensim.models.keyedvectors.Vocab at 0x7f8fd80ea5f8>,
 'accuracy': <gensim.models.keyedvectors.Vocab at 0x7f8fd86bc6a0>,
 'active': <gensim.models.keyedvectors.Vocab at 0x7f8fd80ea5c0>,
 'activities': <gensim.models.keyedvectors.Vocab at 0x7f8fd80ef0f0>,
 'adaboost': <gensim.models.keyedvectors.Vocab at 0x7f8fd8073c50>,
 'adaptive': <gensim.models.keyedvectors.Vocab at 0x7f8fd80efda0>,
 'adversarial': <gensim.models.keyedvectors.Vocab at 0x7f8fd8073b38>,
 'after': <gensim.models.keyedvectors.Vocab at 0x7f8fd86bcb38>,
 'against': <gensim.models.keyedvectors.Vocab at 0x7f8fd80ea8d0>,
 'ai': <gensim.models.keyedvectors.Vocab at 0x7f8fd80ead68>,
 'al': <gensim.models.keyedvectors.Vocab at 0x7f8fd80efe48>,
 'algorithm': <gensim.mo

### Get an embedding for word `SVM`

In [24]:
model.wv['svm']

array([-0.2417694 , -0.24420615, -0.3270728 ,  0.31636244, -0.38657758,
        0.10294402,  0.50202876, -0.17660767, -0.2200225 ,  0.06568649,
       -0.31555796, -0.29176122, -0.05975629, -0.07367154, -0.30786332,
        0.30076197,  0.4878046 ,  0.01022672,  0.2025485 ,  0.31482974,
        0.26381558, -0.57566977, -0.00356158,  0.15234452, -0.08887644,
        0.12094612,  0.31385854,  0.04325441,  0.03915186, -0.4410611 ,
        0.24488017,  0.14665775, -0.11505769, -0.88124794,  0.37574026,
       -0.26656982, -0.08072456,  0.04736617,  0.02083592,  0.03923621,
        0.23763624, -0.16777651,  0.09120493, -0.24866505,  0.2879205 ,
       -0.3272936 ,  0.04693003,  0.39004165, -0.00408765, -0.13937509],
      dtype=float32)

### Finding most similar words for word `learning`

In [25]:
model.wv.most_similar('learning')

2019-02-24 16:36:44,191 : INFO : precomputing L2-norms of word weight vectors


[('machine', 0.9827715158462524),
 ('deep', 0.9719343781471252),
 ('of', 0.9567088484764099),
 ('algorithms', 0.9424633979797363),
 ('a', 0.9417915344238281),
 ('in', 0.9352954626083374),
 ('artificial', 0.9335716962814331),
 ('networks', 0.9332436919212341),
 ('neural', 0.9324018955230713),
 ('has', 0.9307993054389954)]

### Find the word which is not like others from `machine, svm, ball, learning`

In [26]:
model.doesnt_match("machine svm ball learning".split())



'svm'

### Save the model with name `word2vec-wiki-10`

In [27]:
model.save('word2vec-wiki-10')

2019-02-24 16:36:44,219 : INFO : saving Word2Vec object under word2vec-wiki-10, separately None
2019-02-24 16:36:44,224 : INFO : not storing attribute vectors_norm
2019-02-24 16:36:44,226 : INFO : not storing attribute cum_table
2019-02-24 16:36:44,236 : INFO : saved word2vec-wiki-10


### Load the model `word2vec-wiki-10`

In [28]:
#Load model from memory
model = gensim.models.Word2Vec.load('word2vec-wiki-10')

2019-02-24 16:36:44,242 : INFO : loading Word2Vec object from word2vec-wiki-10
2019-02-24 16:36:44,251 : INFO : loading wv recursively from word2vec-wiki-10.wv.* with mmap=None
2019-02-24 16:36:44,252 : INFO : setting ignored attribute vectors_norm to None
2019-02-24 16:36:44,254 : INFO : loading vocabulary recursively from word2vec-wiki-10.vocabulary.* with mmap=None
2019-02-24 16:36:44,257 : INFO : loading trainables recursively from word2vec-wiki-10.trainables.* with mmap=None
2019-02-24 16:36:44,258 : INFO : setting ignored attribute cum_table to None
2019-02-24 16:36:44,259 : INFO : loaded word2vec-wiki-10
