# Data Design HW 3 <a id='back to top'></a>
## Scott Virshup

## Contents:
1. [PART 1](#part1)
    1. [Bag of Words](#part1.1)
    1. [TF-IDF](#part1.2)
    1. [Feature Hashing](#part1.3)
1. [PART 2](#part2)


### PART 1 Assignment Description: <a id='part1'></a>
1. Use Amazon book reviews (text documents) dataset (small dataset with 10 reviews).
1. Use Bag of Words (BoW) and TF-IDF (CountVectorizer and TfidfVectorizer in scikit-learn).
1. Write Python program to create and print vocabulary and document-term matrix (vectorized representation).
1. Try unigram and bigram parameters and observe their effect on number of features.

#### Bag of Words <a id='part1.1'></a>

In [2]:
##########################
### Bag of Words (BoW) ###
##########################

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

amzn = pd.read_csv("https://github.com/svirshup/Data-Design/raw/master/Data/Small-Book%20Reviews%20from%20Amazon.csv", header=None)
#amzn.head()

# Turn the column of book reviews into an array
reviews = amzn[1]

# BoW representation is implemented in CountVectorizer.
count = CountVectorizer()

# Fitting the CountVectorizer does sthe following: 
## 1) tokenizing and 
## 2) building the vocabulary. Vocabulary can be accessed through the vocabulary_ attribute.
bag = count.fit_transform(reviews)

# Print the results
print("The number of words in this bag: ")
print(len(count.vocabulary_))
print("\n")
print("Contents of this bag:")
print(count.vocabulary_)
print("\n")
print("BoW document term matrix:")
print(bag.toarray())

The number of words in this bag: 
904


Contents of this bag:
{'ok': 552, 'but': 116, 'think': 792, 'the': 782, 'keirsey': 436, 'temperment': 774, 'test': 777, 'is': 417, 'more': 508, 'accurate': 16, 'and': 47, 'cheaper': 135, 'this': 794, 'book': 108, 'has': 352, 'its': 425, 'good': 335, 'points': 593, 'if': 386, 'anything': 54, 'it': 424, 'helps': 361, 'you': 899, 'put': 623, 'into': 410, 'words': 884, 'what': 862, 'want': 847, 'from': 318, 'supervisor': 753, 'not': 540, 'very': 840, 'online': 556, 'does': 217, 'account': 15, 'for': 303, 'difference': 205, 'between': 98, 'when': 863, 'of': 549, 'their': 783, 'options': 560, 'are': 58, 'both': 110, 'exactly': 257, 'like': 462, 'or': 561, 'they': 789, 'don': 223, 'describe': 195, 'at': 65, 'all': 33, 'messes': 496, 'up': 833, 'results': 667, 'did': 203, 'me': 490, 'well': 860, 'am': 40, 'just': 433, 'in': 392, 'denial': 192, 'have': 355, 'taken': 765, 'lot': 472, 'personality': 586, 'type': 821, 'tests': 778, 'sorter': 716, 'pretty': 6

#### TF-IDF <a id='part1.2'></a>
[Back to Top](#back to top)

In [4]:
##############
### TF-IDF ###
##############

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

amzn = pd.read_csv("https://github.com/svirshup/Data-Design/raw/master/Data/Small-Book%20Reviews%20from%20Amazon.csv", header=None)
#amzn.head()

# Turn the column of book reviews into an array
reviews = amzn[1]

# BoW representation is implemented in CountVectorizer.
count = TfidfVectorizer()
count.fit(reviews)

# Fitting the CountVectorizer does sthe following: 
## 1) tokenizing and 
## 2) building the vocabulary. Vocabulary can be accessed through the vocabulary_ attribute.
tfidf = count.transform(reviews)

# Print the results
print("Create a document term matrix: ")
print(tfidf.toarray())
print("\n")
feature_names = count.get_feature_names()
print(feature_names)
print("\n")
number_of_features = len(feature_names)
print(number_of_features)

Create a document term matrix: 
[[0.06143829 0.         0.         ... 0.15668449 0.         0.        ]
 [0.         0.02325564 0.02325564 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.08745341]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.05937969 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


['10', '1000', '14th', '18th', '1953', '1955', '1960', '1970', '60', 'abandon', 'able', 'about', 'academic', 'accomplished', 'according', 'account', 'accurate', 'act', 'actively', 'actually', 'ade', 'admit', 'adolescent', 'adulation', 'adventure', 'advocate', 'advocating', 'after', 'against', 'age', 'agitation', 'al', 'alex', 'all', 'alone', 'also', 'alternative', 'although', 'altogether', 'always', 'am', 'america', 'american', 'among', 'an', 'analysis', 'analytic', 'and', 'angels', 'anna', 'annoying', 'another', 'an

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


#### Feature Hashing <a id='part1.3'></a>
[Back to Top](#back to top)

In [3]:
#######################
### Feature Hashing ###
#######################

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import HashingVectorizer

amzn = pd.read_csv("https://github.com/svirshup/Data-Design/raw/master/Data/Small-Book%20Reviews%20from%20Amazon.csv", header=None)
#amzn.head()

# Turn the column of book reviews into an array
reviews = amzn[1]

# Choose some different values for n_features in order to see what difference it makes to the output
#vectorizer = HashingVectorizer(n_features=1)
#vectorizer = HashingVectorizer(n_features=2)
#vectorizer = HashingVectorizer(n_features=3)
vectorizer = HashingVectorizer(n_features=4)

vect = vectorizer.transform(reviews)

print(vect.toarray())

[[ 0.73910445  0.36955223  0.53753051 -0.16797829]
 [ 0.3805832   0.53281647 -0.66982642  0.35013654]
 [ 0.93961848  0.20134682 -0.26846242 -0.06711561]
 [-0.08137885  0.56965192  0.81378846 -0.08137885]
 [ 0.34299717  0.34299717  0.85749293 -0.17149859]
 [ 0.         -0.09149914  0.640494    0.76249285]
 [ 0.50470424  0.06308803 -0.75705636  0.41007219]
 [-0.12216944  0.48867778 -0.12216944  0.85518611]
 [ 0.6947088   0.62158156  0.25594535  0.25594535]
 [-0.96152395 -0.27472113  0.          0.        ]]


#### Feature Hashing interpretation
__n_features = 1:__ unigrams, or single words. The output of this is a single column of 10 where all values are 1, except for the last value, which is -1.


__n_features = 2:__ bigrams, or sequences of two (or 1) word(s). When running on this, the array output has two columns of 10 rows. 


__n_features >= 3:__ n-grams, which are sequences of n or fewer words. There are n columns in the output matrix from this features value

__Trends:__
* The value you select for n will determine how many columns are in the results vector. 
* The results vector will have a row per block of text in the original corpus.
* With n=1, all values are 1.
* If you wanted a specific range of sequences that are allowed, you would need to use ngram_range. In this case, we use anything under the n value provided.

### PART 2 Assignment Description: <a id='part2'></a>
Further read about the speech-to-text services from the four vendors (Microsoft, Amazon, Google, IBM) and write key 3-4 features about each service.

[Back to Top](#back to top)

__Microsoft Azure__
* Common use cases:
    * Recognize a brief utterance, such as a command, without interim results.
    * Transcribe a long, previously-recorded utterance, such as a voicemail message.
    * Transcribe streaming speech in real-time, with partial results, for dictation.
    * Determine what users want to do based on a spoken natural-language request.
* Able to create custom language/acoustic models
* Microsoft Bing was a previous version. Azure is in preview mode currently

__Amazon Transcribe__
* Common use cases:
    * Analyze customer call data
    * Automate subtitle creation
    * Target advertising based on content
    * Enable rich search capabilities on archives of audio and video content
* The API has intuitive 3 primary actions:
    * StartTranscriptionJob
    * GetTranscriptionJob
    * ListTranscriptionJobs

__Google Cloud Speech to Text__
* Recognizes over 120 languages
* Offers pre-built models based on the source of your data. These models are:
    * command_and_search - Best for short queries such as voice commands or voice search.
    * phone_call - Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate)
    * video	- Best for audio that originated from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate.
    * default - Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.
* Pricing:
    * Speech Recognition (all models except video)	Free (between 0 - 60 minutes)	0.006 USD / 15 seconds (up to 1 million minutes)
    * Video Speech Recognition	0.006 (between 0 - 60 minutes)	0.012 USD / 15 seconds (up to 1 million minutes)
* High latency compared to other services

__IBM Watson__
* Generally worse predictive accuracy, but very low latency, making it faster at returning results
* Compatible with 7 languages: English, Japanese, Spanish, Brazilian, Portuguese, Modern Standard Arabic, and Mandarin
* Customizable language feature included


__References__
* https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/
* https://aws.amazon.com/blogs/aws/amazon-transcribe-scalable-and-accurate-automatic-speech-recognition/
* https://cloud.google.com/speech-to-text/
* https://www.ibm.com/watson/services/speech-to-text/
    * http://fredrikstenbeck.com/what-languages-does-ibm-watson-support/
* https://blog.craftworkz.co/speech-recognition-a-comparison-of-popular-services-in-en-and-nl-67a3e1b0cee6
* https://recast.ai/blog/benchmarking-speech-recognition-api/
* https://medium.com/@tanyathakur6/comparing-machine-learning-ml-services-from-various-cloud-ml-service-providers-63c8a2626cb6