#### Sentiment Analysis Using Neural Networks
----
Only process numeric representations of things

I can model a sentence or a document as a numeric form; I can use some numeric coding to `X1`; That way I can represent a word as a vector; taking a document in as `tensors` ( n-dimensional thingy )

#### Word Embedding:
----


 
###### _one hot_ large corpus of every word 

* Large vocabulary will have an enormous amount of features
* unordered will lose a lot of context and relationship between the words 
* Binary; no frequency analysis



###### _frequency based_
* Counter Vectors wordcounts
* Term-Frequency-Inverse Document Frequency -> how important to doc and corpus as a whole. 
* Tokenize each word, take teh counts, express the review as counts
* Large vocab can be sparse; 
    * you can choose the _Top-N_ most common words
    * You can also hash individual words to buckets; sized so that buckets will make collisions rare ( a word is in the same bucket as another bucket )


##### TF-IDF Algo
-----
* How often a word appears in document and the corpus

* Tokenize each word as an integer; the document is represented as a tensor
* every word is represented as a `score = TermFrequency(wi, dj) * InverseDocumentFrequency(wi, D)`
* Frequency in the corpus makes the scor for a word reduces for more common words ( the, an, uhh )
* Frequency in the document, assumes that it is important
* Encoding every word in the document, and scores it
    * Advantages: feature vector is more tractable in size
    * Scores capture frequency and relevance
    * Context is not captures



######  _prediction based_ 
* used by deep learning networks, to look for meaning and semantics of words

### Representing Text in Numerical Form
----
* `CountVectorizer`
* `TfidfVectorizer`
* `HashingVectorizer`

each implements a `fit()` to pass a corpus in, so the vectorizer learns the dataset

`transform()` will assign the generated id to each word in the corpus

smaller sets can usually be done with`fit_and_transform`

# Extracting Features from Text
##### Using Bag of Words, TF-IDF Transformation

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

#### Define a corpus of 4 documents with some repeated values

In [2]:
corpus = ['This is the first document.',
          'This is the second document.', 
          'Third document. Document number three', 
          'Number four. To repeat, number four']

#### Use CountVectorizer to convert a collection of text documents to a "bag of words"

In [3]:
vectorizer = CountVectorizer()
#instantiate
bag_of_words = vectorizer.fit_transform(corpus)

bag_of_words

<4x12 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

#### Viewing the bag of words created
----
you can see that each word is indexed and identified by a number and the document is comes from
The same words will have the same number. 

#### View what the "bag" looks like

In [4]:
print(bag_of_words)

  (0, 0)	1
  (0, 1)	1
  (0, 7)	1
  (0, 3)	1
  (0, 9)	1
  (1, 6)	1
  (1, 0)	1
  (1, 7)	1
  (1, 3)	1
  (1, 9)	1
  (2, 10)	1
  (2, 4)	1
  (2, 8)	1
  (2, 0)	2
  (3, 5)	1
  (3, 11)	1
  (3, 2)	2
  (3, 4)	2


#### Get the value to which a word is mapped

In [5]:
vectorizer.vocabulary_.get('document')

0

#### Access all the words inside 
----

In [6]:
vectorizer.vocabulary_

{'this': 9,
 'is': 3,
 'the': 7,
 'first': 1,
 'document': 0,
 'second': 6,
 'third': 8,
 'number': 4,
 'three': 10,
 'four': 2,
 'to': 11,
 'repeat': 5}

In [7]:
import pandas as pd

print(pd.__version__)

0.24.1


#### Much easier to view it all as a pandas dataframe
----

In [8]:
pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,document,first,four,is,number,repeat,second,the,third,this,three,to
0,1,1,0,1,0,0,0,1,0,1,0,0
1,1,0,0,1,0,0,1,1,0,1,0,0
2,2,0,0,0,1,0,0,0,1,0,1,0
3,0,0,2,0,2,1,0,0,0,0,0,1


#### Extend bag of words with TF-IDF weights

##### Everything has a unique id and a score

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
bag_of_words = vectorizer.fit_transform(corpus)

print(bag_of_words)

  (0, 9)	0.43584673254990375
  (0, 3)	0.43584673254990375
  (0, 7)	0.43584673254990375
  (0, 1)	0.5528163151092931
  (0, 0)	0.3528554929793508
  (1, 9)	0.43584673254990375
  (1, 3)	0.43584673254990375
  (1, 7)	0.43584673254990375
  (1, 0)	0.3528554929793508
  (1, 6)	0.5528163151092931
  (2, 0)	0.6191395067937654
  (2, 8)	0.4850008395708102
  (2, 4)	0.3823802326982809
  (2, 10)	0.4850008395708102
  (3, 4)	0.5412799489419371
  (3, 2)	0.6865449812276998
  (3, 11)	0.3432724906138499
  (3, 5)	0.3432724906138499


In [10]:
vectorizer.vocabulary_.get('document')

0

#### Represent as a dataframe
---- 
contain scores not frequencies

In [11]:
pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,document,first,four,is,number,repeat,second,the,third,this,three,to
0,0.352855,0.552816,0.0,0.435847,0.0,0.0,0.0,0.435847,0.0,0.435847,0.0,0.0
1,0.352855,0.0,0.0,0.435847,0.0,0.0,0.552816,0.435847,0.0,0.435847,0.0,0.0
2,0.61914,0.0,0.0,0.0,0.38238,0.0,0.0,0.0,0.485001,0.0,0.485001,0.0
3,0.0,0.0,0.686545,0.0,0.54128,0.343272,0.0,0.0,0.0,0.0,0.0,0.343272


#### View all the words and their corresponding values

In [12]:
vectorizer.vocabulary_

{'this': 9,
 'is': 3,
 'the': 7,
 'first': 1,
 'document': 0,
 'second': 6,
 'third': 8,
 'number': 4,
 'three': 10,
 'four': 2,
 'to': 11,
 'repeat': 5}

#### For Really Big Things
----
Hash

### Hashing Vectorizer
* One issue with CountVectorizer and TF-IDF Vectorizer is that the number of features can get very large if the vocabulary is very large
* The whole vocabulary will be stored in memory, and this may end up taking a lot of space
* With Hashing Vectorizer, one can limit the number of features, let's say to a number <b>n</b>
* Each word will be hashed to one of the n values
* There will collisions where different words will be hashed to the same value
* In many instances, peformance does not really suffer in spite of the collisions

In [4]:
from sklearn.feature_extraction.text import HashingVectorizer
# set the number of buckets=8
vectorizer = HashingVectorizer(n_features=8)
feature_vector = vectorizer.fit_transform(corpus)
print(feature_vector)

  (0, 0)	-0.8944271909999159
  (0, 5)	0.4472135954999579
  (0, 6)	0.0
  (1, 0)	-0.5773502691896258
  (1, 3)	0.5773502691896258
  (1, 5)	0.5773502691896258
  (1, 6)	0.0
  (2, 0)	-0.7559289460184544
  (2, 3)	0.3779644730092272
  (2, 5)	0.3779644730092272
  (2, 7)	0.3779644730092272
  (3, 0)	0.31622776601683794
  (3, 3)	0.31622776601683794
  (3, 5)	0.6324555320336759
  (3, 7)	0.6324555320336759


#### There is no way to compute the inverse transform to get the words from the hashed value