# Data Preprocessing using sklearn
We are going to be performing tokenization (Parsing words) and feature extraction(vectorization). The sklearn library offers tools to perform these two operations

## Bag of Words model
Machine Learning algorithms take vectors of numbers as input and hence our raw text needs to be converted into fixed-length vectors of numbers. One method to achieve this is using the Bag-of-words model. We assign each word in the sentence a unique number. Then any corpus of text can be encoded as a fixed-length vector with the length of the vocabulary of known words. Then the vectorizer returns the count of each word in the document. 

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
text=["Beijing is in china. China is in Asia"]
vectorizer=CountVectorizer()
vectorizer.fit(text)
print(vectorizer.vocabulary_)

{'beijing': 1, 'in': 3, 'asia': 0, 'is': 4, 'china': 2}


In [26]:
text2=["China is India's neighbour and also in Asia. China is the major exporter of zirconium"]
vector2=vectorizer.transform(text2)
print(vector2.toarray())

[[1 0 2 1 2]]


So, we can see from the above example that an array containing the count of each word is returned by the vectorizer. Which gives us an idea of the frequency of the words in a given corpus of text

## Word Frequencies with TfidfVectorize
### Term Frequency- Inverse Document Frequency 

The countVectorizer model just gives us the count of each word in the document is is not very meaningful because some words like the will appear many times.
So, we use TFidF model which has two main components

1. Term Frequency: This summarizes how often a given word appears in the document
2. Inverse Document Frequency: This downscales words that appear a lot across documents

TFIDF are word frequency scores that try to highlight words that are more interesting . 

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
text=['The quick brown fox jumped over the lazy dog',"The dog","The fox"]
vectorizer=TfidfVectorizer()

vectorizer.fit(text)
#Summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

#Encode document
vector=vectorizer.transform([text[0]])
print(vector.shape)
print(vector.toarray())

{'the': 7, 'dog': 1, 'fox': 2, 'lazy': 4, 'over': 5, 'quick': 6, 'jumped': 3, 'brown': 0}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


Now, a vocubalary of 8 words is learned from the document and each word is assigned a unique integer index in the output vector. The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: the at index 7. 

## Hashing with HashingVectorizer

In this method we perform one way hashing of words and convert them to integers. 

In [28]:
from sklearn.feature_extraction.text import HashingVectorizer
text=["The quick brown fox jumped over the lazy dog"]
vectorizer=HashingVectorizer(n_features=20)

vector=vectorizer.transform(text)
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]
