# Text Feataure Extraction

### 1. CountVectorizer

Converts a collection of text documents to a matrix of token counts.

### 2. TfidfVectorizer

Transforms text to feature vectors that reflect term importance using Term Frequency-Inverse Document Frequency (TF-IDF).

### 3. HashingVectorizer

Converts text documents to a matrix of token occurrences using a hashing trick, useful for large datasets.


#### ============================================

##  What is HashingVectorizer?

HashingVectorizer is a text feature extraction method in scikit-learn that transforms a collection of text documents into a matrix of token occurrences using the hashing trick. Unlike other vectorizers, it doesn't store a vocabulary dictionary, making it memory-efficient and suitable for large-scale or streaming text data.

Video Link: https://youtu.be/fCmaaFcibqI?si=egS4AFPBKONF37t- 

![image-2.png](attachment:image-2.png)


### Sample Documents:

"Data science is fascinating."

"Machine learning drives innovation."

![image.png](attachment:image.png)

In [2]:
from sklearn.feature_extraction.text import HashingVectorizer
import pandas as pd

# Sample documents
docs = [
    "I am happy.",
    "Dog is happy."
]

docs

['I am happy.', 'Dog is happy.']

n_features=10: Sets the number of features (columns) in the output matrix to 10. A smaller number increases the chance of hash collisions, where different tokens map to the same feature index. 

norm=None: Disables normalization of the output vectors. By default, HashingVectorizer applies L2 normalization; setting this to None retains the raw term frequencies. 

alternate_sign=False: Ensures that all hashed feature values are non-negative by disabling the alternating sign mechanism, which is designed to approximate inner products in the hashed space. 
scikit-learn



In [8]:
# Initialize HashingVectorizer
vectorizer = HashingVectorizer(n_features=10, norm=None, alternate_sign=False)

# Transform documents into feature vectors
X = vectorizer.fit_transform(docs)

X.toarray()

array([[0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 1., 1., 0., 0.]])

In [9]:
# Convert to dense array and create DataFrame
df = pd.DataFrame(X.toarray(), index=["Doc 1", "Doc 2"])

# Display the DataFrame
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Doc 1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
Doc 2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
