## TF-IDF Vectorizer in Scikit-Learn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(corpus)

## Hashing Vectorizer (No need for explicit vocabulary)

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=10)
hv.transform(10)

HashingVectorizer and TfidfVectorizer are both techniques used for text feature extraction in natural language processing, but they operate in fundamentally different ways and have distinct characteristics:

## Feature Generation:

HashingVectorizer: It uses a hash function to map tokens to a fixed number of features (hashes). The hash values are directly used as feature indices, and each feature corresponds to a hash. This results in a fixed-size feature space regardless of the size of the vocabulary or the dataset.

TfidfVectorizer: It constructs a vocabulary of unique words (terms) from the text data and counts the occurrence of each term in each document. Then, it computes the Term Frequency-Inverse Document Frequency (TF-IDF) value for each term-document pair. TF-IDF is a measure that reflects both the importance of a term in a document and its rarity in the entire corpus. The result is a sparse matrix where each row corresponds to a document, and each column corresponds to a unique term in the vocabulary.

## Memory Efficiency:

HashingVectorizer: It is memory-efficient, especially for large datasets and when the vocabulary size is large. Since it doesn't store a vocabulary explicitly, it has a fixed memory footprint.

TfidfVectorizer: It can be memory-intensive because it requires storing the vocabulary and the term-document matrix. The memory usage is directly proportional to the vocabulary size and the number of documents.

## Interpretability:

HashingVectorizer: It is less interpretable because the mapping from tokens to features is determined by the hash function, and the same token may map to different features in different documents.

TfidfVectorizer: It is more interpretable because it explicitly maintains a vocabulary of terms, and each feature corresponds to a specific term. This makes it easier to understand which terms are contributing to the feature vectors.

## Scalability:

HashingVectorizer: It is highly scalable and suitable for large datasets because the feature space size is fixed.

TfidfVectorizer: It can be less scalable when dealing with very large vocabularies because it needs to store and process the entire vocabulary.

## Feature Dimension:

HashingVectorizer: The feature dimension is fixed and determined by the number of features specified when creating the vectorizer.

TfidfVectorizer: The feature dimension varies depending on the size of the vocabulary and the number of unique terms in the text data.

## XGBoost

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier()

model.fit(x_train, y_train,
         early_stopping_rounds=10,
         eval_metric="logloss",
         eval_set=[(x_eval,y_eval)])

y_pred=model.predict(x_test)
accuracy = accuracy_score(y_test,y_pred)

## Cosine Similarity

In [None]:
from sklearn.metrcis.pairwise import cosine_similarity

# Between Two Vectors
sim = cosine_similarity(x,y)[0,0]

# Between all rows of a matrix
sim = cosine_similarity(X)

## K Means

In [None]:
from sklearn.cluster import Kmeans
kmeans = Kmeans(n_clusters=10)
kkeans.fit(X)
assigned_clusters = kmeans.labels_

## PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_component=10)
X_train_pca = pca.fit_transform(X_train)

## LDA

In [None]:
from gensim.models.ldamodel import LdaModel
# Train LDA with 10 topics
ldf = LdaModel(doc_term_matrix, num_topic=10,
              id2word=dictionary, passes=3)
lda.show_topics(formatted=false)

# To get topic porpotion to a document, use the correspoding row of the document term matrix
lda[doc_term_matrix[1]]