#  Steps to extract features from your dataset
---

Here are a few steps to do a basic preprocessing of a dataset. The example dataset used is the `IMDB Dataset of 50K Movie Reviews dataset`


---

## 📑 Contents

1. Lower Casing
2. Remove HTML tags
3. Remove URLs
4. Remove Punctuation
5. Chat word treatment
6. Spelling Correction
7. Removing Stop words
8. Handling Emojis
9. Tokenization
10. Stemming
11. Lemmatization

# 1. One Hot Encoding

One-Hot Encoding is a method to convert categorical (textual or class) data into a binary (0 or 1) format that is more suitable for machine learning models.

Each category is represented as a binary vector with a 1 indicating the presence of a specific class and 0s elsewhere.
Many machine learning algorithms cannot handle categorical values directly (like ['red', 'green', 'blue']) because they expect numerical input. One-hot encoding

converts these into binary vectors so the model can process them.

For example:

### Original data:

['red', 'green', 'blue']

### After One-Hot Encoding:

red   → [1, 0, 0]  
green → [0, 1, 0]  
blue  → [0, 0, 1]


### Pros:

| Advantage                                | Description                                                                             |
| ---------------------------------------- | --------------------------------------------------------------------------------------- |
|  **Simple to implement**               | Very easy and straightforward to apply using libraries like pandas or sklearn.          |
|  **No ordinal relationship**           | Prevents the model from assuming a natural ordering (e.g., 'Red' ≠ '1', 'Green' ≠ '2'). |
|  **Works well with tree-based models** | Such as Decision Trees, Random Forest, and XGBoost.                                     |


### Cons:

| Disadvantage                         | Description                                                                           |
| ------------------------------------ | ------------------------------------------------------------------------------------- |
|  **Curse of dimensionality**       | If there are many categories, the feature space becomes very large and sparse.        |
|  **Memory inefficiency**           | Each category creates a new column, consuming memory and slowing training.            |
|  **Sparcity** | Sparse vectors can be inefficient for deep learning (embeddings are often preferred). |
|  **Out of vocabulary**             | Can not represent out of vocabulary input words.                                      |
|  **No Fixed Size**                 | Only fixed size of input can be given                                                 |


In [None]:
# Example in Python using pandas:
import pandas as pd

# Sample data
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red']})

# One-hot encoding
encoded = pd.get_dummies(df, columns=['Color'])

print(encoded)


   Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2        True        False      False
3       False        False       True


# 2. Bag of Words(BoW)

Bag of Words (BoW) is a text representation technique in Natural Language Processing (NLP) where a text (such as a sentence or document) is represented as an unordered collection of its words, disregarding grammar and word order, but keeping multiplicity (frequency).

BoW builds a vocabulary of all the unique words from the entire corpus (dataset) and represents each document by counting how many times each word appears in it.

    It ignores the position of the words.

    The result is a sparse vector where each element represents the count of a word from the vocabulary.


`Doc 1`: "I love NLP"

`Doc 2`: "I love machine learning"

`Vocabulary`: ['I', 'love', 'NLP', 'machine', 'learning']

`BoW representation`:

Document:	I	love	NLP	machine	learning

Doc 1	:   1	1	1	0	0

Doc 2	:   1	1	0	1	1

Each row is a vector representation of a document.

| Advantage                                  | Description                                         |
| ------------------------------------------ | --------------------------------------------------- |
|  **Simple and intuitive**                 | Easy to implement and understand.                   |
|  **Captures word frequency**             | Gives information about how often words appear.     |
|  **Works well with classical ML models** | Such as Naive Bayes, SVM, Logistic Regression, etc. |


| Disadvantage                            | Description                                                        |
| --------------------------------------- | ------------------------------------------------------------------ |
|  **Ignores semantics and word order** | Cannot detect meaning, synonyms, or context.                       |
|  **Large and sparse vectors**         | High-dimensional space if vocabulary is large.                     |
|  **No understanding of meaning**      | "I love dogs" and "Dogs love me" are treated as identical in BoW.  |
|  **Weights all words equally**        | Common words like "the" may dominate unless stopwords are removed. |


In [None]:
# Example in Python using CountVectorizer from scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love NLP",
    "I love machine learning"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print('Vocabulary:', vectorizer.get_feature_names_out())
print(X.toarray())


Vocabulary: ['learning' 'love' 'machine' 'nlp']
[[0 1 0 1]
 [1 1 1 0]]


# 3. N-grams

N-grams are contiguous sequences of n items (usually words or characters) from a given text.
In NLP, N-grams are most often used to capture context and word order in token sequences.

    Unigram = 1 word

    Bigram = 2 consecutive words

    Trigram = 3 consecutive words

    … and so on.

    

The N-gram model is a simple and widely used method for text representation and language modeling.

Unlike Bag of Words (which ignores word order), N-grams help capture phrase structure and local context by looking at adjacent words.

    For example, the sentence:

"I love machine learning"



    Unigrams:
    ['I', 'love', 'machine', 'learning']

    Bigrams (n=2):
    ['I love', 'love machine', 'machine learning']

    Trigrams (n=3):
    ['I love machine', 'love machine learning']

These are often used to build statistical language models or as features in classification tasks.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love machine learning"]

# Extract Bigrams (2-grams)
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())


['love machine' 'machine learning']
[[1 1]]


| Advantage                           | Description                                                               |
| ----------------------------------- | ------------------------------------------------------------------------- |
|  **Captures context and order**    | Retains the local structure of language (e.g., “not good” vs “good”).     |
|  **Improves text classification** | Especially effective for tasks like sentiment analysis or spam detection. |
|  **Useful in language modeling**  | Helps in predicting the next word in a sequence.                          |


| Disadvantage                | Description                                                          |
| --------------------------- | -------------------------------------------------------------------- |
|  **Data sparsity**        | As `n` increases, combinations become sparse and hard to learn from. |
|  **High dimensionality**  | Large N-grams result in huge feature spaces, requiring more memory.  |
|  **Limited context**      | Still fails to capture long-term dependencies and meaning.           |
|  **Vocabulary explosion** | Many similar phrases are treated as completely separate features.    |


# 4. TF-IDF (Term Frequecny Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection (corpus). It is widely used in information retrieval and text mining.

It balances two ideas:

   - TF (Term Frequency): How often a term appears in a document.

   - IDF (Inverse Document Frequency): How rare the term is across all documents.

#### Formula:

TF-IDF = TF(t, d) × IDF(t)


Where:

   - t = term (word)

   - d = document

   - TF(t, d) = (Number of times term t appears in document d) / (Total terms in d)

   - IDF(t) = log(N / df(t)), where:

       - N = total number of documents

       - df(t) = number of documents containing term t

#### Explanation

The idea is to:

  -  Emphasize important words in a document (high TF).

  -  Downweight common words like “the”, “is”, “and” (low IDF).

So, rare but frequent words in a document get the highest scores.

In [None]:
docs = [
    "I love machine learning",
    "I love deep learning",
    "machine learning is amazing"
]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())


['amazing' 'deep' 'is' 'learning' 'love' 'machine']
[[0.         0.         0.         0.48133417 0.61980538 0.61980538]
 [0.         0.72033345 0.         0.42544054 0.54783215 0.        ]
 [0.5844829  0.         0.5844829  0.34520502 0.         0.44451431]]


| Pros                             | Description                                          |
| -------------------------------- | ---------------------------------------------------- |
|  **Captures term importance**   | Highlights keywords unique to a document.            |
|  **Reduces common word impact** | Common across documents = less weight.               |
|  **Efficient & Scalable**       | Simple and fast to compute even on large corpora.    |
|  **Better than BoW**            | Avoids overvaluing frequent but uninformative words. |


| Cons                        | Description                                  |
| --------------------------- | -------------------------------------------- |
|  **No semantics**          | “good” and “excellent” treated as unrelated. |
|  **Sparse representation** | Large vocabularies → many zero entries.      |
|  **Fixed vocabulary**      | Can't handle unseen words during inference.  |
|  **Ignores word order**    | Loses sequence information like N-grams.     |


# 5. Word2Vec

Word embedding is a term used for hte representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.

Word2Vec is a word embedding technique that transforms words into dense vector representations based on their context in a corpus. It captures semantic meaning, so similar words have similar vectors.

  -  Developed by Tomas Mikolov et al. at Google in 2013.

### Explanation:

Word2Vec learns vector representations of words.
⚙️ It has two main architectures:

  1.  CBOW (Continuous Bag of Words)

      -  Predicts the target word from the surrounding context words.

      -  Example: "I ___ NLP" → predict "love".

  2.  Skip-Gram

      -  Predicts context words from the target word.

      -  Example: "NLP" → predict "I", "love", etc.

Both use a shallow neural network (one hidden layer) to learn these relationships.

| Pros                             | Description                                                              |
| -------------------------------- | ------------------------------------------------------------------------ |
|  **Semantic meaning**           | Similar words have similar vectors.                                      |
|  **Efficient & scalable**       | Trains fast on large corpora.                                            |
|  **Handles large vocabularies** | Learns compact, meaningful representations.                              |
|  **Works well with analogies**  | “king - man + woman ≈ queen”.                                            |
|  **Improves downstream tasks**  | Better input features for NLP tasks (e.g., classification, translation). |


| Cons                         | Description                                                                          |
| ---------------------------- | ------------------------------------------------------------------------------------ |
|  **Requires large corpora** | Needs a lot of data to perform well.                                                 |
|  **Context-independent**    | Same vector for a word regardless of sentence (e.g., "bank" in "river" vs. "money"). |
|  **No OOV handling**        | Cannot represent words not seen during training.                                     |
|  **Ignores morphology**     | “run”, “running”, and “ran” are unrelated unless explicitly trained.                 |
|  **Static embeddings**      | One word = one meaning.                                                              |


In [1]:
import gensim

In [11]:
# Download and load the model
import gensim.downloader as api

model = api.load("word2vec-google-news-300")




In [14]:
model['cricket'].shape

(300,)

In [15]:
model.most_similar('man')

[('woman', 0.7664012908935547),
 ('boy', 0.6824871301651001),
 ('teenager', 0.6586930155754089),
 ('teenage_girl', 0.6147903203964233),
 ('girl', 0.5921714305877686),
 ('suspected_purse_snatcher', 0.571636438369751),
 ('robber', 0.5585119128227234),
 ('Robbery_suspect', 0.5584409832954407),
 ('teen_ager', 0.5549196600914001),
 ('men', 0.5489763021469116)]

In [16]:
model.most_similar('cricket')

[('cricketing', 0.8372225761413574),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068488240242004),
 ('Twenty##', 0.7624265551567078),
 ('Cricket', 0.75413978099823),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316356897354126),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.6987985968589783)]

In [18]:
model.most_similar('facebook')

[('Facebook', 0.7563533186912537),
 ('FaceBook', 0.7076998949050903),
 ('twitter', 0.6988552212715149),
 ('myspace', 0.6941817998886108),
 ('Twitter', 0.664244532585144),
 ('twitter_facebook', 0.6572229862213135),
 ('Facebook.com', 0.6529868245124817),
 ('myspace_facebook', 0.6370643973350525),
 ('facebook_twitter', 0.6367618441581726),
 ('linkedin', 0.6356592774391174)]

In [19]:
model.similarity('man','woman')

0.76640123

In [20]:
model.similarity('man','PHP')

-0.032995153

In [21]:
model.doesnt_match(['PHP','java','monkey'])

'monkey'

In [22]:
vec = model['king'] - model['man'] + model['woman']
model.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

In [31]:
vec = model['coin'] - model ['crypto'] + model['forex']
model.most_similar([vec])

[('forex', 0.7094285488128662),
 ('coin', 0.5928510427474976),
 ('Forex', 0.518968939781189),
 ('coins', 0.5181783437728882),
 ('Currency', 0.5116606950759888),
 ('curency', 0.4965347647666931),
 ('currency', 0.4918244779109955),
 ('FOREX', 0.48118162155151367),
 ('FXCM_Micro', 0.4436749219894409),
 ('interbank_forex', 0.43405434489250183)]

In [32]:
vec = model['coin'] - model ['forex'] + model['crypto']
model.most_similar([vec])

[('crypto', 0.6437585353851318),
 ('coin', 0.49621814489364624),
 ('proto', 0.3795802891254425),
 ('Crypto', 0.3647187054157257),
 ('swastika_emblazoned', 0.34711766242980957),
 ('coinage', 0.34510746598243713),
 ('Stakes_Albarado', 0.33207038044929504),
 ('medallion', 0.33010321855545044),
 ('eponym', 0.3290451467037201),
 ('coins', 0.32545778155326843)]