# 03. Feature Extraction & Text Representation
Feature extraction and text representation are crucial steps in Natural Language Processing (NLP) that involve converting text data into numerical formats that machine learning algorithms can understand and process effectively.

### What You'll Learn:
- Bag of Words (BoW)
- TF-IDF
- N-grams
- When to use each method

## Why Convert Text to Numbers?

Machine Learning models only understand numbers.
We need to convert text into numerical vectors.

## Method 1: Bag of Words (BoW)

Count how many times each word appears.
Ignore word order.

Example:
- Doc 1: 'I like apples' -> [1, 1, 1, 0, 0, 0]
- Doc 2: 'I like oranges' -> [1, 1, 0, 1, 0, 0]

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

documents = [
    'I like apples',
    'I like oranges',
    'Apples are good'
]

vectorizer = CountVectorizer(lowercase=True)
bow_matrix = vectorizer.fit_transform(documents)

feature_names = vectorizer.get_feature_names_out()
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names)

print('Bag of Words Matrix:')
print(bow_df)

Bag of Words Matrix:
   apples  are  good  like  oranges
0       1    0     0     1        0
1       0    0     0     1        1
2       1    1     1     0        0


## Method 2: TF-IDF

Gives higher weight to:
- Words appearing frequently in a document (TF)
- Words rare across documents (IDF)

Formula: TF-IDF = TF Ã— IDF

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    'machine learning is amazing',
    'deep learning and machine learning',
    'python programming language'
]

tfidf_vec = TfidfVectorizer(lowercase=True)
tfidf_matrix = tfidf_vec.fit_transform(documents)

feature_names = tfidf_vec.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

print('TF-IDF Matrix:')
print(tfidf_df.round(3))

TF-IDF Matrix:
   amazing    and   deep     is  language  learning  machine  programming  \
0    0.563  0.000  0.000  0.563     0.000     0.428    0.428        0.000   
1    0.000  0.452  0.452  0.000     0.000     0.688    0.344        0.000   
2    0.000  0.000  0.000  0.000     0.577     0.000    0.000        0.577   

   python  
0   0.000  
1   0.000  
2   0.577  


## Method 3: N-grams

Groups of N consecutive words.
- Unigrams: Single words
- Bigrams: Two-word pairs
- Trigrams: Three-word groups

In [6]:
text = 'I like apples. Apples are good.'

# Unigrams
ug_vec = CountVectorizer(ngram_range=(1,1))
ug_matrix = ug_vec.fit_transform([text])
print('Unigrams:', dict(zip(ug_vec.get_feature_names_out(), ug_matrix.toarray()[0])))

# Bigrams
bg_vec = CountVectorizer(ngram_range=(2,2))
bg_matrix = bg_vec.fit_transform([text])
print('\nBigrams:', dict(zip(bg_vec.get_feature_names_out(), bg_matrix.toarray()[0])))

# Bigrams
bg_vec = CountVectorizer(ngram_range=(3,3))
bg_matrix = bg_vec.fit_transform([text])
print('\nTrigrams:', dict(zip(bg_vec.get_feature_names_out(), bg_matrix.toarray()[0])))

Unigrams: {'apples': np.int64(2), 'are': np.int64(1), 'good': np.int64(1), 'like': np.int64(1)}

Bigrams: {'apples apples': np.int64(1), 'apples are': np.int64(1), 'are good': np.int64(1), 'like apples': np.int64(1)}

Trigrams: {'apples apples are': np.int64(1), 'apples are good': np.int64(1), 'like apples apples': np.int64(1)}
