- To make real-world data tidy, we need to convert all the data info into numerical values.
- So that we can create feature matrix.
- Some common feature engineering tasks include: 
  Features to represent:
  1. categorical data
  2. text
  3. images
  4. Derived features to increase model complexity and impute missing values.

- Feature engineering is also called **Vectorization** as converts arbitrary data to well-behaved vectors.
   

## 1. Categorical Features
- We use **One-hot-encoding**


In [4]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]


In [5]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}  # Doesn't work with sklearn

{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}

In [6]:
data

[{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
 {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
 {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
 {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]

- When dictionary form data - directly use DictVectorizer.

In [8]:
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False, dtype=int)
vec

In [10]:
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

- each neighborhood converted to 3 cols

In [14]:
vec.get_feature_names_out()  # Check features name that are encoded

array(['neighborhood=Fremont', 'neighborhood=Queen Anne',
       'neighborhood=Wallingford', 'price', 'rooms'], dtype=object)

**Problem**

- If there are large number of categories, then it will create that many number of cols with zeros at many positions, which makes our dataset unnecessarily large.

In [15]:
vec = DictVectorizer(sparse=True, dtype=int)

In [16]:
vec.fit_transform(data)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 12 stored elements and shape (4, 5)>

In [19]:
vec.get_feature_names_out()   # Sparsed table

array(['neighborhood=Fremont', 'neighborhood=Queen Anne',
       'neighborhood=Wallingford', 'price', 'rooms'], dtype=object)

- Such sparsed input is used in some cases to fit data. 
- sklearn.preprocessing.OneHotEncoder
- sklearn.feature_extraction.FeatureHasher  
- These are two additional tool by sklearn to encode such types of data.

## 2. Text features

- Convert text data into numerical values 
- Normally used - **Word Count Vectorization**: Represent text by counting the occurrences of each word in a given text snippet.


In [20]:
sample = ['problem of evil', 'evil queen', 'horizon problem']


Using CountVectorizer from Scikit-Learn to transform these phrases into a numerical format:

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
x = vec.fit_transform(sample)

In [22]:
x

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 7 stored elements and shape (3, 5)>

- Output is a sparse matrix where each row represents a phrase and each column represents a word from the vocabulary.
- To convert the sparse matrix to a readable format, use pandas:


In [24]:
import pandas as pd 
pd.DataFrame(x.toarray(), columns= vec.get_feature_names_out())

Unnamed: 0,evil,horizon,of,problem,queen
0,1,0,1,1,0
1,1,0,0,0,1
2,0,1,0,1,0


**Drawbacks of Word Counts:**

- Overweighting Frequent Words: Common words that appear frequently across documents may dominate the feature space.
- Solution: Use Term Frequency–Inverse Document Frequency (TF-IDF).

**TF-IDF (Term Frequency-Inverse Document Frequency):**

Purpose: Adjusts word counts based on their frequency across documents. Rare words get higher weights, common words get lower weights.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())


Unnamed: 0,evil,horizon,of,problem,queen
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.0,0.0,0.0,0.795961
2,0.0,0.795961,0.0,0.605349,0.0


- TF-IDF is commonly used in text classification tasks to balance word importance.

# 3. Image features
Basic approach:

- Directly use pixel values
- This makes each pixel a feature.
- For complex problems sklearn has standard feature extraction techniques for images.

# 4. Mathematically derived fetaures

