## One Hot Encoding

##### One-hot encoding is the process of turning categorical factors into a numerical structure that machine learning algorithms can readily process. It functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector’s size equivalent to the number of potential categories. 

In [1]:
import numpy as np

In [21]:
sentences = [
    "The quick brown fox jumped over the lazy dog.",
    "She sells seashells by the seashore.",
    "Peter Piper picked a peck of pickled peppers."
]

In [22]:
#Vocab=UniqueWords
tokenized_sentences = [sentence.lower().replace('.', '').split() for sentence in sentences]

vocabulary = sorted(set(word for sentence in tokenized_sentences for word in sentence))

In [24]:
vocabulary

['a',
 'brown',
 'by',
 'dog',
 'fox',
 'jumped',
 'lazy',
 'of',
 'over',
 'peck',
 'peppers',
 'peter',
 'picked',
 'pickled',
 'piper',
 'quick',
 'seashells',
 'seashore',
 'sells',
 'she',
 'the']

In [6]:
pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.5 MB 435.7 kB/s eta 0:00:27
   ---------------------------------------- 0.1/11.5 MB 469.7 kB/s eta 0:00:25
   ---------------------------------------- 0.1/11.5 MB 722.1 kB/s eta 0:00:16
    --------------------------------------- 0.2/11.5 MB 841.6 kB/s eta 0:00:14
    --------------------------------------- 0.3/11.5 MB 1.1 MB/s eta 0:00:11
   - -------------------------------------- 0.4/11.5 MB 1.4 MB/s eta 0:00:08
   -- ------------------------------------- 0.6/11.5 MB 1.8 MB/s eta 0:00:07
   --- ------------------------------------ 0.9/11.5 MB 2

In [10]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.13.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.6 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.6 kB ? eta -:--:--
     ------------------- ------------------ 30.7/60.6 kB 330.3 kB/s eta 0:00:01
     ------------------------- ------------ 41.0/60.6 kB 326.8 kB/s eta 0:00:01
     -------------------------------------- 60.6/60.6 kB 402.7 kB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.0-cp312-cp312-win_amd64.whl (10.9 MB)
   ---------------------------------------- 0.0/10.9 MB ? eta -:--:--
   ---------------------------------------- 0.1/10.9 MB 3.4 MB/s eta 0:00:04
   ---------------------------------------- 0.1/10.9 MB 1.3 MB/s et

In [25]:
from sklearn.preprocessing import OneHotEncoder

In [26]:
one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoder.fit(np.array(vocabulary).reshape(-1, 1))

In [28]:
encoded_sentences = []
for sentence in tokenized_sentences:
    encoded_sentence = one_hot_encoder.transform(np.array(sentence).reshape(-1, 1))
    encoded_sentences.append(encoded_sentence)

In [29]:
for i, sentence in enumerate(sentences):
    print(f"Sentence: '{sentence}'")
    print(f"One-hot encoded vectors:\n{encoded_sentences[i]}\n")

Sentence: 'The quick brown fox jumped over the lazy dog.'
One-hot encoded vectors:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Sentence: 'She sells seashells by the seashore.'
One-hot encoded vectors:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.