# <center> <font size = 24 color = 'steelblue'> <b>One Hot Encoding

# <a id= 'h0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Introduction](#h1)<br>
[2. Data acquisition and cleaning](#h2)<br>
[3. Vocabulary generation](#h3)<br>
[4. Creation of teh one hot encoded matrix](#h4)<br>
[5. Display one hot encoded matrix](#h5)<br>

## <a id = 'h1'>
    
<font size = 10 color = 'midnightblue'> <b> Introduction

<div class="alert alert-block alert-success">    
<font size = 4> 

**One-hot encoding stands out as the most prevalent and fundamental method for converting a token into a vector.**<br>

<b>The process involves:</b>
  - Assigning a unique integer index to each word.
  - Converting this integer index, denoted as 'i,' into a binary vector of size N, where N represents the vocabulary size.
  - This vector is predominantly filled with zeros, except for the i-th entry, which is set to 1.

In [None]:
import re
import pandas as pd
from string import punctuation

## <a id = 'h2'>
    
<font size = 10 color = 'midnightblue'> <b> Data acquisition and cleaning

<font size = 5 color = pwdrblue> <b>  Define the set of statements

In [None]:
dataset = [
    "The weather today is fantastic, with clear skies and a gentle breeze.",
    "Reading is a great way to escape reality and immerse oneself in different worlds.",
    "Climate change is a pressing global issue that requires immediate attention.",
    "Exercise is crucial for maintaining good physical and mental health.",
    "Learning a new language can be challenging but incredibly rewarding."
]

<font size = 5 color = pwdrblue> <b>  Remove punctuations

In [None]:
pat = re.compile('[A-Za-z][{}]+'.format(punctuation))
new_dataset = []
for s in dataset:
    s = s.lower()
    txt = re.findall(pat,s )
    for k in txt:
        s = s.replace(k[-1], '')
    new_dataset.append(s)
new_dataset

[top](#h0)

## <a id = 'h3'>  
<font size = 10 color = 'midnightblue'> <b>  Create a set of unique words as vocabulary from documents.

In [None]:
vocab =list(set((' '.join(new_dataset)).split()))
# Sorting the vocabulary for better management
vocab.sort()
print(vocab)

In [None]:
len(vocab)

[top](#h0)

## <a id = 'h4'>    
<font size = 10 color = 'midnightblue'> <b> Creating one hot encoded matrix for each sentence.

In [None]:
d = {}
i = 0
for sentence in new_dataset:

    # getting the words of the sentence in dataset
    s = sentence.split()

    # creating an empty df to store the one-hot encoded matrix and filling it up with 0
    df = pd.DataFrame([],columns = vocab,index = s)
    df.fillna(0, inplace = True)

    # assign 1 to the cells where the word in the statement(index of df) matches the column name
    for word in s:
        df.loc[word,word ] = 1

    # creating a dictionary of these matrices
    d[f'sent{i}'] = df

    i+= 1

[top](#h0)

## <a id = 'h5'> 
<font size = 10 color = 'midnightblue'> <b>  Displaying results for one of the sentences

In [None]:
print(f"\nThe one hot encoding for the sentence : \n \"{dataset[0]}\" is :\n")
d['sent0']