## Text Representation/Vectorization


Convert Textual data to numbers(models only understands numbers not languages) is known as Text Vectorization (Feature Extarction)

### Techiniques to Achive Text Vectorization

1. One Hot Encoding

2. Bag Of Words

3. Ngrams

4. Tfidf

5. Custom Features 

### Common Terms

1. Corpus (c) : Collention of all the words of dataset is  a corpus

2. Vocabulary (v): Unique words in corpus

3. Document (d): Individual(single) record is called one document

4. Word (w): words used in document called word.



### 1. One Hot Coding

* Convert every words of document in v-dimensional vector

#### Example


D1:People watch campusx

D2:campusx watch campusx

D3:people write comment

D4:campusx write comment

Corpus: People watch campusx campusx watch campusx people write comment campusx write comment

Vocabulary: People watch campusx write comment (V=5)

OHC for D1 =[[1,0,0,0,0],[0,1,0,0,0],[0,0,1,0,0]]




### 2. Bag of Words

#### Example

```
          text                 output
D1:People watch campusx           1

D2:campusx watch campusx          1

D3:people write comment           0

D4:campusx write comment          0

Corpus: People watch campusx campusx watch campusx people write comment campusx write comment

Vocabulary:  People watch campusx  write  comment (V=5)

  D1           1      1     1        0       0

  D2           0      1     2        0       0

  D3           0      0     0        1       1
  
  D4           0      0     1        1       1

  ```

In [1]:
import numpy as np
import pandas as pd

In [2]:
df =pd.DataFrame({'text':['People watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0]})

In [3]:
df

Unnamed: 0,text,output
0,People watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [5]:
bow = cv.fit_transform(df['text'])

In [6]:
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [7]:
print(bow[0].toarray())

[[1 0 1 1 0]]


In [8]:
cv.transform(["campusx watch and write comment of campusx"]).toarray()

array([[2, 1, 0, 1, 1]], dtype=int64)

### 3. N-grams

#### Example(Bag of Bi-grams)

```
          text                 output
D1:People watch campusx           1

D2:campusx watch campusx          1

D3:people write comment           0

D4:campusx write comment          0

Corpus: People watch campusx campusx watch campusx people write comment campusx write comment

Vocabulary:  People-watch watch-campusx  campusx-write people-write  write-comment campusx-write (V=6)

  D1              1             1              0            0              0             0

  D2              0             1              1            0              0             0

  D3              0             0              0            1              1             0
  
  D4              0             0              0            0              1             1

  ```

In [9]:
cv = CountVectorizer(ngram_range=(2,2))

In [10]:
bow = cv.fit_transform(df['text'])

In [11]:
print(cv.vocabulary_)

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}


In [12]:
print(bow[0].toarray())

[[0 0 1 0 1 0]]


### 4. Tf-Idf

#### Term Frequency(TF)

Tels What is the probablity of a particular term in a particular Document

```
         (Number of ocurrences of term t in document d)
TF(t,d)= _______________________________________________
            (Total number of terms in the document d)
```

#### Example

```
          text                 output
D1:People watch campusx           1

D2:campusx watch campusx          1

D3:people write comment           0

D4:campusx write comment          0


Tf(people,D1)=1/3

```

#### Inverse Document Frequency(IDF)

```
               (Total number of document in the corpus)
IDF(t) = loge __________________________________________
               (Number of documents with term t in them)
```

#### Example

```
          text                 output
D1:People watch campusx           1

D2:campusx watch campusx          1

D3:people write comment           0

D4:campusx write comment          0


IDF(people)=log(4/2)

IDF(watch)=log(4/2)

IDF(campusx)=log(4/3)

IDF(write)=log(4/2)

IDF(comment)=log(4/2)

```

#### TF-IDF

#### Example

TF(people,D1)=1/3

TF(watch,D1)=1/3

TF(campusx,D1)=1/3

IDF(people)=log(4/2)=0.3

IDF(watch)=log(4/2)=0.3

IDF(campusx)=log(4/3)=0.25

```
Vocabulary:    People         watch         campusx          write        comment (V=5)

  D1         1/3 * 0.3      1/3 * 0.3     1/3 * 0.25           0             0
  D2            0           1/3 * 0.3     1/3 * 0.25           0             0
  D3         1/3 * 0.3          0         2/3 * 0.25           0         1/3 * 0.3
  D4            0               0        1/3 * 0.125       1/3 * 0.3     1/3 * 0.3
```

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [14]:
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [15]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


### 4. Custom Features

You will define the feature accroding to project and dataset and you mixed this custom feaatures with above techniques to get required output 