<a href="https://colab.research.google.com/github/sudeep-sp/GenAI/blob/main/GenAI_Data_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Representation

# Word Embeddings

1. count or freq:

  * One hot encoding
  * Bag of word (BOW)
  * TF-TDF ( Team Freq - Inverse Document Freq)
2. Deep learning Trained model
  * Word2vec
  * AvgWrod2Vec
  * Transformers

## meaning?
1. **Corpus**: A corpus is a large and structured set of texts. It is a collection of written or spoken material used for linguistic analysis and the development of language models. Essentially, it's the body of text that you use to train and evaluate your NLP models.

    * Example: A collection of news articles, books, or social media posts can be considered a corpus.

2. **Vocabulary**: Vocabulary is the set of unique words present in a corpus. It represents all the distinct words used in your text data.

    * Example: If your corpus consists of the sentences "The cat sat on the mat" and "The dog chased the ball", your vocabulary would be {"the", "cat", "sat", "on", "mat", "dog", "chased", "ball"}.
3. **Document**: A document refers to a single unit of text within a corpus. It could be an article, a book chapter, a tweet, or any other self-contained piece of text.

    * Example: In a corpus of news articles, each individual article is considered a document.

4. **Word**: A word is the basic unit of language, representing a single meaningful element within a document or corpus. It is typically a sequence of characters separated by spaces or punctuation.

    * Example: In the sentence "The cat sat on the mat", the words are "The", "cat", "sat", "on", "the", and "mat".

## One Hot Encoding

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize


df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0]})


# Tokenize the words (split by space or simple split)
words = word_tokenize(df['text'].str.cat(sep=' '))

# Get unique vocabulary
vocabulary = sorted(set(words))

# One-hot encode using sklearn's OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded_vectors = encoder.fit_transform(np.array(vocabulary).reshape(-1, 1))

# Print the one-hot encoded vectors
for word, encoding in zip(vocabulary, encoded_vectors):
    print(f"{word}: {encoding}")





campusx: [1. 0. 0. 0. 0.]
comment: [0. 1. 0. 0. 0.]
people: [0. 0. 1. 0. 0.]
watch: [0. 0. 0. 1. 0.]
write: [0. 0. 0. 0. 1.]


In [None]:
# Create a mapping of words to their one-hot encoding
word_to_onehot = {word: encoded_vectors[i].astype(int) for i, word in enumerate(vocabulary)}

# Assign one-hot encodings to each sentence in df['text']
sentence_encodings = []
for sentence in df['text']:
    # Tokenize the sentence
    tokenized_sentence = word_tokenize(sentence)
    # Get the one-hot encoding for each word in the sentence
    sentence_encoding = [word_to_onehot[word].tolist() for word in tokenized_sentence]
    sentence_encodings.append(sentence_encoding)

sentence_encodings = np.array(sentence_encodings)

print(sentence_encodings)


[[[0 0 1 0 0]
  [0 0 0 1 0]
  [1 0 0 0 0]]

 [[1 0 0 0 0]
  [0 0 0 1 0]
  [1 0 0 0 0]]

 [[0 0 1 0 0]
  [0 0 0 0 1]
  [0 1 0 0 0]]

 [[1 0 0 0 0]
  [0 0 0 0 1]
  [0 1 0 0 0]]]


## Bag of words

In [None]:
import numpy as np
import pandas as pd


In [None]:
df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0]})

In [None]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [None]:
bow = cv.fit_transform(df['text'])

In [None]:
#vocabulary
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [None]:
bow.toarray()

array([[1, 0, 1, 1, 0],
       [2, 0, 0, 1, 0],
       [0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1]])

In [None]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]
[[0 1 1 0 1]]
[[1 1 0 0 1]]


In [None]:
cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[2, 1, 0, 1, 1]])

In [None]:
X = bow.toarray()

## N Gram

In [None]:
df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [None]:
#BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2)) #ngram_range=(3,3) -> Tri gram

In [None]:
bow = cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}


In [None]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())

[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]
[[0 0 0 1 0 1]]
[[0 1 0 0 0 1]]


## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
arr = tfidf.fit_transform(df['text']).toarray()

In [None]:
arr

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [None]:
print(tfidf.idf_)

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]


# Deep learning trained models for data representation

## Word2Vec

In [None]:
import numpy as np
import pandas as pd
import gensim
import os

In [None]:
!pip install gensim --upgrade gensim --user



In [None]:
from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
story = []
for filename in os.listdir('data'):
  if filename == '.ipynb_checkpoints':
    pass
  f = open(os.path.join('data', filename),  encoding='ISO-8859-1')
  corpus = f.read()
  raw_sent = sent_tokenize(corpus)
  for sent in raw_sent:
    story.append(simple_preprocess(sent))

In [55]:
story[0]

['george',
 'martin',
 'dance',
 'with',
 'dragons',
 'book',
 'five',
 'of',
 'song',
 'of',
 'ice',
 'and',
 'fire',
 'dedication',
 'this',
 'one',
 'is',
 'for',
 'my',
 'fans',
 'for',
 'lodey',
 'trebla',
 'stego',
 'pod',
 'caress',
 'yags',
 'ray',
 'and',
 'mr',
 'kate',
 'chataya',
 'mormont',
 'mich',
 'jamie',
 'vanessa',
 'ro',
 'for',
 'stubby',
 'louise',
 'agravaine',
 'wert',
 'malt',
 'jo',
 'mouse',
 'telisiane',
 'blackfyre',
 'bronn',
 'stone',
 'coyote',
 'daughter',
 'and',
 'the',
 'rest',
 'of',
 'the',
 'madmen',
 'and',
 'wild',
 'women',
 'of',
 'the',
 'brotherhood',
 'without',
 'banners',
 'for',
 'my',
 'website',
 'wizards',
 'elio',
 'and',
 'linda',
 'lords',
 'of',
 'westeros',
 'winter',
 'and',
 'fabio',
 'of',
 'wic',
 'and',
 'gibbs',
 'of',
 'dragonstone',
 'who',
 'started',
 'it',
 'all',
 'for',
 'men',
 'and',
 'women',
 'of',
 'asshai',
 'in',
 'spain',
 'who',
 'sang',
 'to',
 'us',
 'of',
 'bear',
 'and',
 'maiden',
 'fair',
 'and',
 'the

In [None]:
len(story)

145020

In [None]:
model = gensim.models.Word2Vec(
    window=30,
    min_count=2
)

In [None]:
model.build_vocab(story)

In [None]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(6569416, 8628190)

In [None]:
model.wv.most_similar('daenerys')

[('stormborn', 0.8439348340034485),
 ('queen', 0.7940853834152222),
 ('unburnt', 0.7900612950325012),
 ('targaryen', 0.7470002770423889),
 ('princess', 0.7194494605064392),
 ('myrcella', 0.7077713012695312),
 ('dragons', 0.6856666803359985),
 ('dorne', 0.6648232936859131),
 ('westeros', 0.6640669107437134),
 ('regent', 0.6609808802604675)]

In [None]:
model.wv.similarity('daenerys','khal')

0.56881285

In [None]:
vec = model.wv.get_normed_vectors()

In [None]:
vec

array([[-0.13393894,  0.04445564, -0.00023106, ..., -0.09327397,
        -0.04756232,  0.10555492],
       [-0.15555042, -0.00642802,  0.03565891, ..., -0.02302233,
        -0.04957606,  0.08374401],
       [ 0.06989704, -0.05860437, -0.08746112, ..., -0.09666179,
         0.00231893, -0.19625413],
       ...,
       [-0.02730977,  0.17165625,  0.10804832, ..., -0.09796965,
         0.08154844, -0.02875738],
       [-0.00957488,  0.1383937 ,  0.05561015, ..., -0.01604915,
        -0.08118843, -0.03888327],
       [-0.10223771,  0.10671241,  0.0742107 , ..., -0.00090365,
        -0.08661205,  0.00378238]], dtype=float32)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)

In [None]:
X = pca.fit_transform(model.wv.get_normed_vectors())

In [None]:
X

array([[ 0.08254063, -0.57584894,  0.14200252],
       [ 0.1127544 , -0.41166162,  0.00518483],
       [-0.37512743, -0.43235663, -0.19935152],
       ...,
       [ 0.07314593,  0.1188072 ,  0.07979396],
       [ 0.11143699,  0.05154723, -0.17087981],
       [-0.13657475,  0.3040172 ,  0.17077014]], dtype=float32)

In [None]:
y = model.wv.index_to_key

In [57]:
len(y)

17453

In [58]:
X.shape

(17453, 3)

In [59]:
import plotly.express as px
fig = px.scatter_3d(X[200:300], x=0, y=1,z=2, color=y[200:300])
fig.show()