<a href="https://colab.research.google.com/github/tymepas/GenAI_Projects/blob/main/Day2_GenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
### Word Embedding

# TF and IDF
# TF - Term Frequency
# IDF - Inverse Document Frequency

# It is a statistical measure that evaluates how important a word is to a document in a collection or corpus

In [None]:
# Term Frequency - Measure how frequently a term report in a document
# Tf(t, d) = Number of times term t appears in document d / Total number of terms in document d

# Inverse Document Frequency (IDF) - Measure how important a term is in the entire corpus (Document)
# IDF (t) = log(N / (df(t)))
# N = Total number of documents in the corpus
# df(t) = Number of documents with term t in it
# 1 is added to avoid division by zero error

# TF -IDF Score
# TF - IDF(t, d) = tf(t, d) * log(N / (df(t))) or "TF - IDF(t, d) = tf(t, d) * IDF(t)"

# Why this is used?
#   - Down weight to common words like "the", "and", "is"
#   - Up-weight rare but informative words like - "machine", "neural", "quantum"

# Examples:-
# D1 = "the cat sat on the mat"
# D2 = "the dog sat on the log"
# D3 = "dogs and cart are great"

# The word "the" has appeared 4 times
# - high TF low IDF == low TF-IDF
# the word "great" has appeared 1 time
# - low TF high IDF == high TF-IDF


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
docs = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "dogs and cart are great"
]

In [None]:
vectorizer = TfidfVectorizer()
tfidfmatrix = vectorizer.fit_transform(docs)
print(tfidfmatrix.toarray())

[[0.         0.         0.         0.42755362 0.         0.
  0.         0.         0.42755362 0.32516555 0.32516555 0.6503311 ]
 [0.         0.         0.         0.         0.42755362 0.
  0.         0.42755362 0.         0.32516555 0.32516555 0.6503311 ]
 [0.4472136  0.4472136  0.4472136  0.         0.         0.4472136
  0.4472136  0.         0.         0.         0.         0.        ]]


In [None]:
df = pd.DataFrame(tfidfmatrix.toarray())#, columns=vectorizer.get_feature_names_out())
print(df)

         0         1         2         3         4         5         6   \
0  0.000000  0.000000  0.000000  0.427554  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.427554  0.000000  0.000000   
2  0.447214  0.447214  0.447214  0.000000  0.000000  0.447214  0.447214   

         7         8         9         10        11  
0  0.000000  0.427554  0.325166  0.325166  0.650331  
1  0.427554  0.000000  0.325166  0.325166  0.650331  
2  0.000000  0.000000  0.000000  0.000000  0.000000  


In [None]:
# Rows = Documents
# Cols = Words
# Values = TFIDF Score

In [None]:
# another examples

data = {'id': [1,2,3], 'text': ['I love Machine learning',
                                'Natural Language processing is fascinating ',
                                'words embedding represents words in vector']}
df = pd.DataFrame(data)
df

Unnamed: 0,id,text
0,1,I love Machine learning
1,2,Natural Language processing is fascinating
2,3,words embedding represents words in vector


In [None]:
vectorizer = TfidfVectorizer()
X_tfidf_matrix = vectorizer.fit_transform(df['text'])
print(X_tfidf_matrix.toarray())

[[0.         0.         0.         0.         0.         0.57735027
  0.57735027 0.57735027 0.         0.         0.         0.
  0.        ]
 [0.         0.4472136  0.         0.4472136  0.4472136  0.
  0.         0.         0.4472136  0.4472136  0.         0.
  0.        ]
 [0.35355339 0.         0.35355339 0.         0.         0.
  0.         0.         0.         0.         0.35355339 0.35355339
  0.70710678]]


In [None]:
print(pd.DataFrame(X_tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()))


   embedding  fascinating        in        is  language  learning     love  \
0   0.000000     0.000000  0.000000  0.000000  0.000000   0.57735  0.57735   
1   0.000000     0.447214  0.000000  0.447214  0.447214   0.00000  0.00000   
2   0.353553     0.000000  0.353553  0.000000  0.000000   0.00000  0.00000   

   machine   natural  processing  represents    vector     words  
0  0.57735  0.000000    0.000000    0.000000  0.000000  0.000000  
1  0.00000  0.447214    0.447214    0.000000  0.000000  0.000000  
2  0.00000  0.000000    0.000000    0.353553  0.353553  0.707107  


In [None]:
### Hugging Face - Most popular transformer Library

## !pip install transformers datasets torch huggingface_hub

In [None]:
## Try to find the sentiment of a sentence  - positive or negative

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
results = classifier(['I love transformer', 'This is terrible'])
print(results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998635053634644}, {'label': 'NEGATIVE', 'score': 0.9996459484100342}]


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

# Collect both inputs
sentence1 = input("Enter the first sentence: ")
sentence2 = input("Enter the second sentence: ")

# Pass them together in a single list
results = classifier([sentence1, sentence2])

print(results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Enter the first sentence: This is fun outside the house
Enter the second sentence: i am scared inside the dark room
[{'label': 'POSITIVE', 'score': 0.9997490048408508}, {'label': 'NEGATIVE', 'score': 0.9891675710678101}]


In [None]:
### NAMED ENTITY RECOGNITION

ner = pipeline('ner', grouped_entities = True)
text = "Elon Musk founded SpaceX in 2024 and later bought Twitter in 2025."
result = ner(text)
print(result)
for i in result:
  print(i)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'entity_group': 'PER', 'score': np.float32(0.9985181), 'word': 'Elon Musk', 'start': 0, 'end': 9}, {'entity_group': 'ORG', 'score': np.float32(0.9992113), 'word': 'SpaceX', 'start': 18, 'end': 24}, {'entity_group': 'ORG', 'score': np.float32(0.99866354), 'word': 'Twitter', 'start': 50, 'end': 57}]
{'entity_group': 'PER', 'score': np.float32(0.9985181), 'word': 'Elon Musk', 'start': 0, 'end': 9}
{'entity_group': 'ORG', 'score': np.float32(0.9992113), 'word': 'SpaceX', 'start': 18, 'end': 24}
{'entity_group': 'ORG', 'score': np.float32(0.99866354), 'word': 'Twitter', 'start': 50, 'end': 57}


In [None]:
text = input("Enter the sentence: ")
result = ner(text)
#print(result)
for i in result:
  print(i)

In [None]:
## Fill in the blanks
## predict the missing word in the sequence

filled_data = pipeline('fill-mask', model='bert-base-uncased')
sent = "Transformer are a very ([MASK]) technology"
result = filled_data(sent)
print(result)
for i in result:
  print(i)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.13967962563037872, 'token': 2047, 'token_str': 'new', 'sequence': 'transformer are a very ( new ) technology'}, {'score': 0.13906064629554749, 'token': 3935, 'token_str': 'advanced', 'sequence': 'transformer are a very ( advanced ) technology'}, {'score': 0.08328744769096375, 'token': 2715, 'token_str': 'modern', 'sequence': 'transformer are a very ( modern ) technology'}, {'score': 0.0361173115670681, 'token': 6450, 'token_str': 'expensive', 'sequence': 'transformer are a very ( expensive ) technology'}, {'score': 0.03445785865187645, 'token': 3522, 'token_str': 'recent', 'sequence': 'transformer are a very ( recent ) technology'}]
{'score': 0.13967962563037872, 'token': 2047, 'token_str': 'new', 'sequence': 'transformer are a very ( new ) technology'}
{'score': 0.13906064629554749, 'token': 3935, 'token_str': 'advanced', 'sequence': 'transformer are a very ( advanced ) technology'}
{'score': 0.08328744769096375, 'token': 2715, 'token_str': 'modern', 'sequence': 'transfor

In [None]:
sent = input("Enter the sentence: ")
result = filled_data(sent)
print(result)
for i in result:
  print(i)


Enter the sentence: India is a ([MASK]) Country.
[{'score': 0.05882960185408592, 'token': 2235, 'token_str': 'small', 'sequence': 'india is a ( small ) country.'}, {'score': 0.05203427001833916, 'token': 2489, 'token_str': 'free', 'sequence': 'india is a ( free ) country.'}, {'score': 0.04263685643672943, 'token': 5152, 'token_str': 'muslim', 'sequence': 'india is a ( muslim ) country.'}, {'score': 0.029311861842870712, 'token': 11074, 'token_str': 'sovereign', 'sequence': 'india is a ( sovereign ) country.'}, {'score': 0.026897192001342773, 'token': 4975, 'token_str': 'developing', 'sequence': 'india is a ( developing ) country.'}]
{'score': 0.05882960185408592, 'token': 2235, 'token_str': 'small', 'sequence': 'india is a ( small ) country.'}
{'score': 0.05203427001833916, 'token': 2489, 'token_str': 'free', 'sequence': 'india is a ( free ) country.'}
{'score': 0.04263685643672943, 'token': 5152, 'token_str': 'muslim', 'sequence': 'india is a ( muslim ) country.'}
{'score': 0.02931186

In [None]:
def fill_in_the_blank_Chatbot():
  print("Fill in the blanks ChatBot")
  print("Type a sentence with ([MASK]) in it or type EXIT to quit")

  while True:
    user_input = input("Enter the sentence: ")
    if user_input.upper() == "EXIT":
      print("Thanks for Using this ChatBot..!! GoodBye!!!")
      break

    if "[MASK]" not in user_input:
      print("No [MASK] found in the sentence")

    try:
      result = filled_data(user_input)
      print("Predictions are here below")
      for i in result[:5]:
        print(i)
    except:
      print("Something went wrong")


In [None]:
fill_in_the_blank_Chatbot()

In [None]:
## ZERO SHOT classification

## Classify text into user defined Categories

clf = pipeline('zero-shot-classification')
text = "The stock market crashed due to infation and war"
labels = ['Finance', 'Technology', 'Sports', 'Health', 'Politics']
res = clf(text, labels)
print(res)


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'sequence': 'The stock market crashed due to infation and war', 'labels': ['Finance', 'Health', 'Politics', 'Technology', 'Sports'], 'scores': [0.7598903179168701, 0.06476442515850067, 0.06253916025161743, 0.06204197555780411, 0.05076424032449722]}


In [None]:
text = input("Enter the sentence: ")
res = clf(text, labels)

In [None]:
# Print the sequence
print("Sequence:", res['sequence'])

# Print each label and its score line by line
print("Scores:")
for label, score in zip(res['labels'], res['scores']):
  print(f"- {label}: {score:.4f}")

Sequence: The stock market crashed due to infation and war
Scores:
- Finance: 0.7599
- Health: 0.0648
- Politics: 0.0625
- Technology: 0.0620
- Sports: 0.0508


In [None]:
## TRANSLATION MODELS FROM ENGLISH TO HINDI

mod = pipeline('text2text-generation', model = 'barghavani/English_to_Hindi')

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/304M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
output = mod("The weather is very good today. I think i should make a tea with some pakodas")
print(output)

[{'generated_text': 'आज मौसम बहुत अच्छा है। मुझे लगता है कि मैं कुछ पैसे के साथ चाय बनाना चाहिए'}]


In [None]:
output = mod(input("Enter the sentence: "))
print(output)

Enter the sentence: I am playing cricket outside
[{'generated_text': 'मैं बाहर क्रिकेट बजा रहा हूँ'}]
