### What is TF-IDF?

- TF stands for **Term Frequency** and denotes the ratio of  number of times a particular word appeared in a Document to total number of words in the document.
          
         Term Frequency(TF) = [number of times word appeared / total no of words in a document]
 
- Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.


- IDF stands for **Inverse Document Frequency** and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.

         Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)]
        
- In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0. 


- Finally:
         
         TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

### What is TF-IDF?

TF-IDF ببساطة هو طريقة بنستخدمها عشان نقيس أهمية كلمة معينة في نص معين بالنسبة لبقية الكلمات في مجموعة نصوص.

خلينا نبسط الموضوع أكتر:

- **TF** أو "تكرار الكلمة" هو نسبة عدد مرات ظهور كلمة معينة في نص واحد لعدد الكلمات الكلي في النص ده. يعني لو كلمة ظهرت 5 مرات في نص فيه 100 كلمة، فالنسبة هتكون 5/100.
  
- **IDF** أو "عكس تكرار المستند" هو اللي بيقولنا إذا كانت الكلمة دي شائعة في كل النصوص ولا لأ. لو كلمة ظهرت في كل النصوص، معناها إنها مش مهمة أوي لأن الكل بيستخدمها. لكن لو ظهرت في عدد قليل من النصوص، هتكون أهم.

- وأخيراً: **TF-IDF** بيجمع بين الاتنين. بنضرب **TF** في **IDF**، عشان نطلع قيمة توضح لنا مدى أهمية الكلمة دي في النص بتاعنا.

يعني لو الكلمة بتتكرر كتير في نص معين، لكن مش موجودة في كل النصوص، هتكون مهمة. أما لو الكلمة دي بتتكرر في كل النصوص، هتكون قيمتها أقل.

ببساطة، هي طريقة بنستخدمها عشان نعرف إيه الكلمات المهمة في النص ده مقارنة ببقية النصوص.


| **Condition**                                         | **TF** (Word Frequency in the Document) | **IDF** (Word Frequency in Other Documents) | **TF-IDF** (Word Importance)     | **Word Importance**             |
|-------------------------------------------------------|----------------------------------------|--------------------------------------------|----------------------------------|---------------------------------|
| The word appears frequently in one document           | High                                   | Low                                       | Low                              | Not very important              |
| The word appears infrequently in one document         | Low                                    | High                                      | Medium to High                   | Important                       |
| The word appears frequently in one document and many others | High                                   | Low                                       | Low                              | Not important                   |
| The word appears infrequently in one document and rarely in others | Low                                    | High                                      | High                             | Very important                  |


In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [2]:
v = TfidfVectorizer()
transformed  = v.fit_transform(corpus)

In [19]:
transformed.toarray()

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.2698

In [3]:
v.get_feature_names_out()

array(['already', 'am', 'amazon', 'and', 'announcing', 'apple', 'are',
       'ate', 'biryani', 'dot', 'eating', 'eco', 'google', 'grapes',
       'iphone', 'ironman', 'is', 'loki', 'microsoft', 'model', 'new',
       'pixel', 'pizza', 'surface', 'tesla', 'thor', 'tomorrow', 'you'],
      dtype=object)

In [5]:
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [32]:
for word in v.vocabulary_:
    index = v.vocabulary_.get(word)
    idf_score = v.idf_[index]
    print(f'{word} : {idf_score}')

thor : 2.386294361119891
eating : 1.9808292530117262
pizza : 2.386294361119891
loki : 2.386294361119891
is : 1.1335313926245225
ironman : 2.386294361119891
ate : 2.386294361119891
already : 2.386294361119891
apple : 2.386294361119891
announcing : 1.2876820724517808
new : 1.2876820724517808
iphone : 2.386294361119891
tomorrow : 1.2876820724517808
tesla : 2.386294361119891
model : 2.386294361119891
google : 2.386294361119891
pixel : 2.386294361119891
microsoft : 2.386294361119891
surface : 2.386294361119891
amazon : 2.386294361119891
eco : 2.386294361119891
dot : 2.386294361119891
am : 2.386294361119891
biryani : 2.386294361119891
and : 2.386294361119891
you : 2.386294361119891
are : 2.386294361119891
grapes : 2.386294361119891


In [35]:
print(transformed.toarray()[0])

[0.24266547 0.         0.         0.         0.         0.
 0.         0.24266547 0.         0.         0.40286636 0.
 0.         0.         0.         0.24266547 0.11527033 0.24266547
 0.         0.         0.         0.         0.72799642 0.
 0.         0.24266547 0.         0.        ]
