<a href="https://colab.research.google.com/github/rammohanbadvelli/NLP-training/blob/main/NLP_text_representaion_TFIDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **TF-IDF for text representation**
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).

**Term Frequency**: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.

**Inverse Document Frequency:** IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0

The** TF-IDF** of a term is calculated by multiplying TF and IDF scores.

Importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the corpus.

In [1]:
corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
v =  TfidfVectorizer()
v.fit(corpus)
transform_output = v.transform(corpus)

In [3]:
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [12]:
all_features = v.get_feature_names_out()
all_features

array(['already', 'am', 'amazon', 'and', 'announcing', 'apple', 'are',
       'ate', 'biryani', 'dot', 'eating', 'eco', 'google', 'grapes',
       'iphone', 'ironman', 'is', 'loki', 'microsoft', 'model', 'new',
       'pixel', 'pizza', 'surface', 'tesla', 'thor', 'tomorrow', 'you'],
      dtype=object)

In [11]:
for word in all_features:
  indx = v.vocabulary_.get(word)
  print(f"{word} {v.idf_[indx]}")

already 2.386294361119891
am 2.386294361119891
amazon 2.386294361119891
and 2.386294361119891
announcing 1.2876820724517808
apple 2.386294361119891
are 2.386294361119891
ate 2.386294361119891
biryani 2.386294361119891
dot 2.386294361119891
eating 1.9808292530117262
eco 2.386294361119891
google 2.386294361119891
grapes 2.386294361119891
iphone 2.386294361119891
ironman 2.386294361119891
is 1.1335313926245225
loki 2.386294361119891
microsoft 2.386294361119891
model 2.386294361119891
new 1.2876820724517808
pixel 2.386294361119891
pizza 2.386294361119891
surface 2.386294361119891
tesla 2.386294361119891
thor 2.386294361119891
tomorrow 1.2876820724517808
you 2.386294361119891


In [13]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [17]:
transform_output.toarray()[:2]

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ]])

# **Problem Statement: Given a description about a product sold on e-commerce website, classify it in one of the 4 categories**
Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

This data consists of two columns.

Text :Description of an item sold on e-commerce website
***Label***: Category of that item. Total 4 categories: "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

In [92]:
import pandas as pd
import numpy as np
df = pd.read_csv("ecommerceDataset.csv",engine="python",on_bad_lines ='skip',header = None)
print(df.shape)
df.head()


(50425, 2)


Unnamed: 0,0,1
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [93]:
df.columns = ['label','text']

In [95]:
df.head()

Unnamed: 0,label,text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-28-46b813618a66> in <cell line: 1>()
----> 1 df.label.value_counts()

/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   5900         ):
   5901             return self[name]
-> 5902         return object.__getattribute__(self, name)
   5903 
   5904     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'label'

Issue resolved after adding columns names

In [96]:
df.label.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: label, dtype: int64

In [97]:
df.label.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: label, dtype: int64

In [98]:
df['label_num'] = df.label.map({
    'Household' :0,
    'Books ' :1,
    'Electronics' :2,
    'Clothing & Accessories' :3
})
df.tail()

Unnamed: 0,label,text,label_num
50420,Electronics,Strontium MicroSD Class 10 8GB Memory Card (Bl...,2.0
50421,Electronics,CrossBeats Wave Waterproof Bluetooth Wireless ...,2.0
50422,Electronics,Karbonn Titanium Wind W4 (White) Karbonn Titan...,2.0
50423,Electronics,"Samsung Guru FM Plus (SM-B110E/D, Black) Colou...",2.0
50424,Electronics,Micromax Canvas Win W121 (White),2.0


In [99]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    df.text,
    df.label_num,
    test_size = 0.2,
    random_state = 2022
 
)

In [100]:
print("Shape of X_train: ", x_train.shape)
print("Shape of X_test: ", x_test.shape)


Shape of X_train:  (40340,)
Shape of X_test:  (10085,)


In [101]:
x_train.head()

48283    Set 40pc Car Boat Instrument Switch Panel Deca...
34456    Nayak Men's Kurta Nayak is a men's ethnic clot...
37810    ZEYO Women's Cotton Navy Blue & Pink Feeding N...
41866    MikroTik Wireless Access Point RB951Ui-2HnD mi...
21260    Home Gardeners’ Guide Indian Garden Flowers Ab...
Name: text, dtype: object

In [102]:
y_train.value_counts()

0.0    15493
2.0     8503
3.0     6969
Name: label_num, dtype: int64

In [103]:
y_test.value_counts()

0.0    3820
2.0    2118
3.0    1702
Name: label_num, dtype: int64

Attempt 1 :

using sklearn pipeline module create a classification pipeline to classify the Ecommerce Data.
Note:

use TF-IDF for pre-processing the text.

use KNN as the classifier

print the classification report.

In [104]:
y_train = np.nan_to_num(y_train)

In [105]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('KNN', KNeighborsClassifier())         
])

#2. fit with X_train and y_train
clf.fit(x_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

ValueError: ignored