### What is TF-IDF?
TF stands for **Term Frequency** and denotes the ratio of number of times a particular word appeared in a Document to total number of words in the document.

   Term Frequency(TF) = [number of times word appeared / total no of words in a document]
   
Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.

IDF stands for **Inverse Document Frequency** and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.

   Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)]
   
In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0.

Finally:

   TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Corpus- Collection of documents
corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [2]:
v = TfidfVectorizer()


In [4]:
transformed_output = v.fit_transform(corpus)
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [5]:
dir(v)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_params',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_tfidf',
 '_validate_data',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 

In [7]:
v.get_feature_names_out()

array(['already', 'am', 'amazon', 'and', 'announcing', 'apple', 'are',
       'ate', 'biryani', 'dot', 'eating', 'eco', 'google', 'grapes',
       'iphone', 'ironman', 'is', 'loki', 'microsoft', 'model', 'new',
       'pixel', 'pizza', 'surface', 'tesla', 'thor', 'tomorrow', 'you'],
      dtype=object)

In [13]:
all_feature_names = v.get_feature_names_out()

for words in all_feature_names:
    index  = v.vocabulary_.get(words)
    score = v.idf_[index]
    print(f"The idf score for {words} is {score}")
    
    

The idf score for already is 2.386294361119891
The idf score for am is 2.386294361119891
The idf score for amazon is 2.386294361119891
The idf score for and is 2.386294361119891
The idf score for announcing is 1.2876820724517808
The idf score for apple is 2.386294361119891
The idf score for are is 2.386294361119891
The idf score for ate is 2.386294361119891
The idf score for biryani is 2.386294361119891
The idf score for dot is 2.386294361119891
The idf score for eating is 1.9808292530117262
The idf score for eco is 2.386294361119891
The idf score for google is 2.386294361119891
The idf score for grapes is 2.386294361119891
The idf score for iphone is 2.386294361119891
The idf score for ironman is 2.386294361119891
The idf score for is is 1.1335313926245225
The idf score for loki is 2.386294361119891
The idf score for microsoft is 2.386294361119891
The idf score for model is 2.386294361119891
The idf score for new is 1.2876820724517808
The idf score for pixel is 2.386294361119891
The i

In [9]:
v.vocabulary_.get('surface')

23

In [11]:
v.idf_[v.vocabulary_.get('surface')]

2.386294361119891

In [14]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [15]:
transformed_output.toarray()[:2]

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ]])

# Using ecommerce data 
Problem Statement: Given a description about a product sold on e-commerce website, classify it in one of the 4 categories

In [28]:
import pandas as pd
df = pd.read_csv('Ecommerce_data.csv')
print(df.shape)

df

(24000, 2)


Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories
...,...,...
23995,Marvel Physics MCQ's for MHT - CET,Books
23996,Internet Download Manager | Lifetime License |...,Books
23997,Sadhubela's Handcrafted Iron Degchi Handi Pot ...,Household
23998,Audio-Technica AT-LP60 Automatic Belt Driven D...,Electronics


In [29]:
df.describe()

Unnamed: 0,Text,label
count,24000,23999
unique,13834,4
top,Diverse Men's Formal Shirt Diverse is a wester...,Household
freq,23,6000


In [33]:
df[df.label.isnull()]

Unnamed: 0,Text,label
4262,The Global War on Christians: Dispatches from ...,


In [34]:
df=df.drop(df[df.label.isnull()].index)

In [35]:
df

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories
...,...,...
23995,Marvel Physics MCQ's for MHT - CET,Books
23996,Internet Download Manager | Lifetime License |...,Books
23997,Sadhubela's Handcrafted Iron Degchi Handi Pot ...,Household
23998,Audio-Technica AT-LP60 Automatic Belt Driven D...,Electronics


In [36]:
df.describe()

Unnamed: 0,Text,label
count,23999,23999
unique,13833,4
top,Diverse Men's Formal Shirt Diverse is a wester...,Household
freq,23,6000


In [37]:
df.label.unique()

array(['Household', 'Electronics', 'Clothing & Accessories', 'Books'],
      dtype=object)

In [38]:
df.label.value_counts()

Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     5999
Name: label, dtype: int64

In [39]:
# Now we will map these label categories to numerical values so that our machine can understand it

df['label_num'] = df.label.map({
                               'Household':0,
                               'Electronics':1,
                               'Clothing & Accessories':2,
                               'Books':3
                               })

df

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,1
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,2
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,2
...,...,...,...
23995,Marvel Physics MCQ's for MHT - CET,Books,3
23996,Internet Download Manager | Lifetime License |...,Books,3
23997,Sadhubela's Handcrafted Iron Degchi Handi Pot ...,Household,0
23998,Audio-Technica AT-LP60 Automatic Belt Driven D...,Electronics,1


In [43]:
df.drop('label',axis=1,inplace=True)

In [44]:
df.head()

Unnamed: 0,Text,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,0
1,"Contrast living Wooden Decorative Box,Painted ...",0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,1
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,2
4,Indira Designer Women's Art Mysore Silk Saree ...,2


In [45]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Text, df.label_num,test_size=0.2,random_state=1,stratify=df.label_num)

In [46]:
X_train.shape

(19199,)

In [48]:
X_test.shape

(4800,)

In [50]:
y_test.value_counts()

0    1200
2    1200
1    1200
3    1200
Name: label_num, dtype: int64

In [51]:
y_train.value_counts()

1    4800
2    4800
0    4800
3    4799
Name: label_num, dtype: int64

In [52]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf =  Pipeline([
                ('tf_idf_Vectorizer',TfidfVectorizer()),
                ('Knn',KNeighborsClassifier())
                ])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1200
           1       0.97      0.97      0.97      1200
           2       0.98      0.99      0.98      1200
           3       0.98      0.95      0.96      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



In [53]:
X_test[:5]

2892     Parasnath Stainless Steel Perforated Dustbin 7...
7914     Fila Men's Track Pants Fila 100 percent stretc...
9927     Innovier Professional Powerpoint Presenter Rem...
1222     Reliable Trends 300 TC Plain Stripe Cotton Kin...
23867    EP7 Star Wars Boba Fett 3D Wall DÃcor See your...
Name: Text, dtype: object

In [57]:
X_test[2892]

"Parasnath Stainless Steel Perforated Dustbin 7''X11''-6L+8''X13''+10L+10''X15''-18L (all in 1) Size:all in 1   Parasnath Perforated Bins Are Used As Waste Basket For Dry Waste Like Papers And Office Waste & Also Used As Laundry Hamper For Soiled Clothes The Perforations All Around Ensure The Clothes Shall Not Smell Bad. Available In Various Sizes, Can Be Used With Or Without Standard Garbage Bags. Common Rooms, Bathrooms Bedrooms, Soho, Under The Desk Etc Are Common Usage Areas Now A Day'S Steel Dustbins Are In Used In Every Home Because Of Its Long Lasting Life, Modern Looks.It Is Commonly Used In Offices, Living Areas And Inside The Rooms. It Is Made Of High Quality Stainless Steel And Beautifully Finished, Available In Various Sizes, Can Be Used With Or Without Standard Garbage Bags. Common Rooms, Bathrooms, Bedrooms, Soho, Under The Desk Etc Are Common Usage Areas."

In [54]:
y_test[:5]

2892     0
7914     2
9927     1
1222     0
23867    0
Name: label_num, dtype: int64

In [55]:
y_pred[:5]

array([0, 2, 1, 0, 0], dtype=int64)

In [58]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf =  Pipeline([
                ('tf_idf_Vectorizer',TfidfVectorizer()),
                ('naive bayes',MultinomialNB())
                ])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1200
           1       0.97      0.97      0.97      1200
           2       0.97      0.99      0.98      1200
           3       0.98      0.92      0.95      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [59]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf =  Pipeline([
                ('tf_idf_Vectorizer',TfidfVectorizer()),
                ('Random Forest',RandomForestClassifier())
                ])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.97      0.96      1200
           1       0.98      0.97      0.98      1200
           2       0.98      0.99      0.98      1200
           3       0.98      0.97      0.98      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



In [60]:
# Now training the model using the pre processed text

In [69]:
# Preprocessing the text column first
import spacy

nlp = spacy.load('en_core_web_sm')
def preprocess(text):
    filtered_text=[]
    doc  = nlp(text)
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_text.append(token.lemma_)
    return " ".join(filtered_text)

In [None]:
df['Processed_text']= df.Text.apply(preprocess)

In [65]:
df.head()

Unnamed: 0,Text,label_num,Processed_text
0,Urban Ladder Eisner Low Back Study-Office Comp...,0,Urban
1,"Contrast living Wooden Decorative Box,Painted ...",0,contrast
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,1,IO
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,2,ISAKAA
4,Indira Designer Women's Art Mysore Silk Saree ...,2,Indira


In [68]:
df.Processed_text[:5]

0       Urban
1    contrast
2          IO
3      ISAKAA
4      Indira
Name: Processed_text, dtype: object

In [66]:
X_train, X_test, y_train, y_test = train_test_split(df.Processed_text, df.label_num,test_size=0.2,random_state=1,stratify=df.label_num)

In [67]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf =  Pipeline([
                ('tf_idf_Vectorizer',TfidfVectorizer()),
                ('Random Forest',RandomForestClassifier())
                ])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.89      0.81      0.85      1200
           1       0.94      0.85      0.89      1200
           2       0.75      0.93      0.83      1200
           3       0.93      0.87      0.89      1200

    accuracy                           0.86      4800
   macro avg       0.87      0.86      0.87      4800
weighted avg       0.87      0.86      0.87      4800

