<h2 align='center'>NLP Tutorial - Text Representation: TF-IDF</h2>

### What is TF-IDF?

- TF stands for **Term Frequency** and denotes the ratio of  number of times a particular word appeared in a Document to total number of words in the document.
          
         Term Frequency(TF) = [number of times word appeared / total no of words in a document]
 
- Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.


- IDF stands for **Inverse Document Frequency** and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.

         Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)]
        
- In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0. 


- Finally:
         
         TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone Google",
    "Tesla is announcing new model-3 Google",
    "Google is announcing new pixel-6 Google",
    "Microsoft is announcing new surface Google",
    "Amazon is announcing new eco-dot Tesla",
    "I am eating biryani and you are eating grapes"
]

In [25]:
# fit corpus and transform them
vector = TfidfVectorizer()
vector.fit(corpus)
transform_output = vector.transform(corpus)

In [26]:
dir(vector)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_params',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_tfidf',
 '_validate_data',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 

In [27]:
print(vector.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'google': 12, 'tesla': 24, 'model': 19, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 26, 'are': 6, 'grapes': 13}


In [28]:
# print the idf score
all_features_name = vector.get_feature_names_out()

for word in all_features_name:
    index = vector.vocabulary_.get(word) # get the index in vocab
    idf_score = vector.idf_[index] # get the score
    print(f"{word} : {idf_score}")

already : 2.386294361119891
am : 2.386294361119891
amazon : 2.386294361119891
and : 2.386294361119891
announcing : 1.2876820724517808
apple : 2.386294361119891
are : 2.386294361119891
ate : 2.386294361119891
biryani : 2.386294361119891
dot : 2.386294361119891
eating : 1.9808292530117262
eco : 2.386294361119891
google : 1.4700036292457357
grapes : 2.386294361119891
iphone : 2.386294361119891
ironman : 2.386294361119891
is : 1.1335313926245225
loki : 2.386294361119891
microsoft : 2.386294361119891
model : 2.386294361119891
new : 1.2876820724517808
pixel : 2.386294361119891
pizza : 2.386294361119891
surface : 2.386294361119891
tesla : 1.9808292530117262
thor : 2.386294361119891
you : 2.386294361119891


In [30]:
print(transform_output.toarray())

[[0.24266547 0.         0.         0.         0.         0.
  0.         0.24266547 0.         0.         0.40286636 0.
  0.         0.         0.         0.24266547 0.11527033 0.24266547
  0.         0.         0.         0.         0.72799642 0.
  0.         0.24266547 0.        ]
 [0.         0.         0.         0.         0.30224568 0.56011275
  0.         0.         0.         0.         0.         0.
  0.34504032 0.         0.56011275 0.         0.26606332 0.
  0.         0.         0.30224568 0.         0.         0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.31816313 0.
  0.         0.         0.         0.         0.         0.
  0.36321151 0.         0.         0.         0.28007526 0.
  0.         0.58961051 0.31816313 0.         0.         0.
  0.48942736 0.         0.        ]
 [0.         0.         0.         0.         0.29588843 0.
  0.         0.         0.         0.         0.         0.
  0.67556592 0.         0.         0

### Problem Statement: Given a description about a product sold on e-commerce website, classify it in one of the 4 categories

Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification


- This data consists of two columns.

| Text | Label | 
| --- | --- |
| Indira Designer Women's Art Mysore Silk Saree With Blouse Piece (Star-Red) This Saree Is Of Art Mysore Silk & Comes With Blouse Piece. | Clothing & Accessories | 
|IO Crest SY-PCI40010 PCI RAID Host Controller Card Brings new life to any old desktop PC. Connects up to 4 SATA II high speed SATA hard disk drives. Supports Windows 8 and Server 2012|Electronics|
|Operating Systems in Depth About the Author Professor Doeppner is an associate professor of computer science at Brown University. His research interests include mobile computing in education, mobile and ubiquitous computing, operating systems and distribution systems, parallel computing, and security.|Books|

- ***Text***: Description of an item sold on e-commerce website
- ***Label***: Category of that item. Total 4 categories: "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.


In [31]:
import pandas as pd

df = pd.read_csv("../../data/Ecommerce_data.csv")
print(df.shape)
df.head(5)

(24000, 2)


Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [32]:
df['label'].value_counts() # label distribution

Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: label, dtype: int64

- From the above, we can see that almost all the labels(classes) occured equal number of times and perfectly balanced. There is no problem of class imbalance and hence no need to apply any balancing techniques like undersampling, oversampling etc.

In [35]:
df['label_num'] = df['label'].map({
    'Household' : 0,
    'Books' : 1,
    'Electronics' : 2,
    'Clothing & Accessories' : 3
})

df.head(25)

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3
5,Selfie: How We Became So Self-Obsessed and Wha...,Books,1
6,Quantum QHM8810 Keyboard with Mouse (Black) Ul...,Electronics,2
7,Y&S Uv Protected Non Polarized Wayfarer Boy's ...,Clothing & Accessories,3
8,HP external USB DVD Drive DVDRW DVD-ROM A2U56A...,Electronics,2
9,Fujifilm Instax Mini Monochrome Film (10 Sheet...,Books,1


<h3>Train test split</h3>

- Build a model with original text (no pre processing)

In [42]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.Text,
    df.label_num,
    test_size=0.27, # 27% sample
    random_state=2022,
    stratify=df.label_num
)

In [43]:
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)

X_train:  (17520,)
X_test:  (6480,)


In [44]:
X_train.head()

22572                                            Ramayana 
8012     COOFIT 12 Pairs Kid's Socks Warm Anti-slip Cre...
14526    ARUBA Women's Lace Bra and Panty ARUBA equisit...
15334    Varshine Happy Home Laurel Fan Heater || Heat ...
15802    KROSSSTITCH Solid Denim Men Jacket Featuring b...
Name: Text, dtype: object

In [45]:
y_train.value_counts()

1    4380
3    4380
0    4380
2    4380
Name: label_num, dtype: int64

In [46]:
y_test.value_counts()

0    1620
1    1620
2    1620
3    1620
Name: label_num, dtype: int64

**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Ecommerce Data.

**Note:**
- use TF-IDF for pre-processing the text.

- use **KNN** as the classifier 
- print the classification report.

In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.naive_bayes import BernoulliNB

clf = Pipeline([
    ('tf-idf', TfidfVectorizer()),
    ('KNN', KNeighborsClassifier())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1620
           1       0.97      0.96      0.96      1620
           2       0.97      0.97      0.97      1620
           3       0.97      0.98      0.98      1620

    accuracy                           0.97      6480
   macro avg       0.97      0.97      0.97      6480
weighted avg       0.97      0.97      0.97      6480



In [54]:
X_test[:5]

12337    Creative Farmer Himalayan Cypress Elegant and ...
11246    Nilkamal Leo Computer Table (Beech) Nilkamal L...
10691    Harappa - Curse of the Blood River Review “Har...
7438     Plextone Wired Gaming Earphone with Detachable...
2413     Wonderland Gardening Mat / Mats With Artificia...
Name: Text, dtype: object

In [55]:
y_test[:5]

12337    0
11246    0
10691    1
7438     2
2413     0
Name: label_num, dtype: int64

In [56]:
y_pred[:5]

array([0, 0, 1, 2, 0])

**Attempt 2** :

1. using sklearn pipeline module create a classification pipeline to classify the Ecommerce Data.

**Note:**
- use TF-IDF for pre-processing the text.

- use **MultinomialNB** as the classifier.
- print the classification report.


In [57]:
from sklearn.naive_bayes import MultinomialNB

clf = Pipeline([
    ('tf-idf', TfidfVectorizer()),
    ('multi', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1620
           1       0.98      0.93      0.96      1620
           2       0.97      0.97      0.97      1620
           3       0.98      0.99      0.98      1620

    accuracy                           0.96      6480
   macro avg       0.96      0.96      0.96      6480
weighted avg       0.96      0.96      0.96      6480



**Attempt 3** :

1. using sklearn pipeline module create a classification pipeline to classify the Ecommerce Data.

**Note:**
- use TF-IDF for pre-processing the text.

- use **Random Forest** as the classifier.
- print the classification report.
