## TF- IDF to improve the count vectorizer

popular for document retriveal and textmining

#### what was wrong with count vectorizer? 
- we don't want stopwords as they're not useful for nlp
- but how do we know our list for stepwords is correct? we don't. they can be document specific

### TF IDF scale down the words based on how many documents they appear in: 
Term Frequency - Inverse Document Frequency = Term Frequency / Inverse Document Frequency (intuition not the formula)

Formülü detaylı ipad'de anlattım

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
Xtrain = tfidf.fit_transform(input_data)
Xtest = tfidf.transform(test_data)

#note: arguments exist for stopwrods, tokenizer, strip accents etc.

NameError: name 'input_data' is not defined

### Term Frequency variations
- Binary (1 if word appears in doc, 0 if not)
- Normalize the count (sometimes this is default) : divide the sum to the number of all terms in a document
- take the log of 1 + count(t,d) #to reduce the effect of extreme values

### Inverse Document Frequency variations
- Smooth IDF : log( N / (N(t)+1)) +1 #prevents us from value of zero
- IDF max: instead of using N, use maximum term count from the same document, ynai tüm dokümanlara değil, aynı dokümanda en çok çıkan kelimenin çıkma sayısını paya yazıyo
- Probablistic IDF: log ( N- N(t)) / N(t)) => log odds also aka logit

### Another variation: Noralizing entire TF-IDF vector
take the vector and divide it by L2 norm (by its length), this makes the length of every tfidf vector 1.
- advantage: ranking by euclidian distance and cosine distance will yield the same result
- unlike count vectorizer, tfidfvectorizer in sklearn supports this type of normalization

In [None]:
TfidfVectorizer(norm="l2") #for normalization
#l2 is a default so if you don't put anything your tfidf vector will be normalized to have unit length
#else you can write norm="l1"

## Build TF-IDF from scratch

Here we'll build from scratch

then take random docs fro dataset and check their tf-idf values

the one with the high tf-idf should have two traits:
- they should appear a lot in the doc
- they should be unique

In [2]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File 'bbc_text_cls.csv' already there; not retrieving.



In [3]:
import pandas as pd
import numpy as np
import nltk

from nltk import word_tokenize

In [4]:
df = pd.read_csv('bbc_text_cls.csv')
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [45]:
#word2idx mapping
#convert docs into sequences of indices

idx = 0
word2idx = {}
tokenized_docs = []

for doc in df["text"]:
    words = word_tokenize(doc.lower())
    doc_as_int = []
    for word in words:
        if word not in word2idx:
            word2idx[word] = idx #add a key-value pair to the dictionary
            idx += 1
            
        doc_as_int.append(word2idx[word])
#şimdi alt satırdaki dokğmana geçiyoruz
    tokenized_docs.append(doc_as_int)


In [6]:
#create a reverse mapping as well
#itwill iterate through our word2idx dictionary and simply switch the value and the key
idx2word = {v:k for k, v in word2idx.items()}

#yet this is somwhat inefficient bcz we're using a dictionay structure, even tho indices are values from zero up to the
#vocab size

#you may find a smarter way and store it in a list instead

In [7]:
idx2word_smarter = pd.Series(word2idx.keys()).to_list()
idx2word_smarter

#ben buldum 4:44

['ad',
 'sales',
 'boost',
 'time',
 'warner',
 'profit',
 'quarterly',
 'profits',
 'at',
 'us',
 'media',
 'giant',
 'timewarner',
 'jumped',
 '76',
 '%',
 'to',
 '$',
 '1.13bn',
 '(',
 '£600m',
 ')',
 'for',
 'the',
 'three',
 'months',
 'december',
 ',',
 'from',
 '639m',
 'year-earlier',
 '.',
 'firm',
 'which',
 'is',
 'now',
 'one',
 'of',
 'biggest',
 'investors',
 'in',
 'google',
 'benefited',
 'high-speed',
 'internet',
 'connections',
 'and',
 'higher',
 'advert',
 'said',
 'fourth',
 'quarter',
 'rose',
 '2',
 '11.1bn',
 '10.9bn',
 'its',
 'were',
 'buoyed',
 'by',
 'one-off',
 'gains',
 'offset',
 'a',
 'dip',
 'bros',
 'less',
 'users',
 'aol',
 'on',
 'friday',
 'that',
 'it',
 'owns',
 '8',
 'search-engine',
 'but',
 'own',
 'business',
 'had',
 'has',
 'mixed',
 'fortunes',
 'lost',
 '464,000',
 'subscribers',
 'lower',
 'than',
 'preceding',
 'quarters',
 'however',
 'company',
 "'s",
 'underlying',
 'before',
 'exceptional',
 'items',
 'back',
 'stronger',
 'adverti

In [8]:
#number of documents
N = len(df["text"])

#number of words 
V = len(word2idx)

In [9]:
#instantiate term-frequency matrix
#note : same as using count vectorizer

#altho more efficient to use sparse matrix, here in example we'll use dense matrix
tf = np.zeros((N,V))

In [10]:
enumerate(tokenized_docs) #displays index and document at the same time

<enumerate at 0x7f9e2195b980>

In [11]:
#populate term-freq counts:
#tokenized docs are not vectors they are just the documents translated into a list of index numbers for each of its word
for i,doc_as_int in enumerate(tokenized_docs):
    for j in doc_as_int: #j represents indices corresponded to words örneğin cümle (3,4,9) ise ilk kelime 3
        #yani önce ilk dökümandaki 3 nolu kelimeyi gösteren hücre 1 artacak, tekrar 3 kelimesi aynı doc'ta çıksa
        #gene artacak, böyle böyle i satırında (doc'unda) her kelimenin frequency'sini görcez
        tf[i,j] += 1
        #sonra her döküman için yapcaz bunu


In [12]:
#compute IDF : her kelimenin toplam kaç dökümanda göründüğüne bakcaz

#remember we'll have idf value for each word, so v idf values, a vector of size v
document_freq = np.sum(tf >0, axis = 0) #this is a v size array 

#tf>0 a boolean operator, her kelime(sütun) için tf = 0 olmayan dökümanlarsa true diyecek, toplamları döküman sayısı
#çok mantıklı, tek sıkıntı sum diyince aşağı doğru topladığına nerden eminiz? axis = 0 dediğimiz mi için?
idf = np.log(N / document_freq)

In [13]:
#compute TF-idf
tf_idf = tf * idf

##hocada bu warningler çıkmıyo

In [44]:
#pick a random document, show first 5 terms in terms of tf_idf score
np.random.seed(123)
i = np.random.choice(N)
row = df.iloc[i]

print("Label:", row['labels'])
print("Text:", row['text'].split("\n", 1)[0]) ##we're printing out the first line of text, not the entire text
#recall each news article has multiple lines
#for simplicity we'll split the test with the new line character : "\n"
print("Top 5 terms:")

scores = tf_idf[i] #tf_idf matriksinde i dokümanına ait satırdaki her kelimeye ait tf-idf value'leri
indices = (- scores).argsort() ##en yüksek if-tdf'e sahip kelimeden azalarak o value'lere sahip kelimelrein indeksleri

for j in indices[:5]: #en yüksek iftdf'li 5 kelimenin endeksleri
    print(idx2word[j]) #o endeks key ise value pair'i nedir

Label: sport
Text: Athens memories soar above lows
Top 5 terms:
paula
athens
1500m
her
kelly


In [15]:
##bakalım benim idxtoword'ümle olcak mı
for j in indices[:5]: 
    print(idx2word_smarter[j]) 
    
##YESSS!!!

paula
athens
1500m
her
kelly


## Exercise: use CountVectorizer to form the counts instead

## Exercise (hard): use Scipy's csr_matrix instead
## You cannot use X[i, j] += 1 here

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

In [41]:
vectorizer = CountVectorizer()


In [36]:
type(tokenized_docs)

list

In [46]:
tokenized_docs = pd.Series(tokenized_docs)
tokenized_docs = pd.DataFrame(tokenized_docs)
tokenized_docs.head()

Unnamed: 0,0
0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
1,"[235, 61, 69, 236, 237, 23, 235, 80, 238, 56, ..."
2,"[400, 401, 402, 403, 404, 405, 23, 406, 37, 40..."
3,"[510, 511, 327, 238, 512, 92, 7, 513, 514, 80,..."
4,"[662, 663, 664, 665, 666, 657, 40, 667, 668, 4..."


In [53]:
tokenized_docs[0].apply(lambda x: x.as_type("str"))

AttributeError: 'list' object has no attribute 'as_type'

In [42]:
tf_cv = vectorizer.fit_transform(tokenized_docs)
tf_cv

TypeError: expected string or bytes-like object

In [None]:
##YAPAMADIM