# Assignment

<font face='georgia'>
    
   <h4><strong>What does tf-idf mean?</strong></h4>

   <p>    
Tf-idf stands for <em>term frequency-inverse document frequency</em>, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
</p>
    
   <p>
One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.
</p>
    
   <p>
Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
</p>
    
</font>

<font face='georgia'>
    <h4><strong>How to Compute:</strong></h4>

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

 <ul>
    <li>
<strong>TF:</strong> Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: <br>

$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$
</li>
<li>
<strong>IDF:</strong> Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: <br>

$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}}.$
for numerical stabiltiy we will be changing this formula little bit
$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}+1}.$
</li>
</ul>

<br>
<h4><strong>Example</strong></h4>
<p>

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
</p>
</font>

## Task-1

<font face='georgia'>
    <h4><strong>1. Build a TFIDF Vectorizer & compare its results with Sklearn:</strong></h4>

<ul>
    <li> As a part of this task you will be implementing TFIDF vectorizer on a collection of text documents.</li>
    <br>
    <li> You should compare the results of your own implementation of TFIDF vectorizer with that of sklearns implemenation TFIDF vectorizer.</li>
    <br>
    <li> Sklearn does few more tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer:
       <ol>
        <li> Sklearn has its vocabulary generated from idf sroted in alphabetical order</li>
        <li> Sklearn formula of idf is different from the standard textbook formula. Here the constant <strong>"1"</strong> is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.
            
 $IDF(t) = 1+\log_{e}\frac{1\text{ }+\text{ Total  number of documents in collection}} {1+\text{Number of documents with term t in it}}.$
        </li>
        <li> Sklearn applies L2-normalization on its output matrix.</li>
        <li> The final output of sklearn tfidf vectorizer is a sparse matrix.</li>
    </ol>
    <br>
    <li>Steps to approach this task:
    <ol>
        <li> You would have to write both fit and transform methods for your custom implementation of tfidf vectorizer.</li>
        <li> Print out the alphabetically sorted voacb after you fit your data and check if its the same as that of the feature names from sklearn tfidf vectorizer. </li>
        <li> Print out the idf values from your implementation and check if its the same as that of sklearns tfidf vectorizer idf values. </li>
        <li> Once you get your voacb and idf values to be same as that of sklearns implementation of tfidf vectorizer, proceed to the below steps. </li>
        <li> Make sure the output of your implementation is a sparse matrix. Before generating the final output, you need to normalize your sparse matrix using L2 normalization. You can refer to this link https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html </li>
        <li> After completing the above steps, print the output of your custom implementation and compare it with sklearns implementation of tfidf vectorizer.</li>
        <li> To check the output of a single document in your collection of documents,  you can convert the sparse matrix related only to that document into dense matrix and print it.</li>
        </ol>
    </li>
    <br>
   </ul>

  <p> <font color="#e60000"><strong>Note-1: </strong></font> All the necessary outputs of sklearns tfidf vectorizer have been provided as reference in this notebook, you can compare your outputs as mentioned in the above steps, with these outputs.<br>
   <font color="#e60000"><strong>Note-2: </strong></font> The output of your custom implementation and that of sklearns implementation would match only with the collection of document strings provided to you as reference in this notebook. It would not match for strings that contain capital letters or punctuations, etc, because sklearn version of tfidf vectorizer deals with such strings in a different way. To know further details about how sklearn tfidf vectorizer works with such string, you can always refer to its official documentation.<br>
   <font color="#e60000"><strong>Note-3: </strong></font> During this task, it would be helpful for you to debug the code you write with print statements wherever necessary. But when you are finally submitting the assignment, make sure your code is readable and try not to print things which are not part of this task.
    </p>

### Corpus

In [217]:
## SkLearn# Collection of string documents

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

### SkLearn Implementation

In [218]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)

In [219]:
# sklearn feature names, they are sorted in alphabetic order by default.

print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [220]:
# Here we will print the sklearn tfidf vectorizer idf values after applying the fit method
# After using the fit function on the corpus the vocab has 9 words in it, and each has its idf value.

print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [221]:
# shape of sklearn tfidf vectorizer output after applying transform method.

skl_output.shape

(4, 9)

In [222]:
# sklearn tfidf values for first line of the above corpus.
# Here the output is a sparse matrix
# [array([[0.38408524, 0.38408524, 0.38408524, 0.58028582, 0.46979139]]), array([[0.32167263, 0.39345181, 0.32167263, 0.32167263, 0.61641829,
#         0.39345181]]), array([[0.51184851, 0.26710379, 0.26710379, 0.26710379, 0.51184851,
#         0.51184851]]), array([[0.38408524, 0.38408524, 0.38408524, 0.58028582, 0.46979139]])]

print(skl_output[0])

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045


In [223]:
# sklearn tfidf values for first line of the above corpus.
# To understand the output better, here we are converting the sparse output matrix to dense matrix and printing it.
# Notice that this output is normalized using L2 normalization. sklearn does this by default.

print(skl_output[1].toarray())

[[0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]]


### Your custom implementation

In [224]:
# Write your code here.
# Make sure its well documented and readble with appropriate comments.
# Compare your results with the above sklearn tfidf vectorizer
# You are not supposed to use any other library apart from the ones given below

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy as np

In [225]:
final_word=[]
final_tfidf=[]
def fit_tfidf(data):
    
    row_count=0
    unique_words=set()
    #check if the input data is of type ist or not 
    if isinstance(data,(list,)):
        for row in data:
            for word in row.split(' '):
                if len(word)<2:
                    continue
                unique_words.add(word)
            row_count+=1
    else:
        print('Input Data is Not of type list ')
#     print(row_count)
#     print(unique_words)
    unique_words=sorted(list(unique_words))
#     print(unique_words)
    if isinstance(data,(list,)):
        for word in unique_words:
            tfidf=0
            freq_word_row=0
            for row in data:
                if word in row:
                    
                    freq_word_row+=1
#                     print(word)
#                     print(row)
#                     print(freq_word_row)
#             print(1+row_count)
#             print(1+freq_word_row)
            tfidf=1+(np.log((1+row_count)/(1+freq_word_row)))
            final_word.append(word)
            print(tfidf)
            final_tfidf.append(tfidf)
           
    print(unique_words)
#     print(*final_tfidf)

                
                    


In [226]:
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]
fit_tfidf(corpus)
        

1.916290731874155
1.2231435513142097
1.5108256237659907
1.0
1.916290731874155
1.916290731874155
1.0
1.916290731874155
1.0
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [227]:
print(*final_word)
print(*final_tfidf)

and document first is one second the third this
1.916290731874155 1.2231435513142097 1.5108256237659907 1.0 1.916290731874155 1.916290731874155 1.0 1.916290731874155 1.0


In [228]:
word_val=dict(zip(final_word,final_tfidf))

In [229]:
normalized_value=[]
l1=[1,2,3,4]
normalized_value.append(l1)
print(normalized_value)


[[1, 2, 3, 4]]


In [230]:
for row in corpus:
    for word in row.split(' '):
        print(word)

this
is
the
first
document
this
document
is
the
second
document
and
this
is
the
third
one
is
this
the
first
document


In [231]:
global normalized_value
normalized_value=[]
global final_normalized_input
final_normalized_input=[]
rows = []
columns = []
values = []
final_output=[]
output=[]

def tfidf_transform(data,vocab):
    prev_idx=0
    if isinstance(data,(list,)):
        for idx,row in enumerate(tqdm(data)):
            word_column=0
            for word in row.split(' '):
                word_column+=1
                key_value=word_val.get(word,0)
                
#                 print(normalized_value)
#                 print(idx)
#                 print(prev_idx)
                if(idx!=prev_idx):
#                     print("normal",normalized_value)
                    final_normalized_input.append(normalized_value[:])
#                     print(final_normalized_input)
#                     print("prev",prev_idx)
#                     print("current",idx)
            
                    prev_idx=idx
                    normalized_value.clear()
                normalized_value.append(key_value)
#                 print("outside_normal",normalized_value)
        final_normalized_input.append(normalized_value[:])
                    
                    
# #                 rows.append(idx)
# #                 columns.append(word_column)
#                 normalized_value.append(key_value)
# #                 print(word)
#                 if(row!=(prev_row)):
        
        for i in range(len(final_normalized_input)):
            output=normalize([final_normalized_input[i]])
            final_output.append(output[:])
            
        j=0
        for i in range(len(final_output)):
            for k in range(len(final_output)):
                print(i,j,k,final_output[i][j][k])
            

        
                       
        
        
#                 prev_row=row
                
# #                 print(idx,word_column,normalized_value)
#                 print(idx)   
# #             final_normalized_input.append(normalized_value)
# #             print(normalized_value)
# # #             output=normalize(final_normalized_input)
# #             final_output.append(output)      
        
        
    else:
        print("input a list")
            
        

In [232]:
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]
tfidf_transform(corpus,word_val)
# print(*final_normalized_input)



100%|██████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4004.11it/s]


0 0 0 0.38408524091481483
0 0 1 0.38408524091481483
0 0 2 0.38408524091481483
0 0 3 0.5802858236844359
1 0 0 0.3216726331114366
1 0 1 0.3934518068245154
1 0 2 0.3216726331114366
1 0 3 0.3216726331114366
2 0 0 0.511848512707169
2 0 1 0.267103787642168
2 0 2 0.267103787642168
2 0 3 0.267103787642168
3 0 0 0.38408524091481483
3 0 1 0.38408524091481483
3 0 2 0.38408524091481483
3 0 3 0.5802858236844359


## Task-2

<font face='georgia'>
    <h4><strong>2. Implement max features functionality:</strong></h4>

<ul>
    <li> As a part of this task you have to modify your fit and transform functions so that your vocab will contain only 50 terms with top idf scores.</li>
    <br>
    <li>This task is similar to your previous task, just that here your vocabulary is limited to only top 50 features names based on their idf values. Basically your output will have exactly 50 columns and the number of rows will depend on the number of documents you have in your corpus.</li>
    <br>
    <li>Here you will be give a pickle file, with file name <strong>cleaned_strings</strong>. You would have to load the corpus from this file and use it as input to your tfidf vectorizer.</li>
    <br>
    <li>Steps to approach this task:
    <ol>
        <li> You would have to write both fit and transform methods for your custom implementation of tfidf vectorizer, just like in the previous task. Additionally, here you have to limit the number of features generated to 50 as described above.</li>
        <li> Now sort your vocab based in descending order of idf values and print out the words in the sorted voacb after you fit your data. Here you should be getting only 50 terms in your vocab. And make sure to print idf values for each term in your vocab. </li>
        <li> Make sure the output of your implementation is a sparse matrix. Before generating the final output, you need to normalize your sparse matrix using L2 normalization. You can refer to this link https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html </li>
        <li> Now check the output of a single document in your collection of documents,  you can convert the sparse matrix related only to that document into dense matrix and print it. And this dense matrix should contain 1 row and 50 columns. </li>
        </ol>
    </li>
    <br>
   </ul>

In [233]:
# Below is the code to load the cleaned_strings pickle file provided
# Here corpus is of list type

import pickle
with open('cleaned_strings', 'rb') as f:
    corpus = pickle.load(f)
    
# printing the length of the corpus loaded
print("Number of documents in corpus = ",len(corpus))

Number of documents in corpus =  746


In [234]:
# we will be using this document as our test document .
corpus[0]

'slow moving aimless movie distressed drifting young man'

In [235]:
# Write your code here.
# Try not to hardcode any values.
# Make sure its well documented and readble with appropriate comments.

In [236]:
#Fit function for or tfidf unique words and their tfidf values
final_word=[]
final_tfidf=[]
def fit_tfidf(data):
    
    row_count=0
    unique_words=set()
    #check if the input data is of type ist or not 
    if isinstance(data,(list,)):
        for row in data:
            for word in row.split(' '):
                if len(word)<2:
                    continue
                unique_words.add(word)
            row_count+=1
    else:
        print('Input Data is Not of type list ')
#     print(row_count)
#     print(unique_words)
    unique_words=sorted(list(unique_words))
#     print(unique_words)
    if isinstance(data,(list,)):
        for word in unique_words:
            tfidf=0
            freq_word_row=0
            for row in data:
                if word in row:
                    
                    freq_word_row+=1
#                     print(word)
#                     print(row)
#                     print(freq_word_row)
#             print(1+row_count)
#             print(1+freq_word_row)
            tfidf=1+(np.log((1+row_count)/(1+freq_word_row)))
            final_word.append(word)
            print(word,tfidf)
            final_tfidf.append(tfidf)
           
    print(unique_words)
#     print(*final_tfidf)

                
                    


In [237]:
# Applying the fit function to corpus dataset
fit_tfidf(corpus)

aailiyah 6.922918004572872
abandoned 6.922918004572872
ability 6.229770824012927
abroad 6.922918004572872
absolutely 5.3134800921387715
abstruse 6.922918004572872
abysmal 6.517452896464707
academy 6.922918004572872
accents 6.922918004572872
accessible 6.922918004572872
acclaimed 6.922918004572872
accolades 6.922918004572872
accurate 6.517452896464707
accurately 6.922918004572872
accused 6.517452896464707
achievement 6.517452896464707
achille 6.922918004572872
ackerman 6.922918004572872
act 2.6960842593046923
acted 6.229770824012927
acting 3.927185731018881
action 5.218169912334447
actions 6.229770824012927
actor 4.283860674957613
actors 4.671626205966376
actress 5.670155036077504
actresses 6.229770824012927
actually 5.218169912334447
adams 6.922918004572872
adaptation 6.517452896464707
add 5.824305715904762
added 6.922918004572872
addition 6.229770824012927
admins 6.922918004572872
admiration 6.922918004572872
admitted 6.922918004572872
adorable 6.006627272698717
adrift 6.9229180045728

cameo 6.922918004572872
camera 5.131158535344817
camerawork 6.922918004572872
camp 6.229770824012927
campy 6.922918004572872
canada 6.517452896464707
cancan 6.922918004572872
candace 6.922918004572872
candle 6.922918004572872
cannot 6.517452896464707
cant 6.229770824012927
captain 6.922918004572872
captured 6.922918004572872
captures 6.922918004572872
car 4.214867803470662
card 6.006627272698717
cardboard 6.517452896464707
cardellini 6.922918004572872
care 5.218169912334447
carol 6.922918004572872
carrell 6.922918004572872
carries 6.922918004572872
carry 6.922918004572872
cars 6.922918004572872
cartoon 5.824305715904762
cartoons 6.517452896464707
case 6.229770824012927
cases 6.922918004572872
cast 4.397189360264616
casted 6.922918004572872
casting 5.670155036077504
cat 5.3134800921387715
catchy 6.922918004572872
caught 6.922918004572872
cause 6.517452896464707
ceases 6.517452896464707
celebration 6.922918004572872
celebrity 6.922918004572872
celluloid 6.922918004572872
centers 6.922918

depth 6.006627272698717
derivative 6.922918004572872
describe 6.229770824012927
describes 6.922918004572872
desert 6.922918004572872
deserved 6.517452896464707
deserves 6.517452896464707
deserving 6.517452896464707
design 6.229770824012927
designed 6.922918004572872
designer 6.922918004572872
desperately 6.922918004572872
desperation 6.922918004572872
despised 6.922918004572872
despite 6.517452896464707
destroy 6.922918004572872
detailing 6.922918004572872
details 6.922918004572872
develop 6.517452896464707
development 6.517452896464707
developments 6.922918004572872
di 2.681591252002126
diabetic 6.922918004572872
dialog 4.908014984030608
dialogs 6.922918004572872
dialogue 5.131158535344817
diaper 6.922918004572872
dickens 6.922918004572872
difference 6.922918004572872
different 5.824305715904762
dignity 6.922918004572872
dimensional 6.922918004572872
direct 4.438011354784871
directed 6.229770824012927
directing 5.418840607796598
direction 6.006627272698717
director 5.131158535344817
d

filmography 6.922918004572872
films 4.671626205966376
final 5.536623643452981
finale 6.922918004572872
finally 6.229770824012927
financial 6.922918004572872
find 5.131158535344817
finds 6.922918004572872
fine 5.536623643452981
finest 6.922918004572872
fingernails 6.922918004572872
finished 6.922918004572872
fire 6.922918004572872
first 5.218169912334447
fish 6.517452896464707
fishnet 6.922918004572872
fisted 6.922918004572872
fit 6.922918004572872
five 6.517452896464707
flag 6.922918004572872
flakes 6.922918004572872
flaming 6.922918004572872
flashbacks 6.922918004572872
flat 6.517452896464707
flaw 5.824305715904762
flawed 6.517452896464707
flaws 6.517452896464707
fleshed 6.922918004572872
flick 5.824305715904762
flicks 6.922918004572872
florida 6.922918004572872
flowed 6.922918004572872
flying 6.922918004572872
flynn 6.517452896464707
focus 6.517452896464707
fodder 6.922918004572872
follow 5.670155036077504
following 6.517452896464707
follows 6.922918004572872
foolish 6.92291800457287

improvement 6.922918004572872
improvisation 6.922918004572872
impulse 6.922918004572872
inappropriate 6.922918004572872
incendiary 6.922918004572872
includes 6.922918004572872
including 6.229770824012927
incomprehensible 6.922918004572872
inconsistencies 6.922918004572872
incorrectness 6.922918004572872
incredible 6.006627272698717
incredibly 6.229770824012927
indeed 6.006627272698717
indescribably 6.922918004572872
indication 6.922918004572872
indictment 6.922918004572872
indie 6.922918004572872
individual 6.922918004572872
indoor 6.922918004572872
indulgent 6.229770824012927
industry 6.517452896464707
ineptly 6.922918004572872
inexperience 6.922918004572872
inexplicable 6.922918004572872
initially 6.922918004572872
innocence 6.922918004572872
insane 6.922918004572872
inside 6.922918004572872
insincere 6.922918004572872
insipid 6.922918004572872
insomniacs 6.922918004572872
inspiration 6.922918004572872
inspiring 6.922918004572872
instant 6.922918004572872
instead 6.517452896464707
in

memories 6.229770824012927
memorized 6.922918004572872
menace 6.922918004572872
menacing 6.922918004572872
mention 5.824305715904762
mercy 6.517452896464707
meredith 6.922918004572872
merit 6.922918004572872
mesmerising 6.922918004572872
mess 5.418840607796598
messages 6.922918004572872
meteorite 6.922918004572872
mexican 6.922918004572872
michael 6.922918004572872
mickey 6.006627272698717
microsoft 6.922918004572872
middle 6.922918004572872
might 6.229770824012927
mighty 6.922918004572872
mind 5.536623643452981
mindblowing 6.922918004572872
miner 6.922918004572872
mini 6.517452896464707
minor 6.922918004572872
minute 5.131158535344817
minutes 5.418840607796598
mirrormask 6.922918004572872
miserable 6.922918004572872
miserably 6.922918004572872
mishima 6.517452896464707
misplace 6.922918004572872
miss 6.229770824012927
missed 6.517452896464707
mistakes 6.922918004572872
miyazaki 6.229770824012927
modern 6.517452896464707
modest 6.922918004572872
mollusk 6.922918004572872
moment 5.67015

predictable 5.218169912334447
predictably 6.517452896464707
prejudice 6.922918004572872
prelude 6.922918004572872
premise 6.006627272698717
prepared 6.922918004572872
presence 6.517452896464707
presents 6.006627272698717
preservation 6.922918004572872
president 6.922918004572872
pretentious 6.229770824012927
pretext 6.922918004572872
pretty 4.9770078555175585
previous 6.517452896464707
primal 6.922918004572872
primary 6.922918004572872
probably 5.536623643452981
problem 5.670155036077504
problems 6.006627272698717
proceedings 6.517452896464707
process 6.517452896464707
produce 5.824305715904762
produced 6.229770824012927
producer 6.517452896464707
producers 6.922918004572872
product 5.670155036077504
production 5.824305715904762
professionals 6.922918004572872
professor 6.922918004572872
progresses 6.922918004572872
promote 6.922918004572872
prompted 6.922918004572872
prone 6.922918004572872
propaganda 6.922918004572872
properly 6.922918004572872
proud 6.229770824012927
proudly 6.92291

sharing 6.922918004572872
sharply 6.922918004572872
shatner 6.922918004572872
shattered 6.922918004572872
shed 5.536623643452981
sheer 6.922918004572872
shelf 6.922918004572872
shell 6.922918004572872
shelves 6.922918004572872
shenanigans 6.922918004572872
shepard 6.922918004572872
shined 6.922918004572872
shirley 6.922918004572872
shocking 6.922918004572872
shooting 6.922918004572872
short 5.05111582767128
shortlist 6.922918004572872
shot 5.418840607796598
shots 6.229770824012927
show 4.397189360264616
showcasing 6.922918004572872
showed 6.229770824012927
shows 6.229770824012927
shut 6.922918004572872
sibling 6.922918004572872
sick 6.517452896464707
side 4.725693427236653
sidelined 6.922918004572872
sign 5.670155036077504
significant 6.517452896464707
silent 6.229770824012927
silly 6.922918004572872
simmering 6.922918004572872
simplifying 6.922918004572872
simply 5.418840607796598
since 5.536623643452981
sincere 6.517452896464707
sing 4.357968647111335
singing 6.229770824012927
single

thumper 6.922918004572872
thunderbirds 6.922918004572872
thus 6.922918004572872
ticker 6.922918004572872
tickets 6.922918004572872
tightly 6.922918004572872
time 3.6457732715806954
timeless 6.922918004572872
timely 6.922918004572872
timers 6.922918004572872
times 5.418840607796598
timing 6.922918004572872
tiny 6.922918004572872
tired 6.922918004572872
title 6.517452896464707
titta 6.922918004572872
today 6.006627272698717
together 5.536623643452981
told 6.229770824012927
tolerable 6.922918004572872
tolerate 6.922918004572872
tom 5.536623643452981
tomorrow 6.922918004572872
tone 6.517452896464707
tongue 6.922918004572872
tonight 6.922918004572872
tons 6.922918004572872
tony 6.922918004572872
took 6.517452896464707
toons 6.517452896464707
top 5.3134800921387715
tops 6.922918004572872
torture 5.824305715904762
tortured 6.922918004572872
total 5.05111582767128
totally 5.05111582767128
touch 5.824305715904762
touches 6.922918004572872
touching 6.229770824012927
tough 6.922918004572872
towar

wow 6.922918004572872
wrap 6.922918004572872
write 5.536623643452981
writer 5.536623643452981
writers 6.922918004572872
writing 4.9770078555175585
written 5.536623643452981
wrong 6.229770824012927
wrote 6.922918004572872
yardley 6.922918004572872
yawn 6.922918004572872
yeah 6.517452896464707
year 4.843476462893037
years 5.05111582767128
yelps 6.922918004572872
yes 5.536623643452981
yet 5.824305715904762
young 5.824305715904762
younger 6.922918004572872
youthful 6.922918004572872
youtube 6.922918004572872
yun 6.922918004572872
zillion 6.922918004572872
zombie 6.229770824012927
zombiez 6.922918004572872




In [238]:
# Keeping all the unique words and their tfidf values into two lists and zipping them into a dictionary
vocab_dict=dict(zip(final_word,final_tfidf))

In [239]:
#Sorting this based on the values and if values are same then alphabetically. 
#Ref Link: https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
sorted_vocab= (sorted(vocab_dict.items(), key = lambda x : x[1],reverse=True))

In [240]:
#Putting the top 50 features , i.e words and their values into a dictionary.
words1=[]
values=[]
i=0
j=0
while (i<=51):
    words1.append(sorted_vocab[i][0])
    i+=1
while(j<51):
    values.append(sorted_vocab[j][1])
    j+=1
    

In [241]:
#printing the words and their values respectively
print(*words1)
print(*values)

aailiyah abandoned abroad abstruse academy accents accessible acclaimed accolades accurately achille ackerman adams added admins admiration admitted adrift adventure aesthetically affected affleck afternoon agreed aimless aired akasha alert alike allison allowing alongside amateurish amazed amazingly amusing amust anatomist angela angelina angry anguish angus animals animated anita anniversary anthony antithesis anyway apart appears
6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922918004572872 6.922

In [242]:
# Creating a dictionary with top 51 features
dict_50=dict(zip(words1,values))

In [243]:
# I took the top 51 values 
len(dict_50)

51

In [244]:
#the values and their values
print(dict_50)

{'aailiyah': 6.922918004572872, 'abandoned': 6.922918004572872, 'abroad': 6.922918004572872, 'abstruse': 6.922918004572872, 'academy': 6.922918004572872, 'accents': 6.922918004572872, 'accessible': 6.922918004572872, 'acclaimed': 6.922918004572872, 'accolades': 6.922918004572872, 'accurately': 6.922918004572872, 'achille': 6.922918004572872, 'ackerman': 6.922918004572872, 'adams': 6.922918004572872, 'added': 6.922918004572872, 'admins': 6.922918004572872, 'admiration': 6.922918004572872, 'admitted': 6.922918004572872, 'adrift': 6.922918004572872, 'adventure': 6.922918004572872, 'aesthetically': 6.922918004572872, 'affected': 6.922918004572872, 'affleck': 6.922918004572872, 'afternoon': 6.922918004572872, 'agreed': 6.922918004572872, 'aimless': 6.922918004572872, 'aired': 6.922918004572872, 'akasha': 6.922918004572872, 'alert': 6.922918004572872, 'alike': 6.922918004572872, 'allison': 6.922918004572872, 'allowing': 6.922918004572872, 'alongside': 6.922918004572872, 'amateurish': 6.92291

In [245]:
#Function for printing the dense matrix after transformation. This will be called inside the transformation function.
def dense_matrix(data,vocab):
    global features 
    features=[]
    global features_values
    features_values=[]
    features=words1
    for idx,row in enumerate(tqdm(features)):
        for words in row.split(' '):
            if (words in corpus_input):
                features_values.append(dict_50[words])
            else:
                features_values.append(0)
    a=np.column_stack((features, features_values))
    print('dense_matrix',a)
    

In [246]:
# Fucntion for transforming the fit data.
global normalized_value
normalized_value=[]
global final_normalized_input
final_normalized_input=[]
rows = []
columns = []
values = []
final_output=[]
output=[]

def tfidf_transform(data,vocab):
    prev_idx=0
    if isinstance(data,(list,)):
        for idx,row in enumerate(tqdm(data)):
            word_column=0
            for word in row.split(' '):
                word_column+=1
                key_value=word_val.get(word,0)
                
#                 print(normalized_value)
#                 print(idx)
#                 print(prev_idx)
                if(idx!=prev_idx):
#                     print("normal",normalized_value)
                    final_normalized_input.append(normalized_value[:])
#                     print(final_normalized_input)
#                     print("prev",prev_idx)
#                     print("current",idx)
            
                    prev_idx=idx
                    normalized_value.clear()
                normalized_value.append(key_value)
#                 print(normalized_value)
#                 print("outside_normal",normalized_value)
        final_normalized_input.append(normalized_value[:])
                    
                    
# #                 rows.append(idx)
# #                 columns.append(word_column)
#                 normalized_value.append(key_value)
# #                 print(word)
#                 if(row!=(prev_row)):
        
        for i in range(len(final_normalized_input)):
            
            output=normalize([final_normalized_input[i]])
            final_output.append(output[:])
            
        j=0
#         for i in range(len(final_output)):
#             for k in range(len(final_output)):
#                 print(i,j,k,final_output[i][j][k])
                
            
        print("sparse matrix",final_output)
        dense_matrix(corpus_input,dict_50)

                       
        
        
#                 prev_row=row
                
# #                 print(idx,word_column,normalized_value)
#                 print(idx)   
# #             final_normalized_input.append(normalized_value)
# #             print(normalized_value)
# # #             output=normalize(final_normalized_input)
# #             final_output.append(output)      
        
        
    else:
        print("input a list")
            
        

In [247]:
# one document, on which we will test our tfidf features.
type(corpus[0])
# As corpus[0] is a str, we take a new variable called corpus_input which is a list and append first element into it.
corpus_input=[]
# appending first element
corpus_input=['slow moving aimless movie distressed drifting young man']
print(corpus_input)


# type of this one document is a list. Now, we can input this into the transform fuction.
type(corpus_input)

['slow moving aimless movie distressed drifting young man']


list

In [248]:
#Calling the transform function which will print the sparse and dense matrix of the test document.
tfidf_transform(corpus_input,dict_50)


# The dense matrix is having 50 columns and 1 row and all the values are zero as this doesnt 
  #have any of the top 50 feature words in it.

100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]


sparse matrix [array([[0., 0., 0., 0., 0., 0., 0., 0.]])]


100%|██████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:00<?, ?it/s]


dense_matrix [['aailiyah' '0']
 ['abandoned' '0']
 ['abroad' '0']
 ['abstruse' '0']
 ['academy' '0']
 ['accents' '0']
 ['accessible' '0']
 ['acclaimed' '0']
 ['accolades' '0']
 ['accurately' '0']
 ['achille' '0']
 ['ackerman' '0']
 ['adams' '0']
 ['added' '0']
 ['admins' '0']
 ['admiration' '0']
 ['admitted' '0']
 ['adrift' '0']
 ['adventure' '0']
 ['aesthetically' '0']
 ['affected' '0']
 ['affleck' '0']
 ['afternoon' '0']
 ['agreed' '0']
 ['aimless' '0']
 ['aired' '0']
 ['akasha' '0']
 ['alert' '0']
 ['alike' '0']
 ['allison' '0']
 ['allowing' '0']
 ['alongside' '0']
 ['amateurish' '0']
 ['amazed' '0']
 ['amazingly' '0']
 ['amusing' '0']
 ['amust' '0']
 ['anatomist' '0']
 ['angela' '0']
 ['angelina' '0']
 ['angry' '0']
 ['anguish' '0']
 ['angus' '0']
 ['animals' '0']
 ['animated' '0']
 ['anita' '0']
 ['anniversary' '0']
 ['anthony' '0']
 ['antithesis' '0']
 ['anyway' '0']
 ['apart' '0']
 ['appears' '0']]
