# TfidfVectorizer Explanation
Convert a collection of raw documents to a matrix of TF-IDF features

TF-IDF where TF means term frequency, and IDF means Inverse Document frequency.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
text= ['mydak adflkfaab keafabh']

In [2]:
vect = TfidfVectorizer()

In [3]:
vect.fit(text)

TfidfVectorizer()

In [4]:
## TF will count the frequency of word in each document. and IDF 
print(vect.idf_)

[1. 1. 1.]


In [5]:
print(vect.vocabulary_)

{'mydak': 2, 'adflkfaab': 0, 'keafabh': 1}


### A words which is present in all the data, it will have low IDF value. With this unique words will be highlighted using the Max IDF values.

In [6]:
example = text[0]
example

'mydak adflkfaab keafabh'

In [7]:
example = vect.transform([example])
print(example.toarray())

[[0.57735027 0.57735027 0.57735027]]


### Here, 0 is present in the which indexed word, which is not available in given sentence.

## PassiveAggressiveClassifier

### Passive: if correct classification, keep the model; Aggressive: if incorrect classification, update to adjust to this misclassified example.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few ‘online-learning algorithms‘. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example.

## Let's start the work

In [8]:
import pandas as pd
import numpy as np

In [9]:
dataframe = pd.read_csv('data.csv')
dataframe.head()

Unnamed: 0,Text,Target
0,reserve bank forming expert committee based in...,Blockchain
1,director could play role financial system,Blockchain
2,preliminary discuss secure transaction study r...,Blockchain
3,security indeed prove essential transforming f...,Blockchain
4,bank settlement normally take three days based...,Blockchain


In [10]:
x = dataframe['Text']
y = dataframe['Target']

In [11]:
y.value_counts()

FinTech             8551
Cyber Security      2640
Bigdata             2267
Reg Tech            2206
credit reporting    1748
Blockchain          1375
Neobanks            1069
Microservices        977
Stock Trading        787
Robo Advising        737
Data Security        347
Name: Target, dtype: int64

In [12]:
x

0        reserve bank forming expert committee based in...
1                director could play role financial system
2        preliminary discuss secure transaction study r...
3        security indeed prove essential transforming f...
4        bank settlement normally take three days based...
                               ...                        
22699    fourth study discusses blockchain technology e...
22700    book finishes stating biggest issue emerging F...
22701                                  people culture cess
22702    author challenges execu tive lead change stop ...
22703    change data driven culture come bottom must start
Name: Text, Length: 22704, dtype: object

In [13]:
y

0        Blockchain
1        Blockchain
2        Blockchain
3        Blockchain
4        Blockchain
            ...    
22699      Reg Tech
22700      Reg Tech
22701      Reg Tech
22702      Reg Tech
22703      Reg Tech
Name: Target, Length: 22704, dtype: object

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score

In [15]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
y_train

7222     Cyber Security
7849     Cyber Security
8            Blockchain
20007     Stock Trading
15699           FinTech
              ...      
13123           FinTech
19648     Stock Trading
9845            FinTech
10799           FinTech
2732            Bigdata
Name: Target, Length: 18163, dtype: object

In [16]:
x_train

7222     exit strategy acceptable exit transition strat...
7849     intent implementation readiness lack thereof r...
8        technology need transaction intermediary clear...
20007    basis selection period this duration faced boo...
15699                                         feel urgency
                               ...                        
13123    transfer burden even sharper context according...
19648                            study faster earn trading
9845     furthermore industry must maintain quality tak...
10799                                           made legal
2732                                             whiterose
Name: Text, Length: 18163, dtype: object

In [17]:
tfvect = TfidfVectorizer(stop_words='english',max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)
tfid_x_test = tfvect.transform(x_test.apply(lambda x: np.str_(x)))

* max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".

In [22]:
classifier = PassiveAggressiveClassifier(max_iter=5)
classifier.fit(tfid_x_train,y_train)

PassiveAggressiveClassifier(max_iter=5)

In [23]:
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 64.7%
