# Exercise - Classification of Text Documents using Perceptron

The fetch_20newsgroups in sklearn.datasets provides can provide us with a corpus of text documents based on category. The code in this notebook gets the datasets into required format. We have to classify documents into their appropriate category using Perceptron. 

For fetch_20newsgroups functions takes categories as following:  
alt.atheism,talk.religion.misc,'comp.graphics,sci.space  
The subset can be train or test.  

### Perform the following operations:
1. Fetch two datasets, with 2 categories. One of them is training dataset and the other will be test. Each dataset contains data and its labels. The random state should be 50
2. Use the TF-IDF vectorizer to find term frequency of the training & testing dataset. 
3. Obtain the TF-IDF of every word in the vocaulbary of the training dataset. Find the following from that  
The total number of words  
The word having the lowest weight and highest weight  
All the words having a weight between 0.00045 to 0.006.  
4. Train your perceptron model with the transformed training dataset, and predict the output of the training dataset. Obtain the confusion matrix and classification report for the same. 
5. Perform Kfold cross validation scoring on the entire dataset(Training + Testing). 

In [16]:
import pandas as pd
from sklearn.linear_model import Perceptron
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from sklearn import metrics

In [24]:
categories = ['alt.atheism','talk.religion.misc','comp.graphics','sci.space',]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=50,
                                remove=remove)

In [25]:
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

In [26]:
y_train, y_test = data_train.target, data_test.target

print(y_train, y_test)

[2 3 2 ... 1 0 2] [2 1 1 ... 3 1 1]


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

In [28]:
classifier = Perceptron(max_iter=20)
model = classifier.fit(X_train, y_train)

In [29]:
obtained_y = model.predict(X_test)

In [30]:
metrics.confusion_matrix(y_test, obtained_y)

array([[193,  14,  24,  88],
       [ 10, 340,  23,  16],
       [ 23,  28, 314,  29],
       [ 64,  14,  17, 156]], dtype=int64)

In [31]:
print(metrics.classification_report(y_test, obtained_y))

             precision    recall  f1-score   support

          0       0.67      0.61      0.63       319
          1       0.86      0.87      0.87       389
          2       0.83      0.80      0.81       394
          3       0.54      0.62      0.58       251

avg / total       0.75      0.74      0.74      1353

