# Text Classification ( Naive Bayes )

### Loading the dataset
20newsgroup dataset will be downloaded and divided in training(60%) and test(40%) dataset. [source code for fetching the dataset](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/twenty_newsgroups.py) (see the fetch_20newgroup function)

In [1]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(subset='all',shuffle=True,random_state=42,data_home='scikit_learn_data/')
twenty_train = fetch_20newsgroups(subset='train',shuffle=True,random_state=42,data_home='scikit_learn_data/')
twenty_test = fetch_20newsgroups(subset='test',shuffle=True,random_state=42,data_home='scikit_learn_data/')

print("no of examples:", len(dataset.data))
print("training examples:",len(twenty_train.data))
print("test examples:", len(twenty_test.data))

no of examples: 18846
training examples: 11314
test examples: 7532


### Extracting the Features from text data
Two different type of features will be used. [source code for feature extraction](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py) (see the CounVectorizer and TfidfTransformer class)  
For each example
1. count vectorizer = word_id: no of words present
2. tfidf transformer = word_id: tfidf value

Note:  
1. fit_transform: fit acording to the data provided and than tranform text into vector as mentioned above
2. transform: transorm the data into vector as already fitted

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# tokenizing the text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_test_counts = count_vect.transform(twenty_test.data)

# calculating tfidf (Term_frequency * inverse_document_frequency)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

### Generating the Naive Bayes Model
[Naive Bayes model](http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) with 
1. smoothing prior alpha =1(in each document add one instance of every word) and 
2. fit_prior = false ( initial probability of classes will be taken equal)  
[source code for naive bayes](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py) (see the MultinomialNB class)

Two models are generated for different typr of features.


In [3]:
from sklearn.naive_bayes import MultinomialNB
clf_naive_count = MultinomialNB(alpha=1,fit_prior='false').fit(X_train_counts, twenty_train.target)
clf_naive_tfidf = MultinomialNB(alpha=1,fit_prior='false').fit(X_train_tfidf, twenty_train.target)

### Save the trained model
saving the trained model on the disk

In [4]:
#ckeck for directory and make them
import os
naive_count_dir = "trained_models/naive_count"
naive_tfidf_dir = "trained_models/naive_tfidf"
naive_count_path = os.path.join(naive_count_dir,'Naive_count.pkl')
naive_tfidf_path = os.path.join(naive_tfidf_dir,'Naive_tfidf.pkl')
if not os.path.exists(naive_count_dir):
    os.makedirs(naive_count_dir)
if not os.path.exists(naive_tfidf_dir):
    os.makedirs(naive_tfidf_dir)
    
# save the model
import joblib
joblib.dump(clf_naive_count,naive_count_path)
joblib.dump(clf_naive_tfidf,naive_tfidf_path)


['trained_models/naive_tfidf/Naive_tfidf.pkl',
 'trained_models/naive_tfidf/Naive_tfidf.pkl_01.npy',
 'trained_models/naive_tfidf/Naive_tfidf.pkl_02.npy',
 'trained_models/naive_tfidf/Naive_tfidf.pkl_03.npy',
 'trained_models/naive_tfidf/Naive_tfidf.pkl_04.npy',
 'trained_models/naive_tfidf/Naive_tfidf.pkl_05.npy']

### Predict the class of an example

In [5]:
test_example = ["my laptop is fantastic.buy it today and get free offer.best cheap quality discount"]
test_example_count = count_vect.transform(test_example)
probs = clf_naive_count.predict_proba(test_example_count)[0] # alternate: predict_log_proba

for i in range(len(twenty_train.target_names)):
    print(twenty_train.target_names[i],":      ",probs[i])
    
print("\ncategory: ",twenty_train.target_names[clf_naive_count.predict(test_example_count)[0]])

alt.atheism :       9.69249730763e-07
comp.graphics :       0.00157018218978
comp.os.ms-windows.misc :       2.72851955035e-08
comp.sys.ibm.pc.hardware :       0.0248204115856
comp.sys.mac.hardware :       0.221292944274
comp.windows.x :       2.60102082051e-06
misc.forsale :       0.695596318096
rec.autos :       0.0031841460353
rec.motorcycles :       0.0216113398158
rec.sport.baseball :       9.36946876277e-06
rec.sport.hockey :       8.56827655908e-06
sci.crypt :       0.0185170680563
sci.electronics :       0.0107097204999
sci.med :       0.000185816821492
sci.space :       4.83882839364e-05
soc.religion.christian :       0.000108325618836
talk.politics.guns :       0.000326284499462
talk.politics.mideast :       1.59245425214e-05
talk.politics.misc :       0.00195445763946
talk.religion.misc :       3.71367412366e-05

category:  misc.forsale


### Load a txt file and predict its class
Classifing the (About the Research Topic) section on "[computational neuroscience of deep brain simulation](http://journal.frontiersin.org/researchtopic/5705/computational-neuroscience-of-deep-brain-stimulation)"  
As expected both the models classify it in science (medician) category

In [6]:
file = open('neuroscience.txt')
file_data = file.read()
file_data_count = count_vect.transform([file_data])
file_data_tfidf = tfidf_transformer.transform(file_data_count)
print("\ncount_category: ",twenty_train.target_names[clf_naive_count.predict(file_data_count)[0]])
print("\ntfidf_category: ",twenty_train.target_names[clf_naive_tfidf.predict(file_data_tfidf)[0]])


count_category:  sci.med

tfidf_category:  sci.med


### Evaluation of the performance on the test dataset
As expected tfidf features gives some better accuracy

In [7]:
import numpy as np
naive_count_predicted = clf_naive_count.predict(X_test_counts)
naive_tfidf_predicted = clf_naive_tfidf.predict(X_test_tfidf)

print("\ncount_accuracy:",np.mean(naive_count_predicted == twenty_test.target))
print("\ntfidf_accuracy:",np.mean(naive_tfidf_predicted == twenty_test.target))


count_accuracy: 0.772835900159

tfidf_accuracy: 0.77389803505
