# Text Analysis

Hi Guys, Welcome to [Tirendaz Academy](https://youtube.com/c/tirendazacademy) 😀
</br>
In this notebook, I'm going to talk about text analysis.
</br>
Happy learning 🐱‍🏍 

In [9]:
categories=['rec.motorcycles','rec.sport.baseball','comp.graphics','rec.sport.hockey']

In [10]:
from sklearn.datasets import load_files

In [32]:
twenty_train=load_files('Data/20newsbydate/20news-bydate-train/',
                       categories=categories,
                       shuffle=True,
                       random_state=42,
                       encoding='utf-8',
                       decode_error='ignore')

In [33]:
type(twenty_train)

sklearn.utils.Bunch

In [34]:
twenty_train.target_names

['comp.graphics', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey']

In [14]:
len(twenty_train.data)

2379

In [15]:
twenty_train.target[:10]

array([1, 1, 1, 3, 2, 2, 3, 1, 0, 0])

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()

In [17]:
X_train_counts=count_vect.fit_transform(twenty_train.data)

In [18]:
X_train_counts.shape

(2379, 32550)

## tf–idf technic

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer=TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf=tf_transformer.transform(X_train_counts)

In [20]:
X_train_tf.shape

(2379, 32550)

## Building the model

In [21]:
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB().fit(X_train_tf, twenty_train.target)

In [22]:
docs_new=['brake-lamp is good','this computer is fast']

In [23]:
X_new_count=count_vect.transform(docs_new)
X_new_tf=tf_transformer.transform(X_new_count)

## Predicting the data

In [24]:
predicted=clf.predict(X_new_tf)

In [27]:
for doc, category in zip(docs_new, predicted):
    print('%r=>%s' %(doc,twenty_train.target_names[category]))

'brake-lamp is good'=>rec.motorcycles
'this computer is fast'=>comp.graphics


## Pipeline

In [28]:
from sklearn.pipeline import Pipeline

In [29]:
text_clf=Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf',MultinomialNB())])

In [30]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [35]:
twenty_test=load_files('Data/20newsbydate/20news-bydate-test/',
                       categories=categories,
                       shuffle=True,
                       random_state=42,
                       encoding='utf-8',
                       decode_error='ignore')

In [36]:
docs_test=twenty_test.data

## Predicting the model

In [37]:
predicted=text_clf.predict(docs_test)

In [39]:
import numpy as np
np.mean(predicted==twenty_test.target)

0.9576753000631711

## SVM

In [40]:
from sklearn.linear_model import SGDClassifier

In [41]:
text_clf=Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf',SGDClassifier(loss='hinge',
                                      penalty='l2',
                                      alpha=1e-3,
                                      random_state=42,
                                      max_iter=5,
                                      tol=None))])

In [42]:
text_clf.fit(twenty_train.data, twenty_train.target)
predicted=text_clf.predict(docs_test)
np.mean(predicted==twenty_test.target)

0.9696778269109286

In [43]:
from sklearn import metrics
metrics.confusion_matrix(twenty_test.target, predicted)

array([[382,   2,   5,   0],
       [  3, 393,   1,   1],
       [  6,   3, 369,  19],
       [  1,   1,   6, 391]], dtype=int64)

Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎