## 1. Text-classification using Count-vectorized Logistic Regression

#### (1) Prepare data (20 Newsgroups)

In [1]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


train_news = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=156)
X_train = train_news.data
y_train = train_news.target

test_news = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), random_state=156)
X_test = test_news.data
y_test = test_news.target

In [2]:
print(f'>>> size of train : {len(X_train)}  |  size of test : {len(X_test)}\n\n')
print(train_news.DESCR[:1080])
print('\n>>> names of Classes : \n', train_news.target_names)

>>> size of train : 11314  |  size of test : 7532


.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
  

In [3]:
print(f'>>> Topic : {train_news.target_names[0]}')
print(train_news.data[0])

>>> Topic : alt.atheism


What I did NOT get with my drive (CD300i) is the System Install CD you
listed as #1.  Any ideas about how I can get one?  I bought my IIvx 8/120
from Direct Express in Chicago (no complaints at all -- good price & good
service).

BTW, I've heard that the System Install CD can be used to boot the mac;
however, my drive will NOT accept a CD caddy is the machine is off.  How can
you boot with it then?

--Dave



#### (2) Vectorize data (using CountVectorizer)

In [4]:
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train, y_train)

# Vectorizing X_train & X_test
X_train_cnt_vect = cnt_vect.transform(X_train)
X_test_cnt_vect = cnt_vect.transform(X_test)

print(X_train_cnt_vect.shape)
print(X_test_cnt_vect.shape)

(11314, 101631)
(7532, 101631)


#### (3) Generate model (using Logistic Regression) & Evaluate (using accuracy_score)

In [5]:
lr_clf = LogisticRegression()
lr_clf.fit( X_train_cnt_vect, y_train )
pred = lr_clf.predict(X_test_cnt_vect)

print(pred)
print()
print(f'accuracy_score of Countvectorized Logistic Regression : {accuracy_score(y_test, pred):.3f}')

[ 4 11  2 ...  6 17  9]

accuracy_score of Countvectorized Logistic Regression : 0.616


## 2. Text-classification using TF-IDF-vectorized Logistic Regression

#### (1) Prepare data (20 Newsgroups) - same data

#### (2) Vectorize data (using TfidfVectorizer)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

print(X_train_tfidf_vect.shape, X_test_tfidf_vect.shape)

(11314, 101631) (7532, 101631)


#### (3) Generate model (using Logistic Regression) & Evaluate (using accuracy_score)

In [7]:
lr_clf = LogisticRegression()
lr_clf.fit( X_train_tfidf_vect, y_train )
pred = lr_clf.predict(X_test_tfidf_vect)

print(pred)
print()
print(f'accuracy_score of TF-IDF-vectorized Logistic Regression : {accuracy_score(y_test, pred):.3f}')

[ 5  1  1 ...  6 17  9]

accuracy_score of TF-IDF-vectorized Logistic Regression : 0.678


## 3. Text-classification using TF-IDF-vectorized Logistic Regression
#### ( + Parameter tunning - stopwords, n-gram )

#### (1) stop_words='english', ngram_range=(1,3), max_df=300

In [8]:
tfidf_vect = TfidfVectorizer( stop_words='english', ngram_range=(1,3), max_df=300 )
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

print('shape of vectorized trand & test :', X_train_tfidf_vect.shape, X_test_tfidf_vect.shape)
print()

lr_clf = LogisticRegression()
lr_clf.fit( X_train_tfidf_vect, y_train )
pred = lr_clf.predict(X_test_tfidf_vect)

print('predicted labels :', pred)
print()
print(f'accuracy_score of TF-IDF-vectorized Logistic Regression : {accuracy_score(y_test, pred):.3f}')

shape of vectorized trand & test : (11314, 1971091) (7532, 1971091)

predicted labels : [ 4 11  1 ...  6 17  7]

accuracy_score of TF-IDF-vectorized Logistic Regression : 0.687


#### (2) stop_words='english', ngram_range=(1,2), max_df=300

In [9]:
tfidf_vect = TfidfVectorizer( stop_words='english', ngram_range=(1,2), max_df=300 )
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

print('shape of vectorized trand & test :', X_train_tfidf_vect.shape, X_test_tfidf_vect.shape)
print()

lr_clf = LogisticRegression()
lr_clf.fit( X_train_tfidf_vect, y_train )
pred = lr_clf.predict(X_test_tfidf_vect)

print('predicted labels :', pred)
print()
print(f'accuracy_score of TF-IDF-vectorized Logistic Regression : {accuracy_score(y_test, pred):.3f}')

shape of vectorized trand & test : (11314, 943453) (7532, 943453)

predicted labels : [ 4 11  1 ...  6 17  7]

accuracy_score of TF-IDF-vectorized Logistic Regression : 0.690


## 4. Parameter tunning using GridSearchCV

In [10]:
from sklearn.model_selection import GridSearchCV

param = { 'C':[0.01, 0.1, 1, 5, 10, 15, 20] }
grid_cv_lr = GridSearchCV(estimator=lr_clf , param_grid=param, cv=3, scoring='accuracy', verbose=1)
grid_cv_lr.fit( X_train_tfidf_vect, y_train )

print('best C parameter :', grid_cv_lr.best_params_)

Fitting 3 folds for each of 7 candidates, totalling 21 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:  5.3min finished


best C parameter : {'C': 15}


In [11]:
pred = grid_cv_lr.predict(X_test_tfidf_vect)

print('predicted labels :', pred)
print()
print(f'accuracy_score of TF-IDF-vectorized Logistic Regression (with local best C) : {accuracy_score(y_test, pred):.3f}')

predicted labels : [ 3 11  1 ...  7 17  7]

accuracy_score of TF-IDF-vectorized Logistic Regression (with local best C) : 0.705
