### 20 뉴스그룹 분류

In [71]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import fetch_20newsgroups

In [2]:

news = fetch_20newsgroups(subset='all', random_state=2023)

- 데이터 탐색

In [3]:
print(news.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [8]:
print(news.data[10])

Organization: Penn State University
From: <RSM2@psuvm.psu.edu>
Subject: US-Made M-B SUV
Lines: 10

Mercedes-Benz announced yesterday its plans to begin building sport-utility
vehicles in the US by 1997.  They are targeted at the Jeep Grand Cherokee
et al. and will reportedly sell for less than $30,000.

Did anyone see a picture?   Is it the G-wagon (Gelaendewagen) currently
available in Europe (and in the US by grey-market) or is it an entirely new
vehicle?  Any details would be appreciated.

Dick Meyer
Applied Research Laboratory, Penn State



In [11]:
from pprint import pprint
pprint(news.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [12]:
np.unique(news.target, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 array([799, 973, 985, 982, 963, 988, 975, 990, 996, 994, 999, 991, 984,
        990, 987, 997, 910, 940, 775, 628], dtype=int64))

- 데이터셋 추출

In [66]:
train_news = fetch_20newsgroups(
    subset='train', random_state=2023, remove=('headers','quotes','footers')
) 
X_train = train_news.data
y_train = train_news.target

In [67]:
test_news = fetch_20newsgroups(
    subset='test', random_state=2000, remove=('headers', 'quotes', 'footers')
)
X_test = test_news.data
y_test = test_news.data

In [49]:
import re

X_train = list(map(lambda x: re.sub('[^A-Za-z]', ' ', x), X_train))

In [51]:
X_test = list(map(lambda x: re.sub('[^A-Za-z]', ' ', x), X_test))

In [68]:
len(X_train), len(X_test)

(11314, 7532)

##### 피쳐 벡터화 + 머신러닝 모델 학습/평가

- Case 1. CountVectorizer + LogisiticRegression

In [69]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english')
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((11314, 101322), (7532, 101322))

In [70]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=2023, max_iter=500)
%time lr.fit(X_train_cv, y_train)
lr.score(X_test_cv, y_test)


CPU times: total: 1min 21s
Wall time: 1min 13s


0.0

In [55]:
lr.score(X_test_cv, y_test)

0.0

- Case 2.TifidfVectorizer + LogisticRegression

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvect = TfidfVectorizer(stop_words='english')
tvect.fit(X_train)
X_train_tv = tvect.transform(X_train)
X_test_tv = tvect.transform(X_test)
X_train_tv.shape, X_test_tv.shape

((11314, 79696), (7532, 79696))

In [57]:
%time lr.fit(X_train_tv, y_train)
lr.score(X_test_tv, y_test)

CPU times: total: 36.9 s
Wall time: 34.3 s


0.0

- Case 3. TfidfVectorizer(N-gram) + LogisticRegression

In [58]:
tvect2 = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
tvect2.fit(X_train)
X_train_tv2 = tvect2.transform(X_train)
X_test_tv2 = tvect2.transform(X_test)
X_train_tv2.shape, X_test_tv2.shape


((11314, 899236), (7532, 899236))

In [25]:
lr = LogisticRegression(random_state=2023, max_iter=500)
%time lr.fit(X_train_tv2, y_train)
lr.score(X_test_tv2, y_test)

CPU times: total: 7min 10s
Wall time: 6min 16s


0.0

- Pipeline / GridSearchCV

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
    ('tvect', TfidfVectorizer(stop_words='english')), 
    ('LR', LogisticRegression(random_state=2023, max_iter=500))
    ])
params = {
    'tvect__max_df' : [300, 700],
    'LR__C' : [1, 10]
}

grid_pipe = GridSearchCV(pipeline, params, scoring='accuracy', cv=3, n_jobs=-1)

In [31]:
%time grid_pipe.fit(X_train, y_train)

CPU times: total: 1min 47s
Wall time: 6min 42s


In [33]:
grid_pipe.best_params_

{'LR__C': 10, 'tvect__max_df': 700}

In [34]:
grid_pipe.best_estimator_.score(X_test, y_test)

0.0