# 20 뉴스 그룹 분류

In [3]:
import numpy as np 
import pandas as pd 

In [4]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', random_state=2021)

In [5]:
print(news.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

# 타켓 클래스의 값과 분포도

In [8]:
pd.Series(news.target).value_counts().sort_index()

0     799
1     973
2     985
3     982
4     963
5     988
6     975
7     990
8     996
9     994
10    999
11    991
12    984
13    990
14    987
15    997
16    910
17    940
18    775
19    628
dtype: int64

In [9]:
print(news.data[0])

From: dagibbs@quantum.qnx.com (David Gibbs)
Subject: Re: Countersteering sans Hands
Organization: QNX Software Systems, Ltd.
Lines: 22

In article <1993Apr20.203344.8417@cs.cornell.edu> karr@cs.cornell.edu (David Karr) writes:
>In article <Clarke.6.735328328@bdrc.bd.com> Clarke@bdrc.bd.com (Richard Clarke) writes:
>>So how do I steer when my hands aren't on the bars? (Open Budweiser in left 
>>hand, Camel cigarette in the right, no feet allowed.) 
>
>>If I lean, and the 
>>bike turns, am I countersteering?
>
>No, the bars would turn only *toward* the direction of turn in
>no-hands steering.

Just in case the original poster was looking for a serious answer,
I'll supply one.

Yes, even when steering no hands you do something quite similar
to countersteering.  Basically to turn left, you to a quick wiggle
of the bike to the right first, causing a counteracting lean to
occur to the left.  It is a lot more difficult to do on a motorcycle
than a bicycle though, because of the extra weight. 

In [10]:
train_news = fetch_20newsgroups(subset='train', random_state=2021, remove=('headers', 'footers', 'quotes'))
X_train = train_news.data
y_train = train_news.target

In [13]:
print(X_train[1], y_train[1])

]Is it possible to do a "wheelie" on a motorcycle with shaft-drive?

yes.
 8


In [14]:
test_news = fetch_20newsgroups(subset='test', random_state=2021, remove=('headers', 'footers', 'quotes'))
X_test = test_news.data
y_test = test_news.target

In [16]:
len(X_train), len(X_test)

(11314, 7532)

### 피처 백터화
- Case 1) Count Vectorizer

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
cvet = CountVectorizer()
cvet.fit(X_train)
X_train_count = cvet.transform(X_train)
X_test_count = cvet.transform(X_test)

In [25]:
X_train_count.shape

(11314, 101631)

In [26]:
X_test_count.shape

(7532, 101631)

### - case 2) TFIDF Vectorizer

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
tvet = TfidfVectorizer()

In [31]:
tvet.fit(X_train)
X_train_tfidf = tvet.transform(X_train)
X_test_tfidf = tvet.transform(X_test)

In [32]:
X_train_tfidf.shape

(11314, 101631)

In [33]:
X_test_tfidf.shape

(7532, 101631)

### Logistic Regression 으로 분류 

In [35]:
from sklearn.linear_model import LogisticRegression

In [36]:
lr = LogisticRegression()

In [37]:
lr.fit(X_train_count, y_train)
pred_count = lr.predict(X_test_count)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [38]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred_count)

0.6068773234200744

 - TfidfVectorizer로 백터화한 데이터

In [39]:
lr = LogisticRegression()
lr.fit(X_train_tfidf, y_train)
pred_tfidf = lr.predict(X_test_tfidf)
accuracy_score(y_test, pred_tfidf)

0.6736590546999469

### Decision Tree

In [42]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

In [43]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train_tfidf, y_train)
pred_dtc = dtc.predict(X_test_tfidf)
accuracy_score(y_test, pred_dtc)

0.3993627190653213

### Navie Bayes 로 분류

In [47]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)
pred_nb = nb.predict(X_test_tfidf)
accuracy_score(y_test, pred_nb)

0.6062134891131173

 - case 3 ) TFIDF, ngram_range=(1,2), stopwords, max_df = 300

In [49]:
tvet3 = TfidfVectorizer(ngram_range=(1,2), stop_words='english', max_df=300)
tvet3.fit(X_train)
X_train_tfidf3 = tvet3.transform(X_train)
X_test_tfidf3 = tvet3.transform(X_test)

In [50]:
lr = LogisticRegression()
lr.fit(X_train_tfidf3, y_train)
pred3 = lr.predict(X_test_tfidf3)
accuracy_score(y_test, pred3)

0.6922464152947424

In [51]:
X_train_tfidf.shape, X_train_tfidf3.shape

((11314, 101631), (11314, 943453))

- case 4) Case 3에서 Logistic Regression 파라미터 C의 값을 10으로 

In [52]:
lr4 = LogisticRegression(C=10)
lr4.fit(X_train_tfidf3, y_train)
pred4 = lr4.predict(X_test_tfidf3)
accuracy_score(y_test, pred4)
%time

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Wall time: 0 ns


In [53]:
accuracy_score(y_test, pred4)


0.7010090281465746