In [1]:
import pandas as pd

## The 20 newsgroups dataset

The [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The data is organized into 20 different newsgroups, each corresponding to a different topic:

- 'atheism',
- 'comp.graphics',
- 'comp.os.ms-windows.misc',
- 'comp.sys.ibm.pc.hardware',
- 'comp.sys.mac.hardware',
- 'comp.windows.x',
- 'misc.forsale',
- 'rec.autos',
- 'rec.motorcycles',
- 'rec.sport.baseball',
- 'rec.sport.hockey',
- 'sci.crypt',
- 'sci.electronics',
- 'sci.med',
- 'sci.space',
- 'soc.religion.christian',
- 'talk.politics.guns',
- 'talk.politics.mideast',
- 'talk.politics.misc',
- 'talk.religion.misc']

 we will work on a partial dataset with only 6 categories out of the 20 available in the dataset

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
# choose, for example, 6 categories
categories = [
    'alt.atheism',
    'comp.windows.x',
    'rec.autos',
    'rec.sport.baseball',
    'sci.electronics',
    'sci.space',
]

train = fetch_20newsgroups(subset='train', 
                                categories=categories,
                                remove=('headers', 'footers', 'quotes')
                          )

test = fetch_20newsgroups(subset='test', 
                                categories=categories,
                                remove=('headers', 'footers', 'quotes')
                          )

In [4]:
train_data = pd.DataFrame({'text' : train['data'], 
                           'category' : train['target']})
train_data.head()

Unnamed: 0,text,category
0,"Benedikt Rosenau writes, with great authority:...",0
1,\nI don't understand this last statement about...,2
2,I'd like to compile X11r5 on a Sony NWS-1750 r...,1
3,"\n\n\nHow do you know it's based on ignorance,...",0
4,\nmuch crap deleted\n\n\nDEAD WRONG! Last tim...,3


In [5]:
test_data = pd.DataFrame({'text' : test['data'], 
                           'category' : test['target']})
test_data.head()

Unnamed: 0,text,category
0,I can see it now emblazened across the evening...,5
1,The color of the board shows the composition o...,4
2,Regarding the feasability of retrieving the HS...,5
3,\nI believe Acker got a ring from his wife whe...,3
4,\n\nThe new Cruisers DO NOT have independent s...,2


In [6]:
train['target_names']

['alt.atheism',
 'comp.windows.x',
 'rec.autos',
 'rec.sport.baseball',
 'sci.electronics',
 'sci.space']

In [7]:
# topics
train_data.category.value_counts()

3    597
2    594
1    593
5    593
4    591
0    480
Name: category, dtype: int64

In [8]:
# topics
test_data.category.value_counts()

3    397
2    396
1    395
5    394
4    393
0    319
Name: category, dtype: int64

In [9]:
# space
print(train_data[train_data.category==5].sample().iloc[0,0])




	They did the rollout already??!?  I am going to have to pay more
attention to the news.  Are any of the gifs headed for wuarchive??
 

Patrick




**Goal**:  classify documents from the dataset by their topic

## Training a Naive Bayes model

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [11]:
X_train = train_data.text
y_train = train_data.category

In [12]:
pipe = Pipeline(steps=[
    ('vect', TfidfVectorizer()), 
    ('clf', MultinomialNB()) 
])

In [13]:
params_dic =  {'vect__max_features' : [1000,2000,5000,10000],
               'vect__stop_words' : ['english', None],
               'vect__min_df' : [5,10,20,50],
               'vect__ngram_range' : [(1,1), (1,2),(1,3)],
               'vect__use_idf' : [True,False]}

In [14]:
grid = GridSearchCV(pipe,params_dic,scoring='accuracy',cv=5, n_jobs=-1, verbose=True)
grid.fit(X_train,y_train)

Fitting 5 folds for each of 192 candidates, totalling 960 fits


KeyboardInterrupt: 

In [None]:
grid.best_score_

In [None]:
grid.best_params_

In [None]:
best_clf = grid.best_estimator_

In [None]:
# evaluate the model
X_test = test_data.text
y_test = test_data.category
y_test_pred = best_clf.predict(X_test)

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
train['target_names']

In [None]:
best_clf.predict(['I always wanted to be an astronaut','I hate Windows 10'])

## Logistic regression

In [15]:
pipe = Pipeline(steps=[
    ('vect', TfidfVectorizer()), 
    ('clf', LogisticRegression()) 
])

In [18]:
pipe.fit(X_train,y_train)

Pipeline(steps=[('vect', TfidfVectorizer()), ('clf', LogisticRegression())])

In [21]:
pipe['clf'].coef_.shape

(6, 33187)