# Topic in Text  
Will use the [20 newsgroup](http://qwone.com/~jason/20Newsgroups/) dataset  
Code is from [sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) and [internet](https://krakensystems.co/blog/2018/text-classification)

In [None]:
import os
import pickle
import numpy as np

In [2]:
# load dataset
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset="all")
list(data.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
data.data[0]

"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

In [4]:
len(data.data)

18846

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

### Try Simple Model

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

clf1 = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

clf1.fit(X_train, y_train)
clf1.score(X_test, y_test)

0.8527851458885941

In [7]:
# try to add stop words
from nltk.corpus import stopwords

clf2 = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))),
    ('classifier', MultinomialNB())
])

clf2.fit(X_train, y_train)
clf2.score(X_test, y_test)

0.8816976127320955

### Find Best Params

In [8]:
from sklearn.model_selection import GridSearchCV

newclf = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))),
    ('classifier', MultinomialNB())
])
params = {
    'classifier__alpha': [1, 0.1, 0.01, 0.001]
}
searcher= GridSearchCV(newclf, param_grid=params)
searcher.fit(X_train, y_train)
searcher.best_score_

0.9095248778213266

In [9]:
searcher.best_params_

{'classifier__alpha': 0.01}

### Further Preprocessing

In [11]:
import string
from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer

import warnings
warnings.filterwarnings('ignore') 

def tokenizer(text):
    stemmer = SnowballStemmer('english')
    return [stemmer.stem(x) for x in word_tokenize(text)]

clf = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation), tokenizer=tokenizer)),
    ('classifier', MultinomialNB(alpha=0.01))
])

clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9212201591511936

### Try Other Models

In [12]:
from sklearn.linear_model import SGDClassifier

clf_sgd = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation), tokenizer=tokenizer)),
    ('classifier', SGDClassifier())
])

clf_sgd.fit(X_train, y_train)
clf_sgd.score(X_test, y_test)

0.9238726790450928

In [13]:
from sklearn.svm import LinearSVC

clf_svc = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation), tokenizer=tokenizer)),
    ('classifier', LinearSVC())
])

clf_svc.fit(X_train, y_train)
clf_svc.score(X_test, y_test)

0.9352785145888595

### More Tests

In [14]:
test_tweets = [
    "Humans drive using 2 cameras on a slow gimbal &amp; are often distracted. A Tesla with 8 cameras, radar, sonar &amp; always being alert can definitely be superhuman.",
    "Climate change is one of the toughest challenges the world has ever taken on. But I believe we can avoid a climate catastrophe if we take steps now to reduce emissions and find ways to adapt to a warmer world. Heres how were approaching the challenge:",
    "Malnutrition is the single greatest health inequity in the world. In its 2019 State of the Worlds Children report,  takes a critical look at how we can help everyone reach their full potential by improving access to nutrition.",
    "Knew this storm would produce a tornado at some point! Take cover! A possible tornado is on the ground! In far northern Georgia! #gawx #tornado",
    "What if we really are all in this together? In this fascinating book,  explains why and how humans evolved to work together:"
]

In [15]:
for cla in [clf, clf_sgd, clf_svc]:
    print([data.target_names[x] for x in cla.predict(test_tweets)])

['sci.electronics', 'sci.space', 'sci.med', 'talk.politics.misc', 'alt.atheism']
['misc.forsale', 'comp.os.ms-windows.misc', 'sci.med', 'soc.religion.christian', 'soc.religion.christian']
['misc.forsale', 'comp.os.ms-windows.misc', 'sci.med', 'soc.religion.christian', 'soc.religion.christian']


### Going to Use MultinomialNB

In [17]:
import pickle
with open("multinomial.pkl", "wb") as outFile:
    pickle.dump([data.target_names, clf], outFile)