<a href="https://colab.research.google.com/github/yala/introML_chem/blob/master/lab1/sample_news_groups_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pickle
import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model  import PassiveAggressiveClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import warnings
warnings.filterwarnings("ignore")


# The Task: News Group Classification

Given documents in different news groups (i.e topics):
```
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']```
 
Train a classifier to predict the topic of a given document. 


## Step 1: Preprocessing the data
To start off, we're going to load the data from sklearn and do some simple preprocessing. We'll remove the headers, footers and quotes in the articles. We'll also do the same preprocessing as before.


The sanity check the data, we'll look at a few examples.


In [2]:

categories = ['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

full_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories)
test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories)

def preprocess_data(data):
    processed_data = []
    for indx, sample in enumerate(data['data']):
        text, label = sample, data['target'][indx]
        label_name = data['target_names'][label]
        text = re.sub('\W+', ' ', text).lower().strip()
        processed_data.append( (text, label, label_name) )
    return processed_data


train_set = preprocess_data(full_train)
full_test_set = preprocess_data(test)
half_indx = len(full_test_set) // 2
dev_set = full_test_set[:half_indx]
test_set = full_test_set[half_indx:]



Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
print("Num Train: {}".format(len(train_set)))
print("Num Dev: {}".format(len(dev_set)))
print("Num Test: {}".format(len(test_set)))
print("Example Documents:")
print(train_set[0])
print()
print(train_set[1])


Num Train: 11314
Num Dev: 3766
Num Test: 3766
Example Documents:
('i was wondering if anyone out there could enlighten me on this car i saw the other day it was a 2 door sports car looked to be from the late 60s early 70s it was called a bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is all i know if anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail', 7, 'rec.autos')

('a fair number of brave souls who upgraded their si clock oscillator have shared their experiences for this poll please send a brief message detailing your experiences with the procedure top speed attained cpu rated speed add on cards and adapters heat sinks hour of usage per day floppy disk functionality with 800 and 1 4 m floppies are especially requested i will be summarizing in the next two days so please add to the network knowledge base if you 

## Step 2: Feature Engineering 

How do we represent a document? This is up to you!
Remeber, you can vary the vocabulary size, choose to put ``ngrams``!

Remember, we can do this very easily with ```sklearn.feature_extraction.text.CountVectorizer```

<img src="https://github.com/yala/MLCodeLab/blob/master/lab1/vectorizer.png?raw=true">


In [0]:
#Extract tweets and labels into 2 lists
trainText = [t[0] for t in train_set]
trainY = [t[1] for t in train_set]

devText = [t[0] for t in dev_set]
devY = [t[1] for t in dev_set]


testText = [t[0] for t in test_set]
testY = [t[1] for t in test_set]

min_df = 10
max_df = 0.90
ngram_range = (1,3)
max_features = 10000
countVec = CountVectorizer(min_df=min_df, 
                           max_df=max_df,
                           max_features=max_features, 
                           ngram_range=ngram_range,
                           stop_words='english')



# Learn vocabulary from train set
countVec.fit(trainText)

# Transform list of review to matrix of bag-of-word vectors
trainX = countVec.transform(trainText)
devX = countVec.transform(devText)
testX = countVec.transform(testText)

In [0]:
print("Shape of Train X {}\n".format(trainX.shape))
print("Sample of the vocab:\n {}".format(np.random.choice(countVec.get_feature_names(), 20)))


Shape of Train X (11314, 10000)

Sample of the vocab:
 ['sumgait' 'torque' 'dsl pitt edu' 'family' 'incident' 'ye' 'graphic'
 'publication' 'staff' 'laserjet' 'interface' 'dont' 'pile' 'ott' 'relate'
 'minority' 'atheists' 'tony' 'tricky' 'censorship']


## Step 3: Pick a model and experiment

Explore various models.

I recomment exploring:
1) [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
2) [SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

And look around the [library](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm) for other options!





In [0]:
# Initialize your model
model = LogisticRegression(C=.01, class_weight='balanced',
                          penalty='l2')



In [0]:
model.fit(trainX, trainY)


print("Train Accuracy:", model.score(trainX, trainY))
print("Dev Accuracy:", model.score(devX, devY))
print("--")


Train Accuracy: 0.7965352660420718
Dev Accuracy: 0.6147105682421667
--


## Step 4: Analysis, Debugging the Model
To understand how to make the model better, it's important understand what the model is learning, and what it's getting wrong.

Recall how we did this for Logistic regression. 

It can be helpful inspect the highest weighted features of the model and look at some examples the model got wrong on the development set. 

From what you learn, you can go back and change the feature extraction or the regularization.


In [0]:
print("Intepreting The model")
for label in range(20):
    coefs = model.coef_[label]
    vocab = np.array(countVec.get_feature_names())
    num_features = 5

    top = np.argpartition(coefs, -num_features)[-num_features:]
    # Sort top
    top = top[np.argsort(coefs[top])]
    s_coef = coefs[top]
    scored_vocab = list(zip(vocab[top], s_coef))
    print("Top weighted features for label {}:\n \n {}\n -- \n".format(test['target_names'][label], scored_vocab))

Intepreting The model
Top weighted features for label alt.atheism:
 
 [('atheists', 0.2508178944890633), ('bobby', 0.2519097588926118), ('islam', 0.2614950752178715), ('religion', 0.29312586490879056), ('atheism', 0.3371150581374824)]
 -- 

Top weighted features for label comp.graphics:
 
 [('points', 0.23292519676922613), ('format', 0.24627895446899792), ('3d', 0.33904912664630843), ('image', 0.38818763318135424), ('graphics', 0.6634203740777268)]
 -- 

Top weighted features for label comp.os.ms-windows.misc:
 
 [('cica', 0.232913294482514), ('files', 0.23393875842103476), ('driver', 0.2368866207976759), ('file', 0.2507592059973859), ('windows', 0.8374738275416796)]
 -- 

Top weighted features for label comp.sys.ibm.pc.hardware:
 
 [('ide', 0.26686129560057065), ('bus', 0.26744485480038316), ('486', 0.26872496346764074), ('monitor', 0.26874886085025357), ('controller', 0.28921621100848527)]
 -- 

Top weighted features for label comp.sys.mac.hardware:
 
 [('simms', 0.25469387667761223)

In [0]:
## Find erronous dev errors
devPred = model.predict(devX)
errors = []
for indx in range(len(devText)):
    if devPred[indx] != devY[indx]:
        error = "Document: \n {} \n Predicted: {} \n Correct: {} \n ---".format(
            devText[indx],
            test['target_names'][devPred[indx]],
            test['target_names'][devY[indx]])
        errors.append(error)

np.random.seed(1)
print("Random dev error: \n {} \n \n {} \n \n{}".format(
        np.random.choice(errors,1),
        np.random.choice(errors,1),
        np.random.choice(errors,1))
     )

Random dev error: 
 ['Document: \n unfortunately roger is now over at r s baseball spewing his expertise i e being a dickhead i guess he is afraid of posting anything here because he knows what to expect \n Predicted: rec.sport.baseball \n Correct: rec.sport.hockey \n ---'] 
 
 ['Document: \n date sun 25 apr 1993 10 13 30 gmt from fred rice darice yoyo cc monash edu au the qur an talks about those who take their lusts and worldly desires for their god i think this probably encompasses most atheists fred rice darice yoyo cc monash edu au as well as all the muslim men screwing fourteen year old prostitutes in thailand got a better quote \n Predicted: sci.space \n Correct: alt.atheism \n ---'] 
 
['Document: \n precisely one wonders what unusual strain the boy might be under that could be causing difficulty with his behavior standard practice would be to get a second opinion from a child psychiatrist one would want to rule out the possibility that the bad behavior is not psychiatric illne

## Step 5: Take best model, and report results on Test

In [0]:
print("Test Accuracy:", model.score(testX, testY))


Test Accuracy: 0.6250663834306956
