# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup postings on a variety of topics. You'll train classifiers to distinguish between the topics based on the text of the posts. Whereas with digit classification, the input is relatively dense: a 28x28 matrix of pixels, many of which are non-zero, here we'll represent each document with a "bag-of-words" model. As you'll see, this makes the feature representation quite sparse -- only a few words of the total vocabulary are active in any given document. The bag-of-words assumption here is that the label depends only on the words; their order is not important.

The SK-learn documentation on feature extraction will prove useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up and write your own code.

In [53]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
# DEPRECATED from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# Added just for neater outputs
import pandas

Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset = 'train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories = categories)
newsgroups_test = fetch_20newsgroups(subset = 'test',
                                     remove = ('headers', 'footers', 'quotes'),
                                     categories = categories)

In [3]:
## ADDED code (and packages) to save data locally (took some time to import so want to be speedy to reload each time)
import sys
import pickle

pickle.dump(newsgroups_train, open("newsgroups_train.sav", "wb"))
newsgroups_train_reload = pickle.load(open("newsgroups_train.sav", "rb"))

pickle.dump(newsgroups_test, open("newsgroups_test.sav", "wb"))
newsgroups_test_reload = pickle.load(open("newsgroups_test.sav", "rb"))

# TEST
print(type(newsgroups_train))
print(sys.getsizeof(newsgroups_train))
print(type(newsgroups_train_reload))
print(sys.getsizeof(newsgroups_train_reload))

<class 'sklearn.utils.Bunch'>
256
<class 'sklearn.utils.Bunch'>
256


In [4]:
num_test = len(newsgroups_test.target)
test_data, test_labels = newsgroups_test.data[int(num_test/2):], newsgroups_test.target[int(num_test/2):]
dev_data, dev_labels = newsgroups_test.data[:int(num_test/2)], newsgroups_test.target[:int(num_test/2)]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training data shape:', len(train_data))
print('training label shape:', train_labels.shape)
print('test data shape:', len(test_data))
print('test label shape:', test_labels.shape)
print('dev data shape:', len(dev_data))
print('dev label shape:', dev_labels.shape)
print('labels names:', newsgroups_train.target_names)

training data shape: 2034
training label shape: (2034,)
test data shape: 677
test label shape: (677,)
dev data shape: 676
dev label shape: (676,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


## (1) For each of the first 5 training examples, print the text of the message along with the label.

In [5]:
for ex in range(len(train_data)):
    print("Example: " + str(ex) + " / Label: " + str(newsgroups_train.target_names[train_labels[ex]]))
    print(train_data[ex])
    print()    

Example: 0 / Label: comp.graphics
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

Example: 1 / Label: talk.religion.misc


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
fo


Example: 578 / Label: comp.graphics
WHATS THIS  680x1024 256 color mode? Asking a lot of your hardware ?

Example: 579 / Label: sci.space
Another fish to check out is Richard Rast -- he works
for Lockheed Missiles, but is on-site at NASA Johnson.

Nick Johnson at Kaman Sciences in Colo. Spgs and his
friend, Darren McKnight at Kaman in Alexandria, VA.

Good luck.

R. Landis

Example: 580 / Label: comp.graphics

Okay, I've received a whole lot of requests for the movie, so for
simplicity's sake I can't mail out any more than I've already received (as
of 16:30 EDT, Tuesday).  Maybe it'll pop up on a site sooner or later.

Example: 581 / Label: comp.graphics
[Most info regarding dangers of reading from Floppy disks omitted]

In all fairness, how many people do you know personally who read images
from Floppy drives?  I haven't tried it with JPEGs, but I do realize how
agonizingly slow it is with GIF files.  

Example: 582 / Label: alt.atheism

 


I love it, I love it, I love it!! Wish I c

Example: 1075 / Label: comp.graphics
:     Help!! I need code/package/whatever to take 3-D data and turn it into
: a wireframe surface with hidden lines removed. I'm using a DOS machine, and
: the code can be in ANSI C or C++, ANSI Fortran or Basic. The data I'm using
: forms a rectangular grid.
:    Please post your replies to the net so that others may benefit. IMHO, this
: is a general interest question.
:    Thank you!!!!!!


Example: 1076 / Label: talk.religion.misc


Example: 1077 / Label: talk.religion.misc
}>}(a) out of context;
}>Must have missed when you said this about these other "promises of god" that we keep
}>getting subjected to.  Could you please explain why I am wrong and they are OK?
}>Or an acknowledgement of public hypocrisy. Both or neither.
}
}So, according to you, Jim, the only way to criticize one person for
}taking a quote out of context, without being a hypocrite, is to post a
}response to *every* person on t.r.m who takes a quote out of context?

Did I eithe

  Does anyone know how to convert a targa or similar 24 bit picture into a list
 of R G B values and then convert back to targa after doing operations on the p
ixels R G B codes.
ex.  Targa ---->000100255pixel 1
001200201pixel 2etc....
If no one can help me with this could someone explain how the 24 bit data is st
ored in the targa file and also how its stored in the 8 bit targas.   Thanks


Example: 1578 / Label: comp.graphics
I have posted disp135.zip to alt.binaries.pictures.utilities


******   You may distribute this program freely for non-commercial use
         if no fee is gained.
******   There is no warranty. The author is not responsible for any
         damage caused by this program.


Important changes since version 1.30:
    Fix bugs in file management system (file displaying).
    Improve file management system (more user-friendly).
    Fix bug in XPM version 3 reading.
    Fix bugs in TARGA reading/writng.
    Fix bug in GEM/IMG reading.
    Add support for PCX and GEM/

Example: 1977 / Label: talk.religion.misc
After tons of mail, could we move this discussion to alt.religion?
--There are many here among us who feel that life is but a joke. (Bob Dylan)
--"If you were happy every day of your life you wouldn't be a human
being, you'd be a game show host." (taken from the movie "Heathers.")
--Lecture (LEK chur) - process by which the notes of the professor
become the notes of the student without passing through the minds of
either.


Example: 1978 / Label: talk.religion.misc


It is true that Mormons believe that all spirits (including Jesus,
Lucifer, Robert Weiss) are in the same family.  It does not mean
that Jesus was created, but rather that Lucifer and Robert Weiss
were not.  I agree that this is a "heresy".  So what?  
The sweating of blood in Gethsemene is
not a basic Mormon doctrine.  Jesus did not perform the atonement
in Getheseme alone, as some anti-Mormons are trying to teach.  
As far as the "unpardonable sin" whatever that is, it is Biblica

## (2) Use CountVectorizer to turn the raw training text into feature vectors.  
You should use the fit_transform function, which makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").

The vectorizer has a lot of options. To get familiar with some of them, write code to answer these questions:

In [7]:
### TO DO ### - change "fit" to "vectorized" or something throughout,. fit is misleading/confusing

vectorizer = CountVectorizer()
fit_train_data = vectorizer.fit_transform(train_data)

**a. The oa. The output of the transform (also of fit_transform) is a sparse matrix:** http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html.  
- What is the size of the vocabulary?  
- What is the average number of non-zero features per example? 
- What fraction of the entries in the matrix are non-zero?  
_Hint:_ use "nnz" and "shape" attributes.

In [22]:
print("The vocabulary consists of " + str(fit_train_data.shape[1]) + " words")
print("Of the " + str((fit_train_data.shape[0] * fit_train_data.shape[1])) + " entries in the matrix, " + 
      str(round(fit_train_data.nnz / (fit_train_data.shape[0] * fit_train_data.shape[1])*100,3)) + "% (" + 
      str(fit_train_data.nnz) + ") of them are non-zero.")
print("The avergae number of non-zero features per example is " + 
      str(round(fit_train_data.nnz/fit_train_data.shape[0],2)))

The vocabulary consists of 26879 words
Of the 54671886 entries in the matrix, 0.36% (196700) of them are non-zero.
The avergae number of non-zero features per example is 96.71


**b. What are the 0th and last feature strings (in alphabetical order)?**  
_Hint:_ use the vectorizer's `get_feature_names` function.

In [9]:
full_feature_list = vectorizer.get_feature_names()
print("0th feature string: " + full_feature_list[0])
print("Last feature string: " + full_feature_list[len(full_feature_list)-1])

0th feature string: 00
Last feature string: zyxel


**c. Specify your own vocabulary with 4 words: ["atheism", "graphics", "space", "religion"].**  
Confirm the training vectors are appropriately shaped.  
Now what's the average number of non-zero features per example?

In [16]:
my_vocab = ["atheism", "graphics", "space", "religion"]
vectorizer_myvocab = CountVectorizer(vocabulary = my_vocab)
fit_train_data_mv = vectorizer_myvocab.fit_transform(train_data)
print("Confirm shape is appropriate: " + 
      str(fit_train_data_mv.shape))
print("The avergae number of non-zero features per example (with 4-word vocabulary) is " + 
      str(round(fit_train_data_mv.nnz/fit_train_data_mv.shape[0],2)))

Confirm shape is appropriate: (2034, 4)
The avergae number of non-zero features per example (with 4-word vocabulary) is 0.27


**d. Instead of extracting unigram word features, use "analyzer" and "ngram_range" to extract bigram and trigram character features.**  
What size vocabulary does this yield?

In [20]:
# Bigram
vectorizer_bigram = CountVectorizer(ngram_range = (1,2), analyzer = "char_wb")
fit_train_data_bg = vectorizer_bigram.fit_transform(train_data)
print("The bigram vocabulary consists of " + str(fit_train_data_bg.shape[1]) + " words")

# Trigram
vectorizer_bigram = CountVectorizer(ngram_range = (1,3), analyzer = "char_wb")
fit_train_data_tg = vectorizer_bigram.fit_transform(train_data)
print("The trigram vocabulary consists of " + str(fit_train_data_tg.shape[1]) + " words")

The bigram vocabulary consists of 3167 words
The trigram vocabulary consists of 29031 words


**e. Use the "min_df" argument to prune words that appear in fewer than 10 documents.**  
What size vocabulary does this yield?

In [21]:
# Prune
vectorizer_prune = CountVectorizer(min_df = 10)
fit_train_data_prune = vectorizer_prune.fit_transform(train_data)
print("The pruned vocabulary consists of " + str(fit_train_data_prune.shape[1]) + " words")

The pruned vocabulary consists of 3064 words


**f. Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary?**  
_Hint:_ build a vocabulary for both train and dev and look at the size of the difference.  

In [34]:
vectorizer_dev = CountVectorizer()
fit_dev_data = vectorizer_dev.fit_transform(dev_data)
full_feature_list_dev = vectorizer_dev.get_feature_names()
print("Feature list length (train): " + str(len(full_feature_list)))
print("Feature list length (dev): " + str(len(full_feature_list_dev)))
features_dev_not_train = np.setdiff1d(full_feature_list_dev, full_feature_list)
print("Features in dev, not in train: " + str(len(features_dev_not_train)) +
     " (" + str(round(len(features_dev_not_train) / len(full_feature_list_dev)*100, 2)) + "%)")

Feature list length (train): 26879
Feature list length (dev): 16246
Features in dev, not in train: 4027(24.79%)


## (3) Use the default CountVectorizer options and: 
1. Report the f1 score (use metrics.f1_score) for a k nearest neighbors classifier. Find the optimal value for k.
2. Fit a Multinomial Naive Bayes model and find the optimal value for alpha. 
3. Fit a logistic regression model and find the optimal value for the regularization strength C using l2 regularization.  

A few questions:
- Why doesn't nearest neighbors work well for this problem?
- Any ideas why logistic regression doesn't work as well as Naive Bayes?
- Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute.  
Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.

In [82]:
vectorizer_dev = CountVectorizer(vocabulary = full_feature_list)
fit_dev_data_align = vectorizer_dev.fit_transform(dev_data)
print(fit_train_data.shape)
print(fit_dev_data_align.shape)

(2034, 26879)
(676, 26879)


### KNN

In [103]:
k_values = {"n_neighbors": list(range(1, 100))}
k_search = GridSearchCV(KNeighborsClassifier(), k_values, cv = 3, return_train_score = True)
k_search.fit(fit_train_data, train_labels)
pandas.DataFrame(k_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.001994,2.820242e-03,0.076644,7.307805e-03,1,{'n_neighbors': 1},0.430044,0.384389,0.412722,0.409046,0.018832,77,0.977122,0.979336,0.976436,0.977631,0.001237
1,0.005201,7.355018e-03,0.078107,3.394152e-06,2,{'n_neighbors': 2},0.425626,0.377025,0.406805,0.403147,0.020023,90,0.673063,0.764576,0.733432,0.723690,0.037990
2,0.005209,7.366707e-03,0.075599,6.567811e-03,3,{'n_neighbors': 3},0.406480,0.382916,0.397929,0.395772,0.009747,98,0.639852,0.662731,0.621502,0.641362,0.016865
3,0.000000,0.000000e+00,0.078111,6.676666e-06,4,{'n_neighbors': 4},0.430044,0.375552,0.393491,0.399705,0.022690,95,0.647970,0.670111,0.631811,0.649964,0.015699
4,0.005554,7.855049e-03,0.079818,2.422264e-03,5,{'n_neighbors': 5},0.444772,0.391753,0.390533,0.409046,0.025295,77,0.638376,0.642066,0.617820,0.632754,0.010667
5,0.001330,1.880873e-03,0.084378,6.757058e-03,6,{'n_neighbors': 6},0.450663,0.368189,0.392012,0.403638,0.034679,88,0.611070,0.627306,0.606038,0.614805,0.009075
6,0.002660,2.049241e-03,0.092745,3.732608e-03,7,{'n_neighbors': 7},0.437408,0.407953,0.378698,0.408063,0.023959,79,0.592620,0.595572,0.586156,0.591449,0.003932
7,0.011421,5.943043e-03,0.084049,8.409308e-03,8,{'n_neighbors': 8},0.432990,0.379971,0.371302,0.394789,0.027273,99,0.586716,0.605904,0.578792,0.590471,0.011382
8,0.008873,4.789163e-03,0.087059,1.254812e-02,9,{'n_neighbors': 9},0.431517,0.403535,0.396450,0.410521,0.015141,74,0.594096,0.592620,0.581001,0.589239,0.005856
9,0.002327,1.694890e-03,0.089209,4.856233e-03,10,{'n_neighbors': 10},0.440353,0.405007,0.394970,0.413471,0.019466,70,0.576384,0.580812,0.574374,0.577190,0.002689


In [90]:
def knn_custom(k_value, in_train_data, in_train_labels, in_test_data, in_test_labels):
    clf_model = KNeighborsClassifier(n_neighbors = k_value, weights = "distance")
    clf_model.fit(in_train_data, in_train_labels)
    result = clf_model.predict(in_test_data)
    f1 = metrics.f1_score(y_true = in_test_labels, y_pred = result, average = "weighted")
    print("K = %3.3f ; F1-Score: %3.3f" %(k_value, f1))
    print(classification_report(in_test_labels, result))
    
    return result

In [95]:
dev_pred_k97 = knn_custom(k_value = 97, 
                         in_train_data = fit_train_data,
                         in_train_labels = train_labels,
                         in_test_data = fit_dev_data_align,
                         in_test_labels = dev_labels)

K = 97.000 ; F1-Score: 0.463
              precision    recall  f1-score   support

           0       0.48      0.32      0.39       165
           1       0.51      0.60      0.55       185
           2       0.49      0.58      0.53       199
           3       0.35      0.31      0.33       127

   micro avg       0.47      0.47      0.47       676
   macro avg       0.46      0.45      0.45       676
weighted avg       0.47      0.47      0.46       676



### Naive Bayes

In [101]:
alphas = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
alpha_search = GridSearchCV(MultinomialNB(), alphas, cv = 3, return_train_score = True)
alpha_search.fit(fit_train_data, train_labels)
pandas.DataFrame(alpha_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.00798,0.001412,0.001331,0.0004668749,0.0001,{'alpha': 0.0001},0.824742,0.818851,0.83284,0.825467,0.005732,3,0.963838,0.961624,0.966127,0.963863,0.001838
1,0.007981,1e-06,0.001662,0.000469797,0.001,{'alpha': 0.001},0.824742,0.821797,0.831361,0.825959,0.003997,2,0.963838,0.960886,0.96539,0.963371,0.001868
2,0.008643,0.000469,0.001995,4.052337e-07,0.01,{'alpha': 0.01},0.826215,0.821797,0.837278,0.828417,0.006507,1,0.962362,0.960148,0.963918,0.962142,0.001547
3,0.010309,0.001243,0.00266,0.0004717133,0.1,{'alpha': 0.1},0.820324,0.805596,0.840237,0.822026,0.014188,4,0.957196,0.956458,0.958027,0.957227,0.000641
4,0.006318,0.0053,0.000664,0.0004697972,0.5,{'alpha': 0.5},0.810015,0.790869,0.825444,0.808751,0.014138,5,0.940959,0.935055,0.939617,0.938544,0.002527
5,0.0111,0.007849,0.0,0.0,1.0,{'alpha': 1.0},0.792342,0.774669,0.821006,0.795969,0.019084,6,0.927675,0.920295,0.932253,0.926741,0.004926
6,0.005562,0.007865,0.005209,0.007366707,2.0,{'alpha': 2.0},0.768778,0.75405,0.798817,0.773845,0.018618,7,0.898155,0.899631,0.912371,0.903386,0.006382
7,0.011101,0.00785,0.0,0.0,10.0,{'alpha': 10.0},0.703976,0.692194,0.656805,0.684366,0.020032,8,0.75572,0.783026,0.733432,0.757392,0.020281


In [102]:
mnb = MultinomialNB(alpha = 0.01)
mnb.fit(fit_train_data, train_labels)
dev_pred_mnb = mnb.predict(fit_dev_data_align)
print(classification_report(dev_labels, dev_pred_mnb))

              precision    recall  f1-score   support

           0       0.67      0.72      0.69       165
           1       0.92      0.90      0.91       185
           2       0.81      0.89      0.85       199
           3       0.65      0.50      0.57       127

   micro avg       0.78      0.78      0.78       676
   macro avg       0.76      0.75      0.76       676
weighted avg       0.78      0.78      0.78       676



### Logistic

In [93]:
cs = {'C': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
logitreg = LogisticRegression(penalty = "l2", solver = "lbfgs", max_iter = 1000, multi_class = "auto")

c_search = GridSearchCV(logitreg, cs, cv = 3, return_train_score = True)
c_search.fit(fit_train_data, train_labels)
pandas.DataFrame(c_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.949242,0.025965,0.000997,0.001411,0.0001,{'C': 0.0001},0.460972,0.440353,0.436391,0.445919,0.010778,8,0.473801,0.481919,0.472754,0.476158,0.004096
1,1.251474,0.080011,0.00133,0.00094,0.001,{'C': 0.001},0.597938,0.594993,0.599112,0.597345,0.001733,7,0.674539,0.656827,0.659057,0.663474,0.007877
2,1.58212,0.080095,0.001006,0.001422,0.01,{'C': 0.01},0.727541,0.717231,0.733728,0.726155,0.006803,6,0.870849,0.872325,0.879234,0.874136,0.003655
3,2.069733,0.132739,0.000332,0.00047,0.1,{'C': 0.1},0.759941,0.751105,0.760355,0.757129,0.004268,1,0.971956,0.968266,0.968336,0.969519,0.001723
4,3.463314,0.396128,0.002657,0.000474,0.5,{'C': 0.5},0.758468,0.743741,0.752959,0.751721,0.00608,2,0.978598,0.979336,0.977172,0.978369,0.000898
5,3.098137,0.214129,0.000336,0.000475,1.0,{'C': 1.0},0.752577,0.740795,0.745562,0.746313,0.004843,3,0.979336,0.979336,0.977909,0.97886,0.000673
6,3.862051,0.543362,0.000998,0.001411,2.0,{'C': 2.0},0.745214,0.740795,0.744083,0.743363,0.001875,4,0.979336,0.979336,0.977909,0.97886,0.000673
7,5.317724,1.479385,0.005874,0.006944,10.0,{'C': 10.0},0.739323,0.742268,0.738166,0.739921,0.001727,5,0.980074,0.979336,0.977909,0.979106,0.000899


In [94]:
logitreg = LogisticRegression(penalty = "l2", C = 0.1, solver = "lbfgs", max_iter = 1000, multi_class = "auto")
logitreg.fit(fit_train_data, train_labels)
dev_pred_logitreg = logitreg.predict(fit_dev_data_align)
print(classification_report(dev_labels, dev_pred_logitreg))

              precision    recall  f1-score   support

           0       0.61      0.56      0.58       165
           1       0.79      0.86      0.83       185
           2       0.74      0.82      0.78       199
           3       0.59      0.48      0.53       127

   micro avg       0.70      0.70      0.70       676
   macro avg       0.68      0.68      0.68       676
weighted avg       0.69      0.70      0.70       676



Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C

In [105]:
print()
print(logitreg.coef_[0,])
print(sum(logitreg.coef_[0,]**2))
print(logitreg.coef_[:,0])
# Loop through C settings
    # Loop through features
        # Store each feature


[-3.10346413e-02  2.36333138e-02 -2.43862521e-05 ... -1.39322274e-04
 -2.78644549e-04 -1.31478141e-04]
14.088172553138689
[-0.03103464  0.0685337  -0.01132777 -0.02617128]


In [9]:
#def P3():
### STUDENT START ###
# k = 97, f1 = 04763 (?)
# MNB -> Logistic > KNN
# Draw graphs
### STUDENT END ###
#P3()
# Probability Calibration curves 
# https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

ANSWER: 
KNN gets better higher, need to re-run with 1000 range (100 seems too small)


## (4) Train a logistic regression model.  
Find the 5 features with the largest weights for each label -- 20 features in total.  
Create a table with 20 rows and 4 columns that shows the weight for each of these features for each of the labels.  
Create the table again with bigram features. Any surprising features in this table?

In [10]:
#def P4():
### STUDENT START ###
# ngram_range=(2,2)
### STUDENT END ###
#P4()

ANSWER:

## (5) Try to improve the logistic regression classifier by passing a custom preprocessor to CountVectorizer.  
The preprocessing function runs on the raw text, before it is split into words by the tokenizer. Your preprocessor should try to normalize the input in various ways to improve generalization.  
For example, try: 
- Lowercasing everything
- Replacing sequences of numbers with a single token
- Removing various other non-letter characters
- Shortening long words  

If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and `re.sub()` in particular.  

With your new preprocessor, how much did you reduce the size of the dictionary?  
For reference, I was able to improve dev F1 by 2 points.

In [11]:
#def empty_preprocessor(s):
#    return s

#def better_preprocessor(s):
### STUDENT START ###

### STUDENT END ###
# alpha = 0.005 // f1 = 0.77
#def P5():
### STUDENT START ###

### STUDENT END ###
#P5()

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. That is, logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size. The default regularization, L2, computes this size as the sum of the squared weights (see P3, above). L1 regularization computes this size as the sum of the absolute values of the weights. The result is that whereas L2 regularization makes all the weights relatively small, L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

## (6) Train a logistic regression model using a "l1" penalty.  
Output the number of learned weights that are not equal to zero.  
- How does this compare to the number of non-zero weights you get with "l2"?  
Now, reduce the size of the vocabulary by keeping only those features that have at least one non-zero weight and retrain a model using "l2".

Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when pruning unused features by adjusting the C parameter.

Note: The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting `tol=.01` (the default is .0001).

In [12]:
#def P6():
    # Keep this random seed here to make comparison easier.
    #np.random.seed(0)

    ### STUDENT START ###
    
    ### STUDENT END ###
#P6()

## (7) Use the TfidfVectorizer (_NOTE: how is this different from the CountVectorizer?_) Train a logistic regression model with C=100.

Make predictions on the dev data and show the top 3 documents where the ratio R is largest, where R is:

maximum predicted probability / predicted probability of the correct label

What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

In [13]:
#def P7():
### STUDENT START ###

## STUDENT END ###
#P7()

ANSWER:

## (8) EXTRA CREDIT

Try implementing one of your ideas based on your error analysis. Use logistic regression as your underlying model.