# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup postings on a variety of topics. You'll train classifiers to distinguish between the topics based on the text of the posts. Whereas with digit classification, the input is relatively dense: a 28x28 matrix of pixels, many of which are non-zero, here we'll represent each document with a "bag-of-words" model. As you'll see, this makes the feature representation quite sparse -- only a few words of the total vocabulary are active in any given document. The bag-of-words assumption here is that the label depends only on the words; their order is not important.

The SK-learn documentation on feature extraction will prove useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up and write your own code.

In [4]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd  
# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *



Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [5]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

num_test = len(newsgroups_test.target)

test_data, test_labels = newsgroups_test.data[int(num_test/2):], newsgroups_test.target[int(num_test/2):]

dev_data, dev_labels = newsgroups_test.data[:int(num_test/2)], newsgroups_test.target[:int(num_test/2)]

train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print ('training label shape:', train_labels.shape)
print ('test label shape:', test_labels.shape)
print ('dev label shape:', dev_labels.shape)
print ('labels names:', newsgroups_train.target_names)

training label shape: (2034,)
test label shape: (677,)
dev label shape: (676,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


(1) For each of the first 5 training examples, print the text of the message along with the label.

In [6]:
def P1(num_examples=5):
    
### STUDENT START ###

# For each train label position, print the label name correspondent to that position,
#and then the text.
    #First, we check how many examples we have in total#
    print("Train data length (data points):",len(train_data),"\n") 
    for i in range(num_examples):
        print("\n")
        print("Label for example number ",i+1," :",newsgroups_train.target_names[train_labels[i]])
        print("\n")
        print(train_data[i])
        print("\n")       
    
### STUDENT END ###

P1()

Train data length (data points): 2034 



Label for example number  1  : comp.graphics


Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych




Label for example number  2  : talk.religion.misc




Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope -

(2) Use CountVectorizer to turn the raw training text into feature vectors. You should use the fit_transform function, which makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").

The vectorizer has a lot of options. To get familiar with some of them, write code to answer these questions:

a. The output of the transform (also of fit_transform) is a sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html. What is the size of the vocabulary? What is the average number of non-zero features per example? What fraction of the entries in the matrix are non-zero? Hint: use "nnz" and "shape" attributes.

b. What are the 0th and last feature strings (in alphabetical order)? Hint: use the vectorizer's get_feature_names function.

c. Specify your own vocabulary with 4 words: ["atheism", "graphics", "space", "religion"]. Confirm the training vectors are appropriately shaped. Now what's the average number of non-zero features per example?

d. Instead of extracting unigram word features, use "analyzer" and "ngram_range" to extract bigram and trigram character features. What size vocabulary does this yield?

e. Use the "min_df" argument to prune words that appear in fewer than 10 documents. What size vocabulary does this yield?

f. Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? Hint: build a vocabulary for both train and dev and look at the size of the difference.

In [84]:
def P2():

### STUDENT START ###
    vect=CountVectorizer()
    vect.fit(train_data)
    print("a.\nVocabulary size:{} ".format(len(vect.vocabulary_)),"\n")
    #print("Vocabulary content:\n{}".format(vect.vocabulary_))

    words=vect.transform(train_data)
    print("Words sparse matrix:{} ".format(repr(words)),"\n")
    print("Non zero words in total ",words.nnz)
    print("Text shape would be(Examples,Vocabulary): ",words.shape,"\n")
    print("Like in P1, we find ",words.shape[0]," data points or examples.")
    print("Then, the average non zero features per example is (Non Zero Values/Total Examples): ",round(words.nnz/words.shape[0],2))

    print("Percentage of non zero values: ",round(100*(words.nnz/(words.shape[0]*words.shape[1])),2),"%\n")
        
    #b#
    
    print("b.")
    feature_names=vect.get_feature_names()
    print("0th, first feature:{}\n ".format(feature_names[:1]))
    print("Last feature:{}\n ".format(feature_names[-1:]))
    
    #c#
    print("\nc.")
    vectc=CountVectorizer(vocabulary=["atheism", "graphics", "space", "religion"])
    vectc.fit(train_data)
    wordsc=vectc.transform(train_data)
    print("Then, the average non zero features per example is (Non Zero Values/Total Examples): ",round(wordsc.nnz/wordsc.shape[0],2))
    print("Percentage of non zero values: ",round(100*(wordsc.nnz/(wordsc.shape[0]*wordsc.shape[1])),2),"%\n")
        
    #d#
    print("d.")
    #vectd=CountVectorizer(analyzer="char",ngram_range=(2,3))
    #vectd=vectd.fit(train_data)    
    #print("Vocabulary size:{}".format(len(vectd.vocabulary_)),"\n")
    
    vectd=CountVectorizer(analyzer="char",ngram_range=(2,3))
    wordsd=vectd.fit_transform(train_data)
    wordsd_feature_names=vectd.get_feature_names()
    print("Vocabulary size:{} ".format(len(wordsd_feature_names)),"\n")
           
    #e#
    print("e.")
    vecte=CountVectorizer(min_df=10)
    vecte.fit(train_data)
    print("Vocabulary size for words that appear in fewer than 10 documents :{} ".format(len(vecte.vocabulary_)),"\n")
    
    #f#
    print("f.")
    vectf1=CountVectorizer()
    vectf1.fit(train_data)
    print("Vocabulary size for train data:{}".format(len(vectf1.vocabulary_)),"\n")
    vectf2=CountVectorizer()
    vectf2.fit(dev_data)
    print("Vocabulary size for dev data:{}".format(len(vectf2.vocabulary_)),"\n")
    
    count=0
    
    for i in vectf2.vocabulary_:
    
        for j in vectf1.vocabulary_:
          
            if i==j:
                count=count+1
    print("Words in dev vocabulary found in train vocabulary as well: ",count)         
    
    print("A ",round(100*(1-(count/len(vectf2.vocabulary_))),1),"% of dev vocabulary is missing from the train vocabulary")
    
    
### STUDENT END ###

P2()

a.
Vocabulary size:26731  

Words sparse matrix:<2034x26731 sparse matrix of type '<class 'numpy.int64'>'
	with 196350 stored elements in Compressed Sparse Row format>  

Non zero words in total  196350
Text shape would be(Examples,Vocabulary):  (2034, 26731) 

Like in P1, we find  2034  data points or examples.
Then, the average non zero features per example is (Non Zero Values/Total Examples):  96.53
Percentage of non zero values:  0.36 %

b.
0th, first feature:['01']
 
Last feature:['zyxel']
 

c.
Then, the average non zero features per example is (Non Zero Values/Total Examples):  0.27
Percentage of non zero values:  6.71 %

d.
Vocabulary size:35529  

e.
Vocabulary size for words that appear in fewer than 10 documents :3052  

f.
Vocabulary size for train data:26731 

Vocabulary size for dev data:16246 

Words in dev vocabulary found in train vocabulary as well:  12134
A  25.3 % of dev vocabulary is missing from the train vocabulary


(3) Use the default CountVectorizer options and report the f1 score (use metrics.f1_score) for a k nearest neighbors classifier; find the optimal value for k. Also fit a Multinomial Naive Bayes model and find the optimal value for alpha. Finally, fit a logistic regression model and find the optimal value for the regularization strength C using l2 regularization. A few questions:

a. Why doesn't nearest neighbors work well for this problem?

b. Any ideas why logistic regression doesn't work as well as Naive Bayes?

c. Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute. Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.

In [104]:
def P3():

### STUDENT START ###
  
    vect=CountVectorizer()
    transformed_training_data=vect.fit_transform(train_data)
    transformed_test_data=vect.transform(test_data)
    #Only transform and not fit test_data, given we need the same vocabulary for both datasets.
    
        
    #K Nearest Neighbors
    clf = KNeighborsClassifier(n_neighbors=1)
    clf.fit(transformed_training_data,train_labels)#learning fron training data
    prediction=clf.predict(transformed_test_data)#making prediction
    #print("Prediction Score",clf.score(transformed_test_data,test_labels))#checking on testing data
    print("F1 score for a K Nearest neighbors classifier",round(metrics.f1_score(test_labels,prediction, average="weighted"),2)) 
    
    clf_range = np.arange(30) + 1
    kParameters = {'n_neighbors': clf_range} 
    
    grid_search=GridSearchCV(clf,kParameters,cv=10)
    grid_search.fit(transformed_training_data, train_labels)
    print ("Optimal value of k for K Nearest Neighbors: ",grid_search.best_params_) 
    print("\n")    
    
    
    # Multinomial Naive Bayes    
    mnb = MultinomialNB()
    mnb.fit(transformed_training_data, train_labels)
    prediction2=mnb.predict(transformed_test_data)
    print("F1 score for a Multinomial Naive Bayes model",round(metrics.f1_score(test_labels,prediction2, average="weighted"),2)) 
    
    mnb_range = np.linspace(0.001, 0.1, num=100)
    mnbParameters = {'alpha': mnb_range} 
    
    grid_searchmnb=GridSearchCV(mnb,mnbParameters,cv=10)
    grid_searchmnb.fit(transformed_training_data, train_labels)    
    print ("Optimal value of alpha for Multinomial Naive Bayes: ",grid_searchmnb.best_params_,"\n") 
   

    # Logistic Regression    
    logreg = LogisticRegression()
    logreg.fit(transformed_training_data, train_labels)
    prediction3=logreg.predict(transformed_test_data)
    print("F1 score for Logistic Regression: ",round(metrics.f1_score(test_labels,prediction3, average="weighted"),2)) 
    
    c_range = np.linspace(0.1, 1, num=10)
    logreg_Parameters = {'C': c_range}
    
    grid_searchlogreg=GridSearchCV(logreg,logreg_Parameters,cv=10)
    grid_searchlogreg.fit(transformed_training_data, train_labels)
    
    print ("Optimal value of C for Logistic Regression: ",grid_searchlogreg.best_params_)
    print("\n")
       
    print("Sum of the squared weight values for each class for each setting of the C parameter.\n")
    
    for value in c_range:
        logreg = LogisticRegression(penalty='l2', C = value)
        classifierlogreg = logreg.fit(transformed_training_data, train_labels)
        predictionslogreg = logreg.predict(transformed_test_data)
        weights = logreg.coef_
        squareWeights = weights**2
        print("Sum of Squares for Each Class for C=%2.1f:" %(value))
        print(np.sum(squareWeights, axis = 1))

        ### STUDENT END ###

P3()

F1 score for a K Nearest neighbors classifier 0.41
Optimal value of k for K Nearest Neighbors:  {'n_neighbors': 10}


F1 score for a Multinomial Naive Bayes model 0.77
Optimal value of alpha for Multinomial Naive Bayes:  {'alpha': 0.0070000000000000001} 

F1 score for Logistic Regression:  0.74
Optimal value of C for Logistic Regression:  {'C': 0.40000000000000002}


Sum of the squared weight values for each class for each setting of the C parameter.

Sum of Squares for Each Class for C=0.1:
[ 27.12875584  24.66522466  27.46182625  23.01153619]
Sum of Squares for Each Class for C=0.2:
[ 49.76503952  42.75876873  49.34332998  42.64741765]
Sum of Squares for Each Class for C=0.3:
[ 69.31148532  57.88039331  67.92821492  59.75569865]
Sum of Squares for Each Class for C=0.4:
[ 86.76115495  71.15648179  84.31174957  75.05185327]
Sum of Squares for Each Class for C=0.5:
[ 102.64270504   83.06447313   99.08796749   88.9955416 ]
Sum of Squares for Each Class for C=0.6:
[ 117.28369979   94.0459

ANSWER: KNN can suffer from skewed class distributions. For example, if a certain class is very frequent in the training set, it will tend to dominate the majority results of the new example. 

The Naive Bayes is linear classifier using Bayes Theorem and strong independence condition among features.Naive Bayes model reaches its asymptotic faster (O(log n)) than Logistic Regression(O (n)), which means it reaches the asymptotic solution for fewer training sets than Logistic Regression. Also, Naive Bayes is a learning algorithm with greater bias, but lower variance, than Logistic Regression. If this bias is appropriate given the actual data, Naive Bayes will be preferred. Otherwise, Logistic Regression will be preferred.

Higher values of C correspond to less regularization, meaning that when we use a higher value of C, Logistic Regression tries to fit the training set as best as possible, while with low values of C the model puts more emphasis finding weights that are closer to 0.

(4) Train a logistic regression model. Find the 5 features with the largest weights for each label -- 20 features in total. Create a table with 20 rows and 4 columns that shows the weight for each of these features for each of the labels. Create the table again with bigram features. Any surprising features in this table?

In [184]:
def P4():

### STUDENT START ###
    
    # Logistic Regression
    
    vect=CountVectorizer()
    transformed_training_data=vect.fit_transform(train_data)
    transformed_test_data=vect.transform(test_data)#Only transform and not fit, given we need the vocabulary for both datasets.
    print("Vocabulary size:{}".format(len(vect.vocabulary_)))
        
    vocabulary=[]  
  
    for i in vect.vocabulary_.keys():
        vocabulary.append(i)
         
    #Checking Vocabulary list length
    vocab=pd.DataFrame(vocabulary)
       
    lr = LogisticRegression()
    lr.fit(transformed_training_data, train_labels)
    pred=lr.predict(transformed_test_data)
    print("\n F1 score for unigram logistic regression",round(metrics.f1_score(test_labels,pred, average="weighted"),2),"\n")
    coef=[]
    coef=abs(lr.coef_)#Coefficients absolute value
    #categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
    
    
    #Create a Data Frame and transpose rows x columns
    df=pd.DataFrame(coef.T)
    df = df.abs()
    
    #Add the features list
    df['Words']=vocab 
    
          
    df0=df.sort_values(by=[0], ascending=False)
    table0=df0.head(5)
    
    
    df1=df.sort_values(by=[1], ascending=False)
    table1=df1.head(5)
   
    table0=table0.append(table1)
    
    df2=df.sort_values(by=[2], ascending=False)
    table2=df2.head(5)
    
    table0=table0.append(table2)
    
    df3=df.sort_values(by=[3], ascending=False)
    table3=df3.head(5)
    
    table0=table0.append(table3)
    
    categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space','Words']
    table0.columns=[categories]
    print("20 x 4 Table\n",table0,"\n")
    
    
    #Bigram Features Logistic Regression    
    vect=CountVectorizer(ngram_range=(2,2))
    transformed_training_data=vect.fit_transform(train_data)
    transformed_test_data=vect.transform(test_data)#Only transform and not fit, given we need the vocabulary for both datasets.
    print("Vocabulary size:{}".format(len(vect.vocabulary_)))
    
    #print("Vocabulary content:",(vect.vocabulary_))
    
    vocabulary=[]  
  
    for i in vect.vocabulary_.keys():
        vocabulary.append(i)
    #print(vocabulary)
     
    #Checking Vocabulary list length
    vocab=pd.DataFrame(vocabulary)
    #print('Vocabulary shape:',len(vocab))
    #print('Vocab Head',vocab.head)
    
    #Checking shapes
    #print("Training data shape",transformed_training_data.shape)
    #print("Test Data shape",transformed_test_data.shape)    
    
    lr = LogisticRegression()
    lr.fit(transformed_training_data, train_labels)
    pred=lr.predict(transformed_test_data)
    print("\n F1 score bigram logistic regression: ",round(metrics.f1_score(test_labels,pred, average="weighted"),2),"\n")
    
    coef=[]
    coef=lr.coef_    
    
       
    #Create a Data Frame and transpose rows x columns
    df=pd.DataFrame(coef.T)
    df = df.abs()
    
    #Add the features list
    df['Words']=vocab 
    #print(df)
    #print("Table shape",df.shape)
        
    df0=df.sort_values(by=[0], ascending=False)
    table0=df0.head(5)
    #print(table0.shape)
    #print(table0)
    
    df1=df.sort_values(by=[1], ascending=False)
    table1=df1.head(5)
    #print(table1)
    table0=table0.append(table1)
    
    df2=df.sort_values(by=[2], ascending=False)
    table2=df2.head(5)
    #print(table2)
    table0=table0.append(table2)
    
    df3=df.sort_values(by=[3], ascending=False)
    table3=df3.head(5)
    #print(table3)
    table0=table0.append(table3)
    
    categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space','Words']
    table0.columns=[categories]
    print("20 x 4 Table for Bigram Logistic Regression\n",table0,"\n")
    
### STUDENT END ###

P4()

Vocabulary size:26731

 F1 score for unigram logistic regression 0.74 

20 x 4 Table
        alt.atheism  talk.religion.misc  comp.graphics  sci.space  \
22419     1.263144            1.316950       2.165660   1.167601   
7694      1.124741            0.397716       0.421006   0.396218   
3723      1.031097            0.097157       0.320096   0.836538   
4637      0.986734            0.218642       0.347002   0.457267   
20282     0.954418            0.614837       0.795453   0.064417   
11405     0.758986            1.936187       1.335361   0.762962   
12622     0.583081            1.345737       0.827148   0.470798   
22419     1.263144            1.316950       2.165660   1.167601   
10229     0.335021            1.267078       0.808328   0.628464   
1040      0.358668            1.127628       0.704470   0.379415   
22419     1.263144            1.316950       2.165660   1.167601   
11405     0.758986            1.936187       1.335361   0.762962   
17450     0.413400            

ANSWER: N-grams are combinations of n sequential words, starting with bigrams (two words). Adding bigrams to feature sets improves the accuracy of text classification model.

This case is different, because with bigram features the accuracy has not improved, and the vocabulary is larger, which leads me to think that I might be overfitting the model. Coefficients have diminished.

(5) Try to improve the logistic regression classifier by passing a custom preprocessor to CountVectorizer. The preprocessing function runs on the raw text, before it is split into words by the tokenizer. Your preprocessor should try to normalize the input in various ways to improve generalization. For example, try lowercasing everything, replacing sequences of numbers with a single token, removing various other non-letter characters, and shortening long words. If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular. With your new preprocessor, how much did you reduce the size of the dictionary?

For reference, I was able to improve dev F1 by 2 points.

In [186]:
def empty_preprocessor(s):
    return s

def better_preprocessor(s):

    string0=s.lower()
    
    p=re.compile(r'[0-9]')
    string1=p.sub('',string0)
    
    s=re.compile(r'[_\/\'":;?><,!#]')
    string2=s.sub("",string1)
                 
    l=re.compile(r'[a-z]{11,}')
    string3=l.sub("",string2)
                 
    l=re.compile(r'\b[a-z]{1,3}\b')
    string4=l.sub("",string3)
    
    return string4


def P5():

### STUDENT START ###

    # Logistic Regression with default preprocessor 
    
    print("Logistic Regression with default preprocessor")                 
    vect=CountVectorizer()
    transformed_training_data=vect.fit_transform(train_data)
    transformed_test_data=vect.transform(test_data)
    print("\nVocabulary size with default preprocessor:{}".format(len(vect.vocabulary_)))
    #print("Vocabulary content:",(vect.vocabulary_))
    
    
    lr = LogisticRegression()
    lr.fit(transformed_training_data, train_labels)
    pred=lr.predict(transformed_test_data)
    print("F1 score for Logistic Regression with empty preprocessor :",round(metrics.f1_score(test_labels,pred, average="weighted"),2),"\n") 
    print("\n")
 
    # Logistic Regression w/o preprocessor 
    
    print("Logistic Regression with empty preprocessor")                 
    vect=CountVectorizer(preprocessor=empty_preprocessor)
    transformed_training_data=vect.fit_transform(train_data)
    transformed_test_data=vect.transform(test_data)
    print("\nVocabulary size with empty preprocessor:{}".format(len(vect.vocabulary_)))
    #print("Vocabulary content:",(vect.vocabulary_))
      
    lr = LogisticRegression()
    lr.fit(transformed_training_data, train_labels)
    pred=lr.predict(transformed_test_data)
    print("F1 score for Logistic Regression with empty preprocessor :",round(metrics.f1_score(test_labels,pred, average="weighted"),2),"\n") 
    print("\n")
                
    # Logistic Regression w/preprocessor 
    
    print("Logistic Regression with better preprocessor")                 
    vect2=CountVectorizer(preprocessor=better_preprocessor)
    transformed_training_data=vect2.fit_transform(train_data)
    transformed_test_data=vect2.transform(test_data)
    print("\nVocabulary size:{}".format(len(vect2.vocabulary_)))
    #print("Vocabulary content:",(vect.vocabulary_))

    lr = LogisticRegression()
    lr.fit(transformed_training_data, train_labels)
    pred=lr.predict(transformed_test_data)
    print("F1 score for Logistic Regression with better preprocessor :",round(metrics.f1_score(test_labels,pred, average="weighted"),2),"\n")
    print("\n")
    print("Dictionary Size Reduction:",len(vect.vocabulary_)-len(vect2.vocabulary_)," words.")
    print("\n")
### STUDENT END ###

P5()

Logistic Regression with default preprocessor

Vocabulary size with default preprocessor:26731
F1 score for Logistic Regression with empty preprocessor : 0.74 



Logistic Regression with empty preprocessor

Vocabulary size with empty preprocessor:33143
F1 score for Logistic Regression with empty preprocessor : 0.71 



Logistic Regression with better preprocessor

Vocabulary size:19859
F1 score for Logistic Regression with better preprocessor : 0.75 



Dictionary Size Reduction: 13284  words.




(6) The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. That is, logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size. The default regularization, L2, computes this size as the sum of the squared weights (see P3, above). L1 regularization computes this size as the sum of the absolute values of the weights. The result is that whereas L2 regularization makes all the weights relatively small, L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

Train a logistic regression model using a "l1" penalty. Output the number of learned weights that are not equal to zero. How does this compare to the number of non-zero weights you get with "l2"? Now, reduce the size of the vocabulary by keeping only those features that have at least one non-zero weight and retrain a model using "l2".

Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when pruning unused features by adjusting the C parameter.

Note: The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.01 (the default is .0001).

In [196]:
def P6():

    #Keep this random seed here to make comparison easier.
    np.random.seed(0)

    ### STUDENT START ###
      
    
    # Logistic Regression l1 
    
    #General CountVectorizer
    vect=CountVectorizer()
    transformed_training_data=vect.fit_transform(train_data)
    transformed_test_data=vect.transform(test_data)
    #Only transform and not fit test_data, given we need the same vocabulary for both datasets.
   
    print("Logistic Regression using l1 penalty, tol=0.01 and C=0.4")
    print("Vocabulary size:{}".format(len(vect.vocabulary_)))
        
    lr1 = LogisticRegression(penalty='l1',tol=0.01,C=0.4)
    lr1.fit(transformed_training_data, train_labels)
    pred=lr1.predict(transformed_test_data)
    print("F1 score:",round(metrics.f1_score(test_labels,pred, average="weighted"),2)) 
    
    coef=[]
    coef=lr1.coef_    
       
    df=pd.DataFrame(coef.T)
    
    vocabularyl1=[]  
  
    for i in vect.vocabulary_.keys():
        vocabularyl1.append(i)     
    
     
    #Checking Vocabulary list length
    vocab=pd.DataFrame(vocabularyl1)
    df['Words']=vocab 
    categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space','Words']
    df.columns=[categories]
    #print(df)
    
    Words1= df.loc[df['alt.atheism'] != 0]['Words']
    Words2= df.loc[df['talk.religion.misc'] != 0]['Words']
    Words3= df.loc[df['comp.graphics'] != 0]['Words']
    Words4= df.loc[df['sci.space'] != 0]['Words']
    
    result = pd.concat([Words1,Words2,Words3,Words4],ignore_index=True) 
    vocabl1_result=result.drop_duplicates()
    print("Non Zero Values shape using l1 penalty:",vocabl1_result.shape[0])
    print("Reduced size vocabulary with only features that have at least one non-zero weight",vocabl1_result.shape[0],"\n")   

        
    ##Logistic Regression with l2
    
    print("Logistic Regression l2")
    lr2 = LogisticRegression(penalty='l2')
    lr2.fit(transformed_training_data, train_labels)
    pred=lr2.predict(transformed_test_data)
    print("F1 score:",round(metrics.f1_score(test_labels,pred, average="weighted"),2)) 
    
    coef=[]
    coef=lr2.coef_    
        
    df=pd.DataFrame(coef.T)
    
    vocabularyl2=[]  
  
    for i in vect.vocabulary_.keys():
        vocabularyl2.append(i)
         
    #Checking Vocabulary list length
    vocabl2=pd.DataFrame(vocabularyl2)
    df['Words']=vocabl2
    categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space','Words']
    df.columns=[categories]
  
    Words1= df.loc[df['alt.atheism'] != 0]['Words']
    Words2= df.loc[df['talk.religion.misc'] != 0]['Words']
    Words3= df.loc[df['comp.graphics'] != 0]['Words']
    Words4= df.loc[df['sci.space'] != 0]['Words']
    
    resultl2 = pd.concat([Words1,Words2,Words3,Words4],ignore_index=True) 
    vocabl2_result=resultl2.drop_duplicates()
    print("Non Zero Values shape using l2 penalty:",vocabl2_result.shape[0])
    print("Reduced size vocabulary with only features that have at least one non-zero weight",vocabl2_result.shape[0],"\n")   
        
    ##Retrain a model Logistic Regression using l2
    print("Retrain a model Logistic Regression using l2")
    vect2=CountVectorizer(vocabulary=vocabl1_result)
    transformed_training_data=vect2.fit_transform(train_data)
    transformed_test_data=vect2.transform(test_data)#Only transform and not fit, given we need the vocabulary for both datasets.
    print("Vocabulary size for retrained model:{}".format(len(vect2.vocabulary_)))
    
    lr2new = LogisticRegression(penalty='l2')
    lr2new.fit(transformed_training_data, train_labels)
    predlr2new=lr2new.predict(transformed_test_data)
    print("F1 score for retrained model:",round(metrics.f1_score(test_labels,pred, average="weighted"),2))
    
        
    #Plot                 
    #Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when
    #pruning unused features by adjusting the C parameter.          

    
    ### STUDENT END ###
    
P6()

Logistic Regression using l1 penalty, tol=0.01 and C=0.4
Vocabulary size:26731
F1 score: 0.73
Non Zero Values shape using l1 penalty: 637
Reduced size vocabulary with only features that have at least one non-zero weight 637 

Logistic Regression l2
F1 score: 0.74
Non Zero Values shape using l2 penalty: 26731
Reduced size vocabulary with only features that have at least one non-zero weight 26731 

Retrain a model Logistic Regression using l2
Vocabulary size for retrained model:637
F1 score for retrained model: 0.74


(7) Use the TfidfVectorizer -- how is this different from the CountVectorizer? Train a logistic regression model with C=100.

Make predictions on the dev data and show the top 3 documents where the ratio R is largest, where R is:

maximum predicted probability / predicted probability of the correct label

What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

In [216]:
def P7():

### STUDENT START ###

    vect=TfidfVectorizer()
    transformed_training_data=vect.fit_transform(train_data)
    transformed_dev_data=vect.transform(dev_data)
    vocab=vect.vocabulary_
    print("Vocabulary Size:{}".format(len(vect.vocabulary_)),"\n")
    
    logreg = LogisticRegression(penalty='l2')
    logreg.fit(transformed_training_data, train_labels)
    prediction=logreg.predict(transformed_dev_data)
    
    prediction_prob=logreg.predict_proba(transformed_dev_data)
    
    maxProb = np.amax(prediction_prob, axis=1)
    
    #create empty array to store calculated R values
    rValues = []

    #calculate R value and add to array
    for i in range(len(maxProb)):
        r = maxProb[i]/float(prediction_prob[i,dev_labels[i]])
        rValues.extend([r])

    #sort the R values in ascending order
    sortedIndexes = np.argsort(rValues)

    #return the Rvalue, document text, actual label, and predicted label for documents with the top 3 document labels.
    for index in sortedIndexes[-3:]:
        print("R score =",(rValues[index]),"\n")        
        print('Original label:', newsgroups_train.target_names[dev_labels[index]])
        print('Predicted label:', newsgroups_train.target_names[prediction[index]],'\n')
        print(dev_data[index])

## STUDENT END ###

P7()

Vocabulary Size:26731 

R score = 4.2619625514 

label: alt.atheism
Predicted label: talk.religion.misc 

With the Southern Baptist Convention convening this June to consider
the charges that Freemasonry is incompatible with christianity, I thought
the following quotes by Mr. James Holly, the Anti-Masonic Flag Carrier,
would amuse you all...

     The following passages are exact quotes from "The Southern 
Baptist Convention and Freemasonry" by James L. Holly, M.D., President
of Mission and Ministry To Men, Inc., 550 N 10th St., Beaumont, TX 
77706. 
 
     The inside cover of the book states: "Mission & Ministry to Men, 
Inc. hereby grants permission for the reproduction of part or all of 
this booklet with two provisions: one, the material is not changed and
two, the source is identified." I have followed these provisions. 
  
     "Freemasonry is one of the allies of the Devil" Page iv. 
 
     "The issue here is not moderate or conservative, the issue is God
and the Devil" Page vi.

ANSWER:CountVectorizer just counts the word frequencies. With the TFIDFVectorizer the value increases proportionally to count, but is offset by the frequency of the word in the corpus.

Given TFIDFVectorizer considers the presence of each word in the entire corpus,seems like the model is missclassifiyng religion texts because of other words with high presence and related to other subjects, like graphics, which result to be misleading. 

A way to fix this is by using bigrams, which usually tend to improve the accuracy of text classification models, given using a set of words in the vocabulary instead of one can help the model. 


(8) EXTRA CREDIT

Try implementing one of your ideas based on your error analysis. Use logistic regression as your underlying model.

In [221]:
    vect=TfidfVectorizer(ngram_range=(1,2))
    transformed_training_data=vect.fit_transform(train_data)
    transformed_dev_data=vect.transform(dev_data)
    vocab=vect.vocabulary_
    print("Vocabulary Size:{}".format(len(vect.vocabulary_)),"\n")
    
    logreg = LogisticRegression(penalty='l2')
    logreg.fit(transformed_training_data, train_labels)
    prediction=logreg.predict(transformed_dev_data)
    
    prediction_prob=logreg.predict_proba(transformed_dev_data)
    
    maxProb = np.amax(prediction_prob, axis=1)
    
    #create empty array to store calculated R values
    rValues = []

    #calculate R value and add to array
    for i in range(len(maxProb)):
        r = maxProb[i]/float(prediction_prob[i,dev_labels[i]])
        rValues.extend([r])

    #sort the R values in ascending order
    sortedIndexes = np.argsort(rValues)

    #return the Rvalue, document text, actual label, and predicted label for documents with the top 3 document labels.
    for index in sortedIndexes[-3:]:
        print("\n R score =",(rValues[index]),"\n")        
        print('Orginal label:', newsgroups_train.target_names[dev_labels[index]])
        print('Predicted label:', newsgroups_train.target_names[prediction[index]],'\n')
        print(dev_data[index])

Vocabulary Size:221155 


 R score = 3.79606387065 

Orginal label: talk.religion.misc
Predicted label: comp.graphics 

Newsgroups: talk.religion.misc
Subject: On-line copy of Book of Mormon
Summary: 
Distribution: usa
Organization: Advanced Decision Systems, Mtn. View, CA (415) 960-7300
Keywords: BOM, Book of Mormon, Mormon

Can anyone provide me a ftp site where I can obtain a online version
of the Book of Mormon. Please email the internet address if possible.

 R score = 3.80779845851 

Orginal label: talk.religion.misc
Predicted label: alt.atheism 

Why is the NT tossed out as info on Jesus.  I realize it is normally tossed
out because it contains miracles, but what are the other reasons?

MAC
--
****************************************************************
                                                    Michael A. Cobb
 "...and I won't raise taxes on the middle     University of Illinois
    class to pay for my programs."                 Champaign-Urbana
          -Bill Cli