<a href="https://colab.research.google.com/github/rgoding2004/w207/blob/main/Ryan_Goding_project3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Topic Classification using Naive Bayes

**Solution**

# Intro
---
In this project, you'll work with text data from newsgroup posts on a variety of topics. You'll train classifiers to distinguish posts by topics inferred from the text. Whereas with digit classification, where each input is relatively **dense** (represented as a 28x28 matrix of pixels, many of which are non-zero), here each document is relatively **sparse** (represented as a **bag-of-words**). Only a few words of the total vocabulary are active in any given document. The assumption is that a label depends only on the count of words, not their order.

The `sklearn` documentation on feature extraction may be useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on Slack, but <b> please prepare your own write-up with your own code. </b>

## Grading
---
- Make sure to answer every part in every question.
 - There are 7 questions and one extra credit question. 
 - Read carefully what is asked including the notes.
 - Additional points may be deducted if:
   - the code is not clean and well commented, 
   - and if the functions or answers are too long.

 ## Requirements:
---
1. Comment your code.
1. All graphs should have titles, label for each axis, and if needed a legend. It should be understandable on its own.
1. All code must run on colab.research.google.com
1. You should not import any additional libraries.
1. Try and minimize the use of the global namespace (meaning keep things in functions).



In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk

Load the data, stripping out metadata so that only textual features will be used, and restricting documents to 4 specific topics. By default, newsgroups data is split into training and test sets, but here the test set gets further split into development and test sets.  (If you remove the categories argument from the fetch function calls, you'd get documents from all 20 topics.)

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test  = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels   = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels     = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         newsgroups_train.target_names)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


training label shape: (2034,)
dev label shape: (676,)
test label shape: (677,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


### Question 1: Examining your data
---

 1. For each of the first 5 training examples, print the text of the message along with the label (checkout newsgroups_train.target_names).

In [None]:
def Q1(num_examples=5):
    ### STUDENT START ###
    print('Below shows the first five training examples text along with the label:\n')
    for i in range(num_examples):
      print(train_data[i])
      print("\nlabel is: " + str(categories[train_labels[i]]) + '\n')
    ### STUDENT END ###

Q1(5)

Below shows the first five training examples text along with the label:

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

label is: talk.religion.misc



Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonst

### Question 2: Text representation
---

1. Transform the training data into a matrix of **word** unigram feature vectors.
  1. What is the size of the vocabulary? 
  1. What is the average number of non-zero features per example?  
  1. What is the fraction of the non-zero entries in the matrix?  
  1. What are the 0th and last feature strings (in alphabetical order)?
  - _Use `CountVectorization` and its `.fit_transform` method.  Use `.nnz` and `.shape` attributes, and `.get_feature_names` method._
1. Now transform the training data into a matrix of **word** unigram feature vectors restricting to the vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"].  Confirm the size of the vocabulary. 
  1. What is the average number of non-zero features per example?
  - _Use `CountVectorization(vocabulary=...)` and its `.transform` method._
1. Now transform the training data into a matrix of **character** bigram and trigram feature vectors.  
  1. What is the size of the vocabulary?
  - _Use `CountVectorization(analyzer=..., ngram_range=...)` and its `.fit_transform` method._
1. Now transform the training data into a matrix of **word** unigram feature vectors and prune words that appear in fewer than 10 documents.  
  1. What is the size of the vocabulary?<br/>
  - _Use `CountVectorization(min_df=...)` and its `.fit_transform` method._
1. Now again transform the training data into a matrix of **word** unigram feature vectors. 
 1. What is the fraction of words in the development vocabulary that is missing from the training vocabulary?
 - _Hint: Build vocabularies for both train and dev and look at the size of the difference._

Notes:
* `.fit_transform` makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").
* `.fit_transform` and `.transform` return sparse matrix objects.  See about them at http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html. 

In [None]:
def Q2():
    ### STUDENT START ###
    #create vectorizer to covert collection of text documents to a matrix
  vectorizer = CountVectorizer()
  X = vectorizer.fit_transform(train_data)
  #create vocab dict object
  vocab = vectorizer.get_feature_names()

  #Count total amount of words including repeats in dict object
  vocab_total = len(vocab)
  #display answers to q2.1
  print('Answers for Question 2.1:')
  print('The size of vocabulary is:', + vocab_total)
  print('The average number of non-zero features per example is: ' + str(X.nnz))
  #compute fraction of non-zero entities in matrix
  fraction = X.nnz/(2034*26879)
  #create vocab list to find 0th and last string features
  vocab_list = vectorizer.get_feature_names()
  print('The fraction of non-zero entities in the matrix is: ' + '%.4f' % fraction)
  print('The 0th feature string is: ' + vocab_list[0] )
  print('The last feature string is: ' + vocab_list[len(vocab_list)-1])

  #create new vectorizer for question 2.2 with restricted vocabulary
  vectorizer2_2 = CountVectorizer(vocabulary= ["atheism", "graphics", "space", "religion"])
  X2_2 = vectorizer2_2.transform(train_data)
  print('\nAnswer for Question 2.2:')
  print('The average number of non-zero features per example for new vectorizer is: ' + str(X2_2.nnz))

  #create new vectorizer with character bigram and trigram feature vectors
  vectorizer2_3 = CountVectorizer(analyzer='char', ngram_range=(2,3))
  X2_3 = vectorizer2_3.fit_transform(train_data)
  vocab_2_3 = vectorizer2_3.vocabulary_
  vocab_2_3_values = vocab_2_3.values()
  vocab_2_3_total = sum(vocab_2_3_values)
  print('\nAnswer for Question 2.3:')
  print("The size of vocabulary for CountVectorizer for bigram and trigram characters is: " + str(vocab_2_3_total))

  #create new vectorizer with word unigram feature vectors and prune words that appear in fewer than 10 documents
  vectorizer2_4 = CountVectorizer(analyzer='word', min_df=10)
  X2_4 = vectorizer2_4.fit_transform(train_data)
  vocab_2_4 = vectorizer2_4.vocabulary_
  vocab_2_4_values = vocab_2_4.values()
  vocab_2_4_total = sum(vocab_2_4_values)
  print('\nAnswer for Question 2.4:')
  print("The size of vocabulary for a CountVectorizer for word unigram feature vectors and \npruning words that appear in fewer than 10 documents: " + str(vocab_2_4_total))

  #question 2.5
  vectorizer2_5 = CountVectorizer(analyzer='word')
  X2_5 = vectorizer2_5.fit_transform(dev_data)
  vocab_dev = vectorizer2_5.vocabulary_

  vocab_dev_total = len(vocab_dev)
  print('\nAnswer for Question 2.5:')
  vocab_diff = (vocab_total - vocab_dev_total)/vocab_total
  print('Fraction of words in dev vocab that is missing from training vocab: ' + '%4f' % vocab_diff)
      ### STUDENT END ###

Q2()

Answers for Question 2.1:
The size of vocabulary is: 26879
The average number of non-zero features per example is: 196700
The fraction of non-zero entities in the matrix is: 0.0036
The 0th feature string is: 00
The last feature string is: zyxel

Answer for Question 2.2:
The average number of non-zero features per example for new vectorizer is: 546

Answer for Question 2.3:
The size of vocabulary for CountVectorizer for bigram and trigram characters is: 629326503

Answer for Question 2.4:
The size of vocabulary for a CountVectorizer for word unigram feature vectors and 
pruning words that appear in fewer than 10 documents: 4692516

Answer for Question 2.5:
Fraction of words in dev vocab that is missing from training vocab: 0.395588


### Question 3: Initial model evaluation
---

1. Transform the training and development data to matrices of word unigram feature vectors.
1. Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score.  For each model, show the k value and f1 score. 
1. Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score.  For each model, show the alpha value and f1 score.
1. Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score.  For each model, show the C value, f1 score, and sum of squared weights for each topic.
1. Why doesn't k-Nearest Neighbors work well for this problem?
1. Why doesn't Logistic Regression work as well as Naive Bayes does?
1. What is the relationship between logistic regression's sum of squared weights vs. C value?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer` and its `.fit_transform` and `.transform` methods to transform data.
* You can use `KNeighborsClassifier(...)` to produce a k-Nearest Neighbors model.
* You can use `MultinomialNB(...)` to produce a Naive Bayes model.
* You can use `LogisticRegression(C=..., solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.

In [16]:
def Q3():
    ### STUDENT START ###
  #Transform both train and dev data to matrices of word unigram feature vectors
  vectorizer = CountVectorizer()
  #Fit, learn words in train data and then transform train data into numbers
  X_train = vectorizer.fit_transform(train_data).toarray()
  X_dev = vectorizer.transform(dev_data).toarray()

  #Produce Several K-nearest neighbors models
  k_values = [1,3,5,7,9]
  for k in k_values:
    neigh = KNeighborsClassifier(k, metric = 'euclidean')
    neigh.fit(X_train, train_labels)
    predict_list = []
    predict_list = neigh.predict(X_dev)
    f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
    print(str(k) + '-Nearest Neighbor model\'s f1 score is: ' + '%4f' % f1_score)

    #Produce Several NB with varying alpha values
  alphas = [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1]
  for alpha in alphas:
    clf = MultinomialNB(alpha = alpha)
    clf.fit(X_train, train_labels)
    predict_list = []
    predict_list = clf.predict(X_dev)
    f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
    print('Naive Bayes model with alpha value of ' + str(alpha) + ' has a f1 score of ' + '%.4f' % f1_score)

  #Produce several Logistic Regression models with varying L2 Regularization Strength
  Cs = [.001, .01, .1, 1, 2]
  for C in Cs:
    logreg = LogisticRegression(C=C, solver="liblinear", multi_class="auto").fit(X_train, train_labels)
    #show its f1 score
    predict_list = []
    predict_list = logreg.predict(X_dev)
    f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
    print('Logistic Regression with L2 score of ' + str(C) + ' has a f1 score of ' + '%.4f' % f1_score)
    ### STUDENT END ###

Q3()

1-Nearest Neighbor model's f1 score is: 0.394710
3-Nearest Neighbor model's f1 score is: 0.423038
5-Nearest Neighbor model's f1 score is: 0.442839
7-Nearest Neighbor model's f1 score is: 0.466039
9-Nearest Neighbor model's f1 score is: 0.455744


  'setting alpha = %.1e' % _ALPHA_MIN)


Naive Bayes model with alpha value of 0 has a f1 score of 0.7472
Naive Bayes model with alpha value of 0.1 has a f1 score of 0.7903
Naive Bayes model with alpha value of 0.3 has a f1 score of 0.7876
Naive Bayes model with alpha value of 0.5 has a f1 score of 0.7863
Naive Bayes model with alpha value of 0.7 has a f1 score of 0.7847
Naive Bayes model with alpha value of 0.9 has a f1 score of 0.7811
Naive Bayes model with alpha value of 1 has a f1 score of 0.7777
Logistic Regression with L2 score of 0.001 has a f1 score of 0.6193
Logistic Regression with L2 score of 0.01 has a f1 score of 0.6647
Logistic Regression with L2 score of 0.1 has a f1 score of 0.6966
Logistic Regression with L2 score of 1 has a f1 score of 0.6944
Logistic Regression with L2 score of 2 has a f1 score of 0.6925



ANSWER: 
K-nearest neighbor does not work well with this problem because the classifier is assuming that the classification of an instance is most similiar to another instance that is nearby in vector space, and this is not necessarily true.

Logistic Regression doesn't work as well as Naive Bayes in this case specifically because of the data sets we are working with in this problem.  Generally it is not the case that NB works better than logistic regression, and is dependent on the dataset being used. 

The relationship between logistic regression's sum of squared weights and C value is that the scaled sum of the square of the weights penalizes the log likehood function, and the C value is the inverse or the regularization weight.

### Question 4: Feature exploration
---

1. Transform the data to a matrix of word **bigram** feature vectors.  Produce a Logistic Regression model.
1. For each topic, find the 5 features with the largest weights (not absolute value). If there are no overlaps, you can expect 20 features in total.
1. Show a 20 row (features) x 4 column (topics) table of the weights. So, for each of the features (words) found, we show their weight for all topics.
1. Do you see any surprising features in this table?

Notes:
* Train on the transformed training data.
* You can use `CountVectorizer` and its `.fit_transform` method to transform data.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `np.argsort` to get indices sorted by element value. 


In [74]:
def Q4():
    ### STUDENT START ###
    vectorizer = CountVectorizer(analyzer='word', ngram_range=(2,2))
    X = vectorizer.fit_transform(train_data).toarray()
    #X_dev = vectorizer.transform(dev_data).toarray()
    #produce logistic regression model
    clf = LogisticRegression(C=.5, solver = 'liblinear', multi_class='auto').fit(X, train_labels)
    #create 4 vectors of coef weights for each topic
    topic0 = clf.coef_[0]
    topic1 = clf.coef_[1]
    topic2 = clf.coef_[2]
    topic3 = clf.coef_[3]
    #vocab list 
    vocab_dict = vectorizer.vocabulary_
    #retrieving key if given value function
    def get_key(val):
      for key, value in vocab_dict.items():
        if val == value:
          return key
    #sort and flip function
    def sort_flip(array):
      temp_sort = np.argsort(array, axis = 0)
      temp_sort_flip = np.flip(temp_sort)
      return temp_sort_flip
    #run function to create sorted arrays
    topic0_sorted = sort_flip(topic0)
    topic1_sorted = sort_flip(topic1)
    topic2_sorted = sort_flip(topic2)
    topic3_sorted = sort_flip(topic3)
    def vectors(topic_array, weights_array):
      index =[]
      weights = []
      words = []
      #create vector of index
      for i in range(5):
        index.append(topic_array[i])
      #weights vector
      for i in index:
        weights.append(weights_array[i])
      #words vector
      for i in index:
        words.append(get_key(i))
      return (weights, words, index)
     #run created function to generate weights and words for each topic
    topic0_weights, topic0_words, topic0_index = vectors(topic0_sorted, topic0) 
    topic1_weights, topic1_words, topic1_index = vectors(topic1_sorted, topic1) 
    topic2_weights, topic2_words, topic2_index = vectors(topic2_sorted, topic2) 
    topic3_weights, topic3_words, topic3_index = vectors(topic3_sorted, topic3) 
    total_words = topic0_words + topic1_words + topic2_words + topic3_words
    total_words
    #remove duplicate item
    total_words.remove('cheers kent')
    #create total index to find weights for each topic
    total_index = topic0_index + topic1_index + topic2_index + topic3_index
    total_index

    #weights function
    def weights_function(index_array, weights_array):
      weights = []
      for i in range(19):
        weights.append(weights_array[index_array[i]])
      return weights

    topic0_weights_final = weights_function(total_index, topic0)
    topic1_weights_final = weights_function(total_index, topic1)
    topic2_weights_final = weights_function(total_index, topic2)
    topic3_weights_final = weights_function(total_index, topic3)
    #create pandas table
    df = pd.DataFrame(list(zip(topic0_weights_final, topic1_weights_final, topic2_weights_final, topic3_weights_final)), index =total_words, columns =['topic1', 'topic2','topic3', 'topic4'])
    return df
    ### STUDENT END ###

Q4()

Unnamed: 0,topic1,topic2,topic3,topic4
claim that,0.605549,-0.199067,-0.274345,-0.140364
was just,0.55572,-0.697918,-0.663766,0.534808
you are,0.48205,-0.131418,-0.128882,-0.227469
are you,0.47274,-0.279894,-0.481305,0.028373
looking for,0.446953,-0.248257,-0.097135,-0.305625
in advance,-0.630341,1.108375,-0.50005,-0.571869
comp graphics,-0.459351,0.832567,-0.438501,-0.418453
out there,-0.292166,0.801208,-0.370885,-0.285186
is there,-0.274803,0.758658,-0.479057,-0.277089
the space,-0.340882,0.754998,-0.468249,-0.257079


ANSWER: It's surprising how drastically different the weights are for the other coefficients for the other topic's top 5 weighted features.

### Question 5: Pre-processing for text
---

To improve generalization, it is common to try preprocessing text in various ways before splitting into words. For example, you could try transforming strings to lower case, replacing sequences of numbers with single tokens, removing various non-letter characters, and shortening long words.

1. Produce a Logistic Regression model (with no preprocessing of text). **Note that you may need to override the "default" preprocessing with an identity function**. Evaluate and show its f1 score and size of the dictionary.
1. Produce an improved Logistic Regression model by preprocessing the text. Evaluate and show its f1 score and size of the vocabulary.  Aim for an improvement in f1 score of 0.02. **Note: this is actually very hard**.
1. How much did the improved model reduce the vocabulary size?

Notes:
* Things you can try: ** ???: Anything else we can suggest** 
 - Look at default pre-processing done.
 - Removing stop words.
 - Experiment with different ways of getting rid of apostrophe's such as replacing them with spaces or with empty strings.
  - Lower casing.
  - Including both lowercase and original case versions of a word.
  - nltk functions such as stemming.
* Train on the "transformed" training data, the data after you applied pre-processing.
* Evaluate on the transformed development data. Note that you never want to "learn" anything from the dev data.
* You can use `CountVectorizer(preprocessor=...)` to preprocess strings with your own custom-defined function.
* `CountVectorizer` default is to preprocess strings to lower case.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.
* If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular.
* The order you apply pre-processing may produce different results.


In [None]:
def Q5():
    ### STUDENT START ###
    #produce logistic regression with no preprocessing of text
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(train_data).toarray()
    X_dev = vectorizer.transform(dev_data).toarray()
    #produce logistic regression model
    clf = LogisticRegression(C=.5, solver = 'liblinear', multi_class='auto').fit(X, train_labels)
    #show its f1 score and vocab size
    predict_list = []
    predict_list = clf.predict(X_dev)
    f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
    print('Logistic Regression without preprocessing of text has a f1 score of ' + '%.4f' % f1_score)
    #calculate vocab size
    vocab_list = []
    vocab_list = vectorizer.get_feature_names()
    vocab_size = len(vocab_list)
    print('The Vocab size for logistic regression without preprocessing is ' + str(vocab_size))
    print('\nPreprocessing of text using stop_words, a built in stop word list for english, results in the following:\n')
    #new logistic regression with preprocessing of text
    #produce logistic regression with no preprocessing of text
    vectorizer_improved = CountVectorizer(stop_words='english')
    X = vectorizer_improved.fit_transform(train_data).toarray()
    X_dev = vectorizer_improved.transform(dev_data).toarray()
    #produce logistic regression model
    clf = LogisticRegression(C=.5, solver = 'liblinear', multi_class='auto').fit(X, train_labels)
    #show its f1 score and vocab size
    predict_list = []
    predict_list = clf.predict(X_dev)
    f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
    print('Logistic Regression with preprocessing of text has an improved f1 score of ' + '%.4f' % f1_score)
    #calculate vocab size
    vocab_list = []
    vocab_list = vectorizer_improved.get_feature_names()
    vocab_size_new = len(vocab_list)
    print('The Vocab size for logistic regression with preprocessing is ' + str(vocab_size))
    print('The improved model reduced the vocabulary size by ' + str(vocab_size - vocab_size_new))
    ### STUDENT END ###

Q5()

Logistic Regression without preprocessing of text has a f1 score of 0.7085
The Vocab size for logistic regression without preprocessing is 26879

Preprocessing of text using stop_words, a built in stop word list for english results in the following:

Logistic Regression with preprocessing of text has an improved f1 score of 0.7237
The Vocab size for logistic regression with preprocessing is 26879
The improved model reduced the vocabulary size by 303


### Question 6: L1 and L2 regularization
---

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small total size. The default L2 regularization computes this size as the sum of the squared weights (as in Part 3 above). L1 regularization computes this size as the sum of the absolute values of the weights. Whereas L2 regularization makes all the weights relatively small, **L1 regularization drives many of the weights to 0, effectively removing unimportant features**. For this reason, we can use it as a way to do "feature selection".

1. For several L1 regularization strengths ...
  1. Produce a Logistic Regression model using the **L1** regularization strength.  Reduce the vocabulary to only those features that have at least one non-zero weight among the four categories.
  1. Produce a new Logistic Regression model using the reduced vocabulary . For this new model, use an **L2** regularization strength of 0.5.  
  1. Evaluate and show the L1 regularization strength, vocabulary size, and f1 score associated with the new model.
1. Show a plot of f1 score vs. log vocabulary size.  Each point corresponds to a specific L1 regularization strength used to reduce the vocabulary.
1. How does performance of the models based on reduced vocabularies compare to that of a model based on the full vocabulary?

Notes:
* No need to apply pre-processing from question 5.
* Train on the transformed (i.e. CountVectorizer) training data.
* Evaluate on the transformed development data (using the CountVectorizer instance you trained on the training data).
* You can use `LogisticRegression(..., penalty="l1")` to produce a logistic regression model using L1 regularization.
* You can use `LogisticRegression(..., penalty="l2")` to produce a logistic regression model using L2 regularization.
* You can use `LogisticRegression(..., tol=0.015)` to produce a logistic regression model using relaxed gradient descent convergence criteria.  The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.015 (the default is .0001).
* (solver="liblinear" might be needed for it not to crash)

In [None]:
def Q6():
    # Keep this random seed here to make comparison easier.
    np.random.seed(0)

    ### STUDENT START ###

    ### STUDENT END ###

Q6()

In [18]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_data).toarray()
X_dev = vectorizer.transform(dev_data).toarray()
#Produce a logistic regularization using several L1 regularization strengths
regularization_strength = [.001, .01, .1, .5 , 1 ]
for strength in regularization_strength:
  clf = LogisticRegression(C = strength, penalty='l1', solver = 'liblinear').fit(X, train_labels)
  #show its f1 score and vocab size
  predict_list = []
  predict_list = clf.predict(X_dev)
  f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
  print('Logistic Regression model using L1 regularization with a stength of ' + '%.4f' % (1/strength) + ' has a f1 score of ' + '%.4f' % f1_score)
#Reduce vocab to ones that have at least one nnz weight among four categories

#Produce new logistic regression using the reduced vocab



Logistic Regression model using L1 regularization with a stength of 1000.0000 has a f1 score of 0.2449
Logistic Regression model using L1 regularization with a stength of 100.0000 has a f1 score of 0.4370
Logistic Regression model using L1 regularization with a stength of 10.0000 has a f1 score of 0.6305




Logistic Regression model using L1 regularization with a stength of 2.0000 has a f1 score of 0.6889
Logistic Regression model using L1 regularization with a stength of 1.0000 has a f1 score of 0.6832




ANSWER: 

### Question 7: TfIdf
---
As you may recall [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) stands for *term frequency inverse document frequency* and is a way to assign a weight to each word or token signifying their importance for a document in a corpus (a collection of documents).

Produce a Logistic Regression model based on data represented in tf-idf form, with L2 regularization strength of 100.  Evaluate and show the f1 score.  How is `TfidfVectorizer` different than `CountVectorizer`?

1. How is `TfidfVectorizer` different than `CountVectorizer`?
1. Show the 3 documents with highest R ratio, where ...
  - $R\,ratio = maximum\,predicted\,probability \div predicted\,probability\,of\,correct\,label$
1. Explain what the R ratio describes.
1. What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

Note:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `TfidfVectorizer` and its `.fit_transform` method to transform data to tf-idf form.
* You can use `LogisticRegression(C=100, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `LogisticRegression`'s `.predict_proba` method to access predicted probabilities.

In [13]:
def Q7():
    ### STUDENT START ###
  #creat Tfidfvectorizer
  tfid_vectorizer = TfidfVectorizer()
  #fit and transform to training data
  X_tfid = tfid_vectorizer.fit_transform(train_data).toarray()
  X_dev = tfid_vectorizer.transform(dev_data)
  #Produce Logistic Regressions based on data in tf-idf form
  logreg_model = LogisticRegression(C=100, solver='liblinear', multi_class='auto').fit(X_tfid, train_labels)
  #show its f1 score
  predict_list = []
  predict_list = logreg_model.predict(X_dev)
  f1_score = metrics.f1_score(dev_labels, predict_list, average = 'weighted')
  print('Logistic Regression model using L2 regularization score of 100 has a f1 score of ' + '%.4f\n' % f1_score)

  #Access predicted probabilites
  predic_prob = logreg_model.predict_proba(X_dev)
  #find max of each row 
  #compute probability of correct lable
  predic_prob_label = logreg_model.score(X_dev, dev_labels)
  #Compute R Ratio
  r_ratio = predic_prob/predic_prob_label
  #sort r_ratio to show 3 documents with highest r_ratio
  r_ratio_sorted = np.argsort(np.max(r_ratio, axis = 1))
  #flip array so it is in descending order
  r_ratio_sorted_flipped = np.flip(r_ratio_sorted)
  #Display index for top three R-ratio documents
  print('The three documents index in dev_data that have the highest R ratio are: ')
  for i in range(3):
    print(r_ratio_sorted_flipped[i])

    ### STUDENT END ###

Q7()

Logistic Regression model using L2 regularization score of 100 has a f1 score of 0.7598

The three documents index in dev_data that have the highest R ratio are: 
475
12
505


ANSWER: Both the TfidfTransfomer and CountVectorizer produce frequencies of the terms in the data, but TfidTransformer normalizes the count.  The R ratio is taking the max of the four labels and dividing it by the overall models label prediction.  This is showing the difference in the dev data prediction and the training prediction.  It appears the model is more favorable to the larger documents, this can be fixed by splitting portions of the document. 

### Question 8 EXTRA CREDIT:
---
Produce a Logistic Regression model to implement your suggestion from Part 7.