# CS 412 Machine Learning 2020 

# Assignment 3

100 pts

## Goal 

The goal of this assignment 

*  Introduction to working with text data
*  Gain experience with the Scikit-Learn library
*  Gain experience with Naive Bayes and Logistic Regression

## Dataset

**20 Newsgroup Dataset** is a collection 18846 documents which are about 20 different topics.


## Task
Build naive bayes and logistic regression classifiers with the scikit-learn library function to **classify** the documents about their content topic.

## Submission

Follow the instructions at the end.

# 1) Initialize

First, make a copy of this notebook in your drive

# 2) Load Dataset

The 20 Newsgroup Dataset exist on Scikit-Learn library.

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
train_batch = fetch_20newsgroups(subset='train')
test_batch = fetch_20newsgroups(subset='test')

In [3]:
# target groups you will be dealing with
target_groups = train_batch.target_names
target_groups

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
# creating training and test sets
train_x =  train_batch["data"]
train_y =  train_batch["target"]
test_x  =  test_batch["data"]
test_y  =  test_batch["target"]

In [5]:
print(train_x[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [6]:
print(target_groups[train_y[0]])

rec.autos


# Preprocess

In [7]:
import re

In [8]:
%%capture
import nltk
nltk.download("stopwords")

In [9]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")

In [10]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [11]:
# You will use this function to preprocess your data. If you would like to add another preprocessing step in the function, please add it and mention about it in your report.
def preprocess(text):
  text = re.sub("[\w\d._]+@[^\s]+|[^\s]+\.[^\s]+|[^\s]+-[^\s]+|\d+|[^\w\s]","",text.lower().strip())
  text = ' '.join([stemmer.stem(word) for word in re.findall("\w+",text) if word not in stop_words])
  return text

In [12]:
# Apply <preprocess> function on the training and test set
preprocessed_train_x = [preprocess(sample) for sample in train_x]
preprocessed_test_x = [preprocess(sample) for sample in test_x]

# Models

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score

import numpy as np
import pandas as pd

## Tune Naive Bayes

In [14]:
# Create a CountVectorizer for NB with:
#     min_df = 50
#     max_df = 3000
#     stop_words = stop_words
vectorizer = CountVectorizer(min_df=50, max_df=3000, stop_words=stop_words)

In [15]:
# Vectorize your training and test set
NB_train_x = vectorizer.fit_transform(preprocessed_train_x)
NB_test_x = vectorizer.transform(preprocessed_test_x)

In [16]:
#https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

#Initiate the NB model with required components.
NB_model = MultinomialNB()

#Set the hyperparameter space that will be scanned:
#   alpha = (0.1,0.5,1.0,5.0)
hyperparameters = dict(
    alpha = (0.1,0.5,1.0,5.0)
)

#Let the GridSearchCV scan the hyperparameter and find the best hyperparameter set that will maximize the scoring option.
#   cv = 3
#   scoring = "accuracy"
NB_grid_search = GridSearchCV(NB_model,hyperparameters,cv=3,scoring="accuracy")
NB_grid_search.fit(NB_train_x,train_y)

GridSearchCV(cv=3, error_score=nan,
             estimator=MultinomialNB(alpha=1.0, class_prior=None,
                                     fit_prior=True),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': (0.1, 0.5, 1.0, 5.0)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

In [17]:
# show the best score
NB_grid_search.best_score_

0.8212835269890522

In [18]:
# show the best parameter
NB_grid_search.best_params_

{'alpha': 0.5}

### Evaluate The Best Model for NB

In [19]:
#Create your NB model with the best parameter set.
best_NB_model = MultinomialNB(alpha=0.5)
#Fit your model on training set.
best_NB_model = best_NB_model.fit(NB_train_x, train_y)

In [20]:
# Make predictions on test set
NB_predictions = best_NB_model.predict(NB_test_x)

In [21]:
# Show your accuracy on test set
accuracy_score(test_y, NB_predictions)

0.7555762081784386

## Tune Logistic Regresion

In [22]:
# Create a CountVectorizer for LR with:
#     min_df = 50
#     max_df = 3000
#     stop_words = stop_words
vectorizer = CountVectorizer(min_df=50, max_df=3000, stop_words=stop_words)

In [23]:
# Vectorizer your training and test set
LR_train_x = vectorizer.fit_transform(preprocessed_train_x)
LR_test_x = vectorizer.transform(preprocessed_test_x)

In [24]:
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#Initiate the LR model:
#   max_iter=2000
LR_model = LogisticRegression(max_iter=2000)

# Set the hyperparameter space that will be scanned:
#     C = (0.001,0.01,0.1,1)     
hyperparameters = dict(
    C = (0.001,0.01,0.1,1)
)

#Let the GridSearchCV scan the hyperparameter and find the best hyperparameter set that will maximize the scoring option.
#   cv = 3
#   scoring = "accuracy"
LR_grid_search = GridSearchCV(LR_model,hyperparameters,cv=3,scoring="accuracy")
LR_grid_search.fit(LR_train_x,train_y)

GridSearchCV(cv=3, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=2000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': (0.001, 0.01, 0.1, 1)}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring='accuracy',
             verbose=0)

In [25]:
# show the best score
LR_grid_search.best_score_

0.8338344038554356

In [26]:
# show the best parameter
LR_grid_search.best_params_

{'C': 0.1}

### Evaluate The Best Model for Logistic Regression

In [27]:
#Create your LR model with the best parameter set.
best_LR_model = LogisticRegression(C=0.1, max_iter=2000)
#Fit your model on training set.
best_LR_model = best_LR_model.fit(LR_train_x, train_y)

In [28]:
# Make predictions on test set
LR_predictions = best_LR_model.predict(LR_test_x)

In [29]:
# Show your accuracy on test set
accuracy_score(test_y,LR_predictions)

0.7484067976633032

# Feature Importances

In [30]:
# Find the each category's most important top 3 features (words) for LR model and show with a dataframe
LR_dict = {}
words = vectorizer.get_feature_names()

for i in range(len(target_groups)):
  coefs = best_LR_model.coef_[i].flatten().tolist()
  LR_dict[target_groups[i]] = [item[0] for item in sorted(list(zip(words,coefs)),reverse=True,key=lambda x: x[1])[:3]]

lr_df = pd.DataFrame(LR_dict)

In [31]:
# My Output:
lr_df

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,atheist,graphic,window,monitor,mac,xr,sale,car,dod,basebal,hockey,clipper,circuit,doctor,space,christian,gun,israel,theodor,christian
1,keith,imag,win,gateway,appl,widget,offer,automot,bike,philli,team,encrypt,electron,diseas,orbit,god,waco,isra,clinton,order
2,islam,anim,driver,motherboard,powerbook,motif,forsal,auto,motorcycl,yanke,playoff,tap,tv,medic,pat,church,firearm,serdar,libertarian,koresh


In [None]:
# It should look like this:
lr_df

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,atheist,graphic,window,monitor,mac,xr,sale,car,dod,basebal,hockey,clipper,circuit,doctor,space,christian,gun,israel,theodor,christian
1,keith,imag,win,gateway,appl,widget,offer,automot,bike,philli,team,encrypt,electron,diseas,orbit,god,waco,isra,clinton,order
2,islam,anim,driver,motherboard,powerbook,motif,forsal,auto,motorcycl,yanke,playoff,tap,tv,medic,pat,church,firearm,serdar,libertarian,koresh


In [32]:
# Find the each category's most important top 3 features (words) for NB model and show with a dataframe
NB_dict = {}
words = vectorizer.get_feature_names()

for i in range(len(target_groups)):
  coefs = best_NB_model.coef_[i].flatten().tolist()
  NB_dict[target_groups[i]] = [item[0] for item in sorted(list(zip(words,coefs)),reverse=True,key=lambda x: x[1])[:3]]

nb_df = pd.DataFrame(NB_dict)

In [33]:
# My Output:
nb_df

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,god,imag,window,drive,mac,window,sale,car,bike,year,team,key,wire,peopl,space,god,gun,armenian,peopl,god
1,peopl,file,file,card,appl,file,new,engin,dod,game,game,encrypt,ground,msg,orbit,christian,peopl,peopl,go,christian
2,say,graphic,driver,scsi,problem,program,offer,good,ride,team,play,chip,work,medic,launch,peopl,right,israel,think,peopl


In [None]:
# It should look like this:
nb_df

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,god,imag,window,drive,mac,window,sale,car,bike,year,team,key,wire,peopl,space,god,gun,armenian,peopl,god
1,peopl,file,file,card,appl,file,new,engin,dod,game,game,encrypt,ground,msg,orbit,christian,peopl,peopl,go,christian
2,say,graphic,driver,scsi,problem,program,offer,good,ride,team,play,chip,work,medic,launch,peopl,right,israel,think,peopl


# **Notebook & Report**

Notebook: We may just look at your notebook results; so make sure each cell is run and outputs are there.

Report: Write an at most 1/2 page summary of your approach to this problem at the end of your notebook; this should be like an abstract of a paper or the executive summary.

Must include statements such as:

( Include the problem definition: 1-2 lines )

(Talk about any preprocessing you did, explain your reasoning)

(Talk about train/test sets, size and how split)

(State what your test results are with the chosen method, parameters: e.g. "We have obtained the best results with the ….. classifier (parameters=....) , giving classification accuracy of …% on test data….")

(Comment on feature importances of models)

(Comment on anything that you deem important/interesting)


You will get full points from here as long as you have a good (enough) summary of your work, regardless of your best performance or what you have decided to talk about in the last few lines.



# **Write your report in this cell**

**My Report**

**Problem Definition:** 

In this homework, given 20 Newsgroup Dataset, our task was to use Naive Bayes and Logistic Regression models to classify the documents regarding their content topics.

**For preprocessing:** 

No extra preprocessing was done other than what was given in the original .ipynb file. I tried to include only the messages (by excluding `From:`, `Lines:` etc.) but my (or should I say lack of) regex knowledge was not enough to achieve what I was going for, so I skipped this part.

**Train/val/test sets, size and how split:** 

I did not split the data into train and test sets as it was originally given seperately, but by checking the data, it seems that there are 11314 instances for training and 7532 instances for testing.

With training data, used Grid Search for both models to scan the hyperparameters and find the best hyperparameter set that will maximize the accuracy. For Naive Bayes, the value of $alpha$ that maximized the accuracy is 0.5, and for Logistic Regression, the value of $C$ that maximized the accuracy is 0.1.


**Test results, best hyperparameters, classification accuracy scores:**

We have obtained the best results with the Naive Bayes model (`hyperparameters=(alpha=0.5)`), giving classification accuracy of 75.56% on test data.

The best Logistic Regression model (`hyperparameters=(C=0.1)`), gave classification accuracy of 74.84% on test data.

**Feature importances of models:**

Top 3 features (or words) of certain categories seem to be completely different when we compare Naive Bayes and Logistic Regression models. One example being: `comp.sys.ibm.pc.hardware` category. For `comp.sys.ibm.pc.hardware` category, in NB model top 3 words are `[monitor, gateway, motherboard]`, but in LR model top 3 words are `[drive, card, scsi]`.


**Comments:**

I was expecting the Logistic Regression model to perform better (since training accuracy is higher), but Naive Bayes model performed better. I believe the cause of this might be related to the fact that discriminative models are prone to overfitting compared to generative models.

# **Submission**
You will submit this homework via SUCourse.


Please read this document again before submitting it.

Please submit your **"share link" INLINE in Sucourse submissions.** That is we should be able to click on the link and go there and run (and possibly also modify) your code.

For us to be able to modify, in case of errors etc, you should get your "share link" as **share with anyone in edit mode** 

Download the **.ipynb and the .html** file and upload both of them to Sucourse.
 
Please do your assignment individually, do not copy from a friend or the Internet. Plagiarized assignments will receive -100.
