# CS 412 Machine Learning 2020 

# Assignment 3

100 pts

## Goal 

The goal of this assignment 

*  Introduction to working with text data
*  Gain experience with the Scikit-Learn library
*  Gain experience with Naive Bayes and Logistic Regression

## Dataset

**20 Newsgroup Dataset** is a collection 18846 documents which are about 20 different topics.


## Task
Build naive bayes and logistic regression classifiers with the scikit-learn library function to **classify** the documents about their content topic.

## Submission

Follow the instructions at the end.

# 1) Initialize

First, make a copy of this notebook in your drive

# 2) Load Dataset

The 20 Newsgroup Dataset exist on Scikit-Learn library.

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
train_batch = fetch_20newsgroups(subset='train')
test_batch = fetch_20newsgroups(subset='test')

In [None]:
# target groups you will be dealing with
target_groups = train_batch.target_names
target_groups

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [None]:
# creating training and test sets
train_x =  train_batch["data"]
train_y =  train_batch["target"]
test_x  =  test_batch["data"]
test_y  =  test_batch["target"]

In [None]:
print(train_x[0])

train_y[0]

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







7

In [None]:
print(target_groups[train_y[0]])

rec.autos


In [None]:
print(len(train_x), len(test_x))
print(len(train_y), len(test_y))

11314 7532
11314 7532


# Preprocess

In [None]:
import re

In [None]:
%%capture
import nltk
nltk.download("stopwords")

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [None]:
# You will use this function to preprocess your data. If you would like to add another preprocessing step in the function, please add it and mention about it in your report.
def preprocess(text):
  text = re.sub("[\w\d._]+@[^\s]+|[^\s]+\.[^\s]+|[^\s]+-[^\s]+|\d+|[^\w\s]","",text.lower().strip())
  text = ' '.join([stemmer.stem(word) for word in re.findall("\w+",text) if word not in stop_words])
  return text

In [None]:
# Apply <preprocess> function on the training and test set 
preprocessed_train_x = [preprocess(sample) for sample in train_x]
preprocessed_test_x = [preprocess(sample) for sample in test_x]


In [None]:
print(preprocessed_train_x[0])

where thing subject car organ univers maryland colleg park line wonder anyon could enlighten car saw day sport car look late earli call bricklin door realli small addit front bumper separ rest bodi know anyon tellm model name engin spec year product car made histori whatev info funki look car pleas thank il brought neighborhood lerxst


# Models

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score

import numpy as np
import pandas as pd

## Tune Naive Bayes

In [None]:
# Create a CountVectorizer for NB with:
min_df = 50
max_df = 3000
#     stop_words = stop_words
vectorizerNaive = CountVectorizer(min_df=min_df, max_df=max_df,stop_words = stop_words)

In [None]:
# Vectorize your training and test set
train_x = vectorizerNaive.fit_transform(preprocessed_train_x)
test_x = vectorizerNaive.transform(preprocessed_test_x)


In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

#Initiate the NB model with required components.
mnb_pipeline = Pipeline([
                         ('clf', MultinomialNB())
])


#Set the hyperparameter space that will be scanned:
hyperparameters = dict(
    clf__alpha = (0.1,0.5,1.0,5.0),
)


#Let the GridSearchCV scan the hyperparameter and find the best hyperparameter set that will maximize the scoring option.
#   cv = 3
#   scoring = "accuracy"

mnb_grid_searchNaive = GridSearchCV(mnb_pipeline, hyperparameters, cv=3, scoring = 'accuracy', n_jobs=-1)
mnb_grid_searchNaive.fit(train_x,train_y)

GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        MultinomialNB(alpha=1.0,
                                                      class_prior=None,
                                                      fit_prior=True))],
                                verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'clf__alpha': (0.1, 0.5, 1.0, 5.0)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

In [None]:
# show the best score
mnb_grid_searchNaive.best_score_

0.8212835269890522

In [None]:
# show the best parameter
mnb_grid_searchNaive.best_params_

{'clf__alpha': 0.5}

### Evaluate The Best Model for NB

In [None]:
#Create your NB model with the best parameter set.
modelNB = MultinomialNB(alpha=0.1)

#Fit your model on training set.
modelNB =modelNB.fit(train_x,train_y)

In [None]:
# Make predictions on test set
predictions2 = modelNB.predict(test_x)

In [None]:
# Show your accuracy on test set
print(accuracy_score(test_y, predictions2))

0.7535847052575677


## Tune Logistic Regresion

In [None]:
# Create a CountVectorizer for LR with:
min_df = 50
max_df = 3000
stop_words = stop_words
vectorizer = CountVectorizer(min_df=min_df, max_df=max_df,stop_words = stop_words)

In [None]:
# Vectorizer your training and test set
train_xx = vectorizer.fit_transform(preprocessed_train_x)
test_xx = vectorizer.transform(preprocessed_test_x)

In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#Initiate the LR model:
max_iter=2000
mnb_pipeline = Pipeline([
                         ('Lr', LogisticRegression())
])

# Set the hyperparameter space that will be scanned:
#     C = (0.001,0.01,0.1,1)     1 OVER LAMDA
hyperparameters = dict(
    Lr__C = (0.001,0.01,0.1,1),
    Lr__max_iter = (2000,),
)

#Let the GridSearchCV scan the hyperparameter and find the best hyperparameter set that will maximize the scoring option.
#   cv = 3
#   scoring = "accuracy"
mnb_grid_search = GridSearchCV(mnb_pipeline, hyperparameters, cv=3, scoring = 'accuracy', n_jobs=-1)
mnb_grid_search.fit(train_x,train_y)


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('Lr',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
                                                           n_jobs=None,
                                                           penalty='l2',
                                                           random_state=None,
                                                  

In [None]:
# show the best score
mnb_grid_search.best_score_

0.8338344038554356

In [None]:
# show the best parameter
mnb_grid_search.best_params_

{'Lr__C': 0.1, 'Lr__max_iter': 2000}

### Evaluate The Best Model for Logistic Regression

In [None]:
#Create your LR model with the best parameter set.
'''mnb_pipeline = Pipeline([
                         ('Lr', LogisticRegression())
]).set_params(**mnb_grid_search.best_params_)'''
modelLR = LogisticRegression(C= 0.1, max_iter=2000)
#Fit your model on training set.
modelLR =modelLR.fit(train_x,train_y)

In [None]:
# Make predictions on test set
predictions = model.predict(test_x)

In [None]:
# Show your accuracy on test set
print(accuracy_score(test_y, predictions))


0.7484067976633032


# Feature Importances

In [None]:
# Find the each category's most important top 3 features (words) for LR model and show with a dataframe
print(((modelLR.coef_)[0].shape),modelLR.coef_.shape)

(3099,) (20, 3099)


In [None]:

print(vectorizerNaive.get_feature_names())

['__', '___', '____', '_____', '_the', 'aa', 'aaron', 'ab', 'abc', 'abil', 'abl', 'abort', 'absolut', 'abstract', 'absurd', 'abus', 'ac', 'academ', 'acceler', 'accept', 'access', 'accid', 'accomplish', 'accord', 'account', 'accur', 'accuraci', 'accus', 'achiev', 'acid', 'acknowledg', 'acquir', 'across', 'act', 'action', 'activ', 'actual', 'ad', 'adam', 'adapt', 'add', 'addit', 'address', 'adequ', 'adjust', 'administr', 'admit', 'adob', 'adopt', 'adult', 'advanc', 'advantag', 'advertis', 'advic', 'advis', 'advoc', 'affair', 'affect', 'affili', 'afford', 'afraid', 'afterward', 'age', 'agenc', 'agenda', 'agent', 'aggress', 'ago', 'agre', 'agreement', 'ah', 'ahead', 'ai', 'aid', 'aim', 'aint', 'air', 'aka', 'al', 'ala', 'alan', 'albert', 'alberta', 'alcohol', 'alexand', 'algorithm', 'aliv', 'allan', 'alleg', 'allen', 'alloc', 'allow', 'almost', 'alon', 'along', 'alot', 'alreadi', 'also', 'alter', 'altern', 'although', 'alway', 'ama', 'amateur', 'amaz', 'ame', 'amend', 'america', 'american'

In [None]:
Feature_names = vectorizer.get_feature_names()
featureList = []
for i in range(len(target_groups)):
  top3 = np.argsort(np.abs(modelLR.coef_[i]))[len(modelLR.coef_[i])-3:len(modelLR.coef_[i])]
  reversedTop3 = top3[::-1]
  print(reversedTop3)
  TargetFeatures = []
  for j in reversedTop3:
    TargetFeatures.append(Feature_names[j])
  featureList.append(TargetFeatures)
print(featureList)


[ 203 1533 1474]
[1219 1374  127]
[3039 3037  848]
[1791 1169 1799]
[1659  149 2102]
[3080 3031 1800]
[2379 1921 1119]
[422 228 225]
[ 822  300 1804]
[ 256 2035 3085]
[1312 2739 2066]
[ 515  913 2728]
[ 494  893 2883]
[ 819  801 1721]
[2574 1947 1995]
[ 489  810 1198]
[1243 2977 1080]
[1479 1478 2453]
[2773  513 1611]
[ 489 1948 1554]
[['atheist', 'keith', 'islam'], ['graphic', 'imag', 'anim'], ['window', 'win', 'driver'], ['monitor', 'gateway', 'motherboard'], ['mac', 'appl', 'powerbook'], ['xr', 'widget', 'motif'], ['sale', 'offer', 'forsal'], ['car', 'automot', 'auto'], ['dod', 'bike', 'motorcycl'], ['basebal', 'philli', 'yanke'], ['hockey', 'team', 'playoff'], ['clipper', 'encrypt', 'tap'], ['circuit', 'electron', 'tv'], ['doctor', 'diseas', 'medic'], ['space', 'orbit', 'pat'], ['christian', 'distribut', 'god'], ['gun', 'waco', 'firearm'], ['israel', 'isra', 'serdar'], ['theodor', 'clinton', 'libertarian'], ['christian', 'order', 'koresh']]


In [None]:
Df = pd.DataFrame(featureList, index=target_groups)
Df.head(20)


Unnamed: 0,0,1,2
alt.atheism,atheist,keith,islam
comp.graphics,graphic,imag,anim
comp.os.ms-windows.misc,window,win,driver
comp.sys.ibm.pc.hardware,monitor,gateway,motherboard
comp.sys.mac.hardware,mac,appl,powerbook
comp.windows.x,xr,widget,motif
misc.forsale,sale,offer,forsal
rec.autos,car,automot,auto
rec.motorcycles,dod,bike,motorcycl
rec.sport.baseball,basebal,philli,yanke


In [None]:
Df.transpose().head()


Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,atheist,graphic,window,monitor,mac,xr,sale,car,dod,basebal,hockey,clipper,circuit,doctor,space,christian,gun,israel,theodor,christian
1,keith,imag,win,gateway,appl,widget,offer,automot,bike,philli,team,encrypt,electron,diseas,orbit,distribut,waco,isra,clinton,order
2,islam,anim,driver,motherboard,powerbook,motif,forsal,auto,motorcycl,yanke,playoff,tap,tv,medic,pat,god,firearm,serdar,libertarian,koresh


In [None]:
print(((modelNB.coef_)[0].shape),modelNB.coef_.shape)

(3099,) (20, 3099)


In [None]:
# Find the each category's most important top 3 features (words) for NB model and show with a dataframe

featureList2 = []
for i in range(len(target_groups)):
  top3_2 = np.argsort(modelNB.coef_[i])[-3: ]
  reversedTop3_2 = top3_2[::-1]
  print(reversedTop3_2)
  TargetFeatures2 = []
  for j in reversedTop3_2:
    TargetFeatures2.append(Feature_names[j])
  featureList2.append(TargetFeatures2)
print(featureList2)

[1198 2012 2394]
[1374 1068 1219]
[3039 1068  848]
[ 846  423 2417]
[1659  149 2140]
[3039 1068 2151]
[2379 1858 1921]
[ 422  920 1204]
[ 300  822 2327]
[3087 1162 2739]
[2739 1162 2064]
[1539  913  482]
[3044 1232 3059]
[2012 1816 1721]
[2574 1947 1576]
[1198  489 2012]
[1243 2012 2331]
[ 175 2012 1479]
[2012 1195 2783]
[1198  489 2012]
[['god', 'peopl', 'say'], ['imag', 'file', 'graphic'], ['window', 'file', 'driver'], ['drive', 'card', 'scsi'], ['mac', 'appl', 'problem'], ['window', 'file', 'program'], ['sale', 'new', 'offer'], ['car', 'engin', 'good'], ['bike', 'dod', 'ride'], ['year', 'game', 'team'], ['team', 'game', 'play'], ['key', 'encrypt', 'chip'], ['wire', 'ground', 'work'], ['peopl', 'msg', 'medic'], ['space', 'orbit', 'launch'], ['god', 'christian', 'peopl'], ['gun', 'peopl', 'right'], ['armenian', 'peopl', 'israel'], ['peopl', 'go', 'think'], ['god', 'christian', 'peopl']]


In [None]:
Df2 = pd.DataFrame(featureList2, index=target_groups)
Df2.head(20)


Unnamed: 0,0,1,2
alt.atheism,god,peopl,say
comp.graphics,imag,file,graphic
comp.os.ms-windows.misc,window,file,driver
comp.sys.ibm.pc.hardware,drive,card,scsi
comp.sys.mac.hardware,mac,appl,problem
comp.windows.x,window,file,program
misc.forsale,sale,new,offer
rec.autos,car,engin,good
rec.motorcycles,bike,dod,ride
rec.sport.baseball,year,game,team


In [None]:
Df2.transpose().head()

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,god,imag,window,drive,mac,window,sale,car,bike,year,team,key,wire,peopl,space,god,gun,armenian,peopl,god
1,peopl,file,file,card,appl,file,new,engin,dod,game,game,encrypt,ground,msg,orbit,christian,peopl,peopl,go,christian
2,say,graphic,driver,scsi,problem,program,offer,good,ride,team,play,chip,work,medic,launch,peopl,right,israel,think,peopl


##example DFs are deleted because accidently i ran these example df cells. Then i deleted them.

# **Notebook & Report**

Notebook: We may just look at your notebook results; so make sure each cell is run and outputs are there.

Report: Write an at most 1/2 page summary of your approach to this problem at the end of your notebook; this should be like an abstract of a paper or the executive summary.

Must include statements such as:

( Include the problem definition: 1-2 lines )

(Talk about any preprocessing you did, explain your reasoning)

(Talk about train/test sets, size and how split)

(State what your test results are with the chosen method, parameters: e.g. "We have obtained the best results with the ….. classifier (parameters=....) , giving classification accuracy of …% on test data….")

(Comment on feature importances of models)

(Comment on anything that you deem important/interesting)


You will get full points from here as long as you have a good (enough) summary of your work, regardless of your best performance or what you have decided to talk about in the last few lines.



# **Write your report in this cell**
We have a dataset which contains huge number of texts in the format of mail and their classification in terms of subject. We are trying to predict the given text's classification.

Firslty, we need to do some preprocessing to manipulate the dataset in a way that machine can process. We need to use vectorization to make a some kind of dataset of words. And this vectorization is sensitive to uppercase characters and counts them as a different word. So, we need to convert all characters to lowercase characters. Also, There are some meaningless words for machine such as 'is', 'are' etc. which are called stop_words. So, we will get rid of these words in our text examples. Another preprocessing material that wer used is stemming. A single word can be written in many different forms such as 'apply', 'applied' etc. so we need to get rid of these and take the root for our mapping like 'appl'. Last preprocessing is getting rid of punctuation marks.

Training and test datas are came built in. So, we did not split them explicity. Also, training data set has 11314 instances and test data set has 7532 instances. So, it is not actually what we wanted to do in order to have a bigger portion of training data.

We have obtained the best reult with the MultinomialNaiveBayes classifier with parameters of alpha = 0.5 and we obtained the accuracy of 75,35847052575677% on test data. 

Feature importance dataframes are very satisfying for humanbeing. Almost all of top 3 features are very related with the given, predicted class of text.

In my opinion, it is interesting that how coefficients of MultinomialNaiveBayes classifier is all negative and their importance is not determined by absolute value. I really got stuck on that situation. At first i took absolute values of coefficients and result was wrong. Then, i spent so much time to understand what is wrong with my code. Then, i started to do something like debug like looking the numbers one by one, getting their index etc. When i was printing the coefficients group by group, i found out that all coefficients are negative and the most important one is the closer one to 0. Maybe i missed some point in lectures. I will look again to this topic.


# **Submission**
You will submit this homework via SUCourse.


Please read this document again before submitting it.

Please submit your **"share link" INLINE in Sucourse submissions.** That is we should be able to click on the link and go there and run (and possibly also modify) your code.

For us to be able to modify, in case of errors etc, you should get your "share link" as **share with anyone in edit mode** 

Download the **.ipynb and the .html** file and upload both of them to Sucourse.
 
Please do your assignment individually, do not copy from a friend or the Internet. Plagiarized assignments will receive -100.
