# Table of Content:
- [Import Libraries](#Import-Libraries)
- [Read in API Data](#Read-in-API-Data)
- [Data Cleaning](#Data-Cleaning)
- [Preprocessing](#Preprocessing)
- [Model Setup](#Model-Setup)
    - [Logistic Regression](#Logistic-Regression)
    
    - [KNearest Neighbors](#KNN-Classifier)
    
    - [Navie Bayes Multinominal](#Navie-Bayes-Multinominal)
    
    - [Decision Tree](#Decision-Tree-Classifier)
- [Combined Model Evaluation](#Combined-Model-Evaluations)


## Import Libraries

In [5]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import regex as re
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords as sw
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier



## Read in API Data

In [6]:
#read in both data from data folder
e_46 = pd.read_csv('./data/e_46.csv')
e_90 = pd.read_csv('./data/e_90.csv')

#this the combine of all models
final = pd.read_csv('./data/total.csv')

In [7]:
#check e_46
e_46.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,time_date
0,Window fuse,Anyone have a diagram of the fuses ? Cant seem...,e46,1582133739,pillow12345,0,1,True,2020-02-19
1,"New e46 owner, quick question about side mirror.","Alright, so this is the first bmw I've ever ow...",e46,1582136970,Sawachuki,6,1,True,2020-02-19
2,HID Ballast Question (Self Leveling and Corner...,"Hi all, I am in the process of fixing every li...",e46,1582147313,enm22,4,1,True,2020-02-19
3,"Brake lights work, turn signals don't, any adv...",,e46,1582156658,SpookySkips,5,1,True,2020-02-19
4,Tire randomly goes flat,Hello! Im pretty much a noob with tires but an...,e46,1582161377,WyaOscar,3,1,True,2020-02-19


In [8]:
#check e_90
e_90.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,time_date
0,My 07 335i,So my check engine light came on the other day...,E90,1582117887,scrubbblord,34,1,True,2020-02-19
1,Vacuum line leak?,So my check engine light came on and said my m...,E90,1582121771,scrubbblord,2,1,True,2020-02-19
2,This might sound stupid...,If you’ve seen my last post- I recently bought...,E90,1582152810,blacknot6lack,13,1,True,2020-02-19
3,335i vs M3,Im getting a car soon and have a choice betwee...,E90,1582179834,closetedfaggot,20,1,True,2020-02-19
4,Something you guys might be interested in ! Cu...,https://youtu.be/liJ_MhC5Z_8,E90,1582237443,Cars_with_Tommy,0,1,True,2020-02-20


## Data Cleaning

In [9]:
#only need title selftext and subreddit
e_46 = e_46[['title', 'selftext', 'subreddit']]
e_90 = e_90[['title', 'selftext', 'subreddit']]

In [10]:
#check for nulls
e_46.isnull().sum()

title         0
selftext     86
subreddit     0
dtype: int64

In [11]:
e_90.isnull().sum()

title          0
selftext     109
subreddit      0
dtype: int64

In [12]:
#Can't have any nulls. Need to remove them
e_46.dropna(inplace = True)
e_90.dropna(inplace = True)

In [13]:
#check for links, 'http' and '.com'
print(e_46['selftext'].str.contains('.com').sum())
print(e_46['selftext'].str.contains('http').sum())
print(e_90['selftext'].str.contains('.com').sum())
print(e_90['selftext'].str.contains('http').sum())

699
262
567
198


In [14]:
#Remove all with links
links = ['http', '.com']

for link in links:
    e_46 = e_46[~e_46['selftext'].str.contains(link)]
    e_90 = e_90[~e_90['selftext'].str.contains(link)]

#reset index
e_46.reset_index(inplace = True, drop = True)
e_90.reset_index(inplace = True, drop = True)

In [15]:
#check the len of each after removing links
print(len(e_46))
print(len(e_90))

1220
1181


In [16]:
#combine into 1 big list for classificaton
m3 = pd.concat([e_46, e_90])

#reset index after concat
m3.reset_index(inplace = True, drop = True)

In [17]:
m3

Unnamed: 0,title,selftext,subreddit
0,Window fuse,Anyone have a diagram of the fuses ? Cant seem...,e46
1,"New e46 owner, quick question about side mirror.","Alright, so this is the first bmw I've ever ow...",e46
2,Tire randomly goes flat,Hello! Im pretty much a noob with tires but an...,e46
3,Weird Intermittent Temperature Problem,"Hey, I was wondering if anyone else had experi...",e46
4,Recommendations on which e46 to buy,I’m a 22 year old with a steady job. I need to...,e46
...,...,...,...
2396,N52 Strange Cold Start,Does anyone have an issue with their E90 idlin...,E90
2397,BMW E92 335i N54 Tune using MHD Flasher with a...,I purchased my 2010 E92 335i with an afe intak...,E90
2398,The N54 in my 335i seized yesterday on the way...,Basically what the title says. I took my 2010...,E90
2399,Choosing between 330i/335i,So Ive been looking around for the next car af...,E90


In [18]:
#drop the last row
m3.drop([2400], inplace = True)

In [19]:
m3

Unnamed: 0,title,selftext,subreddit
0,Window fuse,Anyone have a diagram of the fuses ? Cant seem...,e46
1,"New e46 owner, quick question about side mirror.","Alright, so this is the first bmw I've ever ow...",e46
2,Tire randomly goes flat,Hello! Im pretty much a noob with tires but an...,e46
3,Weird Intermittent Temperature Problem,"Hey, I was wondering if anyone else had experi...",e46
4,Recommendations on which e46 to buy,I’m a 22 year old with a steady job. I need to...,e46
...,...,...,...
2395,Coolant Air Bleed Procedure Not Working,"Hey everyone, so yesterday I did a coolant flu...",E90
2396,N52 Strange Cold Start,Does anyone have an issue with their E90 idlin...,E90
2397,BMW E92 335i N54 Tune using MHD Flasher with a...,I purchased my 2010 E92 335i with an afe intak...,E90
2398,The N54 in my 335i seized yesterday on the way...,Basically what the title says. I took my 2010...,E90


## Preprocessing

In [20]:
#convert both title and selftext to remove html with lower case

m3['selftext'] = m3['selftext'].apply(lambda row : re.sub('[^a-zA-z0-9]', ' ', row.lower()))
m3['title'] = m3['title'].apply(lambda row : re.sub('[^a-zA-Z0-9]', ' ', row.lower()))


In [21]:
#title and subreddit are related. Make a interaction term for both
m3['comb'] = m3['title'] + ' ' + m3['selftext']

In [22]:
#map e46 to 0 and e90 to 1
m3['subreddit'] = m3['subreddit'].map({'e46' : 0, 'E90' : 1})
m3

Unnamed: 0,title,selftext,subreddit,comb
0,window fuse,anyone have a diagram of the fuses cant seem...,0,window fuse anyone have a diagram of the fuses...
1,new e46 owner quick question about side mirror,alright so this is the first bmw i ve ever ow...,0,new e46 owner quick question about side mirro...
2,tire randomly goes flat,hello im pretty much a noob with tires but an...,0,tire randomly goes flat hello im pretty much ...
3,weird intermittent temperature problem,hey i was wondering if anyone else had experi...,0,weird intermittent temperature problem hey i ...
4,recommendations on which e46 to buy,i m a 22 year old with a steady job i need to...,0,recommendations on which e46 to buy i m a 22 y...
...,...,...,...,...
2395,coolant air bleed procedure not working,hey everyone so yesterday i did a coolant flu...,1,coolant air bleed procedure not working hey ev...
2396,n52 strange cold start,does anyone have an issue with their e90 idlin...,1,n52 strange cold start does anyone have an iss...
2397,bmw e92 335i n54 tune using mhd flasher with a...,i purchased my 2010 e92 335i with an afe intak...,1,bmw e92 335i n54 tune using mhd flasher with a...
2398,the n54 in my 335i seized yesterday on the way...,basically what the title says i took my 2010...,1,the n54 in my 335i seized yesterday on the way...


In [23]:
#function for manual lemmatizing to skip numbers

def lem(words):
    lemma = WordNetLemmatizer()
    lem = ' '
    for word in words.split(' '):
        try:
            lem += lemma.lemmatize(word) + ' '
        except:
            lem += word + ' '
    return lem   

In [24]:
#function for manual stemmming to skip numbers

def stem(words):
    stem = PorterStemmer()
    s = ''
    for word in words.split(' '):
        try:
            s += stem.stem(word) + ' '
        except:
            s += word + ' '
    return s   

In [25]:
#create new column for stemming
m3['stem'] = m3['comb'].apply(lambda row : stem(row))

#create a new column for lemmatizing
m3['lem'] = m3['comb'].apply(lambda row : lem(row))

In [26]:
m3

Unnamed: 0,title,selftext,subreddit,comb,stem,lem
0,window fuse,anyone have a diagram of the fuses cant seem...,0,window fuse anyone have a diagram of the fuses...,window fuse anyon have a diagram of the fuse ...,window fuse anyone have a diagram of the fuse...
1,new e46 owner quick question about side mirror,alright so this is the first bmw i ve ever ow...,0,new e46 owner quick question about side mirro...,new e46 owner quick question about side mirro...,new e46 owner quick question about side mirr...
2,tire randomly goes flat,hello im pretty much a noob with tires but an...,0,tire randomly goes flat hello im pretty much ...,tire randomli goe flat hello im pretti much a...,tire randomly go flat hello im pretty much a...
3,weird intermittent temperature problem,hey i was wondering if anyone else had experi...,0,weird intermittent temperature problem hey i ...,weird intermitt temperatur problem hey i wa w...,weird intermittent temperature problem hey i...
4,recommendations on which e46 to buy,i m a 22 year old with a steady job i need to...,0,recommendations on which e46 to buy i m a 22 y...,recommend on which e46 to buy i m a 22 year ol...,recommendation on which e46 to buy i m a 22 y...
...,...,...,...,...,...,...
2395,coolant air bleed procedure not working,hey everyone so yesterday i did a coolant flu...,1,coolant air bleed procedure not working hey ev...,coolant air bleed procedur not work hey everyo...,coolant air bleed procedure not working hey e...
2396,n52 strange cold start,does anyone have an issue with their e90 idlin...,1,n52 strange cold start does anyone have an iss...,n52 strang cold start doe anyon have an issu w...,n52 strange cold start doe anyone have an iss...
2397,bmw e92 335i n54 tune using mhd flasher with a...,i purchased my 2010 e92 335i with an afe intak...,1,bmw e92 335i n54 tune using mhd flasher with a...,bmw e92 335i n54 tune use mhd flasher with afe...,bmw e92 335i n54 tune using mhd flasher with ...
2398,the n54 in my 335i seized yesterday on the way...,basically what the title says i took my 2010...,1,the n54 in my 335i seized yesterday on the way...,the n54 in my 335i seiz yesterday on the way h...,the n54 in my 335i seized yesterday on the wa...


## Model Setup

In [27]:
#features and labels

#selftext
X = m3['lem']

#label
y = m3['subreddit']

In [28]:
#train text split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 888, stratify  = y)

In [29]:
#custom stop words

stop = ['i', 'im', 'iam', 'pretty', 'much', 'about', 'with', 'have', 'and', 'an', 'a', 'the', 'by', 'but', 'cant', "can't", 'can',
       'basically', 'general', 'for', 'what', 'when', 'how', 'did', 'doesnt', 'my', 'me', 'him','her', 'his', 'm',
       'around', 'of', 'between', 'so']

## Logistic Regression

In [30]:
# function with pipeline for logistics regression with CountVectorizer or TdiftVectorizer

def setup(model, trans):
    
    #define the transformer and estimator:
    pipe = Pipeline([
        ('vec' , trans),
        ('model', model)
    ])
    
    #hyperparamter for transformer
    hyperp = {
        
        #Using, none, 'english' and custome stopwords
        'vec__stop_words' : ['english', stopwords,], 
        
        #n_gram ranges to get 1 words, 2 words phrases and 3 word phrase for card turn 
        'vec__ngram_range' : [(1,1), (1,2), (1,3)],
        
        #Regularization. Smaller C, the stronger the regularization
        'model__C' : [.1, .5, 5]
    }
    
    #search in GrideSearch
    search = GridSearchCV(pipe, param_grid= hyperp, cv = 5, scoring = 'accuracy' )

    return search

In [31]:
# Logistic Regression and CountVectorizer

lr_c = setup(LogisticRegression(solver = 'liblinear'), CountVectorizer())
lr_c.fit(X_train, y_train);

In [32]:
print(f'best score is: {lr_c.best_score_}')
print(f'best parameter: {lr_c.best_params_}')
print(f'testset best score: {lr_c.score(X_test, y_test)}')

best score is: 0.835
best parameter: {'model__C': 0.1, 'vec__ngram_range': (1, 3), 'vec__stop_words': 'english'}
testset best score: 0.8583333333333333


In [33]:
# Logistic Regression and TdifVectorizer

lr_tv = setup(LogisticRegression(solver = 'liblinear'), TfidfVectorizer())
lr_tv.fit(X_train, y_train);

In [34]:
print(f'best score is: {lr_tv.best_score_}')
print(f'best parameter: {lr_tv.best_params_}')
print(f'testset best score: {lr_tv.score(X_test, y_test)}')

best score is: 0.8322222222222221
best parameter: {'model__C': 5, 'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
testset best score: 0.8533333333333334


In [35]:
#confusion matrix for countvectorizer
cm_c = pd.DataFrame(confusion_matrix(y_test,lr_c.predict(X_test)), columns = ['e46', 'e90'], index = ['p_e46', 'p_e90'])
cm_c

Unnamed: 0,e46,e90
p_e46,255,50
p_e90,35,260


In [36]:
#confusion matrix for TfidfVectorizer
cm_t = pd.DataFrame(confusion_matrix(y_test, lr_tv.predict(X_test)), columns = ['e46', 'e90'], index = ['p_e46', 'p_e90'])
cm_t

Unnamed: 0,e46,e90
p_e46,253,52
p_e90,36,259


In [37]:
#locate the highest and lowest coefficient, with its words for Countvectorizer
coef = lr_c.best_estimator_.named_steps['model'].coef_
names = lr_c.best_estimator_.named_steps['vec'].get_feature_names()

In [38]:
#transpose array for easy sorting

array = pd.DataFrame(coef, columns = names).T

In [39]:
#sort by column '0'

sort = array.sort_values(by = [0])

In [40]:
#top 4 for e46
sort.head()

Unnamed: 0,0
e46,-1.561806
330ci,-0.65877
325ci,-0.485112
330i,-0.417354
zhp,-0.415655


In [41]:
#top 5 for e90
sort.tail()

Unnamed: 0,0
2011,0.632632
328i,0.650346
e92,0.711353
335i,0.846351
e90,1.309839


In [42]:
#locate the highest and lowest coefficient, with its words for Tfidfvectorizer
coef_t = lr_tv.best_estimator_.named_steps['model'].coef_
names_t = lr_tv.best_estimator_.named_steps['vec'].get_feature_names()

In [43]:
#transpose for easy sorting

array_t = pd.DataFrame(coef_t, columns = names_t).T

In [46]:
#sort by column '0'

sort_t = array_t.sort_values(by = [0])

In [47]:
#top 5 for e46

sort_t.head()

Unnamed: 0,0
e46,-12.30857
330ci,-5.703909
325ci,-4.294846
2004,-3.767476
zhp,-3.680316


In [48]:
#top 5 for e90

sort_t.tail()

Unnamed: 0,0
2011,5.347992
2006,5.462075
e92,5.541207
335i,6.527404
e90,10.301755


In [49]:
#top coefficient comparion for e46

co_e46 = pd.DataFrame({'CountVec' : sort.head()[0], 'TFidf' : sort_t.head()[0] }, index = list(sort.index[0:5]))
co_e46

Unnamed: 0,CountVec,TFidf
e46,-1.561806,-12.30857
330ci,-0.65877,-5.703909
325ci,-0.485112,-4.294846
330i,-0.417354,
zhp,-0.415655,-3.680316


In [50]:
#Top coefficent comparison for e90 

co_e90 = pd.DataFrame({'CountVec' : sort.tail()[0], 'Tdidf' : sort_t.tail()[0] }, index = list(sort.index[-1:-6:-1]))
co_e90


Unnamed: 0,CountVec,Tdidf
e90,1.309839,10.301755
335i,0.846351,6.527404
e92,0.711353,5.541207
328i,0.650346,
2011,0.632632,5.347992


## KNN Classifier

In [51]:
#KNN function for CountVectorizer and TfidftVectorizer

def setup_k(model, trans):
    
    #define the transformer and estimator:
    pipe = Pipeline([
        ('vec', trans),
        ('model', model)
    ])
    
    #hyperparamter for transformer
    hyperp = {
        
        #Using, none, 'english' and custome stopwords
        'vec__stop_words' : ['english', stopwords,], 
        
        #n_gram ranges to get 1 words, 2 words phrases and 3 word phrase for card turn 
        'vec__ngram_range' : [(1,1), (1,2), (1,3)],
        
        #define n neighbors
        'model__n_neighbors' : [5, 15, 40],
        
        #metric
        'model__metric' : ['euclidean', 'manhattan']
    }
    
    #search in GrideSearch
    search = GridSearchCV(pipe, param_grid= hyperp, cv = 5, scoring = 'accuracy' )

    return search

In [52]:
# KNeighor with Countvectorizer
k = setup_k(KNeighborsClassifier(), CountVectorizer())

In [53]:
k.fit(X_train, y_train);

In [54]:
print(f'best score is: {k.best_score_}')
print(f'best parameter: {k.best_params_}')
print(f'testset best score: {k.score(X_test, y_test)}')

best score is: 0.6144444444444445
best parameter: {'model__metric': 'manhattan', 'model__n_neighbors': 5, 'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
testset best score: 0.58


In [55]:
# KNeighbor with Tfidfvectorizer
k_tv = setup_k(KNeighborsClassifier(), TfidfVectorizer())

In [56]:
k_tv.fit(X_train, y_train);

In [57]:
print(f'best score is: {k_tv.best_score_}')
print(f'best parameter: {k_tv.best_params_}')
print(f'testset best score: {k_tv.score(X_test, y_test)}')

best score is: 0.7388888888888888
best parameter: {'model__metric': 'euclidean', 'model__n_neighbors': 40, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
testset best score: 0.7616666666666667


In [58]:
#confusion matrix for KNeighbors Countvector

cm_kc = pd.DataFrame(confusion_matrix(y_test, k.predict(X_test)), columns = ['e46', 'e90'], index = ['e46', 'e90'])
cm_kc

Unnamed: 0,e46,e90
e46,98,207
e90,45,250


In [59]:
#confusion matrix for KNeighbors TdiftVectorizer

cm_ktv = pd.DataFrame(confusion_matrix(y_test, k_tv.predict(X_test)), columns = ['e46', 'e90'], index = ['e46', 'e90'])
cm_ktv

Unnamed: 0,e46,e90
e46,244,61
e90,82,213


## Navie Bayes Multinominal

In [60]:
#Navie Bayes Mutlionimal Function for CountVectorizer and TfidfVectorizer

def nb_c(model, trans):
    
    #define the transformer and estimator:
    pipe = Pipeline([
        ('vec' , trans),
        ('model', model)
    ])
    
    #hyperparamter for transformer
    hyperp = {
        
        #Using, none, 'english' and custome stopwords
        'vec__stop_words' : ['english', stopwords,], 
        
        #n_gram ranges to get 1 words, 2 words phrases and 3 word phrase for card turn 
        'vec__ngram_range' : [(1,1), (1,2), (1,3)],
        
    }
    
    #search in GrideSearch
    search = GridSearchCV(pipe, param_grid= hyperp, cv = 5, scoring = 'accuracy' )

    return search

In [61]:
#Navie Bayes Multinominals with CountVectorizer

nbm_c = nb_c(MultinomialNB(), CountVectorizer())
nbm_c.fit(X_train, y_train);

In [62]:
print(f'best score is: {nbm_c.best_score_}')
print(f'best parameter: {nbm_c.best_params_}')
print(f'testset best score: {nbm_c.score(X_test, y_test)}')

best score is: 0.8288888888888888
best parameter: {'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
testset best score: 0.8416666666666667


In [63]:
#Navie Bayse with TfidfVectorizer

nbm_v = nb_c(MultinomialNB(), TfidfVectorizer())
nbm_v.fit(X_train, y_train);

In [64]:
print(f'best score is: {nbm_v.best_score_}')
print(f'best parameter: {nbm_v.best_params_}')
print(f'testset best score: {nbm_v.score(X_test, y_test)}')

best score is: 0.8116666666666668
best parameter: {'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
testset best score: 0.8333333333333334


In [65]:
#confusion matrix for nbm CountVectorizer

cm_nbm_c = pd.DataFrame(confusion_matrix(y_test, nbm_c.predict(X_test)), 
                        columns = ['e46', 'e90'], 
                        index = ['p_e46', 'p_e90'])

cm_nbm_c

Unnamed: 0,e46,e90
p_e46,254,51
p_e90,44,251


In [66]:
#confusion matrix for nmb TdidfVectorizer

cm_nbm_tv = pd.DataFrame(confusion_matrix(y_test, nbm_v.predict(X_test)), 
                        columns = ['e46', 'e90'],
                        index = ['p_e46', 'p_e90'])

cm_nbm_tv

Unnamed: 0,e46,e90
p_e46,262,43
p_e90,57,238


## Decision Tree Classifier

In [67]:
# Base decision Tree with CountVectorier

#instantiate both Decision Tree and CountVectorizer
dt_c = DecisionTreeClassifier(random_state = 88)
cvec = CountVectorizer()

#fit and transform with the training data.
X_c = cvec.fit_transform(X_train)
X_ct = cvec.transform(X_test)

In [68]:
#fit DecisionClassifier
dt_c.fit(X_c, y_train)

#score for each set
print(f'score for train set {dt_c.score(X_c, y_train)}')
print(f'score for test set {dt_c.score(X_ct, y_test)}')

score for train set 1.0
score for test set 0.79


In [69]:
#instantiate Decision Tree classifier for TfidfVectorizer
dt_t = DecisionTreeClassifier(random_state = 88)

#instantiate TfidfVectorizer
tvec = TfidfVectorizer()

#fit and transform with training data
X_t = tvec.fit_transform(X_train)
X_tt = tvec.transform(X_test)

#fit DecisionTree Classifier
dt_t.fit(X_t, y_train)

#score for each set
print(f'score for train set {dt_t.score( X_t, y_train)}')
print(f'score for test set {dt_t.score(X_tt, y_test)}')

score for train set 1.0
score for test set 0.8266666666666667


In [70]:
#Confusion Matrix for Decision Tree and CountVectorizer
cm_dtc = pd.DataFrame(confusion_matrix(y_test, dt_c.predict(X_ct)), columns = ['e46', 'e90'], index = ['p_e46', 'p_e90'])
cm_dtc

Unnamed: 0,e46,e90
p_e46,231,74
p_e90,52,243


In [71]:
#confusion Matrix for Decision Tree and TfidfVectorizer
cm_dtt = pd.DataFrame(confusion_matrix(y_test, dt_t.predict(X_tt)), columns = ['e46', 'e90'], index = ['p_e46', 'p_e90'])
cm_dtt

Unnamed: 0,e46,e90
p_e46,256,49
p_e90,55,240


In [72]:
#function for both CountVectorizer and Tfidftvectorizer

def dc(model, trans):
    
    #define pipline transformer and estimator
    pipe = Pipeline([
        ('trans', trans),
        ('model', model)
    ])
    
    #define hyperparamter
    hyperp = {
        'trans__stop_words' : ['english', stopwords],
        'trans__ngram_range' : [(1,1), (1,2), (1,3)],
        'model__max_depth' : [50, 100],
        'model__min_samples_split' : [4, 10],
        'model__min_samples_leaf' : [1, 5],
        'model__ccp_alpha' : [0, 1]
    }
    
    #setup for Gridsearch
    search = GridSearchCV(pipe, param_grid=hyperp , cv = 5, scoring = 'accuracy')
    
    return search

In [73]:
dcc = dc(DecisionTreeClassifier(random_state = 88), CountVectorizer())
dcc.fit(X_train, y_train);

In [74]:
print(f'best score is: {dcc.best_score_}')
print(f'best parameter: {dcc.best_params_}')
print(f'testset best score: {dcc.score(X_test, y_test)}')

best score is: 0.8227777777777778
best parameter: {'model__ccp_alpha': 0, 'model__max_depth': 50, 'model__min_samples_leaf': 5, 'model__min_samples_split': 4, 'trans__ngram_range': (1, 1), 'trans__stop_words': 'english'}
testset best score: 0.8033333333333333


In [75]:
dct = dc(DecisionTreeClassifier(random_state = 88), TfidfVectorizer())
dct.fit(X_train, y_train);

In [76]:
print(f'best score is: {dct.best_score_}')
print(f'best parameter: {dct.best_params_}')
print(f'testset best score: {dct.score(X_test, y_test)}')

best score is: 0.8027777777777777
best parameter: {'model__ccp_alpha': 0, 'model__max_depth': 100, 'model__min_samples_leaf': 1, 'model__min_samples_split': 4, 'trans__ngram_range': (1, 2), 'trans__stop_words': 'english'}
testset best score: 0.81


In [77]:
#confusion matrix for decision tree with hyper parameters on Countvectorizer

cm_dtc2 = pd.DataFrame(confusion_matrix(y_test, dcc.predict(X_test)), columns = ['e46', 'e90'], index = ['p_e46', 'p_e90'])
cm_dtc2
                       

Unnamed: 0,e46,e90
p_e46,233,72
p_e90,46,249


In [78]:
#confusion matrix for decision tree with hyper parameters on TfidfVectorizer

cm_dtt2 = pd.DataFrame(confusion_matrix(y_test, dct.predict(X_test)), columns = ['e46', 'e90'], index = ['p_46', 'p_90'])
cm_dtt2

Unnamed: 0,e46,e90
p_46,245,60
p_90,54,241


## Combined Model Evaluations

In [107]:
# combine all accuracc

#make a dictionary to collect all acccuracy

acc_c = [
    #'lr_c' : 
    round((lr_c.score(X_test, y_test)), 3),
    
    #'knn_c' : 
    round((k.score(X_test, y_test)), 3),
    
    #'nbm_c' : 
    round((nbm_c.score(X_test, y_test)), 3),
    
    #'dt0_c' : 
    round((dt_c.score(X_ct, y_test)), 3),
    
    #'dt_c' : 
    round((dcc.score(X_test, y_test)), 3)
]

acc_tv =[
    #'lr_tf' : 
    round((lr_tv.score(X_test, y_test)), 3),
    
    #'knn_tf' : 
    round((k_tv.score(X_test, y_test)), 3),
    
    #'nbm_tf' : 
    round((nbm_c.score(X_test, y_test)), 3),
    
    #'dt0_tf' : 
    round((dt_t.score(X_tt, y_test)), 3),
    
    #'dt_tf' : 
    round((dct.score(X_test, y_test)), 3)
]

#Accuracy into a dataframe
comb_acc = pd.DataFrame(data = [acc_c, acc_tv],
                        columns = ['lr', 'knn' , 'nbm', 'dt0', 'dt'],
                       index = ['cvec', 'tfidf'])


comb_acc



Unnamed: 0,lr,knn,nbm,dt0,dt
cvec,0.858,0.58,0.842,0.79,0.803
tfidf,0.853,0.762,0.842,0.827,0.81


best accucracy is logistic regression with countvectorizer at 86%

In [220]:
#combine all predicted y from diferent model into one dataframe with the true y to check 
#for any rows that were missclassifed by all models. 

#make a dictionary of all the y_pred with its method
name = {
    'true_y' : y_test.ravel(),
    'log_c' : lr_c.predict(X_test),
    'log_t' : lr_tv.predict(X_test),
    'knn_c' : k.predict(X_test),
    'knn_t' : k_tv.predict(X_test),
    'nbm_c' : nbm_c.predict(X_test),
    'nbm_t' : nbm_v.predict(X_test),
    'dt_c'  : dt_c.predict(X_ct),
    'dt_t'  : dt_t.predict(X_tt),
    'dt_c_gs' : dcc.predict(X_test),
    'dt_t_gs' : dct.predict(X_test)
}

#use y_test index to reference back to the orginal text
total = pd.DataFrame(name, index = y_test.keys())
total

Unnamed: 0,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
2059,1,1,1,1,1,1,1,0,0,0,0
880,0,0,0,1,0,0,0,0,0,0,0
138,0,0,0,1,0,0,0,0,0,0,0
794,0,1,1,1,1,1,1,1,1,0,1
1844,1,1,1,1,1,1,1,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
1885,1,1,1,0,1,1,1,1,1,1,1
1268,1,1,1,1,1,1,1,0,1,0,1
1829,1,1,1,0,1,0,0,1,1,1,1
1669,1,1,1,1,1,0,0,0,0,0,0


In [221]:
#save it a csv I dont need to run the model everytime to get into the details
total.to_csv('./data/total.csv')

In [24]:
#Use the final df from loading in total.csv from the beginning
final.head()

Unnamed: 0.1,Unnamed: 0,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
0,2059,1,1,1,1,1,1,1,0,0,0,0
1,880,0,0,0,1,0,0,0,0,0,0,0
2,138,0,0,0,1,0,0,0,0,0,0,0
3,794,0,1,1,1,1,1,1,1,1,0,1
4,1844,1,1,1,1,1,1,1,1,0,1,1


In [28]:
#renamaed 'unnamed column to m3_index
final.rename(columns = {'Unnamed: 0' : 'm3_index'}, inplace = True)

In [29]:
final

Unnamed: 0,m3_index,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
0,2059,1,1,1,1,1,1,1,0,0,0,0
1,880,0,0,0,1,0,0,0,0,0,0,0
2,138,0,0,0,1,0,0,0,0,0,0,0
3,794,0,1,1,1,1,1,1,1,1,0,1
4,1844,1,1,1,1,1,1,1,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
595,1885,1,1,1,0,1,1,1,1,1,1,1
596,1268,1,1,1,1,1,1,1,0,1,0,1
597,1829,1,1,1,0,1,0,0,1,1,1,1
598,1669,1,1,1,1,1,0,0,0,0,0,0


In [38]:
#the e46 post that got misclassified by all models
final[(final[final.columns[2:]].sum(axis = 1) == 10) & (final['true_y'] == 0)]

Unnamed: 0,m3_index,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
343,1212,0,1,1,1,1,1,1,1,1,1,1
453,857,0,1,1,1,1,1,1,1,1,1,1
473,346,0,1,1,1,1,1,1,1,1,1,1


In [None]:
#e90 posts that got misclassified by all models

In [37]:
final[(final[final.columns[2:]].sum(axis = 1) == 0) & (final['true_y'] == 1)]

Unnamed: 0,m3_index,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
218,1870,1,0,0,0,0,0,0,0,0,0,0
255,1549,1,0,0,0,0,0,0,0,0,0,0
367,1702,1,0,0,0,0,0,0,0,0,0,0


In [39]:
m3['comb'][1212]

'muffler delete opinions  yay or nay '

In [40]:
m3['comb'][857]

'repair shops in near daytona beach  port orange  ormond beach florida  i need to find a trusted mechanic to do some work on my car '

In [41]:
m3['comb'][346]

'better exhaust i want to get more sound out of my car without the volume  i know you can t really have quality without a more open muffler  but are there any good ones that stay relatively quiet with normal driving  but get louder with some throttle '

In [42]:
m3['comb'][1870]

'front turn signals that pases mot in uk hey guys   i have trouble with my front turn signals and wondering if you guys know any replacement that won t fail in mot test   thanks '

In [43]:
m3['comb'][1549]

'shoot so  after fixing my broken charge pipe today the car won t start   jumping it doesn t work  and it doesn t even crank   when i press the start button it makes a whirring noise  which i believe is the water or fuel pump turning but i m not sure  then nothing   one time it started up and got to idle before immediately shutting off  which is a problem that fixing the charge pipe was supposed to remediate      amp  x200b   argh    amp  x200b   i had it towed to the mechanic and he said he thinks it s a bad starter   anyone have a similar problem   or knwo what i should expect to pay for a new starter '

In [44]:
m3['comb'][1702]

'quick ac diagnostic question my ac blows with very little force but makes noticeable noise on the three highest settings  is it more likely to be the blower motor or the resistor that s gone bad '

In [243]:
m3['comb'][346]

'better exhaust i want to get more sound out of my car without the volume  i know you can t really have quality without a more open muffler  but are there any good ones that stay relatively quiet with normal driving  but get louder with some throttle '

In [313]:
#examples of misclassificaiton of logistic regression

final[(final['log_c'] == 1) & (final['log_t'] ==1) & (final['true_y'] == 0)].head(10)
                                                                                

Unnamed: 0,m3_index,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
3,794,0,1,1,1,1,1,1,1,1,0,1
18,258,0,1,1,0,0,1,1,1,1,1,1
41,639,0,1,1,1,1,0,0,0,0,0,0
59,35,0,1,1,1,1,1,0,1,0,1,1
69,271,0,1,1,1,1,0,0,1,0,1,0
98,1033,0,1,1,1,1,1,1,0,0,0,0
105,691,0,1,1,0,0,0,0,1,1,1,1
123,739,0,1,1,1,1,1,0,1,0,1,1
128,584,0,1,1,1,1,1,1,0,0,0,0
150,467,0,1,1,1,0,0,0,0,0,0,1


In [315]:
m3['comb'][467]

'transmission help 2003 320i driving home today and everything was going well   stopped at a red light  then when i went to accelerate the car would only crawl forward  have error code p1732 and p0712  reverse works fine but in drive it feels like it might be in limp mode   any ideas what the issue is or suggestions on where to start would be greatly appreciated '

In [109]:
final[(final['log_c'] == 0) & (final['log_t'] ==0) & (final['true_y'] == 1)].head(10)

Unnamed: 0.1,Unnamed: 0,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
33,1413,1,0,0,1,0,0,0,1,1,1,1
34,1792,1,0,0,0,1,0,0,0,1,0,0
46,1960,1,0,0,0,0,1,0,1,1,1,1
57,2285,1,0,0,1,0,0,0,0,1,0,1
79,2284,1,0,0,1,0,0,0,1,0,1,1
115,1791,1,0,0,1,0,0,0,0,0,0,0
135,1630,1,0,0,0,1,1,1,1,1,1,1
218,1870,1,0,0,0,0,0,0,0,0,0,0
251,2179,1,0,0,1,1,0,0,0,0,0,0
255,1549,1,0,0,0,0,0,0,0,0,0,0


In [116]:
m3['comb'][2179]

'got ebay angel eyes  the resistor or drivers or heat sinks get hot   should i worry  ok i got ebay angel eyes  i like the way they look  but there are two boxes on the wires  i don t know what they are  ones like 1 5  the other is like 2 5   the small one gets real hot  i have no experience w any angel eye bulbs  i m wondering if these are too hot to put in my housing   i m worried it would melt the plastic or the wires '

In [322]:
#example of misclassification of knn

final[(final['knn_c'] == 1) & (final['knn_t'] == 1) & (final['true_y'] == 0)].head(50)


Unnamed: 0,m3_index,true_y,log_c,log_t,knn_c,knn_t,nbm_c,nbm_t,dt_c,dt_t,dt_c_gs,dt_t_gs
3,794,0,1,1,1,1,1,1,1,1,0,1
21,162,0,1,0,1,1,0,0,1,0,0,0
28,557,0,0,0,1,1,0,1,0,0,0,0
41,639,0,1,1,1,1,0,0,0,0,0,0
53,595,0,0,0,1,1,0,0,0,1,1,1
59,35,0,1,1,1,1,1,0,1,0,1,1
69,271,0,1,1,1,1,0,0,1,0,1,0
85,104,0,0,0,1,1,0,0,0,0,0,1
98,1033,0,1,1,1,1,1,1,0,0,0,0
123,739,0,1,1,1,1,1,0,1,0,1,1


In [323]:
m3['comb'][361]

'head gasket replacement i ve got to taking the intake manifold off  engine still in car  but i can t find the 16 mm bolt on the underside of the intake manifold that is holding it down  i ve been watching the e46 rebuild that shoplife did on youtube but he has the engine out of the car  any help is appreciated'