# HW5 Skeleton Code
Please note that this skeleton code is provided to help you with homework.
Full description of each question can be found on HW5.pdf, so please read instruction of each question carefully. There might be some questions that is not presented in this code.

In [7]:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

## Q. Changing HTML Text to Plain Text

The Python library <b>BeautifulSoup</b> is useful for dealing with html text. In order to use this library, you will need to install it first by running the following command: 
 <b>conda install beautifulsoup4</b> 
 in the terminal.
 <br> In the code, you can import it by running the following line: 
<br> 
  <b>from bs4 import BeautifulSoup </b>
<br>

In [8]:
  #Read our data file
df_train = pd.read_csv(r'stack_stats_2023_train.csv') #Todo
df_test = pd.read_csv(r'stack_stats_2023_test.csv') #Todo

In [9]:
train_Body = df_train['Body']
train_Title = df_train['Title']
train_Tags = df_train['Tags']

test_Body = df_test['Body']
test_Title = df_test['Title']
test_Tags = df_test['Tags']

In [11]:
#Cleaning 'Body'
#Change HTML Text to Plain text using get_text() function from BeautifulSoup
train_Body = train_Body.apply(lambda x: BeautifulSoup(x).get_text().replace('/n', ''))
test_Body = test_Body.apply(lambda x: BeautifulSoup(x).get_text().replace('/n', ''))

#Cleaning Tags
train_Tags = train_Tags.apply(lambda x: BeautifulSoup(x).get_text().replace('/n', ''))
test_Tags = test_Tags.apply(lambda x: BeautifulSoup(x).get_text().replace('/n', ''))

train_Body

0        I'm a master's student in EECS working my way ...
1        I do not know if this is a good question, but ...
2        I am doing 10 times repeated 10-fold cross-val...
3        I have a dataset with 1MM records, around 40 f...
4        I want to run a regression where one of the ex...
                               ...                        
19242    I need to fill missing values. I have found th...
19243    In the vast majority of cases, linear regressi...
19244    I can see on the Wikipedia page of the Poisson...
19245    There are three conditions to prove that a fun...
19246    I have a timecourse RNASeq experiment for muta...
Name: Body, Length: 19247, dtype: object

## Q. Basic Text Cleaning and Merging into a single Text data

### Change to Lower Case, Remove puncuation, digits, 

In [13]:
#Change to Lowercase
train_Body = train_Body.str.lower()
train_Title = train_Title.str.lower()
train_Tags = train_Tags.str.lower()

test_Body = test_Body.str.lower()
test_Title = test_Title.str.lower()
test_Tags = test_Tags.str.lower()

train_Body

0        i'm a master's student in eecs working my way ...
1        i do not know if this is a good question, but ...
2        i am doing 10 times repeated 10-fold cross-val...
3        i have a dataset with 1mm records, around 40 f...
4        i want to run a regression where one of the ex...
                               ...                        
19242    i need to fill missing values. i have found th...
19243    in the vast majority of cases, linear regressi...
19244    i can see on the wikipedia page of the poisson...
19245    there are three conditions to prove that a fun...
19246    i have a timecourse rnaseq experiment for muta...
Name: Body, Length: 19247, dtype: object

In [15]:
#Remove Punctations 
from string import punctuation

#You can get this function from our discussion session code. However, we leave it as a blank for a practice.
def remove_punctuation(document):
    
    no_punct = ''.join([character for character in document if character not in punctuation])#Todo

    return no_punct

In [17]:
train_Body = train_Body.apply(remove_punctuation)
train_Title = train_Title.apply(remove_punctuation)
train_Tags = train_Tags.apply(remove_punctuation)

test_Body = test_Body.apply(remove_punctuation)
test_Title = test_Title.apply(remove_punctuation)
test_Tags = test_Tags.apply(remove_punctuation)

train_Body

0        im a masters student in eecs working my way to...
1        i do not know if this is a good question but i...
2        i am doing 10 times repeated 10fold crossvalid...
3        i have a dataset with 1mm records around 40 fe...
4        i want to run a regression where one of the ex...
                               ...                        
19242    i need to fill missing values i have found tha...
19243    in the vast majority of cases linear regressio...
19244    i can see on the wikipedia page of the poisson...
19245    there are three conditions to prove that a fun...
19246    i have a timecourse rnaseq experiment for muta...
Name: Body, Length: 19247, dtype: object

In [18]:
#Remove Digits 

def remove_digit(document): 
    
    no_digit = ''.join([character for character in document if not character.isdigit()])#Todo
              
    return no_digit

In [19]:
train_Body = train_Body.apply(remove_digit)
train_Title = train_Title.apply(remove_digit)
train_Tags = train_Tags.apply(remove_digit)

test_Body = test_Body.apply(remove_digit)
test_Title = test_Title.apply(remove_digit)
test_Tags = test_Tags.apply(remove_digit)

train_Body

0        im a masters student in eecs working my way to...
1        i do not know if this is a good question but i...
2        i am doing  times repeated fold crossvalidatio...
3        i have a dataset with mm records around  featu...
4        i want to run a regression where one of the ex...
                               ...                        
19242    i need to fill missing values i have found tha...
19243    in the vast majority of cases linear regressio...
19244    i can see on the wikipedia page of the poisson...
19245    there are three conditions to prove that a fun...
19246    i have a timecourse rnaseq experiment for muta...
Name: Body, Length: 19247, dtype: object

### Tokenization and Remove Stopwords and do stemming

In [20]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [21]:
train_Body = train_Body.apply(word_tokenize)
train_Title = train_Title.apply(word_tokenize)
train_Tags = train_Tags.apply(word_tokenize)

test_Body = test_Body.apply(word_tokenize)
test_Title = test_Title.apply(word_tokenize)
test_Tags = test_Tags.apply(word_tokenize)

train_Body

0        [im, a, masters, student, in, eecs, working, m...
1        [i, do, not, know, if, this, is, a, good, ques...
2        [i, am, doing, times, repeated, fold, crossval...
3        [i, have, a, dataset, with, mm, records, aroun...
4        [i, want, to, run, a, regression, where, one, ...
                               ...                        
19242    [i, need, to, fill, missing, values, i, have, ...
19243    [in, the, vast, majority, of, cases, linear, r...
19244    [i, can, see, on, the, wikipedia, page, of, th...
19245    [there, are, three, conditions, to, prove, tha...
19246    [i, have, a, timecourse, rnaseq, experiment, f...
Name: Body, Length: 19247, dtype: object

In [22]:
#Remove Stopwords

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(document):
    
    words = [word for word in document if not word in stop_words]#Todo
    
    return words

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
train_Body = train_Body.apply(remove_stopwords)
train_Title = train_Title.apply(remove_stopwords)
train_Tags = train_Tags.apply(remove_stopwords)

test_Body = test_Body.apply(remove_stopwords)
test_Title = test_Title.apply(remove_stopwords)
test_Tags = test_Tags.apply(remove_stopwords)

train_Body

0        [im, masters, student, eecs, working, way, tow...
1        [know, good, question, found, answer, anywhere...
2        [times, repeated, fold, crossvalidation, want,...
3        [dataset, mm, records, around, features, class...
4        [want, run, regression, one, explanatory, vari...
                               ...                        
19242    [need, fill, missing, values, found, many, app...
19243    [vast, majority, cases, linear, regression, mo...
19244    [see, wikipedia, page, poisson, distribution, ...
19245    [three, conditions, prove, function, copula, c...
19246    [timecourse, rnaseq, experiment, mutants, wt, ...
Name: Body, Length: 19247, dtype: object

In [24]:
#We use porter stemming 

from nltk.stem import PorterStemmer

porter = PorterStemmer()

def stemmer(document):
    
    stemmed_document = [porter.stem(word) for word in document] #Todo
    
    return stemmed_document

In [25]:
train_Body = train_Body.apply(stemmer)
train_Title = train_Title.apply(stemmer)
train_Tags = train_Tags.apply(stemmer)

test_Body = test_Body.apply(stemmer)
test_Title = test_Title.apply(stemmer)
test_Tags = test_Tags.apply(stemmer)

train_Body

0        [im, master, student, eec, work, way, toward, ...
1        [know, good, question, found, answer, anywher,...
2        [time, repeat, fold, crossvalid, want, report,...
3        [dataset, mm, record, around, featur, class, i...
4        [want, run, regress, one, explanatori, variabl...
                               ...                        
19242    [need, fill, miss, valu, found, mani, approach...
19243    [vast, major, case, linear, regress, model, us...
19244    [see, wikipedia, page, poisson, distribut, pmf...
19245    [three, condit, prove, function, copula, cucv,...
19246    [timecours, rnaseq, experi, mutant, wt, compar...
Name: Body, Length: 19247, dtype: object

## Let's Check our dataframe

In [26]:
train_Body.head(5)

0    [im, master, student, eec, work, way, toward, ...
1    [know, good, question, found, answer, anywher,...
2    [time, repeat, fold, crossvalid, want, report,...
3    [dataset, mm, record, around, featur, class, i...
4    [want, run, regress, one, explanatori, variabl...
Name: Body, dtype: object

### Q. Treat Three text data independently and merge into one column

In [29]:
#Treat Three types of data independently
#let's define functions that will help this operation

def add_body(document):
    
    added_document = document.add('Body') #Todo
    
    return added_document

def add_title(document):
    
    added_document = document.add('Title') #Todo
    
    return added_document

def add_tags(document):
    
    added_document = document.add('Tags') #Todo
    
    return added_document

In [30]:
train_Body = train_Body.apply(add_body)
train_Title = train_Title.apply(add_title)
train_Tags = train_Tags.apply(add_tags)

test_Body = test_Body.apply(add_body)
test_Title = test_Title.apply(add_title)
test_Tags = test_Tags.apply(add_tags)

train_Body

#df_train['Body'] = df_train['Body'].apply(add_body)
#df_train['Title'] = df_train['Title'].apply(add_title)
#df_train['Tags'] = df_train['Tags'].apply(add_tags)

#df_test['Body'] = df_test['Body'].apply(add_body)
#df_test['Title'] = df_test['Title'].apply(add_title)
#df_test['Tags'] = df_test['Tags'].apply(add_tags)

AttributeError: 'list' object has no attribute 'add'

In [23]:
#Now we need to merge all those 3 columns into a single column. Implement this below.
#df_train['text'] = df_train[['Body','Title','Tags']].apply(lambda x: pd.DataFrame.join()) #Todo
train_files = [train_Body, train_Title, train_Tags]
test_files = [test_Body, test_Title, test_Tags]
df_train = pd.concat(train_files, axis = 1, join = 'inner')
df_test = pd.concat(test_files, axis = 1, join = 'inner') #Todo

## Let's check our DataFrame

In [None]:
df_train.head(5)

### Q. Detokenize and convert to document term matrices

In [None]:
#Merge Three text column into one column and detokenize

from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.feature_extraction.text import CountVectorizer

text_train = df_train['text'].apply() #Todo: Detokenize your tokenized text data
countvec_train = #Todo: Define your own CountVectorizer here
sparse_dtm_train = #Todo: Fit and Transform your Countvectorizer and return sparse dtm.

In [None]:
#Todo: Do same on the test set.
text_test = df_test['text'].apply()
sparse_dtm_test = 

In [None]:
#Convert the sprase dtm to pandas DataFrame.
dtm_train = #Todo
dtm_test = #Todo

### Q. Change dependent variable to binary variable

In [None]:
#Change 'Score' to a binary variable, which indicates whether the question is good or not.
y_train = #Todo
y_test = #Todo

In [None]:
#Add y_train and y_test to your data frame if it is needed. Drop unnecessary columns
df_train[''] = y_train
df_test[''] = y_test
df_train.drop(columns = [], inplace = True)
df_test.drop(columns = [], inplace = True)

## Let's check our DataFrame


In [None]:
df_train.head(5)

## (b) Please read the instruction carefully in the pdf.

In [None]:
#Create Comparison Table
#These lines are provided for you to help construct a comparison table.
#It is not requred to follow this format. + You need to find ACC, TPR, FPR, PRE for each model that you choose.
comparison_data = {'Baseline':[baseline_acc,baseline_TPR,baseline_FPR, baseline_PRE],
                   'Logistic Regression':[log_acc,log_TPR,log_FPR, log_PRE],
                   'Decision Tree Classifier':[dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE],
                   'Random Forest with CV':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                  'Linear Discriminant Analysis':[lda_acc,lda_TPR, lda_FPR,lda_PRE]}

comparison_table = pd.DataFrame(data=comparison_data, index=['Accuracy', 'TPR', 'FPR','PRE']).transpose()
comparison_table.style.set_properties(**{'font-size': '12pt',}).set_table_styles([{'selector': 'th', 'props': [('font-size', '10pt')]}])
comparison_table


## Report details of your training procedures and final comparisons on the test set in this cell. Use your best judgment to choose a final model and explain your choice.

## Report Bootstrap Analysis in this cell

### (c)