# COMP30027 Machine Learning Assignment 2

## Description of text features

This notebook describes the pre-computed text features provided for assignment 2. **You do not need to recompute the features yourself for this assignment** -- this information is just for your reference. However, feel free to experiment with different text features if you are interested. If you do want to try generating your own text features, some things to keep in mind:
- There are many different decisions you can make throughout the feature design process, from the text preprocessing to the size of the output vectors. There's no guarantee that the defaults we chose will produce the best possible text features for this classification task, so feel free to experiment with different settings.
- These features must be trained on a training corpus. Generally, the training corpus should not include validation samples, but for the purposes of this assignment we have used the entire non-test set (training+validation) as the training corpus, to allow you to experiment with different validation sets. If you recompute the text features as part of your own model, you should exclude validation samples and retrain on training samples only. Note that if you do N-fold cross-validation, this means generating N sets of features for N different training-validation splits.
- This code may take a long time to run and require a good bit of memory, which is why we are not requiring you to recompute these features yourself. doc2vec in particular is very slow unless you can implement some speed-ups in C.

In [74]:
import numpy as np
import pandas as pd

# read recipe_train.csv
x_train_original = pd.read_csv(r"recipe_train.csv", index_col = False, delimiter = ',', header=0)
# use recipe name as an example
train_corpus_name = x_train_original['name']
test_name = x_train_original['name']

## Count vectorizer

A count vectorizer converts documents to vectors which represent word counts. Each column in the output represents a different word and the values indicate the number of times that word appeared in the document. The overall size of a count vector matrix can be quite large (the number of columns is the total number of different words used across all documents in a corpus), but most entries in the matrix are zero (each document contains only a few of all the possible words). Therefore, it is most efficient to represent the count vectors as a sparse matrix.

In [75]:
from sklearn.feature_extraction.text import CountVectorizer

# preprocess text and compute counts
vocab_name = CountVectorizer(stop_words='english').fit(train_corpus_name)
# generate counts for a new set of documents
x_train_name = vocab_name.transform(train_corpus_name)
x_test_name = vocab_name.transform(test_name)

# check the number of words in vocabulary
print(len(vocab_name.vocabulary_))
# check the shape of sparse matrix
print(x_train_name.shape)

10892
(40000, 10892)


## Load the data set

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from scipy.sparse import hstack
from sklearn import datasets, svm
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import make_classification
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neural_network import MLPClassifier
from collections import Counter

# load train and test set
x_train_original = pd.read_csv(r"recipe_train.csv", index_col = False, delimiter = ',', header=0)
x_test_original = pd.read_csv(r"recipe_test.csv", index_col = False, delimiter = ',', header=0)

## Split to train and test set 

In [2]:
# preprocess to remove punctuation
a = r"[\"\',\[\]]"
x_train_original['name']=(x_train_original['name']).str.replace(a, "", regex=True)
x_train_original['steps']=(x_train_original['steps']).str.replace(a, "", regex=True)
x_train_original['ingredients']=(x_train_original['ingredients']).str.replace(a, "", regex=True)

In [78]:
# train test split recipe_train.csv
X_train, X_test, y_train, y_test = train_test_split(x_train_original,x_train_original.iloc[:,-1], test_size=0.23, random_state=88)

## Feature Engineering (Count Vectorise)

In [79]:
# count vectorize name, step and ingr for both train and test set & clean sentence with stopwords

# vectorizes name
name=(X_train['name'])
name_test=(X_test['name'])

vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(name)
names=vectorizer.transform(name)
names_test=vectorizer.transform(name_test)

#vectorizes step
step_test=(X_test['steps'])
step=(X_train['steps'])

vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(step)
step=vectorizer.transform(step)
step_test=vectorizer.transform(step_test)

#vectorizes ingredients
ingr_test=(X_test['ingredients'])
ingr=(X_train['ingredients'])

vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(ingr)
ingr=vectorizer.transform(ingr)
ingr_test=vectorizer.transform(ingr_test)

## Feature Selection (Chi Square)

In [80]:
# feature selection using chi-square, selects the top 1000 features

x2 = SelectKBest(chi2, k=1000)
X_train_x2_names = x2.fit_transform(names,y_train)
X_test_x2_names = x2.transform(names_test)

x2 = SelectKBest(chi2, k=1000)
X_train_x2_step = x2.fit_transform(step,y_train)
X_test_x2_step = x2.transform(step_test)

x2 = SelectKBest(chi2, k=1000)
X_train_x2_ingr = x2.fit_transform(ingr,y_train)
X_test_x2_ingr = x2.transform(ingr_test)

## Combine all of the features

In [81]:
all_features_x2=hstack((X_train_x2_names, X_train_x2_step,X_train_x2_ingr,X_train[ 'n_steps'].to_numpy().reshape(-1,1),X_train[ 'n_ingredients'].to_numpy().reshape(-1,1)))
all_features_test_x2=hstack((X_test_x2_names, X_test_x2_step,X_test_x2_ingr,X_test[ 'n_steps'].to_numpy().reshape(-1,1),X_test[ 'n_ingredients'].to_numpy().reshape(-1,1)))

# Model Selection

## Logistic Regression

In [83]:
# train & test data using logistic regression model
# lr = LogisticRegression(max_iter=2000).fit(all_features_x2, y_train)
# print("Logistic regression accuracy for Chisquare:",lr.score(all_features_test_x2,y_test))

#tuning with C hyperparameter = 0.5
lr_2 = LogisticRegression(max_iter=2000, C=0.5).fit(all_features_x2, y_train)
print("Logistic regression accuracy for Chisquare:",lr_2.score(all_features_test_x2,y_test))

Logistic regression accuracy for Chisquare: 0.7919565217391304


## Multinomial NB

In [84]:
# train & test data using multinomial NB model alpha = 10

Multi = MultinomialNB(alpha=10).fit(all_features_x2.toarray(), y_train)
print("Multinomial accuracy for Chisquare:", Multi.score(all_features_test_x2.toarray(),y_test))

Multinomial accuracy for Chisquare: 0.7344565217391305


## Gaussian NB

In [86]:
# train & test data using gaussian NB model

gaus= GaussianNB().fit(all_features_x2.toarray(), y_train)
print("Gaussian NB accuracy for Chisquare:",gaus.score(all_features_test_x2.toarray(),y_test))

Gaussian NB accuracy for Chisquare: 0.6475


## Bernoulli NB

In [88]:
# train & test data using bernoulli NB model with alpha = 6

bernoulli= BernoulliNB(alpha=6).fit(all_features_x2.toarray(), y_train)
print("bernoulli accuracy:",bernoulli.score(all_features_test_x2.toarray(),y_test))

bernoulli accuracy: 0.7515217391304347


## Decision Tree Classifier

In [89]:
# train & test data using Decision Tree with max depth = 20

dt= DecisionTreeClassifier(max_depth=20).fit(all_features_x2.toarray(), y_train)
print("decision tree accuracy:", dt.score(all_features_test_x2.toarray(),y_test))

decision tree accuracy: 0.7544565217391305


## Neural Network

In [90]:
# train & test data using Neural Network

clf = MLPClassifier(max_iter=2000)
clf.fit(all_features_x2.toarray(), y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=2000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [91]:
# without standardscaler

print("NN with relu activation function and alpha = 0.0001 and 100 neurons in 1st hidden layer:", clf.score(all_features_test_x2.toarray(),y_test))

NN with relu activation function and alpha = 0.0001 and 100 neurons in 1st hidden layer: 0.7739130434782608


In [92]:
# with standardscaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(all_features_x2.toarray())

In [93]:
# train & test data using Neural Network with standardscaler
clf = MLPClassifier(max_iter=2000, alpha=0.1)
clf.fit(x_train, y_train)

MLPClassifier(activation='relu', alpha=0.1, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=2000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [94]:
x_test = scaler.fit_transform(all_features_test_x2.toarray())
print("NN with relu activation function and alpha = 0.0001 and 100 neurons in 1st hidden layer:", clf.score(x_test, y_test))

NN with relu activation function and alpha = 0.0001 and 100 neurons in 1st hidden layer: 0.7682608695652174


In [95]:
predictions = clf.predict(x_test)

In [96]:
from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(y_test,predictions)) # precision, recall, f1 score for Neural Network

              precision    recall  f1-score   support

         1.0       0.75      0.77      0.76      4046
         2.0       0.79      0.78      0.78      4695
         3.0       0.73      0.63      0.68       459

    accuracy                           0.77      9200
   macro avg       0.76      0.73      0.74      9200
weighted avg       0.77      0.77      0.77      9200



## Stacking

In [98]:
# train & test data using stacking method with base Logistic Regression, Decision Tree & Multinomial NB as Base Model
# and with Logistic Regression as learner model

# define the base models
level0 = list()
level0.append(('lr', LogisticRegression(max_iter=5000, C=0.5)))
level0.append(('dt', DecisionTreeClassifier(max_depth=10)))
level0.append(('Multinomial NB', MultinomialNB(alpha=10)))

# define meta learner model
level1 = LogisticRegression(C=0.5,max_iter=5000)

# define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=3)

# fit the model on all available data
model.fit(all_features_x2.toarray(), y_train)

print("stacking method accuracy:",model.score(all_features_test_x2.toarray(),y_test))

stacking method accuracy: 0.8027173913043478


In [99]:
# train & test data using stacking method with base Logistic Regression, Decision Tree & Bernoulli NB as Base Model
# and with Logistic Regression as learner model

# define the base models
level0 = list()
level0.append(('lr', LogisticRegression(max_iter=5000, C=0.5)))
level0.append(('dt', DecisionTreeClassifier(max_depth=10)))
level0.append(('BNB', BernoulliNB(alpha=6)))

# define meta learner model
level1 = LogisticRegression(C=0.5,max_iter=5000)

# define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=3)

# fit the model on all available data
model.fit(all_features_x2.toarray(), y_train)

print("stacking method accuracy:",model.score(all_features_test_x2.toarray(),y_test))

stacking method accuracy: 0.8004347826086956


# Test Validation for Kaggle Submission

## Count Vectorise
Now, we use all the data set from recipe_train.csv for training instead of splitting them for kaggle test. Here, we are using the best model which is Ensemble Stacking to predict the test set 

In [3]:
label=x_train_original['duration_label']

name=(x_train_original['name'])
name_kaggle_test=(x_test_original['name'])
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(name)
names=vectorizer.transform(name)
name_kaggle_test=vectorizer.transform(name_kaggle_test)

step_kaggle_test=(x_test_original['steps'])
step=(x_train_original['steps'])
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(step)
step=vectorizer.transform(step)
step_kaggle_test=vectorizer.transform(step_kaggle_test)

ingr_kaggle_test=(x_test_original['ingredients'])
ingr=(x_train_original['ingredients'])
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(ingr)
ingr=vectorizer.transform(ingr)
ingr_kaggle_test=vectorizer.transform(ingr_kaggle_test)

## Feature Selection (Chisquare)

In [4]:
x2 = SelectKBest(chi2, k=1000)
names_x2 = x2.fit_transform(names,label)
names_test_x2 = x2.transform(name_kaggle_test)

x2 = SelectKBest(chi2, k=1000)
step_x2 = x2.fit_transform(step,label)
step_test_x2 = x2.transform(step_kaggle_test)

x2 = SelectKBest(chi2, k=1000)
ingr_x2 = x2.fit_transform(ingr,label)
ingr_test_x2 = x2.transform(ingr_kaggle_test)

## Combine all of the features

In [6]:
all_features_train=hstack((names_x2, step_x2,ingr_x2,x_train_original[ 'n_steps'].to_numpy().reshape(-1,1),x_train_original[ 'n_ingredients'].to_numpy().reshape(-1,1)))
all_features_test=hstack((names_test_x2, step_test_x2,ingr_test_x2,x_test_original[ 'n_steps'].to_numpy().reshape(-1,1),x_test_original[ 'n_ingredients'].to_numpy().reshape(-1,1)))

## Logistic Regression

In [None]:
lr = LogisticRegression(max_iter=2000).fit(all_features_train, label)
pred_test=lr.predict(all_features_test)

d2 = { 'duration_label':pred_test}
df2=pd.DataFrame(d2)
df2.index+=1
index=df2.index
index.name = "id"
df2.to_csv ('logistic_prediction.csv')

In [25]:
#tune C hyperparameter
lr_2 = LogisticRegression(max_iter=2000, C=0.5).fit(all_features_train, label)
pred_test=lr_2.predict(all_features_test)

d2 = { 'duration_label':pred_test}
df2=pd.DataFrame(d2)
df2.index+=1
index=df2.index
index.name = "id"
df2.to_csv ('logistic_tune_parameter_prediction.csv')

## MultinomialNB

In [24]:
MNB = MultinomialNB(alpha=10).fit(all_features_train, label)
pred_test=MNB.predict(all_features_test)

d3 = { 'duration_label':pred_test}
df3=pd.DataFrame(d3)
df3.index+=1
index=df3.index
index.name = "id"
df3.to_csv ('MultinomialNB_prediction.csv')

## Bernoulli NB

In [22]:
BNB = BernoulliNB(alpha=6).fit(all_features_train.toarray(), label)
pred_test=BNB.predict(all_features_test.toarray())

d4 = { 'duration_label':pred_test}
df4=pd.DataFrame(d4)
df4.index+=1
index=df4.index
index.name = "id"
df4.to_csv('BernoulliNB_prediction.csv')

## GaussianNB

In [21]:
GNB = GaussianNB().fit(all_features_train.toarray(), label)
pred_test=GNB.predict(all_features_test.toarray())

d5 = { 'duration_label':pred_test}
df5=pd.DataFrame(d5)
df5.index+=1
index=df5.index
index.name = "id"
df5.to_csv ('GaussianNB_prediction.csv')

## Decision Tree

In [20]:
dt = DecisionTreeClassifier(max_depth=20).fit(all_features_train, label)
pred_test=dt.predict(all_features_test)

d6 = { 'duration_label':pred_test}
df6=pd.DataFrame(d6)
df6.index+=1
index=df6.index
index.name = "id"
df6.to_csv ('Decison_tree_prediction.csv')

## Neural Network

In [16]:
clf = MLPClassifier(max_iter=2000)
clf.fit(all_features_train.toarray(), label)
pred_test=clf.predict(all_features_test)
d7 = { 'duration_label':pred_test}
df7=pd.DataFrame(d7)
df7.index+=1
index=df7.index
index.name = "id"
df7.to_csv ('Neural_Network_prediction.csv')

## Stacking

In [115]:
# Stacking using Linear Regression, Decision Tree and ... as the basemodel and Logistic Regression as the learner model

# define the base models
level0 = list()
level0.append(('lr', LogisticRegression(max_iter=2000)))
level0.append(('dt', DecisionTreeClassifier(max_depth=20)))
level0.append(('MNB', MultinomialNB()))
#level0.append(('svm', svm.LinearSVC(max_iter=5000)))
#level0.append(('svc', svm.LinearSVC(max_iter=5000)))
#level0.append(('MLPClassifier', MLPClassifier(max_iter=2000)))

# define meta learner model
level1 = LogisticRegression(C=1,max_iter=2000)

# define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
# fit the model on all available data

# fit the model on all available data
model.fit(all_features_train.toarray(), label)

StackingClassifier(cv=5,
                   estimators=[('lr',
                                LogisticRegression(C=1.0, class_weight=None,
                                                   dual=False,
                                                   fit_intercept=True,
                                                   intercept_scaling=1,
                                                   l1_ratio=None, max_iter=2000,
                                                   multi_class='auto',
                                                   n_jobs=None, penalty='l2',
                                                   random_state=None,
                                                   solver='lbfgs', tol=0.0001,
                                                   verbose=0,
                                                   warm_start=False)),
                               ('dt',
                                DecisionTreeClassifier(ccp_alpha=0.0,
                                  

## Predicting test set (Validation)

In [116]:
y_pred_test=model.predict(all_features_test.toarray())

d1 = { 'duration_label':y_pred_test}
df1=pd.DataFrame(d1)
df1.index+=1
index=df1.index
index.name = "id"
df1.to_csv ('kaggle_dataframe.csv') # export prediction results as dataframe

print(y_pred_test)

[2. 1. 1. ... 1. 1. 2.]


## doc2vec

doc2vec methods are an extension of word2vec. word2vec maps words to a high-dimensional vector space in such a way that words which appear in similar contexts will be close together in the space. doc2vec does a similar embedding for multi-word passages. The doc2vec (or Paragraph Vector) method was introduced by:

**Le & Mikolov (2014)** Distributed Representations of Sentences and Documents<br>
https://arxiv.org/pdf/1405.4053v2.pdf

The implementation of doc2vec used for this project is from gensim and documented here:<br>
https://radimrehurek.com/gensim/models/doc2vec.html

The size of the output vector is a free parameter. Most implemementations use around 100-300 dimensions, but the best size depends on the problem you're trying to solve with the embeddings and the number of training samples, so you may wish to try different vector sizes. We provided three sets of doc2vec features, and their dimensions are 50 and 100. The vectors themselves represent directions in a high-dimensional concept space; the columns do not represent specific words or phrases. Values in the vector are continuous real numbers and can be negative.

In [None]:
import gensim

# size of the output vector
vec_size = 20

# function to preprocess and tokenize text
def tokenize_corpus(txt, tokens_only=False):
    for i, line in enumerate(txt):
        tokens = gensim.utils.simple_preprocess(line)
        if tokens_only:
            yield tokens
        else:
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# tokenize a training corpus
corpus_name = list(tokenize_corpus(train_corpus_name))

# train doc2vec on the training corpus
model = gensim.models.doc2vec.Doc2Vec(vector_size=vec_size, min_count=2, epochs=40)
model.build_vocab(corpus_name)
model.train(corpus_name, total_examples=model.corpus_count, epochs=model.epochs)

# tokenize new documents
doc = list(tokenize_corpus(test_name, tokens_only=True))

# generate embeddings for the new documents
x_test_name = np.zeros((len(doc),vec_size))
for i in range(len(doc)):
    x_test_name[i,:] = model.infer_vector(doc[i])
    
# check the shape of doc_emb
print(x_test_name.shape)