# COMP30027 Machine Learning Project 2

## Description of text features

This notebook describes the pre-computed text features provided for Project 2. **You do not need to recompute the features yourself for this assignment** -- this information is just for your reference. However, feel free to experiment with different text features if you are interested. If you do want to try generating your own text features, some things to keep in mind:
- There are many different decisions you can make throughout the feature design process, from the text preprocessing to the size of the output vectors. There's no guarantee that the defaults we chose will produce the best possible text features for this classification task, so feel free to experiment with different settings.
- These features must be trained using a training corpus. Generally, the training corpus should not include validation samples, but for the purposes of this assignment we have used the entire non-test set (training+validation) as the training corpus, to allow you to experiment with different validation sets. If you recompute the text features as part of your own model, you should exclude validation samples and compute them on training samples only. For example, if you do N-fold cross-validation, this means generating N sets of features for N different training-validation splits.

In [3]:
import numpy as np
import pandas as pd

# read text
# for DEMONSTRATION PURPOSES, the entire training set will be used to train the models and also as a test set
x_train_original = pd.read_csv(r"book_rating_train.csv", index_col = False, delimiter = ',', header=0)
# use recipe name as an example
train_corpus_name = x_train_original['Name']
test_name = x_train_original['Name']

In [4]:
def count_rating(x_train_original):
    a = x_train_original.loc[x_train_original["rating_label"] == 3]# change to <= 5 later
    b = x_train_original.loc[x_train_original["rating_label"] == 4]
    c = x_train_original.loc[x_train_original["rating_label"] == 5]
    print("rating is 3: ", len(a))
    print("rating is 4: " , len(b))
    print("rating is 5: " , len(c))
    print("total is : " , len(a)+len(b)+len(c))


In [5]:
x_train_original.head
a = []
a = x_train_original.columns
index_dict = {}
language_dict = {}
for i in a:
    if(not x_train_original[pd.isnull(x_train_original[i])].index.empty):
        index_dict[i] = x_train_original[pd.isnull(x_train_original[i])].index
    # else:
    #     if i not in language_dict.keys():
    #         language_dict[i] = 1;
    #     else:
    #         language_dict[i] += 1;

# len(x_train_original)
index_dict

a = pd.DataFrame(x_train_original["rating_label"].loc[index_dict["Publisher"]]) 
b = pd.DataFrame(x_train_original["rating_label"].loc[index_dict["Language"]]) 

count_rating(a)
count_rating(b)

# rating
# pd.set_option('display.max_columns', None) # 显示所有列 
# pd.set_option('display.max_rows', None) # 显示所有行 
# pd.set_option('display.width', None) # 显示完整宽度 
# pd.set_option('display.max_colwidth', -1) # 显示完整单元格内容
# x_train_original["Description"].head(50)


rating is 3:  37
rating is 4:  100
rating is 5:  11
total is :  148
rating is 3:  4725
rating is 4:  11641
rating is 5:  836
total is :  17202


把nan变成0，eng变成1

In [6]:
langage_freq = x_train_original['Language'].value_counts()
langage_freq

eng    5450
fre     154
spa     149
ger      59
jpn       8
per       8
mul       7
por       5
lat       4
ita       4
zho       3
grc       2
heb       2
rus       2
ara       1
swe       1
frs       1
nld       1
Name: Language, dtype: int64

In [7]:


author_freq = x_train_original['Authors'].value_counts()
print(len(author_freq))
author_freq

16301


Anonymous              49
William Shakespeare    48
Carole Mortimer        47
Nora Roberts           47
Agatha Christie        46
                       ..
Arina Tanemura          1
Linda Morse             1
Richard W. Dortch       1
Carol Allain            1
Henry Rollins           1
Name: Authors, Length: 16301, dtype: int64

测试数据是否对齐

In [8]:
# length = len(x_train_original)
# for i in range(length):
#     print(len(x_train_original.iloc[i,:]))
#     if (len(x_train_original.iloc[i,:]) != 10):
#         # print(len((x_train_original.iloc[i,:])))
#         print("ha")
    

忽略部分数据 没有label的直接删除

数据对其

## Count vectorizer

A count vectorizer converts documents to vectors which represent word counts. Each column in the output represents a different word and the values indicate the number of times that word appeared in the document. The overall size of a count vector matrix can be quite large (the number of columns is the total number of different words used across all documents in a corpus), but most entries in the matrix are zero (each document contains only a few of all the possible words). Therefore, it is most efficient to represent the count vectors as a sparse matrix.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# preprocess text and compute counts

vocab_name = CountVectorizer(stop_words='english').fit(train_corpus_name)

# generate counts for a new set of documents
x_train_name = vocab_name.transform(train_corpus_name)
x_test_name = vocab_name.transform(test_name)

# check the number of words in vocabulary
# print(x_test_name)
print(vocab_name.vocabulary_)
# check the shape of sparse matrix
print(x_train_name.shape)

(23063, 20766)


## doc2vec

doc2vec methods are an extension of word2vec. word2vec maps words to a high-dimensional vector space in such a way that words which appear in similar contexts will be close together in the space. doc2vec does a similar embedding for multi-word passages. The doc2vec (or Paragraph Vector) method was introduced by:

**Le & Mikolov (2014)** Distributed Representations of Sentences and Documents<br>
https://arxiv.org/pdf/1405.4053v2.pdf

The implementation of doc2vec used for this project is from gensim and documented here:<br>
https://radimrehurek.com/gensim/models/doc2vec.html

The size of the output vector is a free parameter. Most implemementations use around 100-300 dimensions, but the best size depends on the problem you're trying to solve with the embeddings and the number of training samples, so you may wish to try different vector sizes. We provided doc2vec features for Name (vec_size = 100), Authors (vec_size = 20) and Description (vec_size = 100). The vectors themselves represent directions in a high-dimensional concept space; the columns do not represent specific words or phrases. Values in the vector are continuous real numbers and can be negative.

In [10]:
import gensim

# size of the output vector
vec_size = 100

# function to preprocess and tokenize text
def tokenize_corpus(txt, tokens_only=False):
    for i, line in enumerate(txt):
        tokens = gensim.utils.simple_preprocess(line)
        if tokens_only:
            yield tokens
        else:
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# tokenize a training corpus
corpus_name = list(tokenize_corpus(train_corpus_name))

# train doc2vec on the training corpus
model = gensim.models.doc2vec.Doc2Vec(vector_size=vec_size, min_count=2, epochs=40)
model.build_vocab(corpus_name)
model.train(corpus_name, total_examples=model.corpus_count, epochs=model.epochs)

# tokenize new documents
doc = list(tokenize_corpus(test_name, tokens_only=True))

# generate embeddings for the new documents
x_test_name = np.zeros((len(doc),vec_size))
for i in range(len(doc)):
    x_test_name[i,:] = model.infer_vector(doc[i])
    
# check the shape of doc_emb
print(x_test_name.shape)

print(x_test_name[0, :])

# x_test_name[:, :]

(23063, 100)
[ 0.12247372 -0.16716184 -0.05286216  0.14339057 -0.23605962 -0.07263322
  0.24679767 -0.01012929  0.09658359  0.02743485 -0.05624078 -0.16276358
 -0.19995739  0.36224449 -0.05442979  0.14395255  0.05806667 -0.01978705
 -0.18072282  0.04182241  0.26121348  0.26954904  0.07324885  0.12172621
 -0.11038662  0.02010806 -0.09524085 -0.03123214 -0.09523094 -0.11960624
  0.3674213   0.14375825 -0.04299714 -0.05475654 -0.12217149  0.15552863
  0.1992961   0.12926787  0.09033578  0.15766996 -0.22697225 -0.05309399
 -0.00100342 -0.1527109   0.12464591  0.14810796  0.03692804  0.09714021
  0.00869756  0.01217902  0.16457619 -0.22398432 -0.20060556 -0.12280559
  0.02971364  0.11850204  0.10876668  0.10155477  0.11579313  0.02167111
  0.13576247  0.02373512 -0.06546354 -0.06650485  0.25376812  0.19972578
 -0.10399266 -0.25785118 -0.21424733 -0.05842309 -0.21848562  0.14634396
 -0.28714877 -0.24697016  0.17711127 -0.17073706  0.00718086 -0.07971063
  0.01690894 -0.08501223  0.01828076 -

In [11]:


# # 读取原始数据
x_train_original.drop(columns= ['Description'])
insert = pd.DataFrame(x_test_name);
# x_train_original
# pd.set_option('display.max_columns', None) # 显示所有列 
# pd.set_option('display.max_rows', None) # 显示所有行 
# pd.set_option('display.width', None) # 显示完整宽度 
# pd.set_option('display.max_colwidth', -1) # 显示完整单元格内容
# df_concat = pd.concat([x_train_original, insert], axis=1, ignore_index=True)
df_concat = pd.concat([x_train_original, insert], axis=1)
feature_name = df_concat.columns.tolist()
for i in feature_name:
    if isinstance(i, int):
        df_concat = df_concat.rename(columns={i:str("description"+str(i))})
new_name = df_concat.columns.tolist()
for i in new_name:
    if not isinstance(i, str):
        print(i)

df_concat.head(1)

Unnamed: 0,Name,Authors,PublishYear,PublishMonth,PublishDay,Publisher,Language,pagesNumber,Description,rating_label,...,description90,description91,description92,description93,description94,description95,description96,description97,description98,description99
0,Best of Dr Jean: Reading & Writing,Jean R. Feldman,2005,6,1,Teaching Resources,,48,Teachers will turn to this treasury of ideas a...,4.0,...,-0.027812,0.299403,-0.053035,-0.01364,0.004124,0.038769,0.172008,-0.104059,-0.178947,0.229281


获取label

In [12]:
# x_test = df_concat.iloc[:,9]
# x_test
# df_concat.head()
label = df_concat.iloc[:,9]
process_data = df_concat.drop(columns = ['rating_label','Description'])
process_data.head()

try_data = process_data.drop(columns= ['Name','Authors','Publisher','Language'])

try_data.head()

Unnamed: 0,PublishYear,PublishMonth,PublishDay,pagesNumber,description0,description1,description2,description3,description4,description5,...,description90,description91,description92,description93,description94,description95,description96,description97,description98,description99
0,2005,6,1,48,0.122474,-0.167162,-0.052862,0.143391,-0.23606,-0.072633,...,-0.027812,0.299403,-0.053035,-0.01364,0.004124,0.038769,0.172008,-0.104059,-0.178947,0.229281
1,1991,10,1,364,-0.220939,0.122903,0.155457,-0.043671,0.10702,-0.101992,...,0.237253,-0.077616,0.143283,-0.049465,0.091827,0.07309,0.020756,-0.253021,0.110221,-0.029475
2,2005,3,31,32,-0.120815,0.004263,-0.045235,0.006535,0.08312,-0.089212,...,-0.054174,0.022808,0.109734,0.008179,0.041716,0.091771,0.131824,-0.124567,-0.007508,-0.108386
3,2004,9,1,293,0.191892,0.024904,0.184299,-0.237274,0.12401,0.143698,...,-0.024594,-0.190755,0.1908,0.285465,0.064409,0.185415,-0.049907,-0.119883,0.020514,0.089924
4,2005,7,7,352,-0.063622,0.045327,0.05089,-0.146475,0.21122,-0.23424,...,0.22862,0.040399,0.044665,0.010294,0.089408,0.076299,0.009914,-0.041171,-0.107252,0.061392


stacking model

In [13]:
from sklearn import svm
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
import time
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from sklearn.datasets import load_boston
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyRegressor
from sklearn.neighbors import KNeighborsRegressor

np.random.seed(30027)

X_train, X_test, y_train, y_test = train_test_split(try_data, label, test_size=0.33, random_state=30027)

In [14]:


# Define base estimators
estimators = [
    ('dummy', DummyRegressor()),
    ('lr', LinearRegression()),
]

# Define meta-estimator
final_estimator = LinearRegression()

# Define stacking regressor
stacking = StackingRegressor(
    estimators=estimators,
    final_estimator=final_estimator
)

# Define pipeline
pipe = make_pipeline(StandardScaler(), stacking)

# Define hyperparameter grid
param_grid = {
    'stackingregressor__dummy__strategy': ['mean', 'median'],
    'stackingregressor__lr__normalize': [True, False]
}

# Train with GridSearchCV
grid = GridSearchCV(
    pipe, param_grid=param_grid, cv=5, n_jobs=-1
)

# print(grid.best_params_)
# # 定义pipeline
# pipe = make_pipeline(StandardScaler(), stacking)

# # 定义参数网格
# param_grid = {
#     'stackingregressor__rf__n_estimators': [5, 10, 15],
#     'stackingregressor__rf__max_depth': [None, 5, 10],
#     'stackingregressor__lr__normalize': [True, False]
# }

# # 定义GridSearchCV对象
# grid = GridSearchCV(
#     pipe, param_grid=param_grid, cv=5, n_jobs=-1
# )

# # 拟合模型并输出最佳参数
grid.fit(X_train, y_train)
print(grid.best_params_)



{'stackingregressor__dummy__strategy': 'median', 'stackingregressor__lr__normalize': False}


In [23]:
print("Best score:", grid.best_score_)
print("Test score:", grid.score(X_test, y_test))

Best score: 0.02220058738265196
Test score: 0.02704429265319197


In [15]:


class StackingClassifier():

    def __init__(self, classifiers, metaclassifier):
        self.classifiers = classifiers
        self.metaclassifier = metaclassifier

    def fit(self, X, y):
        for clf in self.classifiers:
            clf.fit(X, y)
        X_meta = self._predict_base(X) #X_meta is the output (y_hat) of the base classifiers
        self.metaclassifier.fit(X_meta, y) #output of the base classifiers is the input for the meta classifier
    
    def _predict_base(self, X):
        yhats = []
        for clf in self.classifiers:
            yhat = clf.predict_proba(X)
            yhats.append(yhat)
        yhats = np.concatenate(yhats, axis=1)
        assert yhats.shape[0] == X.shape[0] # check that the number of rows yhats matches the number of rows in the input data X
        return yhats
    
    def predict(self, X):
        X_meta = self._predict_base(X)     
        yhat = self.metaclassifier.predict(X_meta)
        return yhat
    def score(self, X, y):
        yhat = self.predict(X)
        return accuracy_score(y, yhat)
    


classifiers = [DummyClassifier(strategy='most_frequent'),
                LogisticRegression(),
                KNeighborsClassifier(),
                # GaussianNB(),
                ]

titles = ['Zero_R',
          'Logistic Regression',
          'KNN',
        #   'Gaussian NB',  
          ]

# grid = GridSearchCV(
#     pipe, param_grid=param_grid, cv=5, n_jobs=-1
# )


meta_classifier_lr = LogisticRegression()
stacker_lr = StackingClassifier(classifiers, meta_classifier_lr)

meta_classifier_dt = DecisionTreeClassifier()
stacker_dt = StackingClassifier(classifiers, meta_classifier_dt)


stackin出现的问题 无法统筹连续和非连续同时出现的情况，我的处理方式是把那些feature 直接不要

In [16]:
# X_train, X_test, y_train, y_test = train_test_split(try_data, label, test_size=0.33, random_state=30027)
# stacker_lr.fit(X_train, y_train)
# print('\nStacker Accuracy (Logistic Regression):', stacker_lr.score(X_test, y_test))

# stacker_dt.fit(X_train, y_train)
# print('\nStacker Accuracy (Decision Tree):', stacker_dt.score(X_test, y_test))

In [17]:
lr = LogisticRegression().fit(X_train,y_train)
print(lr.score(X_test,y_test))


0.6996452502956247


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
knn = KNeighborsClassifier().fit(X_train,y_train)
print(knn.score(X_test,y_test))

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.6426225200367889


In [19]:
zr = DummyClassifier(strategy='most_frequent').fit(X_train,y_train)
print(zr.score(X_test,y_test))

0.6999080278544212


你是真的烂啊


In [20]:
gn = GaussianNB().fit(X_train,y_train)
print(gn.score(X_test,y_test))

0.5139929050059125


In [21]:
stacker_lr.fit(X_train, y_train)
print('\nStacker Accuracy (Logistic Regression):', stacker_lr.score(X_test, y_test))

stacker_dt.fit(X_train, y_train)
print('\nStacker Accuracy (Decision Tree):', stacker_dt.score(X_test, y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Stacker Accuracy (Logistic Regression): 0.6501116804624885


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Stacker Accuracy (Decision Tree): 0.5781106293522533
