# Text Clasification II (week 6)

This lab is prepared with the materials in the article "A Comprehensive Guide to Understand and Implement Text Classification in Python" https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

Load libraries for dataset preparation, feature engineering, model training 

In [1]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

# install xgboost and textblob: $>pip install xgboost; $>pip install textblob
import pandas, xgboost, numpy, textblob, string  

# load functions from textpreprocess.py
from textpreprocess import denoise_text, normalize, replace_contractions, remove_non_ascii, to_lowercase, remove_punctuation, replace_numbers, remove_stopwords
import nltk

### 1. Dataset preparation
We are using the dataset of amazon reviews which can be downloaded at this link https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235 The dataset consists of 10,000 text reviews and their labels, To prepare the dataset, load the downloaded data into a pandas dataframe containing two columns – text and label.

<b>We are taking only 1,000 reviews for each category.</b>

In [2]:
# load the dataset, but load only 1000 reviews for each category.
data = open('data/corpus', encoding="utf-8").read()
labels, texts = [], [] # dictionary for labels and texts
label_1_count = 0
label_2_count = 0
for i, line in enumerate(data.split("\n")):
    # load 1000 reviews for each category
    if label_1_count < 1000 or label_2_count < 1000:
        line = replace_contractions(line) # Replace contractions in string of text
        content = nltk.word_tokenize(line)   
        words = content[1:] # first token is the label
        words = remove_non_ascii(words)
        #words = to_lowercase(words)
        words = remove_punctuation(words)
        #words = replace_numbers(words)
        #words = remove_stopwords(words)
        
        if content[0] == '__label__1' and label_1_count < 1000:
            label_1_count += 1 # add count to label 1
            labels.append(content[0])
            texts.append(words)
        if content[0] == '__label__2' and label_2_count < 1000:
            label_2_count += 1 # add count to label 2
            labels.append(content[0])
            texts.append(words)
        
# create a dataframe using texts and labels
trainDF = pandas.DataFrame()
texts1=[' '.join(line) for line in texts] # join words in each line with space character
trainDF['text'] = texts1
trainDF['label'] = labels

In [3]:
trainDF.shape #2000 rows, 2 columns

(2000, 2)

In [4]:
trainDF['label'].value_counts(sort=True)

__label__2    1000
__label__1    1000
Name: label, dtype: int64

Next, we will split the dataset into training and validation sets so that we can train and test classifier. Also, we will encode our target column so that it can be used in machine learning models.

In [5]:
# split the dataset into training and validation datasets: 75% for training, 25% (default) for testing
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], 
                                           trainDF['label'],random_state=2, stratify=trainDF['label'])

print("Train_y: ")
print(train_y.value_counts())
print()
print("Valid_y: ")
print(valid_y.value_counts())
# label encode the target variable 
# => holds the labels in an array
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.transform(valid_y)

Train_y: 
__label__2    750
__label__1    750
Name: label, dtype: int64

Valid_y: 
__label__2    250
__label__1    250
Name: label, dtype: int64


In [6]:
list(encoder.classes_)

['__label__1', '__label__2']

In [7]:
encoder.transform(['__label__1', '__label__2']) 

array([0, 1])

In [8]:
list(encoder.inverse_transform([0, 1, 0, 1]))

['__label__1', '__label__2', '__label__1', '__label__2']

### 2. Feature Engineering
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement the following different ideas in order to obtain relevant features from our dataset.

#### 2.1 Count Vectors as features
[Count Vector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [9]:
# create a count vectorizer object: 
# analyzer: whether the feature should be made of word or character n-grams.
# token_pattern: regular expression denoting what constitutes a “token”
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

# fit and transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.fit_transform(train_x)
xvalid_count =  count_vect.transform(valid_x) # just transform validation data

#### 2.2 TF-IDF Vectors as features

[TF-IDF Vectors](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) can be generated at different levels of input tokens (words, characters, n-grams)

a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents

b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus

In [10]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
xtrain_tfidf =  tfidf_vect.fit_transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
xtrain_tfidf_ngram =  tfidf_vect_ngram.fit_transform(train_x)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)

# characters level tf-idf
# Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. 
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.fit_transform(train_x) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x) 

### 3. Model Building
The next step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement Naive Bayes Classifier for this purpose:



The following function is a utility function which can be used to train a model. It accepts the classifier, feature_vector of training data, labels of training data and feature vectors of valid data as inputs. Using these inputs, the model is trained and accuracy score is computed.

In [11]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    return metrics.accuracy_score(predictions, valid_y)

#### 3.1 Naive Bayes Classifier
Implementing a naive bayes model using sklearn implementation with different features

[Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here .

In [12]:
# Naive Bayes on Count Vectors
# train_model(classifier, train feature vectors (countvectorizer, label, 
# validation feature vectors(countvectorizer))
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
print ("NB, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print ("NB, N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print ("NB, CharLevel Vectors: ", accuracy)

NB, Count Vectors:  0.82
NB, WordLevel TF-IDF:  0.822
NB, N-Gram Vectors:  0.828
NB, CharLevel Vectors:  0.794


#### 3.2 Logistic Regression

Implementing a Linear Classifier (Logistic Regression)

[Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. 

In [13]:
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
print ("LR, Count Vectors: ", accuracy)

# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("LR, WordLevel TF-IDF: ", accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print ("LR, N-Gram Vectors: ", accuracy)

# Linear Classifier on Character Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print ("LR, CharLevel Vectors: ", accuracy)

LR, Count Vectors:  0.838
LR, WordLevel TF-IDF:  0.866
LR, N-Gram Vectors:  0.824
LR, CharLevel Vectors:  0.842




#### 3.3 SVM Classifer

[Support Vector Machine (SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) is a supervised machine learning algorithm which can be used for both classification or regression challenges. The model extracts a best possible hyper-plane / line that segregates the two classes. 

The implementation of SVM is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
The multiclass support is handled according to a one-vs-one scheme.

In [14]:
# SVM Classifier on Count Vectors
accuracy = train_model(svm.SVC(kernel='linear'), xtrain_count, train_y, xvalid_count)
print ("SVM, Count Vectors: ", accuracy)

# SVM Classifier on Word Level TF IDF Vectors
accuracy = train_model(svm.SVC(kernel='linear'), xtrain_tfidf, train_y, xvalid_tfidf)
print ("SVM, WordLevel TF-IDF: ", accuracy)

# SVM Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(svm.SVC(kernel='linear'), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print ("SVM, N-Gram Vectors: ", accuracy)

# SVM Classifier on Character Level TF IDF Vectors
accuracy = train_model(svm.SVC(kernel='linear'), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print ("SVM, CharLevel Vectors: ", accuracy)

SVM, Count Vectors:  0.814
SVM, WordLevel TF-IDF:  0.866
SVM, N-Gram Vectors:  0.802
SVM, CharLevel Vectors:  0.836


#### 3.4 Radom Forest Classifier

Implementing a [Random Forest Model](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family.

In [15]:
# RF on Count Vectors => n_estimators = number of trees in forest
accuracy = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_count, train_y, xvalid_count)
print ("RF, Count Vectors: ", accuracy)

# RF on Word Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf, train_y, xvalid_tfidf)
print ("RF, WordLevel TF-IDF: ", accuracy)

# RF Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print ("RF, N-Gram Vectors: ", accuracy)

# RF Classifier on Character Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print ("RF, CharLevel Vectors: ", accuracy)

RF, Count Vectors:  0.808
RF, WordLevel TF-IDF:  0.822
RF, N-Gram Vectors:  0.736
RF, CharLevel Vectors:  0.778


#### 3.5 Boosting Model

Implementing [Xtereme Gradient Boosting Model](https://xgboost.readthedocs.io/en/latest/index.html).

Boosting models are another type of ensemble models part of tree based models. Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing).

In [16]:
# Extereme Gradient Boosting on Count Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_count, train_y, xvalid_count)
print ("Xgb, Count Vectors: ", accuracy)

# Extereme Gradient Boosting on Word Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Xgb, WordLevel TF-IDF: ", accuracy)

# RF Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print ("Xgb, N-Gram Vectors: ", accuracy)

# Extereme Gradient Boosting on Character Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print ("Xgb, CharLevel Vectors: ", accuracy)

Xgb, Count Vectors:  0.808
Xgb, WordLevel TF-IDF:  0.814
Xgb, N-Gram Vectors:  0.712
Xgb, CharLevel Vectors:  0.77


### 4. Grid Search with SVM - improve performance through grid search of parameters

[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) performs exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

pipeline = Pipeline([
    ('vect', TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)),
    ('clf', svm.SVC())
])
parameters = {
    #'vect__max_df': (0.1, 0.5, 1.0),
    #'vect__stop_words': ('english', None),
    #'vect__lowercase': (True, False),
    #'vect__binary': (True, False),
    #'vect__max_features': (5000, 7500),
    #'vect__ngram_range': ((1, 1), (1, 2)),
    #'vect__use_idf': (True, False),
    #'vect__norm': ('l1', 'l2'),
    'clf__C': (0.01, 1, 10),
    #'clf__gamma: (0.5, 1,2,3,4),
    'clf__kernel': ('rbf', 'linear')
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=10)
    grid_search.fit(train_x, train_y)
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    # Refit an estimator using the best found parameters on the whole dataset.
    # The refitted estimator is made available at the best_estimator_attribute and 
    # permits using predict directly on this GridSearchCV instance.
    predictions = grid_search.predict(valid_x)
    print('Accuracy:', accuracy_score(valid_y, predictions))
    print('Precision:', precision_score(valid_y, predictions))
    print('Recall:', recall_score(valid_y, predictions))
    print('F1_score:', f1_score(valid_y, predictions))

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    9.2s finished


Best score: 0.803
Best parameters set:
	clf__C: 1
	clf__kernel: 'linear'
Accuracy: 0.866
Precision: 0.8765432098765432
Recall: 0.852
F1_score: 0.8640973630831643


Another way to do Grid Search without pipeline

In [18]:
# Set the parameters by cross-validation
tuned_parameters = [{'C': [0.01, 0.1, 1, 10, 100],
                     'kernel': ['rbf', 'linear']}]
clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=10, scoring='accuracy')
clf.fit(xtrain_tfidf, train_y)
#clf.grid_scores_
#clf.cv_results_
print(clf.cv_results_['mean_test_score'])
print(clf.cv_results_['params'])

[0.766      0.766      0.766      0.77333333 0.766      0.80666667
 0.766      0.792      0.766      0.792     ]
[{'C': 0.01, 'kernel': 'rbf'}, {'C': 0.01, 'kernel': 'linear'}, {'C': 0.1, 'kernel': 'rbf'}, {'C': 0.1, 'kernel': 'linear'}, {'C': 1, 'kernel': 'rbf'}, {'C': 1, 'kernel': 'linear'}, {'C': 10, 'kernel': 'rbf'}, {'C': 10, 'kernel': 'linear'}, {'C': 100, 'kernel': 'rbf'}, {'C': 100, 'kernel': 'linear'}]


In [19]:
clf.best_params_

{'C': 1, 'kernel': 'linear'}

In [20]:
# 15% of test observations misclassified
clf.best_estimator_.score(xvalid_tfidf, valid_y)

0.866

## 5. Combining text with other features using pipeline: Text / NLP based features

With pipelines, you don't need to carry test dataset transformation along with your train features - this is taken care of automatically.

Refer to [A Deep Dive Into Sklearn Pipelines](https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines/data) and [Work like a Pro with Pipelines and Feature Unions](https://www.kaggle.com/metadist/work-like-a-pro-with-pipelines-and-feature-unions)

In [21]:
trainDF['char_count'] = trainDF['text'].apply(len)
#lambda creates the inline function => x: len(x.split())
trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split())) 
trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count']+1)
trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
trainDF['upper_case_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [22]:
trainDF.head()

Unnamed: 0,text,label,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
0,Stuning even for the nongamer This sound track...,__label__2,416,80,5.135802,1,10,3
1,The best soundtrack ever to anything I am read...,__label__2,505,101,4.95098,0,11,6
2,Amazing This soundtrack is my favorite music o...,__label__2,732,134,5.422222,0,25,4
3,Excellent Soundtrack I truly like this soundtr...,__label__2,711,118,5.97479,0,51,4
4,Remember Pull Your Jaw Off The Floor After Hea...,__label__2,465,90,5.10989,0,31,0


In [23]:
ftrain_x, fvalid_x, train_y, valid_y = model_selection.train_test_split(trainDF[['text', 'char_count', 'word_count', 'word_density', 'punctuation_count', 'title_word_count', 'upper_case_word_count']], trainDF['label'], 
                                                                      random_state=2, stratify=trainDF['label'])
# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.transform(valid_y)

In [24]:
ftrain_x.head()

Unnamed: 0,text,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
1468,tobacco pouch The material and workmanship is ...,226,38,5.794872,0,3,1
459,Intelligently written a fast and suspenseful r...,440,81,5.365854,0,12,2
1047,the best album by a femalelead group I love th...,253,53,4.685185,0,4,2
802,too had to read I find the old english wording...,163,36,4.405405,0,1,1
1542,Bad run the FIRST time I wore them Having read...,321,59,5.35,0,8,5


In order to tag our text in the pipeline, we will create an estimator class of our own. We just have to inherit some base classes and overload very few functions that we are actually going to use.

First we are going to create a selector transformer that simply returns the one column in the dataset by the key value we pass. We made two different selectors for either text or numeric columns. The return type is different, but other than that they work the same.

In [25]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None): # fit() doesn't do anything
        return self

    def transform(self, X):   # all the work is done here
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]


Pipeline consists of two steps: first grab just text column from the dataset, then perform tf-idf on just that column and return the results.
To make a pipeline, just pass an array of tuples of the format (name, object). The first part is the name of the action, and the second is the actual object. So this pipeline consists of "selecting" and then "tfidf-ing" a column.
To execute, use it just like any other transformer. You can call text.fit() to fit to training data, text.transform() to apply it to training data, or text.fit_transform() to do both.
Since it's text, it will return a sparse matrix, but we can see that it works:

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

text = Pipeline([
                ('selector', TextSelector(key='text')),
                ('tfidf', TfidfVectorizer(stop_words='english'))
            ])

text.fit_transform(ftrain_x)  # create 1500x11480 matrix

<1500x11468 sparse matrix of type '<class 'numpy.float64'>'
	with 46047 stored elements in Compressed Sparse Row format>

In [27]:
text.steps #  look up all the steps of the pipeline

[('selector', TextSelector(key='text')),
 ('tfidf',
  TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.float64'>, encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words='english', strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None))]

In [28]:
from sklearn.preprocessing import StandardScaler

char_count =  Pipeline([
                ('selector', NumberSelector(key='char_count')),
                ('standard', StandardScaler())
            ])

char_count.fit_transform(ftrain_x)

array([[-0.88800316],
       [ 0.05642127],
       [-0.7688468 ],
       ...,
       [-0.33635338],
       [-1.04687829],
       [-0.48640212]])

In [29]:
word_count =  Pipeline([
                ('selector', NumberSelector(key='word_count')),
                ('standard', StandardScaler())
            ])
word_density =  Pipeline([
                ('selector', NumberSelector(key='word_density')),
                ('standard', StandardScaler())
            ])
punctuation_count =  Pipeline([
                ('selector', NumberSelector(key='punctuation_count')),
                ('standard', StandardScaler())
            ])
title_word_count =  Pipeline([
                ('selector', NumberSelector(key='title_word_count')),
                ('standard', StandardScaler()),
            ])
upper_case_word_count =  Pipeline([
                ('selector', NumberSelector(key='upper_case_word_count')),
                ('standard', StandardScaler()),
            ])

To make a pipeline from all of our pipelines, we do the same thing, but now we use a FeatureUnion to join the feature processing pipelines. The syntax is the same as a regular pipeline, it's just an array of tuple, with the (name, object) format.

The feature union itself is not a pipeline, it's just a union, so you need to do one more step to make it useable: pass it to a pipeline, with the same structure, an array of tuples, with the simple (name, object) format. 

In [30]:
from sklearn.pipeline import FeatureUnion

feats = FeatureUnion([('text', text), 
                      ('char_count', char_count),
                      ('word_count', word_count),
                      ('word_density', word_density),
                      ('punctuation_count', punctuation_count),
                      ('title_word_count', title_word_count),
                      ('upper_case_word_count', upper_case_word_count)])

feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(ftrain_x)  # create 1500x11486 matrix

<1500x11474 sparse matrix of type '<class 'numpy.float64'>'
	with 55047 stored elements in Compressed Sparse Row format>

In [31]:
import numpy as np

pipeline = Pipeline([
    ('features',feats),
    ('classifier', svm.SVC())
])

pipeline.fit(ftrain_x, train_y)

preds = pipeline.predict(fvalid_x)
np.mean(preds == valid_y)

0.5

To see the list of all the possible things you could fine tune, call get_params().keys() on your pipeline.

In [32]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'features', 'classifier', 'features__n_jobs', 'features__transformer_list', 'features__transformer_weights', 'features__verbose', 'features__text', 'features__char_count', 'features__word_count', 'features__word_density', 'features__punctuation_count', 'features__title_word_count', 'features__upper_case_word_count', 'features__text__memory', 'features__text__steps', 'features__text__verbose', 'features__text__selector', 'features__text__tfidf', 'features__text__selector__key', 'features__text__tfidf__analyzer', 'features__text__tfidf__binary', 'features__text__tfidf__decode_error', 'features__text__tfidf__dtype', 'features__text__tfidf__encoding', 'features__text__tfidf__input', 'features__text__tfidf__lowercase', 'features__text__tfidf__max_df', 'features__text__tfidf__max_features', 'features__text__tfidf__min_df', 'features__text__tfidf__ngram_range', 'features__text__tfidf__norm', 'features__text__tfidf__preprocessor', 'features__text__tfidf

In [33]:
from sklearn.model_selection import GridSearchCV

hyperparameters = { 'features__text__tfidf__max_df': [0.9, 0.95],
                    'features__text__tfidf__ngram_range': [(1,1), (1,2)],
                    'classifier__C': (0.01, 1, 10),
                    #'clf__gamma: (0.5, 1,2,3,4),
                    'classifier__kernel': ('rbf', 'linear')
                  }
clf = GridSearchCV(pipeline, hyperparameters, cv=5)
 
# Fit and tune model
clf.fit(ftrain_x, train_y)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('selector',
                                                                                         TextSelector(key='text')),
                                                                                        ('tfidf',
                                                                                         TfidfVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                       

In [34]:
clf.best_params_

{'classifier__C': 10,
 'classifier__kernel': 'linear',
 'features__text__tfidf__max_df': 0.9,
 'features__text__tfidf__ngram_range': (1, 2)}