# Advanced Data Science Capstone - Week 3 - Model Definition 

In this notebook we will select our classifier algorithm that it is best fitted for our problem, that is, classifying tweets between three different classes, such as positive, negative and neutral.

For that, it is well established that three algorithms have higher performance for text classification, to list: Naive Bayes, Support Vector Machines and Random Forests. Therefore, they will be tested against our constructed dataset, we will also apply evaluation for this dataset, cross-validation for hyperparameter tuning and to finish, discuss some difficulties to this specific task and intuitions and how we can proceed to improve performance.

For model definition we have chosen to do hyperparameter and some feature extraction in this phase such as Bag of Words, TF-IDF Vectors, Count Vectores, etc. Since they also are included on the cross-validation tasks for a variety of parameters

In [23]:
# importing data
import pandas as pd

data = pd.read_csv("data_preprocessed.csv", sep='\t')

# checking its dimensions
data.dropna(subset=['text_stemming'], inplace=True)
data.shape

(14132, 24)

In [25]:
data.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,...,user_timezone,positive_emoticon,negative_emoticon,text_preprocessed,neg_scores,neu_scores,pos_scores,compound_scores,text_stemming,negation
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,...,Eastern Time (US & Canada),0,0,what said .,0.0,1.0,0.0,0.0,said,False
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,...,Pacific Time (US & Canada),0,0,plus you've added commercials to the experienc...,0.0,1.0,0.0,0.0,plu ad commerci experi tacki,False
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,...,Central Time (US & Canada),0,0,i didn't today must mean i need to take anothe...,0.0,1.0,0.0,0.0,today must mean need take anoth trip exclam,True
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,...,Pacific Time (US & Canada),0,0,"it's really aggressive to blast obnoxious "" en...",0.248,0.609,0.142,-0.2716,realli aggress blast obnoxi entertain guest fa...,False
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,...,Pacific Time (US & Canada),0,0,and it's a really big bad thing about it,0.351,0.649,0.0,-0.5829,realli big bad thing,False


In [26]:
features = ['text_stemming', 'neu_scores', 'neg_scores', 'compound_scores', 'negation', 'pos_scores']
label = ['airline_sentiment']

data['negation'] = pd.get_dummies(data.negation)


X = data[[col for col in data.columns if col in features]]
y = data[label]

X.shape, y.shape

((14132, 6), (14132, 1))

In [27]:
# train_test_split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print("Shapes are {} and {}".format(X_train.shape, X_test.shape))

Shapes are (9892, 6) and (4240, 6)


## Baseline

Before continuing with our classification process, it is fundamental in any machine learning project to check our baseline, that is, what is the majority class of our dataset. In this way, we must perform above this threshold, and this is called baseline. 

In [28]:
pd.Series(y_train.values.ravel()).value_counts(normalize=True)

negative    0.641832
neutral     0.209159
positive    0.149009
dtype: float64

In [29]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

We will test our classifier with two types of text processors, tfidf and countVectorizer.

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

tfidf = Pipeline([
                ('selector', TextSelector(key='text_stemming')),
                ('tfidf', TfidfVectorizer())
            ])

countvect = Pipeline([
                ('selector', TextSelector(key='text_stemming')),
                ('countvect', CountVectorizer())
            ])

In [31]:
from sklearn.preprocessing import MinMaxScaler

negation =  Pipeline([
                ('selector', NumberSelector(key='negation')),
                ('minmax', MinMaxScaler())
            ])
neu_scores =  Pipeline([
                ('selector', NumberSelector(key='neu_scores')),
                ('minmax', MinMaxScaler())
            ])
neg_scores =  Pipeline([
                ('selector', NumberSelector(key='neg_scores')),
                ('minmax', MinMaxScaler())
            ])
pos_scores =  Pipeline([
                ('selector', NumberSelector(key='pos_scores')),
                ('minmax', MinMaxScaler())
            ])

compound_scores =  Pipeline([
                ('selector', NumberSelector(key='compound_scores')),
                ('minmax', MinMaxScaler())
            ])

In [32]:
from sklearn.pipeline import FeatureUnion

# defining different sets of text processors

def features_union(textProcessor):
    return FeatureUnion([('text', textProcessor), 
                      ('negation', negation),
                      ('neu_scores', neu_scores),
                      ('neg_scores', neg_scores),
                      ('pos_scores', pos_scores),
                      ('compound_scores', compound_scores)])


In [33]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder().fit(y_train.values.ravel())

y_train = le.transform(y_train.values.ravel())
y_test = le.transform(y_test.values.ravel())

X_train.shape, X_test.shape

((9892, 6), (4240, 6))

In [34]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

features_count = features_union(countvect)
nb_pipeline = Pipeline([('features', features_count),
                       ('nb', clf)])

nb_pipeline.fit(X_train, y_train)
nb_pipeline.score(X_test, y_test)

0.77900943396226419

In [35]:
from sklearn.svm import LinearSVC

svm = LinearSVC()

features_tfidf = features_union(tfidf)
svm_pipeline = Pipeline([('features', features_tfidf),
                       ('svm', svm)])



svm_pipeline.fit(X_train, y_train)
svm_pipeline.score(X_test, y_test)

0.79481132075471694

In [36]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 100)
rf_pipeline = Pipeline([('features', features_count),
                       ('rf', rf)])

rf_pipeline.fit(X_train, y_train)
rf_pipeline.score(X_test, y_test)

0.79268867924528297

## Cross-Validation

### Naive Bayes

In [40]:
# cross-validation for Naive Bayes Classifier
from sklearn.model_selection import GridSearchCV

features_tfidf = features_union(tfidf)

nb_pipeline = Pipeline([('feats', features_tfidf),  ('clf', MultinomialNB())])
parameters = {
    'feats__text__tfidf__max_df': (0.5, 0.75, 1.0),
    'feats__text__tfidf__ngram_range': ((1, 1), (1, 2), (2, 2)), 
    'feats__text__tfidf__use_idf': (False, True),
    'feats__text__tfidf__binary':(False, True),
    'feats__text__tfidf__binary':('l1', 'l2', None),
    'clf__alpha': (1.0, 5.0, 10.0),
    'clf__fit_prior': (True, False),                               
}

nb_gs = GridSearchCV(nb_pipeline, parameters, cv=3)
nb_gs.fit(X_train, y_train)
nb_gs.score(X_test, y_test)

0.76438679245283014

In [None]:
import warnings

warnings.filterwarnings("ignore")

print('Best params: {}'.format(nb_gs.best_params_))
nb_cv_results = pd.DataFrame(nb_gs.cv_results_)
nb_cv_results.head(3)

### Support Vector Machines

In [37]:
# cross-validation for SVM Classifier
from sklearn.model_selection import GridSearchCV

features_count = features_union(tfidf)
svm_count_pipeline = Pipeline([('feats', features_count),  ('clf', LinearSVC())])
parameters = {
    'feats__text__tfidf__max_df': (0.5, 0.75, 1.0),
    'feats__text__tfidf__ngram_range': ((1, 1), (1, 2), (2, 2)), 
    'feats__text__tfidf__use_idf': (False, True),
    'clf__loss': ('hinge', 'squared_hinge'),
    'clf__C': (0.1, 0.5, 0.6, 1, 4, 5, 10, 100),
    'clf__class_weight': (None, 'balanced')                                    
}

svm_gs = GridSearchCV(svm_count_pipeline, parameters, cv=3)
svm_gs.fit(X_train, y_train)
svm_gs.score(X_test, y_test)

0.81014150943396224

In [21]:
import warnings

warnings.filterwarnings("ignore")

print('Best params: {}'.format(svm_gs.best_params_))
svm_cv_results = pd.DataFrame(svm_gs.cv_results_)
svm_cv_results.head(3)

Best params: {'clf__C': 0.5, 'clf__class_weight': 'balanced', 'clf__loss': 'squared_hinge', 'feats__text__tfidf__max_df': 0.5, 'feats__text__tfidf__ngram_range': (1, 2), 'feats__text__tfidf__use_idf': True}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__class_weight,param_clf__loss,param_feats__text__tfidf__max_df,param_feats__text__tfidf__ngram_range,param_feats__text__tfidf__use_idf,...,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.143589,0.004203,0.050598,0.005267,0.1,,hinge,0.5,"(1, 1)",False,...,0.737034,0.744842,0.743833,0.005189,451,0.753223,0.758302,0.758945,0.756823,0.002559
1,0.135401,0.007359,0.058834,0.013769,0.1,,hinge,0.5,"(1, 1)",True,...,0.733091,0.739078,0.739891,0.005911,475,0.751251,0.754966,0.755761,0.753993,0.001965
2,0.363443,0.032684,0.062494,1.5e-05,0.1,,hinge,0.5,"(1, 2)",False,...,0.732484,0.742415,0.739486,0.004973,484,0.748369,0.752843,0.751668,0.75096,0.001894


### Random Forest

In [None]:
# cross-validation for RF Classifier
features_count = features_union(tfidf)
rf_pipeline_cv = Pipeline([('feats', features_count),  ('rf', RandomForestClassifier())])
parameters = {
    'feats__text__tfidf__max_df': (0.5, 1.0),
    'feats__text__tfidf__ngram_range': ((1, 1), (1, 2)),
    'rf__max_features': (0.5, 0.7, 1.0),
    'rf__bootstrap': (False ,True),
    'rf__class_weight': ('balanced', 'balanced_subsample', None),
    'rf__n_estimators':(10, 50, 80)
}

rf_gs = GridSearchCV(rf_pipeline_cv, parameters, cv=3)
rf_gs.fit(X_train, y_train)
rf_gs.score(X_test, y_test)

In [7]:
from models import SentimentRNN

# Instantiate the model w/ hyperparams
vocab_size = 100
output_size = 3
embedding_dim = 150
hidden_dim = 128
n_layers = 3

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(100, 150)
  (lstm): LSTM(150, 128, num_layers=3, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=128, out_features=3, bias=True)
)
