## DataSchool - Machine Learning with Text and Python:  

### Week 6 Self Imposed Homework

I am going to use the Kaggle competition that inspired me to sign up for this course.  Using what I have learned in the Machine Learning with Text course, I will attempt to beat the scores on the leader board.

The Kaggle competition is:

![SA Image](sa_emotions_picture.png) 

### [Sentiment Analysis: Emotion in Text.](https://www.kaggle.com/c/sa-emotions)

*Identify emotion in text using sentiment analysis.*

In [1]:
import numpy as np
import time
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from stemming.porter2 import stem
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline, make_union
from sentiment_analysis.transformers import RemoveEllipseTransformer, RemoveHtmlEncodedTransformer, RemoveNumbersTransformer, RemoveSpecialCharactersTransformer, RemoveUsernameTransformer, RemoveUrlsTransformer
from sklearn.preprocessing import LabelEncoder, FunctionTransformer
from sklearn.model_selection import cross_val_score
from sklearn.base import TransformerMixin
import re
import nltk.stem
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from nltk.sentiment.util import mark_negation



In [2]:
# allow plots to appear in the notebook
%matplotlib inline

In [3]:
# read the training data
training_data = pd.read_csv('../data/kaggle/sa-emotions/train_data_lexicon.csv')

In [4]:
training_data.shape

(30000, 7)

In [5]:
training_data.head(20)

Unnamed: 0.1,Unnamed: 0,sentiment,content,disposition,neg_words,neu_words,pos_words
0,0,empty,@tiffanylue i know i was listenin to bad habi...,-1.0,0.05,0.95,0.0
1,1,sadness,Layin n bed with a headache ughhhh...waitin o...,-1.0,0.076923,0.923077,0.0
2,2,sadness,Funeral ceremony...gloomy friday...,-1.0,0.166667,0.833333,0.0
3,3,enthusiasm,wants to hang out with friends SOON!,-1.0,0.125,0.875,0.0
4,4,neutral,@dannycastillo We want to trade with someone w...,0.0,0.0,1.0,0.0
5,5,worry,Re-pinging @ghostridah14: why didn't you go to...,1.0,0.0,0.95,0.05
6,6,sadness,"I should be sleep, but im not! thinking about ...",-1.0,0.058824,0.941176,0.0
7,7,worry,Hmmm. http://www.djhero.com/ is down,0.0,0.0,1.0,0.0
8,8,sadness,@charviray Charlene my love. I miss you,-1.0,0.125,0.875,0.0
9,9,sadness,@kelcouch I'm sorry at least it's Friday?,-1.0,0.090909,0.909091,0.0


In [6]:
class RemoveUsernameTransformer(TransformerMixin):
    """
    Transformer that will remove tokens from a string of the form:  @someusername
    
    """

    @staticmethod
    def _preprocess_data(data_series):
        """
        inspired from:
        https://raw.githubusercontent.com/youngsoul/ml-twitter-sentiment-analysis/develop/cleanup.py
        :param data_series:
        :return:
        """

        # remove user name
        regex = re.compile(r"@[^\s]+[\s]?")
        data_series.replace(regex, "", inplace=True)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """

        :param X: Series, aka column of data.
        :return:
        """
        RemoveUsernameTransformer._preprocess_data(X)
        return X


In [7]:
class RemoveNumbersTransformer(TransformerMixin):
    """
    Transformer that will remove tokens from a string that are numbers.
    """

    @staticmethod
    def _preprocess_data(data_series):
        """
        inspired from:
        https://raw.githubusercontent.com/youngsoul/ml-twitter-sentiment-analysis/develop/cleanup.py
        :param data_series:
        :return:
        """

        # remove numbers
        regex = re.compile(r"\s?[0-9]+\.?[0-9]*")
        data_series.replace(regex, "", inplace=True)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """

        :param X: Series, aka column of data.
        :return:
        """
        RemoveNumbersTransformer._preprocess_data(X)
        return X


In [8]:
class StemmedTfidfVectorizer(TfidfVectorizer):
    """
    A TF-IDF Vectorizer that will apply a stemmer to the tokeninze word.
    
    """
    def __init__(self, stemmer=None, input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):

        super(StemmedTfidfVectorizer, self).__init__(
            input=input, encoding=encoding, decode_error=decode_error,
            strip_accents=strip_accents, lowercase=lowercase,
            preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer,
            stop_words=stop_words, token_pattern=token_pattern,
            ngram_range=ngram_range, max_df=max_df, min_df=min_df,
            max_features=max_features, vocabulary=vocabulary, binary=binary,
            dtype=dtype, norm='l2', use_idf=True, smooth_idf=True,
                             sublinear_tf=False)
        self.stemmer = stemmer

    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([self.stemmer.stem(w) for w in analyzer(doc)])


In [9]:
def mark_negation_sentence(sentence):
    """
    See the NLTK utility for the mark_negation function.
    
    This function will take a sentence in, split it and call mark_negation, and 
    puts the string back together again.  
    
    Append _NEG suffix to words that appear in the scope between a negation
    and a punctuation mark.
    
    :param sentence an entire sentence
    :return sentence with the negation marked
    """
    return " ".join(mark_negation(sentence.split()))


In [10]:
# Functions used to create different features from the text.
def count_username_mentions(value):
    return len(re.findall(r"@[^\s]+[\s]?", value))

def count_ellipsis(value):
    return len(re.findall(r"\.\s?\.\s?\.", value))

def count_hashtags(value):
    return len(re.findall(r"3[^\s]+[\s]?", value))

def count_exclamation_points(value):
    groups = re.findall(r"\w+(!+)", value)
    return sum([len(exclamation_string) for exclamation_string in groups])

def count_question_marks(value):
    groups = re.findall(r"[\w+!](\?+)", value)
    return sum([len(exclamation_string) for exclamation_string in groups])

def is_boredom(y):
    if 'bored' in y.lower() or 'boring' in y.lower():
        return 1
    else:
        return 0


In [11]:
def make_features(df):
    df['number_of_mentions'] = df.content.apply(count_username_mentions)
    df['number_of_ellipsis'] = df.content.apply(count_ellipsis)
    df['number_of_exclamations'] = df.content.apply(count_exclamation_points)
    df['number_of_hashtabs'] = df.content.apply(count_hashtags)
    df['number_of_question'] = df.content.apply(count_question_marks)
    df['is_boredom'] = df.content.apply(is_boredom)
    #df['content_len'] = df.content.apply(len)


In [12]:
"""
Read the data, setup the function transformers, add the features to the training data.
"""

# -----------------  Function Transformers ----------------
def get_features_df(df):
    return df.loc[:, ['number_of_mentions', 'number_of_ellipsis', 'number_of_exclamations', 'number_of_hashtabs', 'number_of_question', 'is_boredom', 'disposition', 'neg_words', 'neu_words', 'pos_words']]

def get_sentiment_content(df):
    '''Returns the original content from the data set'''
    return df.content.copy()

def get_sentiment_content_negation(df):
    '''Returns the content after it has gone through negation'''
    return df.content_negation.copy()

def get_sentiment_content_preprocess_negation(df):
    '''Returns the content after is has gone through negation AND preprocessed'''
    return df.content_preprocessed_negation


# create a function transformer to just extract the feature columns
get_features_transformer = FunctionTransformer(get_features_df, validate=False)
# usage: get_features_transformer.transform(training_data_with_features).head()

# create a function transformer to return the sentiment content so it can be used in pipeline/union
get_sentiment_content_transformer = FunctionTransformer(get_sentiment_content, validate=False)

get_sentiment_content_negation_transformer = FunctionTransformer(get_sentiment_content_negation, validate=False)

get_sentiment_content_preprocess_negation_transformer = FunctionTransformer(get_sentiment_content_preprocess_negation, validate=False)

# -----------------  End Function Transformers ----------------

def preprocess_data_set(input_data_set):
    # Create a pipeline with the transformers we are keeping, and see the overall improvement.
    # This pipeline gets a little tricky.  the 'get_sentiment_content_transformer' returns a COPY of the
    # original content.  So the transformers work on a copy of the content, leaving the make_features with the
    # original content to create features from.
    preprocessor_pipeline = make_pipeline(get_sentiment_content_transformer, RemoveNumbersTransformer(), RemoveUsernameTransformer())
    preprocessed_training_data_content = preprocessor_pipeline.transform(input_data_set)
    input_data_set['content_preprocessed'] = preprocessed_training_data_content
    input_data_set['content_preprocessed_negation'] = input_data_set.content_preprocessed.apply(mark_negation_sentence)

    make_features(input_data_set)



In [13]:
preprocess_data_set(training_data)

In [14]:
training_data.shape

(30000, 15)

In [15]:

# create stemmer and vectorizer
stemmer = nltk.stem.SnowballStemmer('english')
stemmed_tfidf_vectorizer = StemmedTfidfVectorizer(stemmer=stemmer, min_df=5, max_df=0.8, ngram_range=(1,4), stop_words='english', sublinear_tf=True)

# UNION
#    PIPE
#        get the negated and preprocessed text
#        send to stemmed tfidf vectorizer to create vocabulary and document term matrix
#    Get the features DataFrame transformer
union = make_union(make_pipeline(get_sentiment_content_preprocess_negation_transformer, stemmed_tfidf_vectorizer),
                  get_features_transformer)

# encode the sentiment outcomes as a number using the LabelEncoder
# would like to create a column, e.g. sentiment_num, which is a numeric representation of the sentiment.
# this will have to also be applied to any test data.
label_encoder = LabelEncoder()

# fit the label encoder with the unique set of sentiments in the training data.
label_encoder.fit(training_data.sentiment.unique())

# Create an outcomes which is the numeric representation of the label
y = training_data.sentiment.apply(lambda x: label_encoder.transform([x])[0])

# create LogisticRegression Model
# 0.3250
model = LogisticRegression(C=0.1)

# 0.307
#model = MultinomialNB()

# Model Pipeline
#    UNION ->Model
# The union will create a traditional vectoizered set of tokens, and a non-DTM data frame for 
# the model.
model_pipeline = make_pipeline(union, model)
cross_val_score(model_pipeline, training_data, y, cv=5, scoring='accuracy').mean()



0.33674193257108043

With the pos/neg/neu/disposition columns, the new accuracy score is now 0.3367

In [16]:
# Read in the test data that will be used to submit to Kaggle
test_data = pd.read_csv('../data/kaggle/sa-emotions/test_data_lexicon.csv')

In [17]:
test_data.shape

(10000, 7)

In [18]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,id,content,disposition,neg_words,neu_words,pos_words
0,0,1,is hangin with the love of my life. Tessa McCr...,1.0,0.0,0.916667,0.083333
1,1,2,I've Got An Urge To Make Music Like Massively....,1.0,0.0,0.9375,0.0625
2,2,3,@lacrossehawty rofl uh huh,0.0,0.0,1.0,0.0
3,3,4,"@fankri haha! thanks, Tiff it went well, but...",1.0,0.0,0.962963,0.037037
4,4,5,@alyssaisntcool hahah i loveeee him though.,0.0,0.0,1.0,0.0


In [19]:
# Pre-process the test data, because pre-processing should happen on the training and the testing data.
preprocess_data_set(test_data)
test_data.head()

Unnamed: 0.1,Unnamed: 0,id,content,disposition,neg_words,neu_words,pos_words,content_preprocessed,content_preprocessed_negation,number_of_mentions,number_of_ellipsis,number_of_exclamations,number_of_hashtabs,number_of_question,is_boredom
0,0,1,is hangin with the love of my life. Tessa McCr...,1.0,0.0,0.916667,0.083333,is hangin with the love of my life. Tessa McCr...,is hangin with the love of my life. Tessa McCr...,0,0,2,0,0,0
1,1,2,I've Got An Urge To Make Music Like Massively....,1.0,0.0,0.9375,0.0625,I've Got An Urge To Make Music Like Massively....,I've Got An Urge To Make Music Like Massively....,0,0,0,0,0,0
2,2,3,@lacrossehawty rofl uh huh,0.0,0.0,1.0,0.0,rofl uh huh,rofl uh huh,1,0,0,0,0,0
3,3,4,"@fankri haha! thanks, Tiff it went well, but...",1.0,0.0,0.962963,0.037037,"haha! thanks, Tiff it went well, but they WO...","haha! thanks, Tiff it went well, but they WORE...",1,0,4,0,0,0
4,4,5,@alyssaisntcool hahah i loveeee him though.,0.0,0.0,1.0,0.0,hahah i loveeee him though.,hahah i loveeee him though.,1,0,0,0,0,0


In [20]:
test_data.shape

(10000, 15)

### Use the model pipeline to make predictions

Once we have the test data preprocessed - we use the model pipeline with all of the training data, and predict on the testing data.

For a pipeline, we can treat it just like a regular model and call 'fit' and 'predict'.


In [21]:
model_pipeline.fit(training_data, y)

Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline', Pipeline(memory=None,
     steps=[('functiontransformer', FunctionTransformer(accept_sparse=False,
          func=<function get_sentiment_content_preprocess_negation at 0x10d711620>,
          inv_kw_args=None, inve...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [22]:
y_pred_class = model_pipeline.predict(test_data)

In [23]:
# convert the y_pred_class classification numbers BACK to their string versions for submission
y_pred_class_labels = label_encoder.inverse_transform(y_pred_class)
print(y_pred_class_labels)

['love' 'neutral' 'neutral' ..., 'happiness' 'happiness' 'neutral']


In [24]:
# create a submission file (resulting score: 0.30040)
# sub1 
# sub2 - added remove html characters
# sub3 = sub1, sanity check
# sub4 = added pos/neg/neu/disposition to the data set.  resulting score: 0.33700
pd.DataFrame({'id':test_data.id, 'sentiment':y_pred_class_labels}).set_index('id').to_csv('../data/kaggle/sa-emotions/sub4.csv')


