<h1>Sarcasm detector using logistic regression by Franciszek Jemioło</h1>

In [1]:
# This program was written by Franciszek Jemioło
import sys
# We are setting parameters for our dataset
# The limits for maximum number of sarcastic and serious posts to train on.
sarcastic_posts_limit = 30000
serious_posts_limit = 30000
# Number of top features to display
ntop_features = 20
# Number of examples per feature
nexamples = 20

<p>First we will gather data from reddit's comments dataset that is available on kaggle (https://www.kaggle.com/reddit/reddit-comments-may-2015)</p>

In [2]:
# Now we are taking the sarcastic and serious posts from our dataset
# Unfortunately spark sql API doesn't work so we cannot create RDD directly... We have to go around it...
# Tried using sqlite jdbc, added to starting script SPARK_CLASSPATH=sqlite-jdbc-3.8.11.2.jar
# df = sqlContext.read.format('jdbc').options(url='jdbc:sqlite:database.sqlite', dbtable='May2015').load()
# Then we would just run the same sql as below... the only difference would that it would happen where it should in spark.
# Upper command throws some type errors due to spark internals, that's why we cannot use it...

# Importing driver for the sqlite3 database
import sqlite3
# Import regex
import re
# Creating connection
sql_connection = sqlite3.connect('database.sqlite')

# Getting the data
sarcasmData = sql_connection.execute("SELECT subreddit, body, score FROM May2015 \
                                     WHERE body LIKE '% /s' \
                                     LIMIT " + str(sarcastic_posts_limit)).fetchall()

seriousData = sql_connection.execute("SELECT subreddit, body, score FROM May2015 \
                                     WHERE body NOT LIKE '% /s' \
                                     LIMIT " + str(serious_posts_limit)).fetchall()

<p>Now we have to process the comments, strip them of unnecessary characters and signs. We are transforming every comment to one line of words used in that comment. That array of parsed comments we will call corpus.</p>

In [3]:
# Processing data, removing unnecessary characters
# Importing for progress imaging
from IPython.display import clear_output
corpus, raw_corpus, serious_corpus = [], [], []

print "Processing sarcastic posts..."
sys.stdout.flush()
t = 0
i = 0
show_steps = 10
sarcastic_posts = len(sarcasmData)
for sarcastic_post in sarcasmData:
    raw_corpus.append(re.sub('\n', '', sarcastic_post[1]))
    # Removing /s and end of line
    clean_post = re.sub('/s|\n', '', sarcastic_post[1])
    # Removing non word characters
    # Appending text and label = 1
    corpus.append(re.sub(r'((^(\s+))|((\s+)$))', '', re.sub(r'([^0-9a-zA-Z\s]+)', '', clean_post.lower())).split(' '))
    t += 1
    if t >= (i * sarcastic_posts / show_steps):
        i += 1
        print "Progress: " + str(t) + " / " + str(sarcastic_posts)
clear_output()
print "Successfully processed sarcastic posts"

t = 0
i = 0
print "Processing serious posts..."
sys.stdout.flush()
serious_posts = len(seriousData)
for serious_post in seriousData:
    serious_corpus.append(re.sub('\n', '', serious_post[1]))
    # Removing /s and end of line
    clean_post = re.sub('/s|\n', '', serious_post[1])
    # Removing non word characters
    # Appending text and label = 0 (not sarcastic)
    corpus.append(re.sub(r'((^(\s+))|((\s+)$))', '', re.sub(r'([^0-9a-zA-Z\s]+)', '', clean_post.lower())).split(' '))
    t += 1
    if t >= (i * serious_posts / show_steps):
        i += 1
        print "Progress: " + str(t) + " / " + str(serious_posts)
clear_output()
print "Successfully processed all posts"
sys.stdout.flush()

Successfully processed all posts


<p>Now we will be creating spark RDD's from our corpus. Then we will create HashingTF and create TF-IDF matrix. After doing that, we will zip the TF-IDF matrix with labels (1 for sarcastic post and 0 for serious posts). At the end we are spliting the data into training set and validation set.</p>

In [4]:
# Splitting the data, creating RDD's
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

corpusRDD = sc.parallelize(corpus)

corpusRDD.cache()

# Creating TfIDF
hashingTF = HashingTF()
tf = hashingTF.transform(corpusRDD)
#tf.cache()
idf = IDF(minDocFreq=5).fit(tf)
tfidf = idf.transform(tf)
tfidf.cache()

# Making labeled points
#labeledCorpusRDD = corpusRDD.map(lambda (text, label): LabeledPoint(label, text))
# Printing sample of corpus
print corpusRDD.take(1)

testLabels = sc.parallelize([1] * sarcastic_posts + [0] * serious_posts)
zippedRDD = testLabels.zip(tfidf)
labeledRDD = zippedRDD.map(lambda (label, vector): LabeledPoint(label, vector))
# Printing sample of whole dataset
print labeledRDD.take(1)
# Spliting data for train and validation set
weights = [.9, .1]
seed = 42
labeledTrainRDD, labeledValidationRDD = labeledRDD.randomSplit(weights, seed)

# Caching data for quicker access
labeledTrainRDD.cache()
labeledValidationRDD.cache()

[[u'having', u'sex', u'with', u'my', u'girlfriend', u'at', u'least', u'5', u'times', u'a', u'day', u'is', u'my', u'main', u'escape', u'i', u'am', u'fa', u'because', u'i', u'would', u'like', u'a', u'second', u'girlfriend', u'for', u'regular', u'threesomes', u'but', u'she', u'doesnt', u'want', u'to']]
[LabeledPoint(1.0, (1048576,[3932,24165,36748,36757,38031,50570,53144,57166,190103,198825,261763,277231,418086,438737,441832,514653,572533,582012,595965,712467,725041,782260,786838,869331,884882,897504,903739,951974,1004334],[2.26129988016,2.44224704204,4.04652389934,2.64673661248,5.48868776157,1.56384406771,13.6864668487,2.14773704896,7.28854444103,0.0,2.74680769595,4.77954023966,2.88569943494,7.91107405437,2.84318568358,4.2867331214,3.95373009901,5.16066700224,4.55798525103,3.79129805426,1.21578094001,4.83460001684,3.52847939924,6.27472868902,5.06458030265,2.41242833175,5.68399651389,1.87526644625,4.46978221551]))]


PythonRDD[15] at RDD at PythonRDD.scala:43

<p>Below we will be creating two different logistic regression models using two different methods. We will create one using LBFGS and the other one using SGD. As we will see later, the LBFGS is not only slower, shows the same performance on validation set as SGD, but the words that activate it the most are rather useless.</p>

In [5]:
# Creating and training the model
# Training progress should be visible in console where you start ipython notebook with spark.
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionWithSGD, LogisticRegressionModel
# Logistic regression model with LBFGS
print "Training logistic regression with LBFGS model..."
sys.stdout.flush()
logregModelLBFGS = LogisticRegressionWithLBFGS.train(labeledTrainRDD, iterations=100, intercept=True, tolerance=0.0)
print "Finished training logistic regression with LBFGS model"
# Logistic regression model with SGD
print "Training logistic regression with SGD model..."
sys.stdout.flush()
logregModelSGD = LogisticRegressionWithSGD.train(labeledTrainRDD, iterations=100, intercept=True, 
                                                 convergenceTol=0.0, regParam=1e-6, regType="l2")
print "Finished training logistic regression with SGD model"

Training logistic regression with LBFGS model...
Finished training logistic regression with LBFGS model
Training logistic regression with SGD model...
Finished training logistic regression with SGD model


<p>Below we see the accuracy of the prediction on training data and on validation data(held out). The accuracy is around 70% which is pretty good for this simple method.</p>

In [6]:
# Validation and evalution of models
# Accuracy on training data using SGD
labelsAndPredsLogregSGD = labeledTrainRDD.map(lambda x: (x.label, logregModelSGD.predict(x.features)))
accuracyTrainSGD = labelsAndPredsLogregSGD.filter(lambda (v, p): v == p).count() / float(labeledTrainRDD.count())
print "Accuracy on training data using Logistic regression with SGD: " + str(accuracyTrainSGD * 100) +"%"
# Accuracy on training data using LBFGS
labelsAndPredsLogregLBFGS = labeledTrainRDD.map(lambda x: (x.label, logregModelLBFGS.predict(x.features)))
accuracyTrainLBFGS = labelsAndPredsLogregLBFGS.filter(lambda (v, p): v == p).count() / float(labeledTrainRDD.count())
print "Accuracy on training data using Logistic regression with LBFGS: " + str(accuracyTrainLBFGS * 100) +"%"

# Accuracy on held out data using SGD
labelsAndPredsLogregSGDVal = labeledValidationRDD.map(lambda x: (x.label, logregModelSGD.predict(x.features)))
accuracyValidationSGD = labelsAndPredsLogregSGDVal.filter(
    lambda (v, p): v == p).count() / float(labeledValidationRDD.count())
print "Accuracy on held out data using Logistic regression with SGD: " + str(accuracyValidationSGD * 100) +"%"
# Accuracy on held out data using LBFGS
labelsAndPredsLogregLBFGSVal = labeledValidationRDD.map(lambda x: (x.label, logregModelLBFGS.predict(x.features)))
accuracyValidationLBFGS = labelsAndPredsLogregLBFGSVal.filter(
    lambda (v, p): v == p).count() / float(labeledValidationRDD.count())
print "Accuracy on held out data using Logistic regression with LBFGS: " + str(accuracyValidationLBFGS * 100) +"%"

Accuracy on training data using Logistic regression with SGD: 72.282266721%
Accuracy on training data using Logistic regression with LBFGS: 83.9675019432%
Accuracy on held out data using Logistic regression with SGD: 69.9631243714%
Accuracy on held out data using Logistic regression with LBFGS: 70.2145491116%


<p>Now we will gather the words that have highest weights in our models - these are the top informative words, that tell our model that a post is a sarcastic post or a serious post.</p>

In [7]:
# Check what features were informative and gather top informative features
def mapWordToFeatures(inputData):
    result = {}
    for post in inputData:
        for word in post:
            #if not word in result.keys():
            result[word] = hashingTF.indexOf(word)
    return result

mappedWords = mapWordToFeatures(corpus)

feature_weightsSGD = logregModelSGD.weights
feature_weightsLBFGS = logregModelLBFGS.weights

def gatherTopWords(mapOfWords, feature_weights):
    result = {}
    for word in mapOfWords.keys():
        if len(result) < ntop_features:
            result[word] = feature_weights[mapOfWords[word]]
        else:
            if feature_weights[mapOfWords[word]] > min(result.values()):
                min_val = min(result.values())
                deleted = None
                for key, value in result.iteritems():
                    if value == min_val:
                        deleted = key
                        break
                result.pop(deleted)
                result[word] = feature_weights[mapOfWords[word]]
    return result

topWordsSGD = gatherTopWords(mappedWords, feature_weightsSGD)        
    
topWordsLBFGS = gatherTopWords(mappedWords, feature_weightsLBFGS)         

<p>In the two cells below we will show what these top informative words are and show a number of examples to each of the word.</p>

In [8]:
# Display results of Logistic regression model with SGD
def isWordInPost(w, p):
    result = re.compile(r'\b{0}\b'.format(w), flags=re.IGNORECASE).search(p)
    return result != None

def countOccurence(word, inputData, gatherExamples=False):
    occ = 0
    examples = None
    if gatherExamples:
        examples = {}
        k = 1
    for post in inputData:
        if isWordInPost(word, post):
            occ += 1
            if gatherExamples:
                if k <= nexamples:
                    examples[k] = post
                    k += 1
    if gatherExamples:
        return (occ, examples)
    else:
        return occ
            

print "Top activating words using SGD:\n\n"
for word in topWordsSGD.keys():
    occurenceSarcastic, examples = countOccurence(word, raw_corpus, True)
    occurenceSerious = countOccurence(word, serious_corpus)
    print ("Word: " + "'" + word + "'" + ", feature weight: " + str(feature_weightsSGD[mappedWords[word]]) + 
           ". Ocurred " + str(occurenceSarcastic) + " times in sarcastic posts and " + 
           str(occurenceSerious) + " in serious posts.")
    print "Examples: "
    k = 1
    for example in examples.keys():
        print str(k) + ". " + examples[example]
        k += 1
    print '\n'
    

Top activating words using SGD:


Word: 'all', feature weight: 0.122331523715. Ocurred 2815 times in sarcastic posts and 2236 in serious posts.
Examples: 
1. Hush you men aren't being objectified they can't be they have all the privilege! /s
2. Scroll down and look for the match thread, all discussion goes there. Oh and nice skins, probably worth a lot. /s
3. No, Conley and TA didn't play at all, and Gasol was playing with a sprain from the Clippers game, so when it got out of hand he sat down.   What no one wants to talk about is how the back end of our bench absolutely crushed it. Russ Smith and Jordan Adams are clearly better than the Splash brothers. Secret Weapons. /s
4. Because you speeding does not directly result in a crash statistically.  Its hard to directly say that the act of going over the speed limit is what causes the crashes.The US DOT did a study on data between 2005 and 2007 that found 8.4% of all crashes were because the driver was driving "too fast for conditions", 

In [9]:
# Display results of Logistic regression model with LBFGS
print "Top activating words using LBFGS:\n\n"
for word in topWordsLBFGS.keys():
    occurenceSarcastic, examples = countOccurence(word, raw_corpus, True)
    occurenceSerious = countOccurence(word, serious_corpus)
    print ("Word: " + "'" + word + "'" + ", feature weight: " + str(feature_weightsLBFGS[mappedWords[word]]) + 
           ". Ocurred " + str(occurenceSarcastic) + " times in sarcastic posts and " + 
           str(occurenceSerious) + " in serious posts.")
    print "Examples: "
    k = 1
    for example in examples.keys():
        print str(k) + ". " + examples[example]
        k += 1
    print '\n'

Top activating words using LBFGS:


Word: 'accelerator', feature weight: 0.605612340818. Ocurred 1 times in sarcastic posts and 3 in serious posts.
Examples: 
1. I thought this might be an interesting video but about halfway through was hit with so much bullshit I had to start taking notes so I would remember what I wanted to reply to.  Then the second half of this short clip was jam packed with so much *more* bullshit that I now have a novel written.  I wasn't going to post it but have [too much put into it now](http://en.wikipedia.org/wiki/Sunk_costs), I might as well.* "So the intact male needs *very* small strokes during intercourse to ride the wave to orgasm and to ejaculate."Um ... I'm cut, and I like both deep and shallow, but I do prefer deep.  I always thought that was just because I'm a bit bigger than average and like to actually be fully *inside*, not feeling like I'm half out.  But okay, maybe she has something there.* "These nerve endings in the ridge band are the acceler

<h2>Try the model yourself</h2>
<p>In last cell you can enter your comment and it will tell you if this comment is sarcastic :)</p>

In [11]:
# You can enter your comment in this string
comment_str = "This is surely not a sarcastic comment... /s"
commentRDD = sc.parallelize(re.sub(r'((^(\s+))|((\s+)$))', '', 
                                   re.sub(r'([^0-9a-zA-Z\s]+)', '', 
                                          re.sub('/s|\n', '', comment_str.lower()))).split(' '))
commentRDD.cache()
print commentRDD.collect()
cmtf = hashingTF.transform(commentRDD)
cmtf.cache()
cmidf = IDF(minDocFreq=5).fit(cmtf)
cmtfidf = idf.transform(cmtf)
cmtfidf.cache()
# Getting the predictions
predictionSGD = logregModelSGD.predict(cmtfidf)
predictionLBFGS = logregModelLBFGS.predict(cmtfidf)
predictionListSGD = predictionSGD.collect()
predictionListLBFGS = predictionLBFGS.collect()
print predictionListSGD
print predictionListLBFGS
sarcasticProbabilitySGD = (float(predictionSGD.filter(lambda x: x==1).count()) / predictionSGD.count())
sarcasticProbabilityLBFGS = (float(predictionLBFGS.filter(lambda x: x==1).count()) / predictionLBFGS.count())
print ("Logistic regression with SGD says that this is a sarcastic comment with probability: " + 
       str(sarcasticProbabilitySGD * 100) + "%")
print ("Logistic regression with LBFGS says that this is a sarcastic comment with probability: " + 
       str(sarcasticProbabilityLBFGS * 100) + "%")
if (sarcasticProbabilitySGD + sarcasticProbabilityLBFGS / 2) >= 0.5:
    print "I geuss that this is a sarcastic comment :)"
else:
    print "Naaah this is a serious comment."

['this', 'is', 'surely', 'not', 'a', 'sarcastic', 'comment']
[1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 0, 0, 1, 0]
Logistic regression with SGD says that this is a sarcastic comment with probability: 100.0%
Logistic regression with LBFGS says that this is a sarcastic comment with probability: 57.1428571429%
I geuss that this is a sarcastic comment :)
