# Anti-Phishing SMS (APS) System

The APS System project is a platform that helps combat cyber threats involving phishing attacks. With rising cyber threats in Singapore and phishing attacks seeing the biggest jump in 2018, there should be efforts made to help mitigate this problem.

The APS system aims to become a new layer of cybersecurity protection that uses Machine Learning (ML) techniques to detect plausible phishing SMS messages and flags them out with reasons for suspicions, which can also help to educate users on what to spot for phishing signs and to enable them to make more well-informed decisions with valuable information provided by this system.

In this notebook document, we will be referencing a research article titled Towards Filtering of SMS Spam Messages Using Machine Learning Based Technique, as well as build the ML model with the help of a compiled SMS Spam Corpus data set.

#### References
Choudhary, Neelam & Jain, Ankit. (2017). Towards Filtering of SMS Spam Messages Using Machine Learning Based Technique. 10.1007/978-981-10-5780-9_2. 
. https://github.com/abinayam/Spam-Detection-SMS/tree/master/Datasets.

In [2]:
### Import Libraries ###

import pandas as pd
import re


### Global Variables ###

KEYWORD_SPECIFICS_DEFAULT = ["100%", "#1", "$$$", "100% free", "100% satisfied", "50% off", "accept credit cards", "acceptance", "access", "accordingly", "act now", "ad", "additional income", "affordable", "all natural", "all new", "amazed", "amazing", "amazing stuff", "apply now", "apply online as seen on", "auto email removal", "avoid", "avoid bankruptcy", "bargain", "be amazed", "be your own boss", "being a member", "beneficiary", "best price", "beverage", "big bucks", "billing", "billing address", "billion", "billion dollars", "bonus boss", "brand new pager", "bulk email", "buy", "buy direct", "buying judgments cable converter", "call", "call free", "call now", "calling creditors", "can’t live without cancel", "cancel at any time", "cannot be combined with any other offer", "cards accepted", "cash", "cash bonus", "cashcashcash", "casino", "celebrity", "cell phone cancer scam", "cents on the dollar", "certified chance", "cheap", "check", "check or money order", "claims", "claims not to be selling anything claims to be in accordance with some spam law", "claims to be legal", "clearance", "click", "click below", "click here click to remove", "collect", "collect child support", "compare", "compare rates", "compete for your business confidentially on all orders", "congratulations", "consolidate debt and credit", "consolidate your debt", "copy accurately", "copy dvds costs", "credit", "credit bureaus", "credit card offers", "cures", "cures baldness deal", "dear [email/friend/somebody]", "debt", "diagnostics", "dig up dirt on friends", "direct email direct marketing", "discount", "do it today", "don’t delete", "don’t hesitate", "dormant double your", "double your cash", "double your income", "drastically reduced", "earn", "earn $ earn extra cash", "earn per week", "easy terms", "eliminate bad credit", "eliminate debt", "email harvest email marketing", "exclusive deal", "expect to earn", "expire", "explode your business", "extra extra cash", "extra income", "f r e e", "fantastic", "fantastic deal", "fast cash fast viagra delivery", "financial freedom", "financially independent", "for free", "for instant access", "for just $", "for only", "for you", "form", "free", "free access", "free cell phone", "free consultation", "free dvd", "free gift", "free grant money", "free hosting free info", "free installation", "free instant", "free investment", "free leads", "free membership free money", "free offer", "free preview", "free priority mail", "free quote", "free sample free trial", "free website", "freedom", "friend", "full refund", "get get it now", "get out of debt", "get paid", "get started now", "gift certificate", "give it away", "giving away", "great", "great offer", "guarantee", "have you been turned down?", "here", "hidden", "hidden assets", "hidden charges", "home home based", "home employment", "home based business", "human growth hormone", "if only it were that easy", "important information regarding in accordance with laws", "income", "income from home", "increase sales", "increase traffic", "increase your sales incredible deal", "info you requested", "information you requested", "instant", "insurance", "insurance internet market", "internet marketing", "investment", "investment decision", "it’s effective", "join millions", "junk", "laser printer", "leave", "legal", "life insurance", "lifetime", "limited", "limited time", "limited time offer", "limited time only loan", "long distance phone offer", "lose", "lose weight", "lose weight spam", "lower interest rates lower monthly payment", "lower your mortgage rate", "lowest insurance rates", "lowest price", "luxury", "luxury car mail in order form", "maintained", "make $", "make money", "marketing", "marketing solutions mass email", "medicine", "medium", "meet singles", "member", "member stuff message contains", "message contains disclaimer", "million", "million dollars", "miracle", "mlm money", "money back", "money making", "month trial offer", "more internet traffic", "mortgage mortgage rates", "multi-level marketing", "name brand", "never", "new customers only", "no age restrictions", "no catch", "no claim forms", "no cost", "no credit check no disappointment", "no experience", "no fees", "no gimmick", "no hidden", "no hidden costs no interests", "no inventory", "no investment", "no medical exams", "no middleman", "no obligation no purchase necessary", "no questions asked", "no selling", "no strings attached", "no-obligation", "not intended not junk", "not spam", "now", "now only", "obligation", "offshore offer", "offer expires", "once in lifetime", "one hundred percent free", "one hundred percent guaranteed", "one time one time mailing", "online biz opportunity", "online degree", "online marketing", "online pharmacy", "only only $", "open", "opportunity", "opt in", "order", "order now order shipped by", "order status", "order today", "outstanding values", "passwords", "pennies a day per day", "per week", "performance", "phone", "please read", "potential earnings pre-approved", "presently", "price", "print form signature", "print out and fax", "priority mail prize", "problem", "produced and sent out", "profits", "promise", "promise you purchase", "pure profits", "quote", "rates", "real thing", "refinance refinance home", "refund", "removal", "removal instructions", "remove", "removes wrinkles request", "requires initial investment", "reserves the right", "reverses", "reverses aging", "risk free rolex", "round the world", "safeguard notice", "sale", "sample satisfaction", "satisfaction guaranteed", "save $", "save big money", "save up to", "score score with babes", "search engine listings", "search engines", "section 301", "see for yourself", "sent in compliance serious", "serious cash", "serious only", "shopper", "shopping spree", "sign up free", "social security number", "solution", "spam", "special promotion", "stainless steel", "disclaimer statement", "stock pick", "stop", "stop snoring", "strong buy", "stuff on sale subject to cash", "subject to credit", "subscribe", "success", "supplies", "supplies are limited take action", "take action now", "talks about hidden charges", "talks about prizes", "teen", "tells you it’s an ad terms", "terms and conditions", "the best rates", "the following form", "they keep your money — no refund!", "they’re just giving it away this isn’t a scam", "this isn’t junk", "this isn’t spam", "this won’t last", "thousands", "time limited traffic", "trial", "undisclosed recipient", "university diplomas", "unlimited", "unsecured credit unsecured debt", "unsolicited", "unsubscribe", "urgent", "us dollars", "vacation vacation offers", "valium", "viagra", "vicodin", "visit our website", "credit card", "warranty", "we hate spam", "we honor all", "web traffic", "weekend getaway", "weight weight loss", "what are you waiting for?", "what’s keeping you?", "while supplies last", "while you sleep", "who really wins? why pay more?", "wife", "will not believe your eyes", "win", "winner", "winning won", "work from home", "xanax", "you are a winner!", "you have been selected", "your income"]
KEYWORD_SPECIFICS_CUSTOMIZED = ["process", "check out", "give away", "giveaway", "biggest", "flirt", "name", "announcement", "terms", "conditions", "channel", "install", "browse", "enjoy", "mobile content", "sexy", "age", "join", "credit", "downloads", "gender", "ringtone", "dating", "service", "babe", "wet", "porn", "cum", "goodies", "private", "prize", "reply", "topped up", "vid", "video", "pic", "chance", "sale", "cash prize", "worth", "tone", "charge", "topped up", "credit", "credits", "top up", "reply back", "text back"]
KEYWORD_SPECIFICS = KEYWORD_SPECIFICS_DEFAULT + KEYWORD_SPECIFICS_CUSTOMIZED

WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:"".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""


We will define helper functions to extract features that can be used to train the model.

In [3]:
### Helper Functions for Feature Extractions ###

def split_to_words(text):
    return [word.strip('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') for word in text.split(' ')]


def has_phone_number(words):
    for word in words:
        word = re.sub('[- ]', '', word)
        if len(word) > 4 and word.isdigit():
            return 1
    return 0


def length_of_text(words):
    return len(words)


def has_url(text):
    return int(bool(re.search(WEB_URL_REGEX, text)))


def has_math_symbol(text):
    return len(re.findall('[+-<>/^]', text))


def has_dots(text):
    return int('..' in text)

                
def has_special_symbol(text):
    return len(re.findall('[~!#$&*£]', text))

  
def has_emoji():  # TODO: Unimplemented as of now.
    pass

                
def has_keyword_specific(text):
    count = 0
    lowercase_text = text.lower()
    for keyword in KEYWORD_SPECIFICS:
        if keyword in lowercase_text:
            count += 1
    return count

                
def has_lowercased_word(words):
    for word in words:
        if word.islower():
            return 1
    return 0

                
def has_uppercased_word(words):
    count = 0
    for word in words:
        if word.isupper():
            count += 1
    return count
                


In [4]:
# Read in the txt file and rename columns
messages = pd.read_csv('SMSSpamCollection', header = None, delimiter = '\t')
messages.columns = ['Category', 'Text']
messages['Words'] = messages['Text'].map(split_to_words)
messages['Spam'] = messages['Category'].map(lambda x: 1 if x == 'spam' else 0)


# Display what the data currently looks like
display(messages.head(n = 10))


# Display the most common words in ham and spam messages

print("Top 30 Most Common Ham Words:")
display(pd.Series(' '.join(messages.loc[messages['Category'] == 'ham']['Text']).lower().split()).value_counts()[:30])

print("Top 30 Most Common Spam Words:")
display(pd.Series(' '.join(messages.loc[messages['Category'] == 'spam']['Text']).lower().split()).value_counts()[:30])

Unnamed: 0,Category,Text,Words,Spam
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, crazy, Available, o...",0
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, Joking, wif, u, oni]",0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...",1
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, U, c, already, t...",0
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, don't, think, he, goes, to, usf, he, ...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[FreeMsg, Hey, there, darling, it's, been, 3, ...",1
6,ham,Even my brother is not like to speak with me. ...,"[Even, my, brother, is, not, like, to, speak, ...",0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,"[As, per, your, request, Melle, Melle, Oru, Mi...",0
8,spam,WINNER!! As a valued network customer you have...,"[WINNER, As, a, valued, network, customer, you...",1
9,spam,Had your mobile 11 months or more? U R entitle...,"[Had, your, mobile, 11, months, or, more, U, R...",1


Top 30 Most Common Ham Words:


i       2181
you     1669
to      1552
the     1125
a       1058
u        881
and      846
in       790
my       745
is       717
me       590
of       519
for      502
that     444
it       441
have     436
but      415
your     414
are      407
so       399
not      388
on       379
at       374
i'm      369
can      357
if       350
do       341
will     336
be       326
we       298
dtype: int64

Top 30 Most Common Spam Words:


to        685
a         375
call      342
your      263
you       252
the       204
for       202
or        188
free      180
2         169
is        152
ur        144
on        142
txt       136
have      135
from      127
and       122
u         117
text      112
mobile    109
with      108
claim     106
reply     101
&          98
of         95
now        93
4          93
stop       90
this       86
our        85
dtype: int64

In [5]:
# Create a new dataframe for extracted features
features = pd.DataFrame()
features['Math Symbols'] = messages['Text'].map(has_math_symbol)
features['URLs'] = messages['Text'].map(has_url)
features['Dots'] = messages['Text'].map(has_dots)
features['Special Symbols'] = messages['Text'].map(has_special_symbol)
#features['Emojis'] = messages[''].map(has_emoji)
#features['Lowercased Words'] = messages['Words'].map(has_lowercased_word)
features['Uppercased Words'] = messages['Words'].map(has_uppercased_word)
features['Phone Numbers'] = messages['Text'].map(has_phone_number)
features['Keyword Specifics'] = messages['Text'].map(has_keyword_specific)
features['Message Lengths'] = messages['Words'].map(length_of_text)


# Display what the features looks like
display(features.head(n = 10))
display(features.describe())

Unnamed: 0,Math Symbols,URLs,Dots,Special Symbols,Uppercased Words,Phone Numbers,Keyword Specifics,Message Lengths
0,9,0,1,0,0,0,2,20
1,6,0,1,0,0,0,0,6
2,26,0,0,1,2,0,2,28
3,6,0,1,0,2,0,1,11
4,1,0,0,0,1,0,1,13
5,6,0,0,3,0,0,3,32
6,2,0,0,0,0,0,0,16
7,2,0,0,1,0,0,2,26
8,22,0,0,4,2,0,5,26
9,13,0,0,1,3,0,4,29


Unnamed: 0,Math Symbols,URLs,Dots,Special Symbols,Uppercased Words,Phone Numbers,Keyword Specifics,Message Lengths
count,5572.0,5572.0,5572.0,5572.0,5572.0,5572.0,5572.0,5572.0
mean,5.235284,0.03051,0.204594,0.587222,1.064429,0.0,1.126346,15.70944
std,7.599901,0.172,0.403441,1.51949,2.816339,0.0,1.494744,11.493753
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
50%,2.0,0.0,0.0,0.0,0.0,0.0,1.0,12.0
75%,6.0,0.0,0.0,1.0,1.0,0.0,2.0,23.0
max,81.0,1.0,1.0,54.0,37.0,0.0,10.0,171.0


In [6]:
# Define useful functions to help us evaluate the effectiveness of our trained model
from sklearn.metrics import confusion_matrix

def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]

In [7]:
# Splits data int training and test sets for us to use when training and testing the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, messages["Spam"], stratify = messages["Spam"], test_size = 0.2)

In [8]:
from sklearn.ensemble import RandomForestClassifier

rfc=RandomForestClassifier(n_estimators = 1000, max_features = 3)
rfc.fit(X_train,y_train)
y_pred = rfc.predict(X_test)

In [9]:
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("True Positive:", tp(y_test, y_pred))
print("False Positive:", fp(y_test, y_pred))
print("True Negative:", tn(y_test, y_pred))
print("False Negative:", fn(y_test, y_pred))

Accuracy: 0.9748878923766816
True Positive: 126
False Positive: 5
True Negative: 961
False Negative: 23


In [10]:
import pickle
import numpy as np

# Serialize rfc object into a file called rfc.pkg on disk using pickle
with open('rfc.pkl', 'wb') as handle:
    pickle.dump(rfc, handle, pickle.HIGHEST_PROTOCOL)
# pickle.HIGHEST_PROTOCOL using the highest available protocol 
# (we used wb to open file as binary and use a higher pickling protocol)

In [11]:
# de-serialize mlp_nn.pkl file into an object called mlp_nn using pickle

with open('rfc.pkl', 'rb') as handle:
    rfc = pickle.load(handle)    
# no we can call various methods over mlp_nn as as:
# Let X_test be the feature (UNIX timestamp) for which we want to predict the output 
result = rfc.predict(X_test)

In [29]:
class aps_system_predictor():
  def __init__(self):
    pass
  
  def deserialize(self):
    # de-serialize mlp_nn.pkl file into an object called model using pickle
    with open('rfc.pkl', 'rb') as handle:
      model = pickle.load(handle)
      return model
  
  def predict(self, message):
    model = self.deserialize()
    features = []
    
    words = split_to_words(message)
    features.append(has_math_symbol(message))
    features.append(has_url(message))
    features.append(has_dots(message))
    features.append(has_special_symbol(message))
    features.append(has_uppercased_word(words))
    features.append(has_phone_number(message))
    features.append(has_keyword_specific(message))
    features.append(length_of_text(words))
    
    return model.predict(np.array([features]))

In [36]:
import json
import pickle
import os
from urllib.parse import unquote
from flask import Flask, jsonify, request
from flask_cors import CORS
from predictor import aps_system_predictor

app = Flask(__name__)
CORS(app)

@app.route("/predict/",methods=['GET'])
def return_price():
    message = request.args.get('message')
    message = unquote(message)
    predictor = aps_system_predictor()
    result = predictor.predict(message)
    result_dict = {
        'model':'rfc',
        'phishing': str(result[0]),
    }
    print(result_dict)
    print(jsonify(result_dict))
    return jsonify(result_dict)


@app.route("/",methods=['GET'])
def default():
    return "<h1> Welcome to APS System <h1>"

if __name__ == "__main__":
    app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
[2019-07-26 03:30:52,218] ERROR in app: Exception on /predict/ [GET]
Traceback (most recent call last):
  File "E:\Users\Yong He\Anaconda3\lib\site-packages\flask\app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "E:\Users\Yong He\Anaconda3\lib\site-packages\flask\app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "E:\Users\Yong He\Anaconda3\lib\site-packages\flask_cors\extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "E:\Users\Yong He\Anaconda3\lib\site-packages\flask\app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "E:\Users\Yong He\Anaconda3\lib\site-packages\flask\_compat.py", line 35, in reraise
    raise value
  File "E:\Users\Yong He\Anaconda3\lib\site-packages\flask\app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_requ

In [20]:
AAA = (y_test == y_pred).sort_values(ascending=True)
print(AAA)
pd.set_option('display.max_colwidth', -1)
AIYO = [1097, 2247, 2313, 4394, 491, 2863, 5443, 2965]
for i in AIYO:
          print(messages.iloc[i])
          print('===')

1097    False
2247    False
2313    False
4394    False
491     False
2863    False
5443    False
2965    False
1460    False
4249    False
4069    False
5102    False
1625    False
3422    False
1316    False
1734    False
191     False
3094    False
2774    False
2699    False
4016    False
752     False
3742    False
3864    False
747     False
751     False
2804    False
3144    False
185     True 
3756    True 
        ...  
4763    True 
4218    True 
555     True 
325     True 
5437    True 
2968    True 
1500    True 
1869    True 
96      True 
3532    True 
2609    True 
5060    True 
4488    True 
802     True 
3241    True 
1298    True 
2887    True 
3651    True 
5495    True 
1293    True 
2685    True 
4232    True 
898     True 
1102    True 
478     True 
5080    True 
3695    True 
4805    True 
1569    True 
1211    True 
Name: Spam, Length: 1115, dtype: bool
Category    spam                                                                                            

In [37]:
import json
import os
from urllib.parse import unquote
from flask import Flask, jsonify, request
from flask_cors import CORS

class aps_system_predictor():
    import pickle
    def __init__(self):
        pass

    def deserialize(self):
        # de-serialize mlp_nn.pkl file into an object called model using pickle
        with open('rfc.pkl', 'rb') as handle:
            model = self.pickle.load(handle)
        return model

    def predict(self, message):
        model = self.deserialize()
        features = []

        words = split_to_words(message)
        features.append(has_math_symbol(message))
        features.append(has_url(message))
        features.append(has_dots(message))
        features.append(has_special_symbol(message))
        features.append(has_uppercased_word(words))
        features.append(has_phone_number(message))
        features.append(has_keyword_specific(message))
        features.append(length_of_text(words))

        return model.predict(np.array([features]))

app = Flask(__name__)
CORS(app)

@app.route("/predict/",methods=['GET'])
def return_price():
    message = request.args.get('message')
    message = unquote(message)
    predictor = aps_system_predictor()
    result = predictor.predict(message)
    result_dict = {
        'model':'rfc',
        'phishing': str(result[0]),
    }
    print(result_dict)
    print(jsonify(result_dict))
    return jsonify(result_dict)


@app.route("/",methods=['GET'])
def default():
    return "<h1> Welcome to APS System <h1>"

if __name__ == "__main__":
    app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [26/Jul/2019 03:33:09] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [26/Jul/2019 03:33:13] "[37mGET /predict/?message=hi HTTP/1.1[0m" 200 -


{'model': 'rfc', 'phishing': '0'}
<Response 31 bytes [200 OK]>


In [None]:
from urllib.parse import unquote
from flask import Flask,jsonify,request
from flask_cors import CORS
import pickle
import re
import numpy as np
import pandas as pd

### Global Variables ###

KEYWORD_SPECIFICS_DEFAULT = ["100%", "#1", "$$$", "100% free", "100% satisfied", "50% off", "accept credit cards", "acceptance", "access", "accordingly", "act now", "ad", "additional income", "affordable", "all natural", "all new", "amazed", "amazing", "amazing stuff", "apply now", "apply online as seen on", "auto email removal", "avoid", "avoid bankruptcy", "bargain", "be amazed", "be your own boss", "being a member", "beneficiary", "best price", "beverage", "big bucks", "billing", "billing address", "billion", "billion dollars", "bonus boss", "brand new pager", "bulk email", "buy", "buy direct", "buying judgments cable converter", "call", "call free", "call now", "calling creditors", "can’t live without cancel", "cancel at any time", "cannot be combined with any other offer", "cards accepted", "cash", "cash bonus", "cashcashcash", "casino", "celebrity", "cell phone cancer scam", "cents on the dollar", "certified chance", "cheap", "check", "check or money order", "claims", "claims not to be selling anything claims to be in accordance with some spam law", "claims to be legal", "clearance", "click", "click below", "click here click to remove", "collect", "collect child support", "compare", "compare rates", "compete for your business confidentially on all orders", "congratulations", "consolidate debt and credit", "consolidate your debt", "copy accurately", "copy dvds costs", "credit", "credit bureaus", "credit card offers", "cures", "cures baldness deal", "dear [email/friend/somebody]", "debt", "diagnostics", "dig up dirt on friends", "direct email direct marketing", "discount", "do it today", "don’t delete", "don’t hesitate", "dormant double your", "double your cash", "double your income", "drastically reduced", "earn", "earn $ earn extra cash", "earn per week", "easy terms", "eliminate bad credit", "eliminate debt", "email harvest email marketing", "exclusive deal", "expect to earn", "expire", "explode your business", "extra extra cash", "extra income", "f r e e", "fantastic", "fantastic deal", "fast cash fast viagra delivery", "financial freedom", "financially independent", "for free", "for instant access", "for just $", "for only", "for you", "form", "free", "free access", "free cell phone", "free consultation", "free dvd", "free gift", "free grant money", "free hosting free info", "free installation", "free instant", "free investment", "free leads", "free membership free money", "free offer", "free preview", "free priority mail", "free quote", "free sample free trial", "free website", "freedom", "friend", "full refund", "get get it now", "get out of debt", "get paid", "get started now", "gift certificate", "give it away", "giving away", "great", "great offer", "guarantee", "have you been turned down?", "here", "hidden", "hidden assets", "hidden charges", "home home based", "home employment", "home based business", "human growth hormone", "if only it were that easy", "important information regarding in accordance with laws", "income", "income from home", "increase sales", "increase traffic", "increase your sales incredible deal", "info you requested", "information you requested", "instant", "insurance", "insurance internet market", "internet marketing", "investment", "investment decision", "it’s effective", "join millions", "junk", "laser printer", "leave", "legal", "life insurance", "lifetime", "limited", "limited time", "limited time offer", "limited time only loan", "long distance phone offer", "lose", "lose weight", "lose weight spam", "lower interest rates lower monthly payment", "lower your mortgage rate", "lowest insurance rates", "lowest price", "luxury", "luxury car mail in order form", "maintained", "make $", "make money", "marketing", "marketing solutions mass email", "medicine", "medium", "meet singles", "member", "member stuff message contains", "message contains disclaimer", "million", "million dollars", "miracle", "mlm money", "money back", "money making", "month trial offer", "more internet traffic", "mortgage mortgage rates", "multi-level marketing", "name brand", "never", "new customers only", "no age restrictions", "no catch", "no claim forms", "no cost", "no credit check no disappointment", "no experience", "no fees", "no gimmick", "no hidden", "no hidden costs no interests", "no inventory", "no investment", "no medical exams", "no middleman", "no obligation no purchase necessary", "no questions asked", "no selling", "no strings attached", "no-obligation", "not intended not junk", "not spam", "now", "now only", "obligation", "offshore offer", "offer expires", "once in lifetime", "one hundred percent free", "one hundred percent guaranteed", "one time one time mailing", "online biz opportunity", "online degree", "online marketing", "online pharmacy", "only only $", "open", "opportunity", "opt in", "order", "order now order shipped by", "order status", "order today", "outstanding values", "passwords", "pennies a day per day", "per week", "performance", "phone", "please read", "potential earnings pre-approved", "presently", "price", "print form signature", "print out and fax", "priority mail prize", "problem", "produced and sent out", "profits", "promise", "promise you purchase", "pure profits", "quote", "rates", "real thing", "refinance refinance home", "refund", "removal", "removal instructions", "remove", "removes wrinkles request", "requires initial investment", "reserves the right", "reverses", "reverses aging", "risk free rolex", "round the world", "safeguard notice", "sale", "sample satisfaction", "satisfaction guaranteed", "save $", "save big money", "save up to", "score score with babes", "search engine listings", "search engines", "section 301", "see for yourself", "sent in compliance serious", "serious cash", "serious only", "shopper", "shopping spree", "sign up free", "social security number", "solution", "spam", "special promotion", "stainless steel", "disclaimer statement", "stock pick", "stop", "stop snoring", "strong buy", "stuff on sale subject to cash", "subject to credit", "subscribe", "success", "supplies", "supplies are limited take action", "take action now", "talks about hidden charges", "talks about prizes", "teen", "tells you it’s an ad terms", "terms and conditions", "the best rates", "the following form", "they keep your money — no refund!", "they’re just giving it away this isn’t a scam", "this isn’t junk", "this isn’t spam", "this won’t last", "thousands", "time limited traffic", "trial", "undisclosed recipient", "university diplomas", "unlimited", "unsecured credit unsecured debt", "unsolicited", "unsubscribe", "urgent", "us dollars", "vacation vacation offers", "valium", "viagra", "vicodin", "visit our website", "credit card", "warranty", "we hate spam", "we honor all", "web traffic", "weekend getaway", "weight weight loss", "what are you waiting for?", "what’s keeping you?", "while supplies last", "while you sleep", "who really wins? why pay more?", "wife", "will not believe your eyes", "win", "winner", "winning won", "work from home", "xanax", "you are a winner!", "you have been selected", "your income"]
KEYWORD_SPECIFICS_CUSTOMIZED = ["process", "check out", "give away", "giveaway", "biggest", "flirt", "name", "announcement", "terms", "conditions", "channel", "install", "browse", "enjoy", "mobile content", "sexy", "age", "join", "credit", "downloads", "gender", "ringtone", "dating", "service", "babe", "wet", "porn", "cum", "goodies", "private", "prize", "reply", "topped up", "vid", "video", "pic", "chance", "sale", "cash prize", "worth", "tone", "charge", "topped up", "credit", "credits", "top up", "reply back", "text back"]
KEYWORD_SPECIFICS = KEYWORD_SPECIFICS_DEFAULT + KEYWORD_SPECIFICS_CUSTOMIZED

WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:"".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""


### Helper Functions for Feature Extractions ###

def split_to_words(text):
    return [word.strip('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') for word in text.split(' ')]


def has_phone_number(words):
    for word in words:
        word = re.sub('[- ]', '', word)
        if len(word) > 4 and word.isdigit():
            return 1
    return 0


def length_of_text(words):
    return len(words)


def has_url(text):
    return int(bool(re.search(WEB_URL_REGEX, text)))


def has_math_symbol(text):
    return len(re.findall('[+-<>/^]', text))


def has_dots(text):
    return int('..' in text)


def has_special_symbol(text):
    return len(re.findall('[~!#$&*£]', text))


def has_emoji():  # TODO: Unimplemented as of now.
    pass


def has_keyword_specific(text):
    count = 0
    lowercase_text = text.lower()
    for keyword in KEYWORD_SPECIFICS:
        if keyword in lowercase_text:
            count += 1
    return count


def has_lowercased_word(words):
    for word in words:
        if word.islower():
            return 1
    return 0


def has_uppercased_word(words):
    count = 0
    for word in words:
        if word.isupper():
            count += 1
    return count


class aps_system_predictor():
    def __init__(self):
        pass

    def deserialize(self):
        # de-serialize mlp_nn.pkl file into an object called model using pickle
        with open('rfc.pkl', 'rb') as handle:
            model = pd.read_pickle(handle)
            return model

    def predict(self, message):
        model = self.deserialize()
        features = []

        words = split_to_words(message)
        features.append(has_math_symbol(message))
        features.append(has_url(message))
        features.append(has_dots(message))
        features.append(has_special_symbol(message))
        features.append(has_uppercased_word(words))
        features.append(has_phone_number(message))
        features.append(has_keyword_specific(message))
        features.append(length_of_text(words))

        return model.predict(np.array([features]))

app = Flask(__name__)
CORS(app)

@app.route("/predict/",methods=['GET'])
def return_price():
    message = request.args.get('message')
    message = unquote(message)
    predictor = aps_system_predictor()
    result = predictor.predict(message)
    result_dict = {
        'model':'rfc',
        'phishing': str(result[0]),
    }
    print(result_dict)
    print(jsonify(result_dict))
    return jsonify(result_dict)


@app.route("/",methods=['GET'])
def default():
    return "<h1> Welcome to APS System <h1>"

if __name__ == "__main__":
    app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [26/Jul/2019 05:12:43] "[37mGET /predict/?message=hi HTTP/1.1[0m" 200 -


{'model': 'rfc', 'phishing': '0'}
<Response 31 bytes [200 OK]>


127.0.0.1 - - [26/Jul/2019 05:13:26] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [26/Jul/2019 05:13:26] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
127.0.0.1 - - [26/Jul/2019 05:52:13] "[37mGET /predict/?message=Sorry%20man%20my%20account%27s%20dry%20or%20I%20would,%20if%20you%20want%20we%20could%20trade%20back%20half%20or%20I%20could%20buy%20some%20shit%20with%20my%20credit%20card HTTP/1.1[0m" 200 -


{'model': 'rfc', 'phishing': '1'}
<Response 31 bytes [200 OK]>
