# Dark Social Classifier

***

## Overview

The main objective of this project is to build an effective __"deep neural network"__ that is capable of __classifying "dark social"__ web-traffic. First coined in 2012 by __Alexis Madrigal__, a journalist from __"The Atlantic,"__ the term is mostly used by marketers and business professionals to describe __website referral that is barely trackable__. The significant value of this issue is emphasized by one [study](https://radiumone.com/wp-content/uploads/2016/08/radiumone-the-dark-side-of-mobile-sharing-June-7-2016.pdf), which says that approximately __84% of website traffic in the world come through "dark social"__. On this, one of the most common form of dark social is __shared URL that is spreaded through email and instant messaging__. The afore-mentioned activity will be counted as "direct traffic" in most analytics programs nowadays, which provides deluted information and data for web-analyst.

The following machine learning algorithm, utilizes the means of __"natural language processing"__ and further feed the processed data into a "deep neural network" in  order to teach a machine on how to identify "dark social" traffic by __analyzing URL composition__ that has been given to it. Due to the limitation of data available, the following algorithm is specifically designed to catch the most obvious "dark social" based on its URL formation, and leave out the ambigious URL as "direct traffic". Furthermore, from the result of its collection of data, the algorithm will be __mostly effective to be applied for e-commerce website__. 

### Data Used

* __data_traffic_dataset.csv :__ Collection of website pages's URL from numbers of popular E-Commerce websites like Amazon, Alibaba, Lazada, Zalora, and Shoppee. Completely furnished with it dark social classification, the data reinforces a supervised machine learning algorithm.

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Importing Dataset
dataset = pd.read_csv('dark_traffic_dataset.csv')

In [None]:
# Create function to clean url text
import re
from urllib.parse import urlparse

def strip_url(text):
    url = urlparse(text)
    url = url._replace(scheme = ' ', query = ' ')
    return url.geturl()

def remove_special_characters(text):
    return re.sub('[/+.?:]|-|=|html|ref|gp|www|co|id|com', ' ', text)

def denoise_text(text):
    text = strip_url(text)
    text = remove_special_characters(text)
    return text

In [None]:
# Cleaning the text
# Set() are used to speed up the process of for loop
# Stopwords contain words that are not necessary
# Stemmer contain the process of reverting all words into its 
## original main form
# Corpus is collection of text
# Spartsity indicates the amount of sparmatrix within a data or 
## variable with very little amount
# Tokenization is a process of put unique into different column
# By cleaning out the text outside of CountVectorizer parametter,
## you'll have more option
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = [] #still Empty
for i in range (0, 915): # Modify this according to data size 
    review = denoise_text(dataset['Landing Page'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
#Creating the bag of words model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [None]:
# Splitting Dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
# Part 2 - Create the ANN
from keras.models import Sequential
from keras.layers import Dense

# Initializing the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
# Unit = Amount of node in hidden layer ((Independent v. +1) /2)
# Kernel_Initializer = Weight adjuster (close to 0)
# activation = 'relu' (Reticfier)
# input_dim = number of independent v.
classifier.add(Dense(units = 890, kernel_initializer = 'uniform', activation = 'relu', input_dim = 1779))

# Adding Second hidden layer
classifier.add(Dense(units = 890, kernel_initializer = 'uniform', activation = 'relu'))

# Adding third hidden layer
classifier.add(Dense(units = 890, kernel_initializer = 'uniform', activation = 'relu'))

# Adding output layer
# For multiple dependent v. the needed activation function is softmax
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
# Optimizer + parameter which decides way to find the best weight. Adam is one of the best for stochastic algorithym
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
#Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

In [None]:
# Predicting the Test result
Y_Pred = classifier.predict(X_test)
Y_Pred = (Y_Pred > 0.5)

In [None]:
# Making Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, Y_Pred)