### Neural Net Math Project Notebook

This is a notebook for supervised machine learning project in Nueral Network Mathematics class. 

Group members: Luke, Akshay, Yile

#### variable names explanation:
| Var name | Feature name | Description|
|---|---|---|
|pos      | Num posts    | Number of total posts that the user has ever posted.|
|flg      | Num following | Number of following|
|flr      | Num followers | Number of followers|
|bl | Biography length | Length (number of characters) of the user's biography|
|pic | Picture availability | Value 0 if the user has no profile picture, or 1 if has|
|lin | Link availability | Value 0 if the user has no external URL, or 1 if has|
|cl | Average caption length | The average number of character of captions in media|
|cz | Caption zero | Percentage (0.0 to 1.0) of captions that has almost zero (<=3) length|
|ni | Non image percentage | Percentage (0.0 to 1.0) of non-image media. There are three types of media on an Instagram post, i.e. image, video, carousel|
|erl | Engagement rate (Like) | Engagement rate (ER) is commonly defined as (num likes) divide by (num media) divide by (num followers)|
|erc | Engagement rate (Comm.) | Similar to ER like, but it is for comments|
|lt | Location tag percentage | Percentage (0.0 to 1.0) of posts tagged with location|
|hc | Average hashtag count | Average number of hashtags used in a post|
|pr | Promotional keywords | Average use of promotional keywords in hashtag, i.e. {regrann, contest, repost, giveaway, mention, share, give away, quiz}|
|fo | Followers keywords | Average use of followers hunter keywords in hashtag, i.e. {follow, like, folback, follback, f4f}|
|cs | Cosine similarity | Average cosine similarity of between all pair of two posts a user has|
|pi | Post interval | Average interval between posts (in hours)|

The logistic probability model is

$ \hat{p}(s, \theta) = [1 + e^{-\hat{y}(s, \theta)}]^{-1} $

The $\hat{y}$ is defined as:

$ \hat{y}(s, \theta) = \theta^T [s^T 1]^T  $

The objective function is defined as

$ c([y,s], \theta) = - y  log\hat{p}(s, \theta) - (1-y)log(1-\hat{p}(s, \theta)) $

The loss function is

$ l_{n}(\theta) = -(1/n)\sum_{i=1}^{n} c([y,s], \theta) $

The gradient equation is

$ \frac{dc_{i}}{d\theta} = -(y_i - \hat{y}_i) [s_i^{T}, 1] $ 

In [1]:
# import libraries and packages
import numpy as np
import pandas as pd
import time
from sklearn import preprocessing

In [2]:
# read the data
df_data = pd.read_csv("data/user_fake_authentic_2class.csv")
# training features size: 65326 x 17
data_x = df_data.iloc[:,:-1]
print(np.shape(data_x))

# label types: r=real and f=fake
data_y = df_data.iloc[:,-1:]
# convert to 0:fake, 1:real
data_y = data_y.replace({'class':{"r": 1, "f":0}})

(65326, 17)


In [3]:
# normalize 
norm_x = preprocessing.normalize(data_x)
norm_x = pd.DataFrame(norm_x, columns=data_x.columns)

In [4]:
# svd

In [5]:
class logisticRegression:

    def __init__(self, theta, gamma = 0.0001, max_iters = 1000):
        self.gamma = gamma
        self.max_iters = max_iters
        self.theta = theta
        self.grad = None
    

    def objective_func(self, data_x, data_y):
        data_x_yhat = data_x
        data_x_yhat["y_hat"] = np.ones(len(data_y.index))
        
        # gradient descent
        grad = []
        t = 0
        gradnorm = np.inf
        while gradnorm >= 0.001 and t <= self.max_iters:
            y_hat = np.matmul(data_x_yhat, self.theta)
            # p_logistic = 1/(1 + np.exp(-1*y_hat))
            # c_ys_theta = -1 * data_y * np.log(p_logistic) - (1-data_y) * np.log(1-p_logistic)
            gradient = np.matmul(-(np.array(data_y).flatten() - y_hat).T, np.array(data_x_yhat))
            gradient_loss = gradient.flatten()
            gt = gradient_loss
            self.theta = self.theta - self.gamma*gt
            gradnorm = max(abs(gt))
            t += 1
            #print(f"Iternation: {t}; gradnorm = {gradnorm}")
            grad.append(gradnorm)
        return self.theta, grad
    def _sigmoid(self, x):
        return (1/(1+np.exp(-x)))

    def fitting(self, data_x, data_y):
        data_x_y_hat = data_x
        data_x_y_hat["y_hat"] = np.ones(len(data_y.index))
        y_hat = self._sigmoid(np.matmul(data_x_y_hat, self.theta))
        y_hat_binary = [1 if i>0.5 else 0 for i in y_hat]
        return y_hat_binary
    

In [6]:
start = time.time()
# initiate the theta
init_theta = np.zeros(len(df_data.columns))
# find the values
model = logisticRegression(theta = init_theta, gamma = 0.00001, max_iters=1000)
thetas, grad = model.objective_func(norm_x, data_y)
y_hat_all = model.fitting(data_x, data_y)

end = time.time()
def accuracy(y, y_hat):
    accuracy = np.sum(np.equal(y, y_hat))/len(y)
    return accuracy


print(f"The accuracy of the Logistic regression is {accuracy(np.array(data_y.values.tolist()).flatten(), np.array(y_hat_all))}, spending {end-start}s")


The accuracy of the Logistic regression is 0.6071242690506077, spending 1.8436508178710938s
