### Neural Net Math Project Notebook

This is a notebook for supervised machine learning project in Nueral Network Mathematics class. 

Group members: Luke, Akshay, Yile

#### variable names explanation:
| Var name | Feature name | Description|
|---|---|---|
|pos      | Num posts    | Number of total posts that the user has ever posted.|
|flg      | Num following | Number of following|
|flr      | Num followers | Number of followers|
|bl | Biography length | Length (number of characters) of the user's biography|
|pic | Picture availability | Value 0 if the user has no profile picture, or 1 if has|
|lin | Link availability | Value 0 if the user has no external URL, or 1 if has|
|cl | Average caption length | The average number of character of captions in media|
|cz | Caption zero | Percentage (0.0 to 1.0) of captions that has almost zero (<=3) length|
|ni | Non image percentage | Percentage (0.0 to 1.0) of non-image media. There are three types of media on an Instagram post, i.e. image, video, carousel|
|erl | Engagement rate (Like) | Engagement rate (ER) is commonly defined as (num likes) divide by (num media) divide by (num followers)|
|erc | Engagement rate (Comm.) | Similar to ER like, but it is for comments|
|lt | Location tag percentage | Percentage (0.0 to 1.0) of posts tagged with location|
|hc | Average hashtag count | Average number of hashtags used in a post|
|pr | Promotional keywords | Average use of promotional keywords in hashtag, i.e. {regrann, contest, repost, giveaway, mention, share, give away, quiz}|
|fo | Followers keywords | Average use of followers hunter keywords in hashtag, i.e. {follow, like, folback, follback, f4f}|
|cs | Cosine similarity | Average cosine similarity of between all pair of two posts a user has|
|pi | Post interval | Average interval between posts (in hours)|

In [2]:
# import libraries and packages
import numpy as np
import pandas as pd
import time
from sklearn import preprocessing

In [3]:
# read the data
df_data = pd.read_csv("data/user_fake_authentic_2class.csv")
# training features size: 65326 x 17
data_x = df_data.iloc[:,:-1]

# label types: r=real and f=fake
data_y = df_data.iloc[:,-1:]
# convert to 0:fake, 1:real
data_y = data_y.replace({'class':{"r": 1, "f":0}})

In [None]:
# normalize 
norm_x = preprocessing.normalize(data_x)
norm_x = pd.DataFrame(norm_x, columns=data_x.columns)

In [None]:
# svd


The logistic probability model is

$ \hat{p}(s, \theta) = [1 + \exp{-\hat{y}(s, \theta)}]^{-1} $

The $\hat{y}$ is defined as:

$ \hat{y}(s, \theta) = \theta^T [s^T 1]^T  $

In [None]:
# logistic probability model
def y_hat(si, theta, y = np.shape(data_y)[0]):
        return theta.T * (si.T * np.ones(len(y))).T
        
def p_logistic(si, theta):
    return (1 + np.exp(-y_hat(si, theta)))^(-1)

The objective function is defined as

$ c([y,s], \theta) = - y  log\hat{p}(s, \theta) - (1-y)log(1-\hat{p}(s, \theta)) $

In [None]:
# objective function
def obj(yi, y_hat):
    return -yi * np.log(y_hat) - (1-yi)*np.log(1-y_hat)

The loss function is

$ l_{n}(\theta) = -(1/n)\sum_{i=1}^{n} c([y,s], \theta) $

In [None]:
# loss function
start = time.time()
def loss(theta, yi, si, n):
    return - (1/n) * sum(obj(yi, si, theta))

The gradient equation is

$ \frac{dc_{i}}{d\theta} = -(y_i - \hat{y}_i) [s_i^{T}, 1] $ 

In [None]:
# gradient

def gradient_descent(theta, yi, si, n, max_iter = 200, gamma = 0.1):
    grad = []
    t = 0
    gradnorm = np.inf
    while gradnorm >= 10e-4 & t < max_iter:
        gt = loss(theta, yi, si, n)
        theta = theta - gamma*gt
        gradnorm = max(abs(gt))
        print(f"Iternation: {t}; gradnorm = {gradnorm}")
        grad.append(gradnorm)
    return theta, grad