# Experimentation - Exploration

In our previous notebook "EDA", we have noticed that our data was imbalanced. Therefore, we were confronted to an anomaly dectection problem. 
Our objective will be to correctly classify the minority class of `Fake`events.

What we are going to do:
- Feature selection
- This imbalance can be reduced by under-sampling the majority class `Not fake`, by making it close to that of the `Fake` class.
- Model Training
- Model Evaluation

Time for experimentation
There are no "Null" values, so we don't have to work on ways to replace values. 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv("../data/fake_users.csv")

# Unnamed and UserId are dataset artifact, not something useful for analysis
df.drop("Unnamed: 0", axis=1, inplace=True)
df.drop("UserId", axis=1, inplace=True)

# let's make a copy of our orignal dataset
df_feat = df.copy()

# A quick reminder of how the data looks like
df_feat.head()

# Featurize

Our `Event` and `Category` columns hold nominal categorial data where there are no inherent order (in opposition to ordinal categorial data). These "categories" must be transformed into numbers first, before you can apply the learning algorithm on them.

To achieve that there are different encoding techniques.
- Label Encoding: each label is converted into an integer value based on conversion dictionnary 
- One Hot Encoding: each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.
- Hash Encoding: each category is encoded using a hash function. It is a good solution when the cardinality of the category is too high.  

We will user one hot encoding, because the cardinality of our categories are rather low.

In [None]:
df_feat = pd.get_dummies(df_feat, columns=['Event'], prefix = ['Event'])
df_feat = pd.get_dummies(df_feat, columns=['Category'], prefix = ['Category'])
df_feat.head()

In [None]:
original_df_feat = df_feat.copy()

Now let's separate our class from the rest of the data.

In [None]:
y_train = df_feat.pop("Fake")
X_train = df_feat

In [None]:
print(f'Dimension of training, X: {X_train.shape}, y: {y_train.shape}')

# Training

Now, let's proceed to the traning of models. Because our data is imbalanced we have to decide which metrics to use to evaluate the model.
We will use:
- log

In [None]:
'''Importing the auxiliar and preprocessing librarys'''
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold, cross_val_score, cross_val_predict
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline

'''Initialize all the regression models object we are interested in.'''
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
'''Plotly visualization .'''
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)

In [None]:
def GetBasedModel():
    basedModels = []
    
    basedModels.append(('DUM', DummyClassifier()))
    basedModels.append(('LR', Pipeline([('Scaler', StandardScaler()), 
                                   ('LR', LogisticRegression())])))
    basedModels.append(('KNN' , KNeighborsClassifier()))
    
    
    return basedModels

In [None]:
def computeMetrics(conf_matrix):
    tn, fp, fn, tp = conf_matrix.reshape(-1)
    #print(f"TP:{tp}, FP:{fp}, FN: {fn}, TN:{tn}")
    
    precision = 0 if tp+fp==0 else tp/(tp+fp)
    recall = 0 if tp+fn== 0 else tp/(tp+fn)
    f1 = 0 if precision+recall == 0 else 2*(precision*recall)/(precision+recall)
    
    print(f"precision: {precision:.2f}\nrecall: {recall:.2f}\nf1-score: {f1:.2f}")
    print('---' * 35)

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

'''Test options and evaluation metric'''
def BasedLine(X_train, y_train, models, scoring = "accuray", num_folds = 10):

    results, names = [], []

    for name, model in models:
        kfold = StratifiedKFold(n_splits = num_folds)
        
        y_pred = cross_val_predict(model, X_train, y_train, cv=kfold)
        conf_mat = confusion_matrix(y_train, y_pred)
        
        cv_results = cross_val_score(model, X_train, y_train, cv = kfold, scoring = scoring, n_jobs = -1)


        names.append(name)
        print(f"{name}\n{pd.DataFrame(conf_mat)}\n")
        computeMetrics(conf_mat)

    
    return names, results

In [None]:
models = GetBasedModel()
names,results = BasedLine(X_train, y_train,models, "f1", 10)

# Sampling

There are several sampling techniques:
- Undersampling: select a subset of examples from the majority class.
- Oversampling: duplicate examples in the minority class or synthesize new examples from the examples in the minority class.
- Combinations of Techniques: applying both undersampling and oversampling techniques together.

Each technique has several implementation methods. In our case we will use Random Undersampling. It consists in selecting randomly a subest of the majority class.

In [None]:
df_feat = original_df_feat.sample(frac=1)

nb_fake = df_feat['Fake'].value_counts()[1]
print(nb_fake)

# amount of fraud classes 10359 rows.
fake_df = df_feat.loc[df_feat['Fake'] == 1]
not_fake_df = df_feat.loc[df_feat['Fake'] == 0][:nb_fake]

normal_distributed_df = pd.concat([fake_df, not_fake_df], ignore_index=True)

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42, ignore_index=True)


In [None]:
sns.countplot(x='Fake', data=new_df, palette="Pastel1")
plt.title('Equally Distributed Classes', fontsize=14)
plt.show()

In [None]:
new_y_train = new_df.pop("Fake")
new_X_train = new_df

In [None]:
models = GetBasedModel()
names,results = BasedLine(new_X_train, new_y_train,models, "f1", 5)

# Model tunning

In [None]:
seed = 44
def grid_search_cv(model, params, scoring="f1", cv = 10):    
    grid_search = GridSearchCV(estimator = model, param_grid = params, cv = cv, verbose = 1,
                             scoring = scoring, n_jobs = -1)
    grid_search.fit(new_X_train, new_y_train)
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    best_estimator = grid_search.best_estimator_
    
    return best_estimator, best_params, best_score

In [None]:
LR_model = LogisticRegression()

LR_params = [
  {'penalty': ['l1'], 'solver': [ 'saga','liblinear'], 'C': [ 0.01,0.1, 1, 10]},
  {'penalty': ['l2'], 'solver': ['newton-cg', 'sag', 'saga','lbfgs'], 'C': [ 0.01,0.1, 1, 10]},
 ]


LR_best_estimator, LR_best_params, LR_best_score= grid_search_cv(LR_model, LR_params, scoring="f1", cv= 5)
print(f"LR best params:{LR_best_params} & best_score:{LR_best_score:0.5f} / {LR_best_estimator}")

In [None]:
knears_params = {"n_neighbors": [2, 3, 4, 5], 'algorithm': ['auto', 'ball_tree', 'kd_tree'], "weights": ['uniform', 'distance']}

KNN_model = KNeighborsClassifier()
grid_knears = GridSearchCV(KNN_model, knears_params)

KNN_best_estimator, KNN_best_params, KNN_best_score= grid_search_cv(KNN_model, knears_params, scoring="f1", cv= 5)
print(f"KNN best params:{KNN_best_params} & best_score:{KNN_best_score:0.3f} / {KNN_best_estimator}")

In [None]:
print(grid_knears.best_estimator_, grid_knears.best_params_, grid_knears.best_score_)

# Model testing

In [None]:
df_test = pd.read_csv("../data/fake_users_test.csv")
df_test.head()

In [None]:
y_test = df_test.pop("Fake")
X_test = df_test

X_test_f= df_test.iloc[:,1:]
X_test_f = pd.get_dummies(X_test_f, columns=['Event', 'Category'], prefix = ['Event', 'Category'])
X_test_f.head()

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(new_X_train, new_y_train)


y_pred = knn.predict(X_test_f)
y_proba = knn.predict_proba(X_test_f)


# Overfitting Case
print('---' * 35)
print(f'Recall: {recall_score(y_test, y_pred):.2f}')
print(f'Precision: {precision_score(y_test, y_pred):.2f}')
print(f'F1 Score: {f1_score(y_test, y_test):.2f}')
print(f'Accuracy Score: {accuracy_score(y_test, y_pred):.2f}')
print('---' * 35)

In [None]:
sub = pd.DataFrame(data=X_test)
sub['Fake_pred'] = y_pred
sub['is_fake_probability'] =  y_proba[:,1]
sub.tail()

# Conclusion

A limitation of random undersampling is that examples are removed without any concern for how useful or important they might be in determining the decision boundary between the classes.