# Amazon Copurchased

This is a Python notebook created using "jupyter".

Author: Rafael J. P. dos Santos

## Condições do experimento

* Todos os dados
* Todas as features exceto métricas de rede

## Parameters

We use the parameter below to set the maximum number of edges to be read from the CSV containing edges (links).

In [None]:
max_edges = 0 # Set quantity to read from file
edges_csv_file = "data/20180812_links"
nodes_csv_file = "data/20180812_nodes"
features = 'all'
n_estimators = 20

## Load the libraries

Let's load the Python libraries that we will need throughout the script

In [None]:
%load_ext autoreload
%autoreload 1
%aimport shared_functions
import pandas as pd
from __future__ import division
import shared_functions

## Read graph

### Read only first lines of datafile

Due to slowness in calculating centrality measures, we use the parameter provided in the beggining of the script to limit the number of edges we will read.

In [None]:
G = shared_functions.read_G(edges_csv_file, max_edges)

### Calculate nodes centrality measures

Now that we have our NetworkX graph, let's calculate some centrality measures for every node.

In [None]:
centrality_measures = shared_functions.centrality_measures(G)
print centrality_measures.keys()

#### Has link with node 1?

In [None]:
centrality_measures['has_link_to_node_1'] = shared_functions.has_link_to_node(G, 1)

### Load node properties

Let's load the CSV containing the nodes data (title, price) into a Pandas dataframe, and append the centrality measures calculated above.

In [None]:
df = pd.read_csv(nodes_csv_file)

In [None]:
df = shared_functions.add_sha256_column_from_id(df)
df = shared_functions.merge_columns(df, centrality_measures)

### Let's convert some fields to numeric

In [None]:
categorical_features = [
    'category1',
    'category2',
    'category3',
    'category4',
    'category5',
    'category6',
    'category7',
    'category8',
    'category9',
    'category10',
    'language',
    'coverType',
    'publisher',
    'rankingCategory'
]

numeric_features = [
    'ranking',
    'reviewCount',
    'pages',
    'weight',
    'height',
    'width',
    'depth',
    'rating'
]

if features == 'all':
    numeric_features.extend([
        'degree',
        'eigenvector_centrality',
        'betweenness_centrality', 
    ])

if features == 'none':
    categorical_features = []
    numeric_features = []

df = shared_functions.prepare_data(df, numeric_features)

### Remove nodes without price and outliers

In [None]:
df = df.drop(df[df["price"].isnull()].index)
#df = df.drop(df[df["price"] > 500].index)

### Inspect columns

In [None]:
df.columns

### Features summary

Below we have a summary of the Pandas dataframe. We can see the number of nodes that we are actually analyzing, which depends on the max_edges parameter defined before.

In [None]:
df.describe(include='all')

Below we can inspect the first rows of data, containing title, price, degree and other centrality measures.

In [None]:
df.head(10)

## Random forest using multiple features, has_link_to_node_1 as target

### Preparing data

In [None]:
target, features, feature_list, validation_features, validation_target = shared_functions.prepare_datasets(df, numeric_features, categorical_features, 'has_link_to_node_1')

### Cross validation

#### Run cross val

In [None]:
estimators, splits, scores = shared_functions.run_cross_validation_classification(features, target, n_estimators = n_estimators)

#### Cross val confusion matrices

In [None]:
shared_functions.plot_splits_confusion_matrices(features, target, splits, estimators, threshold = 0.5)

#### List of most important features

In [None]:
shared_functions.get_most_important_features(estimators, feature_list)

#### Predicted vs real

In [None]:
y_pred = shared_functions.get_all_predictions_from_splits(features, target, splits, estimators)
labels = {0: u'Sem ligação', 1: u'Com ligação'}
shared_functions.plot_splits_predicted_vs_real(target, y_pred, title=u'Probabilidade de ligação estimada pelo modelo vs. Ligação real', ylabel=u'Probabilidade de ligação estimada pelo modelo', xlabel=u'Ligação real (0 = Ausência, 1 = Presença)', legend = False, labels = labels)

#### Probability density

In [None]:
shared_functions.print_classification_probability_distribution(target, y_pred)

#### ROC Curve

In [None]:
closest_to_optimal_probability = shared_functions.plot_roc_curve(target, y_pred)

In [None]:
shared_functions.plot_splits_confusion_matrices(features, target, splits, estimators, threshold = closest_to_optimal_probability)