# Amazon Copurchased

This is a Python notebook created using "jupyter".

Author: Rafael J. P. dos Santos

## Parameters

We use the parameter below to set the maximum number of edges to be read from the CSV containing edges (links).

In [1]:
max_edges = 20000 # Set to 0 to read all nodes from file

## Load the libraries

Let's load the Python libraries that we will need throughout the script

In [2]:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import export_graphviz
import pydot
from sklearn.ensemble import RandomForestRegressor

## Read graph

### Read only first lines of datafile

Due to slowness in calculating centrality measures, we use the parameter provided in the beggining of the script to limit the number of edges we will read.

In [3]:
lines = []
with open('data/links', 'rb') as f:
    f.readline()   # skip first line / header
    while True:
        line = f.readline()
        if not line or (max_edges > 0 and len(lines) >= max_edges):
            break
        lines.append(line)
G = nx.parse_edgelist(lines, delimiter=',', nodetype=int)

### Calculate nodes centrality measures

Now that we have our NetworkX graph, let's calculate some centrality measures for every node.

#### Degree

In [4]:
degrees = nx.degree(G)

#### Eigenvector centrality

In [5]:
ec = nx.eigenvector_centrality(G)

#### Closeness centrality

In [6]:
# Very slow!
cc = nx.closeness_centrality(G)

#### Betweenness centrality

In [7]:
# Very slow!
bc = nx.betweenness_centrality(G)

### Load node properties

Let's load the CSV containing the nodes data (title, price) into a Pandas dataframe, and append the centrality measures calculated above.

In [8]:
df = pd.read_csv('data/nodes')
df['degree'] = None
df['eigenvector_centrality'] = None
df['closeness_centrality'] = None
df['betweenness_centrality'] = None
for index, row in df.iterrows():
    try:
        df.loc[index, 'degree'] = degrees[row['id']]
        df.loc[index, 'eigenvector_centrality'] = ec[row['id']]
        df.loc[index, 'closeness_centrality'] = cc[row['id']]
        df.loc[index, 'betweenness_centrality'] = bc[row['id']]
    except:
        df.drop([index], inplace=True)
features = ['degree', 'eigenvector_centrality',
    'closeness_centrality', 'betweenness_centrality']
df[features] = df[features].apply(pd.to_numeric)

### Features summary

Below we have a summary of the Pandas dataframe. We can see the number of nodes that we are actually analyzing, which depends on the max_edges parameter defined before.

In [9]:
df.describe(include='all')

Unnamed: 0,id,title,price,degree,eigenvector_centrality,closeness_centrality,betweenness_centrality
count,2884.0,2884,2884.0,2884.0,2884.0,2884.0,2884.0
unique,,2827,,,,,
top,,Fedro,,,,,
freq,,3,,,,,
mean,1920.887656,,43.641456,11.446602,0.00624455,0.27049,0.000955
std,1447.029989,,31.423349,26.264145,0.01754574,0.033969,0.00404
min,1.0,,4.9,1.0,2.311803e-07,0.190259,0.0
25%,819.75,,24.9675,1.0,3.572957e-05,0.24323,0.0
50%,1717.5,,36.1,2.0,0.0002452775,0.265103,0.0
75%,2674.25,,52.5775,8.0,0.002637485,0.29229,7.5e-05


Below we can inspect the first rows of data, containing title, price, degree and other centrality measures.

In [10]:
df.head(10)

Unnamed: 0,id,title,price,degree,eigenvector_centrality,closeness_centrality,betweenness_centrality
0,1,A Política,38.9,41,0.009902,0.368105,0.007037
1,2,História da Filosofia Grega e Romana. Aristóte...,36.5,67,0.011528,0.345435,0.012753
2,3,Ordem e História. A Era Ecumênica - Volume 4,112.43,39,0.021038,0.324297,0.001071
3,4,Física I-II,44.8,40,0.007521,0.339376,0.015963
4,5,A Estrutura das Revoluções Cientificas,46.02,35,0.000645,0.306996,0.011153
5,6,Discurso de Metafísica,28.37,32,0.009921,0.318177,0.001308
6,7,Do Cidadão,61.16,9,0.000324,0.282426,0.000271
7,8,Sobre A Revolução,55.4,32,0.003971,0.320013,0.012883
8,9,Ética a Nicômaco,13.9,111,0.029438,0.40333,0.076011
9,10,Da Justiça,31.19,53,0.025993,0.340378,0.013198


## Random forest using degree as feature, price as target

### Preparing data

In [11]:
feature_list = list(df[features].columns)
features = np.array(df[features])
target = np.array(df['price'])

### Average price as baseline

It's important to have a baseline, so we can validate our predictions after running our model. One easy choice for baseline is the average price of a book.

We have an average price around R\$43, so this means that a very easy prediction would be to always guess R$43 for the price of any book.

In [12]:
average_target = np.average(target)
print "Average price: R$", average_target

Average price: R$ 43.6414563107


### Training data split

Let's split our dataset into two sets: train and test. We use the first to train or model, and we use the second to test the precision of our model.

In [13]:
# Split the data into training and testing sets
train_features, test_features, train_target, test_target \
    = train_test_split(features, target, test_size = 0.25)

# Summary
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_target.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_target.shape)

('Training Features Shape:', (2163, 4))
('Training Labels Shape:', (2163,))
('Testing Features Shape:', (721, 4))
('Testing Labels Shape:', (721,))


### Train data

We train a random forest model with 500 estimators.

In [14]:
# Load model
rf = RandomForestRegressor(n_estimators = 500)
# Train
rf.fit(train_features, train_target);

### Prediction

In [15]:
predictions = rf.predict(test_features)

### Mean absolute error

Now we can compare the errors obtained by our predictions against the errors provided by the baseline (average price). Our prediction errors should be less than the baseline errors to consider the model successful.

In [16]:
# Calculate the absolute errors
errors = abs(predictions - test_target)
errors_baseline = abs(average_target - test_target)
# Print out the mean absolute error (mae)
print('Mean absolute prediction error: R$', round(np.mean(errors), 2))
print('Mean absolute error using average: R$',
      round(np.mean(errors_baseline), 2))

('Mean absolute prediction error: R$', 19.99)
('Mean absolute error using average: R$', 19.33)


### List a few target vs. predicted

Below we can inspect the rows with the biggest prediction error.

In [17]:
data = {
    "degree": test_features.tolist(),
    "target": test_target,
    "prediction": predictions,
    "error": errors,
    "errors_baseline": errors_baseline
}
predicted_df = pd.DataFrame(data = data)
predicted_df.sort_values('error', ascending = False).head(10)

Unnamed: 0,degree,error,errors_baseline,prediction,target
713,"[4.0, 0.00286559083722, 0.279577191621, 1.1456...",170.321038,166.058544,39.378962,209.7
690,"[1.0, 1.32625312538e-06, 0.207499640132, 0.0]",148.541293,147.338544,42.438707,190.98
495,"[1.0, 0.00137877097198, 0.262640065592, 0.0]",139.270938,136.278544,40.649062,179.92
691,"[1.0, 8.62011831887e-06, 0.236680075527, 0.0]",124.367733,151.478544,70.752267,195.12
420,"[5.0, 0.000641187697288, 0.2862959285, 1.40914...",123.946115,129.058544,48.753885,172.7
515,"[5.0, 0.00596870198413, 0.281597968353, 3.7733...",121.72158,116.258544,38.17842,159.9
1,"[7.0, 0.0002802827206, 0.284375616492, 0.00012...",111.417816,25.841456,129.217816,17.8
600,"[19.0, 0.00967963147015, 0.291565533981, 0.000...",110.5809,113.158544,46.2191,156.8
188,"[1.0, 1.20514736417e-05, 0.23056621881, 0.0]",107.90612,133.338544,69.07388,176.98
292,"[2.0, 2.42195251481e-05, 0.237616418033, 0.0]",106.550282,18.741456,131.450282,24.9


In [18]:
predicted_df.describe()

Unnamed: 0,error,errors_baseline,prediction,target
count,721.0,721.0,721.0,721.0
mean,19.990947,19.32579,44.313799,42.666325
std,21.362225,19.813392,15.131511,27.669897
min,0.04326,0.061456,18.76552,5.99
25%,6.81819,8.158544,34.19958,24.8
50%,14.58056,15.301456,42.007564,35.84
75%,25.093298,24.531456,50.931186,51.9
max,170.321038,166.058544,131.450282,209.7


### Visualize decision tree

In [19]:
# Pull out one tree from the forest
tree = rf.estimators_[0]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot',
                feature_names = feature_list, rounded = True)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

<img src="files/image.png">