# Amazon Copurchased

This is a Python notebook created using "jupyter".

Author: Rafael J. P. dos Santos

## Parameters

We use the parameter below to set the maximum number of edges to be read from the CSV containing edges (links).

In [1]:
max_edges = 15000 # Set quantity to read from file
edges_csv_file = "data/20180812_links"
nodes_csv_file = "data/20180812_nodes"

## Load the libraries

Let's load the Python libraries that we will need throughout the script

In [2]:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import export_graphviz
import pydot
from sklearn.ensemble import RandomForestRegressor
from __future__ import division
import sklearn
#from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_validate
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.20.1.


## Read graph

### Read only first lines of datafile

Due to slowness in calculating centrality measures, we use the parameter provided in the beggining of the script to limit the number of edges we will read.

In [3]:
lines = []
total_line_count = 0
with open(edges_csv_file, 'rb') as f:
    f.readline()   # skip first line / header
    while True:
        line = f.readline()
        if not line:
            break
        if not (max_edges > 0 and len(lines) >= max_edges):
            lines.append(line)
        total_line_count += 1
G = nx.parse_edgelist(lines, delimiter=',', nodetype=int)

In [4]:
print "Using %d edges out of %d available (%.2f%% of data)" % (len(lines), total_line_count, len(lines)/total_line_count * 100)

Using 15000 edges out of 229338 available (6.54% of data)


### Calculate nodes centrality measures

Now that we have our NetworkX graph, let's calculate some centrality measures for every node.

In [5]:
centrality_measures = {}

#### Degree

In [6]:
centrality_measures["degree"] = nx.degree(G)

#### Eigenvector centrality

In [7]:
centrality_measures["eigenvector_centrality"] = nx.eigenvector_centrality_numpy(G)

####  Approximate betweenness centrality (current flow)

In [8]:
centrality_measures["betweenness_centrality"] = nx.approximate_current_flow_betweenness_centrality(G)

#### Closeness centrality

In [8]:
# Very slow!
centrality_measures["closeness_centrality"] = nx.closeness_centrality(G)

#### Betweenness centrality

In [9]:
# Very slow!
centrality_measures["betweenness_centrality"] = nx.betweenness_centrality(G)

### Load node properties

Let's load the CSV containing the nodes data (title, price) into a Pandas dataframe, and append the centrality measures calculated above.

In [9]:
df = pd.read_csv(nodes_csv_file)

In [10]:
# Add columns to dataframe
def merge_columns(dataframe, data):
    df = dataframe.copy()
    for col in data:
        rows = []
        for item in data[col].items():
            rows.append({"id": item[0], col: item[1]})
        df = df.merge(pd.DataFrame(rows))
    return df

df = merge_columns(df, centrality_measures)

### Let's convert some fields to numeric

In [11]:
categorical_features = [
    'category1',
    'category2',
    'category3',
    'category4',
    'category5',
    'category6',
    'category7',
    'category8',
    'category9',
    'category10',
    'language',
    'coverType',
    'publisher',
    'rankingCategory'
]

numeric_features = [
    'degree',
    'eigenvector_centrality',
    #'closeness_centrality',
    'betweenness_centrality',
    'ranking',
    'reviewCount',
    'pages',
    'weight',
    'height',
    'width',
    'depth',
    'rating'
]

df = df.replace("<<MISSING_DATA>>", np.NaN)
df[numeric_features] = df[numeric_features].apply(pd.to_numeric)
df[['price']] = df[['price']].apply(pd.to_numeric)

for feature in numeric_features:
    df[feature].fillna(df[feature].mean(), inplace = True)

### Remove nodes without price and outliers

In [12]:
df = df.drop(df[df["price"].isnull()].index)
#df = df.drop(df[df["price"] > 500].index)

### Inspect columns

In [13]:
df.columns

Index([u'id', u'title', u'url', u'authors', u'coverType', u'publisher',
       u'edition', u'publicationDate', u'rankingCategory', u'category1',
       u'category2', u'category3', u'category4', u'category5', u'category6',
       u'category7', u'category8', u'category9', u'category10', u'isbn10',
       u'isbn13', u'language', u'postProcessed', u'price', u'ranking',
       u'pages', u'reviewCount', u'rating', u'width', u'height', u'depth',
       u'weight', u'eigenvector_centrality', u'degree',
       u'betweenness_centrality'],
      dtype='object')

### Features summary

Below we have a summary of the Pandas dataframe. We can see the number of nodes that we are actually analyzing, which depends on the max_edges parameter defined before.

In [14]:
df.describe(include='all')

Unnamed: 0,id,title,url,authors,coverType,publisher,edition,publicationDate,rankingCategory,category1,...,pages,reviewCount,rating,width,height,depth,weight,eigenvector_centrality,degree,betweenness_centrality
count,3206.0,3206,3206,3206,3177,3179,0.0,372,2958,2918,...,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0
unique,,3148,3206,2436,9,395,,310,1,1,...,,,,,,,,,,
top,,Macroeconomia,https://www.amazon.com.br/dp/8535931015/,Vários Autores (Autor),Capa comum,Companhia das Letras,,5 de fevereiro de 2014,Livros,Livros,...,,,,,,,,,,
freq,,5,1,33,2653,192,,5,2958,2918,...,,,,,,,,,,
mean,2437.63194,,,,,,,,,,...,381.442457,13.928591,4.469109,15.582625,22.543681,2.195187,446.776041,0.004749234,7.930131,0.002595
std,2136.078893,,,,,,,,,,...,254.426566,33.309719,0.527006,2.630768,2.608409,1.226045,205.155674,0.01670599,17.472513,0.007816
min,1.0,,,,,,,,,,...,4.0,1.0,1.0,2.6,10.0,0.2,18.1,5.842757e-08,1.0,0.0
25%,935.25,,,,,,,,,,...,216.0,3.0,4.473275,13.8,20.8,1.4,299.0,4.798824e-06,1.0,0.0
50%,2028.5,,,,,,,,,,...,320.0,10.0,4.473275,15.6,22.8,2.0,449.233179,1.464468e-05,2.0,0.000254
75%,3162.75,,,,,,,,,,...,472.0,13.873001,4.8,16.4,23.5,2.8,558.0,0.0002184396,5.0,0.001769


Below we can inspect the first rows of data, containing title, price, degree and other centrality measures.

In [15]:
df.head(10)

Unnamed: 0,id,title,url,authors,coverType,publisher,edition,publicationDate,rankingCategory,category1,...,pages,reviewCount,rating,width,height,depth,weight,eigenvector_centrality,degree,betweenness_centrality
0,1,The Stanford Mathematics Problem Book: With Hi...,https://www.amazon.com.br/dp/0486469247/,"George Polya (Autor),",Capa comum,Dover Publications,,19 de fevereiro de 2009,Livros,Livros,...,68.0,1.0,4.0,14.0,21.0,0.6,181.0,0.012593,19,0.006832
1,2,Fourier Series,https://www.amazon.com.br/dp/0486633179/,"Georgi P. Tolstov (Autor),",Capa comum,Dover Publications,,1 de junho de 1976,Livros,Livros,...,352.0,3.0,4.6,14.6,21.0,1.9,363.0,0.053497,62,0.0051
2,3,Probability Theory: A Concise Course,https://www.amazon.com.br/dp/0486635449/,"Y. A. Rozanov (Autor),",Capa comum,Dover Publications,,,Livros,Livros,...,160.0,13.873001,4.473275,14.4,20.8,0.8,200.0,0.034535,59,0.008281
3,4,"Vectors, Tensors and the Basic Equations of Fl...",https://www.amazon.com.br/dp/0486661105/,"Rutherford Aris (Autor),",Capa comum,Dover Publications,,,Livros,Livros,...,320.0,2.0,4.5,13.7,21.5,1.6,381.0,0.011868,18,0.005083
4,5,Ordinary Differential Equations,https://www.amazon.com.br/dp/0486649407/,"Morris Tenenbaum (Autor),",Capa comum,Dover Publications,,,Livros,Livros,...,848.0,4.0,4.5,13.8,21.8,4.0,939.0,0.074795,88,0.012716
5,6,The Variational Principles of Mechanics,https://www.amazon.com.br/dp/0486650677/,"Cornelius Lanczos (Autor),",Capa comum,Dover Publications,,,Livros,Livros,...,418.0,1.0,5.0,13.8,21.5,2.2,581.0,0.050004,43,0.009571
6,7,A First Look at Perturbation Theory,https://www.amazon.com.br/dp/0486675513/,James G. Simmonds (Autor),Capa comum,Dover Publications Inc.,,,Livros,Livros,...,160.0,13.873001,4.473275,13.7,21.5,0.8,159.0,0.008733,9,0.000871
7,8,Thermodynamics and the Kinetic Theory of Gases...,https://www.amazon.com.br/dp/0486414612/,"Wolfgang Pauli (Autor),",Capa comum,Dover Publications,,18 de outubro de 2010,Livros,Livros,...,160.0,2.0,5.0,13.8,21.6,1.0,159.0,0.004537,10,0.008962
8,9,Mechanics,https://www.amazon.com.br/dp/0486607542/,"Jacob P. Den Hartog (Autor),",Capa comum,Dover Publications,,1 de junho de 1961,Livros,Livros,...,480.0,13.873001,4.473275,13.6,20.3,2.3,522.0,0.019512,23,0.002138
9,10,Statistical Thermodynamics,https://www.amazon.com.br/dp/0486661016/,"Erwin Schrodinger (Autor),",Capa comum,Dover Publications,,,Livros,Livros,...,112.0,13.873001,4.473275,14.0,20.3,1.3,159.0,0.007206,15,0.004597


## Random forest using degree as feature, price as target

### Preparing data

In [16]:
df_with_dummies = pd.get_dummies(df[["id"] + numeric_features + categorical_features + ['price']],columns=categorical_features,drop_first=True)

In [17]:
feature_list = list(df_with_dummies.drop(columns = ['price']))
features = np.array(df_with_dummies.drop(columns = ['price']))
target = np.array(df_with_dummies['price'])

### Average price as baseline

It's important to have a baseline, so we can validate our predictions after running our model. One easy choice for baseline is the average price of a book.

We have an average price around R\$43, so this means that a very easy prediction would be to always guess R$43 for the price of any book.

In [18]:
average_target = np.average(target)
print "Average price: R$", average_target

Average price: R$ 105.26333437305053


### Cross val score

In [19]:
def baseline_score_function (self, target, predictions):
    errors_baseline = abs(average_target - target)
    return np.mean(errors_baseline)

rf = RandomForestRegressor(n_estimators = 500)

scores = cross_validate(rf, features, target, cv=10,
                        scoring = {'abs': 'neg_mean_absolute_error', 'baseline': baseline_score_function},
                        return_train_score=False, return_estimator = True)

#print "Abs: ", scores['test_abs']
print "Abs mean: ", np.mean(np.abs(scores['test_abs']))
print "Abs std: ", np.std(scores['test_abs'])

#print "Baseline: ", scores['test_baseline']
print "Baseline mean: ", np.mean(scores['test_baseline'])
print "Baseline std: ", np.std(scores['test_baseline'])

# Use best estimator to do some visual reports
rf = scores['estimator'][0]

Abs mean:  47.81947150404986
Abs std:  60.74186659554669
Baseline mean:  147.09898793005834
Baseline std:  22.031258456835374


####  List of most important features

In [20]:
importance = zip(feature_list, rf.feature_importances_)
importance.sort(key=lambda x:-x[1])
pd.DataFrame(importance).head(200)

Unnamed: 0,0,1
0,eigenvector_centrality,0.363932
1,pages,0.188581
2,coverType_Capa comum,0.066036
3,width,0.056497
4,publisher_McGraw-Hill Science/Engineering/Math,0.048197
5,height,0.026518
6,category2_Inglês e Outras Línguas,0.025607
7,id,0.024770
8,publisher_Cengage Learning,0.023942
9,betweenness_centrality,0.023584


####  Preço previsto vs. Preço real

In [21]:
plt.figure(figsize=(8,8), dpi=130)
plt.scatter(test_target, predictions, 100, alpha=0.05, edgecolors="none")
baseline = [0, np.max(test_target)]
plt.plot(baseline, baseline, "--", color="green", label = u"Preço previsto = Preço real")
ax = plt.gca()
ax.set_ylabel(u"Preço previsto (R$)")
ax.set_xlabel(u"Preço real (R$)")
ax.legend()
plt.title(u"Preço previsto vs. Preço real")
plt.axes().set_aspect('equal', 'datalim')
#plt.xlim(0, 150)
#plt.ylim(0, 150)
plt.show()

NameError: name 'test_target' is not defined

### Mean absolute error

Now we can compare the errors obtained by our predictions against the errors provided by the baseline (average price). Our prediction errors should be less than the baseline errors to consider the model successful.

In [None]:
# Calculate the absolute errors
errors = abs(predictions - test_target)
errors_baseline = abs(average_target - test_target)
# Print out the mean absolute error (mae)
print('Mean absolute prediction error: R$', round(np.mean(errors), 2))
print('Mean absolute error using average: R$',
      round(np.mean(errors_baseline), 2))

### Worst predictions

Below we can inspect the rows with the biggest prediction error.

In [None]:
pd.set_option('display.max_columns', None)
data = {
    "all_features": test_features.tolist(),
    "id": test_features[:, 0],
    "target": test_target,
    "prediction": predictions,
    "error": errors,
    "errors_baseline": errors_baseline
}
predicted_df = pd.DataFrame(data = data)
joined_predicted_df = predicted_df
joined_predicted_df = predicted_df.set_index("id").join(df.set_index("id"))
joined_predicted_df.sort_values('error', ascending = False).head(20)

### Best predictions

In [None]:
joined_predicted_df.sort_values('error', ascending = True).head(20)

In [None]:
predicted_df.describe()

### Visualize decision tree

In [28]:
# Pull out one tree from the forest
tree = rf.estimators_[0]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot',
                feature_names = feature_list, rounded = True)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

<img src="files/image.png">