# Amazon Copurchased

This is a Python notebook created using "jupyter".

Author: Rafael J. P. dos Santos

## Condições do experimento

* Todos os dados
* Todas as features

## Parameters

We use the parameter below to set the maximum number of edges to be read from the CSV containing edges (links).

In [1]:
max_edges = 0 # Set quantity to read from file
edges_csv_file = "data/20180812_links"
nodes_csv_file = "data/20180812_nodes"

## Load the libraries

Let's load the Python libraries that we will need throughout the script

In [2]:
%load_ext autoreload
%autoreload 1
%aimport shared_functions
import pandas as pd
import numpy as np
from __future__ import division
import shared_functions
from IPython.display import display, HTML

## Read graph

### Read only first lines of datafile

Due to slowness in calculating centrality measures, we use the parameter provided in the beggining of the script to limit the number of edges we will read.

In [3]:
G = shared_functions.read_G(edges_csv_file, max_edges)

Using 229338 edges out of 229338 available (100.00% of data)


### Calculate nodes centrality measures

Now that we have our NetworkX graph, let's calculate some centrality measures for every node.

In [4]:
centrality_measures = shared_functions.centrality_measures(G)
print centrality_measures.keys()

['eigenvector_centrality', 'degree', 'betweenness_centrality']


### Load node properties

Let's load the CSV containing the nodes data (title, price) into a Pandas dataframe, and append the centrality measures calculated above.

In [5]:
df = pd.read_csv(nodes_csv_file)

####  Convert ID to random int to avoid leaking knowledge

In [6]:
df = shared_functions.add_sha256_column_from_id(df)

#### Add centrality measures

In [7]:
df = shared_functions.merge_columns(df, centrality_measures)

### Let's convert some fields to numeric

In [8]:
categorical_features = [
    'category1',
    'category2',
    'category3',
    'category4',
    'category5',
    'category6',
    'category7',
    'category8',
    'category9',
    'category10',
    'language',
    'coverType',
    'publisher',
    'rankingCategory',
    'authors'
]

numeric_features = [
    'degree',
    'eigenvector_centrality',
    'betweenness_centrality',
    'ranking',
    'reviewCount',
    'pages',
    'weight',
    'height',
    'width',
    'depth',
    'rating'
]

df = shared_functions.prepare_data(df, numeric_features)

### Remove nodes without price

In [9]:
df = df.drop(df[df["price"].isnull()].index)

### Inspect columns

In [10]:
df.columns

Index([u'id', u'title', u'url', u'authors', u'coverType', u'publisher',
       u'edition', u'publicationDate', u'rankingCategory', u'category1',
       u'category2', u'category3', u'category4', u'category5', u'category6',
       u'category7', u'category8', u'category9', u'category10', u'isbn10',
       u'isbn13', u'language', u'postProcessed', u'price', u'ranking',
       u'pages', u'reviewCount', u'rating', u'width', u'height', u'depth',
       u'weight', u'sha256_id', u'eigenvector_centrality', u'degree',
       u'betweenness_centrality'],
      dtype='object')

### Features summary

Below we have a summary of the Pandas dataframe. We can see the number of nodes that we are actually analyzing, which depends on the max_edges parameter defined before.

In [11]:
pd.set_option('display.max_columns', None)
df.describe(include='all', percentiles=[0.25, 0.5, 0.75, 0.85, 0.9, 0.95, 0.99])

Unnamed: 0,id,title,url,authors,coverType,publisher,edition,publicationDate,rankingCategory,category1,category2,category3,category4,category5,category6,category7,category8,category9,category10,isbn10,isbn13,language,postProcessed,price,ranking,pages,reviewCount,rating,width,height,depth,weight,sha256_id,eigenvector_centrality,degree,betweenness_centrality
count,9153.0,9153,9153,9152,9035,9052,0.0,783,8820,8707,8707,6525,3345,1205,188,17,0.0,0.0,0.0,9055.0,9055,9050,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0,9153.0
unique,,8952,9153,6177,11,728,,574,1,1,28,228,440,309,71,8,,,,9055.0,9055,7,,,,,,,,,,,,,,
top,,Macroeconomia,https://www.amazon.com.br/dp/8580332990/,Vários Autores (Autor),Capa comum,Companhia das Letras,,1 de janeiro de 2014,Livros,Livros,"Política, Filosofia e Ciências Sociais",Filosofia,Matemática,Brasil,Probabilidade e Estatística,Neurociência,,,,8525431494.0,978-8582600481,Português,,,,,,,,,,,,,,
freq,,6,1,106,7729,536,,13,8820,8707,1557,914,165,78,18,4,,,,1.0,1,7976,,,,,,,,,,,,,,
mean,4896.658473,,,,,,,,,,,,,,,,,,,,,,1.0,71.900469,35581.243023,352.33345,14.392313,4.443767,15.276002,22.264502,2.077165,425.82803,2122251000.0,0.003406945,31.259587,0.000995
std,2826.520995,,,,,,,,,,,,,,,,,,,,,,0.0,145.716338,45683.667822,249.857638,32.650533,0.551851,2.574086,2.590565,1.208801,201.984523,1240653000.0,0.009780235,34.801327,0.001414
min,1.0,,,,,,,,,,,,,,,,,,,,,,1.0,2.9,3.0,2.0,1.0,1.0,0.8,8.6,0.2,4.5,229177.0,6.734553e-11,1.0,0.0
25%,2481.0,,,,,,,,,,,,,,,,,,,,,,1.0,26.31,8345.0,192.0,2.0,4.4,13.6,20.8,1.3,281.0,1025729000.0,1.755986e-05,8.0,0.000188
50%,4900.0,,,,,,,,,,,,,,,,,,,,,,1.0,39.11,22980.0,304.0,10.0,4.445241,15.304075,22.6,1.8,426.421903,2113182000.0,0.0002125446,19.0,0.00052
75%,7348.0,,,,,,,,,,,,,,,,,,,,,,1.0,63.11,45609.0,432.0,14.304266,4.8,16.0,23.2,2.6,535.0,3182206000.0,0.001596866,43.0,0.001282


Below we can inspect the first rows of data, containing title, price, degree and other centrality measures.

In [12]:
df.head(10)

Unnamed: 0,id,title,url,authors,coverType,publisher,edition,publicationDate,rankingCategory,category1,category2,category3,category4,category5,category6,category7,category8,category9,category10,isbn10,isbn13,language,postProcessed,price,ranking,pages,reviewCount,rating,width,height,depth,weight,sha256_id,eigenvector_centrality,degree,betweenness_centrality
0,1,The Stanford Mathematics Problem Book: With Hi...,https://www.amazon.com.br/dp/0486469247/,"George Polya (Autor),",Capa comum,Dover Publications,,19 de fevereiro de 2009,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Matemática,Estudo e Ensino,,,,,,486469247,978-0486469249,Inglês,1,26.25,59183.0,68.0,1.0,4.0,14.0,21.0,0.6,181.0,3564330554,2.295138e-05,19,0.000756
1,2,Fourier Series,https://www.amazon.com.br/dp/0486633179/,"Georgi P. Tolstov (Autor),",Capa comum,Dover Publications,,1 de junho de 1976,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Matemática,Aplicada,Probabilidade e Estatística,,,,,486633179,978-0486633176,Inglês,1,50.37,56112.0,352.0,3.0,4.6,14.6,21.0,1.9,363.0,1309098117,4.595498e-06,62,0.000765
2,3,Probability Theory: A Concise Course,https://www.amazon.com.br/dp/0486635449/,"Y. A. Rozanov (Autor),",Capa comum,Dover Publications,,,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Matemática,Aplicada,Probabilidade e Estatística,,,,,486635449,978-0486635446,Inglês,1,29.23,44345.0,160.0,14.304266,4.445241,14.4,20.8,0.8,200.0,1260550007,5.866979e-06,59,0.001921
3,4,"Vectors, Tensors and the Basic Equations of Fl...",https://www.amazon.com.br/dp/0486661105/,"Rutherford Aris (Autor),",Capa comum,Dover Publications,,,Livros,Livros,Inglês e Outras Línguas,Engenharia e Transporte,Engenharia,Mecânica,Hidráulica,,,,,486661105,978-0486661100,Inglês,1,48.79,82275.0,320.0,2.0,4.5,13.7,21.5,1.6,381.0,4012708477,1.159003e-06,18,0.000632
4,5,Ordinary Differential Equations,https://www.amazon.com.br/dp/0486649407/,"Morris Tenenbaum (Autor),",Capa comum,Dover Publications,,,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Matemática,Aplicada,Equações Diferenciais,,,,,486649407,978-0486649405,Inglês,1,71.63,40840.0,848.0,4.0,4.5,13.8,21.8,4.0,939.0,3891707921,8.827188e-06,88,0.002868
5,6,The Variational Principles of Mechanics,https://www.amazon.com.br/dp/0486650677/,"Cornelius Lanczos (Autor),",Capa comum,Dover Publications,,,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Física,Mecânica,,,,,,486650677,978-0486650678,Inglês,1,88.15,36960.0,418.0,1.0,5.0,13.8,21.5,2.2,581.0,2030201243,8.323662e-06,43,0.001534
6,7,A First Look at Perturbation Theory,https://www.amazon.com.br/dp/0486675513/,James G. Simmonds (Autor),Capa comum,Dover Publications Inc.,,,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Matemática,Aplicada,Equações Diferenciais,,,,,486675513,978-0486675510,Inglês,1,36.1,37291.0,160.0,14.304266,4.445241,13.7,21.5,0.8,159.0,744636978,4.951852e-07,9,0.000114
7,8,Thermodynamics and the Kinetic Theory of Gases...,https://www.amazon.com.br/dp/0486414612/,"Wolfgang Pauli (Autor),",Capa comum,Dover Publications,,18 de outubro de 2010,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Física,Dinâmica,Termodinâmica,,,,,486414612,978-0486414614,Inglês,1,26.01,24406.0,160.0,2.0,5.0,13.8,21.6,1.0,159.0,425205287,1.80316e-05,10,0.000956
8,9,Mechanics,https://www.amazon.com.br/dp/0486607542/,"Jacob P. Den Hartog (Autor),",Capa comum,Dover Publications,,1 de junho de 1961,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Física,Mecânica,,,,,,486607542,978-0486607542,Inglês,1,50.96,90609.0,480.0,14.304266,4.445241,13.6,20.3,2.3,522.0,1246026773,1.837624e-06,23,0.000384
9,10,Statistical Thermodynamics,https://www.amazon.com.br/dp/0486661016/,"Erwin Schrodinger (Autor),",Capa comum,Dover Publications,,,Livros,Livros,Inglês e Outras Línguas,Ciências Tecnológicas,Física,Dinâmica,Termodinâmica,,,,,486661016,978-0486661018,Inglês,1,32.16,65112.0,112.0,14.304266,4.445241,14.0,20.3,1.3,159.0,1338518310,1.278774e-06,15,0.00053


## Random forest using various features, price as target

### Preparing data

In [13]:
target, features, feature_list, test_features, test_target = shared_functions.prepare_datasets(df, numeric_features, categorical_features, 'price')

Numeric features:  ['degree', 'eigenvector_centrality', 'betweenness_centrality', 'ranking', 'reviewCount', 'pages', 'weight', 'height', 'width', 'depth', 'rating']
Categorical features:  ['category1', 'category2', 'category3', 'category4', 'category5', 'category6', 'category7', 'category8', 'category9', 'category10', 'language', 'coverType', 'publisher', 'rankingCategory', 'authors']
Target column:  price
Test percentage:  0.200043701519
Train features shape:  (7322, 8009)
Train target shape:  (7322,)
Test features shape:  (1831, 8009)
Test target shape:  (1831,)


### Average price and median price as baselines

It's important to have a baseline, so we can validate our predictions after running our model. One easy choice for baseline is the average price of a book.

We have an average price around R\$43, so this means that a very easy prediction would be to always guess R$43 for the price of any book.

In [None]:
average_target = np.average(target)
median_target = np.median(target)
print "Average price: R$", average_target
print "Median parice; R$", median_target

Average price: R$ 71.99866156787762
Median parice; R$ 39.265


### Cross val

#### Run cross val

In [None]:
estimators, splits, scores = shared_functions.run_cross_validation_regression(features, target)

#### Cross val score

In [None]:
shared_functions.print_score_summary(scores)

####  List of most important features

In [None]:
shared_functions.get_most_important_features(estimators, feature_list)

### Predicted price vs real price

In [None]:
y_pred = shared_functions.get_all_predictions_from_splits(features, target, splits, estimators)                                                  
shared_functions.plot_splits_predicted_vs_real(target, y_pred, title=u'Preço previsto pelo modelo vs. Preço real', xlabel=u'Preço real (R$)', ylabel=u'Preço previsto pelo modelo (R$)', legend=u'Preço previsto pelo modelo = Preço real', zoomY = 150)

### Mean absolute error

Now we can compare the errors obtained by our predictions against the errors provided by the baseline (average price). Our prediction errors should be less than the baseline errors to consider the model successful.

In [None]:
errors, errors_baseline, errors_relative, errors_baseline_relative, errors_baseline_median, errors_baseline_median_relative = shared_functions.print_mean_absolute_error(y_pred, target, average_target, median_target)

### Join data

In [None]:
pd.set_option('display.max_columns', None)
predicted_df, joined_predicted_df = shared_functions.join_predicted_df(df, features, target, y_pred, errors, errors_relative, errors_baseline, errors_baseline_relative, errors_baseline_median, errors_baseline_median_relative)

### Worst absolute predictions

Below we can inspect the rows with the biggest prediction error.

In [None]:
joined_predicted_df.sort_values('error', ascending = False).head(20)

### Worst relative predictions

In [None]:
joined_predicted_df.sort_values('error_relative', ascending = False).head(20)

### Best absolute predictions

In [None]:
joined_predicted_df.sort_values('error', ascending = True).head(20)

### Best relative predictions

In [None]:
joined_predicted_df.sort_values('error', ascending = True).head(20)

### Relative errors distribution

In [None]:
centers, normalized_hist_predicted, normalized_hist_baseline, normalized_hist_baseline_median = shared_functions.plot_relative_error_distribution(predicted_df)

#### Accumulated

In [None]:
shared_functions.plot_accumulated_relative_error(centers, normalized_hist_predicted, normalized_hist_baseline, normalized_hist_baseline_median)

### Joined data summary

In [None]:
joined_predicted_df.describe(percentiles=[0.25, 0.5, 0.75, 0.85, 0.9, 0.95, 0.99])

### Visualize decision tree

In [29]:
shared_functions.render_image_first_decision_tree(rf, feature_list, 'tree-price.png')

Output image:  tree-price.png


<img src="files/image.png">