# Analysing results

Now that we have scored our images, we can start analysing the results.

At this point we should have 4 files within our data folder
- active_pictures_scored.csv  - this will be our training set. It contains data from pictures used in actual ads that have been pre-scored
- 10kProducts_scored.csv      - this is our data where we want to predict the best images. These have also been pre-scored
- 25_sample_images.csv        - this is the sample data we scored in the previous exercise.
- 25_sample_images_scored.csv - this is the output data we stored in the previous exercise.

And we have a few goals.

1. Analyze the usefulness of aesthetic score or other features
2. Select a number of pictures to use for advertising from the prediction set.

In [None]:
# Required for showing plots within the notebook.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# Understanding the data

Try to find out:
- Does aesthetic score correlate with any variable?
- Does high aesthetic score work better than low aesthetic score?
- Does price reflect picture quality?
- What variables would you use for ad quality?


Here is some python to get you started to understand the data we have

In [None]:
# Loading the data
training_data_file = './../data/active_products_scored.csv'
predict_data_file = './../data/10kProducts_scored.csv'
train_data = pd.read_csv(training_data_file)
predict_data = pd.read_csv(predict_data_file)

In [None]:
plots = predict_data.hist(bins=20)

In [None]:
# Prices distribution looks odd. We can get a better understanding of the price-distribution by restricting the axes better
prices = predict_data['price'].fillna(0)
prices.hist(bins=range(10, 1000, 50))
print("Max price:", prices.max())

In [None]:
# Looking at the highest Aesthetics separately:
high_aes = predict_data.nlargest(500, 'aesthetic')
high_aes.hist(bins=20)
high_aes.head()
high_aes.plot(x='average', y='electronics', kind='scatter')
high_aes.corr()

# Building a model

With sufficient understanding of the data, we can start to tackle building a model on top of it.
We would like to predict the best performing pictures from the prediction set.

Let's aim to build the basics for a model. Starting with simple linear regression

For this, let's choose our X & y as aesthetic and ctr. This means we will try to predict ctr with aesthetic.

In [None]:
train_set = train_data
predict_set = predict_data
training_feature = train_set['aesthetic']
target_feature = train_set['ctr']

Now we can fit a linear regression to the data, and see what it looks like
There is also a handy score-method available to get a numerical estimate of accuracy
We should always also validate our predicted values against actual values.

In [None]:
# Scoring the model
X = training_feature.values.reshape(-1, 1)
y = target_feature.values.reshape(-1, 1)
model = linear_model.LinearRegression()
model.fit(X, y)
predicted_y = model.predict(X)
plt.scatter(y, predicted_y,  color='black')
print("Model score", model.score(X, y))

Our initial score is quite low at 8.64e-05. This means our linear regression model is not noticing any difference in ctr.

We need to start improving our model.

First thing we can try is to limit our regression to the most aesthetic images.

To do this, replace the lines `train_set = train_data` with `train_set = train_data[train_data.aesthetic >= 0.2]` and `predict_set = predict_data` with `predict_set = predict_data[predict_data.aesthetic >= 0.2]`

### Advanced tasks
Test out other approaches. See how much can you improve the score
You can also try to predict performance instead of ctr

# Selecting a number of pictures from 10k products

Our best guess currently is purely random. Let's see if we can improve it.

In [None]:
NUM_OF_PRODUCTS = 10

# Purely random choice
# sample = scores.sample(NUM_OF_PRODUCTS)

Let's use our trained model from above to see what pictures it would suggest

In [None]:
predict_set = predict_set.reset_index()
X = predict_set['aesthetic'].values.reshape(-1, 1)
results = pd.DataFrame(model.predict(X))

predict_set['predicted_ctr'] = results
predict_set.nlargest(NUM_OF_PRODUCTS, 'predicted_ctr')

Manual analysis led us to look at a combination of parameters.
Here is a model which selects a number of products where both price and aesthetic value are high enough.
Then we use given average scores for the weights of the random sampling.
You can compare to the selection above

In [None]:
# Combined selection by parameters
# Selecting only pictures with higher price and higher aesthetic values.
price_aes = scores[scores.price > 100]
price_aes = scores[scores.aesthetic > 0.2]
selected_products = price_aes.sample(NUM_OF_PRODUCTS, weights=price_aes['average'])
selected_products