# Products

The following analysis of Products is relatively similar than that of Sellers.
Our goal is to find products that repetively underperform others, and understand why.
This will help us shape our recommendations on how to improve Olist's profit margin

## 0 - Code `get_training_data` in olist/product.py

Create the `get_training_data` method in `olist/product.py` that will return the following DataFrame:

  - `product_id` (_str_) _the id of the product_
  - `category` (_str_) _the category name (in english)_
  - `height` (_float_) _height of the product (in cm)_
  - `width` (_float_) _width of the product (in cm)_
  - `length` (_float_) _length of the product (in cm)_
  - `weight` (_float_) _weight of the product (in g)_
  - `price` (_float_) _average price at which the product is sold_
  - `freight_value` (_float_) _average value of freight_
  - `product_name_length` (_float_) _character length of product name_
  - `product_description_length` (_float_) _character length of product description_
  - `n_orders` (_int_) _the number of orders in which the product appeared_
  - `quantity` (_int_) _the total number of product sold_
  - `wait_time` (_float_) _the average wait time in days for orders in which the product was sold._
  - `share_of_five_stars` (_float_) _The share of five stars orders for orders in which the product was sold_
  - `share_of_one_stars` (_float_) _The share of one stars orders for orders in which the product was sold_
  - `review_score` (_float_) _The average review score of the order in which each product is sold_
  
Feel free to code all the intermediary methods below if you prefer to breakdown the problem step by step.

✅ Once your logic is encoded, commit and push your new file `order.py`

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

### get_product_features(self):
        """
        Returns a DataFrame with:
       'product_id', 'product_category_name', 'product_name_lenght',
       'product_description_lenght', 'product_photos_qty', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm'
        """

### get_price(self):
        """
        Return a DataFrame with:
        'product_id', 'price'
        """

### def get_wait_time(self):
        """
        Returns a DataFrame with:
        'product_id', 'wait_time'
        """

### get_review_score(self):
        """
        Returns a DataFrame with:
        'product_id', 'share_of_five_stars', 'share_of_one_stars',
        'avg_review_score'
        """

### get_quantity
        """
        Returns a DataFrame with:
        'product_id', 'n_orders', 'quantity'
        """

## 1 -Product category analysis

Let's start by looking at the performance of product categories:

Create a DataFrame aggregating, for each product category, all the products properties.  
Use sum for `quantity` and the aggregation function of your choice for all other properties.  For instance:

  - `quantity` (sum)
  - `wait_time` (median)
  - `review_score` (median)
  - `price` (median)
  - ....

Store it in the method `get_product_cat(self, agg="median")` in product.py for later use.

In [34]:
# Your code below

----
❓ Plot one histogram per features, in one figure, and look for features with outliers

❓ Using plotly, create a scatterplot of `review_score` against `n_orders`, varying bubble size by total `sales` for that category

- Do you notice underperforming product categories?
- Experiment with other x-axis features (`wait_time` for instance) and with `share_of_one_stars` instead of `review_score` as y-axis
- Remember that Olist gets a revenue proportional to the sale prices, and get a cost penality at for each low review.
- Can you think of a strategy to improve Olist's profit margin as per CEO request? (keep it in mind for later!)

## 2 - Regress review_score on product categories


We have seen that some products like Furnitures correlate both with higher `wait_time` and lower `review_score`. 

Can we isolate the true contribution of each product category on customer satisfaction, by holding `wait_time` constant? 

Using statsmodel.formula.api, run an OLS to model `review_score`

- Which dataset should you use for this regression? `product_cat` or the entire `products` training dataset?

- Which regressors / independent variables / features should you use? 

Investigate the results: Which product categories correlate with higher review_score holding wait_time constant?

Feel free to use `return_significative_coef(model)` coded for you in `olist/utils.py`

✅ Congratulation with this challenge! Commit and push your notebook before moving to the next one!