# Sellers

Our goal is to find Sellers that repeatedly underperform vs others, and understand why.  
This will help us shape our recommendations on how to improve Olist's profit margin

## 1 - Create a Sellers Query with the following elements 

- Create a table  `get_training_data` that will return the following DataFrame:

  - `seller_id` (_str_) _the id of the seller **UNIQUE**
  - `delay_to_carrier` (_float_) _Average delay_to_carrier per seller. Return 0 if the order is delivered before the shipping_limit_date, otherwise 0_
  - `wait_time` (_float_) _Average wait_time (duration of deliveries) per seller_
  - `share_of_five_stars` (_float_) _The share of five star orders for orders in which the seller was involved_
  - `share_of_one_stars` (_float_) _The share of one star orders for orders in which the seller was involved_
  - `review_score` (_float_) _The average review score for orders in which the seller was involved_
  - `n_orders` (_int_) _The number of unique orders the seller was involved with._
  - `quantity` (_int_) _The total number of items sold by this seller_
  - `quantity_per_order`(_float_) _The mean number of items per order for this seller_
  - `sales` (_float_) _The total sales associated with this seller (excluding freight value)_ in BRL
  - `date_first_sale` (_datetime_) _Date of first sales on Olist_
  - `date_last_sale` (_datetime_) _Date of last sales on Olist_
  
Feel free to build all intermediary tables below if you prefer to breakdown the problem step by step.

### `get_seller_features`
Returns a Table with: 'seller_id', 'seller_city', 'seller_state'

### `get_seller_delay_wait_time`
Returns a Table with: 'seller_id', 'delay_to_carrier', 'wait_time'

### `get_active_dates`
Returns a Table with 'seller_id', 'date_first_sale', 'date_last_sale'

### `get_review_score`
['seller_id', 'share_of_five_stars', 'share_of_one_stars', 'review_score']

### `get_quantity`
['seller_id', 'n_orders', 'quantity']

### `get_sales`
['seller_id', 'sales']

## 2 - Exploration

### 2.1 - Plots

❓ Let's start as you always should to use the `Data Analyze` add-in to describe the different variables and then to plot the distribution histograms as we did for the Orders

- Check the distribution of each numerical variable of our new Seller's table
- Do you notice any outliers?

In [None]:
# Your answer

### 2.2 - Model out `review_score` with OLS

❓ A more rigorous way to explain sellers' review_score is to **model out the impact of various features on `review_score` with a multivariate-OLS with the `Data Analyze` add-in

Create an OLS with only the numerical features of your choice. What are the most impactful ones? (if you want to be rigurous, don't forget to standardize you data as we did on the Orders analysis. It will allow you to compare coefficients that have that are on the same scale)

In [None]:
# Your answer

❓ Finally, investigate your model's performance (r-squared) and residuals 

<details>
    <summary>Hint</summary>
    

You can plot the residuals and the Fit plot just by checking in the option in the regression menu
</details>


In [None]:
# Your answer

### 2.3 - Optional  -  Add seller_state to your analysis

Linear regressions only accept numerical variables as input to their model but what about Seller state or other features that have information within them and are not numerical ?? You will have to encode them as `Dummy Variables` and add them in your OLS regression. If you have the time explore this possibility ! 