# Simplifying Problems

### Introduction

Oftentimes, with data analysis type problems, the task can be fairly broad and vague.  It's our job as a data analyst to begin to break these problems down.  In this lesson, we'll give some techniques for doing so.

### Consider the perspective

Let's say that we are not even given a dataset, but rather just asked to analyze profit of an ecommerce website.  

At this point, we would also want to consider who's perspective we should adopt.  For example, is this for someone who is in charge of a particular product line, someone who is in charge of a specific sales region, or the CEO of the company?  The answer to this would change what we can control, and therefore would change our analysis.

> For example, if we are in control of a specific region -- we likely would not want to choose to exit the region all together, but rather could focus on changing the products sold there, or focus marketing on more successful cities or towns. 

### MECE

Now, just to keep things particularly broad, let's assume that we are the CEO of the website and want to increase profitability -- how would we break down the problem.

A good approach is to start with MECE.  MECE stands for *mutually exclusive and collectively exhaustive*.  By this we mean, what are different components that completely describe what can be going on.

For example here are the first two components under profitability that satisfy MECE.

* Profitability
    1. Revenue
    2. Cost

And then we can further break down cost following this pattern like so.

* Profitability
    1. Revenue
    2. Cost
        * Marginal Costs
        * Fixed Costs

> **Fixed costs** are costs that do not directly depend on sales -- for example, paying for headquarters, or an executive team.  **Marginal costs** are highly correlated with sales -- for example shipping costs, or the cost of producing the product. 

So MECE is a good approach for simplifying a problem.  And one reason that it works well, is because it forces us to **start broadly and then go narrow**.  So we start with broad categories like revenue and cost, and then can move more narrowly with fixed costs and marginal costs.

### Dimensions of the Data

Now this MECE approach can become a little more difficult when we think about the components of revenue.  For example, we could say that revenue is a function of sales from recurring customers and sales from new customers.  

While this is mutually exclusive and collectively exhaustive, there are other ways to divide up our revenue.  So it's valuable to think of different dimensions in the data.

To see this, let's load up a sample dataset.

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('./ecommerce-dataset.csv')

In [7]:
df[:2]

Unnamed: 0,Transaction_id,customer_id,Date,Product,Gender,Device_Type,Country,State,City,Category,Customer_Login_type,Delivery_Type,Quantity,Transaction Start,Transaction_Result,Amount US$,Individual_Price_US$,Year_Month,Time
0,40170,1348959766,14/11/2013,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910,576,13-Nov,22:35:51
1,33374,2213674919,05/11/2013,Hair Band,Female,Web,United States,California,Los Angles,Accessories,Member,one-day deliver,17,1,1,1699,100,13-Nov,06:44:41


So looking at the data above, we can see that there are multiple dimensions we could segment our data by: geography, product type, temporal (month), customer type.  And each of these may have more subdimensions.  

Give this a shot, we'll list some initial dimensions, and see if you can list subdimensions below them.  

> We'll get you started with by filling in some dimensions for customer.

1. Customer
    * Demographic
        * Gender
            * Male
            * Female
            * Non-binary
     * Repeat?
         * New vs Recurring Customers
2. Region
3. Temporal

### Back to MECE

Notice that even with our dimensions, we try to make the subcategories go back to MECE.  There are only two types of categories under repeat customers -- new or recurring (and we list both of them).  With temporal, if we segment by month -- we can easily be MECE simply by listing each of the months.

Below we have some other ways that we could have broken down the data.

### Answers

1. Customer
    * Demographic
        * Gender
            * Male
            * Female
            * Non-binary
     * Repeat?
         * New vs Recurring Customers
2. Region
    * By Country
    * By City
    * Suburban vs City
3. Temporal
    * By month
    * Holiday vs Non-holiday
    * By time of day

So notice above, that with something like Suburban vs City, this is not a category the dataset gives us, but something we likely could find.  