# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [3]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine


In [2]:
driver = "mysql+pymysql"
user = "ironhacker_read"
password = "ir0nhack3r"
ip = "35.239.232.23"
database = "retail_sales"

connection_string = f"{driver}://{user}:{password}@{ip}/{database}"
    
engine = create_engine(connection_string)

In [3]:
query = """
        SELECT * FROM raw_sales
"""

raw_sales = pd.read_sql(query, engine)
raw_sales.to_csv("../raw_sales.csv", sep = ",")

In [4]:
query = """
        SELECT * FROM sales_by_item_index
"""
sales_by_item_index = pd.read_sql(query, engine)
sales_by_item_index.to_csv("../sales_by_item_index.csv", sep = ",")

In [5]:
query = """
        SELECT * FROM sales_by_item
"""
sales_by_item = pd.read_sql(query, engine)
sales_by_item.to_csv("../sales_by_item.csv", sep = ",")

In [6]:
query = """
        SELECT * FROM sales_by_shop
"""
sales_by_shop = pd.read_sql(query, engine)
sales_by_shop.to_csv("../sales_by_shop.csv", sep = ",")

In [29]:
# read all the data
raw_sales = pd.read_csv("../raw_sales.csv")
sales_by_item = pd.read_csv("../sales_by_item.csv")
sales_by_item_index = pd.read_csv("../sales_by_item_index.csv")
sales_by_shop = pd.read_csv("../sales_by_shop.csv")

In [30]:
raw_sales= raw_sales.drop(["Unnamed: 0"], axis = 1)
raw_sales["date"] = raw_sales["date"].astype('datetime64[ns]')
raw_sales.head()

In [38]:
raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,shop_earning
0,2015-01-04,29,1469,1199.0,1.0,1199.0
1,2015-01-04,28,21364,479.0,1.0,479.0
2,2015-01-04,28,21365,999.0,2.0,1998.0
3,2015-01-04,28,22104,249.0,2.0,498.0
4,2015-01-04,28,22091,179.0,1.0,179.0


In [55]:
def extract_sales_by_shop(raw_sales):
    raw_sales["earnings"] = raw_sales["item_price"]*raw_sales["item_cnt_day"]
    aggregation = {"earnings": "sum", 
                   "item_cnt_day": "sum"}
    raw_sales_agg_by_shop = raw_sales.groupby(["date", "shop_id"]).agg(aggregation)
    raw_sales_agg_by_shop = raw_sales_agg_by_shop.reset_index()
    raw_sales_agg_by_shop = raw_sales_agg_by_shop[["shop_id", "earnings", "item_cnt_day", "date"]]
    raw_sales_agg_by_shop = raw_sales_agg_by_shop.rename({"earnings":"shop_earnings","item_cnt_day": "total_items_sold"}, axis = 1)
    return raw_sales_agg_by_shop

In [56]:
extract_sales_by_shop(raw_sales).head()

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
0,2,103746.0,81.0,2015-01-04
1,3,67443.0,33.0,2015-01-04
2,4,29361.0,39.0,2015-01-04
3,5,33138.0,45.0,2015-01-04
4,6,138678.0,150.0,2015-01-04


In [57]:
def extract_sales_by_item(raw_sales):
    raw_sales["earnings"] = raw_sales["item_price"]*raw_sales["item_cnt_day"]
    aggregation = {"earnings": "sum", 
                   "item_cnt_day": "sum"}
    raw_sales_agg_by_item = raw_sales.groupby(["date", "item_id"]).agg(aggregation)
    raw_sales_agg_by_item = raw_sales_agg_by_item.reset_index()
    raw_sales_agg_by_item = raw_sales_agg_by_item[["item_id", "earnings", "item_cnt_day", "date"]]
    raw_sales_agg_by_item = raw_sales_agg_by_item.rename({"earnings":"item_earnings","item_cnt_day": "total_items_sold"}, axis = 1)
    return raw_sales_agg_by_item

In [58]:
extract_sales_by_item(raw_sales).head()

Unnamed: 0,item_id,item_earnings,total_items_sold,date
0,30,507.0,3.0,2015-01-04
1,31,1089.0,3.0,2015-01-04
2,32,447.0,3.0,2015-01-04
3,42,897.0,3.0,2015-01-04
4,59,747.0,3.0,2015-01-04


In [31]:
sales_by_item = sales_by_item.drop(["Unnamed: 0"], axis = 1)
sales_by_item["date"]=sales_by_item["date"].astype('datetime64[ns]')
sales_by_item.head()

Unnamed: 0,item_id,item_earnings,total_items_sold,date
0,30,169.0,1.0,2019-03-09
1,31,363.0,1.0,2019-03-09
2,32,149.0,1.0,2019-03-09
3,42,299.0,1.0,2019-03-09
4,59,249.0,1.0,2019-03-09


In [32]:
sales_by_item_index = sales_by_item_index.drop(["Unnamed: 0"], axis = 1)
sales_by_item_index["date"] = sales_by_item_index["date"].astype('datetime64[ns]')
sales_by_item_index.head()

Unnamed: 0,id,item_id,item_earnings,total_items_sold,date
0,1,30,169.0,1.0,2019-03-12
1,2,31,363.0,1.0,2019-03-12
2,3,32,149.0,1.0,2019-03-12
3,4,42,299.0,1.0,2019-03-12
4,5,59,249.0,1.0,2019-03-12


In [33]:
sales_by_shop= sales_by_shop.drop(["Unnamed: 0"], axis = 1)
sales_by_shop["date"] = sales_by_shop["date"].astype('datetime64[ns]')
sales_by_shop.head()

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
0,2,33023.5,27.0,2019-03-09
1,3,22481.0,11.0,2019-03-09
2,4,9787.0,13.0,2019-03-09
3,5,11046.0,15.0,2019-03-09
4,6,38784.0,50.0,2019-03-09
