# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [2]:
import pandas as pd 
import sqlalchemy 
import datetime 
from sqlalchemy import create_engine
import numpy as np
import re

In [2]:
# your code here
# import data set from Ironhack's database

driver = "mysql+pymysql"
user = "ironhacker_read"
password = "ir0nhack3r"
ip = "35.239.232.23"
database = "retail_sales"

connection_string = f'{driver}://{user}:{password}@{ip}/{database}'

engine = create_engine(connection_string)
query = """
        SELECT * FROM retail_sales.raw_sales
join sales_by_item on raw_sales.item_id = sales_by_item.item_id
join sales_by_item_index on raw_sales.item_id = sales_by_item_index.item_id
join sales_by_shop on raw_sales.shop_id = sales_by_shop.shop_id;
"""

retail_sales = pd.read_sql(query, engine)

## Had to download all tables due to the lack of time.
### Thats why I have to rename column names and also have extra work rebuilding the original tables for shop and item

In [3]:
retail_sales.to_csv("retail_sales.csv", sep=",")

In [None]:
# As I couldnt connect from home, I had to reload the prior saved file
# retail_sales = pd.read_csv("retail_sales.csv")

# Task 1: Clean the data

In [27]:
retail_sales.head(13)

Unnamed: 0.1,Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,item_id.1,item_earnings,total_items_sold,date.1,id,item_id.2,item_earnings.1,total_items_sold.1,date.2,shop_id.1,shop_earnings,total_items_sold.2,date.3
0,0,2015-01-04,29,1469,1199.0,1.0,1469,3597.0,3.0,03/09/2019,43,1469,3597.0,3.0,03/12/2019,29,28579.0,24.0,03/09/2019
1,1,2015-01-04,29,1469,1199.0,1.0,1469,3597.0,3.0,03/12/2019,43,1469,3597.0,3.0,03/12/2019,29,28579.0,24.0,03/09/2019
2,2,2015-01-04,29,1469,1199.0,1.0,1469,3597.0,3.0,03/12/2019,43,1469,3597.0,3.0,03/12/2019,29,28579.0,24.0,03/09/2019
3,3,2015-01-04,29,1469,1199.0,1.0,1469,3597.0,3.0,03/09/2019,43,1469,3597.0,3.0,03/12/2019,29,28579.0,24.0,03/12/2019
4,4,2015-01-04,29,1469,1199.0,1.0,1469,3597.0,3.0,03/12/2019,43,1469,3597.0,3.0,03/12/2019,29,28579.0,24.0,03/12/2019
5,5,2015-01-04,29,1469,1199.0,1.0,1469,3597.0,3.0,03/12/2019,43,1469,3597.0,3.0,03/12/2019,29,28579.0,24.0,03/12/2019
6,6,2015-01-04,28,21364,479.0,1.0,21364,7616.0,22.0,03/09/2019,940,21364,7616.0,22.0,03/12/2019,28,58538.0,82.0,03/09/2019
7,7,2015-01-04,28,21364,479.0,1.0,21364,7616.0,22.0,03/12/2019,940,21364,7616.0,22.0,03/12/2019,28,58538.0,82.0,03/09/2019
8,8,2015-01-04,28,21364,479.0,1.0,21364,7616.0,22.0,03/12/2019,940,21364,7616.0,22.0,03/12/2019,28,58538.0,82.0,03/09/2019
9,9,2015-01-04,28,21364,479.0,1.0,21364,7616.0,22.0,03/09/2019,940,21364,7616.0,22.0,03/12/2019,28,58538.0,82.0,03/12/2019


In [None]:
# As I downloaded the file in one table, I had to rename some columns,
# which will be used later (for example to reset the original tables)

In [31]:
retail_sales = retail_sales.rename(columns = {"total_items_sold.1": "total_items_sold_item", "date.2": "date_item_sale", "total_items_sold.2": "total_items_sold_shop", "date.3": "date_shop_sale"})

In [44]:
cleaned_sales = retail_sales[["date", "item_id", "item_price", "item_cnt_day", "item_earnings", "date_item_sale", "total_items_sold_item", "shop_id", "date_shop_sale", "shop_earnings", "total_items_sold_shop"]]
cleaned_sales.to_csv("cleaned_data.csv", sep=",")
cleaned_sales.head()

Unnamed: 0,date,item_id,item_price,item_cnt_day,item_earnings,date_item_sale,total_items_sold_item,shop_id,date_shop_sale,shop_earnings,total_items_sold_shop
0,2015-01-04,1469,1199.0,1.0,3597.0,03/12/2019,3.0,29,03/09/2019,28579.0,24.0
1,2015-01-04,1469,1199.0,1.0,3597.0,03/12/2019,3.0,29,03/09/2019,28579.0,24.0
2,2015-01-04,1469,1199.0,1.0,3597.0,03/12/2019,3.0,29,03/09/2019,28579.0,24.0
3,2015-01-04,1469,1199.0,1.0,3597.0,03/12/2019,3.0,29,03/12/2019,28579.0,24.0
4,2015-01-04,1469,1199.0,1.0,3597.0,03/12/2019,3.0,29,03/12/2019,28579.0,24.0


# Task 2: Aggregated per Shop

In [None]:
# Rebuild original dataset for shop only
# --> group by shop_date and shop_id
# --> keep the first values due to duplicates

In [45]:
cleaned_shop_sales = cleaned_sales[["shop_id", "date_shop_sale", "shop_earnings", "total_items_sold_shop"]].groupby(["date_shop_sale", "shop_id"]).first()
cleaned_shop_sales.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,shop_earnings,total_items_sold_shop
date_shop_sale,shop_id,Unnamed: 2_level_1,Unnamed: 3_level_1
03/09/2019,2,33023.5,27.0
03/09/2019,3,22481.0,11.0
03/09/2019,4,9787.0,13.0


In [43]:
aggregate_per_store = cleaned_shop_sales.groupby("shop_id").sum()
aggregate_per_store.to_csv("aggregate_per_store.csv", sep=",")
aggregate_per_store.head()

Unnamed: 0_level_0,shop_earnings,total_items_sold_shop
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,66047.0,54.0
3,44962.0,22.0
4,19574.0,26.0
5,22092.0,30.0
6,77568.0,100.0


# Task 3: Aggregated per item

In [49]:
# As before: Rebuild original dataset for items only
# --> group by item_date and item_id
# --> keep the first values due to duplicates

In [46]:
cleaned_item_sales = cleaned_sales[["item_id", "date_item_sale", "item_earnings", "total_items_sold_item"]].groupby(["date_item_sale", "item_id"]).first()
cleaned_item_sales.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,item_earnings,total_items_sold_item
date_item_sale,item_id,Unnamed: 2_level_1,Unnamed: 3_level_1
03/12/2019,30,169.0,1.0
03/12/2019,31,363.0,1.0
03/12/2019,32,149.0,1.0


In [48]:
aggregate_per_item = cleaned_item_sales.groupby("item_id").sum()
aggregate_per_item.to_csv("aggregate_per_item.csv", sep=",")
aggregate_per_item.head()

Unnamed: 0_level_0,item_earnings,total_items_sold_item
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30,169.0,1.0
31,363.0,1.0
32,149.0,1.0
42,299.0,1.0
59,249.0,1.0
