## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# import data
import pandas as pd
import numpy as np

sales_raw = pd.read_csv('../datasets/retail_sales-raw_sales.csv', sep = ';')
sales = sales_raw.copy()
sales.head(10)

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04 00:00:00,29,1469,1199.0,1.0
1,2015-01-04 00:00:00,28,21364,479.0,1.0
2,2015-01-04 00:00:00,28,21365,999.0,2.0
3,2015-01-04 00:00:00,28,22104,249.0,2.0
4,2015-01-04 00:00:00,28,22091,179.0,1.0
5,2015-01-04 00:00:00,28,21842,149.0,1.0
6,2015-01-04 00:00:00,28,21881,299.0,1.0
7,2015-01-04 00:00:00,29,6930,2199.0,1.0
8,2015-01-04 00:00:00,29,10515,169.0,1.0
9,2015-01-04 00:00:00,29,8624,149.0,1.0


In [2]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          4545 non-null   object 
 1   shop_id       4545 non-null   int64  
 2   item_id       4545 non-null   int64  
 3   item_price    4545 non-null   float64
 4   item_cnt_day  4545 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 177.7+ KB


In [3]:
#we can transform the item_cnt_day from float to int as counts do not have floats.
sales['item_cnt_day'] = sales['item_cnt_day'].astype(int)

In [4]:
# lets create the aggregations:
sales_byshop = sales.groupby('shop_id').sum().drop(columns='item_id')
sales_byitem = sales.groupby('item_id').sum().drop(columns='shop_id')

In [5]:
# finally export them to csv
sales_byshop.to_csv('../datasets/retail_sales-sales_byshop.csv')
sales_byitem.to_csv('../datasets/retail_sales-sales_byitem.csv')