## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# your code here
import pandas as pd
import numpy as np

In [4]:
# create a variable to read the csv and passed the separation tipe

raw_sales = pd.read_csv("data/retail_sales-raw_sales.csv", sep=";")

### Cleaning Data

In [7]:
# reading the table to see which columns and data has

raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04 00:00:00,29,1469,1199.0,1.0
1,2015-01-04 00:00:00,28,21364,479.0,1.0
2,2015-01-04 00:00:00,28,21365,999.0,2.0
3,2015-01-04 00:00:00,28,22104,249.0,2.0
4,2015-01-04 00:00:00,28,22091,179.0,1.0


In [9]:
# reviewing the table info to see if I need to make any changes to the Dtypes

raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          4545 non-null   object 
 1   shop_id       4545 non-null   int64  
 2   item_id       4545 non-null   int64  
 3   item_price    4545 non-null   float64
 4   item_cnt_day  4545 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 177.7+ KB


In [12]:
# since the date column the dtype is an object, I am going to change it to datetime and I´ll normalize it so it only shows the date and not the time as they are all 00:00:00

raw_sales["date"] = pd.to_datetime(raw_sales["date"], errors="coerce").dt.normalize()

In [17]:
# converting from float to int the column item_cnt_day  

raw_sales["item_cnt_day"] = raw_sales["item_cnt_day"].astype(int)

In [18]:
# checking the info again to ensure the dtype for the column date has been changed to datetime
raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          4545 non-null   datetime64[ns]
 1   shop_id       4545 non-null   int64         
 2   item_id       4545 non-null   int64         
 3   item_price    4545 non-null   float64       
 4   item_cnt_day  4545 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 177.7 KB


In [19]:
# checking the table to see if the date column has the normalized date 

raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1
1,2015-01-04,28,21364,479.0,1
2,2015-01-04,28,21365,999.0,2
3,2015-01-04,28,22104,249.0,2
4,2015-01-04,28,22091,179.0,1


### Aggregate per store

In [42]:
# groupby shop id

store = raw_sales.groupby("shop_id", sort=True).count().reset_index()

In [44]:
store.head()

Unnamed: 0,shop_id,date,item_id,item_price,item_cnt_day
0,2,75,75,75,75
1,3,33,33,33,33
2,4,39,39,39,39
3,5,45,45,45,45
4,6,126,126,126,126


### Aggregate per item

In [69]:
# groupby item

item = raw_sales.groupby("item_id").count().reset_index()

In [70]:
item.head()

Unnamed: 0,item_id,date,shop_id,item_price,item_cnt_day
0,30,3,3,3,3
1,31,3,3,3,3
2,32,3,3,3,3
3,42,3,3,3,3
4,59,3,3,3,3


In [78]:
# groupby shop id and item id

item.groupby(["shop_id","item_id"]).count().drop(columns = ["date", "item_price", "item_cnt_day"], axis=1)

shop_id,item_id
3,30
3,31
3,32
3,42
3,59
...,...
36,11927
42,17717
45,20949
48,1969


In [79]:
# adding a revenue column 

raw_sales["revenue"] = raw_sales["item_price"] * raw_sales["item_cnt_day"]

In [80]:
raw_sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,revenue
0,2015-01-04,29,1469,1199.0,1,1199.0
1,2015-01-04,28,21364,479.0,1,479.0
2,2015-01-04,28,21365,999.0,2,1998.0
3,2015-01-04,28,22104,249.0,2,498.0
4,2015-01-04,28,22091,179.0,1,179.0
...,...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1,1299.0
4541,2015-01-04,14,21922,99.0,1,99.0
4542,2015-01-04,15,1969,3999.0,1,3999.0
4543,2015-01-04,14,22091,179.0,1,179.0


In [93]:
# total revenue by shop

shop_revenue = raw_sales.groupby("shop_id")["revenue"].sum().reset_index().sort_values("revenue", ascending=False)

In [95]:
shop_revenue.head()

Unnamed: 0,shop_id,revenue
28,42,330111.0
21,31,304692.0
7,12,295173.0
16,25,288432.0
13,21,228999.0
