# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# relevant imports of libraries

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import pymysql

In [2]:
# importing data from Ironhack database

driver = 'mysql+pymysql'
user = 'root'
password = 'sushi'
ip = '127.0.0.1'
connection_string = f'{driver}://{user}:{password}@{ip}'
db_connection = create_engine(connection_string)


In [3]:
retailsales = pd.read_sql_query("SELECT * FROM retail_sales.raw_sales", db_connection)
retailsales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [4]:
# Checking for NaN s

retailsales.isna().count()

date            4545
shop_id         4545
item_id         4545
item_price      4545
item_cnt_day    4545
dtype: int64

In [5]:
retailsales.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

In [6]:
retailsales['sales_by_item'] = retailsales['item_price'] * retailsales['item_cnt_day']
retailsales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,sales_by_item
0,2015-01-04,29,1469,1199.0,1.0,1199.0
1,2015-01-04,28,21364,479.0,1.0,479.0
2,2015-01-04,28,21365,999.0,2.0,1998.0
3,2015-01-04,28,22104,249.0,2.0,498.0
4,2015-01-04,28,22091,179.0,1.0,179.0
...,...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0,1299.0
4541,2015-01-04,14,21922,99.0,1.0,99.0
4542,2015-01-04,15,1969,3999.0,1.0,3999.0
4543,2015-01-04,14,22091,179.0,1.0,179.0


In [7]:
# for sales by store

In [8]:
sales_by_store = pd.DataFrame(retailsales.drop(columns = ['item_id', 'date', 'item_id', 'item_price', 'item_cnt_day']))
sales_by_store.head()

Unnamed: 0,shop_id,sales_by_item
0,29,1199.0
1,28,479.0
2,28,1998.0
3,28,498.0
4,28,179.0


In [9]:
sales_by_store_by_day = sales_by_store.groupby('shop_id', as_index=False).agg({"sales_by_item": "sum"})
sales_by_store_by_day.head()

Unnamed: 0,shop_id,sales_by_item
0,2,103746.0
1,3,67443.0
2,4,29361.0
3,5,33138.0
4,6,138678.0


In [40]:
sales_by_store_by_day.to_csv('../data/aggregate_per_store.csv')

In [None]:
# for sales by item

In [14]:
sales_by_item = pd.DataFrame(retailsales.drop(columns = ['shop_id', 'date']))
sales_by_item.head()

Unnamed: 0,item_id,item_price,item_cnt_day,sales_by_item
0,1469,1199.0,1.0,1199.0
1,21364,479.0,1.0,479.0
2,21365,999.0,2.0,1998.0
3,22104,249.0,2.0,498.0
4,22091,179.0,1.0,179.0


In [16]:
sales_by_item_by_day = sales_by_item.groupby('item_id', as_index=False).agg({"sales_by_item": "sum", "item_cnt_day": "count"})
sales_by_item_by_day

Unnamed: 0,item_id,sales_by_item,item_cnt_day
0,30,507.0,3
1,31,1089.0,3
2,32,447.0,3
3,42,897.0,3
4,59,747.0,3
...,...,...,...
980,22091,1074.0,6
981,22092,537.0,3
982,22104,1494.0,3
983,22140,652.5,3


In [17]:
sales_by_item_by_day.to_csv('../data/aggregate_per_item.csv')