# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# your code here
import pandas as pd
import numpy as np

from sqlalchemy import create_engine
import pymysql
driver = 'mysql+pymysql'
user = 'root'
password = 'your password'
ip = '127.0.0.1'
connection_string = f'{driver}://{user}:{password}@{ip}'
db_connection = create_engine(connection_string)
df = pd.read_sql_query("SELECT * FROM retail_sales.raw_sales", db_connection)


In [2]:
#Making sure we have a backup.
# df # is the original data fram.
data = df.copy()

In [3]:
data.head(10)

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
5,2015-01-04,28,21842,149.0,1.0
6,2015-01-04,28,21881,299.0,1.0
7,2015-01-04,29,6930,2199.0,1.0
8,2015-01-04,29,10515,169.0,1.0
9,2015-01-04,29,8624,149.0,1.0


In [4]:
# A quick check to know if there are NaNs messing around.
data.isna().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [5]:
print(data.date.sort_values().unique())
print("Shop ID: ", data.shop_id.sort_values().unique())
#print(data.item_id.sort_values().unique())
#print(data.item_price.sort_values().unique())
print('Item count day: ', data.item_cnt_day.sort_values().unique())


['2015-01-04T00:00:00.000000000']
Shop ID:  [ 2  3  4  5  6  7 10 12 14 15 16 18 19 21 22 24 25 26 27 28 29 31 34 35
 37 38 39 41 42 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59]
Item count day:  [-1.  1.  2.  3.  4.  5.  6. 10.]


In [19]:
# An aggregate per store data-set.

agg_store = data.groupby('shop_id').agg({'item_price':['sum','mean','min','max','count']})
agg_store.head(10)


Unnamed: 0_level_0,item_price,item_price,item_price,item_price,item_price
Unnamed: 0_level_1,sum,mean,min,max,count
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2,99070.5,1320.94,28.0,8999.0,75
3,67443.0,2043.727273,500.0,8999.0,33
4,29361.0,752.846154,79.0,2799.0,39
5,33138.0,736.4,99.0,3690.0,45
6,116352.0,923.428571,5.0,3999.0,126
7,52371.0,831.285714,99.0,3999.0,63
10,22707.0,841.0,3.0,2456.0,27
12,212196.4,1473.586111,79.0,8999.0,144
14,33456.0,743.466667,58.0,3999.0,45
15,125139.0,1345.580645,49.0,19990.0,93


In [20]:
agg_item = data.groupby('item_id').agg({'item_price':['sum','mean','min','max','count']})
agg_item.head(10)

Unnamed: 0_level_0,item_price,item_price,item_price,item_price,item_price
Unnamed: 0_level_1,sum,mean,min,max,count
item_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
30,507.0,169.0,169.0,169.0,3
31,1089.0,363.0,363.0,363.0,3
32,447.0,149.0,149.0,149.0,3
42,897.0,299.0,299.0,299.0,3
59,747.0,249.0,249.0,249.0,3
74,1497.0,499.0,499.0,499.0,3
109,747.0,249.0,249.0,249.0,3
259,747.0,249.0,249.0,249.0,3
464,897.0,299.0,299.0,299.0,3
482,19800.0,3300.0,3300.0,3300.0,6


In [1]:
# we save the 3 data frames:
data.to_csv('.\data_cleaned.csv')
agg_store.to_csv('.\agg_store.csv')
agg_item.to_csv('.\agg_item.csv')

NameError: name 'data' is not defined