# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# your code here
import pandas as pd
import pymysql
from sqlalchemy import create_engine
import sqlalchemy
driver = 'mysql+pymysql'
ip = '34.65.10.136'
username = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
db = 'retail_sales'
connection_string  = f'{driver}://{username}:{password}@{ip}/{db}'
engine = create_engine(connection_string)
query = 'SELECT * FROM raw_sales'
raw_sales = pd.read_sql(query,engine)
raw_sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


# Data cleaning


In [2]:
'''Check the types of each column and change if need it'''



raw_sales.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

In [3]:
raw_sales.astype({'item_cnt_day' : 'int64'})

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1
1,2015-01-04,28,21364,479.0,1
2,2015-01-04,28,21365,999.0,2
3,2015-01-04,28,22104,249.0,2
4,2015-01-04,28,22091,179.0,1
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1
4541,2015-01-04,14,21922,99.0,1
4542,2015-01-04,15,1969,3999.0,1
4543,2015-01-04,14,22091,179.0,1


In [4]:
'''Check for null items'''

#null_columns=raw_sales.columns[raw_sales.isnull().any()]
#raw_sales[null_columns].isnull().sum()

raw_sales.isnull().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [9]:
raw_sales.rename(columns={'item_cnt_day':'Quantity_sell','item_price':'Price_Item','date':'Date_sell'}, inplace = True)
raw_sales

Unnamed: 0,Date_sell,shop_id,item_id,Price_Item,Quantity_sell
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


### Make the aggregation with group by

In [18]:


item_table = raw_sales.groupby(['item_id']).agg(['sum'])
item_table.drop(['shop_id'],axis = 1 ,inplace = True)
item_table

Unnamed: 0_level_0,Price_Item,Quantity_sell
Unnamed: 0_level_1,sum,sum
item_id,Unnamed: 1_level_2,Unnamed: 2_level_2
30,507.0,3.0
31,1089.0,3.0
32,447.0,3.0
42,897.0,3.0
59,747.0,3.0
...,...,...
22091,1074.0,6.0
22092,537.0,3.0
22104,747.0,6.0
22140,652.5,3.0


In [19]:
shop_table = raw_sales.groupby(['shop_id']).agg(['sum'])
shop_table.drop(['item_id'],axis = 1 ,inplace = True)
shop_table

Unnamed: 0_level_0,Price_Item,Quantity_sell
Unnamed: 0_level_1,sum,sum
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2
2,99070.5,81.0
3,67443.0,33.0
4,29361.0,39.0
5,33138.0,45.0
6,116352.0,150.0
7,52371.0,63.0
10,22707.0,30.0
12,212196.4,216.0
14,33456.0,51.0
15,125139.0,93.0
