# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [20]:
# your code here
import pandas as pd
import pymysql 
import sqlalchemy
import numpy as np

In [42]:
#Connecting to database
conn_string = 'mysql+pymysql://root:&FK^[8jLK$"/4=n*@34.65.10.136:3306/retail_sales'
conn = sqlalchemy.create_engine(conn_string)

df_rsales = pd.read_sql_query('SELECT * FROM raw_sales;', conn)

#By first sight, we find items sold in different shops, with an allocated price and a item_cnt_day.
#We proceed to group by item_id and check max and min:
gbitem = df_rsales.groupby('item_id').agg(['min','max'])

#Let's check the values of the original chart
df_rsales.isna().sum()
#No Nan numbers to clean

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [43]:
#There are 30 items which have different selling prices, which might affect the final analysis.
wrongprices = gbitem.loc[gbitem[('item_price', 'min')] != gbitem[('item_price','max')]]

#We suggest investigate the source of the data of those items in order to get a better analysis.

In [47]:
#We groupby items summing the shops, prices and item_cnt_day per item_id
itemdf = df_rsales.groupby('item_id').agg(['sum'])

In [48]:
#We groupby shops summing the items, prices and items_cnt_day per shop_id
shopsdf = df_rsales.groupby('shop_id').agg(['sum'])