# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# your code here
# your code here
import pandas as pd
from sqlalchemy import create_engine
driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
engine = create_engine(connection_string)

In [2]:
shop = pd.read_sql('SELECT * FROM retail_sales.raw_sales', engine)
original = pd.read_sql('SELECT * FROM retail_sales.raw_sales', engine)
shop


Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


## Check for unclean data

In [3]:
shop.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [4]:
shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null datetime64[ns]
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 177.7 KB


In [5]:
shop.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

In [6]:
shop.isna().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [7]:
shop[ shop['item_cnt_day'] < 0]

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
179,2015-01-04,35,7877,3990.0,-1.0
305,2015-01-04,25,2575,2099.0,-1.0
386,2015-01-04,21,2946,449.0,-1.0
391,2015-01-04,21,1523,799.0,-1.0
825,2015-01-04,52,16677,332.67,-1.0
899,2015-01-04,44,14652,199.0,-1.0
901,2015-01-04,44,8095,499.0,-1.0
926,2015-01-04,44,1114,299.0,-1.0
1161,2015-01-04,42,1878,2599.0,-1.0
1416,2015-01-04,19,2690,1598.0,-1.0


In [8]:
shop['item_cnt_day'] = shop['item_cnt_day'].replace({-1:0})

## Create aggregates

In [9]:
shop['aggregate_item'] = shop['item_price'] * shop['item_cnt_day']
shop

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,aggregate_item
0,2015-01-04,29,1469,1199.0,1.0,1199.0
1,2015-01-04,28,21364,479.0,1.0,479.0
2,2015-01-04,28,21365,999.0,2.0,1998.0
3,2015-01-04,28,22104,249.0,2.0,498.0
4,2015-01-04,28,22091,179.0,1.0,179.0
...,...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0,1299.0
4541,2015-01-04,14,21922,99.0,1.0,99.0
4542,2015-01-04,15,1969,3999.0,1.0,3999.0
4543,2015-01-04,14,22091,179.0,1.0,179.0


In [10]:
shopaggreg = shop.groupby('shop_id').agg({'aggregate_item':'sum'})
shopaggreg.sort_values(by = 'aggregate_item', ascending = False)

Unnamed: 0_level_0,aggregate_item
shop_id,Unnamed: 1_level_1
42,337908.0
31,304692.0
12,295173.0
25,294729.0
21,232743.0
57,226269.0
37,220500.0
28,202512.0
27,172959.0
55,170847.6


In [11]:
itemaggreg = shop.groupby('item_id').agg({'aggregate_item':'sum'})
itemaggreg.sort_values(by = 'aggregate_item', ascending = False)

Unnamed: 0_level_0,aggregate_item
item_id,Unnamed: 1_level_1
1969,262134.0
6675,242910.0
1971,121473.0
1970,107988.0
13494,89940.0
...,...
16677,0.0
1523,0.0
8095,0.0
2946,0.0


## Export the aggregates as tables

In [12]:
driver   = 'mysql+pymysql:'
user     = 'root'
password = '<new-password>'
ip       = '127.0.0.1'
database = 'retail_sales'
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
engine = create_engine(connection_string)

In [13]:
shopaggreg.to_sql('shop_aggregated', con=engine, if_exists='replace')

In [14]:
itemaggreg.to_sql('item_aggregated', con=engine, if_exists='replace')