# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [79]:
# your code here
import pandas as pd
import numpy as np
import sqlalchemy

In [80]:
connection_string = 'mysql+pymysql://ironhacker_read:ir0nhack3r@35.239.232.23/retail_sales'

In [81]:
engine = sqlalchemy.create_engine(connection_string)

In [82]:
engine.table_names()

['raw_sales', 'sales_by_item', 'sales_by_item_index', 'sales_by_shop']

In [83]:
query = 'SELECT * FROM raw_sales;'

In [84]:
raw_sales = pd.read_sql(query, engine)

In [85]:
# Saving it in case of IP changes
raw_sales.to_csv('raw_sales.csv')

In [86]:
raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [87]:
raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null datetime64[ns]
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 177.6 KB


No null values, column names make sense, I will change the item_cnt_day type to int just because it's counting items and therefore it will always be integers.

In [88]:
raw_sales['item_cnt_day'] = raw_sales['item_cnt_day'].astype(int)

Creating aggregates.

In [89]:
by_shop = raw_sales[['shop_id', 'item_price', 'item_cnt_day']].groupby('shop_id').sum()

In [90]:
by_shop.head()

Unnamed: 0_level_0,item_price,item_cnt_day
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,99070.5,81
3,67443.0,33
4,29361.0,39
5,33138.0,45
6,116352.0,150


In [91]:
by_shop.columns = ['total_sales', 'total_units']

In [92]:
by_shop.head()

Unnamed: 0_level_0,total_sales,total_units
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,99070.5,81
3,67443.0,33
4,29361.0,39
5,33138.0,45
6,116352.0,150


In [93]:
by_item = raw_sales[['item_id', 'item_price', 'item_cnt_day']].groupby('item_id').sum()

In [94]:
by_item.head()

Unnamed: 0_level_0,item_price,item_cnt_day
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30,507.0,3
31,1089.0,3
32,447.0,3
42,897.0,3
59,747.0,3


In [95]:
by_item.columns = ['total_sales', 'total_units']

In [96]:
by_item.head()

Unnamed: 0_level_0,total_sales,total_units
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30,507.0,3
31,1089.0,3
32,447.0,3
42,897.0,3
59,747.0,3


Creating connection to local database

In [100]:
connection_string = 'mysql+pymysql://root:tanaliberatutti@localhost/lab_mysql'

In [101]:
engine = sqlalchemy.create_engine(connection_string)

In [110]:
engine.table_names()

['raw_sales']

In [109]:
raw_sales.to_sql('raw_sales', engine)

In [111]:
by_shop.to_sql('by_shop', engine)

In [112]:
by_item.to_sql('by_item', engine)