# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [14]:
# Libraries
from sqlalchemy import create_engine
import pandas as pd

In [None]:
# Connecting to the IronHack datbase
driver = 'mysql+pymysql:'
user = 'ironhacker_read'
password = 'ir0nhack3r'
ip = '35.239.232.23'
database = 'retail_sales'

connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
    
engine = create_engine(connection_string)

query = """
        SELECT * FROM raw_sales
"""

df_db = pd.read_sql(query, engine)

df_db

In [8]:
# Reading the csv file and deleting an unnecesary column
raw_sales = pd.read_csv('raw_sales.csv')
raw_sales.drop(['Unnamed: 0'], axis=1, inplace=True)

In [10]:
# Looking for the types to change
raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null object
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 177.6+ KB


In [11]:
# Renaming quantity column
raw_sales = raw_sales.rename( columns= {'item_cnt_day':'qty_day'})

In [20]:
# Changing the date dtype
raw_sales = raw_sales.astype({'date':'datetime64'})

In [21]:
# Checking the change
raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date          4545 non-null datetime64[ns]
shop_id       4545 non-null int64
item_id       4545 non-null int64
item_price    4545 non-null float64
qty_day       4545 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 177.6 KB


In [22]:
# Creating the clean data file
raw_sales.to_csv('Clean_Data.csv')

In [41]:
# Creating the aggregate per store
per_store = per_store.groupby(['shop_id']).agg({'item_price':'mean', 'qty_day':'sum'})
per_store.to_csv('per_store.csv')

In [48]:
# Creating the aggregate per item
per_item = raw_sales.groupby(['item_id','item_price']).agg({'qty_day':'sum'})
per_item.to_csv('per_item.csv')