# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [4]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [5]:
# import data set from Ironhack's database

driver = 'mysql+pymysql:'
user = 'ironhacker_read'
password = 'ir0nhack3r'
ip = '35.239.232.23'
database = 'retail_sales'

In [6]:
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'

In [7]:
engine = create_engine(connection_string)

In [13]:
# Have a look at the tables we need sales_by_shop and sales_by item
query = """
        SELECT * FROM sales_by_shop
"""

In [14]:
sales_by_shop = pd.read_sql(query, engine)
sales_by_shop.head()

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
0,2,33023.5,27.0,03/09/2019
1,3,22481.0,11.0,03/09/2019
2,4,9787.0,13.0,03/09/2019
3,5,11046.0,15.0,03/09/2019
4,6,38784.0,50.0,03/09/2019


In [15]:
query = """
        SELECT * FROM sales_by_item
"""

In [16]:
sales_by_item = pd.read_sql(query, engine)
sales_by_item.head()

Unnamed: 0,item_id,item_earnings,total_items_sold,date
0,30,169.0,1.0,03/09/2019
1,31,363.0,1.0,03/09/2019
2,32,149.0,1.0,03/09/2019
3,42,299.0,1.0,03/09/2019
4,59,249.0,1.0,03/09/2019


In [19]:
# Cleaning of the sales_by_shop table
## the column names are correct so no need to change them

## Let's check the types of the table
sales_by_shop.dtypes

shop_id               int64
shop_earnings       float64
total_items_sold    float64
date                 object
dtype: object

In [25]:
## We need to change the date type from oblject to datetime64
sales_by_shop.astype({"date":"datetime64"}).head()

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
0,2,33023.5,27.0,2019-03-09
1,3,22481.0,11.0,2019-03-09
2,4,9787.0,13.0,2019-03-09
3,5,11046.0,15.0,2019-03-09
4,6,38784.0,50.0,2019-03-09


In [24]:
## Check if the date type is now correct
sales_by_shop.dtypes

shop_id               int64
shop_earnings       float64
total_items_sold    float64
date                 object
dtype: object

In [27]:
## The datetime type has not been changed (even though the way the date is written has been changed 
## but I don' t understand why

In [28]:
## Check if there is any null or NaN value in any of the columns
sales_by_shop.isnull().any()

shop_id             False
shop_earnings       False
total_items_sold    False
date                False
dtype: bool

In [29]:
## There is not any null, missing or NaN values in the columns.
## We can also check the number of indexes for each column by using info() 
sales_by_shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 4 columns):
shop_id             90 non-null int64
shop_earnings       90 non-null float64
total_items_sold    90 non-null float64
date                90 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 2.9+ KB


In [None]:
## All indexes are equal to 90 so we confirm there's no missing values

In [None]:
# We can clean the sales_item using the same process

In [30]:
## Let's check the types of the table
sales_by_item.dtypes

item_id               int64
item_earnings       float64
total_items_sold    float64
date                 object
dtype: object

In [31]:
## We need to change the date type from oblject to datetime64
sales_by_item.astype({"date":"datetime64"}).head()

Unnamed: 0,item_id,item_earnings,total_items_sold,date
0,30,169.0,1.0,2019-03-09
1,31,363.0,1.0,2019-03-09
2,32,149.0,1.0,2019-03-09
3,42,299.0,1.0,2019-03-09
4,59,249.0,1.0,2019-03-09


In [32]:
## Check if the date type is now correct
sales_by_item.dtypes

item_id               int64
item_earnings       float64
total_items_sold    float64
date                 object
dtype: object

In [33]:
## Check if there is any null or NaN value in any of the columns
sales_by_shop.isnull().any()

shop_id             False
shop_earnings       False
total_items_sold    False
date                False
dtype: bool

In [34]:
## There is not any null, missing or NaN values in the columns.
## We can also check the number of indexes for each column by using info() 
sales_by_shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 4 columns):
shop_id             90 non-null int64
shop_earnings       90 non-null float64
total_items_sold    90 non-null float64
date                90 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 2.9+ KB


In [None]:
## All indexes are equal to 90 so we confirm there's no missing values

In [None]:
## Check for some more info on the sales_by_shop table
sales_by_shop.info()

In [None]:
## They all have the same index so no missing index in any of the columns
## Check for the presence of NaN


In [18]:
sales_by_shop

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
0,2,33023.500000,27.0,03/09/2019
1,3,22481.000000,11.0,03/09/2019
2,4,9787.000000,13.0,03/09/2019
3,5,11046.000000,15.0,03/09/2019
4,6,38784.000000,50.0,03/09/2019
5,7,17457.000000,21.0,03/09/2019
6,10,7569.000000,10.0,03/09/2019
7,12,70732.133333,72.0,03/09/2019
8,14,11152.000000,17.0,03/09/2019
9,15,41713.000000,31.0,03/09/2019
