# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

-----------------


## Import needed library

In [2]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine


## Connect to sql and save in local as .csv file

In [2]:
driver = "mysql+pymysql"
user = "ironhacker_read"
password = "ir0nhack3r"
ip = "35.239.232.23"
database = "retail_sales"

connection_string = f"{driver}://{user}:{password}@{ip}/{database}"
    
engine = create_engine(connection_string)

In [3]:
# There are total 4 files in sql db
query = """
        SELECT * FROM raw_sales
"""

raw_sales = pd.read_sql(query, engine)
raw_sales.to_csv("../raw_sales.csv", sep = ",")

In [4]:
query = """
        SELECT * FROM sales_by_item_index
"""
sales_by_item_index = pd.read_sql(query, engine)
sales_by_item_index.to_csv("../sales_by_item_index.csv", sep = ",")

In [5]:
query = """
        SELECT * FROM sales_by_item
"""
sales_by_item = pd.read_sql(query, engine)
sales_by_item.to_csv("../sales_by_item.csv", sep = ",")

In [6]:
query = """
        SELECT * FROM sales_by_shop
"""
sales_by_shop = pd.read_sql(query, engine)
sales_by_shop.to_csv("../sales_by_shop.csv", sep = ",")

## Read the .csv files to jupyter

In [3]:
# read all the data
raw_sales = pd.read_csv("../raw_sales.csv")
sales_by_item = pd.read_csv("../sales_by_item.csv")
sales_by_item_index = pd.read_csv("../sales_by_item_index.csv")
sales_by_shop = pd.read_csv("../sales_by_shop.csv")

In [4]:
# check the data type
raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 6 columns):
Unnamed: 0      4545 non-null int64
date            4545 non-null object
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 213.1+ KB


In [5]:
# change the date type
raw_sales= raw_sales.drop(["Unnamed: 0"], axis = 1)
raw_sales["date"] = raw_sales["date"].astype('datetime64[ns]')
raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In the DB there are four tables, I am extracting data from the raw_sales and aggregating it based on the sales_by_shop and sales_by_item table design.

I am not sure about the sales_by_item_index, therefore I didn't create function for this table. 

In [6]:
# define a function that extract the desired data from raw_sales table
# In the sales_by_shop table, there are four columns, shop_id, shop_earnings, total_items_sold, date. 
# This function is to do the operation and select the desired columns and reorder the columns.
def extract_sales_by_shop(raw_sales):
    raw_sales["earnings"] = raw_sales["item_price"]*raw_sales["item_cnt_day"]
    aggregation = {"earnings": "sum", 
                   "item_cnt_day": "sum"}
    raw_sales_agg_by_shop = raw_sales.groupby(["date", "shop_id"]).agg(aggregation)
    raw_sales_agg_by_shop = raw_sales_agg_by_shop.reset_index()
    raw_sales_agg_by_shop = raw_sales_agg_by_shop[["shop_id", "earnings", "item_cnt_day", "date"]]
    raw_sales_agg_by_shop = raw_sales_agg_by_shop.rename({"earnings":"shop_earnings","item_cnt_day": "total_items_sold"}, axis = 1)
    return raw_sales_agg_by_shop

In [8]:
sales_by_shop = extract_sales_by_shop(raw_sales)
sales_by_shop.head()

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
0,2,103746.0,81.0,2015-01-04
1,3,67443.0,33.0,2015-01-04
2,4,29361.0,39.0,2015-01-04
3,5,33138.0,45.0,2015-01-04
4,6,138678.0,150.0,2015-01-04


Now do the same thing to extract the sales_by_item 

In [10]:
def extract_sales_by_item(raw_sales):
    raw_sales["earnings"] = raw_sales["item_price"]*raw_sales["item_cnt_day"]
    aggregation = {"earnings": "sum", 
                   "item_cnt_day": "sum"}
    raw_sales_agg_by_item = raw_sales.groupby(["date", "item_id"]).agg(aggregation)
    raw_sales_agg_by_item = raw_sales_agg_by_item.reset_index()
    raw_sales_agg_by_item = raw_sales_agg_by_item[["item_id", "earnings", "item_cnt_day", "date"]]
    raw_sales_agg_by_item = raw_sales_agg_by_item.rename({"earnings":"item_earnings","item_cnt_day": "total_items_sold"}, axis = 1)
    return raw_sales_agg_by_item

In [11]:
sales_by_item = extract_sales_by_item(raw_sales)
sales_by_item.head()

Unnamed: 0,item_id,item_earnings,total_items_sold,date
0,30,507.0,3.0,2015-01-04
1,31,1089.0,3.0,2015-01-04
2,32,447.0,3.0,2015-01-04
3,42,897.0,3.0,2015-01-04
4,59,747.0,3.0,2015-01-04


## Create the connection that will append the extract result to DB

In [None]:
driver = 'mysql+pymysql'
user = 'root'
password = '********'
ip = 'localhost'
database = 'lab_df_cal_tran'

In [None]:
connection_string = f'{driver}://{user}:{password}@{ip}/{database}'

In [None]:
engine = create_engine(connection_string)

In [None]:
sales_by_shop.to_sql('sales_by_shop', engine)
sales_by_item.to_sql('sales_by_item', engine)
