# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# import libraries
import pandas as pd
import sqlalchemy
import pymysql

# read raw_sales table from retail_sales database ON local server

driver = 'mysql+pymysql'
user = 'root'
password = 'password'
ip = 'localhost:3306'
connection_string = f'{driver}://{user}:{password}@{ip}'
db_connection = sqlalchemy.create_engine(connection_string)
sqlAction = "SELECT * FROM retail_sales.raw_sales"
df = pd.read_sql_query(sqlAction, db_connection)
df.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [2]:
# check data types
print(df.dtypes)
# check if all dates are the same
print('\nDifferent dates count are:',(df.date != df.date[0]).sum())
# check for negative price
print('Negative prices count are:',(df.item_price < 0).sum())
# check for negative quantities
print('Negative quantities count are:',(df.item_cnt_day < 0).sum())

# inspect negative quantities values
print('\nThe negative quantities found are:\n',df.item_cnt_day[df.item_cnt_day < 0],sep="")

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

Different dates count are: 0
Negative prices count are: 0
Negative quantities count are: 30

The negative quantities found are:
179    -1.0
305    -1.0
386    -1.0
391    -1.0
825    -1.0
899    -1.0
901    -1.0
926    -1.0
1161   -1.0
1416   -1.0
1694   -1.0
1820   -1.0
1901   -1.0
1906   -1.0
2340   -1.0
2414   -1.0
2416   -1.0
2441   -1.0
2676   -1.0
2931   -1.0
3209   -1.0
3335   -1.0
3416   -1.0
3421   -1.0
3855   -1.0
3929   -1.0
3931   -1.0
3956   -1.0
4191   -1.0
4446   -1.0
Name: item_cnt_day, dtype: float64


In [3]:
# The proposed cleaned data table will consider the following change:
# negative item_cnt will be considered positive and store as integer
data_clean = df[['date','shop_id','item_id','item_price']].copy()

item_cnt_clean = df.item_cnt_day.astype('int64').copy()
item_cnt_clean[item_cnt_clean < 0] = 1

data_clean['item_cnt_day'] = item_cnt_clean

# write cleaned data table to csv file
data_clean.to_csv(f"./dataRetailOut/data_cleaned_{df.date[0].date()}.csv",sep=',',index=False)

In [None]:
# write cleaned data table to local database
# datatype conversion to MySQL --> keep only date from datetime and use small integers
# don't specify datatype conversion for float --> 'item_price':sqlalchemy.NUMERIC(10,2), --> NO
sqlDtype = {'date':sqlalchemy.DATE,'shop_id':sqlalchemy.SMALLINT,'item_id':sqlalchemy.SMALLINT,\
'item_cnt_day':sqlalchemy.SMALLINT}

data_clean.to_sql('cleaned_sales',db_connection,schema='retail_sales',if_exists='replace',index=False,dtype=sqlDtype)

In [4]:
# create the sales per store dataframe (following our local database table "sales_by_shop" format)
todayStores = data_clean.shop_id.unique()

# create a list that sums all earnings per store TODAY
storesEarning = [(data_clean.item_cnt_day[data_clean.shop_id==elem]*\
                 data_clean.item_price[data_clean.shop_id==elem]).sum() for elem in todayStores]

# create a list that sums total number of items sold per store TODAY
storesQuantity = [(data_clean.item_cnt_day[data_clean.shop_id==elem]).sum() for elem in todayStores]

# now we are ready to create the data frame sales_by_shop FOR TODAY
columNames = ['shop_id','shop_earnings','total_items_sold']
sales_by_shop = pd.DataFrame(list(zip(todayStores,storesEarning,storesQuantity)),columns=columNames).sort_values(by='shop_id')
sales_by_shop['date'] = df.date[0].date()

# write sales_by_shop data table to csv file
sales_by_shop.to_csv(f"./dataRetailOut/sales_by_shop_{df.date[0].date()}.csv",sep=',',index=False)
sales_by_shop.head()

Unnamed: 0,shop_id,shop_earnings,total_items_sold,date
38,2,103746.0,81,2015-01-04
36,3,67443.0,33,2015-01-04
37,4,29361.0,39,2015-01-04
39,5,33138.0,45,2015-01-04
33,6,138678.0,150,2015-01-04


In [None]:
# append sales_by_shop data frame to existing table in local database
# don't specify datatype conversions
sales_by_shop.to_sql('sales_by_shop',db_connection,schema='retail_sales',if_exists='append',index=False)

In [5]:
# create the sales by item dataframe (following our local database table "sales_by_item" format)
todayItems = data_clean.item_id.unique()

# create a list that sums all earnings per item TODAY
itemsEarning = [(data_clean.item_cnt_day[data_clean.item_id==elem]*\
                 data_clean.item_price[data_clean.item_id==elem]).sum() for elem in todayItems]

# create a list that sums total sales per item for all stores TODAY
itemsQuantity = [(data_clean.item_cnt_day[data_clean.item_id==elem]).sum() for elem in todayItems]

# now we are ready to create the data frame sales_by_item FOR TODAY
columNames = ['item_id','item_earnings','total_items_sold']
sales_by_item = pd.DataFrame(list(zip(todayItems,itemsEarning,itemsQuantity)),columns=columNames).sort_values(by='item_id')
sales_by_item['date'] = df.date[0].date()

# write sales_by_shop data table to csv file
sales_by_item.to_csv(f"./dataRetailOut/sales_by_item_{df.date[0].date()}.csv",sep=',',index=False)
sales_by_item.head()

Unnamed: 0,item_id,item_earnings,total_items_sold,date
78,30,507.0,3,2015-01-04
882,31,1089.0,3,2015-01-04
74,32,447.0,3,2015-01-04
629,42,897.0,3,2015-01-04
551,59,747.0,3,2015-01-04


In [None]:
# append sales_by_item data frame to existing table in local database
# don't specify datatype conversions
sales_by_item.to_sql('sales_by_item',db_connection,schema='retail_sales',if_exists='append',index=False)