# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# your code here

# CONNECTING TO DATABASE AND GETTING THE DATAFRAMES

from sqlalchemy import create_engine
import pandas as pd


driver = 'mysql+pymysql:'
user = 'ironhacker_read'
password = 'ir0nhack3r'
ip = '35.239.232.23'
database = 'retail_sales'

connecting_string = f'{driver}//{user}:{password}@{ip}/{database}'
engine = create_engine(connecting_string)

query1 = "SELECT * FROM raw_sales"
query2 = "SELECT * FROM sales_by_item"
query3 = "SELECT * FROM sales_by_item_index"
query4 = "SELECT * FROM sales_by_shop"

df_raw_sales = pd.read_sql(query1,engine)
df_sales_by_item = pd.read_sql(query2,engine)
df_sales_by_item_index = pd.read_sql(query3,engine)
df_sales_by_shop = pd.read_sql(query4,engine)


In [2]:
# DF.DESCRIBE this is used to display the first rows of a table
# Analyse how to clean data
print(type(df_raw_sales))


<class 'pandas.core.frame.DataFrame'>


In [3]:
df_raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [4]:
# first I will check datatable info:
df_raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null datetime64[ns]
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 177.6 KB


In [5]:
df_sales_by_item.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2955 entries, 0 to 2954
Data columns (total 4 columns):
item_id             2955 non-null int64
item_earnings       2955 non-null float64
total_items_sold    2955 non-null float64
date                2955 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 92.4+ KB


In [16]:
# date is wrong somewhere because is an object not datetime64[ns]

df_sales_by_item.isnull().sum() 

# non NaN numbers

item_id             0
item_earnings       0
total_items_sold    0
date                0
dtype: int64

In [29]:
df_sales_by_item.sort_values(['item_id'])
          

Unnamed: 0,item_id,item_earnings,total_items_sold,date
0,30,169.0,1.0,03/09/2019
985,30,169.0,1.0,03/12/2019
1970,30,169.0,1.0,03/12/2019
1,31,363.0,1.0,03/09/2019
986,31,363.0,1.0,03/12/2019
1971,31,363.0,1.0,03/12/2019
2,32,149.0,1.0,03/09/2019
987,32,149.0,1.0,03/12/2019
1972,32,149.0,1.0,03/12/2019
3,42,299.0,1.0,03/09/2019


In [6]:
df_sales_by_item_index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 5 columns):
id                  985 non-null int64
item_id             985 non-null int64
item_earnings       985 non-null float64
total_items_sold    985 non-null float64
date                985 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 38.6+ KB


In [7]:
df_sales_by_shop.dtypes

shop_id               int64
shop_earnings       float64
total_items_sold    float64
date                 object
dtype: object

In [8]:
# checking there are no errors in columns. If not all numbers, or not all dates, the return should be an object. 
# also same kind of datatype for each column.
# no rename needed due to all fields are clear
    # e.g. df_sales.by.shop.rename(columns={"item_cnt_day":"item_count_day"})



In [9]:
df_raw_sales.isnull().sum().sum()
# check if there are NaNs(not a numbers) per each col df.isnull().any()
# with double sum() or any() first sum all rows for each col and then the cols resulting
# with any returns a bool, sum an integer.

0

In [10]:
df_sales_by_shop.describe()

Unnamed: 0,shop_id,shop_earnings,total_items_sold
count,90.0,90.0,90.0
mean,32.311111,34733.432741,37.155556
std,17.621262,25449.615414,28.852165
min,2.0,3095.0,6.0
25%,18.0,15703.0,20.0
50%,34.0,28579.0,26.0
75%,48.0,45106.0,50.0
max,59.0,109288.0,134.0


In [11]:
df_raw_sales.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [12]:
df_sales_by_item.describe()

Unnamed: 0,item_id,item_earnings,total_items_sold
count,2955.0,2955.0,2955.0
mean,10978.31269,1586.806572,1.697462
std,6280.104144,4397.621492,2.104701
min,30.0,25.0,-1.0
25%,5240.0,229.0,1.0
50%,11222.0,499.0,1.0
75%,16083.0,1399.0,2.0
max,22162.0,80970.0,31.0
