# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [None]:
# ACQUISITION

In [136]:
import pandas as pd
from sqlalchemy import create_engine

# Creating a function that creates connection to database.

def connect_to_database(connection_string):
    engine = create_engine(connection_string)
    return engine.connect()

driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'

connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
conn = connect_to_database(connection_string)

retail_sales = pd.read_sql('SELECT * FROM raw_sales', conn)

In [139]:
driver   = 'mysql+pymysql:'
user     = 'root'
password = 'REOvio18'
ip       = '127.0.0.1'
database = 'Retail Sales_Rebecca'

connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
conn = connect_to_database(connection_string)

In [88]:
retail_info

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [None]:
# WRANGLING

In [89]:
retail_info.describe()

# Anomalie 1: There are negative values in item_cnt_day (min = -1). 
# Anomalie 2: There is a hige difference between min and max price. 
 # 75% of the values have a max price of 1192, so it seams odd that the max price is almost 30k.

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [92]:
# Anomalie 1: There are negative values in item_cnt_day (min = -1)

# How Many?
indx_neg_values = retail_info.item_cnt_day[(retail_info.item_cnt_day < 0)]
indx_neg_values.count()
### There are 30 negative values in the column. Are not many. 

30

In [93]:
# What to do with them?
### I would rather clear them than filled them as it seams like a typo mistake.

In [94]:
# How to clean them?
### Creating a function to automatically drop negative value on item_cnt_day

def drop_negative(retail_info):   
    return retail_info.loc[retail_info['item_cnt_day'] >= 0]

retail_info_cleaned = drop_negative(retail_info)
retail_info_cleaned

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [111]:
# Anomaly 2: There is a hige difference between min and max price. 
 # 75% of the values have a max price of 1192, so it seams odd that the max price is almost 30k.

# How many values there are for each price point?
retail_info_cleaned.item_price.value_counts()

399.00     345
299.00     294
99.00      243
199.00     189
349.00     183
          ... 
70.00        3
248.00       3
721.55       3
989.00       3
2598.00      3
Name: item_price, Length: 221, dtype: int64

In [112]:
retail_info_cleaned.item_price.min()

3.0

In [113]:
# How many values there are for each price range?
retail_info_cleaned.item_price.value_counts(bins=3).sort_index()

# Note: it may seam from the ranges that there are negative prices, but that is not the case. Infact min price is 3 (see above)

(-24.988, 9332.0]     4488
(9332.0, 18661.0]        6
(18661.0, 27990.0]      21
Name: item_price, dtype: int64

In [97]:
#f from the previous table we can see that 4480 of the items are priced under 9332, there are 21 items above 20k.
#we should ask the data collector about this, but for now we assume we sell some very expensive items.

In [98]:
# ANALYSIS

In [129]:
# Adding total sales column
retail_info_cleaned["sales"] = retail_info_cleaned["item_price"] * retail_info_cleaned["item_cnt_day"]
retail_info_cleaned 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  retail_info_cleaned["sales"] = retail_info_cleaned["item_price"] * retail_info_cleaned["item_cnt_day"]


Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,sales
0,2015-01-04,29,1469,1199.0,1.0,1199.0
1,2015-01-04,28,21364,479.0,1.0,479.0
2,2015-01-04,28,21365,999.0,2.0,1998.0
3,2015-01-04,28,22104,249.0,2.0,498.0
4,2015-01-04,28,22091,179.0,1.0,179.0
...,...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0,1299.0
4541,2015-01-04,14,21922,99.0,1.0,99.0
4542,2015-01-04,15,1969,3999.0,1.0,3999.0
4543,2015-01-04,14,22091,179.0,1.0,179.0


In [131]:
# Creating function that aggregates per store that adds up the rest of the values.

def aggregate_per_store(cleaned_data):
    sales_by_store = retail_sales_cleaned.groupby("shop_id").sum()
    return sales_by_store

aggregate_per_store(retail_info_cleaned).head()

Unnamed: 0_level_0,item_id,item_price,item_cnt_day,sales
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,966879,99070.5,81.0,103746.0
3,335745,67443.0,33.0,67443.0
4,498624,29361.0,39.0,29361.0
5,620868,33138.0,45.0,33138.0
6,1266894,116352.0,150.0,138678.0


In [133]:
# Creating function that aggregates per item that adds up the rest of the values.

def aggregate_per_item(retail_sales_cleaned):
    sales_by_item = retail_sales_cleaned.groupby("item_id").sum()
    return sales_by_item

aggregate_per_item(retail_info_cleaned).head()

Unnamed: 0_level_0,shop_id,item_price,item_cnt_day,sales
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,84,507.0,3.0,507.0
31,18,1089.0,3.0,1089.0
32,93,447.0,3.0,447.0
42,162,897.0,3.0,897.0
59,171,747.0,3.0,747.0


In [134]:
# Creating the tables in your local database.

def export_to_sql(retail_info_cleaned):
    table_by_store = aggregate_per_store(retail_info_cleaned)
    table_by_item = aggregate_per_item(retail_info_cleaned)
    
    retail_info_cleaned.to_sql('Cleaned Data', conn, index=False)
    table_by_item.to_sql('Sales by item', conn, index=False)
    table_by_store.to_sql('Sales by store', conn, index=False)

In [140]:
export_to_sql(retail_info_cleaned)