# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

### 1. Read the sample file that a daily process will save in your folder.

In [55]:
# Import relevant program to extract data and work on it

import sqlalchemy
import pandas as pd

In [56]:
# Connect to Ironhack´s existing Database.
# When we connect to Ironhack´s database, we will always use this one. When we need to connect to another database,
# such as my own, we will need to change details on user, password, ip (127.0.0.1)

driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'

In [57]:
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
print(connection_string)

mysql+pymysql://data-students:iR0nH@cK-D4T4B4S3@34.65.10.136/retail_sales


In [58]:
engine = sqlalchemy.create_engine(connection_string)
print(engine)

Engine(mysql+pymysql://data-students:***@34.65.10.136/retail_sales)


In [59]:
df = pd.read_sql('SELECT * FROM retail_sales.raw_sales', engine)
df.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


### 2. Clean the data and create the aggregates as you consider.

In [60]:
# Let´s check the unique values in item_prices 

for col in df:
    print (col)
    uni = df[col].unique()
    if len(uni) < 50:
        print (col)
        print('\t', uni)
        


date
date
	 ['2015-01-04T00:00:00.000000000']
shop_id
shop_id
	 [29 28 31 27 35 34 24 25 21 22 19 26 56 55 54 57 58 48 49 50 47 53 52 51
 44 45 41 46 37 38 39 59 42  6  7 10  3  4  2  5 18 16 15 12 14]
item_id
item_price
item_cnt_day
item_cnt_day
	 [ 1.  2.  6.  3. -1.  4.  5. 10.]


In [61]:
# date. It looks like all the data comes from one day. Most probably this represents the "live" date of each data
# extract. One file per day as mentioned in the instructions.

# shop_id. The shop id's look normal

# item_id. More than 50+ values. We could check, but being identifiers there shouldn't be anything we can detect.

# item_price. We cannot see anything regarding the price, it will be interesting to look at it separately. There are >50 +
# values existing.

# item_cnt_day. Looks normal

In [62]:
df.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [63]:
df.loc[df['item_price'] >15000]

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
400,2015-01-04,21,6675,26990.0,1.0
777,2015-01-04,50,6675,26990.0,1.0
904,2015-01-04,44,13442,27990.0,1.0
981,2015-01-04,37,13401,27392.0,1.0
983,2015-01-04,37,13405,25392.0,1.0
1158,2015-01-04,42,6675,26990.0,1.0
1495,2015-01-04,15,13400,19990.0,1.0
1915,2015-01-04,21,6675,26990.0,1.0
2292,2015-01-04,50,6675,26990.0,1.0
2419,2015-01-04,44,13442,27990.0,1.0


In [64]:
# Let´s get more info on prices

item_price_labels = ['Low', 'Medium', 'High']
bins = pd.cut(df ['item_price'], 
              len(item_price_labels), 
              labels = item_price_labels,
             retbins = True)
bins[0]

0       Low
1       Low
2       Low
3       Low
4       Low
       ... 
4540    Low
4541    Low
4542    Low
4543    Low
4544    Low
Name: item_price, Length: 4545, dtype: category
Categories (3, object): [Low < Medium < High]

In [65]:
# Let´s check the min and max on item prices

min_price = df['item_price'].min()

print(" The minimum price per item is: ", min_price)

max_price = df['item_price'].max()

print(" The maximum price per item is: ", max_price)

mean_price = df['item_price'].mean()

print(" The maximum price per item is: ", mean_price)

 The minimum price per item is:  3.0
 The maximum price per item is:  27990.0
 The maximum price per item is:  1031.686121012101


In [66]:
df.sort_values(by = 'item_price')


Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
2977,2015-01-04,10,20949,3.0,2.0
1462,2015-01-04,10,20949,3.0,2.0
4492,2015-01-04,10,20949,3.0,2.0
1422,2015-01-04,18,20949,5.0,2.0
2343,2015-01-04,52,20949,5.0,2.0
...,...,...,...,...,...
981,2015-01-04,37,13401,27392.0,1.0
2496,2015-01-04,37,13401,27392.0,1.0
904,2015-01-04,44,13442,27990.0,1.0
2419,2015-01-04,44,13442,27990.0,1.0


In [67]:
print("Total number of items is", df.count())

print("Total number of items priced above 15,000", df.loc[df['item_price'] >10000].count())

print("Total number of items priced above 15,000", df.loc[df['item_price'] >15000].count())

print("Total number of items priced above 20,000", df.loc[df['item_price'] >26000].count())

above_10k = (df.loc[df['item_price'] >10000].count()/df.count())*100
print(above_10k)

Total number of items is date            4545
shop_id         4545
item_id         4545
item_price      4545
item_cnt_day    4545
dtype: int64
Total number of items priced above 15,000 date            27
shop_id         27
item_id         27
item_price      27
item_cnt_day    27
dtype: int64
Total number of items priced above 15,000 date            21
shop_id         21
item_id         21
item_price      21
item_cnt_day    21
dtype: int64
Total number of items priced above 20,000 date            15
shop_id         15
item_id         15
item_price      15
item_cnt_day    15
dtype: int64
date            0.594059
shop_id         0.594059
item_id         0.594059
item_price      0.594059
item_cnt_day    0.594059
dtype: float64


In [68]:
# Although the items above 10k represents a small proportion of our data set (0.6%), 
# it´s still a considerable amount. On top, the overall price average is 1031.6 which is not that low.
# Therefore, we can assume that our shop, has luxury items.

In [69]:
# Regarding the negative counts, we will assume that there can not be 
# negative sales, adn we will disregard the option of negative count of
# items. Therefore we will drop the lines that have negative counts.

In [70]:
df['item_cnt_day']


0       1.0
1       1.0
2       2.0
3       2.0
4       1.0
       ... 
4540    1.0
4541    1.0
4542    1.0
4543    1.0
4544    1.0
Name: item_cnt_day, Length: 4545, dtype: float64

In [71]:
# Changing a specific value. Change Fuel type on last row. Negative index does not work, it would create a new
# row.

df_1= df

df_1.head()

for i in df_1 ['item_cnt_day']:
    if i<0:
        i = 0
    else:
        i
print(i)       

1.0


In [72]:
df_1.item_cnt_day=df_1.item_cnt_day.mask(df_1.item_cnt_day.lt(0),0)
df_1

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [73]:
df_1.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.110231
std,16.565517,6558.649572,2073.91999,0.516832
min,2.0,30.0,3.0,0.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [74]:
# As we can see in the table above, the negative values for item_cnt_day
# have been replaced by 0. The min value in item_cnt_day is now 0.

### 3. Create the aggregates.

In [75]:
# To do the group bys per store and Items it would be good to look at 
# item_cnt_day as well as sales per item. item_price does not make much sense
# therefore we will create a new table looking at cnt_day and sales
# sales = item_price X item_cnt_day

In [82]:
# Create a new more relevant table for the aggregates

df_1['item_price'] = df_1['item_price'].multiply (df_1['item_cnt_day'])

df_1.head(5)

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,3996.0,2.0
3,2015-01-04,28,22104,996.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [83]:
# We need to rename item_price to a relevant name given it is now Sales

df_2 = df_1.rename(columns = {'item_price' : 'item_sales'})

df_2.head(5)

Unnamed: 0,date,shop_id,item_id,item_sales,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,3996.0,2.0
3,2015-01-04,28,22104,996.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [84]:
# Create an Aggregate by Store

In [87]:
# We will recreate the table adding only shop id, item_cnt and sales

df_by_shop = df_2[['shop_id','item_sales','item_cnt_day']]

df_by_shop.head(5)

Unnamed: 0,shop_id,item_sales,item_cnt_day
0,29,1199.0,1.0
1,28,479.0,1.0
2,28,3996.0,2.0
3,28,996.0,2.0
4,28,179.0,1.0


In [97]:
sales_per_store = df_by_shop.groupby(['shop_id']).sum()

In [None]:
# Create an Aggregate by Item

In [90]:
# We will recreate the table adding only item_id, item_cnt and sales

df_by_item = df_2[['item_id','item_sales','item_cnt_day']]

df_by_item.head(5)

Unnamed: 0,item_id,item_sales,item_cnt_day
0,1469,1199.0,1.0
1,21364,479.0,1.0
2,21365,3996.0,2.0
3,22104,996.0,2.0
4,22091,179.0,1.0


In [99]:
sales_per_item = df_by_item.groupby(['item_id']).sum()

### 4. Write three tables in your local database:

In [100]:
# A table for the cleaned data.

clean_table = df_2
clean_table.head()

Unnamed: 0,date,shop_id,item_id,item_sales,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,3996.0,2.0
3,2015-01-04,28,22104,996.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [101]:
# A table for the aggregate per store.

sales_per_store
sales_per_store.head()

Unnamed: 0_level_0,item_sales,item_cnt_day
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,113097.0,81.0
3,67443.0,33.0
4,29361.0,39.0
5,33138.0,45.0
6,225888.0,150.0


In [102]:
# A table for the aggregate per item.

sales_per_item
sales_per_item.head()

Unnamed: 0_level_0,item_sales,item_cnt_day
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30,507.0,3.0
31,1089.0,3.0
32,447.0,3.0
42,897.0,3.0
59,747.0,3.0
