Version 1.0.3

# Pandas basics 

Hi! In this programming assignment you need to refresh your `pandas` knowledge. You will need to do several [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)s and [`join`]()`s to solve the task. 

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline 

from grader import Grader

In [3]:
DATA_FOLDER = '../readonly/final_project_data/'

transactions    = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))
items           = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
shops           = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))

The dataset we are going to use is taken from the competition, that serves as the final project for this course. You can find complete data description at the [competition web page](https://www.kaggle.com/c/competitive-data-science-final-project/data). To join the competition use [this link](https://www.kaggle.com/t/1ea93815dca248e99221df42ebde3540).

## Grading

We will create a grader instace below and use it to collect your answers. When function `submit_tag` is called, grader will store your answer *locally*. The answers will *not* be submited to the platform immediately so you can call `submit_tag` function as many times as you need. 

When you are ready to push your answers to the platform you should fill your credentials and run `submit` function in the <a href="#Authorization-&-Submission">last paragraph</a>  of the assignment.

In [4]:
grader = Grader()

# Task

Let's start with a simple task. 

<ol start="0">
  <li><b>Print the shape of the loaded dataframes and use [`df.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function to print several rows. Examine the features you are given.</b></li>
</ol>

In [8]:
print(transactions.shape)
print(items.shape)
print(item_categories.shape)
print(shops.shape)
transactions.head()

(2935849, 6)
(22170, 3)
(84, 2)
(60, 2)


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [69]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


Now use your `pandas` skills to get answers for the following questions. 
The first question is:

1. ** What was the maximum total revenue among all the shops in September, 2014?** 


* Hereinafter *revenue* refers to total sales minus value of goods returned.

*Hints:*

* Sometimes items are returned, find such examples in the dataset. 
* It is handy to split `date` field into [`day`, `month`, `year`] components and use `df.year == 14` and `df.month == 9` in order to select target subset of dates.
* You may work with `date` feature as with strings, or you may first convert it to `pd.datetime` type with `pd.to_datetime` function, but do not forget to set correct `format` argument.

In [62]:
transactions["dt"] = pd.to_datetime(transactions.date, format='%d.%m.%Y')


In [67]:
transactions["month"] = transactions.dt.apply(lambda x: x.month)
transactions["year"] = transactions.dt.apply(lambda x: x.year)

transactions.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,revenue,dt,month,year
0,02.01.2013,0,59,22154,999.0,1.0,999.0,2013-01-02,1,2013
1,03.01.2013,0,25,2552,899.0,1.0,899.0,2013-01-03,1,2013
2,05.01.2013,0,25,2552,899.0,-1.0,-899.0,2013-01-05,1,2013
3,06.01.2013,0,25,2554,1709.05,1.0,1709.05,2013-01-06,1,2013
4,15.01.2013,0,25,2555,1099.0,1.0,1099.0,2013-01-15,1,2013


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,revenue,dt,month,year,item_name,item_category_id
0,02.01.2013,0,59,22154,999.00,1.0,999.00,2013-01-02,1,2013,ЯВЛЕНИЕ 2012 (BD),37
1,23.01.2013,0,24,22154,999.00,1.0,999.00,2013-01-23,1,2013,ЯВЛЕНИЕ 2012 (BD),37
2,20.01.2013,0,27,22154,999.00,1.0,999.00,2013-01-20,1,2013,ЯВЛЕНИЕ 2012 (BD),37
3,02.01.2013,0,25,22154,999.00,1.0,999.00,2013-01-02,1,2013,ЯВЛЕНИЕ 2012 (BD),37
4,03.01.2013,0,25,22154,999.00,1.0,999.00,2013-01-03,1,2013,ЯВЛЕНИЕ 2012 (BD),37
5,20.01.2013,0,25,22154,999.00,1.0,999.00,2013-01-20,1,2013,ЯВЛЕНИЕ 2012 (BD),37
6,23.01.2013,0,25,22154,999.00,1.0,999.00,2013-01-23,1,2013,ЯВЛЕНИЕ 2012 (BD),37
7,26.01.2013,0,25,22154,999.00,1.0,999.00,2013-01-26,1,2013,ЯВЛЕНИЕ 2012 (BD),37
8,27.01.2013,0,6,22154,999.00,1.0,999.00,2013-01-27,1,2013,ЯВЛЕНИЕ 2012 (BD),37
9,10.01.2013,0,15,22154,999.00,1.0,999.00,2013-01-10,1,2013,ЯВЛЕНИЕ 2012 (BD),37


In [73]:
df = transactions
df["revenue"] = df.item_price * df.item_cnt_day
df = df[(df.year == 2014) & (df.month == 9)]
transformed = (df[["shop_id", "revenue"]].groupby("shop_id").sum().sort_values("revenue", ascending=False))
transformed.head()


Unnamed: 0_level_0,revenue
shop_id,Unnamed: 1_level_1
31,7982852.0
25,6783338.0
12,6378335.0
28,4985847.0
27,4899292.0


In [74]:
max_revenue = transformed.at[31, "revenue"]
max_revenue

7982852.1999999564

In [75]:
# YOUR CODE GOES HERE

grader.submit_tag('max_revenue', max_revenue)

Current answer for task max_revenue is: 7982852.2


Great! Let's move on and answer another question:

<ol start="2">
  <li><b>What item category generated the highest revenue in summer 2014?</b></li>
</ol>

* Submit `id` of the category found.
    
* Here we call "summer" the period from June to August.

*Hints:*

* Note, that for an object `x` of type `pd.Series`: `x.argmax()` returns **index** of the maximum element. `pd.Series` can have non-trivial index (not `[1, 2, 3, ... ]`).

In [79]:
df = pd.merge(transactions, items, on="item_id")
df = df[(df.month >= 6) & (df.month <=8) & (df.year == 2014)]

df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,revenue,dt,month,year,item_name,item_category_id
66,28.06.2014,17,28,2552,949.0,1.0,949.0,2014-06-28,6,2014,DEEP PURPLE The House Of Blue Light LP,58
120,14.06.2014,17,28,2555,1149.0,1.0,1149.0,2014-06-14,6,2014,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56
121,13.06.2014,17,54,2555,1149.0,1.0,1149.0,2014-06-13,6,2014,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56
122,02.07.2014,18,54,2555,1149.0,1.0,1149.0,2014-07-02,7,2014,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56
123,30.08.2014,19,54,2555,1149.0,1.0,1149.0,2014-08-30,8,2014,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56


In [80]:
df[["item_category_id", "revenue"]].groupby("item_category_id").sum().sort_values("revenue", ascending=False)

Unnamed: 0_level_0,revenue
item_category_id,Unnamed: 1_level_1
20,32157302.43
12,31385229.70
19,26237112.15
23,19896624.03
30,15876623.34
40,12375973.07
55,9468644.35
28,8868913.27
37,7108188.56
3,6854669.80


In [81]:
# YOUR CODE GOES HERE



category_id_with_max_revenue = 20
grader.submit_tag('category_id_with_max_revenue', category_id_with_max_revenue)

Current answer for task category_id_with_max_revenue is: 20


<ol start="3">
  <li><b>How many items are there, such that their price stays constant (to the best of our knowledge) during the whole period of time?</b></li>
</ol>

* Let's assume, that the items are returned for the same price as they had been sold.

In [86]:
transactions[["item_cnt_day"]].sum()

item_cnt_day    3648206.0
dtype: float64

In [97]:
# YOUR CODE GOES HERE

num_items_constant_price = 5926 # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('num_items_constant_price', num_items_constant_price)

Current answer for task num_items_constant_price is: 5926


Remember, the data can sometimes be noisy.

<ol start="4">
  <li><b>What was the variance of the number of sold items per day sequence for the shop with `shop_id = 25` in December, 2014? Do not count the items, that were sold but returned back later.</b></li>
</ol>

* Fill `total_num_items_sold` and `days` arrays, and plot the sequence with the code below.
* Then compute variance. Remember, there can be differences in how you normalize variance (biased or unbiased estimate, see [link](https://math.stackexchange.com/questions/496627/the-difference-between-unbiased-biased-estimator-variance)). Compute ***unbiased*** estimate (use the right value for `ddof` argument in `pd.var` or `np.var`). 
* If there were no sales at a given day, ***do not*** impute missing value with zero, just ignore that day

In [99]:
df = transactions
df = df[(df.shop_id == 25) & (df.month == 12) & (df.year == 2014)]

df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,revenue,dt,month,year
2295837,14.12.2014,23,25,21752,399.0,1.0,399.0,2014-12-14,12,2014
2295838,13.12.2014,23,25,21752,399.0,3.0,1197.0,2014-12-13,12,2014
2295839,26.12.2014,23,25,21733,149.0,1.0,149.0,2014-12-26,12,2014
2295840,31.12.2014,23,25,21732,149.0,1.0,149.0,2014-12-31,12,2014
2295841,30.12.2014,23,25,21726,149.0,1.0,149.0,2014-12-30,12,2014


In [None]:
d

In [None]:
shop_id = 25

total_num_items_sold = df[]
days = df["date"].unique()# YOUR CODE GOES HERE

# Plot it
plt.plot(days, total_num_items_sold)
plt.ylabel('Num items')
plt.xlabel('Day')
plt.title("Daily revenue for shop_id = 25")
plt.show()

total_num_items_sold_var = # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('total_num_items_sold_var', total_num_items_sold_var)

## Authorization & Submission
To submit assignment to Cousera platform, please, enter your e-mail and token into the variables below. You can generate token on the programming assignment page. *Note:* Token expires 30 minutes after generation.

In [92]:
STUDENT_EMAIL = 'groddenator@gmail.com'
STUDENT_TOKEN = 'BptR0ysGd6noRWU6'
grader.status()

You want to submit these numbers:
Task max_revenue: 7982852.2
Task category_id_with_max_revenue: 20
Task num_items_constant_price: 44586
Task total_num_items_sold_var: ----------


In [98]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!


Well done! :)