# Dataframes

Dataframes are mutable two-dimensional dat structures with the axes labeled where:
* each row represents a different observation
* each column represents a different variable

In Python, to define a dataframe, we first need to import the pandas module.

In [1]:
import pandas as pd

Next, if we want a dataframe with 5 rows and 2 columns, we can do it from a [dictionary](https://www.w3schools.com/python/python_dictionaries.asp), a [list](https://www.w3schools.com/python/python_lists.asp) of lists, a list of dictionaries, etc.

We are going to create a 5-rows, 2-columns dataframe from a dictionary.

To do this, we first create a dictionary where the keys will be the names of the columns and the values will be lists, with as many elements as the number of rows we want.

Finally, we convert that dictionary to dataframe with pandas' `DataFrame()` function:

In [2]:
# Example dataframe
data = {
    "x":[1, 2, 3, 4, 5], 
    "y":[6, 7, 8, 9, 10]
}

In [3]:
data['x']

[1, 2, 3, 4, 5]

In [4]:
df = pd.DataFrame(data)
print(df)
# df

   x   y
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10


As we said, we have created a dataframe with 5 rows and two columns, called x and y respectively.

**Observation**: As a result of `print()`, we have not only obtained the 5 rows and 2 columns, but there is an additional column of 5 numbers ordered vertically from 0 to 4. It is simply the name of each row, which by default is the index of each row. The 0 indicates the first row; the 1, the second; and so on.

## Import a csv file to dataframe

In [5]:
# use contextual help to show all the parameters inside read_csv
df = pd.read_csv('../data/eniac/orderlines.csv', sep=',')

## Dataframe dimensions

With the `.shape` [method](https://www.w3schools.com/python/gloss_python_object_methods.asp) we can calculate the dimensions (number of rows and columns) of the dataframe.

In [6]:
df.shape

(293983, 7)

As a result we obtain a [tuple](https://www.w3schools.com/python/python_tuples.asp) where the first element is the number of rows, which in our case is 293983, while the second element is the number of columns, which in our example was 7.

In [7]:
nrows = df.shape[0]
ncols = df.shape[1]
print("The number of rows is", nrows)
print("The number of columns is", ncols)

The number of rows is 293983
The number of columns is 7


With the `.size` method we calculate the total number of values that the dataframe has (number of rows per number of columns)

In [8]:
df.size

2057881

In [9]:
# check if that's true
df.shape[0] * df.shape[1] == df.size

True

With the `.ndim` method we calculate the number of dimensions that the dataframe has. This will always be 2, as it consists of rows and columns.

In [10]:
df.ndim

2

## Dataframes exploration

The `.head()` method is used to display the first rows of the dataframe. By default, the first 5 will be shown

In [11]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [12]:
df.head(9)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.00,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38
5,1119114,295310,0,10,WDT0249,231.79,2017-01-01 01:14:27
6,1119115,299544,0,1,APP1582,1.137.99,2017-01-01 01:17:21
7,1119116,299545,0,1,OWC0100,47.49,2017-01-01 01:46:16
8,1119119,299546,0,1,IOT0014,18.99,2017-01-01 01:50:34


In [13]:
df.tail()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
293978,1650199,527398,0,1,JBL0122,42.99,2018-03-14 13:57:25
293979,1650200,527399,0,1,PAC0653,141.58,2018-03-14 13:57:34
293980,1650201,527400,0,2,APP0698,9.99,2018-03-14 13:57:41
293981,1650202,527388,0,1,BEZ0204,19.99,2018-03-14 13:58:01
293982,1650203,527401,0,1,APP0927,13.99,2018-03-14 13:58:36


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


In [15]:
df.describe()

Unnamed: 0,id,id_order,product_id,product_quantity
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0


In [16]:
df.nunique()

id                  293983
id_order            204855
product_id               1
product_quantity        67
sku                   7951
unit_price           11329
date                251631
dtype: int64

In [17]:
df['sku'].unique().tolist()[:10]

['OTT0133',
 'LGE0043',
 'PAR0071',
 'WDT0315',
 'JBL0104',
 'WDT0249',
 'APP1582',
 'OWC0100',
 'IOT0014',
 'APP0700']

In [18]:
df.isna().sum()

id                  0
id_order            0
product_id          0
product_quantity    0
sku                 0
unit_price          0
date                0
dtype: int64

In [19]:
df.duplicated().sum() # parameters keep=False
# df.drop_duplicates()

0

In [20]:
df.nsmallest(5, 'product_quantity')

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [21]:
df.nlargest(5, 'product_quantity')

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53


## Columns

Given a dataframe, we can select a particular column in several ways:

* Indicating the name of the column between square brackets, []
* With the `.columns[]` method
* With the `.loc[]` method (by name or tag)
* With the `.iloc[]` method (by position)

How to select 1 column

In [22]:
# select the column by name
df['id_order']

0         299539
1         299540
2         299541
3         299542
4         299543
           ...  
293978    527398
293979    527399
293980    527400
293981    527388
293982    527401
Name: id_order, Length: 293983, dtype: int64

In [23]:
# Select the column sku with the method .columns[]
print(df[df.columns[1]])

0         299539
1         299540
2         299541
3         299542
4         299543
           ...  
293978    527398
293979    527399
293980    527400
293981    527388
293982    527401
Name: id_order, Length: 293983, dtype: int64


In [24]:
# method .loc[]
print(df.loc[:, 'id_order'])

0         299539
1         299540
2         299541
3         299542
4         299543
           ...  
293978    527398
293979    527399
293980    527400
293981    527388
293982    527401
Name: id_order, Length: 293983, dtype: int64


In [25]:
# method .iloc[]
print(df.iloc[:, 0])

0         1119109
1         1119110
2         1119111
3         1119112
4         1119113
           ...   
293978    1650199
293979    1650200
293980    1650201
293981    1650202
293982    1650203
Name: id, Length: 293983, dtype: int64


In [26]:
# method .filter()
df.filter(items=['id_order'])

Unnamed: 0,id_order
0,299539
1,299540
2,299541
3,299542
4,299543
...,...
293978,527398
293979,527399
293980,527400
293981,527388


Select multiple columns

If we wanted to select more than one column, we could do it with all the options listed above, with slight modifications in some cases:

In [27]:
# with a list
df[['id_order','sku']]

Unnamed: 0,id_order,sku
0,299539,OTT0133
1,299540,LGE0043
2,299541,PAR0071
3,299542,WDT0315
4,299543,JBL0104
...,...,...
293978,527398,JBL0122
293979,527399,PAC0653
293980,527400,APP0698
293981,527388,BEZ0204


In [28]:
# .columns()
df[df.columns[[1,4]]]

Unnamed: 0,id_order,sku
0,299539,OTT0133
1,299540,LGE0043
2,299541,PAR0071
3,299542,WDT0315
4,299543,JBL0104
...,...,...
293978,527398,JBL0122
293979,527399,PAC0653
293980,527400,APP0698
293981,527388,BEZ0204


In [29]:
df[df.columns[0:3]]

Unnamed: 0,id,id_order,product_id
0,1119109,299539,0
1,1119110,299540,0
2,1119111,299541,0
3,1119112,299542,0
4,1119113,299543,0
...,...,...,...
293978,1650199,527398,0
293979,1650200,527399,0
293980,1650201,527400,0
293981,1650202,527388,0


In [30]:
# .loc()
df.loc[:, ["id_order", "sku"]]

Unnamed: 0,id_order,sku
0,299539,OTT0133
1,299540,LGE0043
2,299541,PAR0071
3,299542,WDT0315
4,299543,JBL0104
...,...,...
293978,527398,JBL0122
293979,527399,PAC0653
293980,527400,APP0698
293981,527388,BEZ0204


In [31]:
# .loc()
df.loc[:, "id_order":"sku"]

Unnamed: 0,id_order,product_id,product_quantity,sku
0,299539,0,1,OTT0133
1,299540,0,1,LGE0043
2,299541,0,1,PAR0071
3,299542,0,1,WDT0315
4,299543,0,1,JBL0104
...,...,...,...,...
293978,527398,0,1,JBL0122
293979,527399,0,1,PAC0653
293980,527400,0,2,APP0698
293981,527388,0,1,BEZ0204


In [32]:
# .iloc
df.iloc[:, [0, 1]]

Unnamed: 0,id,id_order
0,1119109,299539
1,1119110,299540
2,1119111,299541
3,1119112,299542
4,1119113,299543
...,...,...
293978,1650199,527398
293979,1650200,527399
293980,1650201,527400
293981,1650202,527388


In [33]:
df.iloc[:, 0:2]

Unnamed: 0,id,id_order
0,1119109,299539
1,1119110,299540
2,1119111,299541
3,1119112,299542
4,1119113,299543
...,...,...
293978,1650199,527398
293979,1650200,527399
293980,1650201,527400
293981,1650202,527388


## Rows

Given a dataframe, we can select a particular row in several ways:

* With the `.loc[]` method (by name or tag)
* With the `.iloc[]` method (by position)

In [34]:
# set the id to rows_id

In [35]:
df.set_index('id', inplace=True)
# inplate = True would be equal than df = df.set_index('id')

In [36]:
df.head(2)

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45


In [37]:
# select the first observation with the .loc() method
df.loc[1119110]

id_order                         299540
product_id                            0
product_quantity                      1
sku                             LGE0043
unit_price                       399.00
date                2017-01-01 00:19:45
Name: 1119110, dtype: object

In [38]:
# select the last observation with the method .iloc[]
df.iloc[-1]
# df.tail(1)

id_order                         527401
product_id                            0
product_quantity                      1
sku                             APP0927
unit_price                        13.99
date                2018-03-14 13:58:36
Name: 1650203, dtype: object

In [39]:
df.head()

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [40]:
df.loc[[1119111,1119112,1119113]]

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [41]:
df.loc[1119111:1119113]

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [42]:
df.reset_index(inplace=True)

## Drop and Filter data

In [43]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [44]:
df.columns

Index(['id', 'id_order', 'product_id', 'product_quantity', 'sku', 'unit_price',
       'date'],
      dtype='object')

The `.drop()` method allows us to delete the rows or columns that we indicate.

**Attention!** Again, if we want to directly apply the changes to the original dataframe, we need to indicate `inplace = True`

In [45]:
df.drop(['unit_price'], axis=1)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,date
0,1119109,299539,0,1,OTT0133,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,2017-01-01 01:06:38
...,...,...,...,...,...,...
293978,1650199,527398,0,1,JBL0122,2018-03-14 13:57:25
293979,1650200,527399,0,1,PAC0653,2018-03-14 13:57:34
293980,1650201,527400,0,2,APP0698,2018-03-14 13:57:41
293981,1650202,527388,0,1,BEZ0204,2018-03-14 13:58:01


How to filter information on a dataframe.

In [46]:
# products sold in quantities larger than 100
df[df['product_quantity'] > 100]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
27779,1180010,323959,0,126,ADN0039,34.99,2017-02-14 10:21:12
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
136675,1388261,417536,0,200,TRK0009,29.99,2017-10-25 15:02:39
204637,1500715,464858,0,192,APP1662,519.0,2017-12-17 15:53:04
246048,1574262,496172,0,164,EVU0013,19.99,2018-01-22 16:14:42
285492,1637611,522075,0,125,XDO0047,25.99,2018-03-06 10:07:54


The `.query()` method can be useful for this purpose, but it works only when the column values do not contain blank spaces. You can use any **Python Comparison Operators** you want inside the query method (find more information on this [link](https://www.w3schools.com/python/python_operators.asp)).

In [47]:
df.query('product_quantity > 100')

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
27779,1180010,323959,0,126,ADN0039,34.99,2017-02-14 10:21:12
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
136675,1388261,417536,0,200,TRK0009,29.99,2017-10-25 15:02:39
204637,1500715,464858,0,192,APP1662,519.0,2017-12-17 15:53:04
246048,1574262,496172,0,164,EVU0013,19.99,2018-01-22 16:14:42
285492,1637611,522075,0,125,XDO0047,25.99,2018-03-06 10:07:54


Another way to filter information is to look for exact matches. You can do that with the `isin()` method:

In [48]:
# find out rows in a column that match the elements in a list
df[df['sku'].isin(['ADN0039','THU0029','APP1190'])]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
1452,1122094,300886,0,1,APP1190,62.99,2017-01-02 19:15:23
1745,1122690,301173,0,1,APP1190,58.99,2017-01-02 23:46:35
1904,1123012,301322,0,1,APP1190,58.99,2017-01-03 09:34:02
2097,1123386,301504,0,1,APP1190,58.99,2017-01-03 12:14:35
2282,1123789,301694,0,2,APP1190,58.99,2017-01-03 14:58:09
...,...,...,...,...,...,...,...
293100,1648600,526589,0,1,APP1190,56.00,2018-03-13 18:45:25
293106,1648610,526591,0,1,APP1190,56.00,2018-03-13 18:50:43
293150,1648679,526624,0,1,APP1190,56.00,2018-03-13 19:53:32
293154,1648690,526631,0,1,APP1190,56.00,2018-03-13 20:03:52


## `.copy()` method

If you want to create a new dataframe out of a chunk of the original dataframe, it is quite common to run into this problem:

In [56]:
sample = df.iloc[:3,:]
sample

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


In [57]:
sample.iloc[0,4]

'OTT0133'

In [58]:
sample.iloc[0,4] = 'NEW001'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [59]:
sample

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,NEW001,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


In [60]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,NEW001,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


As you can see, we modified the object `sample` but, the data frame `df` has also been modified! We can avoid this using the method `.copy()`

In [61]:
df = pd.read_csv('../data/eniac/orderlines.csv')
sample = df.iloc[:3,:].copy()
sample.iloc[0,4] = 'NEW001'
sample

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,NEW001,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


In [62]:
df.head(3)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


As you can see, now it has not been modified.

# CHALLENGES

<strong>1. The product with the `sku` JBL0104 has been sold for prices at different points in time. How many different prices has it had? </strong>

Tip: combine any pandas filtering method with the method `.nunique()`.

In [3]:
import pandas as pd

df = pd.read_csv("eniac-data/orderlines.csv")
print(df)

             id  id_order  product_id  product_quantity      sku unit_price  \
0       1119109    299539           0                 1  OTT0133      18.99   
1       1119110    299540           0                 1  LGE0043     399.00   
2       1119111    299541           0                 1  PAR0071     474.05   
3       1119112    299542           0                 1  WDT0315      68.39   
4       1119113    299543           0                 1  JBL0104      23.74   
...         ...       ...         ...               ...      ...        ...   
293978  1650199    527398           0                 1  JBL0122      42.99   
293979  1650200    527399           0                 1  PAC0653     141.58   
293980  1650201    527400           0                 2  APP0698       9.99   
293981  1650202    527388           0                 1  BEZ0204      19.99   
293982  1650203    527401           0                 1  APP0927      13.99   

                       date  
0       2017-01-01 00

In [5]:
df[['sku','unit_price']]

Unnamed: 0,sku,unit_price
0,OTT0133,18.99
1,LGE0043,399.00
2,PAR0071,474.05
3,WDT0315,68.39
4,JBL0104,23.74
...,...,...
293978,JBL0122,42.99
293979,PAC0653,141.58
293980,APP0698,9.99
293981,BEZ0204,19.99


In [12]:
df[df['sku'].isin(['JBL0104'])].unit_price.unique()

array(['23.74', '26.99', '24.99', '22.31', '22.99', '23.99', '27.99'],
      dtype=object)

<strong>2. List all the different items that have been sold in the order with an `id_order` of `385921`. </strong>

In [13]:
df[['sku','id_order']]

Unnamed: 0,sku,id_order
0,OTT0133,299539
1,LGE0043,299540
2,PAR0071,299541
3,WDT0315,299542
4,JBL0104,299543
...,...,...
293978,JBL0122,527398
293979,PAC0653,527399
293980,APP0698,527400
293981,BEZ0204,527388


In [19]:
df[df['id_order'] == 385921].sku.unique()

array(['APP2431', 'APP2348', 'APP2131', 'APP1630', 'APP1735', 'APP1216',
       'APP2092', 'APP1215', 'ELA0017', 'MIN0010', 'ELA0039', 'BEA0046',
       'BOS0034', 'BEA0071', 'ELA0029', 'APP2161', 'HOC0008', 'NOM0026',
       'NOM0014'], dtype=object)

<strong>3. Find out in how many different orders have been sold the products with the following `sku`: `APP2431` and `APP2348`.</strong>

In [28]:
df[['sku','id_order']]

Unnamed: 0,sku,id_order
0,OTT0133,299539
1,LGE0043,299540
2,PAR0071,299541
3,WDT0315,299542
4,JBL0104,299543
...,...,...
293978,JBL0122,527398
293979,PAC0653,527399
293980,APP0698,527400
293981,BEZ0204,527388


In [26]:
df[df['sku'].isin(['APP2431','APP2348'])].id_order.nunique()

179

In [27]:
df[df['sku'].isin(['APP2431','APP2348'])].id_order.unique()

array([356101, 367146, 365600, 367892, 369246, 371034, 371564, 372411,
       373592, 373789, 373839, 374838, 375429, 379543, 379659, 382460,
       385921, 388141, 388422, 387477, 390046, 390141, 390384, 390769,
       391747, 392932, 393691, 395879, 396153, 396166, 396391, 396490,
       397600, 398669, 401302, 401429, 402265, 403996, 404504, 405316,
       405858, 407096, 408281, 408565, 409016, 409958, 410438, 410718,
       410930, 410967, 413061, 414724, 414765, 415293, 415575, 416682,
       418987, 420863, 421893, 422026, 422448, 423183, 423707, 424577,
       427124, 427826, 427940, 428694, 430641, 430899, 431344, 431577,
       433231, 433507, 435170, 435482, 435855, 436979, 437539, 440415,
       436041, 441687, 442951, 443680, 444827, 444841, 445517, 448052,
       448962, 449955, 451157, 451169, 451232, 451329, 451385, 451475,
       451814, 452283, 452324, 453317, 453679, 453991, 454050, 454408,
       455322, 456206, 456405, 456479, 458134, 458284, 460281, 460630,
      

<strong>4. Create a new dataframe with all the rows that have a product quantity higher than 500. Call this new dataframe `df_50`, and include only the columns `id`, `id_order`, `product_quantity` and `sku`. Be sure to use the method `.copy()`. </strong>

Once the dataset is created, modify the column `product_quantity` to `quantity`, and `sku'` to `product_code`. To do so, you can use the methods `.rename()` or `.columns`.


In [29]:
df_50 = df.query('product_quantity > 500').copy()
print(df_50)

            id  id_order  product_id  product_quantity      sku unit_price  \
53860  1228150    346221           0               999  APP1190      55.99   
57306  1234111    349133           0               555  APP0665      70.99   
57796  1234924    349475           0               800  KIN0137       7.49   
68712  1254032    358747           0               999  SEV0028      19.99   

                      date  
53860  2017-04-14 21:50:52  
57306  2017-04-24 10:20:13  
57796  2017-04-25 09:59:00  
68712  2017-05-24 14:51:58  


In [34]:
df_50.rename(columns={'product_quantity':'quantity','sku':'product_code'})

Unnamed: 0,id,id_order,product_id,quantity,product_code,unit_price,date
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58


<strong>5. Select all the order lines (i.e. all the rows) where the product `XDO0047` has appeared. Sort the product quantity in a DESCENDING order using the pandas method [`.sort_values()`] </strong>(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html). Then look at the main descriptive information of this results with the method `.describe()`.

In [44]:
df[df['sku'].isin(['XDO0047'])].sort_values(by= ['product_quantity'], ascending=False)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
285492,1637611,522075,0,125,XDO0047,25.99,2018-03-06 10:07:54
203932,1499415,464239,0,2,XDO0047,25.99,2017-12-16 09:36:51
124018,1365515,406387,0,1,XDO0047,23.39,2017-09-29 17:09:32
238277,1561096,491065,0,1,XDO0047,25.99,2018-01-15 09:39:07
186320,1474709,454473,0,1,XDO0047,25.99,2017-12-03 17:05:14
197678,1489835,460182,0,1,XDO0047,25.99,2017-12-11 20:28:12
216416,1523205,475450,0,1,XDO0047,25.99,2017-12-28 20:44:19
217466,1525182,476341,0,1,XDO0047,22.09,2017-12-29 15:39:11
217957,1526191,476848,0,1,XDO0047,22.09,2017-12-29 21:32:50
246332,1574873,496429,0,1,XDO0047,25.99,2018-01-22 23:03:34


In [45]:
df[df['sku'].isin(['XDO0047'])].sort_values(by= ['product_quantity'], ascending=False).describe()

Unnamed: 0,id,id_order,product_id,product_quantity
count,36.0,36.0,36.0,36.0
mean,1485767.0,458640.277778,0.0,4.472222
std,87705.32,36891.598431,0.0,20.662576
min,1365515.0,406387.0,0.0,1.0
25%,1405830.0,425462.75,0.0,1.0
50%,1471428.0,453247.0,0.0,1.0
75%,1564540.0,492406.0,0.0,1.0
max,1640821.0,523533.0,0.0,125.0
