# Dataframes

Dataframes are mutable two-dimensional dat structures with the axes labeled where:
* each row represents a different observation
* each column represents a different variable

In Python, to define a dataframe, we first need to import the pandas module.

In [1]:
import pandas as pd

Next, if we want a dataframe with 5 rows and 2 columns, we can do it from a [dictionary](https://www.w3schools.com/python/python_dictionaries.asp), a [list](https://www.w3schools.com/python/python_lists.asp) of lists, a list of dictionaries, etc.

We are going to create a 5-rows, 2-columns dataframe from a dictionary.

To do this, we first create a dictionary where the keys will be the names of the columns and the values will be lists, with as many elements as the number of rows we want.

Finally, we convert that dictionary to dataframe with pandas' `DataFrame()` function:

In [2]:
# Example dataframe
data = {
    "x":[1, 2, 3, 4, 5], 
    "y":[6, 7, 8, 9, 10]
}

In [3]:
data['x']

[1, 2, 3, 4, 5]

In [4]:
df = pd.DataFrame(data)
print(df)
# df

   x   y
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10


As we said, we have created a dataframe with 5 rows and two columns, called x and y respectively.

**Observation**: As a result of `print()`, we have not only obtained the 5 rows and 2 columns, but there is an additional column of 5 numbers ordered vertically from 0 to 4. It is simply the name of each row, which by default is the index of each row. The 0 indicates the first row; the 1, the second; and so on.

## Import a csv file to dataframe

In [7]:
# use contextual help to show all the parameters inside read_csv
df = pd.read_csv('orderlines.csv', sep=',')

## Dataframe dimensions

With the `.shape` [method](https://www.w3schools.com/python/gloss_python_object_methods.asp) we can calculate the dimensions (number of rows and columns) of the dataframe.

In [8]:
df.shape

(293983, 7)

As a result we obtain a [tuple](https://www.w3schools.com/python/python_tuples.asp) where the first element is the number of rows, which in our case is 293983, while the second element is the number of columns, which in our example was 7.

In [9]:
nrows = df.shape[0]
ncols = df.shape[1]
print("The number of rows is", nrows)
print("The number of columns is", ncols)

The number of rows is 293983
The number of columns is 7


With the `.size` method we calculate the total number of values that the dataframe has (number of rows per number of columns)

In [10]:
df.size

2057881

In [11]:
# check if that's true
df.shape[0] * df.shape[1] == df.size

True

With the `.ndim` method we calculate the number of dimensions that the dataframe has. This will always be 2, as it consists of rows and columns.

In [12]:
df.ndim

2

## Dataframes exploration

The `.head()` method is used to display the first rows of the dataframe. By default, the first 5 will be shown

In [13]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [14]:
df.head(9)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.00,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38
5,1119114,295310,0,10,WDT0249,231.79,2017-01-01 01:14:27
6,1119115,299544,0,1,APP1582,1.137.99,2017-01-01 01:17:21
7,1119116,299545,0,1,OWC0100,47.49,2017-01-01 01:46:16
8,1119119,299546,0,1,IOT0014,18.99,2017-01-01 01:50:34


In [15]:
df.tail()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
293978,1650199,527398,0,1,JBL0122,42.99,2018-03-14 13:57:25
293979,1650200,527399,0,1,PAC0653,141.58,2018-03-14 13:57:34
293980,1650201,527400,0,2,APP0698,9.99,2018-03-14 13:57:41
293981,1650202,527388,0,1,BEZ0204,19.99,2018-03-14 13:58:01
293982,1650203,527401,0,1,APP0927,13.99,2018-03-14 13:58:36


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


In [17]:
df.describe()

Unnamed: 0,id,id_order,product_id,product_quantity
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0


In [18]:
df.nunique()

id                  293983
id_order            204855
product_id               1
product_quantity        67
sku                   7951
unit_price           11329
date                251631
dtype: int64

In [19]:
df['sku'].unique().tolist()[:10]

['OTT0133',
 'LGE0043',
 'PAR0071',
 'WDT0315',
 'JBL0104',
 'WDT0249',
 'APP1582',
 'OWC0100',
 'IOT0014',
 'APP0700']

In [20]:
df.isna().sum()

id                  0
id_order            0
product_id          0
product_quantity    0
sku                 0
unit_price          0
date                0
dtype: int64

In [22]:
df.duplicated().sum() # parameters keep=False
# df.drop_duplicates()

0

In [23]:
df.nsmallest(5, 'product_quantity')

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [24]:
df.nlargest(5, 'product_quantity')

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53


## Columns

Given a dataframe, we can select a particular column in several ways:

* Indicating the name of the column between square brackets, []
* With the `.columns[]` method
* With the `.loc[]` method (by name or tag)
* With the `.iloc[]` method (by position)

How to select 1 column

In [25]:
# select the column by name
df['id_order']

0         299539
1         299540
2         299541
3         299542
4         299543
           ...  
293978    527398
293979    527399
293980    527400
293981    527388
293982    527401
Name: id_order, Length: 293983, dtype: int64

In [26]:
# Select the column sku with the method .columns[]
print(df[df.columns[1]])

0         299539
1         299540
2         299541
3         299542
4         299543
           ...  
293978    527398
293979    527399
293980    527400
293981    527388
293982    527401
Name: id_order, Length: 293983, dtype: int64


In [27]:
# method .loc[]
print(df.loc[:,'id_order'])
# df[rows,:]

0         299539
1         299540
2         299541
3         299542
4         299543
           ...  
293978    527398
293979    527399
293980    527400
293981    527388
293982    527401
Name: id_order, Length: 293983, dtype: int64


In [28]:
# method .iloc[]
print(df.iloc[:, 0])

0         1119109
1         1119110
2         1119111
3         1119112
4         1119113
           ...   
293978    1650199
293979    1650200
293980    1650201
293981    1650202
293982    1650203
Name: id, Length: 293983, dtype: int64


In [29]:
# method .filter()
df.filter(items=['id_order'])

Unnamed: 0,id_order
0,299539
1,299540
2,299541
3,299542
4,299543
...,...
293978,527398
293979,527399
293980,527400
293981,527388


Select multiple columns

If we wanted to select more than one column, we could do it with all the options listed above, with slight modifications in some cases:

In [30]:
# with a list
df[['id_order','sku']]

Unnamed: 0,id_order,sku
0,299539,OTT0133
1,299540,LGE0043
2,299541,PAR0071
3,299542,WDT0315
4,299543,JBL0104
...,...,...
293978,527398,JBL0122
293979,527399,PAC0653
293980,527400,APP0698
293981,527388,BEZ0204


In [31]:
# .columns()
df[df.columns[[1,4]]]

Unnamed: 0,id_order,sku
0,299539,OTT0133
1,299540,LGE0043
2,299541,PAR0071
3,299542,WDT0315
4,299543,JBL0104
...,...,...
293978,527398,JBL0122
293979,527399,PAC0653
293980,527400,APP0698
293981,527388,BEZ0204


In [32]:
df[df.columns[0:3]]

Unnamed: 0,id,id_order,product_id
0,1119109,299539,0
1,1119110,299540,0
2,1119111,299541,0
3,1119112,299542,0
4,1119113,299543,0
...,...,...,...
293978,1650199,527398,0
293979,1650200,527399,0
293980,1650201,527400,0
293981,1650202,527388,0


In [33]:
# .loc()
df.loc[:, ["id_order", "sku"]]

Unnamed: 0,id_order,sku
0,299539,OTT0133
1,299540,LGE0043
2,299541,PAR0071
3,299542,WDT0315
4,299543,JBL0104
...,...,...
293978,527398,JBL0122
293979,527399,PAC0653
293980,527400,APP0698
293981,527388,BEZ0204


In [34]:
# .loc()
df.loc[:, "id_order":"sku"]

Unnamed: 0,id_order,product_id,product_quantity,sku
0,299539,0,1,OTT0133
1,299540,0,1,LGE0043
2,299541,0,1,PAR0071
3,299542,0,1,WDT0315
4,299543,0,1,JBL0104
...,...,...,...,...
293978,527398,0,1,JBL0122
293979,527399,0,1,PAC0653
293980,527400,0,2,APP0698
293981,527388,0,1,BEZ0204


In [35]:
# .iloc
df.iloc[:, [0, 1]]

Unnamed: 0,id,id_order
0,1119109,299539
1,1119110,299540
2,1119111,299541
3,1119112,299542
4,1119113,299543
...,...,...
293978,1650199,527398
293979,1650200,527399
293980,1650201,527400
293981,1650202,527388


In [36]:
df.iloc[:, 0:2]

Unnamed: 0,id,id_order
0,1119109,299539
1,1119110,299540
2,1119111,299541
3,1119112,299542
4,1119113,299543
...,...,...
293978,1650199,527398
293979,1650200,527399
293980,1650201,527400
293981,1650202,527388


## Rows

Given a dataframe, we can select a particular row in several ways:

* With the `.loc[]` method (by name or tag)
* With the `.iloc[]` method (by position)

In [37]:
# set the id to rows_id

In [38]:
df.set_index('id', inplace=True)
# inplate = True would be equal than df = df.set_index('id')

In [39]:
df.head(2)

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45


In [40]:
# select the first observation with the .loc() method
df.loc[1119110]

id_order                         299540
product_id                            0
product_quantity                      1
sku                             LGE0043
unit_price                       399.00
date                2017-01-01 00:19:45
Name: 1119110, dtype: object

In [41]:
# select the last observation with the method .iloc[]
df.iloc[-1]
# df.tail(1)

id_order                         527401
product_id                            0
product_quantity                      1
sku                             APP0927
unit_price                        13.99
date                2018-03-14 13:58:36
Name: 1650203, dtype: object

In [42]:
df.head()

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [43]:
df.loc[[1119111,1119112,1119113]]

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [44]:
df.loc[1119111:1119113]

Unnamed: 0_level_0,id_order,product_id,product_quantity,sku,unit_price,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [45]:
df.reset_index(inplace=True)

## Drop and Filter data

In [46]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [47]:
df.columns

Index(['id', 'id_order', 'product_id', 'product_quantity', 'sku', 'unit_price',
       'date'],
      dtype='object')

The `.drop()` method allows us to delete the rows or columns that we indicate.

**Attention!** Again, if we want to directly apply the changes to the original dataframe, we need to indicate `inplace = True`

In [48]:
df.drop(['unit_price'], axis=1)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,date
0,1119109,299539,0,1,OTT0133,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,2017-01-01 01:06:38
...,...,...,...,...,...,...
293978,1650199,527398,0,1,JBL0122,2018-03-14 13:57:25
293979,1650200,527399,0,1,PAC0653,2018-03-14 13:57:34
293980,1650201,527400,0,2,APP0698,2018-03-14 13:57:41
293981,1650202,527388,0,1,BEZ0204,2018-03-14 13:58:01


How to filter information on a dataframe.

In [49]:
# products sold in quantities larger than 100
df[df['product_quantity'] > 100]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
27779,1180010,323959,0,126,ADN0039,34.99,2017-02-14 10:21:12
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
136675,1388261,417536,0,200,TRK0009,29.99,2017-10-25 15:02:39
204637,1500715,464858,0,192,APP1662,519.0,2017-12-17 15:53:04
246048,1574262,496172,0,164,EVU0013,19.99,2018-01-22 16:14:42
285492,1637611,522075,0,125,XDO0047,25.99,2018-03-06 10:07:54


The `.query()` method can be useful for this purpose, but it works only when the column values do not contain blank spaces. You can use any **Python Comparison Operators** you want inside the query method (find more information on this [link](https://www.w3schools.com/python/python_operators.asp)).

In [50]:
df.query('product_quantity > 100')

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
27779,1180010,323959,0,126,ADN0039,34.99,2017-02-14 10:21:12
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
136675,1388261,417536,0,200,TRK0009,29.99,2017-10-25 15:02:39
204637,1500715,464858,0,192,APP1662,519.0,2017-12-17 15:53:04
246048,1574262,496172,0,164,EVU0013,19.99,2018-01-22 16:14:42
285492,1637611,522075,0,125,XDO0047,25.99,2018-03-06 10:07:54


Another way to filter information is to look for exact matches. You can do that with the `isin()` method:

In [51]:
# find out rows in a column that match the elements in a list
df[df['sku'].isin(['ADN0039','THU0029','APP1190'])]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
1452,1122094,300886,0,1,APP1190,62.99,2017-01-02 19:15:23
1745,1122690,301173,0,1,APP1190,58.99,2017-01-02 23:46:35
1904,1123012,301322,0,1,APP1190,58.99,2017-01-03 09:34:02
2097,1123386,301504,0,1,APP1190,58.99,2017-01-03 12:14:35
2282,1123789,301694,0,2,APP1190,58.99,2017-01-03 14:58:09
...,...,...,...,...,...,...,...
293100,1648600,526589,0,1,APP1190,56.00,2018-03-13 18:45:25
293106,1648610,526591,0,1,APP1190,56.00,2018-03-13 18:50:43
293150,1648679,526624,0,1,APP1190,56.00,2018-03-13 19:53:32
293154,1648690,526631,0,1,APP1190,56.00,2018-03-13 20:03:52


## `.copy()` method

If you want to create a new dataframe out of a chunk of the original dataframe, it is quite common to run into this problem:

In [56]:
sample = df.iloc[:3,:]
sample

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


In [57]:
sample.iloc[0,4]

'OTT0133'

In [58]:
sample.iloc[0,4] = 'NEW001'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [59]:
sample

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,NEW001,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


In [60]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,NEW001,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


As you can see, we modified the object `sample` but, the data frame `df` has also been modified! We can avoid this using the method `.copy()`

In [62]:
df = pd.read_csv('orderlines.csv')
sample = df.iloc[:3,:].copy()
sample.iloc[0,4] = 'NEW001'
sample

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,NEW001,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


In [63]:
df.head(3)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57


As you can see, now it has not been modified.

# CHALLENGES

1. The product with the `sku` JBL0104 has been sold for prices at different points in time. How many different prices has it had?

Tip: combine any pandas filtering method with the method `.nunique()`.

In [64]:
df.query('sku == "JBL0104"').nunique()

id                  58
id_order            58
product_id           1
product_quantity     2
sku                  1
unit_price           7
date                58
dtype: int64

2. List all the different items that have been sold in the order with an `id_order` of `385921`.

In [78]:
df.query('id_order == 385921')['sku'].unique().tolist()

['APP2431',
 'APP2348',
 'APP2131',
 'APP1630',
 'APP1735',
 'APP1216',
 'APP2092',
 'APP1215',
 'ELA0017',
 'MIN0010',
 'ELA0039',
 'BEA0046',
 'BOS0034',
 'BEA0071',
 'ELA0029',
 'APP2161',
 'HOC0008',
 'NOM0026',
 'NOM0014']

3. Find out in how many different orders have been sold the products with the following `sku`: `APP2431` and `APP2348`.

In [66]:
df.query('sku == ["APP2431","APP2348"]')['id_order'].nunique()

179

In [67]:
filter_col = df['sku'].isin(["APP2431","APP2348"])

In [68]:
df.loc[filter_col]['id_order'].nunique()

179

In [69]:
# another option
df.loc[df['sku'].isin(["APP2431","APP2348"])]['id_order'].nunique()
# df[df['sku'].isin(['APP2431','APP2348'])].id_order.nunique()

179

4. Create a new dataframe with all the rows that have a product quantity higher than 500. Call this new dataframe `df_50`, and include only the columns `id`, `id_order`, `product_quantity` and `sku`. Be sure to use the method `.copy()`. 

Once the dataset is created, modify the column `product_quantity` to `quantity`, and `sku'` to `product_code`. To do so, you can use the methods `.rename()` or `.columns`.


In [70]:
df_50 = df.loc[
    # rows rule
    df['product_quantity'] > 500, 
    # select columns
    ['id','product_id','product_quantity','sku']].copy()
df_50

Unnamed: 0,id,product_id,product_quantity,sku
53860,1228150,0,999,APP1190
57306,1234111,0,555,APP0665
57796,1234924,0,800,KIN0137
68712,1254032,0,999,SEV0028


In [71]:
# option 1
df_50.columns = ['id','product_id','quantity','product_code']
df_50

Unnamed: 0,id,product_id,quantity,product_code
53860,1228150,0,999,APP1190
57306,1234111,0,555,APP0665
57796,1234924,0,800,KIN0137
68712,1254032,0,999,SEV0028


In [72]:
# option 2
df_50.rename(columns={'product_quantity':'quantity',
                     'sku':'product_code'},
            inplace=True)
df_50

Unnamed: 0,id,product_id,quantity,product_code
53860,1228150,0,999,APP1190
57306,1234111,0,555,APP0665
57796,1234924,0,800,KIN0137
68712,1254032,0,999,SEV0028


5. Select all the order lines (i.e. all the rows) where the product `XDO0047` has appeared. Sort the product quantity in a DESCENDING order using the pandas method [`.sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html). Then look at the main descriptive information of this results with the method `.describe()`.

In [73]:
(
df.loc[df['sku'] == 'XDO0047']
    .sort_values(by=['product_quantity'], ascending=False)
    .describe()
)

Unnamed: 0,id,id_order,product_id,product_quantity
count,36.0,36.0,36.0,36.0
mean,1485767.0,458640.277778,0.0,4.472222
std,87705.32,36891.598431,0.0,20.662576
min,1365515.0,406387.0,0.0,1.0
25%,1405830.0,425462.75,0.0,1.0
50%,1471428.0,453247.0,0.0,1.0
75%,1564540.0,492406.0,0.0,1.0
max,1640821.0,523533.0,0.0,125.0


OBSERVATION: as you may have noticed, the `unit_price` column is not being detected as a float number on pandas. The reason for that is that some prices has been corrupted. During this week we will discover how to fix them, but for the moment it will be enough to filter the product we want to analyise, bring the data to a new dataframe and change the data type to float. See the example below:

In [74]:
df_xdo0047 = df.query('sku == "XDO0047"').copy()

In [75]:
df_xdo0047['unit_price'] = pd.to_numeric(df_xdo0047['unit_price'])

In [76]:
df_xdo0047.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 124018 to 287315
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                36 non-null     int64  
 1   id_order          36 non-null     int64  
 2   product_id        36 non-null     int64  
 3   product_quantity  36 non-null     int64  
 4   sku               36 non-null     object 
 5   unit_price        36 non-null     float64
 6   date              36 non-null     object 
dtypes: float64(1), int64(4), object(2)
memory usage: 2.2+ KB


In [77]:
df_xdo0047.describe()

Unnamed: 0,id,id_order,product_id,product_quantity,unit_price
count,36.0,36.0,36.0,36.0,36.0
mean,1485767.0,458640.277778,0.0,4.472222,24.8175
std,87705.32,36891.598431,0.0,20.662576,1.8087
min,1365515.0,406387.0,0.0,1.0,21.48
25%,1405830.0,425462.75,0.0,1.0,22.09
50%,1471428.0,453247.0,0.0,1.0,25.99
75%,1564540.0,492406.0,0.0,1.0,25.99
max,1640821.0,523533.0,0.0,125.0,25.99
