# **04. Python Numpy and Pandas Fundamentals**

Welcome to our <font color=#f2cc38>**Fourth Content Block**</font> in the Python Course! 

We are leaving Python Fundamentals behind and starting specifically with **Python for Data & Analytics**!

As such, during the following weeks we'll take a look at Numpy, Pandas and some visualization libraries to up our game at the time of doing data analysis in Python, to later transition into more **advanced analytics** modules.

Today, we will be reviewing:

 - Numpy, what is it, and basic functionalities.
 - Pandas Series.
 - Pandas DataFrames.
 - Uploading and exporting data to Pandas
 - Additional Pandas Functions to evaluate our dataframes.
 - Operating with Pandas DFs.
 - Indexing and Slicing DFs.
 - Filtering DFs.
 - Dealing with null values.
 
In the next module, we will finish with Pandas, covering other essentials, as well as many additional advanced functionalities like joining, grouping, the equivalent of window functions, multiindexing, dataframe mutability, and more !

## **01. Numpy**

So, what is Nupy and why would we be interested in learning about it?

Numpy is a fundamental package for scientific computing in Python. Its core functionality lays namely in allowing for **computing on vector and matrix-like** objects; and to do so really **fast**.

You'd like to learn about Numpy because most relevant libraries for Data Analytics and Data Science in Python **are built on top of Numpy**, like Pandas or sci-kit learn. Therefore, just to have a basic notion of numpy and to be familiar with some of its core concepts, we'll be reviewing some of its main functionalities.

In [2]:
# we always need to import numpy and we do so as np
# in other environments not in JupyterLab, it might be possible numpy is not installed. Do so via pip or conda!

import numpy as np

Numpy is fundamented on the use of the **ndarray** object, which represents both 1D vectors, as well as other n-dimensional matrices.

Let's review three different ways to **create np arrays**:

### **Creating np arrays from lists**

We create arrays from lists when we want to create a numpy array from some predefined values we already know of.

In [5]:
#  note that the way of creating n>1 n-dimensional matrices in numpy is by inputing different levels of lists of lists!
v = [1, 2, 3, 4, 5]
m =[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]

In [7]:
arr = np.array(v)
arr

array([1, 2, 3, 4, 5])

In [8]:
arr2D = np.array(m)
arr2D

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

### **Create np arrays from a range of numbers**

In this case, we'll be creating our numpy arrays from a range of numbers. To do so, we'll be using **np.arange**, the built-in **version of the range** function we know of core Python.

In [10]:
# the syntax is equal to that of the range function!

arr = np.arange(0, 10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
arr = np.arange(0, 20, 2) # now with differences of 2 among the numbers
arr

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [15]:
# compare with range
[i for i in range(0, 20, 2)]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

### **Create np arrays from a random subset of numbers**

In this case, we are using the **built-in random functionalities of numpy**, which mirror those of the core Python **random** library.

In [24]:
# it can be used to obtain random numbers inside a given range
np.random.randint(4, 8)

6

In [27]:
# we can also use randn to get a random array!
np.random.randn(3)

array([ 0.84639465,  0.98068776, -0.41102971])

In [28]:
np.random.randn(5, 5)

array([[ 0.37628572, -0.37630117,  2.72834178,  0.75235542, -0.42904962],
       [-0.55136763, -0.50769516, -0.35066422, -1.87097899,  0.14748977],
       [-1.8799745 ,  1.30465706, -0.19703582,  0.24084835, -0.54426253],
       [ 1.19421275, -1.23030942, -0.79312524,  0.75561386, -0.20153632],
       [-0.29333949, -0.37352236,  0.4467631 , -0.44517391,  1.27968115]])

In [29]:
np.random.randn(2, 3, 2)

array([[[-1.21946535, -0.04875407],
        [ 0.4231142 ,  0.35199397],
        [ 0.99304695,  0.44362341]],

       [[ 1.23887341,  0.91022505],
        [ 0.58252507,  1.4204216 ],
        [ 0.79874272, -0.83569678]]])

The syntax of **random.randn** implies that for each comma, we specify the length in items we want that dimension to have. Therefore, by using random.randn(2, 3, 2) we are obtaining a 3D-matrix.

### **Interesting array attributes and methods, and operations**

Numpy arrays have some specific attributes and methods of interest:

**Shape**

In [5]:
arr = np.random.randn(3, 2)
arr

array([[-1.30179962, -0.57340667],
       [ 0.35768934,  0.85536508],
       [ 1.80429399,  0.57329268]])

In [7]:
# the attribute shape is super useful to find the dimensionality of a given matrix.
arr.shape

(3, 2)

In [8]:
np.array([1, 2, 3, 4, 5, 6]).shape

(6,)

**dtype**

dtype retrieves the data type of elements in the array

In [10]:
np.array([1, 2, 3, 4, 5]).dtype

dtype('int64')

In [11]:
np.array('Bush Biden Obama Trump'.split()).dtype

dtype('<U5')

**numerical methods**

Numpy arrays have also inherent methods to obtain cetain numerical elements.

In [16]:
arr = np.random.randn(7)

In [20]:
arr

array([ 0.44362737,  0.71728122, -0.00329377, -0.44492896,  0.51752225,
       -0.75822619,  0.56168343])

In [22]:
print(arr.max())
print(arr.min())

0.7172812233744162
-0.7582261855082955


**numerical operations with ndarrays**

In [23]:
arr1 = np.random.randn(5)
arr2 = np.random.randn(5)

In [24]:
arr1

array([-0.8438144 ,  0.4759829 , -2.00186412,  0.44948046, -1.40840587])

In [25]:
arr2

array([-0.73140612,  0.96380376,  1.61519382,  0.00820699, -0.63131384])

In [26]:
arr1+arr2

array([-1.57522053,  1.43978666, -0.3866703 ,  0.45768744, -2.03971971])

In [27]:
arr1-arr2

array([-0.11240828, -0.48782086, -3.61705793,  0.44127347, -0.77709203])

In [28]:
arr1*arr2

array([ 0.61717102,  0.45875411, -3.23339854,  0.00368888,  0.88914612])

In [29]:
# while diving, a division by 0 des not trigger an error
arr1/np.array([0, 0, 0, 0, 0])

  arr1/np.array([0, 0, 0, 0, 0])


array([-inf,  inf, -inf,  inf, -inf])

In [30]:
arr1**3

array([-0.60081505,  0.10783855, -8.02239027,  0.09080974, -2.79372389])

In [33]:
arr1

array([-0.8438144 ,  0.4759829 , -2.00186412,  0.44948046, -1.40840587])

In [34]:
# sqrt of <0 number is an i number!
np.sqrt(arr1)

  np.sqrt(arr1)


array([       nan, 0.68991514,        nan, 0.67043304,        nan])

### **Indexing and Slicing Numpy Arrays**

Indexing and slicing work mostly the same as indexing and slicing lists in Python:

In [36]:
arr = np.arange(0, 10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [39]:
arr[-1]

9

In [38]:
arr[1:7]

array([1, 2, 3, 4, 5, 6])

**For a matrix**

In [53]:
m = np.random.randn(3, 3)
m

array([[ 1.26126487, -0.27231635, -0.62504674],
       [-0.79080837, -0.94777988, -0.40747824],
       [ 0.68090298, -1.2340281 ,  1.02716358]])

In [57]:
# fetching the first row
m[0]

array([ 1.26126487, -0.27231635, -0.62504674])

In [58]:
# fetching the first column
m[:,0]

array([ 1.26126487, -0.79080837,  0.68090298])

In [59]:
# for a specific value
m[0][1]

-0.2723163479547421

In [60]:
m[0, 1]

-0.2723163479547421

**Broadcasting**

Broadcasting refers to the possibility of **"vectorizing"** some operations accross a numpy array. One we want to take a close look at is the possibility of **reassigning values for a subarray in the array**, something we could not do in lists!

In [40]:
mylist = [1, 2, 3, 4, 5]
mylist[0] = 4
mylist

[4, 2, 3, 4, 5]

In [41]:
mylist[:3] = 5

TypeError: can only assign an iterable

In [42]:
# instead, for numpy:
arr = np.arange(0, 10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [43]:
arr[:4] = 4
arr

array([4, 4, 4, 4, 4, 5, 6, 7, 8, 9])

In [45]:
# we can do it for the whole array as well!
arr[:] = 4
arr

array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

### **Filtering  Numpy Arrays**

In the process of filtering a Numpy array, we:
 1. First set a condition on a numpy array, and get whether it's True or False
 2. We pass it onto the original array, and numpy filters True occurrences

In [63]:
arr = np.arange(0, 10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [64]:
arr>4

array([False, False, False, False, False,  True,  True,  True,  True,
        True])

In [65]:
arr[arr>4]

array([5, 6, 7, 8, 9])

We will see something very similar on **pandas**!

## **02. Pandas Series and DataFrames**

Pandas is the default library for dealing with data in Python, which is built on top of Numpy.

Pandas count on **Series** and **DataFrames**. Series would be the equivalent of a table's "column", and a DataFrame would be the equivalent of a data table.

### **02.1. Pandas Series**

Pandas Series are built on top of numpy arrays. Because of that, they are super similar to numpy 1D arrays. 

However, unlike Numpy arrays, they can be **indexed by a label**.

#### **How to create a Pandas Series**

In [1]:
# first always import pandas as pd
import numpy as np
import pandas as pd

In [67]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [69]:
# we can see indexes have values!
pd.Series([1, 2, 3, 4, 5], index = 'a b c d e'.split())

a    1
b    2
c    3
d    4
e    5
dtype: int64

 - From a dictionary:

In [70]:
# we can create a Series from a dictionary
mydict = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
pd.Series(mydict)

a    1
b    2
c    3
d    4
e    5
dtype: int64

 - From numpy arrays also:

In [71]:
pd.Series(np.random.randn(6))

0   -0.079530
1   -0.725589
2    0.771851
3   -0.166389
4   -0.681069
5    0.049136
dtype: float64

#### **Indexing with Pandas Series**

Both default numeric, and custom non-numeric indexes can be used to index pandas Series:

In [78]:
s = pd.Series(list(range(10, 15)))
s

0    10
1    11
2    12
3    13
4    14
dtype: int64

In [79]:
s[0]

10

In [80]:
s[:4]

0    10
1    11
2    12
3    13
dtype: int64

In [81]:
s.index = ['USA', 'Japan', 'France', 'Canada', 'Spain']

In [82]:
s

USA       10
Japan     11
France    12
Canada    13
Spain     14
dtype: int64

In [84]:
s['USA']

10

In [85]:
# However, default indexes still work!
s[0]

10

**Operations** among pandas Series  work according to the **index**!

In [89]:
s2 = pd.Series(np.arange(30, 35))
s2.index = ['USA', 'Italy', 'NZ', 'Japan', 'France']

In [93]:
s

USA       10
Japan     11
France    12
Canada    13
Spain     14
dtype: int64

In [94]:
s2

USA       30
Italy     31
NZ        32
Japan     33
France    34
dtype: int64

In [96]:
# those not having an equivalent do not produce a result! it's not done according to row positioning!
s+s2

Canada     NaN
France    46.0
Italy      NaN
Japan     44.0
NZ         NaN
Spain      NaN
USA       40.0
dtype: float64

#### **Quick Exercise**

Create a Pandas Series from a made-up sequence of numbers that indicates the (fake) rating of some Films. Include the films in the index.

In [1]:
films = ['Fargo', 'Star Wars', 'Django', 'Whiplash', 'Goodfellas']

In [6]:
s = pd.Series(np.random.randn(len(films)), index = films)

In [7]:
s

Fargo         0.078233
Star Wars     0.970172
Django        0.264525
Whiplash     -0.872376
Goodfellas    0.102661
dtype: float64

### **02.2. Pandas DataFrames**

DataFrames are the quivalent of **data tables** in pandas. As Series are the equivalent of columns, think about DFs as a combination of series that share a common index!

#### **Creating DFs**

In [9]:
df = pd.DataFrame(np.random.randn(8, 3), index = 'USA Japan France Egypt Poland SK Argentina Cuba'.split(), columns = 'population gdp median_age'.split())
df

Unnamed: 0,population,gdp,median_age
USA,-0.066816,-0.180745,1.263791
Japan,-0.42642,-0.417508,-1.054292
France,0.521824,1.0766,-0.463071
Egypt,1.584942,-0.049625,-0.071315
Poland,-0.513212,2.628185,-0.258796
SK,0.180745,-1.371016,1.345896
Argentina,2.806639,0.527447,0.224193
Cuba,-1.380013,-0.436359,0.118715


We can also create DFs via **dictionaries** (similar to how we did it before with Series), or, for instance, import some external data to generate them. We'll see that a bit later.

#### **df.head()**

When first obtaining a dataset in a df, it's useful to get a sense of its underlying information and structure. We will introduce some additional functions later, but df.head() is a good one for getting a sense of our initial data.

By default, it fetches the **first 5 rows** of the dataframe. When specifying a number inside it, it shows other number of rows.

In [133]:
df.head(4)

Unnamed: 0,population,gdp,median_age
USA,-0.727674,0.628572,1.101631
Japan,0.637028,-1.338933,0.478187
France,-1.323539,-1.988056,0.177701
Egypt,2.257902,0.16579,0.021995


#### **Selecting and Indexing DFs**

We will see how to select:
 - Individual columns
 - Individual rows
 - Groups of these

**Columns**

Columns can be selected like:

In [134]:
# we obtain the resulting series
df["population"]

USA         -0.727674
Japan        0.637028
France      -1.323539
Egypt        2.257902
Poland       0.826227
SK          -0.922904
Argentina   -0.215233
Cuba         0.310247
Name: population, dtype: float64

In [135]:
# as well as if it was an attribute
df.population

USA         -0.727674
Japan        0.637028
France      -1.323539
Egypt        2.257902
Poland       0.826227
SK          -0.922904
Argentina   -0.215233
Cuba         0.310247
Name: population, dtype: float64

In [136]:
type(df.population)

pandas.core.series.Series

And groups of columns like:

In [137]:
df[['population', 'gdp']]

Unnamed: 0,population,gdp
USA,-0.727674,0.628572
Japan,0.637028,-1.338933
France,-1.323539,-1.988056
Egypt,2.257902,0.16579
Poland,0.826227,1.182786
SK,-0.922904,0.806727
Argentina,-0.215233,1.762746
Cuba,0.310247,-0.1351


**Rows**

Rows can be indexed as if it was a list, or by using native pandas methods.

In [138]:
len(df)

8

In [139]:
df[2:7]

Unnamed: 0,population,gdp,median_age
France,-1.323539,-1.988056,0.177701
Egypt,2.257902,0.16579,0.021995
Poland,0.826227,1.182786,-0.357212
SK,-0.922904,0.806727,0.729014
Argentina,-0.215233,1.762746,0.607078


The native pandas methods to do so are **.loc** and **.iloc**.

 - **.loc** is used for indexing according to the name of the index/column.
 - **.iloc** is used for indexing according to its default numerical position.
 
Let's see an example of this:

In [140]:
df.loc['France']

population   -1.323539
gdp          -1.988056
median_age    0.177701
Name: France, dtype: float64

In [141]:
df.loc['France':'Argentina']

Unnamed: 0,population,gdp,median_age
France,-1.323539,-1.988056,0.177701
Egypt,2.257902,0.16579,0.021995
Poland,0.826227,1.182786,-0.357212
SK,-0.922904,0.806727,0.729014
Argentina,-0.215233,1.762746,0.607078


In [142]:
df.loc[['France', 'Argentina', 'Japan']]

Unnamed: 0,population,gdp,median_age
France,-1.323539,-1.988056,0.177701
Argentina,-0.215233,1.762746,0.607078
Japan,0.637028,-1.338933,0.478187


Look how using it we have **changed the order** of appearance of the rows!

Now, as for iloc, they work the same:

In [143]:
df.iloc[0]

population   -0.727674
gdp           0.628572
median_age    1.101631
Name: USA, dtype: float64

In [144]:
df.iloc[0:4]

Unnamed: 0,population,gdp,median_age
USA,-0.727674,0.628572,1.101631
Japan,0.637028,-1.338933,0.478187
France,-1.323539,-1.988056,0.177701
Egypt,2.257902,0.16579,0.021995


We can use them also to **subscript both rows and columns at a time**:

In [145]:
df.loc['France', ['population', 'gdp']]

population   -1.323539
gdp          -1.988056
Name: France, dtype: float64

In [146]:
df.iloc[0:5, 0:2]

Unnamed: 0,population,gdp
USA,-0.727674,0.628572
Japan,0.637028,-1.338933
France,-1.323539,-1.988056
Egypt,2.257902,0.16579
Poland,0.826227,1.182786


#### **Quick Exercise**

Create a pandas dataframe with made-up numbers that presents the total population, number of neighbourhoods and average housing price columns for the cities in the list.

Then, slice it just to get the total population and average housing price columns, for the first 3 cities in the dataframe.

In [2]:
cities = ['BCN', 'MAD', 'BUC', 'NYC', 'WAW', 'LIS']

**Solution**

In [8]:
df = pd.DataFrame(np.random.randn(6,3),index = cities, columns = 'total_population n_neighbourhoods avg_house_price'.split())
df.iloc[0:3,[0,2]]

Unnamed: 0,total_population,avg_house_price
BCN,0.871374,0.35224
MAD,1.289137,-0.86112
BUC,0.105161,-1.15951


#### **How do I create a new column in a pandas DF?**

Creating new columns is crazy simple in pandas:

In [10]:
df['gdp_pp'] = df.gdp / df.population

In [11]:
df

Unnamed: 0,population,gdp,median_age,gdp_pp
USA,-0.066816,-0.180745,1.263791,2.705107
Japan,-0.42642,-0.417508,-1.054292,0.979099
France,0.521824,1.0766,-0.463071,2.063145
Egypt,1.584942,-0.049625,-0.071315,-0.03131
Poland,-0.513212,2.628185,-0.258796,-5.121055
SK,0.180745,-1.371016,1.345896,-7.585354
Argentina,2.806639,0.527447,0.224193,0.187928
Cuba,-1.380013,-0.436359,0.118715,0.316199


#### **.drop() for column removal**

If we want to get rid of a column, instead of subindexing it, we can directly supress it via **.drop()**.

**!Watch Out!**: axis = 0 (default) is for rows, and axis = 1 is for columns!

In [12]:
df.head(3)

Unnamed: 0,population,gdp,median_age,gdp_pp
USA,-0.066816,-0.180745,1.263791,2.705107
Japan,-0.42642,-0.417508,-1.054292,0.979099
France,0.521824,1.0766,-0.463071,2.063145


In [14]:
df.drop('gdp_pp', axis = 1)

Unnamed: 0,population,gdp,median_age
USA,-0.066816,-0.180745,1.263791
Japan,-0.42642,-0.417508,-1.054292
France,0.521824,1.0766,-0.463071
Egypt,1.584942,-0.049625,-0.071315
Poland,-0.513212,2.628185,-0.258796
SK,0.180745,-1.371016,1.345896
Argentina,2.806639,0.527447,0.224193
Cuba,-1.380013,-0.436359,0.118715


In [153]:
# in this method, the change is not permanent! we need to make use of inplace!!
df.head(1)

Unnamed: 0,population,gdp,median_age,gdp_pp
USA,-0.727674,0.628572,1.101631,-0.863809


In [154]:
df.drop('gdp_pp', axis = 1, inplace = True)

In [155]:
df.head(1)

Unnamed: 0,population,gdp,median_age
USA,-0.727674,0.628572,1.101631


Several pandas methods use the **inplace** argument, so be sure to specify it when necessary or, when debugging, to know it comes from there!

## **03. Importing and Exporting Data with Pandas**

We are going to overview how to import plane files like .csv or .xlsx, or to also use starburst to get data.

In [21]:
df = pd.read_csv('../winemag-data_first150k.csv')

In [22]:
# problem! We imported it with a weird unnamed column! How do we get rid of this?
df.head(2)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez


In [163]:
df = pd.read_csv('../winemag-data_first150k.csv', index_col = 0)

We also have the equivalent **pd.read_excel** for xlsx files. However, bear in mind that you may trigger an error, and need to install **openpyxl** via pip to make that work.

**Reading data from starburst**

A great option is connecting directly to starburst to retrieve data:

In [2]:
%load_ext starburst

In [3]:
%starburst SELECT 1

Open the following URL in browser for the external authentication:
https://starburst.g8s-data-platform-prod.glovoint.com/oauth2/token/initiate/4f04dfd738db918a4e8743a4eea8a835a5ac10eebd7987836e6d4232c0944b11
Done.


_col0
1


In [25]:
%%starburst query1 <<
    select
        date(date_trunc('month', order_activated_local_at)) as month,
        order_country_code as country,
        count(distinct case when order_is_first_delivered_order then customer_id end) as ncs,
        count(distinct customer_id) as mau,
        count(distinct order_id) as orders
    from delta.central_order_descriptors_odp.order_descriptors_v2
    where year(order_activated_local_at) = 2023
    group by 1, 2
    order by 1 desc, 2

Done.
Returning data to local variable query1


In [26]:
df = pd.DataFrame(query1)

In [27]:
df.head()

Unnamed: 0,month,country,ncs,mau,orders
0,2023-10-01,AD,159,1732,3350
1,2023-10-01,AM,3648,19375,58952
2,2023-10-01,BA,4309,30035,75717
3,2023-10-01,BG,12395,110180,301769
4,2023-10-01,CI,7870,50529,172437


Now, let's see how to **export** it:

In [28]:
df.to_csv('data/first_df_without_index.csv')

If we read it again...

In [29]:
# We see an unnamed column!
pd.read_csv('data/first_df_without_index.csv').head()

Unnamed: 0.1,Unnamed: 0,month,country,ncs,mau,orders
0,0,2023-10-01,AD,159,1732,3350
1,1,2023-10-01,AM,3648,19375,58952
2,2,2023-10-01,BA,4309,30035,75717
3,3,2023-10-01,BG,12395,110180,301769
4,4,2023-10-01,CI,7870,50529,172437


By default, pandas stores our index. As our index is not identified with anything, it's just a numeric value, beware of using **index = False** when storing it!

In [183]:
# let's overwrite the file...
df.to_csv('data/first_df_without_index.csv', index = False)

In [185]:
# well done now!
pd.read_csv('data/first_df_without_index.csv').head()

Unnamed: 0,month,country,ncs,mau,orders
0,2023-10-01,AD,149,1648,3057
1,2023-10-01,AM,3230,18113,51793
2,2023-10-01,BA,3867,28162,66898
3,2023-10-01,BG,11098,103561,266716
4,2023-10-01,CI,6974,47386,152862


## **04. Methods and attributes to explore a pandas DF**

Once we import a pandas df, we can start taking a look at some of its components values, attributes and so on via the use of **predefined methods**. Some relevant ones are:

In [186]:
df = pd.read_csv('data/first_df_without_index.csv')

### **info()**

info() provides a concise summary of the DataFrame, including information about the data types of each column, the number of non-null values, and memory usage.

In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   month    254 non-null    object
 1   country  254 non-null    object
 2   ncs      254 non-null    int64 
 3   mau      254 non-null    int64 
 4   orders   254 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 10.0+ KB


### **describe()**
describe() generates descriptive statistics of the numeric columns in the DataFrame, including count, mean, standard deviation, minimum, and maximum values.

In [188]:
df.describe()

Unnamed: 0,ncs,mau,orders
count,254.0,254.0,254.0
mean,32502.059055,246184.7,724064.4
std,44186.652039,389320.2,1121350.0
min,0.0,1.0,1.0
25%,5317.75,31349.75,82983.0
50%,13604.5,64083.5,206576.5
75%,44485.25,310567.8,1011996.0
max,205204.0,1870862.0,5633105.0


### **head() and tail()**

They fetch the first n occurrences / last occurrences of the DF.

In [190]:
df.head(3)

Unnamed: 0,month,country,ncs,mau,orders
0,2023-10-01,AD,149,1648,3057
1,2023-10-01,AM,3230,18113,51793
2,2023-10-01,BA,3867,28162,66898


In [191]:
df.tail(2)

Unnamed: 0,month,country,ncs,mau,orders
252,2023-01-01,UA,61192,389935,1203913
253,2023-01-01,UG,3391,13957,30187


### **Relevant Attributes**

Like .columns to fetch all column names, index to fetch the index, or dtypes to check the data type.

In [33]:
df.columns

Index(['month', 'country', 'ncs', 'mau', 'orders'], dtype='object')

In [194]:
df.index

RangeIndex(start=0, stop=254, step=1)

In [195]:
df.shape

(254, 5)

In [196]:
df.dtypes

month      object
country    object
ncs         int64
mau         int64
orders      int64
dtype: object

### **Numerical Methods**

To obtain, separatedly from the describe() matrix, the max, min, mean, median or std.

In [197]:
df.max()

month      2023-10-01
country            UG
ncs            205204
mau           1870862
orders        5633105
dtype: object

In [198]:
df.min()

month      2023-01-01
country            AD
ncs                 0
mau                 1
orders              1
dtype: object

In [35]:
df.mean(numeric_only = True)

ncs        32789.488189
mau       247441.232283
orders    730946.507874
dtype: float64

In [201]:
df.std(numeric_only = True)

ncs       4.418665e+04
mau       3.893202e+05
orders    1.121350e+06
dtype: float64

In [202]:
df.median(numeric_only = True)

ncs        13604.5
mau        64083.5
orders    206576.5
dtype: float64

### **value_counts()**

Returns the number of times a same combination of values was repeated.

In [204]:
df.value_counts()

month       country  ncs     mau     orders 
2023-01-01  AD       415     2160    4171       1
2023-07-01  RS       18802   158227  508649     1
            GH       9443    38506   106267     1
            HR       15164   143630  368340     1
            IT       155503  965767  2299870    1
                                               ..
2023-04-01  KG       9590    42674   116633     1
            KZ       27455   193073  557608     1
            MA       46732   287427  1085214    1
            MD       5037    27431   61758      1
2023-10-01  UG       3999    17253   36165      1
Length: 254, dtype: int64

In [205]:
df.country.value_counts()

AD    10
AM    10
UG    10
UA    10
TN    10
SI    10
RS    10
RO    10
PT    10
PL    10
NG    10
ME    10
MD    10
MA    10
KZ    10
KG    10
KE    10
IT    10
HR    10
GH    10
GE    10
ES    10
CI    10
BG    10
BA    10
PR     4
Name: country, dtype: int64

### **nunique()**
Counts the total amount of unique occurrences per column.

In [208]:
df.nunique()

month       10
country     26
ncs        249
mau        253
orders     254
dtype: int64

### **sample(n)**
Returns a random sample of n rows from the DataFrame.

In [209]:
df.sample(5)

Unnamed: 0,month,country,ncs,mau,orders
133,2023-05-01,HR,12745,149823,393470
33,2023-09-01,HR,13849,149285,393036
81,2023-07-01,GE,20531,166798,597129
113,2023-06-01,MA,51183,357672,1549366
118,2023-06-01,PT,37113,327292,1003512


## **Quick Exercise 3**

From order descriptors, import, for Spain, 100 registers including the order_country_code, order_city_code, customer_id, order_id, order_subvertical, store_id and order_cancel_reason.

Then, explore the dataframe with base exploratory methods. Which relevant information do you see?

In [4]:
%%starburst query2 <<
    select
        order_country_code as counntry,
        order_city_code as city,
        customer_id,
        order_id,
        order_subvertical,
        store_id,
        order_cancel_reason
    from delta.central_order_descriptors_odp.order_descriptors_v2
    limit 100

Done.
Returning data to local variable query2


In [6]:
df = pd.DataFrame(query2)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   counntry             100 non-null    object 
 1   city                 100 non-null    object 
 2   customer_id          100 non-null    int64  
 3   order_id             100 non-null    int64  
 4   order_subvertical    100 non-null    object 
 5   store_id             73 non-null     float64
 6   order_cancel_reason  5 non-null      object 
dtypes: float64(1), int64(2), object(4)
memory usage: 5.6+ KB


In [9]:
df.nunique()

counntry                 8
city                    16
customer_id            100
order_id               100
order_subvertical        5
store_id                63
order_cancel_reason      3
dtype: int64

In [11]:
df.counntry.value_counts()

ES    71
PE     8
IT     7
FR     7
AR     4
CL     1
BR     1
PT     1
Name: counntry, dtype: int64

In [13]:
df.counntry.unique()

array(['ES', 'IT', 'PE', 'FR', 'CL', 'AR', 'BR', 'PT'], dtype=object)

## **05. Filtering Pandas DFs**

Let's overview the equivalent of **Where** clauses in pandas DFs.

In [211]:
df = pd.read_csv('data/first_df_without_index.csv')
df.head()

Unnamed: 0,month,country,ncs,mau,orders
0,2023-10-01,AD,149,1648,3057
1,2023-10-01,AM,3230,18113,51793
2,2023-10-01,BA,3867,28162,66898
3,2023-10-01,BG,11098,103561,266716
4,2023-10-01,CI,6974,47386,152862


Filtering in Pandas works similar to Numpy. Let's see why:

In [215]:
# By just filtering, we're getting the eact same series, but with Boolean values, True/False according to whether the condition is met
df['ncs'] > 1000

0      False
1       True
2       True
3       True
4       True
       ...  
249     True
250     True
251     True
252     True
253     True
Name: ncs, Length: 254, dtype: bool

We have to **Pass this condition onto the whole DF to filter out False values**:

In [217]:
# See how AD is lost in the process!
df[df['ncs'] > 1000].head()

Unnamed: 0,month,country,ncs,mau,orders
1,2023-10-01,AM,3230,18113,51793
2,2023-10-01,BA,3867,28162,66898
3,2023-10-01,BG,11098,103561,266716
4,2023-10-01,CI,6974,47386,152862
5,2023-10-01,ES,147269,1635849,4216475


#### **Staying with "False" values**

What if we want to keep those registers where the condition was not met?

We use the **~ operator**

Bear in mind **using parentheses** to encapsulate the whole condition to negate! It's essential!

In [224]:
df[~(df['ncs'] > 1000)].head()

Unnamed: 0,month,country,ncs,mau,orders
0,2023-10-01,AD,149,1648,3057
25,2023-09-01,AD,278,2018,4013
50,2023-08-01,AD,285,2017,4142
75,2023-07-01,AD,236,2006,4279
100,2023-06-01,AD,298,2165,4597


#### **Multiple Conditions at a time, AND, OR, IN, IS (NOT) NULL**

We can use the **&** (AND) or **|** (OR) operator to combine several different conditions.

Remember to **encapsulate each condition with a parenthesis**, or you'll trigger an error:

In [229]:
# AND:
df[(df.country != 'AD') & (df.ncs < 1000)]

Unnamed: 0,month,country,ncs,mau,orders
168,2023-04-01,PR,0,5,9
194,2023-03-01,PR,0,1,1
220,2023-02-01,PR,0,6,11
246,2023-01-01,PR,0,1,2


In [231]:
# OR:
df[(df.country == 'ES')|(df.country == 'IT')].head()

Unnamed: 0,month,country,ncs,mau,orders
5,2023-10-01,ES,147269,1635849,4216475
9,2023-10-01,IT,117161,851600,1830553
30,2023-09-01,ES,198931,1859635,5451798
34,2023-09-01,IT,156159,1016988,2423674
55,2023-08-01,ES,202663,1762605,5011810


The **.isin()** operator can be used to replace the **in** keyword:

In [232]:
df[df.country.isin(['AD', 'PT', 'KE'])].head()

Unnamed: 0,month,country,ncs,mau,orders
0,2023-10-01,AD,149,1648,3057
10,2023-10-01,KE,9285,55840,165447
18,2023-10-01,PT,33092,287739,762316
25,2023-09-01,AD,278,2018,4013
35,2023-09-01,KE,13210,65932,219334


In [234]:
# For NOT IN, let's negate the .isin() condition:
df[~(df.country.isin(['AD', 'KE', 'PR']))].head()

Unnamed: 0,month,country,ncs,mau,orders
1,2023-10-01,AM,3230,18113,51793
2,2023-10-01,BA,3867,28162,66898
3,2023-10-01,BG,11098,103561,266716
4,2023-10-01,CI,6974,47386,152862
5,2023-10-01,ES,147269,1635849,4216475


For **nulls** and **not nulls** we use the **.isnull()** or **.notnull()** operators.

In [None]:
# filter nulls
df[df.store_id.isnull()].head()

In [None]:
# filter not nulls
df[df.store_id.notnull()].head()

## **06. Handling Missing Data**

It is well possible that our dataframe contains nul values (as we saw before) and that we want to handle them in one way or another. Let's see how to do it with some built-in pandas methods:

### **dropna()**

dropna() is used to supress rows that have NaNs, or that have some NaNs. Let's see it:

In [242]:
df_filtered = df.dropna()

In [None]:
# we can set a threshold of how many NaN values (i.e., in how many columns) we can tolerate

df.dropna(thresh = 1).info()

Also, be mindful to use **inplace** if you want to make the change permanent in the DataFrame that you're using!!

### **fillna()**

fillna() is used to replace null values. you can use it per column, and it can contain the mean value of that column:

In [258]:
# With a putative field called date:

# df.date = df.date.fillna(value = 'Not accounted for')

In [261]:
# we can also use a relative value with regards to the column, like its mean or median
# this is the way to go for some nueric columns in ML problems!

# With a putative field called store:

# df.store = df.store.fillna(value = df.store_id.mean())

We may also use fillna over the total dataset!

In [None]:
df.fillna('not accounted for', inplace = True)