# Working with multi-dimensional numpy arrays

It's important to understand how to work with multi-dimensional arrays in numpy.  This notebook will show you how to do that.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer

W06 -> modelfit - cardio --> Classifications
W07 -> imbalance, Text Analytics
W08 -> Neural Network hand writing
W09 -> Keras
W10 -> CNN
W11 -> RNN, LSTM,GRU, CONV1D, RNN Timeseries







In [None]:
np.random.seed(1)

In [None]:
airbnb = pd.read_csv("airbnb.csv")

In [None]:
# look at the data
airbnb.head(3)

# generate a basic summary of the data
airbnb.info()

# generate a statistical summary of the numeric value in the data
airbnb.describe()

# summary of your data
import summarytools
from summarytools import dfSummary
dfSummary(airbnb)



In [None]:
# Check the missing values by summing the total na's for each variable
airbnb.isna().sum()

In [None]:
# create a list of these catagorical variables
category_var_list = list(airbnb.select_dtypes(include='object').columns)
category_var_list

In [None]:
# Our target is price; but there are three related price variableds - price, price_gte_150, 
# and price_category. We need to drop price_gte_150, and price_category
airbnb.drop(['price_category', 'price_gte_150'], axis=1, inplace = True)

In [None]:
airbnb["property_type"].fillna("unkown", inplace = True)

In [None]:
airbnb.room_type.unique()

In [None]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = ['latitude', 'longitude', 'accommodates', 
                   'bathrooms', 'bedrooms', 'beds', 'Number of amenities', 
                   'guests_included', 'price_per_extra_person', 'minimum_nights', 
                   'number_of_reviews', 'number_days_btw_first_last_review', 
                   'review_scores_rating']                
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array


test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize])

In [None]:
airbnb.drop(['price_category', 'price_gte_150'], axis=1, inplace = True)

In [None]:
airbnb = airbnb.join(pd.get_dummies(airbnb['property_type'], prefix='property_type', drop_first=True))
airbnb.drop('property_type', axis=1, inplace = True)

In [None]:
numeric_cols_with_nas = list(train_df.isna().sum()[train_df.isna().sum() > 0].index)
numeric_cols_with_nas

In [None]:
imputer = SimpleImputer(strategy="median")

train_df[numeric_cols_with_nas] = imputer.fit_transform(train_df[numeric_cols_with_nas])
test_df[numeric_cols_with_nas] = imputer.transform(test_df[numeric_cols_with_nas])

In [None]:
arr.ravel() # same as reshape(-1), it flattens the array

In [None]:
df = pd.DataFrame(arr, columns=('x1', 'x2', 'y'))
df

In [None]:
np.ravel(df)

In [None]:
arr2 = np.array(df).reshape(6,2)
arr2

In [None]:
df2 = pd.DataFrame(arr2, columns=('x1', 'x2'))
df2

## 1.0 Working with 1D arrays

Let's start with a python list

In [2]:
arr1D = [1,2,3,4,5,6,7,8,9,10]

Turn the python list into a numpy array

In [3]:
arr1D = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.intc)  # look up datatypes here https://numpy.org/doc/stable/user/basics.types.html
arr1D

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

> NOTE: We could also do the following:

In [4]:
arr1D = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.intc)  # look up datatypes here https://numpy.org/doc/stable/user/basics.types.html
arr1D

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

In [5]:
arr1D.shape

(10,)

In [6]:
np.append(arr1D, 11)  # append 11 to the end of the array

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [7]:
np.delete(arr1D, 0)  # delete the item at index 0 of the arrray

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

### Non-Destructive Operations/Methods

The np.<methods> used in this notebook are what is called 'non-destructive'; that is, they do not change the original array, but return a new array with the result. If you want to change the given array, then simply store it in the same variable, e.g. arr1D = np.append(arr1D, 11) 

In [8]:
arr1D # notice that the original array is unchanged

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

In [9]:
arr1D = np.append(arr1D, 11)  # if we want to change the original array, we need to reassign it

## 2.0 Working with 2d arrays

In [10]:
arr2D = np.array([[0,1,2,3],[4,5,6,7], [8,9,10,11]])
arr2D

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Notice that we don't see the datatype of the array.  We can get the datatype by using the `dtype` attribute.

In [11]:
print("Datatype:", arr2D.dtype)

Datatype: int32


In [12]:
np.column_stack((arr2D, [79,89,99])) # add a column to the end of the array

array([[ 0,  1,  2,  3, 79],
       [ 4,  5,  6,  7, 89],
       [ 8,  9, 10, 11, 99]])

If we need to insert a column in a location other than the end, we can use the `insert` method.

In [13]:
np.insert(arr2D, [1], [[1],[2],[3]], axis=1) # The 1 is the index, the [[1],[2],[3]] is the value to insert, and axis=1 means insert into the columns
# notice that inserting is more complicated than column_stack; we need to have the same number of rows as the original array

array([[ 0,  1,  1,  2,  3],
       [ 4,  2,  5,  6,  7],
       [ 8,  3,  9, 10, 11]])

In [14]:
np.row_stack((arr2D, [79,89,99,109]))

array([[  0,   1,   2,   3],
       [  4,   5,   6,   7],
       [  8,   9,  10,  11],
       [ 79,  89,  99, 109]])

In [15]:
np.insert(arr2D, 1, [[99,99,9,99]], axis=0) # The 1 is the index, the [99,99,99,99] is the value to insert, and axis=0 means insert into the rows
# notice that inserting is more complicated than row_stack; we need to insert a list of lists.

array([[ 0,  1,  2,  3],
       [99, 99,  9, 99],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

With any of these examples, we can also insert more than 1 column or row at a time.

In [16]:
np.insert(arr2D, 1, [[98,98,98,98],[99,99,9,99]], axis=0) # The 1 is the index, the [99,99,99,99] is the value to insert, and axis=0 means insert into the rows
# notice that inserting is more complicated than row_stack; we need to insert a list of lists.

array([[ 0,  1,  2,  3],
       [98, 98, 98, 98],
       [99, 99,  9, 99],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

### Deleting columns and rows

In [17]:
np.delete(arr2D, 1, axis=1) # delete a column. The 1 is the index, and axis 1 means delete from the columns

array([[ 0,  2,  3],
       [ 4,  6,  7],
       [ 8, 10, 11]])

In [18]:
np.delete(arr2D, 1, axis=0) # delete a row. The 1 is the index, and 0 axis means delete from the rows

array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

## 3.0 Working with 3D arrays

Getting to know how to work with 3D arrays is a bit more complicated, but important to know. All the recurrent neural networks (RNN, GROm LSTM, etc.) use 3D arrays.

In [33]:
arr3D = np.array([
    [[0,1,2,3],[4,5,6,7], [8,9,10,11]],
    [[12,13,14,15],[16,17,18,19], [20,21,22,23]]
])
arr3D


array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

How can we interpret the above? In RNN, the first dimension is the number of samples, the second dimension is the number of time steps, and the third dimension is the number of features.

But, more generally, think of the structure as follows:

[
    [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ],
    [
        [10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]
    ]
]

In [21]:
arr3D.shape

(2, 3, 4)

### Selecting values

In [25]:
# select a column from the 3D dataset
arr3D[:,:,1] # notice that there is a column for each 2D array

array([[ 1,  5,  9],
       [13, 17, 21]])

In [35]:
np.delete(arr3D, 1, 2) # delete the column index 1 in the 2nd dimension. 0 -> data, 1 -> row, 2 -> column

array([[[ 0,  2,  3],
        [ 4,  6,  7],
        [ 8, 10, 11]],

       [[12, 14, 15],
        [16, 18, 19],
        [20, 22, 23]]])

In [34]:
# if want to select the last item in each of the 2nd columns (1st index)
arr3D[:,-1,1]

array([ 9, 21])

In [24]:
# with a three dimensional structure, we can select all of the first dimension, the last item in the second dimensions, and the first 
# item in the third dimension. In a recurrent neural network, the three  dimmensions are  obersvations, time steps, and features.
arr3D[:,-1,0] 

array([ 8, 20])

Look closely at the last output. See how this compares to the arr3D contents seen above.

## Deleting values

In [37]:
np.delete(arr3D, 1, 1) # delete the row index 1 in the 1st dimension (this will delete the 2nd row of each 2d array in the 3d array)

# note: I can't delete a row in just one 2d array in the 3d array - this will result in the array being unbalanced

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [20, 21, 22, 23]]])

In [26]:
np.delete(arr3D, 0,0) # delete the row index 1 in the 1st dimension (this will delete the 1st 2d array in the 3d array)

array([[[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

## 4.0 Reshaping Numpy Arrays 

In [38]:
arr3D

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [39]:
arr3D.reshape(-1, 4) # select the last down to 2D

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [40]:
arr3D.reshape(8*3) # down to 1D

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [41]:
arr3D.reshape(-1,3) # reshape so that each row has 3 columns (down from 4)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23]])

In [42]:
arr3D.reshape(-1,2) # research so that each row has 2 columns (down from 4)

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19],
       [20, 21],
       [22, 23]])

## 5.0 DataFrames and Numpy arrays

In [43]:
import pandas as pd


In [44]:
df = pd.DataFrame(
    data={
        'Temp':[77,78,79,80,78,82,76,76,85,73], 
        'Humidity': [50,51,52,53,67,59,58,56,62,55], 
        'Pressure': [29.92,29.93,29.94,29.95, 29.96,29.97,29.98,29.99,30.00,30.01]
    }, 
    index=['2023-04-01','2023-04-02','2023-04-03','2023-04-04', '2023-04-05','2023-04-06','2023-04-07','2023-04-08','2023-04-09','2023-04-10'])
df

Unnamed: 0,Temp,Humidity,Pressure
2023-04-01,77,50,29.92
2023-04-02,78,51,29.93
2023-04-03,79,52,29.94
2023-04-04,80,53,29.95
2023-04-05,78,67,29.96
2023-04-06,82,59,29.97
2023-04-07,76,58,29.98
2023-04-08,76,56,29.99
2023-04-09,85,62,30.0
2023-04-10,73,55,30.01


### Selecting on dates

In [45]:
df[df.index > '2023-04-03'] # select rows where the index is greater than 2023-04-02


Unnamed: 0,Temp,Humidity,Pressure
2023-04-04,80,53,29.95
2023-04-05,78,67,29.96
2023-04-06,82,59,29.97
2023-04-07,76,58,29.98
2023-04-08,76,56,29.99
2023-04-09,85,62,30.0
2023-04-10,73,55,30.01


In [35]:
# df[df.index > '2023-04-02'] and df.index < '2023-04-04'] # notice this will not work

In [46]:
df.query('index >= "2023-04-02" and index < "2023-04-09"') # this will work; see here for more on query https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

Unnamed: 0,Temp,Humidity,Pressure
2023-04-02,78,51,29.93
2023-04-03,79,52,29.94
2023-04-04,80,53,29.95
2023-04-05,78,67,29.96
2023-04-06,82,59,29.97
2023-04-07,76,58,29.98
2023-04-08,76,56,29.99


In [47]:
df.query('index >= "2023-04-02" and index < "2023-04-06" or index > "2023-04-8"') # we can create more complex queries

Unnamed: 0,Temp,Humidity,Pressure
2023-04-02,78,51,29.93
2023-04-03,79,52,29.94
2023-04-04,80,53,29.95
2023-04-05,78,67,29.96


## Preparing data for a recurrent neural network

A recurrent neural network requires a 3D array as input. The first dimension is the number of samples, the second dimension is the number of time steps, and the third dimension is the number of features. The following code creates a 3D array with 3 samples, 2 time steps, and 1 feature.

Let's say we want to predict the next day after sequence of 3 days.  

Therefore, our first sample will be the feature values for the first 3 rows of the dataframe, and the second sample will be tthe next three rows starting at the second row, etc.

Let's begin by converting the dataframe to a numpy array.

In [48]:
arr2D = df.to_numpy() # convert the dataframe to a numpy array
arr2D

array([[77.  , 50.  , 29.92],
       [78.  , 51.  , 29.93],
       [79.  , 52.  , 29.94],
       [80.  , 53.  , 29.95],
       [78.  , 67.  , 29.96],
       [82.  , 59.  , 29.97],
       [76.  , 58.  , 29.98],
       [76.  , 56.  , 29.99],
       [85.  , 62.  , 30.  ],
       [73.  , 55.  , 30.01]])

The 2D data above is not not our input to the recurrent network - we have three feaures, but no observations. Observations will be the sequences length number of rows.

This might be easier to understand if we look at how we can manually create observations for our recurrent network training.

In [49]:
arr2D[0:3,:]  # this would be the first observation

array([[77.  , 50.  , 29.92],
       [78.  , 51.  , 29.93],
       [79.  , 52.  , 29.94]])

In [50]:
arr2D[1:4,:]  # this would be the second observation

array([[78.  , 51.  , 29.93],
       [79.  , 52.  , 29.94],
       [80.  , 53.  , 29.95]])

In [51]:
arr2D[2:5,:]  # this would be the third observation, etc

array([[79.  , 52.  , 29.94],
       [80.  , 53.  , 29.95],
       [78.  , 67.  , 29.96]])

But, this would be better to put into a function as follows:

In [52]:
def create_sequences(data, seq_length):
    obs = []
    print(obs)    
    for i in range(len(data)-seq_length-1):
        obs.append(data[i:(i+seq_length)+1])
    return np.array(obs)

In [53]:
RNN_input = create_sequences(arr2D, 3)
RNN_input

[]


array([[[77.  , 50.  , 29.92],
        [78.  , 51.  , 29.93],
        [79.  , 52.  , 29.94],
        [80.  , 53.  , 29.95]],

       [[78.  , 51.  , 29.93],
        [79.  , 52.  , 29.94],
        [80.  , 53.  , 29.95],
        [78.  , 67.  , 29.96]],

       [[79.  , 52.  , 29.94],
        [80.  , 53.  , 29.95],
        [78.  , 67.  , 29.96],
        [82.  , 59.  , 29.97]],

       [[80.  , 53.  , 29.95],
        [78.  , 67.  , 29.96],
        [82.  , 59.  , 29.97],
        [76.  , 58.  , 29.98]],

       [[78.  , 67.  , 29.96],
        [82.  , 59.  , 29.97],
        [76.  , 58.  , 29.98],
        [76.  , 56.  , 29.99]],

       [[82.  , 59.  , 29.97],
        [76.  , 58.  , 29.98],
        [76.  , 56.  , 29.99],
        [85.  , 62.  , 30.  ]]])

Now, let's extract our target variable. In this case, we want to predict the next day's temperature, so we will extract the last temperature value from each sequence (observation).

In [54]:
y = RNN_input[:,-1,0] # the 3 is the index of the column, and the 0 is the index of the row
y

array([80., 78., 82., 76., 76., 85.])

In [55]:
X = RNN_input[:,0:3,0:3] # the 3 is the index of the column, and the 0 is the index of the row
X

array([[[77.  , 50.  , 29.92],
        [78.  , 51.  , 29.93],
        [79.  , 52.  , 29.94]],

       [[78.  , 51.  , 29.93],
        [79.  , 52.  , 29.94],
        [80.  , 53.  , 29.95]],

       [[79.  , 52.  , 29.94],
        [80.  , 53.  , 29.95],
        [78.  , 67.  , 29.96]],

       [[80.  , 53.  , 29.95],
        [78.  , 67.  , 29.96],
        [82.  , 59.  , 29.97]],

       [[78.  , 67.  , 29.96],
        [82.  , 59.  , 29.97],
        [76.  , 58.  , 29.98]],

       [[82.  , 59.  , 29.97],
        [76.  , 58.  , 29.98],
        [76.  , 56.  , 29.99]]])

Now, with some of these fundamentals covered -- let's see how we can use them to build a recurrent neural network. This is rnn_data_wrangling_examples.ipynb