# Lesson 07: Numpy and Pandas Modules

Very popular libraries like NumPy and Pandas are used in data science and machine learning. In this lesson, we will learn how to use these libraries in Python.

## 7.1. NumPy
- [Numpy - Official Documentation](https://numpy.org/)
- [Python Numpy Tutorial for Beginners on freeCodeCamp](https://youtu.be/QUT1VHiLmmI)

<br>

- NumPy is a library that provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- It is very popular in data science and machine learning.
- It isn't a built-in library in Python. Therefore, we need to install it first: `pip install numpy`.
- syntax: `import numpy as np`
  - we use `as` keyword to give a name to the imported library, and we use `np` as standard name for NumPy.

### Why NumPy
> From the NumPy documentation:

- NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.

### Why is NumPy Fast?
> From the NumPy documentation:

- Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:

    - vectorized code is more concise and easier to read

    - fewer lines of code generally means fewer bugs

    - the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)

    - vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.

- Broadcasting is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion, i.e., they broadcast. 

### 7.1.1. Basic NumPy Operations

- Create NumPy Array

In [1]:
# import numpy
import numpy as np

list1 = [
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
]

list2 = [
    [2, 3, 4, 5, 6],
    [7, 8, 9, 10, 11]
]

list1 + list2

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [2, 3, 4, 5, 6], [7, 8, 9, 10, 11]]

In [2]:
list1 * list2

TypeError: can't multiply sequence by non-int of type 'list'

In [3]:
np_list1 = np.array(list1)
np_list2 = np.array(list2)
np_list1 + np_list2

array([[ 3,  5,  7,  9, 11],
       [13, 15, 17, 19, 21]])

In [4]:
np_list1 * np_list2

array([[  2,   6,  12,  20,  30],
       [ 42,  56,  72,  90, 110]])

In [5]:
# Create np.array

np_array = np.array([[5.0, 4.0, 3.0, 2.0], [6.0, 7.0, 8.0, 9.0]])
print(np_array)

[[5. 4. 3. 2.]
 [6. 7. 8. 9.]]


In [6]:
# Creation of a NumPy Array with defined data type
np_array_type = np.array([[5.0, 4.0, 3.0, 2.0], [6.0, 7.0, 8.0, 9.0]], dtype='int16')  # data will be converted into int
# if the data cannot be converted into an int, it will throw an a ValueError
print(np_array_type)

[[5 4 3 2]
 [6 7 8 9]]


**Possible NumPy Array Types**
- [datatypes in Numpy](https://numpy.org/doc/stable/user/basics.types.html)

- comomnly used numeric data types:
  - int8, int16, int32, int64 - signed integer types with different bit sizes
  - uint8, uint16, uint32, uint64 - unsigned integer types with different bit sizes
  - float32, float64 - floating-point types with different precision levels
  - complex64, complex128 - complex number types with different precision levels

- general data types such as bool, str, object, etc:
  - 'b' − boolean
  - 'i' − (signed) integer
  - 'u' − unsigned integer
  - 'f' − floating-point
  - 'c' − complex-floating point
  - 'm' − timedelta
  - 'M' − datetime
  - 'O' − (Python) objects
  - 'S', 'a' − (byte-)string
  - 'U' − Unicode
  - 'V' − raw data (void)


- Info about NumPy Array

In [7]:
# Dimension of a NumPy Array
print(np_array.ndim)

# Shape of a NumPy Array
print(np_array.shape)

2
(2, 4)


In [8]:
# Type of the elements in a NumPy Array
print(np_array.dtype)
print(np_array_type.dtype)

float64
int16


In [9]:
# Size of a NumPy Array
print(np_array.size)  #  whole number of elements in the array
print(np_array_type.size)

# Number of bytes consumed by each element
print(np_array.itemsize)
print(np_array_type.itemsize)

# Number of bytes consumed by the whole array - size * itemsize
print(np_array.size * np_array.itemsize)
print(np_array_type.size * np_array_type.itemsize)

# Number of bytes consumed by the whole array
print(np_array.nbytes)
print(np_array_type.nbytes)

8
8
8
2
64
16
64
16


### 7.1.2. Manipulating NumPy Arrays


#### Accessing the elements in a NumPy Array

In [10]:
# Create np.array
array = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]])


- Accessing the elements in a NumPy Array

In [11]:
# Accessing the single elements in a NumPy Array

print(array[0][1])  # similar to lists  - arr[row][column]
print(array[0, 1])  # arr[row, column]

2
2


- Accessing multiple elements in a NumPy Array

In [12]:
# Accessing multiple elements in a NumPy Array

# the whole row
print(array[0, :])

# the whole column
print(array[:, 0])

[ 1  2  3  4  5  6  7  8  9 10]
[ 1 11]


In [13]:
# Accessing parts of a NumPy Array
# [start:stop:step]

print(array[0, 1:5:2])   # start = 1, stop = 5, step = 2
print(array[0, 1:-1:2])  # start = 1, stop = -1, step = 2
print(array[0, 1:-1])    # start = 1, stop = -1
print(array[0, 1:])      # start = 1
print(array[0, :])       # start = 0
print(array[0, 1::2])    # start = 1, step = 2

[2 4]
[2 4 6 8]
[2 3 4 5 6 7 8 9]
[ 2  3  4  5  6  7  8  9 10]
[ 1  2  3  4  5  6  7  8  9 10]
[ 2  4  6  8 10]


#### Changing the elements in a NumPy Array

- Changing single elements in a NumPy Array

In [14]:
print(array)
print(array[1,1])

array[1,1] = 100
print(array)
print(array[1,1])

[[ 1  2  3  4  5  6  7  8  9 10]
 [11 12 13 14 15 16 17 18 19 20]]
12
[[  1   2   3   4   5   6   7   8   9  10]
 [ 11 100  13  14  15  16  17  18  19  20]]
100


- Change column values in a NumPy Array

In [15]:
print(array)
print(array[:,1])  # all values in column 1

array[:,1] = 200  # all values in column 1 are set to 200
print(array)
print(array[:,1])

array[:,2] = [-1, -10]  # all values in column 2 are set to [-1, -10] - it needs to be of the same shape as the subarray (output)
print(array)
print(array[:,2])


[[  1   2   3   4   5   6   7   8   9  10]
 [ 11 100  13  14  15  16  17  18  19  20]]
[  2 100]
[[  1 200   3   4   5   6   7   8   9  10]
 [ 11 200  13  14  15  16  17  18  19  20]]
[200 200]
[[  1 200  -1   4   5   6   7   8   9  10]
 [ 11 200 -10  14  15  16  17  18  19  20]]
[ -1 -10]


# 

#### Initialization of the different types of NumPy Arrays

In [16]:
# Zeros and Ones
zeros = np.zeros((2, 3))
print(zeros)

ones = np.ones((2, 3))
print(ones)

[[0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]]


In [18]:
# Create other same number matrix
np.full((2, 3), 5)  # np.full(shape, value)

array([[5, 5, 5],
       [5, 5, 5]])

In [20]:
# Create other same number matrix with a shape of the already defined matrix
np.full_like(array, 5)  # np.full_like(arr, value)

array([[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]])

#### Random NumPy Arrays

- Sometimes we need to create random NumPy Arrays, that is arrays with random values.

In [21]:
# Matrix with Random Values
np.random.rand(4, 2)  # np.random.rand(rows, columns)

array([[0.94710469, 0.34042724],
       [0.08636898, 0.88912962],
       [0.51875848, 0.4218815 ],
       [0.58107976, 0.13459013]])

In [23]:
# Another sample
np.random.random_sample(array.shape)  # taking the shape of the already defined array

array([[0.31669543, 0.1762191 , 0.66091349, 0.92839343, 0.1661094 ,
        0.46025947, 0.49784725, 0.26847655, 0.08719662, 0.5186123 ],
       [0.25694656, 0.22453147, 0.11426933, 0.11335487, 0.57419999,
        0.8946826 , 0.01894507, 0.3127316 , 0.51653249, 0.96141235]])

In [25]:
# Random Integer values
np.random.randint(low=5, high=10, size=(2, 3))  # np.random.randint(low, high, shape)

array([[9, 6, 9],
       [8, 9, 7]])

In [42]:
np.random.randint(6, size=(3, 3))  # np.random.randint(high, shape)

array([[4, 5, 2],
       [0, 5, 1],
       [4, 3, 3]])

## 7.2. Pandas
- [Pandas - Official Documentation](https://pandas.pydata.org/docs/)
- [Geeks for Geeks: *Introduction to Pandas in Python*](https://www.geeksforgeeks.org/introduction-to-pandas-in-python/)
- [Complete Python Data Science Tutorial](https://www.youtube.com/watch?v=vmEHCJofslg)

- Pandas is a library for data manipulation and analysis. It is very popular in data science and machine learning.
- The name comes from the 'panel data' library. It is a 2-dimensional table with rows and columns.
- It is a very powerful library that works with tabular data, spreadsheets, databases, and time series.
- Here is a list of things that we can do using Pandas:
  - Data set cleaning, merging, and joining.
  - Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
  - Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
  - Powerful group by functionality for performing split-apply-combine operations on data sets.
  - Data Visualization.



- It isn't a built-in library in Python. Therefore, we need to install it first: `pip install pandas`.
- syntax: `import pandas as pd`


In [2]:
!pip install pandas

import pandas as pd



- There are two main data structures in Pandas: Series and DataFrame.

## 7.2.1. Pandas Series

- A Pandas Series is a one-dimensional array that can hold any data type (like integer, string, float, Python object, etc.). It is similar to a column in a table (e.g. a spreadsheet in Excel).
- The axis labels of the Pandas Series are collectively called **index**.
- The Pandas Series can be created using the `pd.Series()` function.

In [5]:
# Pandas Series from a List
s = pd.Series([1, 2, 3, 4, 5])
print(s)

s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])  # pd.Series(data, index) - index is optional
print(s)


0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64


In [7]:
# Pandas Series from a Dictionary
s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
print(s)


a    1
b    2
c    3
d    4
e    5
dtype: int64


In [9]:
# Pandas Series from a Numpy Array
import numpy as np
s = pd.Series(np.array([1, 2, 3, 4, 5]))
print(s)

s = pd.Series(np.array([1, 2, 3, 4, 5]), index=['a', 'b', 'c', 'd', 'e'])  # pd.Series(data, index) - index is optional
print(s)

s = pd.Series(np.array([1, 2, 3, 4, 5]), index=['a', 'b', 'c', 'd', 'e'], name='numbers')  # and we can give it a name
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
Name: numbers, dtype: int64


In [15]:
print(s.values)
print(s.index)
print(s.name)
print(s['a'])

[1 2 3 4 5]
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
numbers
1


## 7.2.2. Pandas DataFrame

- A Pandas DataFrame is a two-dimensional data structure that can hold data of any type (integer, string, float, Python object, etc.). It is similar to a table in a spreadsheet and it has labeled axes (i.e. rows and columns).

In [17]:
# Create a DataFrame from a Dictionary
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
print(df)

   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9


In [18]:
# Create a DataFrame from a Numpy Array
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9


In [19]:
# Create a DataFrame from a list
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9


In [21]:
# Create a DataFrame from the Series
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series(['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame({'numbers': series1, 'letters': series2})
print(df)

   numbers letters
0        1       a
1        2       b
2        3       c
3        4       d
4        5       e
