# Lesson 07: Numpy and Pandas Modules

Very popular libraries like NumPy and Pandas are used in data science and machine learning. In this lesson, we will learn how to use these libraries in Python.

## 7.1. NumPy
- [Numpy - Official Documentation](https://numpy.org/)
- [Python Numpy Tutorial for Beginners on freeCodeCamp](https://youtu.be/QUT1VHiLmmI)

<br>

- NumPy is a library that provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- It is very popular in data science and machine learning.
- It isn't a built-in library in Python. Therefore, we need to install it first: `pip install numpy`.
- syntax: `import numpy as np`
  - we use `as` keyword to give a name to the imported library, and we use `np` as standard name for NumPy.

### Why NumPy
> From the NumPy documentation:

- NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.

### Why is NumPy Fast?
> From the NumPy documentation:

- Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:

    - vectorized code is more concise and easier to read

    - fewer lines of code generally means fewer bugs

    - the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)

    - vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.

- Broadcasting is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion, i.e., they broadcast. 

### 7.1.1. Basic NumPy Operations

- Create NumPy Array

In [9]:
# import numpy
import numpy as np

list1 = [
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
]

list2 = [
    [2, 3, 4, 5, 6],
    [7, 8, 9, 10, 11]
]

list1 + list2

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [2, 3, 4, 5, 6], [7, 8, 9, 10, 11]]

In [10]:
list1 * list2

TypeError: can't multiply sequence by non-int of type 'list'

In [11]:
np_list1 = np.array(list1)
np_list2 = np.array(list2)
np_list1 + np_list2

array([[ 3,  5,  7,  9, 11],
       [13, 15, 17, 19, 21]])

In [12]:
np_list1 * np_list2

array([[  2,   6,  12,  20,  30],
       [ 42,  56,  72,  90, 110]])

In [13]:
# Create np.array

np_array = np.array([[5.0, 4.0, 3.0, 2.0], [6.0, 7.0, 8.0, 9.0]])
print(np_array)

[[5. 4. 3. 2.]
 [6. 7. 8. 9.]]


In [14]:
# Creation of a NumPy Array with defined data type
np_array_type = np.array([[5.0, 4.0, 3.0, 2.0], [6.0, 7.0, 8.0, 9.0]], dtype='int16')  # data will be converted into int
# if the data cannot be converted into an int, it will throw an a ValueError
print(np_array_type)

[[5 4 3 2]
 [6 7 8 9]]


**Possible NumPy Array Types**
- [datatypes in Numpy](https://numpy.org/doc/stable/user/basics.types.html)

- comomnly used numeric data types:
  - int8, int16, int32, int64 - signed integer types with different bit sizes
  - uint8, uint16, uint32, uint64 - unsigned integer types with different bit sizes
  - float32, float64 - floating-point types with different precision levels
  - complex64, complex128 - complex number types with different precision levels

- general data types such as bool, str, object, etc:
  - 'b' − boolean
  - 'i' − (signed) integer
  - 'u' − unsigned integer
  - 'f' − floating-point
  - 'c' − complex-floating point
  - 'm' − timedelta
  - 'M' − datetime
  - 'O' − (Python) objects
  - 'S', 'a' − (byte-)string
  - 'U' − Unicode
  - 'V' − raw data (void)


- Info about NumPy Array

In [15]:
# Dimension of a NumPy Array
print(np_array.ndim)

# Shape of a NumPy Array
print(np_array.shape)

2
(2, 4)


In [16]:
# Type of the elements in a NumPy Array
print(np_array.dtype)
print(np_array_type.dtype)

float64
int16


In [17]:
# Size of a NumPy Array
print(np_array.size)  #  whole number of elements in the array
print(np_array_type.size)

# Number of bytes consumed by each element
print(np_array.itemsize)
print(np_array_type.itemsize)

# Number of bytes consumed by the whole array - size * itemsize
print(np_array.size * np_array.itemsize)
print(np_array_type.size * np_array_type.itemsize)

# Number of bytes consumed by the whole array
print(np_array.nbytes)
print(np_array_type.nbytes)

8
8
8
2
64
16
64
16


### 7.1.2. Manipulating NumPy Arrays


#### Accessing the elements in a NumPy Array

In [18]:
# Create np.array
array = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]])


- Accessing the elements in a NumPy Array

In [19]:
# Accessing the single elements in a NumPy Array

print(array[0][1])  # similar to lists  - arr[row][column]
print(array[0, 1])  # arr[row, column]

2
2


- Accessing multiple elements in a NumPy Array

In [20]:
# Accessing multiple elements in a NumPy Array

# the whole row
print(array[0, :])

# the whole column
print(array[:, 0])

[ 1  2  3  4  5  6  7  8  9 10]
[ 1 11]


In [21]:
# Accessing parts of a NumPy Array
# [start:stop:step]

print(array[0, 1:5:2])   # start = 1, stop = 5, step = 2
print(array[0, 1:-1:2])  # start = 1, stop = -1, step = 2
print(array[0, 1:-1])    # start = 1, stop = -1
print(array[0, 1:])      # start = 1
print(array[0, :])       # start = 0
print(array[0, 1::2])    # start = 1, step = 2

[2 4]
[2 4 6 8]
[2 3 4 5 6 7 8 9]
[ 2  3  4  5  6  7  8  9 10]
[ 1  2  3  4  5  6  7  8  9 10]
[ 2  4  6  8 10]


#### Changing the elements in a NumPy Array

- Changing single elements in a NumPy Array

In [22]:
print(array)
print(array[1,1])

array[1,1] = 100
print(array)
print(array[1,1])

[[ 1  2  3  4  5  6  7  8  9 10]
 [11 12 13 14 15 16 17 18 19 20]]
12
[[  1   2   3   4   5   6   7   8   9  10]
 [ 11 100  13  14  15  16  17  18  19  20]]
100


- Change column values in a NumPy Array

In [23]:
print(array)
print(array[:,1])  # all values in column 1

array[:,1] = 200  # all values in column 1 are set to 200
print(array)
print(array[:,1])

array[:,2] = [-1, -10]  # all values in column 2 are set to [-1, -10] - it needs to be of the same shape as the subarray (output)
print(array)
print(array[:,2])


[[  1   2   3   4   5   6   7   8   9  10]
 [ 11 100  13  14  15  16  17  18  19  20]]
[  2 100]
[[  1 200   3   4   5   6   7   8   9  10]
 [ 11 200  13  14  15  16  17  18  19  20]]
[200 200]
[[  1 200  -1   4   5   6   7   8   9  10]
 [ 11 200 -10  14  15  16  17  18  19  20]]
[ -1 -10]


# 

#### Initialization of the different types of NumPy Arrays

In [24]:
# Zeros and Ones
zeros = np.zeros((2, 3))
print(zeros)

ones = np.ones((2, 3))
print(ones)

[[0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]]


In [25]:
# Create other same number matrix
np.full((2, 3), 5)  # np.full(shape, value)

array([[5, 5, 5],
       [5, 5, 5]])

In [26]:
# Create other same number matrix with a shape of the already defined matrix
np.full_like(array, 5)  # np.full_like(arr, value)

array([[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]])

#### Random NumPy Arrays

- Sometimes we need to create random NumPy Arrays, that is arrays with random values.

In [27]:
# Matrix with Random Values
np.random.rand(4, 2)  # np.random.rand(rows, columns)

array([[0.40250093, 0.55877061],
       [0.33287392, 0.33485237],
       [0.26858954, 0.04531348],
       [0.412077  , 0.99155946]])

In [28]:
# Another sample
np.random.random_sample(array.shape)  # taking the shape of the already defined array

array([[0.93097477, 0.06073106, 0.59862867, 0.63921459, 0.06054933,
        0.60045618, 0.34726114, 0.68195655, 0.42127051, 0.48311008],
       [0.83001066, 0.9756626 , 0.26628853, 0.28270678, 0.02632813,
        0.32194356, 0.74466612, 0.71596709, 0.99662224, 0.28156686]])

In [29]:
# Random Integer values
np.random.randint(low=5, high=10, size=(2, 3))  # np.random.randint(low, high, shape)

array([[8, 6, 5],
       [5, 7, 5]])

In [30]:
np.random.randint(6, size=(3, 3))  # np.random.randint(high, shape)

array([[1, 1, 3],
       [2, 0, 4],
       [1, 2, 0]])

## 7.2. Pandas
- [Pandas - Official Documentation](https://pandas.pydata.org/docs/)
- [Geeks for Geeks: *Introduction to Pandas in Python*](https://www.geeksforgeeks.org/introduction-to-pandas-in-python/)
- [Complete Python Data Science Tutorial](https://www.youtube.com/watch?v=vmEHCJofslg)

- Pandas is a library for data manipulation and analysis. It is very popular in data science and machine learning.
- The name comes from the 'panel data' library. It is a 2-dimensional table with rows and columns.
- It is a very powerful library that works with tabular data, spreadsheets, databases, and time series.
- Here is a list of things that we can do using Pandas:
  - Data set cleaning, merging, and joining.
  - Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
  - Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
  - Powerful group by functionality for performing split-apply-combine operations on data sets.
  - Data Visualization.



- It isn't a built-in library in Python. Therefore, we need to install it first: `pip install pandas`.
- syntax: `import pandas as pd`


In [31]:
!pip install pandas

import pandas as pd



- There are two main data structures in Pandas: Series and DataFrame.

## 7.2.1. Pandas Series

- A Pandas Series is a one-dimensional array that can hold any data type (like integer, string, float, Python object, etc.). It is similar to a column in a table (e.g. a spreadsheet in Excel).
- The axis labels of the Pandas Series are collectively called **index**.
- The Pandas Series can be created using the `pd.Series()` function.

In [32]:
# Pandas Series from a List
s = pd.Series([1, 2, 3, 4, 5])
print(s)

s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])  # pd.Series(data, index) - index is optional
print(s)


0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64


In [33]:
# Pandas Series from a Dictionary
s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
print(s)


a    1
b    2
c    3
d    4
e    5
dtype: int64


In [34]:
# Pandas Series from a Numpy Array
import numpy as np
s = pd.Series(np.array([1, 2, 3, 4, 5]))
print(s)

s = pd.Series(np.array([1, 2, 3, 4, 5]), index=['a', 'b', 'c', 'd', 'e'])  # pd.Series(data, index) - index is optional
print(s)

s = pd.Series(np.array([1, 2, 3, 4, 5]), index=['a', 'b', 'c', 'd', 'e'], name='numbers')  # and we can give it a name
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
Name: numbers, dtype: int64


In [35]:
print(s.values)
print(s.index)
print(s.name)
print(s['a'])

[1 2 3 4 5]
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
numbers
1


## 7.2.2. Pandas DataFrame

- A Pandas DataFrame is a two-dimensional data structure that can hold data of any type (integer, string, float, Python object, etc.). It is similar to a table in a spreadsheet and it has labeled axes (i.e. rows and columns).

In [36]:
# Create a DataFrame from a Dictionary
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
print(df)

   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9


In [37]:
# Create a DataFrame from a Numpy Array
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9


In [38]:
# Create a DataFrame from a list
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9


In [39]:
# Create a DataFrame from the Series
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series(['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame({'numbers': series1, 'letters': series2})
print(df)

   numbers letters
0        1       a
1        2       b
2        3       c
3        4       d
4        5       e


### 7.2.3. Working with Pandas DataFrame from CSV File

- [Pandas - Reading CSV Files](https://pandas.pydata.org/docs/user_guide/io.html)

We will be using the [Titanic Dataset from Kaggle](https://www.kaggle.com/datasets/yasserh/titanic-dataset)

In [40]:
# import necessary libraries
import pandas as pd
import numpy as np

# load dataset
df = pd.read_csv('./Titanic-Dataset.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


We will use `PassengerId` as index, so we need to set `PassengerId` as index. With the optional `inplace` parameter, we can modify the original DataFrame. Otherwise, if we don't want to modify the original DataFrame, we can use `new_df = df.set_index('PassengerId')`.

In [41]:
df.set_index(df['PassengerId'], inplace=True)
df

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
887,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Now, since we have `PassengerId` as index, we can drop the `PassengerId` column. Beware, we still can use the original indexing 0...n when needed.

In [42]:
df.drop('PassengerId', axis=1, inplace=True)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [43]:
df.Survived

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

If we wish to get a row, we can use the `loc` method. With loc, we can use the label of the row (new indices).

In [44]:
df.loc[1]  # loc - label location

Survived                          0
Pclass                            3
Name        Braund, Mr. Owen Harris
Sex                            male
Age                            22.0
SibSp                             1
Parch                             0
Ticket                    A/5 21171
Fare                           7.25
Cabin                           NaN
Embarked                          S
Name: 1, dtype: object

With `iloc`, we can use the original indexing of the row.

In [45]:
df.iloc[1]  # iloc - index location - that makes it the second element

Survived                                                    1
Pclass                                                      1
Name        Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                    female
Age                                                      38.0
SibSp                                                       1
Parch                                                       0
Ticket                                               PC 17599
Fare                                                  71.2833
Cabin                                                     C85
Embarked                                                    C
Name: 2, dtype: object

**1. Data Inspection**

We can use the info() function to get a summary of the dataset, including the data type of each column, the number of non-null values, and the memory usage.

In [46]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 115.8+ KB
None


**2. Data Shape**

We can use the shape attribute to get the number of rows and columns in the dataset.

In [49]:
print(df.shape)  # (891, 11) -> rows, columns
print(df.shape[0])  # rows
print(df.shape[1])  # columns

(891, 11)
891
11


**3. Data Types**

We can use the dtypes attribute to get the data type of each column.

In [50]:
print(df.dtypes)

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object


**4. Selecting Columns**

We can use the [] operator to select specific columns from the dataset.

In [51]:
# Select all columns except 'Name' and 'Ticket'
selected_cols = df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch']]
print(selected_cols.head())

             Survived  Pclass   Age  SibSp  Parch
PassengerId                                      
1                   0       3  22.0      1      0
2                   1       1  38.0      1      0
3                   1       3  26.0      0      0
4                   1       1  35.0      1      0
5                   0       3  35.0      0      0


**5. Filtering Rows**

We can use the `[]` operator to filter rows based on conditions.

We can use the loc[] or iloc[] method for filtering too.

In [53]:
# Filter rows where 'Age' is greater than 50
filtered_df = df[df['Age'] > 50]
print(filtered_df.head())

# user loc or iloc for filtering

filtered_df = df.loc[df['Age'] > 50]
print(filtered_df.head())

             Survived  Pclass                              Name     Sex   Age  \
PassengerId                                                                     
7                   0       1           McCarthy, Mr. Timothy J    male  54.0   
12                  1       1          Bonnell, Miss. Elizabeth  female  58.0   
16                  1       2  Hewlett, Mrs. (Mary D Kingcome)   female  55.0   
34                  0       2             Wheadon, Mr. Edward H    male  66.0   
55                  0       1    Ostby, Mr. Engelhart Cornelius    male  65.0   

             SibSp  Parch      Ticket     Fare Cabin Embarked  
PassengerId                                                    
7                0      0       17463  51.8625   E46        S  
12               0      0      113783  26.5500  C103        S  
16               0      0      248706  16.0000   NaN        S  
34               0      0  C.A. 24579  10.5000   NaN        S  
55               0      1      113509  61.9792  

**6. Sorting Rows**

We can use the sort_values() method to sort rows based on one or more columns.

In [56]:
# Sort rows by 'Age' in ascending order
sorted_df = df.sort_values(by='Age')
print(sorted_df.head())

# sort in descending order
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df.head())

             Survived  Pclass                             Name     Sex   Age  \
PassengerId                                                                    
804                 1       3  Thomas, Master. Assad Alexander    male  0.42   
756                 1       2        Hamalainen, Master. Viljo    male  0.67   
645                 1       3           Baclini, Miss. Eugenie  female  0.75   
470                 1       3    Baclini, Miss. Helene Barbara  female  0.75   
79                  1       2    Caldwell, Master. Alden Gates    male  0.83   

             SibSp  Parch  Ticket     Fare Cabin Embarked  
PassengerId                                                
804              0      1    2625   8.5167   NaN        C  
756              1      1  250649  14.5000   NaN        S  
645              2      1    2666  19.2583   NaN        C  
470              2      1    2666  19.2583   NaN        C  
79               0      2  248738  29.0000   NaN        S  
             Surviv

7. Grouping and Aggregating

We can use the groupby() method to group rows based on one or more columns and then apply aggregation functions to each group.

In [57]:
# Group by 'Pclass' and calculate the mean of 'Age'
grouped_df = df.groupby('Pclass')['Age'].mean()  # what is mean? it is average value.
print(grouped_df)

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64


**8. Merging DataFrames**

We can use the merge() method to merge two or more DataFrames based on a common column.

In [69]:
# Load the Titanic Dataset again
df = pd.read_csv('./Titanic-Dataset.csv')

# Let us make a dataset from the Titanic Dataset with only survived passengers
survived_passengers = df[df['Survived'] == 1]
print(survived_passengers.head())

# export the dataset to CSV
survived_passengers.to_csv('survived_passengers.csv', index=False)

   PassengerId  Survived  Pclass  \
1            2         1       1   
2            3         1       3   
3            4         1       1   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
9                Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   

   Parch            Ticket     Fare Cabin Embarked  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
8      2            347742  11.1333   NaN        S  
9      0            237736  30.0708   NaN        C  


In [61]:
# Merge the two sets on the 'PassengerId' column
merged_df = pd.merge(df, survived_passengers, on='PassengerId')
print(merged_df.head())

   PassengerId  Survived_x  Pclass_x  \
0            2           1         1   
1            3           1         3   
2            4           1         1   
3            9           1         3   
4           10           1         2   

                                              Name_x   Sex_x  Age_x  SibSp_x  \
0  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38.0        1   
1                             Heikkinen, Miss. Laina  female   26.0        0   
2       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35.0        1   
3  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female   27.0        0   
4                Nasser, Mrs. Nicholas (Adele Achem)  female   14.0        1   

   Parch_x          Ticket_x   Fare_x  ... Pclass_y  \
0        0          PC 17599  71.2833  ...        1   
1        0  STON/O2. 3101282   7.9250  ...        3   
2        0            113803  53.1000  ...        1   
3        2            347742  11.1333  ...        3   
4  

**9. Handling Missing Values**

We can use various methods to handle missing values, such as replacing them with a specific value or imputing them using a statistical model.

In [71]:
# Display rows with missing values in 'Age'
missing_age_df = df[df['Age'].isnull()]
print(missing_age_df.head())

# Replace missing values in 'Age' with 0
df['Age'].fillna(0, inplace=True)
print(df.head())

Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.92

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(0, inplace=True)


In [72]:
# display rows with missing values in 'Age' again
missing_age_df = df[df['Age'].isnull()]
print(missing_age_df.head())  # no more missing values

# display rows with 'Age' value 0
zero_age_df = df[df['Age'] == 0]
print(zero_age_df.head())


Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []
    PassengerId  Survived  Pclass                           Name     Sex  Age  \
5             6         0       3               Moran, Mr. James    male  0.0   
17           18         1       2   Williams, Mr. Charles Eugene    male  0.0   
19           20         1       3        Masselmani, Mrs. Fatima  female  0.0   
26           27         0       3        Emir, Mr. Farred Chehab    male  0.0   
28           29         1       3  O'Dwyer, Miss. Ellen "Nellie"  female  0.0   

    SibSp  Parch  Ticket     Fare Cabin Embarked  
5       0      0  330877   8.4583   NaN        Q  
17      0      0  244373  13.0000   NaN        S  
19      0      0    2649   7.2250   NaN        C  
26      0      0    2631   7.2250   NaN        C  
28      0      0  330959   7.8792   NaN        Q  


In the case of the missing value for 'Age', we can replace it with the mean value of the 'Age' column.

In [78]:
# Load the Titanic Dataset again
df = pd.read_csv('./Titanic-Dataset.csv')

# display rows with missing values for 'Age'
missing_age_df = df[df['Age'].isnull()]
print(missing_age_df.head())

    PassengerId  Survived  Pclass                           Name     Sex  Age  \
5             6         0       3               Moran, Mr. James    male  NaN   
17           18         1       2   Williams, Mr. Charles Eugene    male  NaN   
19           20         1       3        Masselmani, Mrs. Fatima  female  NaN   
26           27         0       3        Emir, Mr. Farred Chehab    male  NaN   
28           29         1       3  O'Dwyer, Miss. Ellen "Nellie"  female  NaN   

    SibSp  Parch  Ticket     Fare Cabin Embarked  
5       0      0  330877   8.4583   NaN        Q  
17      0      0  244373  13.0000   NaN        S  
19      0      0    2649   7.2250   NaN        C  
26      0      0    2631   7.2250   NaN        C  
28      0      0  330959   7.8792   NaN        Q  


In [80]:
# get the indices of rows with missing values
missing_age_indices = missing_age_df.index
print(missing_age_indices)

# Replace missing values for 'Age' with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df.head())

Index([  5,  17,  19,  26,  28,  29,  31,  32,  36,  42,
       ...
       832, 837, 839, 846, 849, 859, 863, 868, 878, 888],
      dtype='int64', length=177)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


In [84]:
# let us now check a few rows with previously missing values for 'Age' from missing_age_indices
# for index in missing_age_indices:
#     print(df.loc[index])
#     print()

# make it as new df
new_df = df.loc[missing_age_indices]
new_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,29.699118,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,29.699118,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,29.699118,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,29.699118,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,29.699118,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,29.699118,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,29.699118,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,29.699118,0,0,349217,7.8958,,S
