# Overview
Data manipulation refers to the process of transforming raw data into a format that is more suitable for analysis and modeling. The goal of data manipulation is to clean, organize, and transform data into a usable form that can be easily consumed and analyzed. It involves a range of techniques and methods used to preprocess and organize data, making it easier to work with and analyze. This section will provide an overview of the key techniques involved in data manipulation, including indexing, slicing, filtering, and sorting.

It is an important part of any data analysis or data science project due to several reasons:

Data preparation: Raw data is often in a format that is not suitable for analysis and modeling. Data manipulation helps to clean, organize, and transform the data into a usable form.
Data quality: Data manipulation helps to enhance the quality of the data by dealing with inconsistent or incorrect data, and transforming data into a format that is compatible with analysis tools.
Efficient analysis: By transforming data into a usable form, data manipulation makes it easier to perform efficient and accurate data analysis, which is crucial for making informed decisions.
By the end of this course, you will have a clear understanding of the purpose and importance of data manipulation, and how these techniques are used to transform raw data into actionable insights.

In this module, we will cover the following topics:

I. Indexing: Selecting specific elements or rows from a dataset.
II. Slicing: Selecting a range of rows or columns from a dataset.
III. Filtering: Selecting specific rows from a dataset based on certain conditions.
IV. Sorting: Arranging the data in a specific order, either ascending or descending.

## Learning Objectives
In this module, the learners will:

* Understand how to locate and extract specific data from a DataFrame.
* Analyze the process of filtering a DataFrame based on conditions.
* Apply the methods of sorting data in a DataFrame.
* Understand how to manipulate and analyze data in a DataFrame.
Let's get started!

## Dataset
Titanic dataset: This is a well-known and widely used dataset in the field of data analysis and machine learning. This dataset contains information about the passengers on the Titanic ship, including their demographic information, ticket information, and survival status. In this exercise, we're using the Titanic dataset to demonstrate indexing, slicing, filtering, and sorting of data through Pandas functions.

Here's a description of the columns in the dataset:

* PassengerId: This column is a unique identifier assigned to each passenger.
* Age: This column specifies the age of the passenger.
* Name: This column specifies the name of the passenger.
* Sex: This column specifies the gender of the passenger (Male or Female).
* Survived: This column specifies whether the passenger survived the Titanic disaster or not. The values in this column can either be 0 (did not survive) or 1 (survived).
* Pclass (Passenger Class): This column specifies the class of the passenger (1st, 2nd, or 3rd class).
* SibSp (Siblings/Spouses Aboard): This column specifies the number of siblings or spouses the passenger was traveling with.
* Parch (Parents/Children Aboard): This column specifies the number of parents or children the passenger was traveling with.
* Ticket: This column specifies the ticket number assigned to the passenger.
* Fare: This column specifies the fare paid by the passenger for their ticket.
* Cabin: This column specifies the cabin number assigned to the passenger.
* Embarked: This column specifies the port where the passenger boarded the Titanic (C = Cherbourg, Q = Queenstown, or S = Southampton).

# Indexing
## What is indexing?
Indexing in data manipulation refers to the process of selecting a specific subset of data in a DataFrame or Series. It is a fundamental aspect of data analysis and manipulation, as it allows you to extract only the relevant information from a large dataset, and perform operations on that subset of data.

## Why is it important?
Indexing is important because it enables us to extract meaningful insights from large datasets efficiently. With indexing, we can focus on a specific subset of data that is relevant to our analysis, which can save time and computing resources. Indexing also allows us to filter out irrelevant data and clean our data before analysis.

Suppose you work for a healthcare organization that collects large amounts of patient data daily, including information on medical history, treatments, lab results, and more. As a data analyst, your task is to extract insights from this data to help the organization improve patient outcomes and operational efficiency.

To do this, you need to be able to select and analyze specific subsets of data that are relevant to the analysis, such as patients with a specific medical condition or lab results for a particular test. Without indexing, it would be difficult and time-consuming to extract the relevant data needed for analysis. Thus, indexing enables analysts to quickly and efficiently select data based on specific criteria, making it easier to draw meaningful insights and make data-driven decisions.

## Indexing based on columns
Column indexing allows us to select one or more columns from a DataFrame by specifying the column labels. It can be done using bracket notation with a single-column label or a list of column labels. The resulting object is a Pandas Series or DataFrame, depending on the number of columns selected.

For example, we can select the 'Age' column from the Titanic DataFrame using the bracket notation with a single label:

In [16]:
import pandas as pd

# read in the Titanic dataset
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# select a single column using bracket notation
single_column = df['Age']
print(single_column)

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


In this example, we use the column label 'Age' to select the 'Age' column of the DataFrame. The resulting 'single_column' is a Pandas Series because only one column was selected.

## Indexing by labels
Label-based indexing is a method of selecting data from a Pandas DataFrame based on labels, rather than numeric indices. This can be useful when you want to select data based on specific row and column labels, rather than numeric positions. Indexing by labels can be performed using the '.loc[ ]' accessor in Pandas.

For example, suppose you are working with the Titanic dataset, which contains information about the passengers who were onboard the Titanic when it sank. You might want to select a single element from the dataset using its row and column labels.

In [17]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv') 

# Access the 'Name' of the passenger with 'PassengerId' 1
element_df_label = df.loc[1, 'Name']
print(element_df_label)

Cumings, Mrs. John Bradley (Florence Briggs Thayer)


In this example, we use the '.loc[ ]' accessor to select a single element from the DataFrame. The '.loc[ ]' accessor takes two arguments: the first argument specifies the row label to select (in this case, row label 1), and the second argument specifies the column label to select (in this case, the 'Name' column).

## Indexing by position
In position-based indexing, we use integer-based locations to select data. In the context of the Titanic dataset, an example of position-based indexing can be seen in selecting the first row of the DataFrame using the '.iloc[ ]' accessor and specifying the row index as 0. This can be done using the following code:

In [18]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Select the first row of the DataFrame
first_row = df.iloc[0]
print(first_row)

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object


In this code, the '.iloc[ ]' accessor takes a single argument, which specifies the integer position of the row to select. In this case, we pass the value 0 as the argument to select the first row of the DataFrame. We then assign the resulting row to a new variable 'first_row'.

### NOTE
Loc and iloc are two functions in Pandas that are used to slice a data set in a Pandas DataFrame. The function . loc is typically used for label indexing and can access multiple columns, while . iloc is used for integer indexing

## Indexing by values
Value-based indexing refers to selecting a single value from a Pandas DataFrame using its row and column labels or positions. It is done using the '.at' and '.iat' accessors.

In the Titanic dataset, if we want to extract the age of the passenger in row 4, we can use value-based indexing with the '.at' accessor. Alternatively, we can use the '.iat' accessor to get the same value using row and column positions instead of labels. Here's an example:

In [19]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# select a single value from the DataFrame using .at and .iat
value_with_at = df.at[4, 'Age']
print(value_with_at)
value_with_iat = df.iat[4, 5]
print(value_with_iat)

35.0
35.0


In this example, we use the '.at' accessor to select a single value from the 'Age' column and alternatively, the '.iat' accessor to select a single value from the 5th row and 6th column.

### NOTE
Both '.at' and '.iat' are methods for retrieving a single value from a Pandas DataFrame. While '.iat' is generally faster than '.at' due to its use of integer-based indexing, it does not perform any label validation, which means that retrieving the wrong value is possible if the integer indices are not correctly aligned with the DataFrame. In other words, it is important to use '.at' and '.iat' appropriately based on the type of index being used and to ensure proper alignment between the index and the data being accessed.

# Slicing
## What is slicing?
Slicing in data manipulation refers to the process of selecting a specific range of data in a DataFrame or Series. Similar to indexing, slicing is an important tool for extracting meaningful insights from large datasets. With slicing, we can focus on a specific range of data that is relevant to our analysis, allowing us to save time and computing resources.

## Why is it important?
Suppose you are working with a large dataset containing information on customer transactions, including purchase dates, items purchased, and prices. You want to analyze the sales data for a specific time period, such as the previous month, to identify trends and patterns. Slicing allows you to extract a subset of the data that falls within the desired time frame, allowing you to focus your analysis on the relevant information.

Without slicing, you would need to manually search through the entire dataset to find the relevant transactions, which would be time-consuming and prone to errors. By using slicing, you can quickly and efficiently extract the relevant data and perform analysis on it.

## Slicing based on columns
Column slicing is the process of selecting a specific subset of columns from a DataFrame. It allows us to work with a smaller set of data that is relevant to our analysis.

In [20]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Selecting multiple columns by specifying column labels
column_sliced = df[['Name', 'Age']]
print(column_sliced.head())

                                                Name   Age
0                            Braund, Mr. Owen Harris  22.0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0
2                             Heikkinen, Miss. Laina  26.0
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0
4                           Allen, Mr. William Henry  35.0


This code will return the 'Name' and 'Age' columns of the DataFrame. The result will be a new DataFrame that contains only the specified columns.

### CAVEAT
It's important to note that there is a difference between column indexing and column slicing in Pandas. When using the indexing operator '[ ]', passing a single column label will result in column indexing and return a Pandas series, whereas passing a list of column labels will result in column slicing and return a Pandas dataframe.

## Slicing by labels
Label-based slicing is a way to select a continuous subset of data in a DataFrame based on the index labels. Here's an example:

In [21]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# select a subset of the data using .loc
subset_df_label = df.loc[0:4, ['Name', 'Age']]
print(subset_df_label)

                                                Name   Age
0                            Braund, Mr. Owen Harris  22.0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0
2                             Heikkinen, Miss. Laina  26.0
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0
4                           Allen, Mr. William Henry  35.0


In this example, we use the '.loc[ ]' accessor to select the first 5 rows of the 'Name' and 'Age' columns. 

The following code selects all the rows in the DataFrame except the first 5 rows and all the columns except the 'Name' and 'Age' columns:

In [22]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Select rows with index values 0 to 4
subset_df_label = df.loc[5:, [col for col in df.columns if col not in ['Name', 'Age']]]

# Display the selected rows
print(subset_df_label.head())

   PassengerId  Survived  Pclass     Sex  SibSp  Parch  Ticket     Fare Cabin  \
5            6         0       3    male      0      0  330877   8.4583   NaN   
6            7         0       1    male      0      0   17463  51.8625   E46   
7            8         0       3    male      3      1  349909  21.0750   NaN   
8            9         1       3  female      0      2  347742  11.1333   NaN   
9           10         1       2  female      1      0  237736  30.0708   NaN   

  Embarked  
5        Q  
6        S  
7        S  
8        S  
9        C  


In the above code, the '.loc[ ]' property is used to access all the rows with index labels greater than 5, while the second argument is a list comprehension that selects all the columns that are not 'Name' and 'Age'.

WARNING

Please be aware that when using label-based indexing and slicing, there is a key difference to consider. Label-based indexing retrieves a single element or a set of elements based on the label index, while label-based slicing extracts a range of elements based on the label index.

## Slicing by position
Position-based slicing allows us to select rows and columns from a DataFrame based on their position. This means that we are selecting data based on the row and column numbers, rather than their labels. The first argument passed to the '.iloc[ ]' accessor is the row index and the second argument is the column index. We can specify a single row or column by passing a single integer, or multiple rows and columns by passing a list of integers.

In [23]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# select a subset of the data using .iloc
subset_df_position = df.iloc[0:5, [3, 5]]
print(subset_df_position)

                                                Name   Age
0                            Braund, Mr. Owen Harris  22.0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0
2                             Heikkinen, Miss. Laina  26.0
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0
4                           Allen, Mr. William Henry  35.0


In this example, we use the '.iloc[ ]' accessor to select the first 5 rows of the 3rd and 5th columns. 

### WARNING
Please note that there is a key difference between position-based slicing and position-based indexing. Position-based slicing permits the extraction of a range of elements from a sequence based on their starting and ending positions. In contrast, position-based indexing retrieves a single element from a specific location in the sequence. In the context of a DataFrame, a single element can represent a row, column, or specific value.

## Slicing with slice object
This is a technique used in Python to extract a subset of elements from a list, array, or Pandas DataFrame. The basic idea is to define a slice object with a start and stop value, which can then be used to extract the corresponding subset of elements from the original data structure.

Suppose we want to slice the Titanic dataset to include only rows where the passengers are in the age range of 20 to 30. We can use a slice object to define this range and then use it to slice the DataFrame as follows:

In [24]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Define the slice object for the age range of 20 to 30
age_slice = slice(20, 31)

# Slice the dataframe using the age slice object
age_range_df = df.loc[df['Age'].isin(range(age_slice.start, age_slice.stop))]

# View the sliced dataframe
print(age_range_df.head(10))

    PassengerId  Survived  Pclass  \
0             1         0       3   
2             3         1       3   
8             9         1       3   
12           13         0       3   
23           24         1       1   
34           35         0       1   
37           38         0       3   
41           42         0       2   
51           52         0       3   
53           54         1       2   

                                                 Name     Sex   Age  SibSp  \
0                             Braund, Mr. Owen Harris    male  22.0      1   
2                              Heikkinen, Miss. Laina  female  26.0      0   
8   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
12                     Saundercock, Mr. William Henry    male  20.0      0   
23                       Sloper, Mr. William Thompson    male  28.0      0   
34                            Meyer, Mr. Edgar Joseph    male  28.0      1   
37                           Cann, Mr. Ernest 

In this example, the 'loc' method is used to select rows where the 'Age' column value is within the range defined by the 'age_slice' object. The 'isin()' method is used to check if each 'Age' value is within the range specified by the 'age_slice' object.

# Filtering
## What is filtering?
Filtering is the process of selecting a subset of data from a larger dataset based on some condition. It involves applying a Boolean mask or a filter to a dataset to extract only the rows or columns that meet the specified criteria.

## Why is it important?
In the context of data manipulation, filtering is an important operation because it allows us to focus on specific subsets of the data that are relevant to our analysis or modeling objectives. By filtering the data, we can remove noise and irrelevant information, and extract the relevant insights and patterns.

As a data analyst, you may need to filter data based on certain criteria to extract only the relevant information needed for analysis. Filtering allows you to select specific subsets of data that meet certain conditions, such as patients with a specific medical condition or lab results above a certain threshold.

For example, if you are working with a large dataset of patient records, you may want to filter the data to only include patients who meet certain criteria such as age, gender, or medical history. By doing so, you can focus your analysis on a subset of patients that are most relevant to your research question. Filtering is a powerful tool for data analysis that allows you to extract meaningful insights and make data-driven decisions.

## Filtering using booleans
Boolean-based filtering is a method of filtering data in a Pandas DataFrame based on one or more conditions. This involves creating a boolean mask that selects the rows of the DataFrame that meet the specified conditions. A boolean mask is a boolean array that can be used to filter a Pandas DataFrame by selecting only the rows that satisfy certain conditions. The mask is a boolean array of the same length as the DataFrame, with a True value for each row that satisfies the condition and a False value for each row that does not.

Boolean-based filtering is a flexible and powerful way to filter data and allows you to combine multiple conditions using logical operators such as '&' (and), '|' (or), and '~' (not).

For example, suppose you are working with the Titanic dataset, which contains information about the passengers who were onboard the Titanic when it sank. You might want to filter the data to select only the passengers who were female and under the age of 18. You could do this using boolean-based filtering as follows:

In [25]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Filter the DataFrame to select female passengers under the age of 18
female_under_18 = (df['Sex'] == 'female') & (df['Age'] < 18)
filtered_df = df[female_under_18]

# Print the filtered DataFrame
print(filtered_df.head())

    PassengerId  Survived  Pclass                                  Name  \
9            10         1       2   Nasser, Mrs. Nicholas (Adele Achem)   
10           11         1       3       Sandstrom, Miss. Marguerite Rut   
14           15         0       3  Vestrom, Miss. Hulda Amanda Adolfina   
22           23         1       3           McGowan, Miss. Anna "Annie"   
24           25         0       3         Palsson, Miss. Torborg Danira   

       Sex   Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
9   female  14.0      1      0   237736  30.0708   NaN        C  
10  female   4.0      1      1  PP 9549  16.7000    G6        S  
14  female  14.0      0      0   350406   7.8542   NaN        S  
22  female  15.0      0      0   330923   8.0292   NaN        Q  
24  female   8.0      3      1   349909  21.0750   NaN        S  


In this example, we create a boolean mask that selects only the female passengers who are under the age of 18 using the '&' operator to combine two conditions. We then apply this boolean mask to the original DataFrame using square bracket notation, and assign the resulting filtered DataFrame to a new variable 'filtered_df'.

## Filtering with 'query()' method
Filtering based on a query in Pandas refers to selecting data from a DataFrame using a string containing a boolean expression. It can be done using the 'query()' method.

Suppose you want to filter the dataset to select only the passengers who survived and were in first or second class. Here's an example of how to filter a dataset using a query:

In [26]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# filter the DataFrame to select only the survivors in first or second class
survivors_df = df.query("Survived == 1 and Pclass in [1, 2]")

# display the filtered DataFrame
print(survivors_df.head())

    PassengerId  Survived  Pclass  \
1             2         1       1   
3             4         1       1   
9            10         1       2   
11           12         1       1   
15           16         1       2   

                                                 Name     Sex   Age  SibSp  \
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
9                 Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   
11                           Bonnell, Miss. Elizabeth  female  58.0      0   
15                   Hewlett, Mrs. (Mary D Kingcome)   female  55.0      0   

    Parch    Ticket     Fare Cabin Embarked  
1       0  PC 17599  71.2833   C85        C  
3       0    113803  53.1000  C123        S  
9       0    237736  30.0708   NaN        C  
11      0    113783  26.5500  C103        S  
15      0    248706  16.0000   NaN        S  


In this example, we use the 'query()' method to filter the DataFrame based on a boolean expression that specifies that we want to select only the rows where the 'Survived' column is equal to 1 (indicating that the passenger survived) and the 'Pclass' column is equal to 1 or 2 (indicating that the passenger was in first or second class). The resulting DataFrame 'survivors_df' contains only the rows that satisfy the condition.

Let's see another example:

Suppose you want to filter the Titanic dataset to select only the passengers who were females and under the age of 20.

In [27]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# filter the DataFrame to select only female passengers under the age of 20
females_under_age_df = df.query("Sex == 'female' and Age < 20")

# display the filtered DataFrame
print(females_under_age_df.head())

    PassengerId  Survived  Pclass                                  Name  \
9            10         1       2   Nasser, Mrs. Nicholas (Adele Achem)   
10           11         1       3       Sandstrom, Miss. Marguerite Rut   
14           15         0       3  Vestrom, Miss. Hulda Amanda Adolfina   
22           23         1       3           McGowan, Miss. Anna "Annie"   
24           25         0       3         Palsson, Miss. Torborg Danira   

       Sex   Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
9   female  14.0      1      0   237736  30.0708   NaN        C  
10  female   4.0      1      1  PP 9549  16.7000    G6        S  
14  female  14.0      0      0   350406   7.8542   NaN        S  
22  female  15.0      0      0   330923   8.0292   NaN        Q  
24  female   8.0      3      1   349909  21.0750   NaN        S  


In this example, we use the 'query()' method to filter the DataFrame based on a boolean expression that specifies that we want to select only the rows where the 'Sex' column is equal to female and the 'Age' column is less than 20.

### NOTE
The main difference between Boolean-based and Query-based filtering is the syntax used to write the filter conditions. Boolean-based filtering is more flexible and can handle more complex conditions, but the syntax can sometimes be more verbose and difficult to read. Query-based filtering is more concise and easier to read, but it may not be able to handle all types of conditions.

## Filtering with 'where()' method
The 'where()' method is used to return a new DataFrame with the same shape as the original, but with all the values that do not meet a specified condition replaced by NaN. The method takes a boolean expression as input and returns a new DataFrame with the same shape as the original.

Suppose you want to filter the Titanic dataset to replace all the ages greater than 50 with NaN.

In [28]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# filter the DataFrame to replace all the ages greater than 50 with NaN
df['Age'] = df['Age'].where(df['Age'] <= 50)

# display the filtered DataFrame
print(df.head(10))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male   N

In this example, we use the 'where()' method to filter the DataFrame based on a boolean expression that specifies that we want to replace all the ages greater than 50 with NaN. The resulting DataFrame contains the same shape as the original, but with all the ages greater than 50 replaced by NaN.

Let's see another example:

Suppose you want to filter the Titanic dataset to replace all the ages greater than 50 with the median age of the passengers.

In [29]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# calculate the median age of the passengers
median_age = df['Age'].median()

# use the where method to replace all the ages greater than 50 with the median age
df['Age'] = df['Age'].where(df['Age'] <= 50, other=median_age)

# display the modified DataFrame
print(df.head(10))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male  28.0      0   
6                            McCarthy, Mr. Timothy J    male  28

In this example, we first calculate the median age of the passengers using the 'median()' method on the 'Age' column. Then we use the 'where()' method to filter the DataFrame based on a boolean expression that specifies that we want to replace all the ages greater than 50 with the median age. We pass the 'median_age' as an argument to the 'other' parameter of the 'where()' method to replace all ages greater than 50 with the median.

## Filtering with 'filter()' method
The 'filter()' method in Pandas allows you to subset a DataFrame based on the values in the rows or columns. It returns a new DataFrame with only the rows or columns that match the specified criteria.

Suppose we want to filter the Titanic dataset to include only the passengers who were in the 1st class and paid a fare above the median fare of all passengers. We can use the 'filter()' method to apply a boolean mask based on the conditions, and then filter the dataset to only include the relevant rows.

In [30]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Create a boolean mask based on conditions
mask = (df['Pclass'] == 1) & (df['Fare'] > df['Fare'].median())

# Use filter to apply boolean mask
filtered_titanic = df.filter(items=mask.index[mask], axis=0)

# View the filtered dataset
print(filtered_titanic.head())

    PassengerId  Survived  Pclass  \
1             2         1       1   
3             4         1       1   
6             7         0       1   
11           12         1       1   
23           24         1       1   

                                                 Name     Sex   Age  SibSp  \
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                             McCarthy, Mr. Timothy J    male  54.0      0   
11                           Bonnell, Miss. Elizabeth  female  58.0      0   
23                       Sloper, Mr. William Thompson    male  28.0      0   

    Parch    Ticket     Fare Cabin Embarked  
1       0  PC 17599  71.2833   C85        C  
3       0    113803  53.1000  C123        S  
6       0     17463  51.8625   E46        S  
11      0    113783  26.5500  C103        S  
23      0    113788  35.5000    A6        S  


This will output a dataframe that includes only the passengers who were in the 1st class and paid a fare above the median fare of all passengers.

## Filtering with 'isna()' method
The 'isna()' method in Pandas is used to detect missing or null values in a dataset. When dealing with large datasets, it is common to have missing values in some of the observations. For example, to filter the Titanic dataset to only include rows where the 'Age' column contains missing values, we can use the following code:

In [31]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Filter the dataset to only include rows with missing Age values
filtered_titanic = df[df['Age'].isna()]

# View the filtered dataset
print(filtered_titanic.head(10))


# This version will only get data with age values that are not NaN
# filtered_titanic = df[~df['Age'].isna()]

    PassengerId  Survived  Pclass  \
5             6         0       3   
17           18         1       2   
19           20         1       3   
26           27         0       3   
28           29         1       3   
29           30         0       3   
31           32         1       1   
32           33         1       3   
36           37         1       3   
42           43         0       3   

                                              Name     Sex  Age  SibSp  Parch  \
5                                 Moran, Mr. James    male  NaN      0      0   
17                    Williams, Mr. Charles Eugene    male  NaN      0      0   
19                         Masselmani, Mrs. Fatima  female  NaN      0      0   
26                         Emir, Mr. Farred Chehab    male  NaN      0      0   
28                   O'Dwyer, Miss. Ellen "Nellie"  female  NaN      0      0   
29                             Todoroff, Mr. Lalio    male  NaN      0      0   
31  Spencer, Mrs. William

This code applies the 'isna()' method to the 'Age' column of the dataframe to create a boolean mask that is True for rows with missing values in this column. We pass this boolean mask to the indexing operator ('[ ]') of the dataframe to filter the dataset to only include the relevant rows.

## Filtering with 'between()' method
The 'between()' method is a filtering method in Pandas that allows you to filter a DataFrame based on whether a column's values are within a specified range. This can be particularly useful when working with numerical data. To use the 'between()' method, you need to specify the column to filter, as well as the minimum and maximum values that the column's values should fall between.

Suppose you want to filter the Titanic dataset to include only the passengers who paid a fare between 50 and 100 dollars. Here's the code for implementing this scenario:

In [32]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Create a boolean mask for the Fare column using between()
fare_mask = df['Fare'].between(50, 100)

# Filter the dataset to only include rows where the fare is between $50 and $100
filtered_titanic = df[fare_mask]

# View the filtered dataset
print(filtered_titanic)

     PassengerId  Survived  Pclass  \
1              2         1       1   
3              4         1       1   
6              7         0       1   
34            35         0       1   
35            36         0       1   
..           ...       ...     ...   
849          850         1       1   
863          864         0       3   
867          868         0       1   
871          872         1       1   
879          880         1       1   

                                                  Name     Sex   Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                              McCarthy, Mr. Timothy J    male  54.0      0   
34                             Meyer, Mr. Edgar Joseph    male  28.0      1   
35                      Holverson, Mr. Alexander Oskar    male  42.0      1   
..                                                 ...     ...   ... 

In this code, we first create a boolean mask for the 'Fare' column using the 'between()' method, specifying that we want to include fares between 50 and 100. Finally, we filter the original dataset using this mask.

### NOTE
The 'between()' method is inclusive, meaning that values at the endpoints of the range are included in the resulting boolean mask. If you want to exclude these values, you can use the 'gt()' (greater than) and 'lt()' (less than) methods instead.

Here's an example that shows how to create a boolean mask based on the 'Fare' column using the 'gt()' and 'lt()' methods:

In [33]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Create a boolean mask to select passengers with fare between $50 and $100 (excluding $50 and $100)
fare_mask = df['Fare'].gt(50) & df['Fare'].lt(100)

# Filter the dataset to only include rows where the fare is between $50 and $100
filtered_titanic = df[fare_mask]

# View the filtered dataset
print(filtered_titanic)

     PassengerId  Survived  Pclass  \
1              2         1       1   
3              4         1       1   
6              7         0       1   
34            35         0       1   
35            36         0       1   
..           ...       ...     ...   
849          850         1       1   
863          864         0       3   
867          868         0       1   
871          872         1       1   
879          880         1       1   

                                                  Name     Sex   Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                              McCarthy, Mr. Timothy J    male  54.0      0   
34                             Meyer, Mr. Edgar Joseph    male  28.0      1   
35                      Holverson, Mr. Alexander Oskar    male  42.0      1   
..                                                 ...     ...   ... 

In this example, we used the 'gt()' (greater than) and 'lt()' (less than) methods to create a boolean mask that selects passengers with fares between 50 and 100, but excludes passengers with fares 50 and 100. We used the '&' operator to combine the two conditions into a single boolean mask.

Using the 'gt()' and 'lt()' methods, we can exclude the fare values at the endpoints of the range to obtain a more precise boolean mask. After applying these methods to the 'Fare' column of the Titanic dataset, we see that one row has been excluded, resulting in a boolean mask that selects 107 rows that meet the specified condition. Therefore, it is important to be aware of inclusive vs exclusive filtering when using boolean-based filtering methods in Pandas.

## Filtering with 'nsmallest()' method
Filtering with the 'nsmallest()' method is a common data manipulation technique used to extract a specific number of rows with the smallest values from a Pandas DataFrame. This method is especially useful in scenarios where you want to extract the top or bottom n values from a large dataset.

For example, if you want to extract the 10 passengers with the smallest ages, you can use the 'nsmallest()' method. Here's how you can implement this in Python:

In [34]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Extract the 10 passengers with the smallest ages
youngest_passengers = df.nsmallest(10, 'Age')

# Print the resulting DataFrame
print(youngest_passengers)

     PassengerId  Survived  Pclass                             Name     Sex  \
803          804         1       3  Thomas, Master. Assad Alexander    male   
755          756         1       2        Hamalainen, Master. Viljo    male   
469          470         1       3    Baclini, Miss. Helene Barbara  female   
644          645         1       3           Baclini, Miss. Eugenie  female   
78            79         1       2    Caldwell, Master. Alden Gates    male   
831          832         1       2  Richards, Master. George Sibley    male   
305          306         1       1   Allison, Master. Hudson Trevor    male   
164          165         0       3     Panula, Master. Eino Viljami    male   
172          173         1       3     Johnson, Miss. Eleanor Ileen  female   
183          184         1       2        Becker, Master. Richard F    male   

      Age  SibSp  Parch   Ticket      Fare    Cabin Embarked  
803  0.42      0      1     2625    8.5167      NaN        C  
755 

This code extracts the 10 passengers with the smallest ages from the Titanic dataset, sorted by the 'Age' column. Some of the extracted ages have decimal values, such as 0.42 and 0.67, which may seem unrealistic as ages. However, these decimal values may represent months or fractions of a year. Additionally, there may be missing or incomplete data in the dataset, which could result in some of the extracted ages appearing as non-working values.

## Filtering with 'nlargest()' method
The 'nlargest()' method is a data manipulation tool in Python's Pandas library that allows you to filter a DataFrame by selecting the rows with the n largest values in a specific column. This can be useful in scenarios where you want to identify the top n highest values in a particular dataset.

Suppose we have a dataset containing information about the passengers on the Titanic, including their age, gender, ticket class, and whether they survived or not. We want to identify the five oldest passengers who survived the disaster.

In [35]:
# Import pandas library and read Titanic dataset
import pandas as pd
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Filter the dataset to only include passengers who survived
survived = df[df['Survived'] == 1]

# Use nlargest to select the five rows with the largest 'Age' values
oldest_survivors = survived.nlargest(5, 'Age')

# Display the result
print(oldest_survivors)

     PassengerId  Survived  Pclass                                       Name  \
630          631         1       1       Barkworth, Mr. Algernon Henry Wilson   
275          276         1       1          Andrews, Miss. Kornelia Theodosia   
483          484         1       3                     Turkula, Mrs. (Hedwig)   
570          571         1       2                         Harris, Mr. George   
829          830         1       1  Stone, Mrs. George Nelson (Martha Evelyn)   

        Sex   Age  SibSp  Parch       Ticket     Fare Cabin Embarked  
630    male  80.0      0      0        27042  30.0000   A23        S  
275  female  63.0      1      0        13502  77.9583    D7        S  
483  female  63.0      0      0         4134   9.5875   NaN        S  
570    male  62.0      0      0  S.W./PP 752  10.5000   NaN        S  
829  female  62.0      0      0       113572  80.0000   B28      NaN  


In this code, we filter the dataset to only include passengers who survived using boolean indexing. We create a new DataFrame called 'survived' that contains only the rows where the 'Survived' column is equal to 1. Next, we use the 'nlargest()' method on the 'survived' DataFrame to select the five rows with the largest 'Age' values.