<img src="./images/banner.png" width="800">

# Introduction to DataFrames

In this section, we will introduce Pandas DataFrames, which are fundamental data structures in the Pandas library for handling and analyzing structured data. We will explore what DataFrames are, how they differ from Series, and why they are essential tools in data analysis.


A DataFrame is a two-dimensional labeled data structure in Pandas that consists of rows and columns, similar to a table in a spreadsheet or a SQL database. It is a collection of Pandas Series, where each Series represents a column in the DataFrame. DataFrames provide a convenient way to store, manipulate, and analyze structured data in a tabular format.


Key characteristics of DataFrames include:

- Labeled axes: DataFrames have both a row index and column labels, allowing for easy and intuitive data access and selection.
- Heterogeneous data: Each column in a DataFrame can contain a different data type (e.g., numeric, string, boolean).
- Mutable size: DataFrames are mutable, meaning you can add or remove rows and columns as needed.
- Alignment of data: DataFrames automatically align data based on row and column labels during operations, handling missing data gracefully.


DataFrames are powerful tools for handling structured data and provide a wide range of functionalities for data manipulation, analysis, and visualization.


While Series and DataFrames are both fundamental data structures in Pandas, they have some key differences:

- Dimensionality: A Series is a one-dimensional labeled array that can hold any data type, whereas a DataFrame is a two-dimensional labeled data structure with rows and columns.
- Structure: A Series represents a single column of data, while a DataFrame is a collection of Series, representing multiple columns of data.
- Axis labels: A Series has only one axis (the index), while a DataFrame has two axes (the row index and column labels).
- Heterogeneous data: In a DataFrame, each column can contain a different data type, whereas a Series typically contains data of a single type.


Despite these differences, Series and DataFrames are closely related. A DataFrame can be thought of as a collection of Series, where each Series represents a column in the DataFrame. Many operations that can be performed on Series can also be applied to DataFrames, often operating on each column or row independently.


By leveraging the power and flexibility of DataFrames, data analysts and scientists can efficiently handle and analyze large and complex datasets, making them an indispensable tool in the data analysis workflow.


In the following sections, we will dive deeper into creating DataFrames, accessing data within them, performing basic operations, and exploring various functionalities that make DataFrames a powerful tool for data manipulation and analysis.

**Table of contents**<a id='toc0_'></a>    
- [Creating DataFrames](#toc1_)    
  - [From Lists](#toc1_1_)    
    - [List of Lists](#toc1_1_1_)    
    - [List of Dictionaries](#toc1_1_2_)    
  - [From Dictionaries](#toc1_2_)    
    - [Dictionary of Lists](#toc1_2_1_)    
    - [Dictionary of Series](#toc1_2_2_)    
  - [From NumPy Arrays](#toc1_3_)    
  - [Using the `pd.DataFrame()` Function](#toc1_4_)    
- [DataFrame Methods and Attributes](#toc2_)    
  - [Useful Methods](#toc2_1_)    
  - [Important Attributes](#toc2_2_)    
- [Accessing Data in DataFrames](#toc3_)    
  - [Accessing Columns](#toc3_1_)    
    - [Using Square Brackets `[]`    ](#toc3_1_1_)    
    - [Using Dot Notation `.`](#toc3_1_2_)    
  - [Accessing Rows](#toc3_2_)    
    - [Using `loc[]` with Labels    ](#toc3_2_1_)    
    - [Using `iloc[]` with Integer Positions    ](#toc3_2_2_)    
  - [Accessing Individual Elements](#toc3_3_)    
    - [Using `loc[]` with Row and Column Labels    ](#toc3_3_1_)    
    - [Using `iloc[]` with Row and Column Positions    ](#toc3_3_2_)    
    - [Using `at[]` and `iat[]` for Single Element Access    ](#toc3_3_3_)    
- [Modifying DataFrames](#toc4_)    
  - [Adding and Removing Columns](#toc4_1_)    
  - [Adding and Removing Rows](#toc4_2_)    
  - [Renaming Columns and Indexes](#toc4_3_)    
  - [Updating Values in DataFrames](#toc4_4_)    
- [Sorting DataFrames](#toc5_)    
  - [Sorting by Index](#toc5_1_)    
  - [Sorting by Column Values](#toc5_2_)    
- [Summary](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Creating DataFrames](#toc0_)

Pandas provides several ways to create DataFrames, allowing you to construct them from various data structures such as lists, dictionaries, and NumPy arrays. In this section, we will explore different methods for creating DataFrames.


### <a id='toc1_1_'></a>[From Lists](#toc0_)


#### <a id='toc1_1_1_'></a>[List of Lists](#toc0_)


You can create a DataFrame from a list of lists, where each inner list represents a row in the DataFrame. The values in each inner list correspond to the columns.

In [1]:
import pandas as pd

In [2]:
data = [
    ['John', 25, 'New York'],
    ['Alice', 30, 'London'],
    ['Bob', 35, 'Paris']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])

In [3]:
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


In this example, we create a DataFrame `df` from a list of lists `data`. The `columns` parameter is used to specify the column labels for the DataFrame.


#### <a id='toc1_1_2_'></a>[List of Dictionaries](#toc0_)


You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row in the DataFrame. The keys of the dictionaries correspond to the column labels.


In [4]:
data = [
    {'Name': 'John', 'Age': 25, 'City': 'New York'},
    {'Name': 'Alice', 'Age': 30, 'City': 'London'},
    {'Name': 'Bob', 'Age': 35, 'City': 'Paris'}
]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


In this case, the DataFrame `df` is created from a list of dictionaries `data`. The column labels are automatically inferred from the keys of the dictionaries.


### <a id='toc1_2_'></a>[From Dictionaries](#toc0_)


#### <a id='toc1_2_1_'></a>[Dictionary of Lists](#toc0_)


You can create a DataFrame from a dictionary of lists, where each key in the dictionary represents a column label, and the corresponding value is a list of data for that column.


In [5]:
data = {
    'Name': ['John', 'Alice', 'Bob'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


Here, the DataFrame `df` is created from a dictionary of lists `data`. The keys of the dictionary become the column labels, and the corresponding lists become the column data.


#### <a id='toc1_2_2_'></a>[Dictionary of Series](#toc0_)


Similar to a dictionary of lists, you can create a DataFrame from a dictionary of Series. Each key in the dictionary represents a column label, and the corresponding value is a Pandas Series containing the data for that column.


In [6]:
data = {
    'Name': pd.Series(['John', 'Alice', 'Bob']),
    'Age': pd.Series([25, 30, 35]),
    'City': pd.Series(['New York', 'London', 'Paris'])
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


In this example, the DataFrame `df` is created from a dictionary of Series `data`. The keys of the dictionary become the column labels, and the corresponding Series become the column data.


### <a id='toc1_3_'></a>[From NumPy Arrays](#toc0_)


You can create a DataFrame from a NumPy array, where each row of the array represents a row in the DataFrame.


In [7]:
import numpy as np

In [8]:
data = np.array([
    ['John', 25, 'New York'],
    ['Alice', 30, 'London'],
    ['Bob', 35, 'Paris']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


Here, the DataFrame `df` is created from a NumPy array `data`. The `columns` parameter is used to specify the column labels for the DataFrame.


### <a id='toc1_4_'></a>[Using the `pd.DataFrame()` Function](#toc0_)


The `pd.DataFrame()` function is a versatile way to create DataFrames. It accepts various data structures as input and provides additional parameters for customization.


In [9]:
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data, index=['001', '002', '003'])
df

Unnamed: 0,Name,Age,City
1,John,25,New York
2,Alice,30,London
3,Bob,35,Paris


In this example, we create a DataFrame `df` from a dictionary `data`. The `index` parameter is used to specify custom row labels for the DataFrame.


These are just a few examples of how you can create DataFrames in Pandas. The flexibility of the `pd.DataFrame()` function allows you to construct DataFrames from various data structures, making it easy to work with different data sources.


Once you have created a DataFrame, you can access and manipulate its data using a wide range of methods and functionalities provided by Pandas, which we will explore in the upcoming sections.

## <a id='toc2_'></a>[DataFrame Methods and Attributes](#toc0_)

Pandas DataFrames come with a rich set of methods and attributes that allow you to explore, inspect, and manipulate the data effectively. In this section, we will discuss some of the most useful methods and important attributes of DataFrames.


### <a id='toc2_1_'></a>[Useful Methods](#toc0_)


The `head()` method returns the first `n` rows of the DataFrame. By default, it returns the first 5 rows. This method is useful for quickly inspecting the top few rows of the DataFrame.


In [10]:
df.head()  # Returns the first 5 rows

Unnamed: 0,Name,Age,City
1,John,25,New York
2,Alice,30,London
3,Bob,35,Paris


In [11]:
df.head(10)  # Returns the first 10 rows

Unnamed: 0,Name,Age,City
1,John,25,New York
2,Alice,30,London
3,Bob,35,Paris


The `tail()` method returns the last `n` rows of the DataFrame. By default, it returns the last 5 rows. This method is useful for quickly inspecting the bottom few rows of the DataFrame.


In [12]:
df.tail()  # Returns the last 5 rows

Unnamed: 0,Name,Age,City
1,John,25,New York
2,Alice,30,London
3,Bob,35,Paris


In [13]:
df.tail(10)  # Returns the last 10 rows

Unnamed: 0,Name,Age,City
1,John,25,New York
2,Alice,30,London
3,Bob,35,Paris


The `info()` method provides a concise summary of the DataFrame, including the column names, data types, and non-null count. It is useful for getting a quick overview of the DataFrame's structure and data types.


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 001 to 003
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes


The `describe()` method generates descriptive statistics of the DataFrame. It computes various summary statistics, such as count, mean, standard deviation, minimum, maximum, and quartiles, for each numeric column in the DataFrame.


In [15]:
df.describe()

Unnamed: 0,Age
count,3.0
mean,30.0
std,5.0
min,25.0
25%,27.5
50%,30.0
75%,32.5
max,35.0


The `value_counts()` method returns a Series containing counts of unique values in a specific column. It is useful for understanding the distribution of values in a categorical column.


In [16]:
df['City'].value_counts()

City
New York    1
London      1
Paris       1
Name: count, dtype: int64

The `sort_values()` method sorts the DataFrame by one or more columns. It allows you to specify the column(s) to sort by and the sorting order (ascending or descending).


```python
df.sort_values('Column')  # Sorts by a single column
df.sort_values(['Column1', 'Column2'])  # Sorts by multiple columns
```


In [1]:
# todo: move attributes to the top

### <a id='toc2_2_'></a>[Important Attributes](#toc0_)


The `shape` attribute returns a tuple representing the dimensions of the DataFrame. It contains the number of rows and columns in the DataFrame.


In [17]:
df.shape

(3, 3)

The `size` attribute returns the total number of elements in the DataFrame, which is equivalent to the product of the number of rows and columns.


In [18]:
df.size

9

The `columns` attribute returns an Index object containing the column labels of the DataFrame.


In [19]:
df.columns

Index(['Name', 'Age', 'City'], dtype='object')

The `index` attribute returns an Index object representing the row labels of the DataFrame.


In [20]:
df.index

Index(['001', '002', '003'], dtype='object')

The `dtypes` attribute returns a Series containing the data type of each column in the DataFrame.


In [21]:
df.dtypes

Name    object
Age      int64
City    object
dtype: object

These are just a few examples of the useful methods and attributes available in Pandas DataFrames. Pandas provides a wide range of additional methods for data manipulation, selection, filtering, grouping, and more, which we will explore in the upcoming sections.


By leveraging these methods and attributes, you can effectively explore, inspect, and manipulate your data using Pandas DataFrames. They provide convenient ways to gain insights into your data, perform data cleaning and preprocessing, and prepare your data for further analysis and visualization.

## <a id='toc3_'></a>[Accessing Data in DataFrames](#toc0_)

Pandas provides various methods and techniques to access data in DataFrames efficiently. In this section, we will explore different ways to access columns, rows, and individual elements within a DataFrame.


### <a id='toc3_1_'></a>[Accessing Columns](#toc0_)


In [22]:
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data, index=['001', '002', '003'])
df

Unnamed: 0,Name,Age,City
1,John,25,New York
2,Alice,30,London
3,Bob,35,Paris


#### <a id='toc3_1_1_'></a>[Using Square Brackets `[]`](#toc0_)     [&#8593;](#toc0_)


You can access a single column of a DataFrame using square brackets `[]` and the column label. The result is a Pandas Series containing the data from that column.


In [23]:
df['Name']

001     John
002    Alice
003      Bob
Name: Name, dtype: object

You can also access multiple columns by passing a list of column labels inside square brackets.


In [24]:
df[['Age', 'Name', 'City']]

Unnamed: 0,Age,Name,City
1,25,John,New York
2,30,Alice,London
3,35,Bob,Paris


#### <a id='toc3_1_2_'></a>[Using Dot Notation `.`](#toc0_)


If the column label is a valid Python identifier (i.e., it doesn't contain spaces or special characters), you can access the column using dot notation.


In [25]:
df.Age

001    25
002    30
003    35
Name: Age, dtype: int64

However, dot notation has some limitations and is not recommended when the column label contains spaces or conflicts with DataFrame methods or attributes.


In [2]:
# todo: add df.mean

### <a id='toc3_2_'></a>[Accessing Rows](#toc0_)


#### <a id='toc3_2_1_'></a>[Using `loc[]` with Labels](#toc0_)     [&#8593;](#toc0_)


The `loc[]` accessor allows you to access rows based on their labels. You can use a single label, a list of labels, or a slice of labels to select rows.


In [26]:
df.loc['001']  # Access a single row by label

Name        John
Age           25
City    New York
Name: 001, dtype: object

In [27]:
df.loc[['001', '003']]  # Access multiple rows by labels

Unnamed: 0,Name,Age,City
1,John,25,New York
3,Bob,35,Paris


In [28]:
df.loc['002':'003']  # Access a slice of rows by labels

Unnamed: 0,Name,Age,City
2,Alice,30,London
3,Bob,35,Paris


#### <a id='toc3_2_2_'></a>[Using `iloc[]` with Integer Positions](#toc0_)     [&#8593;](#toc0_)


The `iloc[]` accessor allows you to access rows based on their integer positions. You can use a single integer, a list of integers, or a slice of integers to select rows.


In [29]:
df.iloc[0]  # Access a single row by integer position

Name        John
Age           25
City    New York
Name: 001, dtype: object

In [30]:
df.iloc[[0, 2, 1]]  # Access multiple rows by integer positions

Unnamed: 0,Name,Age,City
1,John,25,New York
3,Bob,35,Paris
2,Alice,30,London


In [31]:
df.iloc[1:3]  # Access a slice of rows by integer positions

Unnamed: 0,Name,Age,City
2,Alice,30,London
3,Bob,35,Paris


### <a id='toc3_3_'></a>[Accessing Individual Elements](#toc0_)


#### <a id='toc3_3_1_'></a>[Using `loc[]` with Row and Column Labels](#toc0_)     [&#8593;](#toc0_)


To access an individual element in a DataFrame, you can use the `loc[]` accessor with both row and column labels.


In [32]:
df.loc['001', 'Age']

25

#### <a id='toc3_3_2_'></a>[Using `iloc[]` with Row and Column Positions](#toc0_)     [&#8593;](#toc0_)


Similarly, you can use the `iloc[]` accessor with row and column positions to access an individual element.


In [33]:
df.iloc[2, 0]

'Bob'

#### <a id='toc3_3_3_'></a>[Using `at[]` and `iat[]` for Single Element Access](#toc0_)     [&#8593;](#toc0_)


For faster access to a single element, you can use the `at[]` and `iat[]` accessors. The `at[]` accessor uses labels, while the `iat[]` accessor uses integer positions.


In [34]:
df.at['001', 'Age']  # Access a single element by labels

25

In [35]:
df.iat[0, 2]  # Access a single element by positions

'New York'

These accessors are optimized for single element access and can provide better performance compared to `loc[]` and `iloc[]` when accessing individual elements repeatedly.


It's important to note that when using `loc[]` and `iloc[]`, you can also use boolean masks, conditional expressions, or callable functions to select rows based on specific criteria.


In [36]:
df.loc[df['Age'] > 27]  # Select rows based on a condition

Unnamed: 0,Name,Age,City
2,Alice,30,London
3,Bob,35,Paris


By mastering these data access techniques, you can efficiently retrieve specific columns, rows, or individual elements from your DataFrame, enabling you to perform targeted analysis and manipulation of your data.


## <a id='toc4_'></a>[Modifying DataFrames](#toc0_)

Pandas provides a range of methods and techniques to modify DataFrames, allowing you to add or remove columns and rows, rename columns and indexes, and update values within the DataFrame. In this section, we will explore these modification operations in detail.


Let's create a sample DataFrame:

In [37]:
# Create a sample DataFrame
data = {
    'Name': ['John', 'Alice', 'Bob', 'John', 'Alice'],
    'Age': [25, 30, 35, 25, 30],
    'City': ['New York', 'London', 'Paris', 'New York', 'London'],
    'Salary': [50000, 60000, 70000, 50000, 60000]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Salary
0,John,25,New York,50000
1,Alice,30,London,60000
2,Bob,35,Paris,70000
3,John,25,New York,50000
4,Alice,30,London,60000


Now, let's use this DataFrame to demonstrate the various modification operations.


### <a id='toc4_1_'></a>[Adding and Removing Columns](#toc0_)


To add a new column to the DataFrame, you can assign values to a new column label using square brackets `[]`.


In [38]:
df['Department'] = ['Sales', 'Marketing', 'Engineering', 'Sales', 'Marketing']
df

Unnamed: 0,Name,Age,City,Salary,Department
0,John,25,New York,50000,Sales
1,Alice,30,London,60000,Marketing
2,Bob,35,Paris,70000,Engineering
3,John,25,New York,50000,Sales
4,Alice,30,London,60000,Marketing


You can also assign the result of a computation or a function to a new column. For example, let's add a new column 'Bonus' that is 10% of the 'Salary'.


In [39]:
df['Bonus'] = df['Salary'] * 0.1
df

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25,New York,50000,Sales,5000.0
1,Alice,30,London,60000,Marketing,6000.0
2,Bob,35,Paris,70000,Engineering,7000.0
3,John,25,New York,50000,Sales,5000.0
4,Alice,30,London,60000,Marketing,6000.0


To remove one or more columns from the DataFrame, you can use the `drop()` method with the `axis=1` parameter to specify that you want to drop columns.


In [40]:
df.drop(['Department', 'Bonus'], axis=1)

Unnamed: 0,Name,Age,City,Salary
0,John,25,New York,50000
1,Alice,30,London,60000
2,Bob,35,Paris,70000
3,John,25,New York,50000
4,Alice,30,London,60000


In [None]:
# todo
df.drop(columns=['Department', 'Bonus'])

### <a id='toc4_2_'></a>[Adding and Removing Rows](#toc0_)


To add new rows to the DataFrame, you can use the `concat()` method and pass a new DataFrame or a dictionary representing the new rows.


In [41]:
new_rows = pd.DataFrame({'Name': ['Charlie', 'Eve'], 'Age': [40, 28], 'City': ['Berlin', 'Madrid'], 'Salary': [80000, 55000]})
pd.concat([df, new_rows], ignore_index=True)


Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25,New York,50000,Sales,5000.0
1,Alice,30,London,60000,Marketing,6000.0
2,Bob,35,Paris,70000,Engineering,7000.0
3,John,25,New York,50000,Sales,5000.0
4,Alice,30,London,60000,Marketing,6000.0
5,Charlie,40,Berlin,80000,,
6,Eve,28,Madrid,55000,,


In [42]:
pd.concat([df, new_rows], ignore_index=False)

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25,New York,50000,Sales,5000.0
1,Alice,30,London,60000,Marketing,6000.0
2,Bob,35,Paris,70000,Engineering,7000.0
3,John,25,New York,50000,Sales,5000.0
4,Alice,30,London,60000,Marketing,6000.0
0,Charlie,40,Berlin,80000,,
1,Eve,28,Madrid,55000,,


To remove rows from the DataFrame based on a condition, you can use boolean indexing.


In [43]:
df[df['Age'] < 35]

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25,New York,50000,Sales,5000.0
1,Alice,30,London,60000,Marketing,6000.0
3,John,25,New York,50000,Sales,5000.0
4,Alice,30,London,60000,Marketing,6000.0


### <a id='toc4_3_'></a>[Renaming Columns and Indexes](#toc0_)


To rename one or more columns in the DataFrame, you can use the `rename()` method and pass a dictionary mapping the old column names to the new column names.


In [44]:
df.rename(columns={'Name': 'Employee', 'City': 'Location'})

Unnamed: 0,Employee,Age,Location,Salary,Department,Bonus
0,John,25,New York,50000,Sales,5000.0
1,Alice,30,London,60000,Marketing,6000.0
2,Bob,35,Paris,70000,Engineering,7000.0
3,John,25,New York,50000,Sales,5000.0
4,Alice,30,London,60000,Marketing,6000.0


Similarly, you can rename the row indexes using the `rename()` method with the `index` parameter.


In [45]:
df.rename(index={0: 'Emp1', 1: 'Emp2', 3: 'Emp3', 4: 'Emp4', 6: 'Emp5'})

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
Emp1,John,25,New York,50000,Sales,5000.0
Emp2,Alice,30,London,60000,Marketing,6000.0
2,Bob,35,Paris,70000,Engineering,7000.0
Emp3,John,25,New York,50000,Sales,5000.0
Emp4,Alice,30,London,60000,Marketing,6000.0


### <a id='toc4_4_'></a>[Updating Values in DataFrames](#toc0_)


To update values in the DataFrame, you can use the `loc[]` accessor to select the specific cells you want to modify and assign new values to them.


In [46]:
df.loc[5, 'Salary'] = 55000
df.loc[df['City'] == 'London', 'Salary'] = 65000
df

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25.0,New York,50000.0,Sales,5000.0
1,Alice,30.0,London,65000.0,Marketing,6000.0
2,Bob,35.0,Paris,70000.0,Engineering,7000.0
3,John,25.0,New York,50000.0,Sales,5000.0
4,Alice,30.0,London,65000.0,Marketing,6000.0
5,,,,55000.0,,


These examples demonstrate how you can modify a DataFrame by adding and removing columns and rows, renaming columns and indexes, and updating values based on specific conditions or locations.

## <a id='toc5_'></a>[Sorting DataFrames](#toc0_)

Sorting is a common operation when working with DataFrames, and Pandas provides convenient methods to sort DataFrames based on either the index or column values. Let's explore how to sort DataFrames using the sample DataFrame `df` from the previous examples.


### <a id='toc5_1_'></a>[Sorting by Index](#toc0_)


To sort a DataFrame by its index, you can use the `sort_index()` method. By default, it sorts the index in ascending order.


In [48]:
df.sort_index()

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25.0,New York,50000.0,Sales,5000.0
1,Alice,30.0,London,65000.0,Marketing,6000.0
2,Bob,35.0,Paris,70000.0,Engineering,7000.0
3,John,25.0,New York,50000.0,Sales,5000.0
4,Alice,30.0,London,65000.0,Marketing,6000.0
5,,,,55000.0,,


You can also sort the index in descending order by passing the parameter `ascending=False` to the `sort_index()` method.


In [49]:
df.sort_index(ascending=False)

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
5,,,,55000.0,,
4,Alice,30.0,London,65000.0,Marketing,6000.0
3,John,25.0,New York,50000.0,Sales,5000.0
2,Bob,35.0,Paris,70000.0,Engineering,7000.0
1,Alice,30.0,London,65000.0,Marketing,6000.0
0,John,25.0,New York,50000.0,Sales,5000.0


### <a id='toc5_2_'></a>[Sorting by Column Values](#toc0_)


To sort a DataFrame by one or more column values, you can use the `sort_values()` method. You need to specify the column or columns to sort by, and by default, it sorts in ascending order.


In [50]:
df.sort_values('Salary')

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
0,John,25.0,New York,50000.0,Sales,5000.0
3,John,25.0,New York,50000.0,Sales,5000.0
5,,,,55000.0,,
1,Alice,30.0,London,65000.0,Marketing,6000.0
4,Alice,30.0,London,65000.0,Marketing,6000.0
2,Bob,35.0,Paris,70000.0,Engineering,7000.0


You can sort by multiple columns by passing a list of column names to the `sort_values()` method. The sorting will be performed in the order of the columns specified.


In [52]:
df.sort_values(['City', 'Salary'])

Unnamed: 0,Name,Age,City,Salary,Department,Bonus
1,Alice,30.0,London,65000.0,Marketing,6000.0
4,Alice,30.0,London,65000.0,Marketing,6000.0
0,John,25.0,New York,50000.0,Sales,5000.0
3,John,25.0,New York,50000.0,Sales,5000.0
2,Bob,35.0,Paris,70000.0,Engineering,7000.0
5,,,,55000.0,,


To sort in descending order, you can pass the parameter `ascending=False` to the `sort_values()` method.


In [53]:
df.sort_values('Salary', ascending=False)


Unnamed: 0,Name,Age,City,Salary,Department,Bonus
2,Bob,35.0,Paris,70000.0,Engineering,7000.0
1,Alice,30.0,London,65000.0,Marketing,6000.0
4,Alice,30.0,London,65000.0,Marketing,6000.0
5,,,,55000.0,,
0,John,25.0,New York,50000.0,Sales,5000.0
3,John,25.0,New York,50000.0,Sales,5000.0


These examples demonstrate how you can sort a DataFrame based on either the index or column values, in ascending or descending order. Sorting DataFrames is a useful operation for organizing and analyzing data effectively.

## <a id='toc6_'></a>[Summary](#toc0_)

In this lecture, we explored the fundamental concepts and techniques for creating and manipulating Pandas DataFrames. We started by discussing the different ways to create DataFrames, including from lists, dictionaries, and NumPy arrays, using the `pd.DataFrame()` function.


We then delved into the various methods and attributes available in DataFrames, such as `head()`, `tail()`, `info()`, `describe()`, `shape`, `size`, `columns`, and `index`, which allow us to inspect and understand the structure and content of our data.


Next, we focused on accessing data within DataFrames. We learned how to access columns using square brackets `[]` and dot notation `.`, access rows using `loc[]` with labels and `iloc[]` with integer positions, and access individual elements using `loc[]`, `iloc[]`, `at[]`, and `iat[]`.


Then, we explored the techniques for modifying DataFrames. We covered adding and removing columns and rows, renaming columns and indexes, and updating values within the DataFrame using various methods and indexing techniques.


Lastly, we discussed sorting DataFrames based on the index or column values using the `sort_index()` and `sort_values()` methods, which are essential for organizing and analyzing data effectively.

Throughout the lecture, we provided practical examples and code snippets to illustrate each concept and technique, making it easier to understand and apply them to real-world datasets.


To deepen your understanding of Pandas DataFrames and explore more advanced topics, you can refer to the following resources:

1. Official Pandas Documentation: [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
   - Comprehensive documentation on DataFrame methods, attributes, and functionalities.

2. Pandas Tutorials:
   - [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html): A quick introduction to Pandas DataFrames and Series.
   - [Pandas Cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html): A collection of practical examples and recipes for common data manipulation tasks using Pandas.

3. Books:
   - "Effective Pandas" by Matt Harrison: A concise and practical guide that focuses on best practices and effective techniques for working with Pandas.
   - "Pandas Cookbook" by Theodore Petrou: A practical guide with recipes and examples for data manipulation and analysis using Pandas.

4. Online Courses:
   - "Data Analysis with Pandas and Python" on Udemy: A hands-on course covering data manipulation and analysis techniques using Pandas.
   - "Data Science with Python" on Coursera: A course that includes a module on data manipulation with Pandas.

5. Pandas API Reference: [DataFrame API](https://pandas.pydata.org/docs/reference/frame.html)
   - Detailed documentation on DataFrame methods and attributes for quick reference.


By exploring these resources, you can further enhance your skills in working with Pandas DataFrames and leverage their power for efficient data manipulation and analysis in your projects.


Remember, practice is key to mastering Pandas DataFrames. Don't hesitate to experiment with different datasets, try out new techniques, and apply what you've learned to real-world problems. Happy coding!