# Pandas

**Agenda**
1. Introduction to Pandas
   - Overview of Pandas and its use cases
   - Installation and setup

2. Data Structures in Pandas
   - Series: Creation, manipulation, indexing
   - DataFrame: Creation, manipulation, indexing
   - DataFrame vs. Series
   - Hierarchical Indexing (MultiIndex)

3. Data Manipulation
   - Adding, dropping, and renaming columns/rows
   - Sorting data: `sort_values()`, `sort_index()`
   - Handling missing data: `isnull()`, `dropna()`, `fillna()`
   - Removing duplicates
   - Handling categorical data

4. Grouping and Aggregation
   - Grouping data with `groupby()`
   - Aggregation functions: `sum()`, `mean()`, `count()`, `min()`, `max()`
   - Grouping by multiple columns
   - Pivot tables and cross-tabulation for data analysis

5. Merging and Joining Data
   - Merging DataFrames: `merge()` function
   - Different types of joins
   - Joining DataFrames with join


# Introduction to Pandas

Pandas is an open-source data analysis and manipulation library built on top of the Python programming language. It is widely used in data science, machine learning, and data analysis tasks due to its powerful and flexible data structures, Series and DataFrame. Pandas allows for efficient handling of structured data and provides tools for reading, writing, cleaning, filtering, and visualizing data.

**Key Use Cases of Pandas:**


* Data Cleaning: Handling missing data, removing duplicates, and formatting data.
* Data Transformation: Reshaping data, merging datasets, and grouping data.
* Data Analysis: Statistical analysis, aggregations, and pivot tables.
* Data Visualization: Plotting graphs and charts directly from data structures.
* Input/Output Operations: Reading from and writing to various file formats (CSV, Excel, JSON, SQL).

**Why Use Pandas?**

* Easy-to-Use: Intuitive syntax and powerful functions for data manipulation.
* High Performance: Efficient handling of large datasets.
* Integration: Works well with other Python libraries like NumPy, Matplotlib, and Scikit-learn.
* Flexibility: Supports different types of data structures and operations.

##  Installation and Setup

To start using Pandas, you need to install it in your Python environment. Pandas is often included in many scientific Python distributions like Anaconda, but you can also install it manually using pip or conda.

**Installation Using pip:**

In [None]:
pip install pandas




**Importing Pandas:**

Once Pandas is installed, you can import it into your Python script or notebook:

In [None]:
import pandas as pd


The convention is to import Pandas as pd to shorten the namespace when calling functions.

# Data Structures in Pandas

Pandas provides two primary data structures: Series and DataFrame. Understanding these data structures is crucial for effectively working with data in Pandas.

## Series

A Series is a one-dimensional labeled array that can hold data of any type (integer, string, float, etc.). It is similar to a column in a spreadsheet or a database table. Each element in a Series has an associated index, which allows for efficient data retrieval.



### Creating a Series

Creating a Series from a List

In [None]:
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40]
series = pd.Series(data)

print(series)


0    10
1    20
2    30
3    40
dtype: int64


**Explanation:**

* The Series series contains four elements, indexed by 0, 1, 2, 3.
* The data type (int64) is shown at the bottom.

Creating a Series with a Custom Index

In [None]:
data = [10, 20, 30, 40]
index = ['a', 'b', 'c', 'd']
series = pd.Series(data, index=index)

print(series)


a    10
b    20
c    30
d    40
dtype: int64


**Explanation:**

* The Series series now uses custom indices ('a', 'b', 'c', 'd') instead of default numeric indices.

### Manipulating a Series

 Performing Operations on a Series

In [None]:
# Adding 5 to each element in the series
modified_series = series + 5

print(modified_series)


a    15
b    25
c    35
d    45
dtype: int64


**Explanation:**

Each element in the Series is increased by 5.


Filtering a Series

In [None]:
# Filtering elements greater than 20
filtered_series = series[series > 20]

print(filtered_series)


c    30
d    40
dtype: int64


**Explanation:**

* The Series is filtered to include only elements greater than 20.


### Indexing a Series

Accessing Elements by Index

In [None]:
# Accessing an element by its index
print(series['b'])


20


**Explanation:**

The value corresponding to index 'b' is accessed.

Slicing a Series

In [None]:
# Slicing the Series
print(series['b':'d'])


b    20
c    30
d    40
dtype: int64


**Explanation:**

A subset of the Series from index 'b' to 'd' is selected.

## DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is analogous to a table in a relational database or an Excel spreadsheet.



### Creating a DataFrame

Creating a DataFrame from a Dictionary

In [None]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

print(df)


      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**Explanation:**

* The DataFrame df consists of three columns: Name, Age, and City.
* Each row is indexed by default with integers (0, 1, 2).

Creating a DataFrame with a Custom Index

In [None]:
df = pd.DataFrame(data, index=['a', 'b', 'c'])

print(df)


      Name  Age         City
a    Alice   25     New York
b      Bob   30  Los Angeles
c  Charlie   35      Chicago


**Explanation:**

* The DataFrame df now has a custom index (a, b, c) instead of the default numeric index.

### Manipulating a DataFrame

Adding a New Column

In [None]:
# Adding a new column 'Salary' to the DataFrame
df['Salary'] = [70000, 80000, 90000]

print(df)


      Name  Age         City  Salary
a    Alice   25     New York   70000
b      Bob   30  Los Angeles   80000
c  Charlie   35      Chicago   90000


**Explanation:**

A new column Salary is added to the DataFrame df.

Dropping a Column

In [None]:
# Dropping the 'Age' column
df = df.drop('Age', axis=1)

print(df)


      Name         City  Salary
a    Alice     New York   70000
b      Bob  Los Angeles   80000
c  Charlie      Chicago   90000


**Explanation:**

The Age column is removed from the DataFrame.

### Indexing a DataFrame

Accessing a Single Column

In [None]:
# Accessing the 'Name' column
print(df['Name'])


a      Alice
b        Bob
c    Charlie
Name: Name, dtype: object


**Explanation:**

The Name column is returned as a Series.

Accessing Multiple Columns

In [None]:
# Accessing 'Name' and 'City' columns
print(df[['Name', 'City']])


      Name         City
a    Alice     New York
b      Bob  Los Angeles
c  Charlie      Chicago


**Explanation:**

Multiple columns (Name and City) are selected from the DataFrame.

Accessing Rows by Index

In [None]:
# Accessing a row by its index
print(df.loc['b'])


Name              Bob
City      Los Angeles
Salary          80000
Name: b, dtype: object


**Explanation:**

The row corresponding to index 'b' is accessed.


Slicing Rows

In [None]:
# Slicing rows using the index
print(df.loc['a':'b'])


    Name         City  Salary
a  Alice     New York   70000
b    Bob  Los Angeles   80000


**Explanation:**

A subset of the DataFrame df from index 'a' to 'b' is selected.

## DataFrame vs. Series

* Series is a one-dimensional array-like object with an index. It can be thought of as a single column of data.

* DataFrame is a two-dimensional table with rows and columns. It can be thought of as a collection of Series objects that share the same index.


**Converting a DataFrame Column to a Series**

In [None]:
# Extracting the 'Name' column as a Series
name_series = df['Name']

print(name_series)


a      Alice
b        Bob
c    Charlie
Name: Name, dtype: object


**Explanation:**

The Name column is extracted as a Series from the DataFrame.

## Hierarchical Indexing (MultiIndex)

Hierarchical indexing, or MultiIndexing, allows you to have multiple levels of index in a Series or DataFrame. This is useful for working with high-dimensional data in a lower-dimensional data structure.

**Creating a MultiIndex DataFrame:**

In [None]:
import pandas as pd
import numpy as np

# Creating a MultiIndex
arrays = [
    ['A', 'A', 'B', 'B'],
    ['one', 'two', 'one', 'two']
]

index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))

# Creating a DataFrame with MultiIndex
df_multi = pd.DataFrame(np.random.randn(4, 2), index=index, columns=['Column1', 'Column2'])

print(df_multi)


              Column1   Column2
Upper Lower                    
A     one   -0.448576  0.773458
      two    0.403773  2.517012
B     one    0.474613  0.702819
      two   -0.646551  1.301305


**Explanation:**

* The DataFrame df_multi has a MultiIndex with two levels (Upper and Lower).
* The MultiIndex

# Data Manipulation

Data manipulation in Pandas involves various operations that modify the structure and content of your DataFrame, such as adding, dropping, renaming columns or rows, sorting, handling missing data, removing duplicates, and managing categorical data.

## Adding, Dropping, and Renaming Columns/Rows

### Adding a Column:

You can easily add a new column to a DataFrame by assigning a list or Series to a new column name.

In [None]:
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)

# Add a new column 'Salary'
df['Salary'] = [70000, 80000, 90000, 100000, 110000]

print(df)


      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston  100000
4      Eva   45      Phoenix  110000


**Explanation:**

A new column Salary is added to the DataFrame, with each row assigned a corresponding salary value.


### Dropping a Column:

Columns can be dropped using the drop() method, specifying the column name and axis=1.

In [None]:
# Drop the 'City' column
df = df.drop('City', axis=1)

print(df)


      Name  Age  Salary
0    Alice   25   70000
1      Bob   30   80000
2  Charlie   35   90000
3    David   40  100000
4      Eva   45  110000


**Explanation:**

The City column is removed from the DataFrame.


### Renaming Columns:

Columns can be renamed using the rename() method, passing a dictionary where the keys are the old column names and the values are the new column names.

In [None]:
# Rename columns
df = df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age'})

print(df)


  Employee Name  Employee Age  Salary
0         Alice            25   70000
1           Bob            30   80000
2       Charlie            35   90000
3         David            40  100000
4           Eva            45  110000


**Explanation:**

The columns Name and Age are renamed to Employee Name and Employee Age, respectively.

### Dropping Rows:

Rows can be dropped by index using the drop() method, specifying the index and axis=0 (default).

In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Dropping a row by index
df_dropped = df.drop(index=2)

# Display the DataFrame after dropping the row
print("\nDataFrame after dropping the row with index 2:")
print(df_dropped)



Original DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston
4      Eva   45      Phoenix

DataFrame after dropping the row with index 2:
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
3  David   40      Houston
4    Eva   45      Phoenix


**Explanation:**

* We use the drop() method to remove the row with index 2 (corresponding to 'Charlie').
* The method df.drop(index=2) creates a new DataFrame df_dropped where the row with index 2 is removed. The axis=0 is implied since the default value is 0.
* The drop() method does not modify the original DataFrame by default. If you want to drop rows in place (i.e., modify the original DataFrame), you can use the inplace=True parameter:



### Renaming Columns

Columns can be renamed using the rename() method, passing a dictionary where the keys are the old column names and the values are the new column names.

In [None]:
# Rename columns
df = df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age'})

print(df)


  Employee Name  Employee Age         City
0         Alice            25     New York
1           Bob            30  Los Angeles
2       Charlie            35      Chicago
3         David            40      Houston
4           Eva            45      Phoenix


**Explanation:**

The columns Name and Age are renamed to Employee Name and Employee Age, respectively.

### Dropping Rows

Rows can be dropped by index using the drop() method, specifying the index and axis=0 (default).

In [None]:
# Drop the row with index 3
df = df.drop(3)

print(df)


  Employee Name  Employee Age         City
0         Alice            25     New York
1           Bob            30  Los Angeles
2       Charlie            35      Chicago
4           Eva            45      Phoenix


**Explanation:**

The row with index 3 (David) is removed from the DataFrame.

## Sorting Data

Sorting is essential for organizing and analyzing data. Pandas provides methods like sort_values() and sort_index() for sorting.



### Sorting by Values:

Use sort_values() to sort the DataFrame by the values in one or more columns.

In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

df = pd.DataFrame(data)

# Sorting the DataFrame by the 'Age' column in ascending order
df_sorted = df.sort_values(by='Age')

# Display the sorted DataFrame
print(df_sorted)


      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston
4      Eva   45      Phoenix


**Explanation:**

* df.sort_values(by='Age'):
   * The sort_values() method sorts the DataFrame by the values in the 'Age' column.
   * By default, the sorting is done in ascending order (smallest to largest).
* The rows are now sorted by the 'Age' column, with the youngest person ('Alice', age 25) at the top and the oldest ('Eva', age 45) at the bottom.


### Sorting by Index

Use sort_index() to sort the DataFrame by the index.

In [None]:
# Sort by index
sorted_df = df.sort_index()

print(sorted_df)


  Employee Name  Employee Age         City
0         Alice            25     New York
1           Bob            30  Los Angeles
2       Charlie            35      Chicago
4           Eva            45      Phoenix


**Explanation:**

The DataFrame is sorted by its index in ascending order.

## Handling Missing Data

Handling missing data is crucial for maintaining the integrity of your analysis. Pandas provides several functions to identify and manage missing values.

### Identifying Missing Data:

Use isnull() to detect missing values, which returns a DataFrame of the same shape with boolean values indicating the presence of NaN.

In [None]:
# Example DataFrame with missing values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eva'],
    'Age': [25, None, 35, 40, 45],
    'Salary': [70000, 80000, 90000, None, 110000]
})

# Identify missing values
print(df.isnull())


    Name    Age  Salary
0  False  False   False
1  False   True   False
2  False  False   False
3   True  False    True
4  False  False   False


**Explanation:**

The isnull() function returns a boolean DataFrame where True indicates a missing value (NaN).

**Dropping Missing Data:**



Use dropna() to remove rows or columns with missing values.

In [None]:
# Drop rows with any missing values
df_cleaned = df.dropna()

print(df_cleaned)


      Name   Age    Salary
0    Alice  25.0   70000.0
2  Charlie  35.0   90000.0
4      Eva  45.0  110000.0


**Explanation:**

The dropna() function removes any rows that contain at least one missing value.


### Filling Missing Data:

Use fillna() to replace missing values with a specified value or method.

In [None]:
# Fill missing values in 'Age' with the mean of the column
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill missing values in 'Salary' with 0
df['Salary'] = df['Salary'].fillna(0)

print(df)


      Name    Age    Salary
0    Alice  25.00   70000.0
1      Bob  36.25   80000.0
2  Charlie  35.00   90000.0
3     None  40.00       0.0
4      Eva  45.00  110000.0


**Explanation:**

The missing values in the Age column are filled with the mean value, and the missing values in the Salary column are replaced with 0.

## Removing Duplicates

Duplicate data can skew analysis, so it's important to identify and remove duplicates.

### Removing Duplicate Rows

Use drop_duplicates() to remove duplicate rows from the DataFrame.

In [None]:
# Example DataFrame with duplicate rows
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Eva'],
    'Age': [25, 30, 35, 30, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Los Angeles', 'Phoenix']
})

# Remove duplicate rows
df_unique = df.drop_duplicates()

print(df_unique)


      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
4      Eva   45      Phoenix


**Explanation:**

The drop_duplicates() function removes duplicate rows, retaining the first occurrence of each.

## Handling Categorical Data

Categorical data is data that takes on a limited, fixed number of possible values. Pandas provides tools to work with such data, including converting columns to the category data type.

**Converting to Categorical:**

You can convert a column to categorical using the astype('category') method.

In [None]:
# Convert 'City' column to categorical data type
df['City'] = df['City'].astype('category')

print(df.dtypes)


Name      object
Age        int64
City    category
dtype: object


**Explanation:**

The City column is converted to a categorical data type, which can improve performance and reduce memory usage.

**Categorical Operations:**

You can perform operations like renaming categories or checking category counts.

In [None]:

# Rename categories in the 'City' column
df['City'] = df['City'].cat.rename_categories({'New York': 'NYC', 'Los Angeles': 'LA'})

# Check the number of occurrences of each category
print(df['City'].value_counts())


City
LA         2
Chicago    1
NYC        1
Phoenix    1
Name: count, dtype: int64


**Explanation:**

* The categories New York and Los Angeles are renamed to NYC and LA, respectively.
* The value_counts() method counts the number of occurrences of each category.

# Grouping and Aggregation

Grouping and aggregation in Pandas are powerful tools for data analysis. They allow you to group data based on one or more columns and then apply aggregate functions like sum(), mean(), count(), etc., to summarize the data.

##  Grouping Data with groupby()

The groupby() function in Pandas is used to split data into groups based on some criteria. Once the data is grouped, you can apply aggregation functions to each group independently.



In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 130, 250, 180],
    'Profit': [30, 50, 70, 60, 90, 40]
}

df = pd.DataFrame(data)

# Grouping by the 'Category' column
grouped = df.groupby('Category')

# Display the grouped object
print(grouped)

# Applying aggregation function to the grouped object
sum_sales = grouped['Sales'].sum()
print("\nTotal Sales by Category:")
print(sum_sales)


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7a8ec5007580>

Total Sales by Category:
Category
A    550
B    460
Name: Sales, dtype: int64


**Explanation:**
* groupby('Category'): This groups the data by the 'Category' column. The result is a GroupBy object that is ready for aggregation.
* grouped['Sales'].sum(): This sums the 'Sales' values within each group. The output will show the total sales for each category.

## Aggregation Functions

After grouping data, you can apply various aggregation functions to summarize the data.

Common Aggregation Functions:

* sum(): Calculates the sum of values.
* mean(): Calculates the mean (average) of values.
* count(): Counts the number of non-NA/null observations.
* min(): Finds the minimum value.
* max(): Finds the maximum value.

**Example: Applying Multiple Aggregation Functions**



In [None]:
# Applying multiple aggregation functions to 'Sales' and 'Profit'
agg_functions = grouped.agg({
    'Sales': ['sum', 'mean', 'min', 'max'],
    'Profit': ['sum', 'mean']
})

print("\nAggregated Data:")
print(agg_functions)



Aggregated Data:
         Sales                       Profit           
           sum        mean  min  max    sum       mean
Category                                              
A          550  183.333333  100  250    190  63.333333
B          460  153.333333  130  180    150  50.000000


**Explanation:**
grouped.agg({...}): The agg() function allows you to apply multiple aggregation functions at once. Here, we apply sum, mean, min, and max to 'Sales', and sum and mean to 'Profit'.

## Grouping by Multiple Columns

You can group data by more than one column to perform more complex group-wise operations.

In [None]:
# Creating a more complex DataFrame
data = {
    'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 250, 130, 300]
}

df = pd.DataFrame(data)

# Grouping by 'Region' and 'Category'
grouped_multi = df.groupby(['Region', 'Category'])

# Summing the sales for each group
sum_sales_multi = grouped_multi['Sales'].sum()

print("\nTotal Sales by Region and Category:")
print(sum_sales_multi)



Total Sales by Region and Category:
Region  Category
East    A           230
        B           150
West    A           200
        B           550
Name: Sales, dtype: int64


**Explanation:**
* groupby(['Region', 'Category']): This groups the data first by 'Region' and then by 'Category'.
* grouped_multi['Sales'].sum(): This sums the sales within each group combination of 'Region' and 'Category'.

## Pivot Tables and Cross-Tabulation

Pivot tables and cross-tabulation are powerful tools for summarizing data, similar to grouping but often used to reshape data.



### Pivot Tables

Pivot tables allow you to create a spreadsheet-style pivot table as a DataFrame.

In [None]:
# Creating a pivot table
pivot = df.pivot_table(values='Sales', index='Region', columns='Category', aggfunc='sum')

print("\nPivot Table:")
print(pivot)



Pivot Table:
Category    A    B
Region            
East      230  150
West      200  550


**Explanation:**
* df.pivot_table(): This creates a pivot table that
summarizes the data by 'Region' and 'Category', showing the sum of 'Sales'.
* values='Sales': Specifies the values to aggregate.
index='Region': Rows are grouped by 'Region'.
* columns='Category': Columns are grouped by 'Category'.
* aggfunc='sum': Specifies the aggregation function to be used.

### Cross-Tabulation

Cross-tabulation (pd.crosstab()) is used to compute a simple cross-tabulation of two or more factors.

In [None]:
# Creating a cross-tabulation
crosstab = pd.crosstab(df['Region'], df['Category'], values=df['Sales'], aggfunc='sum')

print("\nCross-Tabulation:")
print(crosstab)



Cross-Tabulation:
Category    A    B
Region            
East      230  150
West      200  550


**Explanation:**
* pd.crosstab(): Creates a cross-tabulation table.
* The parameters are similar to a pivot table, but cross-tab is typically used for counting frequencies rather than aggregating values.

# Merging and Joining Data in Pandas

Merging and joining are essential operations in data analysis when working with multiple datasets. These operations allow you to combine different DataFrames based on common columns or indices.



##Merging DataFrames with merge()

The merge() function in Pandas is used to combine two DataFrames based on one or more common columns (keys). It's similar to SQL joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and OUTER JOIN.


In [None]:
import pandas as pd

# Creating two sample DataFrames
data_left = {
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32]
}

data_right = {
    'ID': [3, 4, 5, 6],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [70000, 80000, 75000, 90000]
}

df_left = pd.DataFrame(data_left)
df_right = pd.DataFrame(data_right)

# Performing an inner join on the 'ID' column
merged_df = pd.merge(df_left, df_right, on='ID', how='inner')

print("Inner Join Result:")
print(merged_df)


Inner Join Result:
   ID     Name  Age         City  Salary
0   3  Charlie   22     New York   70000
1   4    David   32  Los Angeles   80000


**Explanation:**
* pd.merge(df_left, df_right, on='ID', how='inner'): This merges df_left and df_right DataFrames based on the 'ID' column. The how='inner' argument specifies that we want an inner join, meaning only rows with matching 'ID' values in both DataFrames are included in the result.

## Different Types of Joins

### Left Join



A left join returns all the rows from the left DataFrame, along with the matching rows from the right DataFrame. If there is no match, NaN values are used.

In [None]:
left_join_df = pd.merge(df_left, df_right, on='ID', how='left')

print("\nLeft Join Result:")
print(left_join_df)



Left Join Result:
   ID     Name  Age         City   Salary
0   1    Alice   24          NaN      NaN
1   2      Bob   27          NaN      NaN
2   3  Charlie   22     New York  70000.0
3   4    David   32  Los Angeles  80000.0


**Explanation:**
The result includes all rows from the left DataFrame (df_left), and where there is no match in the right DataFrame, the values are NaN.


### Right Join

A right join returns all the rows from the right DataFrame, along with the matching rows from the left DataFrame.

In [None]:
right_join_df = pd.merge(df_left, df_right, on='ID', how='right')

print("\nRight Join Result:")
print(right_join_df)


**Explanation:**
* right_join_df = pd.merge(df_left, df_right, on='ID', how='right'): This line merges df_left and df_right using a right join.
* The on='ID' specifies that the merge should happen based on the ID column.
* The how='right' argument ensures that all rows from the right DataFrame (df_right) will be in the resulting DataFrame, and where there is no match in the left DataFrame (df_left), the result will have NaN (Not a Number) for those columns.

### Outer Join

An outer join returns all rows from both DataFrames, filling in NaN where there is no match.

In [None]:
outer_join_df = pd.merge(df_left, df_right, on='ID', how='outer')

print("\nOuter Join Result:")
print(outer_join_df)



Outer Join Result:
   ID     Name   Age         City   Salary
0   1    Alice  24.0          NaN      NaN
1   2      Bob  27.0          NaN      NaN
2   3  Charlie  22.0     New York  70000.0
3   4    David  32.0  Los Angeles  80000.0
4   5      NaN   NaN      Chicago  75000.0
5   6      NaN   NaN      Houston  90000.0


**Explanation:**
* outer_join_df = pd.merge(df_left, df_right, on='ID', how='outer'): This line merges df_left and df_right using an outer join.
* The on='ID' specifies that the merge should happen based on the ID column.
* The how='outer' argument ensures that all rows from both df_left and df_right will be in the resulting DataFrame. If there's no match in either DataFrame, the result will have NaN for the missing values.
Result:

## Joining DataFrames with join()

The join() method is used to combine DataFrames on their indices. It can be used to perform similar operations as merge(), but it typically joins on the index rather than a column.

In [None]:
# Creating DataFrames with the index set to 'ID'
df_left.set_index('ID', inplace=True)
df_right.set_index('ID', inplace=True)

# Performing a join operation
joined_df = df_left.join(df_right, how='inner')

print("\nJoin Result:")
print(joined_df)



Join Result:
       Name  Age         City  Salary
ID                                   
3   Charlie   22     New York   70000
4     David   32  Los Angeles   80000


**Explanation:**
* df_left.set_index('ID'): We first set the 'ID' column as the index for both DataFrames.
* df_left.join(df_right): This joins the two DataFrames based on their indices. The how='inner' argument specifies an inner join.