### Mastering Pandas for Machine Learning and Data Science

#### Module 1: Introduction to Pandas
1. **Getting Started with Pandas**
   - What is Pandas?
   - Installation and Setup

2. **Pandas Data Structures**
   - Series
   - DataFrame
   - Index Objects

3. **Basic Operations in Pandas**
   - Data Loading and Saving
   - Indexing and Slicing
   - Data Manipulation Techniques

#### Module 2: Data Cleaning and Preparation
1. **Data Cleaning Techniques**
   - Handling Missing Data
   - Handling Duplicates
   - Data Imputation

2. **Data Transformation**
   - Data Types Conversion
   - Data Normalization and Scaling
   - Handling Outliers

#### Module 3: Exploratory Data Analysis (EDA) with Pandas
1. **Descriptive Statistics**
   - Summary Statistics
   - GroupBy Operations

2. **Data Visualization with Pandas**
   - Plotting Basics
   - Customizing Plots

#### Module 4: Advanced Data Manipulation with Pandas
1. **Advanced Indexing and Selection**
   - Hierarchical Indexing
   - Multi-level Indexing

2. **Time Series Analysis with Pandas**
   - Handling Time Series Data
   - Resampling and Rolling Windows

#### Module 5: Combining and Merging DataFrames
1. **Concatenating DataFrames**
   - Row and Column Concatenation

2. **Merging and Joining DataFrames**
   - Inner, Outer, Left, and Right Joins
   - Merging on Index

#### Module 6: Grouping and Aggregating Data
1. **GroupBy Operations**
   - Split-Apply-Combine Concept
   - Aggregation Functions

2. **Pivot Tables and Cross-Tabulation**
   - Creating Pivot Tables
   - Cross-Tabulation Analysis

#### Module 7: Pandas for Machine Learning
1. **Feature Engineering with Pandas**
   - Creating New Features
   - Handling Categorical Variables

2. **Data Preprocessing Pipelines**
   - Building Preprocessing Pipelines with Pandas

#### Module 8: Real-world Data Projects with Pandas
1. **Case Studies and Projects**
   - Analyzing Real-world Datasets
   - Implementing ML Pipelines with Pandas

#### Module 9: Performance Optimization and Best Practices
1. **Pandas Performance Optimization**
   - Vectorization Techniques
   - Using Cython and Numba

2. **Best Practices and Tips**
   - Writing Efficient Pandas Code
   - Memory Optimization Techniques

#### Module 10: Advanced Topics in Pandas
1. **Handling Big Data with Pandas**
   - Working with Dask DataFrame
   - Using Pandas with Apache Spark

2. **Customizing and Extending Pandas**
   - Creating Custom Functions and Methods
   - Extending Pandas Functionality

#### Final Project:
- Apply all the learned concepts to a comprehensive data science project using real-world datasets.

#### Resources and Tools:
- Jupyter Notebooks for interactive learning.
- Real-world datasets for hands-on practice.
- Additional resources like books, articles, and documentation for further reading.

#### Evaluation:
- Quizzes, assignments, and a final project to assess understanding and practical application.

#### Conclusion:
- Recap of key learnings.
- Next steps in the learning journey.

This structured course outline covers everything from the basics of Pandas to advanced topics tailored specifically for machine learning and data science applications. You can break down each module into smaller sections for easier learning and provide ample hands-on exercises and projects to reinforce concepts.

#### Module 1: Introduction to Pandas
1. **Getting Started with Pandas**
   - What is Pandas?
   - Installation and Setup

#### What is Pandas?

Pandas is a powerful Python library used extensively in data manipulation and analysis. It provides data structures and functions designed to make working with structured data fast, easy, and expressive. The main data structures in Pandas are Series and DataFrame.

- **Series**: A one-dimensional labeled array capable of holding any data type. It's similar to a list or a one-dimensional NumPy array but with additional functionalities.
- **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet or SQL table, with rows and columns, where each column can be of a different data type.

Pandas is widely used in data science, machine learning, and quantitative finance, among other fields. Its versatility and ease of use make it a favorite tool for data manipulation and analysis tasks.

#### Installation and Setup

To install Pandas, you can use pip, Python's package installer. Open your command line interface (CLI) and run the following command:

```
pip install pandas
```

This command will download and install the latest version of Pandas from the Python Package Index (PyPI) along with its dependencies.

Once Pandas is installed, you can import it into your Python scripts or interactive sessions using the following convention:

```python
import pandas as pd
```

The `pd` abbreviation is a commonly used alias for Pandas and makes it easier to reference Pandas functions and objects in your code.

After importing Pandas, you're ready to start using its powerful features for data manipulation, analysis, and visualization.

Example:

```python
import pandas as pd

# Create a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 35, 42, 32],
        'City': ['New York', 'Paris', 'London', 'Sydney']}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)
```

This will create a simple DataFrame and display its contents:

```
    Name  Age      City
0   John   28  New York
1   Anna   35     Paris
2  Peter   42    London
3  Linda   32    Sydney
```

With Pandas installed and imported, you're ready to explore its rich functionality for data manipulation and analysis.

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd

# Create a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 35, 42, 32],
        'City': ['New York', 'Paris', 'London', 'Sydney']}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

    Name  Age      City
0   John   28  New York
1   Anna   35     Paris
2  Peter   42    London
3  Linda   32    Sydney


### Pandas Data Structures

#### 1. Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It is similar to a one-dimensional array or list but with additional functionalities and a labeled index, which makes it more powerful.

**Key Features:**
- **Indexing**: Each element in a Series is associated with a unique index, which can be explicitly defined or automatically generated.
- **Homogeneity**: All elements in a Series must be of the same data type.
- **Vectorized Operations**: Series supports vectorized operations, making it efficient for mathematical calculations.

**Example:**
```python
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)
```

Output:
```
0    10
1    20
2    30
3    40
4    50
dtype: int64
```

#### 2. DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It is similar to a spreadsheet or SQL table, where data is organized in rows and columns. DataFrames are widely used for data manipulation, analysis, and cleaning in Python.

**Key Features:**
- **Columns**: Each column in a DataFrame represents a different variable or feature.
- **Indexing**: DataFrames have both row and column indexes, allowing for flexible data access.
- **Column Operations**: DataFrame supports various operations on columns, such as adding, removing, and renaming columns.
- **Handling Missing Values**: DataFrames provide methods to handle missing values, such as dropping or imputing them.

**Example:**
```python
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 35, 42],
        'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)
```

Output:
```
    Name  Age      City
0   John   28  New York
1   Anna   35     Paris
2  Peter   42    London
```

#### 3. Index Objects
Index objects are immutable arrays that hold axis labels and other metadata. They provide the "index" or "row labels" for Series and DataFrame objects, allowing for efficient data alignment and manipulation.

**Key Features:**
- **Uniqueness**: Index objects ensure that labels are unique, enabling fast indexing and data retrieval.
- **Immutability**: Index objects are immutable, meaning their contents cannot be changed after creation.
- **Alignment**: Index objects facilitate alignment during arithmetic and logical operations on Series and DataFrame objects.

**Example:**
```python
import pandas as pd

# Creating a Series with a custom index
data = [10, 20, 30]
custom_index = ['A', 'B', 'C']
series = pd.Series(data, index=custom_index)
print(series.index)
```

Output:
```
Index(['A', 'B', 'C'], dtype='object')
```

Understanding these Pandas data structures—Series, DataFrame, and Index Objects—is fundamental for effective data manipulation and analysis in Python. They provide powerful tools for handling and processing structured data in various real-world applications.

In [4]:
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [6]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 35, 42],
        'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)

    Name  Age      City
0   John   28  New York
1   Anna   35     Paris
2  Peter   42    London


In [9]:
import pandas as pd

# Creating a Series with a custom index
data = [10, 20, 30]
custom_index = ['A', 'B', 'C']
series = pd.Series(data, index=custom_index)
print(series.index)

Index(['A', 'B', 'C'], dtype='object')


### Basic Operations in Pandas

#### 1. Data Loading and Saving
Pandas provides various functions to load data from different file formats (e.g., CSV, Excel, SQL databases) into DataFrame objects and save DataFrame objects to files.

**Key Functions:**
- `pd.read_csv()`: Load data from a CSV file into a DataFrame.
- `pd.read_excel()`: Load data from an Excel file into a DataFrame.
- `pd.read_sql()`: Load data from a SQL database into a DataFrame.
- `DataFrame.to_csv()`: Save DataFrame to a CSV file.
- `DataFrame.to_excel()`: Save DataFrame to an Excel file.

**Example:**
```python
import pandas as pd

# Load data from a CSV file into a DataFrame
df = pd.read_csv('data.csv')

# Save DataFrame to a CSV file
df.to_csv('output.csv', index=False)
```

#### 2. Indexing and Slicing
Indexing and slicing operations allow you to select subsets of data from a DataFrame or Series based on row and column labels or positions.

**Key Operations:**
- **Single Label/Index**: `df.loc[label]` or `series.loc[label]`
- **Single Position**: `df.iloc[position]` or `series.iloc[position]`
- **Label-Based Slicing**: `df.loc[start_label:end_label]`
- **Position-Based Slicing**: `df.iloc[start_position:end_position]`

**Example:**
```python
# Select rows and columns using labels
subset1 = df.loc[3:5, 'Name':'Age']

# Select rows and columns using positions
subset2 = df.iloc[1:3, 0:2]
```

#### 3. Data Manipulation Techniques
Pandas provides a wide range of functions and methods for data manipulation, including adding, removing, and transforming data within DataFrame objects.

**Key Techniques:**
- **Adding Columns**: `df['new_column'] = values`
- **Removing Columns**: `df.drop(columns=['column_name'])`
- **Renaming Columns**: `df.rename(columns={'old_name': 'new_name'})`
- **Applying Functions**: `df.apply(func)`
- **Filtering Rows**: `df[df['column'] > value]`
- **Sorting Data**: `df.sort_values(by='column')`
- **Grouping Data**: `df.groupby('column').aggregate(func)`

**Example:**
```python
# Adding a new column
df['Birth_Year'] = 2022 - df['Age']

# Removing a column
df.drop(columns=['Age'], inplace=True)

# Renaming a column
df.rename(columns={'Name': 'Full_Name'}, inplace=True)

# Applying a function to a column
df['Full_Name'] = df['Full_Name'].apply(lambda x: x.upper())

# Filtering rows based on a condition
subset = df[df['City'] == 'New York']

# Grouping data and calculating aggregates
grouped_data = df.groupby('City').aggregate({'Birth_Year': 'mean'})
```

Understanding these basic operations in Pandas is crucial for efficiently loading, manipulating, and analyzing data in Python. They provide the foundation for more advanced data processing and analysis tasks in Pandas.

### 1. Data Loading and Saving

In [1]:
# Data Loading and Saving

import pandas as pd

# Load data from CSV file into a DataFrame
df = pd.read_csv('online_transaction_log.csv')

# Display the first few rows of the DataFrame
# print(df.head())
df.head()

# Save DataFrame to a new CSV file (optional)
# df.to_csv('online_transaction_log_processed.csv', index=False)


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


### 2. Indexing and Slicing 

In [2]:
# Indexing and Slicing 

# Selecting specific columns
print("# Selecting specific columns")
subset = df[['step', 'type', 'amount', 'nameOrig']]
# print(subset)
subset

# Selecting specific columns


Unnamed: 0,step,type,amount,nameOrig
0,1,PAYMENT,9839.64,C1231006815
1,1,PAYMENT,1864.28,C1666544295
2,1,TRANSFER,181.00,C1305486145
3,1,CASH_OUT,181.00,C840083671
4,1,PAYMENT,11668.14,C2048537720
...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425
6362616,743,TRANSFER,6311409.28,C1529008245
6362617,743,CASH_OUT,6311409.28,C1162922333
6362618,743,TRANSFER,850002.52,C1685995037


In [3]:
# Indexing and Slicing 

# Selecting rows with a specific condition
print("\n# Selecting rows with a specific condition")
fraudulent_transactions = df[df['isFraud'] == 1]

# print(fraudulent_transactions)
fraudulent_transactions


# Selecting rows with a specific condition


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181.00,C1305486145,181.00,0.0,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.0,C38997010,21182.00,0.00,1,0
251,1,TRANSFER,2806.00,C1420196421,2806.00,0.0,C972765878,0.00,0.00,1,0
252,1,CASH_OUT,2806.00,C2101527076,2806.00,0.0,C1007251739,26202.00,0.00,1,0
680,1,TRANSFER,20128.00,C137533655,20128.00,0.0,C1848415041,0.00,0.00,1,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.0,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.0,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.0,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.0,C2080388513,0.00,0.00,1,0


In [17]:
# Indexing and Slicing 

# Selecting rows based on index
print("\n# Selecting rows based on index")
specific_row = df.loc[5]
# print(specific_row)
specific_row


# Selecting rows based on index


step                         1
type                   PAYMENT
amount                 7817.71
nameOrig             C90045638
oldbalanceOrg          53860.0
newbalanceOrig        46042.29
recipient           M573487274
oldBalanceDest             0.0
newbalanceDest             0.0
isFraud                      0
totalBalanceOrig      99902.29
Name: 5, dtype: object

In [8]:
# Indexing and Slicing 

# Selecting specific rows and columns using slicing
print("\n# Selecting specific rows and columns using slicing")
subset = df.loc[3:5, ['type', 'amount', 'nameOrig']]
# print(subset)
subset


# Selecting specific rows and columns using slicing


Unnamed: 0,type,amount,nameOrig
3,CASH_OUT,181.0,C840083671
4,PAYMENT,11668.14,C2048537720
5,PAYMENT,7817.71,C90045638


### 3. Data Manipulation Techniques

In [9]:
#  Data Manipulation Techniques
# Adding a new column
print("# Adding a new column")
df['totalBalanceOrig'] = df['oldbalanceOrg'] + df['newbalanceOrig']
# print(df.head())
df.head()

# Adding a new column


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,totalBalanceOrig
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,330432.36
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,40633.72
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,181.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,181.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,71439.86


In [10]:
#  Data Manipulation Techniques
# Removing a column
print("\n# Removing a column")
df.drop(columns=['isFlaggedFraud'], inplace=True)
# print(df.head())
df.head()


# Removing a column


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,totalBalanceOrig
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,330432.36
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,40633.72
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,181.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,181.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,71439.86


In [11]:
#  Data Manipulation Techniques
# Renaming columns
print("\n# Renaming columns")
df.rename(columns={'nameDest': 'recipient', 'oldbalanceDest': 'oldBalanceDest'}, inplace=True)
# print(df.head())
df.head()


# Renaming columns


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,recipient,oldBalanceDest,newbalanceDest,isFraud,totalBalanceOrig
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,330432.36
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,40633.72
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,181.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,181.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,71439.86


In [12]:
#  Data Manipulation Techniques
# Applying a function to a column
print("\n# Applying a function to a column")
df['amount'] = df['amount'].apply(lambda x: round(x, 2))
# print(df.head())
df.head()


# Applying a function to a column


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,recipient,oldBalanceDest,newbalanceDest,isFraud,totalBalanceOrig
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,330432.36
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,40633.72
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,181.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,181.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,71439.86


In the above code snippet, a lambda function is used to apply the `round()` function to the 'amount' column of the DataFrame `df`. 

Here's a breakdown of what the lambda function does:
- `lambda x`: This defines an anonymous function with one parameter `x`.
- `round(x, 2)`: Inside the lambda function, it calls the `round()` function to round the value of `x` to 2 decimal places.

So, the lambda function essentially rounds each value in the 'amount' column to two decimal places. This is commonly done to ensure consistent precision in numerical data.

In [13]:
#  Data Manipulation Techniques
# Filtering rows based on multiple conditions
print("\n# Filtering rows based on multiple conditions")
filtered_transactions = df[(df['type'] == 'TRANSFER') & (df['amount'] > 100000)]
# print(filtered_transactions.head())
filtered_transactions.head()


# Filtering rows based on multiple conditions


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,recipient,oldBalanceDest,newbalanceDest,isFraud,totalBalanceOrig
19,1,TRANSFER,215310.3,C1670993182,705.0,0.0,C1100439041,22425.0,0.0,0,705.0
24,1,TRANSFER,311685.89,C1984094095,10835.0,0.0,C932583850,6267.0,2719172.89,0,10835.0
82,1,TRANSFER,224606.64,C873175411,0.0,0.0,C766572210,354678.92,0.0,0,0.0
83,1,TRANSFER,125872.53,C1443967876,0.0,0.0,C392292416,348512.0,3420103.09,0,0.0
84,1,TRANSFER,379856.23,C1449772539,0.0,0.0,C1590550415,900180.0,19169204.93,0,0.0


In [15]:
#  Data Manipulation Techniques
# Grouping data and calculating aggregates
print("\n# Grouping data and calculating aggregates")
transaction_summary = df.groupby('type').agg({'amount': 'sum', 'isFraud': 'sum'})
# print(transaction_summary)
transaction_summary


# Grouping data and calculating aggregates


Unnamed: 0_level_0,amount,isFraud
type,Unnamed: 1_level_1,Unnamed: 2_level_1
CASH_IN,236367400000.0,0
CASH_OUT,394413000000.0,4116
DEBIT,227199200.0,0
PAYMENT,28093370000.0,0
TRANSFER,485292000000.0,4097


### Module 2: Data Cleaning and Preparation

**Data Cleaning Techniques**

1. **Handling Missing Data**: Strategies for dealing with missing values in datasets.
2. **Handling Duplicates**: Techniques for identifying and removing duplicate entries in data.
3. **Data Imputation**: Methods for filling in missing values with estimated or calculated values.
4. **Data Transformation**: Techniques for transforming data to meet specific requirements or standards.

**Data Types Conversion**

1. **Data Normalization and Scaling**: Methods for normalizing and scaling data to a common scale.
2. **Handling Outliers**: Strategies for detecting and managing outliers in datasets.

In [None]:
import pandas as pd

# Load data from CSV file into a DataFrame
df = pd.read_csv('online_transaction_log.csv')

# 1. Handling Missing Data
# Identify missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)