# Introduction to pandas

## What is pandas?

Pandas is a popular open-source Python library for data manipulation and analysis. It provides powerful data structures and data analysis tools built on top of NumPy, a library for scientific computing in Python. Pandas was created by Wes McKinney in 2008 and is widely used in various fields, including finance, economics, statistics, and data science.

The name "pandas" is derived from the term "panel data," which is a multidimensional data structure commonly used in econometrics and statistics. However, the library is not limited to panel data and can handle a wide range of data structures and formats.

## Why use pandas?

Pandas provides several advantages for working with structured (tabular, multidimensional, potentially heterogeneous) and time series data:

1. **Easy to use**: Pandas has a clean and intuitive syntax for data manipulation and analysis, making it accessible to both novice and experienced users.

2. **High-performance**: Pandas is built on top of NumPy, which is highly optimized for numerical operations. This makes pandas efficient for working with large datasets.

3. **Powerful data structures**: Pandas provides two main data structures: Series (1D labeled homogeneous array) and DataFrame (2D labeled data structure with columns of potentially different types). These data structures make it easy to work with structured data.

4. **Data cleaning and preprocessing**: Pandas offers a wide range of functions and methods for data cleaning, preprocessing, merging, reshaping, and handling missing data.

5. **Data analysis and visualization**: Pandas integrates well with other libraries like Matplotlib, Seaborn, and Plotly for data visualization. It also supports advanced statistical and analytical operations.

6. **Interoperability**: Pandas can read and write data in various formats, such as CSV, Excel, SQL databases, and more, making it easy to work with data from different sources.

7. **Time series analysis**: Pandas has strong support for working with time series data, including handling dates and time zones, resampling, and time-based operations.

These features make pandas a powerful and versatile library for data manipulation, analysis, and exploration, making it a go-to choice for data scientists, analysts, and researchers.

## Importing pandas and other necessary libraries

In [1]:
import pandas as pd
import numpy as np

In this notebook, we import the pandas library and assign it the conventional alias `pd`. We also import the NumPy library, which is commonly used alongside pandas for numerical operations.

# pandas Data Structures

Pandas provides two main data structures: Series and DataFrame. These data structures are designed to handle structured data efficiently and provide powerful tools for data manipulation and analysis.

## Series (1D labeled homogeneous array)

A Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a SQL table, where each entry has a corresponding label (index). A Series can be created from a list, a NumPy array, a dictionary, or a scalar value.

In [2]:
# Creating a Series from a list
data = [1, 2, 3, 4]
series = pd.Series(data)
print(series)

In this example, we create a Series from a list of integers. The Series has default integer indices (0, 1, 2, 3) assigned automatically. You can access individual elements of a Series using their index or label.

Series can also be created from dictionaries, where the keys become the labels, and the values become the data:

In [3]:
# Creating a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data)
print(series)

Series are useful for representing and manipulating one-dimensional data, such as a list of values or a column from a table.

## DataFrame (2D labeled data structure with columns of potentially different types)

A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or a SQL table. It is a collection of Series objects, where each Series represents a column, and each row represents a unique entry with a label (index).

In [1]:
# Creating a DataFrame from a dictionary of lists
import pandas as pd

data = {'Name': ['John', 'Emily', 'Michael', 'Sarah'],
        'Age': [25, 32, 38, 27],
        'City': ['New York', 'London', 'Paris', 'Berlin']}

df = pd.DataFrame(data)
print(df)

In this example, we create a DataFrame from a dictionary where the keys represent the column names, and the values are lists containing the data for each column.

DataFrames can also be created from lists of lists, NumPy arrays, or other data structures. You can specify column names explicitly or let pandas automatically assign default column names (0, 1, 2, ...).

In [2]:
# Creating a DataFrame from a list of lists
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['Numbers', 'Letters'])
print(df)

DataFrames are powerful data structures that allow you to store and manipulate tabular data efficiently. You can access and manipulate columns and rows using various indexing techniques, perform operations on the data, and apply functions across rows or columns.

In [3]:
# Accessing a column (Series)
print(df['Numbers'])

# Selecting multiple columns
print(df[['Age', 'City']])

DataFrames are widely used in data analysis, data cleaning, and data manipulation tasks due to their flexibility and powerful features. They provide a convenient way to work with structured data and perform various operations, such as filtering, sorting, grouping, and merging data.

# Creating pandas Objects

Pandas provides several ways to create its core data structures, Series and DataFrames, from various data sources. In this section, we'll explore different methods of creating pandas objects from lists, dictionaries, scalars, and various file formats.

In [1]:
import pandas as pd

## Creating Series from lists, dictionaries, and scalars

### Creating a Series from a list

In [2]:
# Creating a Series from a list
data = [1, 2, 3, 4]
series = pd.Series(data)
print(series)

0    1
1    2
2    3
3    4
dtype: int64


In this example, we create a Series from a list of integers. The Series has default integer indices (0, 1, 2, 3) assigned automatically.

### Creating a Series from a dictionary

In [3]:
# Creating a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data)
print(series)

a    10
b    20
c    30
dtype: int64


In this case, we create a Series from a dictionary, where the keys become the labels, and the values become the data.

### Creating a Series from a scalar value

In [4]:
# Creating a Series from a scalar value
scalar_series = pd.Series(5, index=[0, 1, 2, 3])
print(scalar_series)

0    5
1    5
2    5
3    5
dtype: int64


Here, we create a Series from a scalar value (5) and specify the desired index labels ([0, 1, 2, 3]).

Series are useful for representing and manipulating one-dimensional data, such as a list of values or a column from a table.

## Creating DataFrames from lists, dictionaries, and other data structures

### Creating a DataFrame from a dictionary of lists

In [5]:
# Creating a DataFrame from a dictionary of lists
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah'],
        'Age': [25, 32, 38, 27],
        'City': ['New York', 'London', 'Paris', 'Berlin']}

df = pd.DataFrame(data)
print(df)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


In this example, we create a DataFrame from a dictionary where the keys represent the column names, and the values are lists containing the data for each column.

### Creating a DataFrame from a list of lists

In [6]:
# Creating a DataFrame from a list of lists
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['Numbers', 'Letters'])
print(df)

   Numbers Letters
0        1       a
1        2       b
2        3       c


Here, we create a DataFrame from a list of lists, and we explicitly provide the column names using the `columns` parameter.

DataFrames can also be created from NumPy arrays, database tables, or other data structures. This flexibility allows you to work with data from various sources and formats.

## Reading data from various file formats (CSV, Excel, SQL databases, etc.)

Pandas provides convenient functions to read data from various file formats, including CSV, Excel, SQL databases, JSON, and more.

In [1]:
import pandas as pd

### Reading data from a CSV file

In [2]:
# Assuming you have a CSV file named 'data.csv' in the same directory
csv_data = pd.read_csv('data.csv')
print(csv_data)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


The `pd.read_csv()` function reads the data from a CSV file and creates a DataFrame. You can also specify additional parameters to handle different CSV formats, encoding, and other options.

### Reading data from an Excel file

In [3]:
# Assuming you have an Excel file named 'data.xlsx' in the same directory
excel_data = pd.read_excel('data.xlsx')
print(excel_data)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


The `pd.read_excel()` function reads data from an Excel file and creates a DataFrame. You can specify the sheet name, header row, and other options as needed.

### Reading data from a SQL database

In [4]:
# Assuming you have a SQL database connection
import sqlite3

conn = sqlite3.connect('database.db')
query = "SELECT * FROM users"
sql_data = pd.read_sql_query(query, conn)
print(sql_data)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


The `pd.read_sql_query()` function executes a SQL query against a database connection and creates a DataFrame from the result. You'll need to establish a connection to your database and provide a valid SQL query.

Pandas also supports reading data from other formats like JSON, HTML, and more. The flexibility in reading data from various sources makes it convenient to work with data from different environments and integrate it into your analysis or processing pipelines.

# Data Inspection and Selection

After creating pandas objects (Series and DataFrames), it's essential to inspect and explore the data before performing any analysis or manipulation. Pandas provides various methods and techniques for data inspection and selection, which we'll cover in this section.

In [1]:
import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah', 'David', 'Jessica'],
        'Age': [25, 32, 38, 27, 45, None],
        'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney'],
        'Income': [50000, 65000, 75000, 42000, 88000, 56000]}

df = pd.DataFrame(data)

## Viewing the first and last few rows

In [2]:
# Viewing the first 3 rows
print(df.head(3))
print("\n")  # Adding a newline for better readability

# Viewing the last 3 rows
print(df.tail(3))

     Name   Age    City  Income
0    John  25.0  New York   50000
1   Emily  32.0   London   65000
2  Michael  38.0   Paris   75000

     Name    Age    City  Income
3   Sarah   27.0  Berlin   42000
4   David   45.0   Tokyo   88000
5  Jessica   NaN  Sydney   56000


The `head()` and `tail()` methods allow you to quickly inspect the first and last few rows of a DataFrame, respectively. By default, they show 5 rows, but you can specify the number of rows to display by passing an argument.

## Checking the data types

In [3]:
# Checking the data types
print(df.dtypes)

Name      object
Age      float64
City      object
Income     int64
dtype: object


The `dtypes` attribute of a DataFrame displays the data types of each column. This is useful for understanding the nature of your data and identifying potential issues, such as mixed data types or missing values.

## Selecting columns and rows by label or index

In [4]:
# Selecting a column (Series)
print(df['Name'])
print("\n")  # Adding a newline for better readability

# Selecting multiple columns
print(df[['Name', 'Age', 'City']].head(3))
print("\n")  # Adding a newline for better readability

# Selecting rows
print(df[['Age', 'Income']])

0    John
1   Emily
2  Michael
3   Sarah
4   David
5  Jessica
Name: Name, dtype: object

     Name   Age    City
0    John  25.0  New York
1   Emily  32.0   London
2  Michael  38.0   Paris

     Age  Income
0   25.0   50000
1   32.0   65000
2   38.0   75000
3   27.0   42000
4   45.0   88000
5    NaN   56000


## Conditional selection

Conditional selection allows you to filter rows based on specific conditions. In the first example, we select rows where the 'Age' column is greater than or equal to 35.

In [1]:
import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah', 'David', 'Jessica'],
        'Age': [25, 32, 38, 27, 45, None],
        'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney'],
        'Income': [50000, 65000, 75000, 42000, 88000, 56000]}

df = pd.DataFrame(data)

In [2]:
# Selecting rows based on a condition
print(df[df['Age'] >= 35])

     Name   Age    City  Income
2  Michael  38.0   Paris   75000
4   David   45.0   Tokyo   88000


In the second example, we use the `isna()` method to select rows where the 'Age' column has missing values:

In [3]:
# Selecting rows with missing values
print(df[df['Age'].isna()])

     Name    Age    City  Income
5  Jessica   NaN  Sydney   56000


You can combine multiple conditions using boolean operators (`&`, `|`, `~`) to create more complex filters. For example, to select rows where the 'Age' is greater than or equal to 35 and the 'Income' is greater than 70000:

In [4]:
# Selecting rows based on multiple conditions
print(df[(df['Age'] >= 35) & (df['Income'] > 70000)])

    Name   Age    City  Income
4  David  45.0   Tokyo   88000


Conditional selection is a powerful feature that allows you to extract specific subsets of data based on various criteria, making it easier to focus your analysis on the relevant data.

# Data Manipulation

Pandas provides a wide range of functions and methods for manipulating and transforming data. In this section, we'll explore various data manipulation techniques, including handling missing data, renaming and adding columns, applying functions, sorting and ranking, grouping and aggregating, combining datasets, and reshaping data.

In [1]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah', 'David', 'Jessica'],
        'Age': [25, 32, 38, 27, 45, None],
        'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney'],
        'Income': [50000, 65000, 75000, 42000, 88000, 56000]}

df = pd.DataFrame(data)

## Handling missing data

In [2]:
# Dropping rows with missing values
df_dropped = df.dropna()
print(df_dropped)

     Name   Age    City  Income
0    John  25.0  New York   50000
1   Emily  32.0   London   65000
2  Michael  38.0   Paris   75000
3   Sarah  27.0  Berlin   42000
4   David  45.0   Tokyo   88000


In [3]:
# Filling missing values with a specified value
df_filled = df.fillna(35)
print(df_filled)

     Name   Age    City  Income
0    John  25.0  New York   50000
1   Emily  32.0   London   65000
2  Michael  38.0   Paris   75000
3   Sarah  27.0  Berlin   42000
4   David  45.0   Tokyo   88000
5  Jessica  35.0  Sydney   56000


The `dropna()` method drops rows or columns with missing values, while the `fillna()` method fills missing values with a specified value or a method (e.g., mean, median).

## Renaming and adding new columns

In [4]:
# Renaming columns
df_renamed = df_filled.rename(columns={'Age': 'Years', 'City': 'Location'})

# Adding a new column
df_renamed['Annual_Income'] = df_renamed['Income']

print(df_renamed)

     Name  Years    Location  Income  Annual_Income
0    John     25    New York   50000           50000
1   Emily     32      London   65000           65000
2  Michael     38       Paris   75000           75000
3   Sarah     27      Berlin   42000           42000
4   David     45       Tokyo   88000           88000
5  Jessica     35      Sydney   56000           56000


The `rename()` method allows you to rename columns in a DataFrame, while you can add new columns by assigning values to a new column name.

## Applying functions to data (map, apply)

In [1]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah', 'David', 'Jessica'],
        'Age': [25, 32, 38, 27, 45, None],
        'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney'],
        'Income': [50000, 65000, 75000, 42000, 88000, 56000]}

df = pd.DataFrame(data)
df_filled = df.fillna(35)

# Applying a function to a column

In [2]:
# Applying a function to a column
df_filled['Name'] = df_filled['Name'].str.upper()
print(df_filled)

     Name  Age    City  Income
0    JOHN   25  New York   50000
1   EMILY   32   London   65000
2  MICHAEL   38   Paris   75000
3   SARAH   27  Berlin   42000
4   DAVID   45   Tokyo   88000
5  JESSICA   35  Sydney   56000


In this example, we apply the `str.upper()` function to the 'Name' column of the DataFrame, which converts all the strings in that column to uppercase.

# Applying a function to a DataFrame

In [3]:
# Applying a function to a DataFrame
def calculate_bonus(row):
    income = row['Income']
    if income < 50000:
        return income * 0.25
    elif income < 75000:
        return income * 0.30
    else:
        return income * 0.45

df_filled['Bonus'] = df_filled.apply(calculate_bonus, axis=1)
print(df_filled)

     Name  Age    City  Income  Bonus
0    JOHN   25  New York   50000  12500
1   EMILY   32   London   65000  19500
2  MICHAEL   38   Paris   75000  30000
3   SARAH   27  Berlin   42000  10500
4   DAVID   45   Tokyo   88000  39600
5  JESSICA   35  Sydney   56000  19600


In this example, we define a custom function `calculate_bonus` that calculates a bonus based on the income. We then use the `apply` method to apply this function to each row of the DataFrame along the row axis (`axis=1`). The resulting bonus values are stored in a new column named 'Bonus'.

The `apply` method is a powerful tool for applying custom functions to data in pandas. It can be used to apply a function along rows (`axis=1`) or columns (`axis=0`) of a DataFrame, or to a Series.

## Sorting and ranking data

In [4]:
# Sorting data
sorted_df = df_filled.sort_values(by='Income')
print(sorted_df)

     Name  Age    City  Income  Bonus
3   SARAH   27  Berlin   42000  10500
0    JOHN   25  New York   50000  12500
5  JESSICA   35  Sydney   56000  19600
1   EMILY   32   London   65000  19500
2  MICHAEL   38   Paris   75000  30000
4   DAVID   45   Tokyo   88000  39600


In [5]:
# Ranking data
ranked_df = sorted_df.reset_index(drop=True)
ranked_df['Rank'] = ranked_df['Income'].rank()
print(ranked_df)

     Name  Age    City  Income  Bonus  Rank
3   SARAH   27  Berlin   42000  10500   1.0
0    JOHN   25  New York   50000  12500   2.0
5  JESSICA   35  Sydney   56000  19600   3.0
1   EMILY   32   London   65000  19500   4.0
2  MICHAEL   38   Paris   75000  30000   5.0
4   DAVID   45   Tokyo   88000  39600   6.0


# Ranking data

In [1]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah', 'David', 'Jessica'],
        'Age': [25, 32, 38, 27, 45, None],
        'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney'],
        'Income': [50000, 65000, 75000, 42000, 88000, 56000]}

df = pd.DataFrame(data)
df_filled = df.fillna(35)

def calculate_bonus(row):
    income = row['Income']
    if income < 50000:
        return income * 0.25
    elif income < 75000:
        return income * 0.30
    else:
        return income * 0.45

df_filled['Bonus'] = df_filled.apply(calculate_bonus, axis=1)

In [2]:
# Ranking data
sorted_df = df_filled.sort_values(by='Income')
ranked_df = sorted_df.reset_index(drop=True)
ranked_df['Rank'] = ranked_df['Income'].rank()
print(ranked_df)

     Name  Age    City  Income  Bonus  Rank
3   SARAH   27  Berlin   42000  10500   1.0
0    JOHN   25  New York   50000  12500   2.0
5  JESSICA   35  Sydney   56000  19600   3.0
1   EMILY   32   London   65000  19500   4.0
2  MICHAEL   38   Paris   75000  30000   5.0
4   DAVID   45   Tokyo   88000  39600   6.0


In this example, we first sort the DataFrame by the 'Income' column using the `sort_values` method. We then reset the index of the sorted DataFrame using `reset_index(drop=True)` to avoid any duplicate indices.

Next, we use the `rank` method to assign ranks to the 'Income' column. The `rank` method assigns ranks in ascending order, with ties being assigned the same rank value.

The `rank` method has several parameters that control how ranks are assigned, such as the `method` (e.g., 'dense', 'min', 'max'), `ascending` (True or False), and `na_option` (to handle missing values).

In [3]:
# Ranking data with ties handled using the 'dense' method
ranked_df['Rank'] = ranked_df['Income'].rank(method='dense')
print(ranked_df)

     Name  Age    City  Income  Bonus  Rank
3   SARAH   27  Berlin   42000  10500   1.0
0    JOHN   25  New York   50000  12500   2.0
5  JESSICA   35  Sydney   56000  19600   3.0
1   EMILY   32   London   65000  19500   3.0
2  MICHAEL   38   Paris   75000  30000   5.0
4   DAVID   45   Tokyo   88000  39600   6.0


In the last example, we use the `method='dense'` parameter to handle ties differently. With the 'dense' method, consecutive ranks are not incremented when there are ties, resulting in a more compact ranking.

# Pandas: Data Manipulation, Grouping, Combining, and Reshaping

## Data Manipulation

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

In [None]:
# Select a column
print(df['Name'])

# Select multiple columns
print(df[['Name', 'Age']])

# Filter rows based on a condition
print(df[df['Age'] > 30])

## Grouping Data and Aggregate Functions

In [None]:
# Group data by 'City' and calculate the mean age
print(df.groupby('City')['Age'].mean())

In [None]:
# Group data by multiple columns and aggregate with different functions
grouped = df.groupby(['City', 'Age']).agg({'Name': 'count', 'Age': 'mean'})
print(grouped)

## Combining Datasets (merge, join, concat)

In [None]:
# Create another DataFrame
data2 = {'Name': ['Alice', 'Bob', 'Eve', 'Frank'],
         'Job': ['Engineer', 'Teacher', 'Doctor', 'Architect']}
df2 = pd.DataFrame(data2)

# Merge two DataFrames based on the 'Name' column
merged = pd.merge(df, df2, on='Name', how='inner')
print(merged)

In [None]:
# Concatenate two DataFrames
concatenated = pd.concat([df, df2], ignore_index=True)
print(concatenated)

## Reshaping and Pivoting Data

In [None]:
# Create a sample DataFrame for reshaping
data3 = {'Name': ['Alice', 'Alice', 'Bob', 'Bob'],
         'Subject': ['Math', 'Science', 'Math', 'Science'],
         'Score': [85, 90, 75, 80]}
df3 = pd.DataFrame(data3)

# Reshape data using pivot
pivoted = df3.pivot(index='Name', columns='Subject', values='Score')
print(pivoted)

In [None]:
# Reshape data using melt
melted = pd.melt(pivoted.reset_index(), id_vars='Name', value_vars=['Math', 'Science'])
print(melted)

# Pandas: Data Cleaning and Preprocessing

In [None]:
import pandas as pd
import numpy as np

## Removing Duplicates

In [None]:
# Create a sample DataFrame with duplicates
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
        'Age': [25, 30, 35, 25, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Remove duplicates
df_deduped = df.drop_duplicates()

# Display the deduplicated DataFrame
print("\nDeduplicated DataFrame:")
print(df_deduped)

## Handling Outliers

In [None]:
# Create a sample DataFrame with outliers
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 100],
        'Income': [50000, 60000, 70000, 80000, 500000]}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Identify and remove outliers based on a condition
condition = (df['Age'] < 90) & (df['Income'] < 200000)
df_cleaned = df[condition]

# Display the cleaned DataFrame
print("\nCleaned DataFrame (outliers removed):")
print(df_cleaned)

## String Manipulation and Regular Expressions

In [None]:
# Create a sample DataFrame with string data
data = {'Name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee'],
        'Email': ['alice@example.com', 'bob.johnson@company.org', 'charlie_brown@email.net', 'david.lee@gmail.com']}
df = pd.DataFrame(data)

# Extract first names from 'Name' column
df['First Name'] = df['Name'].str.split().str[0]

# Extract domain names from 'Email' column using regular expressions
df['Domain'] = df['Email'].str.extract(r'@(\w+\.\w+)', expand=False)

# Display the updated DataFrame
print(df)

## Date and Time Data Handling

In [None]:
# Create a sample DataFrame with date and time data
data = {'Date': ['2023-05-01', '2023-05-02', '2023-05-03', '2023-05-04'],
        'Time': ['10:30:00', '14:45:30', '08:15:45', '16:00:15']}
df = pd.DataFrame(data)

# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Extract day of week from 'Date' column
df['Day of Week'] = df['Date'].dt.day_name()

# Convert 'Time' column to datetime format
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S').dt.time

# Display the updated DataFrame
print(df)

# Data Visualization with Pandas

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample DataFrame
data = {'Category': ['A', 'B', 'C', 'D', 'E'],
        'Value': [10, 25, 15, 30, 20]}
df = pd.DataFrame(data)

## Basic Plotting with Pandas

In [None]:
# Line plot
df.plot(kind='line', x='Category', y='Value')
plt.show()

# Bar plot
df.plot(kind='bar', x='Category', y='Value')
plt.show()

# Scatter plot
df.plot(kind='scatter', x='Category', y='Value')
plt.show()

## Customizing Plots

In [None]:
# Customize plot style and colors
plt.style.use('dark_background')
df.plot(kind='bar', x='Category', y='Value', color='orange')
plt.title('Bar Plot', fontsize=16)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.show()

## Plotting with Matplotlib

In [None]:
# Histogram using Matplotlib
plt.figure(figsize=(8, 6))
plt.hist(df['Value'], bins=5, edgecolor='black')
plt.title('Histogram of Values', fontsize=16)
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

## Plotting with Seaborn

In [None]:
# Scatter plot with regression line using Seaborn
tips = sns.load_dataset("tips")
sns.regplot(x="total_bill", y="tip", data=tips)
plt.title('Relationship between Total Bill and Tip', fontsize=16)
plt.show()

# Advanced Pandas Topics

In [None]:
import pandas as pd
import numpy as np

## Hierarchical Indexing

In [None]:
# Create a multi-index DataFrame
data = np.random.randn(4, 2)
columns = pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['Level1', 'Level2'])
df = pd.DataFrame(data, columns=columns)

print("Multi-Index DataFrame:")
print(df)

# Select data using hierarchical indexing
print("\nSelect data from 'A' level1 columns:")
print(df['A'])

## Working with Large Datasets

In [None]:
# Create a large DataFrame
large_df = pd.DataFrame(np.random.randn(1000000, 4), columns=['A', 'B', 'C', 'D'])

# Chunking data for processing
chunk_size = 100000
for chunk in range(0, len(large_df), chunk_size):
    data_chunk = large_df.iloc[chunk:chunk+chunk_size]
    # Process the data chunk here
    print(f"Processing chunk from {chunk} to {chunk+chunk_size}")

## Categoricals and Data Types

In [None]:
# Create a DataFrame with categorical data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
        'Age': [25, 30, 35, 25, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']}
df = pd.DataFrame(data)

# Convert a column to categorical data type
df['City'] = df['City'].astype('category')

print("DataFrame with categorical data:")
print(df)
print("\nData types:")
print(df.dtypes)

## Advanced Indexing and Selection Techniques

In [None]:
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco']}
df = pd.DataFrame(data)

# Select rows based on a condition
print("Rows where Age > 30:")
print(df[df['Age'] > 30])

# Select rows using .isin()
print("\nRows where City is New York or Chicago:")
print(df[df['City'].isin(['New York', 'Chicago'])])

## Styling and Formatting Data for Reports

In [None]:
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco'],
        'Score': [85, 92, 78, 88, 95]}
df = pd.DataFrame(data)

# Style the DataFrame with conditional formatting
styled_df = df.style.apply(lambda x: ['background-color: lightgreen' if v > 90 else '' for v in x], subset=['Score'])

# Format numeric columns
styled_df = styled_df.format({'Age': '{:,.0f}', 'Score': '{:,.2f}'})

# Display the styled DataFrame
styled_df

# Pandas and Machine Learning

In [None]:
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Preparing Data for Machine Learning Tasks

In [None]:
# Load the Boston Housing dataset
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['MEDV'] = boston.target

# Split the data into features and target
X = data.drop('MEDV', axis=1)
y = data['MEDV']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Features Shape: {X_train.shape}")
print(f"Training Target Shape: {y_train.shape}")
print(f"Testing Features Shape: {X_test.shape}")
print(f"Testing Target Shape: {y_test.shape}")

## Feature Engineering with Pandas

In [None]:
# Create a new feature
data['LIVING_AREA'] = data['RM'] * data['DIS']

# One-hot encoding categorical features
categorical_cols = ['RAD']
data = pd.get_dummies(data, columns=categorical_cols)

# Split the data into features and target again
X = data.drop('MEDV', axis=1)
y = data['MEDV']

print("Updated Features:\n", X.columns)

## Integrating Pandas with Scikit-Learn

In [None]:
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model on the test set
score = model.score(X_test, y_test)
print(f"Model Score: {score:.2f}")

# Pandas Performance and Best Practices

In [None]:
import pandas as pd
import numpy as np
%load_ext line_profiler

## Vectorization and Efficient Operations

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({'A': np.random.rand(1000000),
                   'B': np.random.rand(1000000)})

# Vectorized operation
%timeit df['C'] = df['A'] * df['B']

# Non-vectorized operation
%timeit df['C'] = [a * b for a, b in zip(df['A'], df['B'])]

## Caching and Indexing Strategies

In [None]:
# Create a large DataFrame
large_df = pd.DataFrame(np.random.rand(1000000, 4), columns=['A', 'B', 'C', 'D'])

# Cache the result of a computation
result = large_df['A'] + large_df['B']

# Set index for faster lookups
large_df = large_df.set_index('A')

## Profiling and Optimizing Pandas Code

In [None]:
def slow_function(df):
    result = []
    for i in range(len(df)):
        row = df.iloc[i]
        result.append(row['A'] * row['B'])
    return result

def optimized_function(df):
    return df['A'] * df['B']

df = pd.DataFrame({'A': np.random.rand(1000000),
                   'B': np.random.rand(1000000)})

%timeit slow_function(df)
%timeit optimized_function(df)

%lprun -f slow_function slow_function(df)