Pandas is a library in Python with the advantages of being fast, powerful, flexible, easy to use, open source, 
and a tool for analyzing and manipulating data. Pandas is built on the NumPy library and has many functions 
that support cleaning, analyzing, and manipulating valuable data, which can help us extract insights from data 
sets. Pandas is very effective when used on tabular data, such as SQL tables or Excel spreadsheets.

# Series

 The structure is a 1D array with homogeneous data. The data type can be integer, string, float, 
etc. The labeling axis is called the index. The size of the series is immutable and the data value can be 
changed (mutable)

 pandas.Series(data, index, dtype, copy):
- data: This is the input data for the Series. It can be any iterable (like a list, tuple, NumPy array, or dictionary). The values from this iterable will populate the Series.
- index: This is an optional argument. It represents the labels (or the "index") for the Series. If not provided, pandas will automatically generate a default index, which is a sequence of integers starting from 0. If provided, the length of the index should match the length of data.
- dtype: This optional argument specifies the data type for the Series. If not provided, pandas will infer the data type from the data. You can explicitly set it, for example, to dtype='float64', dtype='int', or any other valid pandas data type.
- copy: This optional boolean argument indicates whether to copy the data or not. If copy=True, a copy of the input data is made. If copy=False, no copy is made, and any modifications to the original data will be reflected in the Series.

Example

In [1]:
import pandas as pd

# Example data
data = [1, 2, 3, 4]
index = ['a', 'b', 'c', 'd']

# Creating a Series
series = pd.Series(data, index=index, dtype='float64', copy=False)

print(series)

a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64


# DataFrame

Is a 2D data structure, in the form of a table consisting of columns and rows, columns can 
define different data types. Columns have different data types such as float64, int, bool, etc. A column 
of DataFrame is a Series structure. DataFrame dimensions are labeled in rows and columns. From 
there, we can operate on both rows and columns.

pandas.DataFrame(data, index, columns, dtype, copy)

Arguments Breakdown:
- data: The input data for the DataFrame. It can be a variety of data structures, including:
A dictionary of lists or NumPy arrays.
A list of lists (or 2D arrays).
A NumPy array.
Another DataFrame.
A dictionary of Series.
- index: This is an optional argument. It represents the labels for the rows of the DataFrame. If not provided, pandas will automatically generate a default index (like in a Series), which is a sequence of integers starting from 0.
- columns: This is an optional argument. It represents the labels for the columns. If not provided, and the data is a dictionary, the dictionary keys will be used as column names. If data is a 2D array or list of lists, you need to provide column names explicitly.
- dtype: This optional argument allows you to specify the data type for the entire DataFrame (e.g., dtype='float64'). If not provided, pandas will infer the data types for each column based on the input data.
- copy: This optional boolean argument indicates whether to copy the input data or not. If copy=True, a copy of the input data is made, and modifications to the original data will not affect the DataFrame. If copy=False, no copy is made, and changes to the original data will be reflected in the DataFrame.

In [3]:
import pandas as pd

# Example data: Dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

# Specifying the index and columns
index = ['A', 'B', 'C']
columns = ['Name', 'Age', 'City']

# Creating a DataFrame
df = pd.DataFrame(data, index=index, columns=columns, dtype='object', copy=False)

# Displaying the DataFrame
print(df)

      Name Age         City
A    Alice  25     New York
B      Bob  30  Los Angeles
C  Charlie  35      Chicago


 DataFrame can be considered as a list containing Series.

#  Panel

- items: axis 0, each item corresponds to the DataFrame contained inside.
- – major_axis: axis 1, it is the rows of each DataFrame.
- minor_axis: axis 2, it is the columns of each DataFrame

# Some functions on Pandas are commonly used to process data:

Handle missing values: isna(), notna() – search for NA, isnull() values

isna(): This function detects missing values in a DataFrame or Series and returns a DataFrame or Series of the same shape, where each entry is True if the corresponding entry in the original data is missing (i.e., NaN) and False if not.

notna(): This is the opposite of isna(). It returns True for non-missing values and False for missing values.

isnull(): This is essentially the same as isna(). It checks for missing values and returns True where values are missing. It's an alias for isna().

In [5]:
import pandas as pd
import numpy as np

# Creating a DataFrame with some missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 40],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
}

df = pd.DataFrame(data)

# Displaying the DataFrame
print("Original DataFrame:")
print(df)

Original DataFrame:
      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0         None
3     None  40.0      Chicago


In [6]:
# Detect missing values using isna()
missing_values = df.isna()
print("\nMissing values detected using isna():")
print(missing_values)


Missing values detected using isna():
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3   True  False  False


In [7]:
# Detect non-missing values using notna()
non_missing_values = df.notna()
print("\nNon-missing values detected using notna():")
print(non_missing_values)


Non-missing values detected using notna():
    Name    Age   City
0   True   True   True
1   True  False   True
2   True   True  False
3  False   True   True


In [8]:
# Using isnull() (same as isna())
null_values = df.isnull()
print("\nMissing values detected using isnull():")
print(null_values)


Missing values detected using isnull():
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3   True  False  False


Key Takeaways:
- isna() and isnull() are used to identify missing values, returning True where data is missing.
- notna() is used to identify non-missing values, returning True where data is present.

# Indexing and Slicing in Pandas

In pandas, indexing and slicing allow you to access and manipulate specific rows and columns of your DataFrame. The two primary ways to do this are:

- .loc[]: Label-based indexing.
- .iloc[]: Integer-based indexing.

- .loc[] (Label-based indexing)
- .loc[] is used to select rows and columns by labels (names).
It can be used to select a specific row or column by name, or a range of rows and columns by their names.

In [11]:
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])

# Selecting a row by label
print(df.loc['B'].to_frame())

# df.loc['B'] selects the row with the label 'B'.

                B
Name          Bob
Age            30
City  Los Angeles


In [12]:
# Selecting multiple rows and columns by labels
print(df.loc[['A', 'C'], ['Name', 'City']])

      Name      City
A    Alice  New York
C  Charlie   Chicago


In [13]:
# Slicing rows by label
print(df.loc['B':'D', :])

      Name  Age         City
B      Bob   30  Los Angeles
C  Charlie   35      Chicago
D    David   40      Houston


# .iloc[] (Integer-based indexing)

.iloc[] is used to select rows and columns by their integer position (like Python's built-in indexing).
It is useful when you don't know the labels or prefer to use index positions.

In [14]:
# Selecting a row by integer position
print(df.iloc[1])

Name            Bob
Age              30
City    Los Angeles
Name: B, dtype: object


In [15]:
# Selecting multiple rows and columns by integer positions
print(df.iloc[[0, 2], [0, 2]])

      Name      City
A    Alice  New York
C  Charlie   Chicago


In [16]:
# Slicing rows by integer position
print(df.iloc[1:3, :])

      Name  Age         City
B      Bob   30  Los Angeles
C  Charlie   35      Chicago


# Queries in Pandas (Like in Excel or SQL)

In pandas, you can perform queries to filter your data based on specific conditions, similar to how you might use WHERE in SQL or filters in Excel.

# where() Method
- The where() method is used to filter data based on conditions. It keeps the original data structure but replaces values that don't meet the condition with NaN.

In [17]:
# Using where() to filter rows where 'Age' > 30
filtered_df = df.where(df['Age'] > 30)
print(filtered_df)

      Name   Age     City
A      NaN   NaN      NaN
B      NaN   NaN      NaN
C  Charlie  35.0  Chicago
D    David  40.0  Houston


This code filters rows where the 'Age' is greater than 30. Rows that don't meet the condition are replaced with NaN.

# query() Method

- The query() method provides a more SQL-like syntax for filtering data in pandas. It is more intuitive for complex queries involving multiple conditions.

In [18]:
# Using query() to filter rows where 'Age' > 30 and 'City' is 'Chicago'
filtered_df = df.query("Age > 30 and City == 'Chicago'")
print(filtered_df)

      Name  Age     City
C  Charlie   35  Chicago


# • Series basic functionality: axes, dtype, empty, ndim, size, values, head(), tail().


In pandas, a Series is a one-dimensional labeled array capable of holding data of any type. Here are some of the basic functionalities and attributes of a Series, along with explanations and examples.

Basic Functionality of a Series
axes: Returns the list of the axis labels of the Series. For a Series, it returns a list with one element, which is the index.

dtype: The data type of the elements in the Series.

empty: A boolean attribute that indicates whether the Series is empty (i.e., has no elements).

ndim: The number of dimensions of the Series. For a Series, this will always be 1.

size: The number of elements in the Series.

values: Returns the data of the Series as a NumPy array.

head(n): Returns the first n elements of the Series. The default value for n is 5.

tail(n): Returns the last n elements of the Series. The default value for n is 5.

In [19]:
import pandas as pd

# Creating a Series
data = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

In [20]:
# Displaying the Series
print("Series:")
print(data)

Series:
a    10
b    20
c    30
d    40
e    50
dtype: int64


In [25]:
# Axes
print("\nAxes:")
print(data.axes)


Axes:
[Index(['a', 'b', 'c', 'd', 'e'], dtype='object')]


In [26]:
# Data type
print("\nData type:")
print(data.dtype)


Data type:
int64


In [27]:
# Check if Series is empty
print("\nIs the Series empty?")
print(data.empty)


Is the Series empty?
False


In [28]:
# Number of dimensions
print("\nNumber of dimensions:")
print(data.ndim)


Number of dimensions:
1


In [29]:
# Size of the Series
print("\nSize:")
print(data.size)


Size:
5


In [30]:
# Values of the Series
print("\nValues:")
print(data.values)


Values:
[10 20 30 40 50]


In [31]:
# First 3 elements
print("\nFirst 3 elements (head):")
print(data.head(3))


First 3 elements (head):
a    10
b    20
c    30
dtype: int64


In [32]:
# Last 2 elements
print("\nLast 2 elements (tail):")
print(data.tail(2))


Last 2 elements (tail):
d    40
e    50
dtype: int64


# • Dataframe basic functionality: T, axes, dtypes, empty, ndim, shape, size, values, head(), tail().

Basic Functionality of a DataFrame
T: Transposes the DataFrame, swapping rows and columns.

axes: Returns a list of the DataFrame's row and column labels.

dtypes: Returns the data types of each column in the DataFrame.

empty: Checks if the DataFrame is empty (i.e., has no elements).

ndim: The number of dimensions of the DataFrame. For a DataFrame, this will always be 2.

shape: Returns a tuple representing the dimensions of the DataFrame (rows, columns).

size: The total number of elements in the DataFrame.

values: Returns the data of the DataFrame as a NumPy array.

head(n): Returns the first n rows of the DataFrame. The default value for n is 5.

tail(n): Returns the last n rows of the DataFrame. The default value for n is 5.

In [33]:
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

In [36]:
# Displaying the DataFrame
print("DataFrame:")
df

DataFrame:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,David,40,Houston


In [38]:
# Transpose
print("\nTranspose (T):")
df.T


Transpose (T):


Unnamed: 0,0,1,2,3
Name,Alice,Bob,Charlie,David
Age,25,30,35,40
City,New York,Los Angeles,Chicago,Houston


In [40]:
# Axes
print("\nAxes:")
df.axes


Axes:


[RangeIndex(start=0, stop=4, step=1),
 Index(['Name', 'Age', 'City'], dtype='object')]

In [41]:
# Data types
print("\nData types:")
df.dtypes


Data types:


Name    object
Age      int64
City    object
dtype: object

In [42]:
# Check if DataFrame is empty
print("\nIs the DataFrame empty?")
df.empty


Is the DataFrame empty?


False

In [43]:
# Number of dimensions
print("\nNumber of dimensions:")
print(df.ndim)


Number of dimensions:
2


In [44]:
# Shape of the DataFrame
print("\nShape:")
print(df.shape)


Shape:
(4, 3)


In [45]:
# Size of the DataFrame
print("\nSize:")
print(df.size)


Size:
12


In [47]:
# Values of the DataFrame
print("\nValues:")
df.values


Values:


array([['Alice', 25, 'New York'],
       ['Bob', 30, 'Los Angeles'],
       ['Charlie', 35, 'Chicago'],
       ['David', 40, 'Houston']], dtype=object)

In [48]:
# First 3 rows
print("\nFirst 3 rows (head):")
df.head(3)


First 3 rows (head):


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [49]:
# Last 2 rows
print("\nLast 2 rows (tail):")
df.tail(2)


Last 2 rows (tail):


Unnamed: 0,Name,Age,City
2,Charlie,35,Chicago
3,David,40,Houston


# Statistical Functions in Pandas

count(): Counts the number of non-NA/null entries in each column.

sum(): Computes the sum of the values in each column.

mean(): Calculates the mean (average) of the values in each column.

median(): Computes the median (middle value) of the values in each column.

mode(): Returns the mode(s) (most frequent value(s)) of the values in each column.

std(): Calculates the standard deviation of the values in each column.

min(): Finds the minimum value in each column.

max(): Finds the maximum value in each column.

abs(): Computes the absolute value of each element.

prod(): Computes the product of the values in each column.

cumsum(): Calculates the cumulative sum of the values in each column.

cumprod(): Calculates the cumulative product of the values in each column.

describe(): Provides descriptive statistics like count, mean, std, min, 25%, 50%, 75%, max.

pct_change(): Calculates the percentage change between the current and a prior element.

cov(): Computes the covariance matrix.

corr(): Computes the correlation matrix.

rank(): Computes the ranks of elements.

var(): Calculates the variance of the values in each column.

skew(): Computes the skewness (asymmetry) of the distribution of the values in each column.

apply(): Applies a function along an axis of the DataFrame.

In [53]:
import pandas as pd
import numpy as np

# Creating a DataFrame
data = {
    'A': [10, 20, 30, 40, 50],
    'B': [5, 15, 25, None, 45],
    'C': [1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)

In [54]:
# Displaying the DataFrame
print("DataFrame:")
df

DataFrame:


Unnamed: 0,A,B,C
0,10,5.0,1
1,20,15.0,2
2,30,25.0,3
3,40,,4
4,50,45.0,5


In [58]:
# Count of non-NA/null entries
print("\nCount:")
df.count().to_frame()


Count:


Unnamed: 0,0
A,5
B,4
C,5


In [59]:
# Sum of values
print("\nSum:")
df.sum()


Sum:
A    150.0
B     90.0
C     15.0
dtype: float64


In [61]:
# Mean of values
print("\nMean:")
df.mean()


Mean:


A    30.0
B    22.5
C     3.0
dtype: float64

In [62]:
# Median of values
print("\nMedian:")
print(df.median())


Median:
A    30.0
B    20.0
C     3.0
dtype: float64


In [63]:
# Mode of values
print("\nMode:")
print(df.mode())


Mode:
    A     B  C
0  10   5.0  1
1  20  15.0  2
2  30  25.0  3
3  40  45.0  4
4  50   NaN  5


In [64]:
# Standard deviation
print("\nStandard Deviation (std):")
print(df.std())


Standard Deviation (std):
A    15.811388
B    17.078251
C     1.581139
dtype: float64


In [65]:
# Minimum value
print("\nMinimum Value (min):")
print(df.min())


Minimum Value (min):
A    10.0
B     5.0
C     1.0
dtype: float64


In [66]:
# Maximum value
print("\nMaximum Value (max):")
print(df.max())


Maximum Value (max):
A    50.0
B    45.0
C     5.0
dtype: float64


In [67]:
# Absolute value
print("\nAbsolute Value (abs):")
print(df.abs())


Absolute Value (abs):
    A     B  C
0  10   5.0  1
1  20  15.0  2
2  30  25.0  3
3  40   NaN  4
4  50  45.0  5


In [68]:
# Product of values
print("\nProduct (prod):")
print(df.prod())


Product (prod):
A    12000000.0
B       84375.0
C         120.0
dtype: float64


In [69]:
# Cumulative sum
print("\nCumulative Sum (cumsum):")
print(df.cumsum())


Cumulative Sum (cumsum):
     A     B   C
0   10   5.0   1
1   30  20.0   3
2   60  45.0   6
3  100   NaN  10
4  150  90.0  15


In [70]:
# Cumulative product
print("\nCumulative Product (cumprod):")
print(df.cumprod())


Cumulative Product (cumprod):
          A        B    C
0        10      5.0    1
1       200     75.0    2
2      6000   1875.0    6
3    240000      NaN   24
4  12000000  84375.0  120


In [72]:
# Descriptive statistics
print("\nDescribe:")
df.describe()


Describe:


Unnamed: 0,A,B,C
count,5.0,4.0,5.0
mean,30.0,22.5,3.0
std,15.811388,17.078251,1.581139
min,10.0,5.0,1.0
25%,20.0,12.5,2.0
50%,30.0,20.0,3.0
75%,40.0,30.0,4.0
max,50.0,45.0,5.0


In [73]:
# Percentage change
print("\nPercentage Change (pct_change):")
print(df.pct_change())


Percentage Change (pct_change):
          A         B         C
0       NaN       NaN       NaN
1  1.000000  2.000000  1.000000
2  0.500000  0.666667  0.500000
3  0.333333  0.000000  0.333333
4  0.250000  0.800000  0.250000


df.pct_change() calculates the percentage change from the previous row.

In [75]:
# Correlation matrix
print("\nCorrelation Matrix (corr):")
df.corr()


Correlation Matrix (corr):


Unnamed: 0,A,B,C
A,1.0,1.0,1.0
B,1.0,1.0,1.0
C,1.0,1.0,1.0


In [77]:
# Rank of elements
print("\nRank:")
df.rank()


Rank:


Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,2.0,2.0,2.0
2,3.0,3.0,3.0
3,4.0,,4.0
4,5.0,4.0,5.0


rank():
df.rank() returns the ranks of elements in each column.

In [79]:
# Variance of values
print("\nVariance (var):")
df.var()


Variance (var):


A    250.000000
B    291.666667
C      2.500000
dtype: float64


The value "skew()" in Python (specifically in Pandas DataFrame) is used to measure the skewness of a data distribution.

Skewness is a statistical measure that indicates the level of asymmetry in a data distribution compared to a normal distribution.

Skewness > 0: The distribution has positive skewness, meaning the distribution tail extends to the right. This indicates that there are more values smaller than the mean compared to values larger than the mean.

Skewness < 0: The distribution has negative skewness, meaning the distribution tail extends to the left. This indicates that there are more values larger than the mean compared to values smaller than the mean.

Skewness = 0: The distribution is symmetric, meaning the data is close to a normal distribution, with values evenly distributed around the mean.

Therefore, when you run df.skew(), it will return a series of values, with each value corresponding to the skewness of a column in the DataFrame

In [80]:
# Skewness
print("\nSkewness (skew):")
print(df.skew())


Skewness (skew):
A    0.000000
B    0.752837
C    0.000000
dtype: float64


skew():

df.skew() returns the skewness, measuring the asymmetry of the distribution of values.

rank(): df.rank() returns the ranks of elements in each column.

The expression rank() in df.rank() is used in Pandas to assign ranks to the elements in each column of a DataFrame. Ranking means ordering the values in a column from smallest to largest and then assigning them a rank based on their position in this order.

Key Points:
Rank Calculation: The rank is assigned to each element according to its value relative to the other values in the column. The smallest value gets rank 1, the next smallest gets rank 2, and so on.

Handling Ties: When multiple elements have the same value (a tie), Pandas has different methods to assign ranks:

Average Method (default): Assigns the average rank to all tied values. For example, if two values are tied for 2nd place, they both get rank 2.5.
Min Method: Assigns the smallest rank to all tied values. For example, if two values are tied for 2nd place, they both get rank 2.
Max Method: Assigns the largest rank to all tied values. For example, if two values are tied for 2nd place, they both get rank 3.
First Method: Assigns ranks in the order they appear in the data.
Parameters:

axis: By default, df.rank() operates column-wise (axis=0). You can set axis=1 to rank row-wise.
method: Specifies how to handle ties (e.g., 'average', 'min', 'max', 'first', 'dense').
ascending: If True (default), ranks are assigned in ascending order. If False, ranks are assigned in descending order.

In [81]:
import pandas as pd

df = pd.DataFrame({
    'A': [10, 20, 10, 40],
    'B': [30, 20, 30, 10]
})


In [82]:
df

Unnamed: 0,A,B
0,10,30
1,20,20
2,10,30
3,40,10


In [83]:
df.rank()

Unnamed: 0,A,B
0,1.5,3.5
1,3.0,2.0
2,1.5,3.5
3,4.0,1.0


Explanation:

- Column A:
The value 10 appears twice, so both values get the average rank of 1.5.
The value 20 gets rank 3.
The value 40 gets rank 4.
- Column B:
The value 30 appears twice, so both values get the average rank of 3.0.
The value 20 gets rank 2.
The value 10 gets rank 1.
- Summary:
df.rank() provides an easy way to assign ranks to data based on the relative order of the values, and it allows customization on how to handle ties and rank direction.

# Apply function

The apply() function in Pandas is a powerful tool that allows you to apply a custom function to each element, row, or column of a DataFrame. This is highly flexible, as you can specify how the function should be applied across the DataFrame, either along rows, columns, or even to individual elements.

Key Points:
Function Application:
df.apply() allows you to apply a function to each column (by default) or to each row. You can use built-in Python functions, NumPy functions, or custom functions (e.g., lambda functions).

Axis Parameter:

- axis=0 (default): The function is applied to each column. For example, it treats each column as a series and applies the function to each element in that column.
- axis=1: The function is applied to each row. In this case, each row is treated as a series, and the function is applied to each element in that row.
Custom Functions: You can pass any function (including lambda functions) to apply(). For example, you can create custom transformations, aggregations, or calculations on the data.

Return Values: The function applied can return a single value (which can result in a new DataFrame or Series) or multiple values (which will expand into multiple columns or rows).

In [86]:
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

In [87]:
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [None]:
df_squared = df.apply(lambda x: x**2)

In [85]:
df_squared

Unnamed: 0,A,B
0,1,16
1,4,25
2,9,36


In [88]:
row_sum = df.apply(lambda x: x.sum(), axis = 1)

In [89]:
row_sum

0    5
1    7
2    9
dtype: int64

# • Functions to filter data: groupby(), get_group(), merge(), concat(), append(), melt(), pivot(), pivot_table()

1. groupby()
- Description: The groupby() function is used to group data by one or more columns, allowing you to aggregate or perform operations on those groups.

In [90]:
import pandas as pd

# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'C'],
        'Values': [10, 20, 15, 25, 10, 30]}
df = pd.DataFrame(data)
df

Unnamed: 0,Category,Values
0,A,10
1,B,20
2,A,15
3,B,25
4,A,10
5,C,30


In [91]:
# Group by 'Category' column and calculate the sum of each group
grouped = df.groupby('Category').sum()
grouped

Unnamed: 0_level_0,Values
Category,Unnamed: 1_level_1
A,35
B,45
C,30


In [94]:
grouped.loc[["A"]]

Unnamed: 0_level_0,Values
Category,Unnamed: 1_level_1
A,35


2. get_group()
- Description: After using groupby(), you can retrieve a specific group with get_group().

In [97]:
import pandas as pd

# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'C'],
        'Values': [10, 20, 15, 25, 10, 30]}
df = pd.DataFrame(data)

# Group by 'Category'
grouped = df.groupby('Category')

# Get the group where 'Category' is 'A'
group_A = grouped.get_group('A')
print(group_A)

  Category  Values
0        A      10
2        A      15
4        A      10


3. merge()
- Description: The merge() function merges two DataFrames based on a common column. It is similar to SQL joins.

In [98]:
# Sample DataFrames
df1 = pd.DataFrame({'Key': ['A', 'B', 'C'], 'Value1': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['A', 'B', 'D'], 'Value2': [4, 5, 6]})

In [99]:
df1

Unnamed: 0,Key,Value1
0,A,1
1,B,2
2,C,3


In [100]:
df2

Unnamed: 0,Key,Value2
0,A,4
1,B,5
2,D,6


In [101]:
# Merge on the 'Key' column
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print(merged_df)

  Key  Value1  Value2
0   A       1       4
1   B       2       5


In [102]:
# Merge on the 'Key' column
merged_df = pd.merge(df1, df2, on='Key', how='left')
print(merged_df)

  Key  Value1  Value2
0   A       1     4.0
1   B       2     5.0
2   C       3     NaN


4. concat()
- Description: The concat() function concatenates (joins) two or more DataFrames either along rows (axis=0) or columns (axis=1).

In [103]:
# Concatenate along rows (axis=0)
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)

  Key  Value1  Value2
0   A     1.0     NaN
1   B     2.0     NaN
2   C     3.0     NaN
0   A     NaN     4.0
1   B     NaN     5.0
2   D     NaN     6.0


In [104]:
# Concatenate along rows (axis=0)
concatenated_df = pd.concat([df1, df2], axis=1)
print(concatenated_df)

  Key  Value1 Key  Value2
0   A       1   A       4
1   B       2   B       5
2   C       3   D       6


5. append()
- Description: The append() function appends rows of one DataFrame to another. It is similar to concat() with axis=0.

In [105]:
# Append df2 to df1
appended_df = df1.append(df2, ignore_index=True)
print(appended_df)

  Key  Value1  Value2
0   A     1.0     NaN
1   B     2.0     NaN
2   C     3.0     NaN
3   A     NaN     4.0
4   B     NaN     5.0
5   D     NaN     6.0


  appended_df = df1.append(df2, ignore_index=True)


6. melt()
- Description: The melt() function unpivots a DataFrame from wide format to long format. It turns columns into rows.

In [107]:
# Sample wide DataFrame
df_wide = pd.DataFrame({'ID': [1, 2], 'A': [10, 20], 'B': [30, 40]})
df_wide

Unnamed: 0,ID,A,B
0,1,10,30
1,2,20,40


In [108]:
# Melt the DataFrame
melted_df = pd.melt(df_wide, id_vars=['ID'], value_vars=['A', 'B'])
melted_df

Unnamed: 0,ID,variable,value
0,1,A,10
1,2,A,20
2,1,B,30
3,2,B,40


7. pivot()
- Description: The pivot() function reshapes data from long format to wide format, turning unique values in a column into separate columns.

In [110]:
# Sample long DataFrame
df_long = pd.DataFrame({'ID': [1, 1, 2, 2],
                        'Variable': ['A', 'B', 'A', 'B'],
                        'Value': [10, 30, 20, 40]})
df_long 

Unnamed: 0,ID,Variable,Value
0,1,A,10
1,1,B,30
2,2,A,20
3,2,B,40


In [111]:
# Pivot the DataFrame
pivoted_df = df_long.pivot(index='ID', columns='Variable', values='Value')
pivoted_df

Variable,A,B
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,10,30
2,20,40


8. pivot_table()
- Description: The pivot_table() function is similar to pivot(), but it allows for more advanced aggregation and handling of duplicate values.

In [112]:
# Sample DataFrame with duplicate entries
df_duplicate = pd.DataFrame({'ID': [1, 1, 2, 2],
                             'Category': ['A', 'A', 'B', 'B'],
                             'Value': [10, 30, 20, 40]})
df_duplicate

Unnamed: 0,ID,Category,Value
0,1,A,10
1,1,A,30
2,2,B,20
3,2,B,40


In [113]:
# Create a pivot table with aggregation (e.g., mean)
pivot_table_df = df_duplicate.pivot_table(index='ID', columns='Category', values='Value', aggfunc='mean')
print(pivot_table_df)

Category     A     B
ID                  
1         20.0   NaN
2          NaN  30.0


Summary:
- groupby() and get_group() allow you to group data and retrieve specific groups.
-merge(), concat(), and append() are used for combining DataFrames.
- melt() converts wide data to long format.
- pivot() and pivot_table() convert long data to wide format, with pivot_table() providing more advanced aggregation options.

 # Some other functions: get_option(), set_option(), reset_option(), describe_option(), option_context