# Python Pandas General Useful Functions

In [4]:
# Generate synthetic dataframe to use for illustration
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


## len(df) 

Number of rows in dataframe. Alternate: shape[0]

In [5]:

# Assuming you have a pandas DataFrame named 'df'
num_rows = len(df)

print("Number of rows in the DataFrame:", num_rows)

# alternate: df.shape[0]

Number of rows in the DataFrame: 3


## len(df.columns) 

Number of columns. Alternate: shape[1]

In [6]:
# Assuming you have a pandas DataFrame named 'df'
num_columns = len(df.columns)

print("Number of columns in the DataFrame:", num_columns)

Number of columns in the DataFrame: 3


## df.insert() 

Here's a code stub using df.insert() to insert a new column named "Category" into the DataFrame df at index 1, filled with the values "A", "B", and "C":

In [7]:
# Assuming you have a pandas DataFrame named 'df'
df.insert(1, "Category", ["A", "B", "C"])

print(df)

      Name Category  Age         City
0    Alice        A   25     New York
1      Bob        B   30  Los Angeles
2  Charlie        C   35      Chicago


## df.drop_duplicates(subset)

Here's a code stub using df.drop_duplicates(subset) to drop duplicates from a subset of a pandas DataFrame:

In [8]:
# Create a sample DataFrame with duplicate values in the 'Name' column
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
        'Age': [25, 30, 25, 35]}
df = pd.DataFrame(data)

# Drop duplicates based on the 'Name' column
df = df.drop_duplicates(subset=['Name'])

print(df)

      Name  Age
0    Alice   25
1      Bob   30
3  Charlie   35


This code will remove duplicate rows based on the values in the 'Name' column. You can specify other columns in the subset argument to drop duplicates based on those columns as well.

## df.dropna()

subset = list of cols, inplace=false

In [None]:
import pandas as pd

# Create a sample DataFrame with missing values
data = {'A': [1, 2, None, 4],
        'B': ['a', 'b', None, 'd'],
        'C': [10, 20, 30, None]}
df = pd.DataFrame(data)

# Drop rows containing any missing values
df_dropped = df.dropna()

print(df_dropped)

This code will output:

   A  B   C
0  1  a  10
1  2  b  20

As you can see, the rows containing missing values (NaN) have been dropped from the resulting DataFrame df_dropped.

You can also use the how parameter of df.dropna() to specify whether to drop rows with any missing values (how='any') or only rows with all missing values (how='all').

## df.rename(dict)

rename cols according to dict

In [1]:
import pandas as pd

# Create a sample DataFrame with original column names
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Create a dictionary to map old column names to new column names
new_names = {'Name': 'First Name', 'Age': 'Year of Birth', 'City': 'Location'}

# Rename the columns using the dictionary
df = df.rename(columns=new_names)

print(df)

  First Name  Year of Birth     Location
0      Alice             25     New York
1        Bob             30  Los Angeles
2    Charlie             35      Chicago


This code will output:

  First Name  Year of Birth  Location
0      Alice           25    New York
1        Bob           30  Los Angeles
2    Charlie           35      Chicago

As you can see, the column names have been successfully renamed according to the specified dictionary.

## df.fillna()

Here's a small sample dataset and code snippet to demonstrate how to use df.fillna() to fill missing values in a pandas DataFrame:

In [3]:
import pandas as pd

# Create a sample DataFrame with missing values
data = {'A': [1, 2, None, 4],
        'B': ['a', 'b', None, 'd'],
        'C': [10, 20, 30, None]}
df = pd.DataFrame(data)

print(df)

     A     B     C
0  1.0     a  10.0
1  2.0     b  20.0
2  NaN  None  30.0
3  4.0     d   NaN


In [4]:
# Fill missing values with the value 'unknown'
df_filled = df.fillna('unknown')

print(df_filled)

         A        B        C
0      1.0        a     10.0
1      2.0        b     20.0
2  unknown  unknown     30.0
3      4.0        d  unknown


As you can see, the missing values in the original DataFrame have been replaced with the string 'unknown'.

You can also use other methods to fill missing values, such as:

- df.fillna(0): Replace missing values with 0.
- df.fillna(df.mean()): Replace missing values with the mean of the column.
- df.fillna(method='ffill'): Propagate the last valid observation forward.
- df.fillna(method='bfill'): Propagate the next valid observation backward.

Choose the method that best suits your data and analysis requirements.

## Concatenating DataFrames with pd.concat()

In [5]:
import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8], 'B': [9, 10]})

# Concatenate the DataFrames row-wise (axis=0)
df_row_concat = pd.concat([df1, df2], axis=0, ignore_index=True)

# Concatenate the DataFrames column-wise (axis=1)
df_col_concat = pd.concat([df1, df2], axis=1)

print("Row-wise concatenation:\n", df_row_concat)
print("\nColumn-wise concatenation:\n", df_col_concat)

Row-wise concatenation:
    A   B
0  1   4
1  2   5
2  3   6
3  7   9
4  8  10

Column-wise concatenation:
    A  B    A     B
0  1  4  7.0   9.0
1  2  5  8.0  10.0
2  3  6  NaN   NaN


Explanation:

- pd.concat(): This function is used to concatenate multiple DataFrames.
- axis=0: Concatenates the DataFrames row-wise, stacking them on top of each other.
- axis=1: Concatenates the DataFrames column-wise, placing them side by side.
- ignore_index=True: Resets the index of the concatenated DataFrame, starting from 0.

As you can see, axis=0 stacks the DataFrames vertically, while axis=1 stacks them horizontally.

## df.pivot(index, columns, values)

Here's a small sample dataset and code snippet to demonstrate how to use df.pivot(index, columns, values) to reshape data in a pandas DataFrame:

In [6]:
import pandas as pd

# Create a sample DataFrame
data = {'Category': ['A', 'A', 'B', 'B'],
        'Product': ['X', 'Y', 'X', 'Z'],
        'Sales': [100, 200, 300, 400]}
df = pd.DataFrame(data)

# Pivot the DataFrame to create a new DataFrame with 'Category' as the index, 'Product' as the columns, and 'Sales' as the values
df_pivoted = df.pivot(index='Category', columns='Product', values='Sales')

print(df_pivoted)

Product       X      Y      Z
Category                     
A         100.0  200.0    NaN
B         300.0    NaN  400.0


As you can see, the df.pivot() function reshaped the original DataFrame by moving the unique values from the 'Product' column into new columns and using the unique values from the 'Category' column as the index. The values from the 'Sales' column are placed in the corresponding cells.

## pd.melt(df, id_vars, value_vars, var_name)

Here's a small sample dataset and code snippet to demonstrate how to use pd.melt(df, id_vars, value_vars, var_name) to unpivot a pandas DataFrame:

In [8]:
import pandas as pd

# Create a sample DataFrame
data = {'Category': ['A', 'A', 'B', 'B'],
        'Product': ['X', 'Y', 'X', 'Z'],
        'Sales': [100, 200, 300, 400]}
df = pd.DataFrame(data)

# Unpivot the DataFrame, keeping 'Category' as the identifier and melting 'Product' and 'Sales'
df_melted = df.melt(id_vars='Category', var_name='Product', value_name='Sales_Value')

print(df_melted)

  Category  Product Sales_Value
0        A  Product           X
1        A  Product           Y
2        B  Product           X
3        B  Product           Z
4        A    Sales         100
5        A    Sales         200
6        B    Sales         300
7        B    Sales         400


By using Sales_Value as the value_name, we avoid the conflict with the existing column name 'Sales'.

## df.sort_values(column, inplace, ascending)

Here's a small sample dataset and code snippet to demonstrate how to use df.sort_values(column, inplace=True, ascending=False) to sort column values in a pandas DataFrame:

In [9]:
import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Sort the DataFrame by the 'Age' column in descending order
df.sort_values(by='Age', inplace=True, ascending=False)

print(df)

      Name  Age
2  Charlie   35
1      Bob   30
0    Alice   25


As you can see, the DataFrame has been sorted by the 'Age' column in descending order (from largest to smallest).

Here's a breakdown of the parameters used:

- by: The column(s) to sort by.
- inplace: If True, sorts the DataFrame in place, modifying the original DataFrame. If False (default), returns a new sorted DataFrame.
- ascending: If True, sort in ascending order. If False, sort in descending order.