# Assignment : Pandas Advance Assignment

Q1. List any five functions of the pandas library with execution.

Ans - here are five common functions from the Pandas library along with their execution :

1 head(): This function is used to display the first n rows of the DataFrame. By default, it displays the first 5 rows.

In [2]:
import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

# Displaying the first 3 rows of the DataFrame
print(df.head(3))


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


2 read_csv(): This function is used to read data from a CSV file into a DataFrame.

In [None]:
import pandas as pd

# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')
print(df.head())


3 info(): This function provides a concise summary of the DataFrame including the index dtype and column dtypes, non-null values, and memory usage.

In [6]:
import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

# Displaying information about the DataFrame
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes
None


4 describe(): This function generates descriptive statistics of the DataFrame's numerical columns, such as count, mean, std (standard deviation), min, quartiles, and max.

In [7]:
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Generating descriptive statistics of the DataFrame
print(df.describe())


              A         B
count  5.000000  5.000000
mean   3.000000  3.000000
std    1.581139  1.581139
min    1.000000  1.000000
25%    2.000000  2.000000
50%    3.000000  3.000000
75%    4.000000  4.000000
max    5.000000  5.000000


5 - fillna(): This function is used to fill NA/NaN values in the DataFrame with a specified value or method.

In [8]:
import pandas as pd
import numpy as np

# Creating a DataFrame with NaN values
data = {'A': [1, np.nan, 3, np.nan, 5],
        'B': [np.nan, 2, np.nan, 4, np.nan]}
df = pd.DataFrame(data)

# Filling NaN values with 0
df_filled = df.fillna(0)
print(df_filled)


     A    B
0  1.0  0.0
1  0.0  2.0
2  3.0  0.0
3  0.0  4.0
4  5.0  0.0


Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [10]:
import pandas as pd

def reindex_dataframe(df):
    # Create a new index starting from 1 and incrementing by 2 for each row
    new_index = pd.Index(range(1, len(df) * 2, 2))
    
    # Re-index the DataFrame
    df_reindexed = df.set_index(new_index)
    
    return df_reindexed


# Assuming df is your DataFrame with columns 'A', 'B', and 'C'
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

# Re-index the DataFrame
df_reindexed = reindex_dataframe(df)
print(df_reindexed)


   A  B  C
1  1  4  7
3  2  5  8
5  3  6  9


Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.

A Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column:

In [12]:
import pandas as pd

def calculate_sum_of_first_three(df):
    # Initialize sum
    total_sum = 0
    
    # Iterate over the first three values in the 'Values' column
    for value in df['Values'].head(3):
        total_sum += value
    
    # Print the sum to the console
    print("Sum of the first three values:", total_sum)


# Assuming df is your DataFrame with a column named 'Values'
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Calculate the sum of the first three values
calculate_sum_of_first_three(df)


Sum of the first three values: 60


Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

In [14]:
import pandas as pd

def count_words(text):
    # Split the text by whitespace to get individual words
    words = text.split()
    # Return the count of words
    return len(words)

def add_word_count_column(df):
    # Apply the count_words function to each row of the 'Text' column
    df['Word_Count'] = df['Text'].apply(count_words)
    return df

# Example usage:
# Assuming df is your DataFrame with a column named 'Text'
df = pd.DataFrame({'Text': ["This is a sample text.", "Another example text", "Yet another text"]})

# Create a new column 'Word_Count' containing the number of words in each row of the 'Text' column
df_with_word_count = add_word_count_column(df)
print(df_with_word_count)


                     Text  Word_Count
0  This is a sample text.           5
1    Another example text           3
2        Yet another text           3


Q5. How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size: 

This attribute returns the total number of elements in the DataFrame. It gives you the total number of cells in the DataFrame, calculated by multiplying the number of rows by the number of columns.

In [15]:
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Getting the size of the DataFrame
size = df.size
print("DataFrame size:", size)  # Output will be 6 (3 rows * 2 columns = 6 elements)


DataFrame size: 6


DataFrame.shape: 

This attribute returns a tuple representing the dimensions of the DataFrame. It returns a tuple containing the number of rows and the number of columns in the DataFrame.

In [16]:
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Getting the shape of the DataFrame
shape = df.shape
print("DataFrame shape:", shape)  # Output will be (3, 2) indicating 3 rows and 2 columns


DataFrame shape: (3, 2)


Q6. Which function of pandas do we use to read an excel file?

Ans - 

To read an Excel file in Pandas, you would typically use the read_excel() function. Here's how you can use it:



In [None]:
import pandas as pd

# Reading an Excel file into a DataFrame
df = pd.read_excel('your_file.xlsx')

# Now you can work with the DataFrame 'df'
print(df.head())

Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.

In [21]:
import pandas as pd

def extract_username(email):
    # Split the email address on '@' symbol and return the first part
    return email.split('@')[0]

def add_username_column(df):
    # Apply the extract_username function to each element of the 'Email' column
    df['Username'] = df['Email'].apply(extract_username)
    return df

# Example usage:
# Assuming df is your DataFrame with a column named 'Email'
df = pd.DataFrame({'Email': ['user1@example.com', 'user2@example.com']})

# Create a new column 'Username' containing only the username part of each email address
df_with_username = add_username_column(df)
print(df_with_username)


               Email Username
0  user1@example.com    user1
1  user2@example.com    user2


The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

In [20]:
import pandas as pd

def extract_username(email):
    # Split the email address on '@' symbol and return the first part
    return email.split('@')[0]

def add_username_column(df):
    # Create a new column 'Username' containing the username part of each email address
    df['Username'] = df['Email'].apply(extract_username)
    return df


# Assuming df is your DataFrame with a column named 'Email'
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.doe@example.com']})

# Create a new column 'Username' containing only the username part of each email address
df_with_username = add_username_column(df)
print(df_with_username)


                  Email  Username
0  john.doe@example.com  john.doe
1  jane.doe@example.com  jane.doe


Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.

In [25]:
import pandas as pd

def select_rows(df):
    # Select rows where value in column 'A' is greater than 5 and value in column 'B' is less than 10
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows


# Assuming df is your DataFrame with columns 'A', 'B', and 'C'
df = pd.DataFrame({'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]})

# Select rows where value in column 'A' is greater than 5 and value in column 'B' is less than 10
selected_df = select_rows(df)
print(selected_df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

In [26]:
import pandas as pd

def calculate_statistics(df):
    # Calculate mean, median, and standard deviation of the values in the 'Values' column
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_deviation = df['Values'].std()
    
    return mean_value, median_value, std_deviation


# Assuming df is your DataFrame with a column named 'Values'
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Calculate mean, median, and standard deviation of the values in the 'Values' column
mean, median, std = calculate_statistics(df)
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std)


Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.

Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [27]:
import pandas as pd

def select_rows_in_date_range(df):
    # Convert the 'Date' column to datetime dtype if it's not already
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Define the start and end dates of the range
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    
    # Select rows where the date is between '2023-01-01' and '2023-01-31'
    selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    
    return selected_rows


# Assuming df is your DataFrame with a column named 'Date'
df = pd.DataFrame({'Date': ['2023-01-05', '2023-01-15', '2023-02-10']})

# Select rows where the date is between '2023-01-01' and '2023-01-31'
selected_df = select_rows_in_date_range(df)
print(selected_df)


        Date
0 2023-01-05
1 2023-01-15


Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?

Ans - you need to import the Pandas library itself. You typically import it using the alias pd. Here's how you import Pandas:

In [None]:
import pandas as pd