<a href="https://colab.research.google.com/github/sameermdanwer/python-assignment-/blob/main/Pandas_Advance_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q1. List any five functions of the pandas library with execution.

The Pandas library is a powerful tool for data manipulation and analysis in Python. Here are five key functions of the Pandas library, along with execution examples:

# 1. Reading CSV Files

In [None]:
import pandas as pd

# Reading a CSV file
df = pd.read_csv('data.csv')
print(df.head())  # Display the first five rows

# 2. DataFrame Creation

In [None]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

# 3. Data Selection

In [None]:
# Selecting a specific column
ages = df['Age']
print(ages)

# Selecting specific rows by index
selected_rows = df.iloc[0:2]  # Select the first two rows
print(selected_rows)

# 4. Data Filtering

In [None]:
# Filtering DataFrame based on a condition
filtered_df = df[df['Age'] > 28]  # Select rows where Age is greater than 28
print(filtered_df)

# 5. Group By Operations

In [None]:
# Grouping by 'City' and calculating the average Age
grouped_df = df.groupby('City')['Age'].mean().reset_index()
print(grouped_df)

# Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.]

To re-index a Pandas DataFrame so that the new index starts from 1 and increments by 2 for each row, you can create a custom function. Here’s how you can do it:

# Function Definition

In [1]:
import pandas as pd

def reindex_dataframe(df):
    # Create a new index starting from 1 and incrementing by 2
    new_index = range(1, 2 * len(df), 2)

    # Re-index the DataFrame
    df_reindexed = df.set_index(pd.Index(new_index))

    return df_reindexed

In [2]:
# Create a sample DataFrame
data = {
    'A': [10, 20, 30],
    'B': [40, 50, 60],
    'C': [70, 80, 90]
}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Re-index the DataFrame
df_reindexed = reindex_dataframe(df)

# Display the re-indexed DataFrame
print("\nRe-indexed DataFrame:")
print(df_reindexed)

Original DataFrame:
    A   B   C
0  10  40  70
1  20  50  80
2  30  60  90

Re-indexed DataFrame:
    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


# Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.

You can create a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column using the iterrows() method or simply by slicing the DataFrame. Below is an implementation using slicing for better performance, as iterating through rows is generally slower.

# Function Definition



In [3]:
import pandas as pd

def sum_first_three_values(df):
    # Check if there are at least three values in the 'Values' column
    if len(df) < 3:
        print("The DataFrame does not have enough values in the 'Values' column.")
        return

    # Calculate the sum of the first three values
    total_sum = df['Values'].iloc[:3].sum()

    # Print the sum
    print(f"The sum of the first three values is: {total_sum}")

# Example Usage


In [4]:
# Create a sample DataFrame
data = {
    'Values': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Call the function to calculate and print the sum
sum_first_three_values(df)

The sum of the first three values is: 60


# Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

You can create a new column in a Pandas DataFrame that counts the number of words in each entry of the 'Text' column using the str.split() method combined with str.len(). Below is a function that accomplishes this task.

# Function Definition

In [5]:
import pandas as pd

def add_word_count_column(df):
    # Create a new column 'Word_Count' that counts the number of words in each row of the 'Text' column
    df['Word_Count'] = df['Text'].str.split().str.len()

    return df

# Example Usage

In [7]:
# Create a sample DataFrame
data = {
    'Text': [
        'Hello world',
        'This is a test',
        'Pandas is great for data manipulation',
        'Python programming'
    ]
}
df = pd.DataFrame(data)

# Call the function to add the 'Word_Count' column
df_with_word_count = add_word_count_column(df)

# Display the updated DataFrame
print(df_with_word_count)










                                    Text  Word_Count
0                            Hello world           2
1                         This is a test           4
2  Pandas is great for data manipulation           6
3                     Python programming           2


# Q5. How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size and DataFrame.shape are both attributes in pandas that provide information about the dimensions of a DataFrame, but they do so in different ways:

1. DataFrame.size:

* Returns the total number of elements in the DataFrame.
* This is calculated as the product of the number of rows and the number of columns.
* It returns a single integer value.
* Example:

In [None]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df.size)  # Output: 4 (2 rows * 2 columns)

2. DataFrame.shape:

* Returns a tuple representing the dimensions of the DataFrame.
* The tuple consists of two values: the number of rows and the number of columns.
* It gives a more structured overview of the DataFrame's dimensions.
Example:

In [None]:
print(df.shape)  # Output: (2, 2) (2 rows, 2 columns)

# Q6. Which function of pandas do we use to read an excel file?

In pandas, you can use the pandas.read_excel() function to read an Excel file. This function allows you to load data from an Excel spreadsheet into a DataFrame.

# Basic Usage:


In [None]:
import pandas as pd

# Read an Excel file
df = pd.read_excel('path_to_file.xlsx')

# Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.
The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

You can create a function that extracts the username from the email addresses in the 'Email' column of a Pandas DataFrame and adds it as a new column named 'Username'. Here's how you can do that:



In [8]:
import pandas as pd

def extract_username(df):
    # Check if the 'Email' column exists in the DataFrame
    if 'Email' in df.columns:
        # Create a new column 'Username' by splitting the 'Email' column at the '@' symbol
        df['Username'] = df['Email'].apply(lambda x: x.split('@')[0])
    else:
        print("The 'Email' column does not exist in the DataFrame.")
    return df

# Example usage
data = {
    'Email': ['john.doe@example.com', 'jane.smith@test.com', 'alice.brown@domain.org']
}

df = pd.DataFrame(data)
df = extract_username(df)

print(df)

                    Email     Username
0    john.doe@example.com     john.doe
1     jane.smith@test.com   jane.smith
2  alice.brown@domain.org  alice.brown


# Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.
For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2
Your function should select the following rows: A B C
1 8 2 7
4 9 1 2
The function should return a new DataFrame that contains only the selected rows.

You can create a function that filters the DataFrame based on the conditions specified (values in column 'A' greater than 5 and values in column 'B' less than 10). Here's how you can implement this:

In [9]:
import pandas as pd

def filter_dataframe(df):
    # Select rows where column 'A' is greater than 5 and column 'B' is less than 10
    filtered_df = df[(df['A'] > 5) & (df['B'] < 10)]
    return filtered_df

# Example usage
data = {
    'A': [3, 8, 6, 2, 9],
    'B': [5, 2, 9, 3, 1],
    'C': [1, 7, 4, 5, 2]
}

df = pd.DataFrame(data)
result_df = filter_dataframe(df)

print(result_df)

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


# Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

You can create a function that calculates the mean, median, and standard deviation of the values in the 'Values' column of a Pandas DataFrame. Here's how you can do that:

In [10]:
import pandas as pd

def calculate_statistics(df):
    # Check if the 'Values' column exists in the DataFrame
    if 'Values' in df.columns:
        mean_value = df['Values'].mean()
        median_value = df['Values'].median()
        std_deviation = df['Values'].std()

        # Return the results as a dictionary
        return {
            'Mean': mean_value,
            'Median': median_value,
            'Standard Deviation': std_deviation
        }
    else:
        print("The 'Values' column does not exist in the DataFrame.")
        return None

# Example usage
data = {
    'Values': [10, 20, 30, 40, 50]
}

df = pd.DataFrame(data)
statistics = calculate_statistics(df)

print(statistics)

{'Mean': 30.0, 'Median': 30.0, 'Standard Deviation': 15.811388300841896}


# Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.

You can create a function that calculates the 7-day moving average of the sales data in the 'Sales' column and stores the result in a new column called 'MovingAverage'. Here's how you can implement this:

In [11]:
import pandas as pd

def calculate_moving_average(df):
    # Ensure that the 'Date' column is in datetime format
    df['Date'] = pd.to_datetime(df['Date'])

    # Sort the DataFrame by date
    df = df.sort_values('Date')

    # Calculate the 7-day moving average of the 'Sales' column
    df['MovingAverage'] = df['Sales'].rolling(window=7).mean()

    return df

# Example usage
data = {
    'Date': ['2024-10-01', '2024-10-02', '2024-10-03', '2024-10-04',
             '2024-10-05', '2024-10-06', '2024-10-07', '2024-10-08',
             '2024-10-09', '2024-10-10'],
    'Sales': [100, 150, 200, 130, 160, 170, 180, 190, 220, 240]
}

df = pd.DataFrame(data)
df_with_moving_average = calculate_moving_average(df)

print(df_with_moving_average)

        Date  Sales  MovingAverage
0 2024-10-01    100            NaN
1 2024-10-02    150            NaN
2 2024-10-03    200            NaN
3 2024-10-04    130            NaN
4 2024-10-05    160            NaN
5 2024-10-06    170            NaN
6 2024-10-07    180     155.714286
7 2024-10-08    190     168.571429
8 2024-10-09    220     178.571429
9 2024-10-10    240     184.285714


# Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.
ChatGPT said:
ChatGPT
You can create a function that adds a new column 'Weekday' to the DataFrame, containing the names of the weekdays corresponding to each date in the 'Date' column. Here’s how you can implement this:

In [12]:
import pandas as pd

def add_weekday_column(df):
    # Ensure that the 'Date' column is in datetime format
    df['Date'] = pd.to_datetime(df['Date'])

    # Create a new column 'Weekday' by extracting the weekday name
    df['Weekday'] = df['Date'].dt.day_name()

    return df

# Example usage
data = {
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']
}

df = pd.DataFrame(data)
df_with_weekday = add_weekday_column(df)

print(df_with_weekday)

        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


# Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

You can create a function that filters the DataFrame to select all rows where the 'Date' column falls between specified start and end dates. Here's how you can implement this:

In [13]:
import pandas as pd

def filter_dates(df, start_date='2023-01-01', end_date='2023-01-31'):
    # Ensure that the 'Date' column is in datetime format
    df['Date'] = pd.to_datetime(df['Date'])

    # Filter the DataFrame for dates between start_date and end_date
    filtered_df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

    return filtered_df

# Example usage
data = {
    'Date': ['2023-01-01', '2023-01-15', '2023-02-01', '2023-01-31', '2023-01-10']
}

df = pd.DataFrame(data)
filtered_df = filter_dates(df)

print(filtered_df)

        Date
0 2023-01-01
1 2023-01-15
3 2023-01-31
4 2023-01-10


# Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

To use the basic functions of pandas, the first and foremost necessary library that needs to be imported is the pandas library itself. You typically import it with the following statement:

In [None]:
import pandas as pd

This import statement allows you to use the pandas library's functionality under the alias pd, which is a common convention in the Python community. After this import, you can create DataFrames, Series, and utilize various pandas functions and methods for data manipulation and analysis.