1.) List any five functions of the pandas library with execution.

Common functions of the pandas library in Python, along with example code that demonstrates their usage:

1. read_csv(): This function is used to read a CSV file and create a pandas DataFrame object.

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data.csv')

# Print the first five rows of the DataFrame
print(data.head())

2. dropna(): This function is used to remove rows or columns with missing values (NaN).

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data.csv')

# Drop rows with missing values
data = data.dropna()

# Print the first five rows of the DataFrame
print(data.head())

3. groupby(): This function is used to group data in a DataFrame based on one or more columns.

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data.csv')

# Group the data by the 'category' column and calculate the mean of the 'price' column
grouped_data = data.groupby('category')['price'].mean()

# Print the grouped data
print(grouped_data)

4.) describe(): This function is used to generate summary statistics for a DataFrame, such as mean, standard deviation, and quartiles.

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data.csv')

# Generate summary statistics for the 'price' column
stats = data['price'].describe()

# Print the summary statistics
print(stats)

5.) pivot_table(): This function is used to create a pivot table from a pandas DataFrame.

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data.csv')

# Create a pivot table with the 'category' column as rows and the 'date' column as columns, and the mean of the 'price' column as values
pivot_data = pd.pivot_table(data, index='category', columns='date', values='price', aggfunc='mean')

# Print the pivot table
print(pivot_data)

6.) isna(): This function is used to check for missing values (NaN) in a pandas DataFrame or Series.

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data.csv')

# Check for missing values in the DataFrame
missing_values = data.isna()

# Print the DataFrame of missing values
print(missing_values)

2.) Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [1]:
import pandas as pd

def reindex_df(df):
    new_index = pd.RangeIndex(start=1, step=2, stop=len(df)*2+1)
    df = df.reset_index(drop=True)
    df.index = new_index
    return df

Here,

pd.RangeIndex() creates a new index with the desired start, step, and stop values.

reset_index() method resets the index of the DataFrame to the default integer index, and the drop=True argument 
drops the old index column from the DataFrame.

Finally, set_index() method sets the new index to the DataFrame.

In [2]:
import pandas as pd

df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]})
new_df = reindex_df(df)
print(new_df)

    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


3.) You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.

In [3]:
import pandas as pd

def sum_first_three(df):
    values_sum = 0
    for i in range(3):
        values_sum += df['Values'].iloc[i]
    print("The sum of the first three values in the 'Values' column is:", values_sum)

Here,
The sum_first_three() function takes a DataFrame df as input.

We initialize a variable values_sum to 0 to keep track of the sum of the first three values in the 'Values' column.

We iterate over the first three rows of the 'Values' column using a for loop and the iloc method.
For each row, we add the value to values_sum.

Finally, we print the sum to the console.

In [4]:
import pandas as pd

df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
sum_first_three(df)

The sum of the first three values in the 'Values' column is: 60


4.) Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

In [6]:
import pandas as pd

def count_words(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df

Here,
The count_words() function takes a DataFrame df as input.

We use the apply() method to apply a lambda function to each row of the 'Text' column.

The lambda function takes a string x as input and uses the split() method to split the string into a list of words. The len() function is then used to count the number of words in the list.

Finally, we create a new column 'Word_Count' in the DataFrame df and assign it the result of the apply() method.

In [8]:
import pandas as pd

df = pd.DataFrame({'Text': ['Whats your Name', 'My name is Sajal', 'Nice to Meet You Sajal']})
new_df = count_words(df)
print(new_df)


                     Text  Word_Count
0         Whats your Name           3
1        My name is Sajal           4
2  Nice to Meet You Sajal           5


5.) How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size() method returns the number of elements in the DataFrame, which is equivalent to the product of the number of rows and columns. It returns a single integer value representing the total number of elements in the DataFrame.



DataFrame.shape() method returns a tuple representing the dimensions of the DataFrame. The tuple contains two values: the number of rows and the number of columns, respectively. For example, (10, 5) means that the DataFrame has 10 rows and 5 columns.

In [2]:
import pandas as pd

# create a sample dataframe
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

# get the size of the dataframe
print(df.size) 

# get the shape of the dataframe
print(df.shape)  

9
(3, 3)


6.) Which function of pandas do we use to read an excel file?

The pandas.read_excel() function is a versatile function that can read data from various Excel files, including .xls and .xlsx file formats. It can read data from a specific sheet in the Excel file or read data from multiple sheets.

In [None]:
import pandas as pd

# read an Excel file
df = pd.read_excel('example.xlsx', sheet_name='Sheet1')

# display the data
print(df.head())

7.) You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.

In [5]:
def extract_username(df):
    # extract the username from the email using the string method 'split'
    df['Username'] = df['Email'].str.split('@').str[0]
    
    # return the updated DataFrame with the new column
    return df

In [7]:
import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({'Email': ['sajal@example.com', 'poonam@example.com', 'pooja@example.com']})

# call the function to extract the usernames and create a new column 'Username'
df = extract_username(df)

# display the updated DataFrame
print(df.head())

                Email Username
0   sajal@example.com    sajal
1  poonam@example.com   poonam
2   pooja@example.com    pooja


8.) You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.

For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

In [8]:
def select_rows(df):
    # select rows where A > 5 and B < 10 using boolean indexing
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    
    # return the selected rows as a new DataFrame
    return selected_rows

In the above function, boolean indexing is used to select the rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The resulting selected rows are stored in a new DataFrame selected_rows. Finally, the new DataFrame with the selected rows is returned

In [9]:
import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]})

# call the function to select the rows where A > 5 and B < 10
selected_rows = select_rows(df)

# display the selected rows
print(selected_rows.head())

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


9.) Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

In [10]:
import pandas as pd
import numpy as np

def calculate_stats(df):
    # calculate the mean, median, and standard deviation of the values in the 'Values' column
    mean = np.mean(df['Values'])
    median = np.median(df['Values'])
    std = np.std(df['Values'])
    
    # print the calculated statistics
    print('Mean: {:.2f}'.format(mean))
    print('Median: {:.2f}'.format(median))
    print('Standard Deviation: {:.2f}'.format(std))

IndentationError: unexpected indent (2444728163.py, line 15)

In the above function, the NumPy functions np.mean(), np.median(), and np.std() are used to calculate the mean, median, and standard deviation of the values in the 'Values' column of the DataFrame df. The calculated statistics are then printed to the console with two decimal places using the print() function and the .format() method.

In [13]:
# create a sample DataFrame
df = pd.DataFrame({'Values': [3, 8, 6, 2, 9, 10, 5, 7, 1]})

# call the function to calculate the statistics
calculate_stats(df)

Mean: 5.67
Median: 6.00
Standard Deviation: 2.98


10.) Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.

In [18]:
import pandas as pd

def add_moving_average(df):
    # calculate the moving average of the sales for the past 7 days using a window of size 7
    ma = df['Sales'].rolling(window=7, min_periods=1).mean()
    
    # add the moving average as a new column in the DataFrame
    df['MovingAverage'] = ma
    
    # return the DataFrame with the new column
    return df

In the above function, the rolling() function is used to calculate the moving average of the sales for the past 7 days using a window of size 7. The resulting moving average values are stored in a new Pandas Series ma. The ma series is then added to the original DataFrame df as a new column 'MovingAverage'. Finally, the modified DataFrame with the new column is returned.

In [19]:
# create a sample DataFrame
dates = pd.date_range(start='2022-01-01', end='2022-01-30')
sales = [10, 20, 30, 15, 25, 35, 40, 30, 20, 15, 5, 10, 20, 30, 25, 20, 15, 10, 5, 15, 20, 25, 30, 35, 40, 30, 20, 10, 5, 15]
df = pd.DataFrame({'Date': dates, 'Sales': sales})

# call the function to add the moving average column
df = add_moving_average(df)

# display the modified DataFrame
print(df.head())


        Date  Sales  MovingAverage
0 2022-01-01     10          10.00
1 2022-01-02     20          15.00
2 2022-01-03     30          20.00
3 2022-01-04     15          18.75
4 2022-01-05     25          20.00


In the above example, the add_moving_average() function is called by passing the sample DataFrame df as an argument. The resulting modified DataFrame will contain a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame.

11.)  You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:

Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:


Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.

We can use the dt accessor of the Date column to extract the weekday name for each date in the column. Here's an example function that does that:

In [14]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.day_name()
    return df

# create a sample DataFrame
df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', end='2023-01-05')
})

# call the function to add the 'Weekday' column
df = add_weekday_column(df)

# print the modified DataFrame
print(df)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


12.) Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [15]:
import pandas as pd

def select_rows_between_dates(df):
    start_date = pd.Timestamp('2023-01-01')
    end_date = pd.Timestamp('2023-01-31')
    return df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

In [16]:
# create a sample DataFrame
df = pd.DataFrame({
    'Date': pd.date_range(start='2022-12-25', end='2023-02-10')
})

# call the function to select rows between '2023-01-01' and '2023-01-31'
selected_df = select_rows_between_dates(df)

# print the selected DataFrame
print(selected_df)

         Date
7  2023-01-01
8  2023-01-02
9  2023-01-03
10 2023-01-04
11 2023-01-05
12 2023-01-06
13 2023-01-07
14 2023-01-08
15 2023-01-09
16 2023-01-10
17 2023-01-11
18 2023-01-12
19 2023-01-13
20 2023-01-14
21 2023-01-15
22 2023-01-16
23 2023-01-17
24 2023-01-18
25 2023-01-19
26 2023-01-20
27 2023-01-21
28 2023-01-22
29 2023-01-23
30 2023-01-24
31 2023-01-25
32 2023-01-26
33 2023-01-27
34 2023-01-28
35 2023-01-29
36 2023-01-30
37 2023-01-31


13.) To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?

In [17]:
import pandas as pd

This line of code imports pandas and gives it the alias pd, which is the standard alias used in the pandas community. Once you've imported pandas, you can use its various functions and data structures, such as DataFrames and Series, to manipulate and analyze data.