### Q1. List any five functions of the pandas library with execution.

Pandas is a popular Python library for data manipulation and analysis. Here are five commonly used functions from the Pandas library along with example executions:

##### 1. read_csv():
This function is used to read data from a CSV (Comma-Separated Values) file and create a DataFrame, which is a primary data structure in Pandas.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('taxonomy.csv.xls')

##### 2. head():
This function is used to display the first few rows of a DataFrame.

In [3]:
df.head(6)

Unnamed: 0,taxonomy_id,name,parent_id,parent_name
0,101,Emergency,,
1,101-01,Disaster Response,101,Emergency
2,101-02,Emergency Cash,101,Emergency
3,101-02-01,Help Pay for Food,101-02,Emergency Cash
4,101-02-02,Help Pay for Healthcare,101-02,Emergency Cash
5,101-02-03,Help Pay for Housing,101-02,Emergency Cash


##### 3. describe():
This function provides summary statistics of numerical columns in a DataFrame, including count, mean, standard deviation, minimum, and maximum.

In [4]:
df.describe()

Unnamed: 0,taxonomy_id,name,parent_id,parent_name
count,290,290,279,279
unique,290,183,60,50
top,101,Nursing Home,106-06-07,Health Education
freq,1,4,11,15


##### 4. info():
Provides a summary of the DataFrame, including the data types and number of non-null values

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   taxonomy_id  290 non-null    object
 1   name         290 non-null    object
 2   parent_id    279 non-null    object
 3   parent_name  279 non-null    object
dtypes: object(4)
memory usage: 9.2+ KB


##### 5. groupby():
This function is used for grouping data in a DataFrame based on one or more columns, allowing you to perform operations on each group separately.

In [6]:
df1 = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

In [7]:
df1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [8]:
g = df1.groupby('Pclass')

In [9]:
g.sum()

  g.sum()


Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,99705,136,7111.42,90,77,18177.4125
2,82056,87,5168.83,74,70,3801.8417
3,215625,119,8924.92,302,193,6714.6951


In [10]:
g.mean()

  g.mean()


Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [11]:
data = {'A': [10, 20, 30],
        'B': [40, 50, 60],
        'C': [70, 80, 90]}
df = pd.DataFrame(data)

In [12]:
df

Unnamed: 0,A,B,C
0,10,40,70
1,20,50,80
2,30,60,90


In [13]:
def reindex_dataframe(df):
    new_index = pd.RangeIndex(start=1, stop=len(df)*2, step=2)
    df = df.reset_index(drop=True)
    df.index = new_index
    return df

reindex_dataframe(df)

Unnamed: 0,A,B,C
1,10,40,70
3,20,50,80
5,30,60,90


### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

In [14]:
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

In [15]:
df

Unnamed: 0,Values
0,10
1,20
2,30
3,40
4,50


In [16]:
sum_values = df["Values"].iloc[:3].sum()
print("The sum of the first three values in the 'Values' column is:", sum_values)

The sum of the first three values in the 'Values' column is: 60


In [17]:
def calculate_sum(df):
    sum_values = df['Values'].iloc[:3].sum()
    print("The sum of the first three values in the 'Values' column is:", sum_values)
    
calculate_sum(df)

The sum of the first three values in the 'Values' column is: 60


### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [18]:
data = {'Text': ['My name is Shahrukh.',
                 'I am from Noida.',
                 'I am pursuing Data science on Pwskills.']}
df = pd.DataFrame(data)

In [19]:
df

Unnamed: 0,Text
0,My name is Shahrukh.
1,I am from Noida.
2,I am pursuing Data science on Pwskills.


In [20]:
def create_word_count(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df
create_word_count(df)

Unnamed: 0,Text,Word_Count
0,My name is Shahrukh.,4
1,I am from Noida.,4
2,I am pursuing Data science on Pwskills.,7


In [21]:
df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
print(df)

                                      Text  Word_Count
0                     My name is Shahrukh.           4
1                         I am from Noida.           4
2  I am pursuing Data science on Pwskills.           7


### Q5. How are DataFrame.size() and DataFrame.shape() different?

##### DataFrame.size()
DataFrame.size() returns the total number of elements in the DataFrame, which is equal to the product of the number of rows and the number of columns.

In [22]:
#Example
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

size = df.size
print(size)  # Output: 6 (3 rows * 2 columns = 6 elements)


6


##### DataFrame.shape()

DataFrame.shape() returns a tuple representing the dimensions of the DataFrame, where the first element of the tuple is the number of rows, and the second element is the number of columns.

In [23]:
#Example
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

shape = df.shape
print(shape)  # Output: (3, 2) (3 rows, 2 columns)


(3, 2)


### Q6. Which function of pandas do we use to read an excel file?

In Pandas, you can use the pandas.read_excel() function to read an Excel file. This function allows you to read data from an Excel file and create a DataFrame from it

In [24]:
import pandas as pd

# Read an Excel file into a DataFrame
df = pd.read_excel('Book1.xlsx')

In [25]:
df

Unnamed: 0,House Age (years),Distance to Station (meters),Price per Square Foot,Unnamed: 3,Sum
0,32.0,84.87882,75.8,,Mean
1,19.5,306.5947,84.4,,Median
2,13.3,561.9845,94.6,,Mode
3,13.3,561.9845,109.6,,St. Deviation
4,5.0,390.5684,86.2,,Variance
5,7.1,2175.03,64.2,,
6,34.5,623.4731,80.6,,
7,20.3,287.6025,93.4,,
8,31.7,5512.038,37.6,,
9,17.9,1783.18,44.2,,


### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

In [26]:
data = {'Email': ['shahrukh.khan@example.com', 'aslam.jafari@example.com', 'pwskills@example.com']}
df = pd.DataFrame(data)

In [27]:
df

Unnamed: 0,Email
0,shahrukh.khan@example.com
1,aslam.jafari@example.com
2,pwskills@example.com


In [28]:
def extract_username(df):
    df['Username'] = df['Email'].str.split('@').str[0]
    return df
extract_username(df)

Unnamed: 0,Email,Username
0,shahrukh.khan@example.com,shahrukh.khan
1,aslam.jafari@example.com,aslam.jafari
2,pwskills@example.com,pwskills


### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.
For example, if df contains the following values:

      A B C
    
    0 3 5 1
    
    1 8 2 7
    
    2 6 9 4
    
    3 2 3 5
    
    4 9 1 2


Your function should select the following rows: A B C

    1 8 2 7
    
    4 9 1 2
The function should return a new DataFrame that contains only the selected rows.

In [29]:
data = {'A': [3, 8, 6, 2, 9],
        'B': [5, 2, 9, 3, 1],
        'C': [1, 7, 4, 5, 2]}

df = pd.DataFrame(data)

In [30]:
df

Unnamed: 0,A,B,C
0,3,5,1
1,8,2,7
2,6,9,4
3,2,3,5
4,9,1,2


In [31]:
df[(df['A'] > 5) & (df['B'] < 10)]

Unnamed: 0,A,B,C
1,8,2,7
2,6,9,4
4,9,1,2


In [32]:
def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows
select_rows(df)

Unnamed: 0,A,B,C
1,8,2,7
2,6,9,4
4,9,1,2


### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [33]:
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

In [34]:
df

Unnamed: 0,Values
0,10
1,20
2,30
3,40
4,50


In [35]:
def calculate(df):
    mean = df['Values'].mean()
    median = df['Values'].median()
    standard = df['Values'].std()
    return mean, median, standard
mean, median, standard = calculate(df)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard: {standard}")

Mean: 30.0
Median: 30.0
Standard: 15.811388300841896


### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [36]:
import pandas as pd
import numpy as np
# Create sample data
data = {
    'Date': pd.date_range(start='2023-09-01', periods=30, freq='D'),
    'Sales': np.random.randint(100, 500, size=30)
}

# Create a DataFrame from the sample data
df = pd.DataFrame(data)
df

Unnamed: 0,Date,Sales
0,2023-09-01,296
1,2023-09-02,303
2,2023-09-03,361
3,2023-09-04,334
4,2023-09-05,219
5,2023-09-06,181
6,2023-09-07,359
7,2023-09-08,169
8,2023-09-09,367
9,2023-09-10,439


In [37]:
# Define a function to calculate the moving average
def calculate_moving_average(df):
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    return df

# Calculate the moving average and add it to the DataFrame
df = calculate_moving_average(df)

# Print the resulting DataFrame
df

Unnamed: 0,Date,Sales,MovingAverage
0,2023-09-01,296,296.0
1,2023-09-02,303,299.5
2,2023-09-03,361,320.0
3,2023-09-04,334,323.5
4,2023-09-05,219,302.6
5,2023-09-06,181,282.333333
6,2023-09-07,359,293.285714
7,2023-09-08,169,275.142857
8,2023-09-09,367,284.285714
9,2023-09-10,439,295.428571


### Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:

    Date

0 2023-01-01

1 2023-01-02

2 2023-01-03

3 2023-01-04

4 2023-01-05

Your function should create the following DataFrame:

    Date     Weekday

0 2023-01-01 Sunday

1 2023-01-02 Monday

2 2023-01-03 Tuesday

3 2023-01-04 Wednesday

4 2023-01-05 Thursday

The function should return the modified DataFrame.

In [38]:
data = {'Date': pd.date_range(start = '2023-01-01', periods = 7, freq = 'D')
}
df = pd.DataFrame(data)
df

Unnamed: 0,Date
0,2023-01-01
1,2023-01-02
2,2023-01-03
3,2023-01-04
4,2023-01-05
5,2023-01-06
6,2023-01-07


In [39]:
df['weekdays'] = df['Date'].dt.strftime('%A')

In [40]:
df

Unnamed: 0,Date,weekdays
0,2023-01-01,Sunday
1,2023-01-02,Monday
2,2023-01-03,Tuesday
3,2023-01-04,Wednesday
4,2023-01-05,Thursday
5,2023-01-06,Friday
6,2023-01-07,Saturday


### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [41]:
data = {
    'Date': [
        '2023-01-05 10:15:00',
        '2023-01-10 14:30:00',
        '2023-01-15 08:45:00',
        '2023-02-05 12:00:00',
        '2023-02-10 16:45:00',
    ]
}

In [42]:
df = pd.DataFrame(data)

In [43]:
df

Unnamed: 0,Date
0,2023-01-05 10:15:00
1,2023-01-10 14:30:00
2,2023-01-15 08:45:00
3,2023-02-05 12:00:00
4,2023-02-10 16:45:00


In [44]:
df['Date'] = pd.to_datetime(df['Date'])
def select_rows_between_dates(df):
    start_date = pd.Timestamp('2023-01-01')
    end_date = pd.Timestamp('2023-01-31')
    selected_df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    return selected_df
select_rows_between_dates(df)

Unnamed: 0,Date
0,2023-01-05 10:15:00
1,2023-01-10 14:30:00
2,2023-01-15 08:45:00


### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

The first and foremost necessary library that needs to be imported for pandas is "import pandas as pd."

In [48]:
import pandas as pd