Lambda and Apply Questions (10 Questions)


- How can you use a lambda function with apply to convert all values in a DataFrame column to uppercase?

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [5]:
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie']})
df['name'] = df['name'].apply(lambda x: x.upper())
print(df)

      name
0    ALICE
1      BOB
2  CHARLIE


- Write a lambda function to multiply each value in a DataFrame column by 2 using apply .

In [6]:
df = pd.DataFrame({'value': [10, 20, 30]})
df['value'] = df['value'].apply(lambda x: x * 2)
print(df)

   value
0     20
1     40
2     60


- How would you use apply with a lambda to create a new column that categorizes ages as 'Young' (<30), 'Adult' (30-60), or 'Senior' (>60)?

In [7]:
df = pd.DataFrame({'age': [25, 45, 70]})
df['category'] = df['age'].apply(lambda x: 'Young' if x < 30 else ('Adult' if x >= 60 else 'Senior'))
print(df)

   age category
0   25    Young
1   45   Senior
2   70    Adult


- Using apply and lambda , how can you replace negative values in a DataFrame column with zero?

In [8]:
df = pd.DataFrame({'score': [-5, 10, -15, 20]})
df['score'] = df['score'].apply(lambda x : 0 if x < 0 else x)
print(df)

   score
0      0
1     10
2      0
3     20


- How do you apply a lambda function to multiple columns of a DataFrame to compute their sum?

In [9]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['sum'] = df.apply(lambda row : row['A'] + row['B'], axis = 1)
print(df)

   A  B  sum
0  1  4    5
1  2  5    7
2  3  6    9


- How would you use apply and lambda to normalize a numerical column to a range [0, 1]?

In [10]:
df = pd.DataFrame({'value': [10, 20, 30, 40]})
min_val = df['value'].min()
max_val = df['value'].max()
df['normalized'] = df['value'].apply(lambda x: (x - min_val) / (max_val - min_val))
print(df)

   value  normalized
0     10    0.000000
1     20    0.333333
2     30    0.666667
3     40    1.000000


 - How can you use apply with a lambda to extract the first three characters of a string column?

In [11]:
df = pd.DataFrame({'text': ['apple', 'banana', 'cherry']})
df['short_text'] = df['text'].apply(lambda x : x[:3])
print(df)

     text short_text
0   apple        app
1  banana        ban
2  cherry        che


- Using lambda and apply , how would you create a new column that flags rows where a value exceeds the column mean?


In [12]:
df = pd.DataFrame({'score': [50, 75, 25, 100]})
mean_score = df['score'].mean()
df['exceed_mean'] = df['score'].apply(lambda x: 'Yes' if x > mean_score else 'No')
print(df)

   score exceed_mean
0     50          No
1     75         Yes
2     25          No
3    100         Yes


-  How do you use apply with a lambda to apply a custom function that rounds floats to the nearest integer in a DataFrame?


In [13]:
df = pd.DataFrame({'price': [10.4, 20.7, 30.2]})
df['rounded'] = df['price'].apply(lambda x : round(x))
print(df)

   price  rounded
0   10.4       10
1   20.7       21
2   30.2       30


- What is the difference between using apply with a lambda versus a named function for a DataFrame operation?


In [14]:
# lambda
df = pd.DataFrame({'value': [1, 2, 3]})
df['value'] = df['value'].apply(lambda x: x * 2)
print(df)

   value
0      2
1      4
2      6


In [15]:
# named function
def multiply_by_two(x):
    return x * 2

df['value1'] = df['value'].apply(multiply_by_two)
print(df)

   value  value1
0      2       4
1      4       8
2      6      12


Merge Questions (10 Questions)

- How do you perform an inner merge between two DataFrames on a common column?

In [16]:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
merged_col = df1.merge(df2, how = 'inner', on = 'id')
print(merged_col)

   id   name  age
0   1  Alice   25
1   2    Bob   30


- How would you merge two DataFrames using a left join on multiple columns?

In [17]:
df1 = pd.DataFrame({'id': [1, 2, 3], 'dept': ['HR', 'IT', 'HR'], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'dept': ['HR', 'IT', 'Sales'], 'salary': [50000, 60000, 55000]})
merge_left = df1.merge(df2, how = 'left', on = ['id', 'dept'])
print(merge_left)

   id dept     name   salary
0   1   HR    Alice  50000.0
1   2   IT      Bob  60000.0
2   3   HR  Charlie      NaN


- What happens to rows in a right merge when the right DataFrame has missing values in the merge key?- 

In [18]:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, None], 'age': [25, 30, 35]})
merge_right = df1.merge(df2, how = 'right', on = ['id'])
print(merge_left)

   id dept     name   salary
0   1   HR    Alice  50000.0
1   2   IT      Bob  60000.0
2   3   HR  Charlie      NaN


- How can you merge two DataFrames and keep only rows that don’t match (anti-join equivalent)?

In [19]:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
anti_join_df = df1[~df1['id'].isin(df2['id'])]
print(anti_join_df)

   id     name
2   3  Charlie


- How would you combine two DataFrames with different column names for the same key using merge ?

In [20]:
df1 = pd.DataFrame({'user_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
merged_diff_name = df1.merge(df2, left_on = 'user_id', right_on = 'id', how = 'inner')
print(merged_diff_name)

   user_id   name  id  age
0        1  Alice   1   25
1        2    Bob   2   30


In [21]:
# If you don’t want both key columns (user_id and id) in the result, you can drop one:
merged_df = merged_diff_name.drop(columns='id')
print(merged_df)


   user_id   name  age
0        1  Alice   25
1        2    Bob   30


- How do you handle duplicate column names after merging two DataFrames?- 

In [22]:
df1 = pd.DataFrame({'id': [1, 2], 'value': ['A', 'B']})
df2 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
merged_col = df1.merge(df2, how = 'inner', on = 'id', suffixes = ('_letter', '_number'))
print(merged_col)

   id value_letter  value_number
0   1            A            10
1   2            B            20


- How would you merge three DataFrames sequentially on a common key?

In [23]:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
df3 = pd.DataFrame({'id': [1, 3, 4], 'salary': [50000, 60000, 55000]})
merged_1_2_col = df1.merge(df2, on = 'id', how = 'inner')
print(merged_1_2_col)

   id   name  age
0   1  Alice   25
1   2    Bob   30


In [24]:
final_col = merged_1_2_col.merge(df3, on = 'id', how = 'inner')
print(final_col)

   id   name  age  salary
0   1  Alice   25   50000


- What is the difference between merge and join in pandas, and when would you use each?

In [25]:
#  I have done this question before.

- How can you use merge to add a single column from one DataFrame to another based on a key?

In [26]:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
merged = df1.merge(df2[['id', 'age']], on = 'id', how = 'left')
print(merged)

   id     name   age
0   1    Alice  25.0
1   2      Bob  30.0
2   3  Charlie   NaN


-  How do you perform a merge that preserves the index of one of the DataFrames?

In [27]:
df1 = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie']}, index=[1, 2, 3])
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
df_merged = df1.merge(df2, left_index=True, right_on='id', how='left')
print(df_merged)


        name  id   age
0.0    Alice   1  25.0
1.0      Bob   2  30.0
NaN  Charlie   3   NaN


- GroupBy Questions (10 Questions)

- How do you use groupby to calculate the mean of a column grouped by another column?

In [28]:
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR', 'IT'], 'salary': [50000, 60000, 55000, 65000]})
mean_salary = df.groupby('dept')['salary'].mean()
print(mean_salary)

dept
HR    52500.0
IT    62500.0
Name: salary, dtype: float64


- How would you group a DataFrame by two columns and count the number of rows in each group?

In [29]:
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR', 'IT'], 'region': ['North', 'South', 'South', 'North'], 'name': ['Alice', 'Bob', 'Charlie', 'Talha']})
grouped = df.groupby(['dept', 'region'])
count_rows = grouped.size()
print(count_rows)
           
           

dept  region
HR    North     1
      South     1
IT    North     1
      South     1
dtype: int64


- How can you use groupby with agg to apply multiple aggregation functions (e.g., mean, sum) to a column?

In [30]:
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR'], 'salary': [50000, 60000, 55000]})
agg_col = df.groupby('dept')['salary'].agg(['mean', 'sum'])
print(agg_col)

         mean     sum
dept                 
HR    52500.0  105000
IT    60000.0   60000


- How would you find the maximum value in a column for each group using groupby ?

In [31]:
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR', 'IT'], 'salary': [50000, 60000, 55000, 65000]})
max_groupby = df.groupby('dept')['salary'].max()
print(max_groupby)


dept
HR    55000
IT    65000
Name: salary, dtype: int64


- How do you use groupby to create a new column that ranks values within each group?

In [32]:
df = pd.DataFrame({'dept': ['HR', 'HR', 'IT', 'IT'], 'salary': [50000, 55000, 60000, 65000]})
df['rank_in_dept'] = df.groupby('dept')['salary'].rank()
print(df)

  dept  salary  rank_in_dept
0   HR   50000           1.0
1   HR   55000           2.0
2   IT   60000           1.0
3   IT   65000           2.0


- How would you group a DataFrame by a column and compute the percentage of each group’s total sum?

In [33]:
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR', 'IT'], 'salary': [50000, 60000, 55000, 65000]})
sum_data = df.groupby('dept')['salary'].sum()
final_per = (sum_data / sum_data.sum()) * 100
print(final_per)


dept
HR    45.652174
IT    54.347826
Name: salary, dtype: float64


- What is the difference between groupby with apply versus agg for custom aggregations?


In [34]:
# Using agg
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR'], 'salary': [50000, 60000, 55000]})
result = df.groupby('dept')['salary'].agg(['mean', 'sum'])
print(result)

         mean     sum
dept                 
HR    52500.0  105000
IT    60000.0   60000


- How do you use groupby to filter groups based on a condition, such as groups with more than 2 rows?- 

In [35]:
# Filter groups with more than 2 rows
df = pd.DataFrame({'dept': ['HR', 'HR', 'HR', 'IT', 'IT'], 'salary': [50000, 55000, 52000, 60000, 65000]})
result = df.groupby('dept').filter(lambda g : len (g) > 2)
print(result)              # Use .groupby and .filter when you want to keep or drop entire groups based on a group-level condition (like size, mean, sum, etc.).

  dept  salary
0   HR   50000
1   HR   55000
2   HR   52000


-  How would you group a DataFrame by a column and return the top 3 rows for each group based on a value?

In [36]:
df.groupby('dept', group_keys=False)['salary'].apply(lambda s: s.nlargest(3))


1    55000
2    52000
0    50000
4    65000
3    60000
Name: salary, dtype: int64

Other Pandas Questions (10 Questions)


- How do you pivot a DataFrame to reshape it from long to wide format?


In [37]:
df = pd.DataFrame({'date': ['2023-01', '2023-01', '2023-02', '2023-02'], 'category': ['A', 'B', 'A', 'B'], 'value': [10, 20, 15, 25]})
df

Unnamed: 0,date,category,value
0,2023-01,A,10
1,2023-01,B,20
2,2023-02,A,15
3,2023-02,B,25


In [38]:
# pivot means “turn around” or “rotate.”
# Use pivot() (when each (index, column) pair is unique)
# pivot() rearranges rows into columns.
# If values are the same but in different (index, column) pairs, it works
pivot_wide = df.pivot(index = 'date', columns = 'category', values = 'value')
print(pivot_wide)

category   A   B
date            
2023-01   10  20
2023-02   15  25


- How would you stack a DataFrame to convert it from wide to long format?


In [39]:
df = pd.DataFrame({'date': ['2023-01', '2023-02'], 'A': [10, 15], 'B': [20, 25]})
df

Unnamed: 0,date,A,B
0,2023-01,10,20
1,2023-02,15,25


In [40]:
stacked = df.set_index('date').stack()
print(stacked)

date      
2023-01  A    10
         B    20
2023-02  A    15
         B    25
dtype: int64


- How can you use cut to bin a numerical column into discrete intervals?

In [41]:
df = pd.DataFrame({'score': [55, 75, 85, 95, 65]})
bins = [0, 59, 69, 79, 89, 100]
labels = ['Fail', 'D', 'C', 'B', 'A']
df['grades'] = pd.cut(df['score'], bins=bins, labels=labels)
print(df)

   score grades
0     55   Fail
1     75      C
2     85      B
3     95      A
4     65      D


- What is the purpose of pd.concat , and how does it differ from merge ?

In [42]:
# pd.concat is for stacking DataFrames (just combining, no key matching).
# merge is for joining DataFrames (matches rows using one or more keys, like SQL).

df1 = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
df2 = pd.DataFrame({'name': ['Charlie', 'David'], 'age': [35, 40]})
pd.concat([df1, df2])

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
0,Charlie,35
1,David,40


- How do you handle missing values in a DataFrame using fillna and dropna ?

In [43]:
df = pd.DataFrame({'name': ['Alice', 'Bob', None], 'age': [25, None, 35]})
df

Unnamed: 0,name,age
0,Alice,25.0
1,Bob,
2,,35.0


In [44]:
# Handling missing values with fillna
# df.ffill()
# df.bfill

In [45]:
# Drop any row with at least one NaN
#df.dropna()
# Drop rows where ALL values are NaN
#df.dropna(how='all'
# Drop columns with any NaN
#df.dropna(axis=1)

In [46]:
#fillna→ keep the data, but replace missing values.
#dropna → remove rows/columns that have missing values.

- How would you sort a DataFrame by multiple columns in different orders (ascending and descending)?

In [47]:
df = pd.DataFrame({'dept': ['HR', 'IT', 'HR', 'IT'], 'salary': [50000, 60000, 55000, 65000]})
df

Unnamed: 0,dept,salary
0,HR,50000
1,IT,60000
2,HR,55000
3,IT,65000


In [48]:
df.sort_values(by = 'salary')

Unnamed: 0,dept,salary
0,HR,50000
2,HR,55000
1,IT,60000
3,IT,65000


In [49]:
df.sort_values(by = ['dept', 'salary'], ascending = [True, False])

Unnamed: 0,dept,salary
2,HR,55000
0,HR,50000
3,IT,65000
1,IT,60000


In [50]:
df.sort_values(by = ['dept', 'salary'], ascending = [False, False])

Unnamed: 0,dept,salary
3,IT,65000
1,IT,60000
2,HR,55000
0,HR,50000


-  How can you use query to filter rows in a DataFrame based on a condition?

In [51]:
# Inside query(), conditions are written as a string.
# Use and, or, not instead of &, |, ~.
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [10000 , 20000 ,30000]})
df.query('age > 25' and 'salary > 10000')

Unnamed: 0,name,age,salary
1,Bob,30,20000
2,Charlie,35,30000


- What is the difference between loc and iloc for DataFrame indexing?


In [52]:
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]})
df.loc[0, 'age']


np.int64(25)

In [53]:
df.loc[1, 'name']


'Bob'

In [54]:
df.iloc[0, 1]


np.int64(25)

In [55]:
# Use .loc when your rows/columns have meaningful labels (names, IDs, dates).

# Use .iloc when you want to pick by number positions.

# In practice, most people use .loc more because it’s easier to read and understand.
# But .iloc is very useful in loops or when you only know the positions.

- How do you create a new column based on a condition using np.where ?


In [56]:
df = pd.DataFrame({'score': [55, 75, 85, 95]})
df['result'] = np.where(df['score'] >= 60, 'Pass', 'Fail')
print(df)

   score result
0     55   Fail
1     75   Pass
2     85   Pass
3     95   Pass


In [57]:
df['grade'] = np.where(df['score'] > 80, 'A', np.where(df['score'] >= 70, 'B', 'C'))
print(df)

   score result grade
0     55   Fail     C
1     75   Pass     B
2     85   Pass     A
3     95   Pass     A


- How would you compute a rolling mean for a column with a window size of 3?

In [58]:
df = pd.DataFrame({'value': [10, 20, 30, 40, 50]})
df['rolling_mean'] = df['value'].rolling(window = 3).mean()
print(df)

   value  rolling_mean
0     10           NaN
1     20           NaN
2     30          20.0
3     40          30.0
4     50          40.0
