# Pandas Practice Questions
This notebook contains **20 comprehensive Python pandas practice problems** organized in two sections:

**Section A - Short Coding Questions (Questions 1-17):**
- Questions 1-12: Basic pandas operations (loading, selection, filtering, handling missing values)
- Questions 13-17: Short coding questions on duplicates, missing values, column creation, filtering, and statistics

**Section B - Applied Coding Questions (Questions 18-20):**
- Question 18: GroupBy with multiple aggregations
- Question 19: Advanced filtering and column creation
- Question 20: Handling missing values and outliers

Each question includes:
- Clear problem description
- Hints for solving
- Multiple-choice code options (where applicable)
- Instructor solution with inline examples
- Test cases using small DataFrames

In [2]:
import pandas as pd
import numpy as np
from io import StringIO

In [3]:
name = 'student name'
roll_number = 'student roll number'

### 1. Load a CSV string into a DataFrame
**Return:** A pandas DataFrame from the CSV string

**Choose the correct line:**
- (a) `return pd.read_excel(StringIO(csv_string))`
- (b) `return pd.read_csv(StringIO(csv_string))`
- (c) `return pd.DataFrame(csv_string.split('\n'))`
- (d) `return csv_string.to_dataframe()`

In [None]:
def load_csv_string(csv_string: str) -> pd.DataFrame:
    pass
# csv_data = 'name,age,score\nAlice,25,85\nBob,30,90\nCharlie,22,78'

### 2. Get shape and column names
**Return:** A tuple of (number of rows, number of columns, list of column names)

**Choose the correct code:**
- (a) `return (df.size, df.ndim, df.columns)`
- (b) `return (df.shape[0], df.shape[1], list(df.columns))`
- (c) `return df.info()`
- (d) `return (len(df), len(df.index), df.to_list())`

In [6]:
def get_dataframe_info(df: pd.DataFrame) -> tuple:
    pass

### 3. Get the first n rows of a DataFrame
**Return:** DataFrame containing first n rows

**Choose the correct code:**
- (a) `return df.iloc[:n]`
- (b) `return df.head(n)`
- (c) `return df.nlargest(n, axis=0)`
- (d) `return df[:n:1]`

In [8]:
def get_first_n_rows(df: pd.DataFrame, n: int) -> pd.DataFrame:
    pass

### 4. Get basic statistics for numeric columns
**Return:** A pandas DataFrame with descriptive statistics (using .describe())


In [None]:
def describe_numeric(df: pd.DataFrame) -> pd.DataFrame:
    pass

### 5. Select a single column as a Series
**Return:** A pandas Series for the specified column

In [10]:
def select_column(df: pd.DataFrame, col_name: str) -> pd.Series:
    pass

### 6. Filter rows where a column value exceeds a threshold
**Return:** A DataFrame containing only rows where column > threshold

**Hint:** Use boolean indexing `df[df[col_name] > threshold]` and `.reset_index(drop=True)` to reset row indices.

**Choose the correct code:**
- (a) `return df.filter(column=col_name, value=threshold)`
- (b) `return df.loc[df[col_name] > threshold]`
- (c) `return df[df[col_name] > threshold].reset_index(drop=True)`
- (d) `return df.query(f'{col_name} > {threshold}')`

In [None]:
def filter_by_threshold(df: pd.DataFrame, col_name: str, threshold: float) -> pd.DataFrame:
    pass

### 7. Count missing (NaN) values in each column
**Return:** A pandas Series with column names as index and count of NaN as values

**Hint:** Use `.isnull().sum()` to count missing values in each column.

In [None]:
def count_missing_values(df: pd.DataFrame) -> pd.Series:
    pass

### 8. Drop rows containing any NaN values
**Return:** A DataFrame with all rows containing NaN removed

**Hint:** Use `.dropna()` to remove rows with missing values, then `.reset_index(drop=True)` to renumber rows.

In [None]:
def drop_rows_with_nan(df: pd.DataFrame) -> pd.DataFrame:
    pass

### 9. Fill missing values with the mean of the column
**Return:** A DataFrame where NaN values in numeric columns are replaced by column mean

**Hint:** Get numeric columns using `.select_dtypes()`, then use `.fillna()` with the column mean.

In [None]:
def fill_missing_with_mean(df: pd.DataFrame) -> pd.DataFrame:
    pass

### 10. Group by a column and calculate the mean of another column
**Return:** A DataFrame with grouped results (group column and mean)

**Hint:** Use `.groupby(group_col)[agg_col].mean()` and `.reset_index()` to convert to DataFrame.

In [None]:
def group_by_mean(df: pd.DataFrame, group_col: str, agg_col: str) -> pd.DataFrame:
    pass

### 11. Merge two DataFrames on a common column
**Return:** A merged DataFrame (inner join on the specified key)

**Choose the correct code:**
- (a) `return left.join(right, on=on)`
- (b) `return pd.concat([left, right])`
- (c) `return pd.merge(left, right, on=on, how='inner')`
- (d) `return left.combine(right)`

In [15]:
def merge_dataframes(left: pd.DataFrame, right: pd.DataFrame, on: str) -> pd.DataFrame:
    pass

### 12. Convert a column to datetime format
**Return:** A DataFrame where the specified column has been converted to datetime

**Choose the correct code:**
- (a) `df_copy[col_name] = df_copy[col_name].astype(datetime)`
- (b) `df_copy[col_name] = pd.to_datetime(df_copy[col_name])`
- (c) `df_copy[col_name].convert_to_datetime()`
- (d) `df_copy[col_name] = datetime.strptime(df_copy[col_name], '%Y-%m-%d')`

In [None]:
def convert_to_datetime(df: pd.DataFrame, col_name: str) -> pd.DataFrame:
    pass

### 13. Drop Duplicate Rows
You have a DataFrame with duplicate rows. The command `drop_duplicates` on subset of columns named `['Name', 'Team']` is to be used.

In [17]:
def drop_duplicates_by_cols(df: pd.DataFrame) -> pd.DataFrame:
    pass
sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Team': ['X', 'Y', 'X', 'Z'], 'Salary': [50000, 55000, 50000, 60000]})

### 14. Fill Missing Values in a Column
Write a Python command to fill all missing values in the column 'College' with the text 'Unknown'.

In [None]:
def fill_missing_college(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'College': ['IIT', None, 'NIT']})

### 15. Create New Column with Percentage Increase
Given a DataFrame df with a 'Salary' column, write code to increase salary by 5% and store it in a new column 'UpdatedSalary'.

**Hint:** Multiply the Salary column by 1.05 to increase by 5%.

**Choose the correct code:**
- (a) `df['UpdatedSalary'] = df['Salary'] * 5`
- (b) `df['UpdatedSalary'] = df['Salary'] * 1.05`
- (c) `df['UpdatedSalary'] = df['Salary'] + 0.05`
- (d) `df['UpdatedSalary'] = df['Salary'].apply(lambda x: x * 5)`

In [None]:
def add_updated_salary(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Salary': [50000, 55000, 60000]})

### 16. Filter Rows with Range Condition
Write Python code to select rows where 'Profit' is between 30 and 55 (inclusive).

**Hint:** Use boolean indexing with AND operator `&`.

In [None]:
def filter_profit_range(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Profit': [25, 40, 60, 35]})

### 17. Get Summary Statistics
Write a Python command to show summary statistics (mean, median, std, min, max, etc.) for the entire DataFrame.

In [None]:
def get_summary_stats(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Age': [25, 30, 28, 35], 'Salary': [50000, 55000, 52000, 60000]})

### 18. GroupBy with Multiple Aggregations
You have a DataFrame with columns: Name, Team, Salary, Profit

Write Python code to:
1. Group the data by Team
2. aggregate average salary and total profit for each team
3. return the result

**Hint:** Use `.groupby()` with `.agg()` for multiple aggregations.

**Choose the correct code:**
- (a) `df.groupby('Team').agg({'Salary': 'mean', 'Profit': 'sum'})`
- (b) `df.groupby('Team')[['Salary', 'Profit']].agg(['mean', 'sum'])`
- (c) `df.group('Team').apply(lambda x: {'avg_salary': x['Salary'].mean(), 'total_profit': x['Profit'].sum()})`

In [None]:
def groupby_team_agg(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Team': ['X', 'X', 'Y', 'Y'], 'Salary': [50000, 55000, 52000, 53000], 'Profit': [45, 30, 60, 25]})

### 19. Advanced Filtering and Column Creation
Given a DataFrame with columns: Name, Score1, Score2

Write Python code to:
1. Select only rows where Score1 > 40 AND Score2 > 50
2. Create a new column AverageScore = mean of Score1 and Score2
3. return dataframe with only the  `[['Name', 'AverageScore']]`
**Hint:** Filter first using boolean indexing, then add the new column, then select specific columns.

In [None]:
def advanced_filter_and_create(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Score1': [40, 55, 70, 30], 'Score2': [50, 65, 75, 35]})

### 20. Handle Missing Values and Outliers
You have a DataFrame with an 'Age' column containing missing values and outliers (Age > 100).

Write Python code to:
1. Replace missing values with the median age
2. Remove rows where Age > 100
3. Return the cleaned DataFrame

**Hint:** Use `.fillna()` with median, then filter with boolean indexing.

In [None]:
def clean_age_data(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Age': [25, np.nan, 105, 30, np.nan]})