# Module: Pandas Assignments

## Lesson: Pandas


### Assignment 1: DataFrame Creation and Indexing

1. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Set the index to be the first column.
2. Create a Pandas DataFrame with columns 'A', 'B', 'C' and index 'X', 'Y', 'Z'. Fill the DataFrame with random integers and access the element at row 'Y' and column 'B'.


In [3]:
import pandas as pd
import numpy as np

In [5]:
data = np.random.randint(1, 101, size=(6, 4))
columns = ['ID', 'Score', 'Age', 'Rank']

df = pd.DataFrame(data, columns=columns)

df.set_index('ID', inplace=True)
print("DataFrame with 'ID' as Index:")
print(df)

DataFrame with 'ID' as Index:
    Score  Age  Rank
ID                  
26     62   15    22
60     34   59    30
40     43   50    19
28     18   34    36
24     23   32    89
6      21   90    97


In [10]:
df2 = pd.DataFrame(
    np.random.randint(1, 50, size=(3, 3)),
    index=['X', 'Y', 'Z'],
    columns=['A', 'B', 'C']
)

element = df2.loc['Y', 'B']
# df2.iloc[1, 1]
# df2['B']['Y']

print("DataFrame:")
print(df2)
print(f"\nThe element at row 'Y', column 'B' is: {element}")

DataFrame:
    A   B   C
X  19  32  21
Y   9   6  40
Z  49   1  11

The element at row 'Y', column 'B' is: 6


### Assignment 2: DataFrame Operations

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Add a new column that is the product of the first two columns.
2. Create a Pandas DataFrame with 3 columns and 4 rows filled with random integers. Compute the row-wise and column-wise sum.

In [11]:
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(1, 10, size=(5, 3)),
                   columns=['Col1', 'Col2', 'Col3'])
df1['Product'] = df1['Col1'] * df1['Col2']

print("DataFrame with Product Column:")
print(df1)

DataFrame with Product Column:
   Col1  Col2  Col3  Product
0     6     9     6       54
1     1     1     2        1
2     8     7     3       56
3     5     6     3       30
4     5     3     5       15


In [13]:
np.random.seed(2)
df2 = pd.DataFrame(np.random.randint(1, 20, size=(4, 3)),
                   columns=['A', 'B', 'C'])

col_sum = df2.sum(axis=0)
row_sum = df2.sum(axis=1)

print("Original DataFrame:")
print(df2)
print("\nColumn-wise Sums:\n", col_sum)
print("\nRow-wise Sums:\n", row_sum)

Original DataFrame:
    A   B   C
0   9  16  14
1   9  12  19
2  12   9   8
3   3  18  12

Column-wise Sums:
 A    33
B    55
C    53
dtype: int64

Row-wise Sums:
 0    39
1    40
2    29
3    33
dtype: int64


### Assignment 3: Data Cleaning

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Introduce some NaN values. Fill the NaN values with the mean of the respective columns.
2. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Introduce some NaN values. Drop the rows with any NaN values.

In [18]:
df1 = pd.DataFrame(np.random.randint(1, 10, size=(5, 3)).astype(float),
                   columns=['A', 'B', 'C'])

df1.iloc[0, 0] = np.nan
df1.iloc[2, 1] = np.nan
df1.iloc[4, 2] = np.nan

print("Original DataFrame with NaNs:\n", df1)

df1_filled = df1.fillna(df1.mean())
print("\nFilled DataFrame (NaN replaced by Column Mean):\n", df1_filled)

Original DataFrame with NaNs:
      A    B    C
0  NaN  9.0  4.0
1  1.0  1.0  6.0
2  8.0  NaN  1.0
3  9.0  7.0  6.0
4  2.0  8.0  NaN

Filled DataFrame (NaN replaced by Column Mean):
      A     B     C
0  5.0  9.00  4.00
1  1.0  1.00  6.00
2  8.0  6.25  1.00
3  9.0  7.00  6.00
4  2.0  8.00  4.25


In [20]:
df2 = pd.DataFrame(np.random.randint(1, 10, size=(6, 4)).astype(float),
                   columns=['W', 'X', 'Y', 'Z'])

df2.iloc[1, 1] = np.nan
df2.iloc[3, 0] = np.nan
df2.iloc[3, 3] = np.nan

print("Original DataFrame (6 rows):\n", df2)

df2_dropped = df2.dropna()
print("\nDataFrame after dropna() (Complete rows only):\n", df2_dropped)

Original DataFrame (6 rows):
      W    X    Y    Z
0  4.0  4.0  2.0  9.0
1  7.0  NaN  2.0  6.0
2  8.0  1.0  2.0  6.0
3  NaN  1.0  1.0  NaN
4  7.0  4.0  2.0  9.0
5  6.0  6.0  5.0  3.0

DataFrame after dropna() (Complete rows only):
      W    X    Y    Z
0  4.0  4.0  2.0  9.0
2  8.0  1.0  2.0  6.0
4  7.0  4.0  2.0  9.0
5  6.0  6.0  5.0  3.0


### Assignment 4: Data Aggregation

1. Create a Pandas DataFrame with 2 columns: 'Category' and 'Value'. Fill the 'Category' column with random categories ('A', 'B', 'C') and the 'Value' column with random integers. Group the DataFrame by 'Category' and compute the sum and mean of 'Value' for each category.
2. Create a Pandas DataFrame with 3 columns: 'Product', 'Category', and 'Sales'. Fill the DataFrame with random data. Group the DataFrame by 'Category' and compute the total sales for each category.


In [25]:
data1 = {
    'Category': np.random.choice(['A', 'B', 'C'], size=10),
    'Value': np.random.randint(1, 100, size=10)
}

df1 = pd.DataFrame(data1)
print("Original DataFrame:\n", df1)

grouped_stats = df1.groupby('Category')['Value'].agg(['sum', 'mean'])
print("\nGrouped Stats (Sum & Mean):\n", grouped_stats)

Original DataFrame:
   Category  Value
0        A     34
1        B      3
2        B     60
3        C     58
4        A     39
5        C     97
6        C     86
7        C     77
8        C     20
9        C     54

Grouped Stats (Sum & Mean):
           sum       mean
Category                
A          73  36.500000
B          63  31.500000
C         392  65.333333


In [27]:
data2 = {
    'Product': ['Apples', 'Oranges', 'Bananas', 'Milk', 'Cheese', 'Bread', 'Yogurt'],
    'Category': ['Fruit', 'Fruit', 'Fruit', 'Dairy', 'Dairy', 'Bakery', 'Dairy'],
    'Sales': np.random.randint(100, 500, size=7)
}

df2 = pd.DataFrame(data2)
print("Original DataFrame:\n", df2)

Original DataFrame:
    Product Category  Sales
0   Apples    Fruit    416
1  Oranges    Fruit    244
2  Bananas    Fruit    360
3     Milk    Dairy    403
4   Cheese    Dairy    432
5    Bread   Bakery    474
6   Yogurt    Dairy    289


In [31]:
total_sales = df2.groupby('Category')['Sales'].sum().reset_index()
print("\nTotal Sales by Category:\n", total_sales)


Total Sales by Category:
   Category  Sales
0   Bakery    474
1    Dairy   1124
2    Fruit   1020


### Assignment 5: Merging DataFrames

1. Create two Pandas DataFrames with a common column. Merge the DataFrames using the common column.
2. Create two Pandas DataFrames with different columns. Concatenate the DataFrames along the rows and along the columns.

In [36]:
df_employees = pd.DataFrame({
    'ID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df_departments = pd.DataFrame({
    'ID': [101, 102, 103, 105],
    'Dept': ['HR', 'IT', 'IT', 'Finance']
})

merged_df = pd.merge(df_employees, df_departments, on='ID', how='outer')

print("Employee Table:\n", df_employees)
print("\nDepartment Table:\n", df_departments)
print("\nMerged Result (Inner Join):\n", merged_df)

Employee Table:
     ID     Name
0  101    Alice
1  102      Bob
2  103  Charlie
3  104    David

Department Table:
     ID     Dept
0  101       HR
1  102       IT
2  103       IT
3  105  Finance

Merged Result (Inner Join):
     ID     Name     Dept
0  101    Alice       HR
1  102      Bob       IT
2  103  Charlie       IT
3  104    David      NaN
4  105      NaN  Finance


In [35]:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})

row_concat = pd.concat([df1, df2], axis=0, sort=False)
col_concat = pd.concat([df1, df2], axis=1)

print("Row-wise Concatenation:\n", row_concat)
print("\nColumn-wise Concatenation:\n", col_concat)

Row-wise Concatenation:
      A    B    C    D
0  1.0  3.0  NaN  NaN
1  2.0  4.0  NaN  NaN
0  NaN  NaN  5.0  7.0
1  NaN  NaN  6.0  8.0

Column-wise Concatenation:
    A  B  C  D
0  1  3  5  7
1  2  4  6  8


### Assignment 6: Time Series Analysis

1. Create a Pandas DataFrame with a datetime index and one column filled with random integers. Resample the DataFrame to compute the monthly mean of the values.
2. Create a Pandas DataFrame with a datetime index ranging from '2021-01-01' to '2021-12-31' and one column filled with random integers. Compute the rolling mean with a window of 7 days.

In [38]:
date_idx = pd.date_range(start='2023-01-01', periods=100, freq='D')

df1 = pd.DataFrame(np.random.randint(1, 100, size=(100, 1)),
                   index=date_idx, columns=['Value'])

# Resample to Monthly ('ME' for Month End) and compute the mean
# Note: In older pandas, use 'M', in newer versions use 'ME'
monthly_mean = df1.resample('ME').mean()

print("First 5 rows of daily data:\n", df1.head())
print("\nMonthly Mean Values:\n", monthly_mean)

First 5 rows of daily data:
             Value
2023-01-01     52
2023-01-02     52
2023-01-03     24
2023-01-04     48
2023-01-05     64

Monthly Mean Values:
                 Value
2023-01-31  55.645161
2023-02-28  48.607143
2023-03-31  62.709677
2023-04-30  48.500000


In [42]:
date_range = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D')

df2 = pd.DataFrame(np.random.randint(10, 50, size=(len(date_range), 1)),
                   index=date_range, columns=['Daily_Sales'])

df2['7_Day_Moving_Avg'] = df2['Daily_Sales'].rolling(window=7).mean()
print("Sales Data with Rolling Mean:")
print(df2.iloc[5:15]) # Displaying rows 5 to 15 to show where NaN ends and data starts

Sales Data with Rolling Mean:
            Daily_Sales  7_Day_Moving_Avg
2021-01-06           12               NaN
2021-01-07           47         23.714286
2021-01-08           37         24.714286
2021-01-09           26         23.285714
2021-01-10           21         24.714286
2021-01-11           48         29.428571
2021-01-12           31         31.714286
2021-01-13           12         31.714286
2021-01-14           12         26.714286
2021-01-15           37         26.714286


### Assignment 7: MultiIndex DataFrame

1. Create a Pandas DataFrame with a MultiIndex (hierarchical index). Perform some basic indexing and slicing operations on the MultiIndex DataFrame.
2. Create a Pandas DataFrame with MultiIndex consisting of 'Category' and 'SubCategory'. Fill the DataFrame with random data and compute the sum of values for each 'Category' and 'SubCategory'.

In [47]:
index_tuples = [('A', 1), ('A', 2), ('B', 1), ('B', 2), ('C', 1)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['Letter', 'Number'])

df1 = pd.DataFrame(np.random.randint(1, 10, size=(5, 2)),
                   index=index, columns=['Val1', 'Val2'])

slice_A = df1.loc['A']

specific_row = df1.loc[('B', 2)]
# Slice across levels (e.g., all rows where Number is 1)
# Note: Using IndexSlice makes this much easier
idx = pd.IndexSlice
slice_level2 = df1.loc[idx[:, 1], :]

print("Full MultiIndex DataFrame:\n", df1)
print("\nSlice for Letter 'A':\n", slice_A)
print("\nRows where Number is 1:\n", slice_level2)

Full MultiIndex DataFrame:
                Val1  Val2
Letter Number            
A      1          7     7
       2          1     6
B      1          8     8
       2          7     6
C      1          5     2

Slice for Letter 'A':
         Val1  Val2
Number            
1          7     7
2          1     6

Rows where Number is 1:
                Val1  Val2
Letter Number            
A      1          7     7
B      1          8     8
C      1          5     2


In [49]:
categories = ['Electronics', 'Furniture']
subcategories = ['New', 'Used']
index2 = pd.MultiIndex.from_product([categories, subcategories],
                                   names=['Category', 'SubCategory'])

df2 = pd.DataFrame(np.random.randint(100, 1000, size=(4, 1)),
                   index=index2, columns=['Sales'])

category_sum = df2.groupby(level='Category').sum()

subcategory_sum = df2.groupby(level='SubCategory').sum()

print("Original Data:\n", df2)
print("\nTotal by Category:\n", category_sum)
print("\nTotal by SubCategory:\n", subcategory_sum)

Original Data:
                          Sales
Category    SubCategory       
Electronics New            537
            Used           445
Furniture   New            712
            Used           668

Total by Category:
              Sales
Category          
Electronics    982
Furniture     1380

Total by SubCategory:
              Sales
SubCategory       
New           1249
Used          1113


### Assignment 8: Pivot Tables

1. Create a Pandas DataFrame with columns 'Date', 'Category', and 'Value'. Create a pivot table to compute the sum of 'Value' for each 'Category' by 'Date'.
2. Create a Pandas DataFrame with columns 'Year', 'Quarter', and 'Revenue'. Create a pivot table to compute the mean 'Revenue' for each 'Quarter' by 'Year'.


In [50]:
data1 = {
    'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01']),
    'Category': ['Electronics', 'Furniture', 'Electronics', 'Furniture', 'Electronics'],
    'Value': [100, 200, 150, 300, 50]
}
df1 = pd.DataFrame(data1)

pivot1 = df1.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
print("Original Data:\n", df1)
print("\nPivot Table (Sum of Value):\n", pivot1)

Original Data:
         Date     Category  Value
0 2023-01-01  Electronics    100
1 2023-01-01    Furniture    200
2 2023-01-02  Electronics    150
3 2023-01-02    Furniture    300
4 2023-01-01  Electronics     50

Pivot Table (Sum of Value):
 Category    Electronics  Furniture
Date                              
2023-01-01          150        200
2023-01-02          150        300


In [52]:
data2 = {
    'Year': [2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022],
    'Quarter': ['Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3', 'Q4'],
    'Revenue': [5000, 5500, 6000, 7000, 5200, 5800, 6100, 7500]
}
df2 = pd.DataFrame(data2)

pivot2 = df2.pivot_table(index='Year', columns='Quarter', values='Revenue', aggfunc='mean')
print("Financial Data:\n", df2)
print("\nPivot Table (Mean Revenue):\n", pivot2)

Financial Data:
    Year Quarter  Revenue
0  2021      Q1     5000
1  2021      Q2     5500
2  2021      Q3     6000
3  2021      Q4     7000
4  2022      Q1     5200
5  2022      Q2     5800
6  2022      Q3     6100
7  2022      Q4     7500

Pivot Table (Mean Revenue):
 Quarter      Q1      Q2      Q3      Q4
Year                                   
2021     5000.0  5500.0  6000.0  7000.0
2022     5200.0  5800.0  6100.0  7500.0


### Assignment 9: Applying Functions

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Apply a function that doubles the values of the DataFrame.
2. Create a Pandas DataFrame with 3 columns and 6 rows filled with random integers. Apply a lambda function to create a new column that is the sum of the existing columns.


In [53]:
np.random.seed(42)
df1 = pd.DataFrame(np.random.randint(1, 10, size=(5, 3)), columns=['A', 'B', 'C'])

def double_val(x):
    return x * 2

df_doubled = df1.apply(double_val)

print("Original DataFrame:\n", df1)
print("\nDoubled DataFrame:\n", df_doubled)

Original DataFrame:
    A  B  C
0  7  4  8
1  5  7  3
2  7  8  5
3  4  8  8
4  3  6  5

Doubled DataFrame:
     A   B   C
0  14   8  16
1  10  14   6
2  14  16  10
3   8  16  16
4   6  12  10


In [54]:
df2 = pd.DataFrame(np.random.randint(1, 20, size=(6, 3)), columns=['X', 'Y', 'Z'])

df2['Total_Sum'] = df2.apply(lambda row: row['X'] + row['Y'] + row['Z'], axis=1)

print("DataFrame with Lambda Sum Column:")
print(df2)

DataFrame with Lambda Sum Column:
    X   Y   Z  Total_Sum
0   2  12   6         20
1   2   1  12         15
2  12  17  10         39
3  16  15  15         46
4  19  12   3         34
5   5  19   7         31


### Assignment 10: Working with Text Data

1. Create a Pandas Series with 5 random text strings. Convert all the strings to uppercase.
2. Create a Pandas Series with 5 random text strings. Extract the first three characters of each string.


In [55]:
text_series = pd.Series(['apple', 'banana', 'cherry', 'date', 'elderberry'])

upper_series = text_series.str.upper()

print("Original Series:")
print(text_series)
print("\nUppercase Series:")
print(upper_series)

Original Series:
0         apple
1        banana
2        cherry
3          date
4    elderberry
dtype: object

Uppercase Series:
0         APPLE
1        BANANA
2        CHERRY
3          DATE
4    ELDERBERRY
dtype: object


In [56]:
names_series = pd.Series(['Python', 'Pandas', 'NumPy', 'Matplotlib', 'Seaborn'])
short_names = names_series.str[:3]

print("\nOriginal Names:")
print(names_series)
print("\nFirst Three Characters:")
print(short_names)


Original Names:
0        Python
1        Pandas
2         NumPy
3    Matplotlib
4       Seaborn
dtype: object

First Three Characters:
0    Pyt
1    Pan
2    Num
3    Mat
4    Sea
dtype: object


### The Scenario: "Global Gadget Sales"
You have been handed a messy dataset of sales transactions. Your goal is to:

Clean missing values.

Analyze performance by category.

Identify trends over time.

In [66]:
np.random.seed(42)

dates = pd.date_range(start='2025-01-01', periods=20, freq='D')
categories = ['Electronics', 'Home', 'Office', 'Garden']

data = {
    'Date': np.random.choice(dates, size=50),
    'Category': np.random.choice(categories, size=50),
    'Revenue': np.random.uniform(50, 500, size=50),
    'Units': np.random.randint(1, 10, size=50)
}

df = pd.DataFrame(data)

df.iloc[::10, 2] = np.nan

# --- STEP 1: CLEANING ---
# Fill missing Revenue with the median of its Category
df['Revenue'] = df.groupby('Category')['Revenue'].transform(lambda x: x.fillna(x.median()))

# --- STEP 2: FEATURE ENGINEERING ---
# Create a 'Price_Per_Unit' column
df['Price_Per_Unit'] = df['Revenue'] / df['Units']

# --- STEP 3: AGGREGATION ---
# Which category generated the most total revenue?
category_report = df.groupby('Category')[['Revenue', 'Units']].sum().sort_values(by='Revenue', ascending=False)

# --- STEP 4: TIME TRENDS ---
# Set Date as index and find the 3-day rolling average of total revenue
df.set_index('Date', inplace=True)
daily_revenue = df['Revenue'].resample('D').sum()
moving_avg = daily_revenue.rolling(window=3).mean()


print("Category Performance Report:")
print(category_report)
print("\nFirst 5 days of Rolling 3-Day Revenue:")
print(moving_avg.head())

Category Performance Report:
                 Revenue  Units
Category                       
Home         4538.294886     76
Electronics  3955.747532     56
Office       2448.928075     38
Garden       2348.788266     65

First 5 days of Rolling 3-Day Revenue:
Date
2025-01-01           NaN
2025-01-02           NaN
2025-01-03    647.184481
2025-01-04    858.476067
2025-01-05    696.223676
Freq: D, Name: Revenue, dtype: float64
