# Pandas and NumPy Exercises Notebook

This notebook contains comprehensive exercises for both Pandas and NumPy libraries.
Complete each exercise by writing your code in the provided cells.

## Setup

First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Enable inline plotting for Jupyter
%matplotlib inline

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

# NumPy Exercises

## Exercise 1: Array Creation and Basic Operations

### Task: Create various types of NumPy arrays and perform basic operations

In [None]:
# Create a 1D array from a list
# Your code here

# Create a 2D array (matrix) from nested lists
# Your code here

# Create an array of zeros with shape (3, 4)
# Your code here

# Create an array of ones with shape (2, 3)
# Your code here

# Create an identity matrix of size 3x3
# Your code here

# Create an array with values from 0 to 9
# Your code here

# Create an array with 10 evenly spaced values between 0 and 1
# Your code here

## Exercise 2: Array Indexing and Slicing

### Task: Practice accessing and modifying array elements

In [None]:
# Create a 2D array for indexing practice
arr_2d = np.array([[1, 2, 3, 4], 
                   [5, 6, 7, 8], 
                   [9, 10, 11, 12]])

print("Original array:")
print(arr_2d)

# Access element at row 1, column 2
# Your code here

# Access the entire second row
# Your code here

# Access the entire third column
# Your code here

# Access elements from rows 0-1 and columns 1-2
# Your code here

# Access every other element from the first row
# Your code here

# Reverse the order of rows
# Your code here

## Exercise 3: Array Operations and Broadcasting

### Task: Perform mathematical operations on arrays

In [None]:
# Create two arrays for operations
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

print("Array a:", a)
print("Array b:", b)

# Add the two arrays element-wise
# Your code here

# Multiply arrays element-wise
# Your code here

# Calculate the dot product
# Your code here

# Square each element in array a
# Your code here

# Calculate the mean of array a
# Your code here

# Find the maximum value in array b
# Your code here

# Broadcasting: Add a scalar to an array
# Your code here

## Exercise 4: Array Reshaping and Transposing

### Task: Change the shape and orientation of arrays

In [None]:
# Create a 1D array with 12 elements
arr_1d = np.arange(12)
print("Original 1D array:", arr_1d)

# Reshape to 3x4 matrix
# Your code here

# Reshape to 2x6 matrix
# Your code here

# Transpose the 3x4 matrix
# Your code here

# Flatten the matrix back to 1D
# Your code here

# Create a 3D array and flatten it
arr_3d = np.random.rand(2, 3, 4)
print("3D array shape:", arr_3d.shape)
# Your code here

## Exercise 5: Boolean Indexing and Filtering

### Task: Use boolean conditions to filter arrays

In [None]:
# Create an array for boolean indexing
data = np.array([10, 25, 3, 45, 7, 18, 92, 31])
print("Original array:", data)

# Find elements greater than 20
# Your code here

# Find elements that are even
# Your code here

# Find elements between 10 and 50 (inclusive)
# Your code here

# Replace all values greater than 30 with -1
# Your code here

# Count how many elements are greater than the mean
# Your code here

## Exercise 6: Random Number Generation

### Task: Generate random numbers with different distributions

In [None]:
# Set seed for reproducibility
np.random.seed(42)

# Generate 10 random numbers between 0 and 1
# Your code here

# Generate 5 random integers between 1 and 100
# Your code here

# Generate random numbers from a normal distribution (mean=0, std=1)
# Your code here

# Generate random numbers from a normal distribution (mean=5, std=2)
# Your code here

# Shuffle an array
arr_to_shuffle = np.arange(10)
print("Original array:", arr_to_shuffle)
# Your code here

# Choose 3 random elements from an array without replacement
population = np.arange(1, 21)
# Your code here

# Pandas Exercises

## Exercise 7: Creating and Inspecting DataFrames

### Task: Create a DataFrame from the given dictionary and inspect its properties

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [50000, 60000, 55000, 65000, 58000],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing']
}

# Create a DataFrame
# Your code here

# Display the first few rows
# Your code here

# Display the last few rows
# Your code here

# Display basic information about the DataFrame
# Your code here

# Display summary statistics
# Your code here

# Check the shape of the DataFrame
# Your code here

# Check column data types
# Your code here

## Exercise 8: Filtering and Selecting Data

### Task: Filter the DataFrame based on various conditions

In [None]:
# Assuming df is already created from Exercise 7

# Filter employees older than 30
# Your code here

# Filter employees in IT department
# Your code here

# Filter employees with salary greater than 55000 and age less than 35
# Your code here

# Select only Name and Salary columns
# Your code here

# Select employees from New York or London
# Your code here

# Use loc to select specific rows and columns
# Your code here

## Exercise 9: Sorting and Grouping Data

### Task: Sort and group the DataFrame

In [None]:
# Sort by Age in ascending order
# Your code here

# Sort by Salary in descending order
# Your code here

# Sort by Department and then by Age
# Your code here

# Group by Department and calculate mean salary
# Your code here

# Group by Department and get count of employees
# Your code here

# Group by Department and get min/max/mean age
# Your code here

# Create a pivot table showing average salary by department and city
# Your code here

## Exercise 10: Handling Missing Data

### Task: Introduce and handle missing values in the DataFrame

In [None]:
# Create a copy of the DataFrame with missing values
df_missing = df.copy()

# Introduce missing values
df_missing.loc[1, 'Age'] = np.nan
df_missing.loc[3, 'Salary'] = np.nan
df_missing.loc[4, 'Department'] = np.nan

print("DataFrame with missing values:")
print(df_missing)

# Check for missing values
# Your code here

# Drop rows with any missing values
# Your code here

# Drop rows where specific columns have missing values
# Your code here

# Fill missing values with mean (for numeric columns)
# Your code here

# Fill missing values with mode (for categorical columns)
# Your code here

# Fill missing values with a specific value
# Your code here

## Exercise 11: Data Transformation and Feature Engineering

### Task: Create new columns and transform existing data

In [None]:
# Create a new column for Age Category
# Your code here

# Create a new column for Salary Category
# Your code here

# Create a new column with name length
# Your code here

# Create a new column with city code (first 3 letters uppercase)
# Your code here

# Apply a custom function to create a bonus column (10% of salary)
# Your code here

# Use map to convert department names to codes
# Your code here

# Normalize the salary column (min-max scaling)
# Your code here

## Exercise 12: Merging and Joining DataFrames

### Task: Create additional DataFrames and merge them

In [None]:
# Create a second DataFrame with performance data
performance_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Performance_Score': [8.5, 7.2, 9.1, 8.8, 7.9],
    'Years_Experience': [3, 5, 8, 4, 6]
}

# Create the second DataFrame
# Your code here

# Merge the two DataFrames on Name (inner join)
# Your code here

# Left join
# Your code here

# Create a third DataFrame with department info
dept_data = {
    'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Sales'],
    'Budget': [100000, 200000, 150000, 120000, 180000],
    'Head_Count': [5, 10, 8, 6, 12]
}

# Create the third DataFrame
# Your code here

# Merge with department data
# Your code here

# Concatenate DataFrames vertically
# Your code here

## Exercise 13: Time Series Data

### Task: Work with datetime data and time series operations

In [None]:
# Create a time series DataFrame
dates = pd.date_range('2023-01-01', periods=100, freq='D')
ts_data = {
    'Date': dates,
    'Value': np.random.randn(100).cumsum(),
    'Category': np.random.choice(['A', 'B', 'C'], 100)
}

# Create the time series DataFrame
# Your code here

# Set Date as index
# Your code here

# Extract year, month, day components
# Your code here

# Resample to monthly frequency and calculate mean
# Your code here

# Calculate rolling mean with window of 7 days
# Your code here

# Filter data for a specific date range
# Your code here

# Group by month and calculate statistics
# Your code here

## Exercise 14: Data Visualization with Pandas

### Task: Create various plots using pandas plotting capabilities

In [None]:
# Assuming we have the original df from Exercise 7

# Create a histogram of Age
# Your code here

# Create a bar plot of average salary by department
# Your code here

# Create a scatter plot of Age vs Salary
# Your code here

# Create a box plot of Salary by Department
# Your code here

# Create a line plot for time series data (from Exercise 13)
# Your code here

# Create a pie chart of department distribution
# Your code here

# Create a correlation heatmap
# Your code here

## Exercise 15: Advanced Data Manipulation

### Task: Practice advanced pandas operations

In [None]:
# Create a multi-index DataFrame
multi_data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'City': ['NYC', 'Boston', 'Miami', 'Atlanta', 'DC', 'Philadelphia', 'LA', 'Seattle'],
    'Sales_Q1': [100, 80, 90, 70, 85, 95, 110, 105],
    'Sales_Q2': [105, 85, 95, 75, 90, 100, 115, 110]
}

# Create DataFrame with multi-index
# Your code here

# Set multi-index
# Your code here

# Select data for a specific region
# Your code here

# Calculate total sales by region
# Your code here

# Melt the DataFrame from wide to long format
# Your code here

# Pivot the melted DataFrame back to wide format
# Your code here

# Handle duplicate data
duplicate_df = df.copy()
# Add duplicate rows
duplicate_df = pd.concat([duplicate_df, duplicate_df.iloc[:2]], ignore_index=True)

# Identify duplicates
# Your code here

# Remove duplicates
# Your code here

## Exercise 16: String Operations

### Task: Perform string operations on text data

In [None]:
# Create a DataFrame with text data
text_data = {
    'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince'],
    'Email': ['alice.johnson@email.com', 'bob.smith@company.org', 'charlie.brown@university.edu', 'diana.prince@domain.net'],
    'Description': ['Senior Developer', 'Project Manager', 'Data Scientist', 'UX Designer']
}

# Create the DataFrame
# Your code here

# Extract first names
# Your code here

# Extract last names
# Your code here

# Extract domain from email
# Your code here

# Convert names to uppercase
# Your code here

# Check if description contains 'Developer'
# Your code here

# Split email by '@' and get username
# Your code here

# Replace spaces with underscores in names
# Your code here

## Exercise 17: Working with CSV and External Data

### Task: Read and write data to external files

In [None]:
# Save the DataFrame to CSV
# Your code here

# Read the CSV file back
# Your code here

# Save to Excel (if available)
# Your code here

# Read from Excel (if available)
# Your code here

# Save to JSON
# Your code here

# Read from JSON
# Your code here

# Read CSV with specific parameters (skip rows, select columns)
# Your code here

## Exercise 18: Performance Optimization

### Task: Optimize pandas operations for better performance

In [None]:
# Create a large DataFrame for performance testing
large_df = pd.DataFrame({
    'A': np.random.randn(100000),
    'B': np.random.randn(100000),
    'C': np.random.choice(['X', 'Y', 'Z'], 100000),
    'D': np.random.randint(1, 100, 100000)
})

print("Large DataFrame shape:", large_df.shape)

# Use vectorized operations instead of loops
# Bad approach (loop)
# Your code here (commented out)

# Good approach (vectorized)
# Your code here

# Use efficient data types
# Check memory usage
# Your code here

# Convert to more efficient data types where possible
# Your code here

# Use eval() for complex operations
# Your code here

# Use query() for filtering
# Your code here

## Exercise 19: Integration of NumPy and Pandas

### Task: Demonstrate how NumPy and Pandas work together

In [None]:
# Create NumPy arrays
np_array = np.random.randn(100, 4)

# Convert NumPy array to DataFrame
# Your code here

# Access underlying NumPy array from DataFrame
# Your code here

# Apply NumPy functions to DataFrame columns
# Your code here

# Use NumPy for complex calculations on DataFrame data
# Your code here

# Vectorized operations combining both libraries
# Your code here

# Statistical operations using both libraries
# Your code here

# Challenge Exercises

## Challenge 1: Movie Data Analysis

### Task: Analyze a movie dataset with various operations

In [None]:
# Create a sample movie dataset
movie_data = {
    'Title': ['The Shawshank Redemption', 'The Godfather', 'The Dark Knight', 'Pulp Fiction', 'Forrest Gump'],
    'Year': [1994, 1972, 2008, 1994, 1994],
    'Genre': ['Drama', 'Crime', 'Action', 'Crime', 'Drama'],
    'Rating': [9.3, 9.2, 9.0, 8.9, 8.8],
    'Duration': [142, 175, 152, 154, 142],
    'Director': ['Frank Darabont', 'Francis Ford Coppola', 'Christopher Nolan', 'Quentin Tarantino', 'Robert Zemeckis']
}

# Create the DataFrame
# Your code here

# Find the highest rated movie
# Your code here

# Calculate average rating by genre
# Your code here

# Find movies from 1994
# Your code here

# Create a new column for 'Century' based on year
# Your code here

# Sort by rating in descending order
# Your code here

# Find the director with the most movies in the dataset
# Your code here

## Challenge 2: Stock Market Analysis

### Task: Analyze simulated stock market data

In [None]:
# Create simulated stock data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=252, freq='B')  # Business days
stock_data = {
    'Date': dates,
    'AAPL': 150 + np.random.randn(252).cumsum(),
    'GOOGL': 2500 + np.random.randn(252).cumsum(),
    'MSFT': 300 + np.random.randn(252).cumsum(),
    'AMZN': 3000 + np.random.randn(252).cumsum()
}

# Create the DataFrame
# Your code here

# Set Date as index
# Your code here

# Calculate daily returns
# Your code here

# Calculate cumulative returns
# Your code here

# Find the best performing stock
# Your code here

# Calculate volatility (standard deviation of returns)
# Your code here

# Plot the stock prices
# Your code here

# Calculate 30-day moving average
# Your code here

## Challenge 3: Data Cleaning Pipeline

### Task: Create a complete data cleaning pipeline

In [None]:
# Create a messy dataset
np.random.seed(123)
messy_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', np.nan, 'Frank'],
    'Age': [25, 30, np.nan, 28, 32, 35, 'thirty-five'],
    'Salary': ['$50,000', '$60,000', '$55,000', np.nan, '$58,000', '$62,000', '$48,000'],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing', 'HR', np.nan],
    'Join_Date': ['2020-01-15', '2019-03-22', '2021-07-10', np.nan, '2018-11-05', '2022-02-28', '2020-09-12']
}

# Create the messy DataFrame
# Your code here

# Clean the dataset step by step
# 1. Handle missing values
# Your code here

# 2. Convert data types
# Your code here

# 3. Clean salary column (remove $ and commas)
# Your code here

# 4. Convert Age to numeric (handle 'thirty-five')
# Your code here

# 5. Convert Join_Date to datetime
# Your code here

# 6. Handle duplicates if any
# Your code here

# 7. Create derived columns
# Your code here

# Final cleaned dataset
# Your code here

# Solutions

## NumPy Solutions

### Exercise 1 Solution:

In [None]:
# Array creation examples
arr_1d = np.array([1, 2, 3, 4, 5])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
zeros_arr = np.zeros((3, 4))
ones_arr = np.ones((2, 3))
identity = np.eye(3)
range_arr = np.arange(10)
linspace_arr = np.linspace(0, 1, 10)

print("1D Array:", arr_1d)
print("2D Array:")
print(arr_2d)
print("Zeros Array:")
print(zeros_arr)

### Exercise 2 Solution:

In [None]:
# Indexing examples
element = arr_2d[1, 2]  # Row 1, Column 2
second_row = arr_2d[1, :]  # Entire second row
third_col = arr_2d[:, 2]  # Entire third column
subset = arr_2d[0:2, 1:3]  # Rows 0-1, Columns 1-2
every_other = arr_2d[0, ::2]  # Every other element in first row
reversed_rows = arr_2d[::-1]  # Reverse row order

print("Element at [1,2]:", element)
print("Second row:", second_row)
print("Subset:")
print(subset)

### Exercise 3 Solution:

In [None]:
# Array operations
addition = a + b
multiplication = a * b
dot_product = np.dot(a, b)
squared = a ** 2
mean_val = np.mean(a)
max_val = np.max(b)
scalar_add = a + 10

print("Addition:", addition)
print("Multiplication:", multiplication)
print("Dot product:", dot_product)
print("Squared:", squared)
print("Mean of a:", mean_val)
print("Max of b:", max_val)

## Pandas Solutions

### Exercise 7 Solution:

In [None]:
# DataFrame creation and inspection
df = pd.DataFrame(data)

print("First few rows:")
print(df.head())

print("\nLast few rows:")
print(df.tail())

print("\nDataFrame info:")
print(df.info())

print("\nSummary statistics:")
print(df.describe())

print("\nShape:", df.shape)
print("\nData types:")
print(df.dtypes)

### Exercise 8 Solution:

In [None]:
# Filtering examples
older_than_30 = df[df['Age'] > 30]
it_department = df[df['Department'] == 'IT']
salary_age_filter = df[(df['Salary'] > 55000) & (df['Age'] < 35)]
name_salary = df[['Name', 'Salary']]
city_filter = df[df['City'].isin(['New York', 'London'])]
loc_example = df.loc[0:2, ['Name', 'Age']]

print("Employees older than 30:")
print(older_than_30)
print("\nIT Department:")
print(it_department)

### Exercise 9 Solution:

In [None]:
# Sorting and grouping
sorted_by_age = df.sort_values('Age')
sorted_by_salary_desc = df.sort_values('Salary', ascending=False)
sorted_multi = df.sort_values(['Department', 'Age'])

dept_salary_mean = df.groupby('Department')['Salary'].mean()
dept_count = df.groupby('Department').size()
dept_age_stats = df.groupby('Department')['Age'].agg(['min', 'max', 'mean'])

pivot_table = df.pivot_table(values='Salary', index='Department', columns='City', aggfunc='mean')

print("Sorted by age:")
print(sorted_by_age)
print("\nMean salary by department:")
print(dept_salary_mean)

## Additional Resources

For more information and practice:
- NumPy Documentation: https://numpy.org/doc/
- Pandas Documentation: https://pandas.pydata.org/docs/
- NumPy User Guide: https://numpy.org/doc/stable/user/index.html
- Pandas User Guide: https://pandas.pydata.org/docs/user_guide/index.html
- Kaggle Datasets for practice: https://www.kaggle.com/datasets

## Tips for Success

1. **Practice regularly**: The more you code, the better you get
2. **Read documentation**: Official docs are your best friend
3. **Work on real projects**: Apply what you learn to actual data
4. **Join communities**: Stack Overflow, Reddit's r/learnpython, and pandas/numpy forums
5. **Debug systematically**: Use print statements and understand error messages
6. **Optimize performance**: Learn about vectorized operations and efficient data structures

Happy coding! 🚀