### Essential NumPy and Pandas Functions for Data Analysis

Let's look at practical code snippets that demonstrate the most useful NumPy and pandas methods for data analysis in Python. You can use these in a Jupyter notebook for experimentation and workflow mastery.

#### NumPy Basics for Data Analysis
Import NumPy:

In [23]:
import numpy as np

1. Creating Arrays

In [24]:
arr = np.array([1, 2, 3, 4])            # 1D array
matrix = np.array([[1, 2], [3, 4]])     # 2D array
zeros = np.zeros((2, 3))                # 2x3 array of zeros
ones = np.ones((2, 3))                  # 2x3 array of ones
rnd = np.random.rand(2, 3)              # 2x3 array of random floats

2. Array Shape, Reshape, Flatten

In [25]:
print(arr.shape)                # (4,)
reshaped = matrix.reshape((4, 1))   # Reshape to 4 rows, 1 column
flat = matrix.ravel()               # Flatten to 1D array

(4,)


3. Indexing and Slicing

In [26]:
print(matrix[0, 1])    # element at first row, second column
print(arr[1:3])        # slice elements 1 to 2

2
[2 3]


4. Basic Math Operations

In [27]:
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])
print(a + b)           # [11 22 33]
print(a - b)           # [ 9 18 27]
print(a * b)           # [10 40 90]
print(a / b)           # [10. 10. 10.]


[11 22 33]
[ 9 18 27]
[10 40 90]
[10. 10. 10.]


5. Aggregation and Stats

In [28]:
np.sum(arr)                 # 10
np.mean(arr)                # 2.5
np.std(arr)                 # Standard deviation
np.min(arr), np.max(arr)    # Min and max values

(1, 4)

6. Dot Product, Matrix Ops

In [29]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32
mat = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
matprod = np.matmul(mat, mat2) # Matrix multiplication

7. Logical and Filtering

In [30]:
arr[arr > 2]                # Returns [3, 4]
np.where(arr > 2)           # Indices where condition is true

(array([2, 3], dtype=int64),)

8. Random Sampling

In [31]:
np.random.randint(0, 100, size=5) # 5 random ints 0-99
np.random.choice([0, 1], size=10) # 10 random 0s or 1s

array([1, 0, 1, 1, 0, 1, 0, 1, 0, 0])

#### Pandas Basics for Data Analysis
Import pandas:

In [32]:
import pandas as pd

1. Creating DataFrames

In [33]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data) # Create DataFrame

2. Inspecting Data

In [34]:
df.head()                # First 5 rows
df.tail(3)                # Last 3 rows
df.info()                 # DataFrame info + datatypes
df.describe()             # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


3. Selecting Data

In [35]:
df['A']                   # Get column A
df.loc[0]                 # Row by label (index 0)
df.iloc[1:3]              # Rows 1-2 by position
df[df['A'] > 1]           # Filter rows where A > 1

Unnamed: 0,A,B
1,2,5
2,3,6


4. Adding/Dropping Columns

In [36]:
df['C'] = df['A'] + df['B']     # Add new column
new_df = df.drop('B', axis=1)   # Drop column B

5. Missing Values & Types

In [37]:
df.isnull().sum()               # Missing values per column
df.fillna(0)                    # Replace NaNs with 0
df['A'] = df['A'].astype(float) # Change column type

6. Grouping and Aggregation

In [38]:
grouped = df.groupby('A').sum()   # Sum columns, grouped by values in A
df['A'].value_counts()            # Count unique values in A

A
1.0    1
2.0    1
3.0    1
Name: count, dtype: int64

7. Sorting

In [39]:
df.sort_values('B', ascending=False)  # Sort by column B

Unnamed: 0,A,B,C
2,3.0,6,9
1,2.0,5,7
0,1.0,4,5


8. Reading/Writing Files

In [40]:
df.to_csv('mydata.csv') # Save to CSV
df = pd.read_csv('mydata.csv') # Load from CSV
df.head()

Unnamed: 0.1,Unnamed: 0,A,B,C
0,0,1.0,4,5
1,1,2.0,5,7
2,2,3.0,6,9


9. Apply/Map Functions

In [41]:
df['A_squared'] = df['A'].apply(lambda x: x**2) # Apply to column
df['B_label'] = df['B'].map({4: 'Low', 5: 'Med', 6: 'High'}) # Map values
df.sample(3)

Unnamed: 0.1,Unnamed: 0,A,B,C,A_squared,B_label
0,0,1.0,4,5,1.0,Low
1,1,2.0,5,7,4.0,Med
2,2,3.0,6,9,9.0,High


### Task:

- Generated a Synthetic Dataset with columns of `age`, `salary`, `department`, `years_experience`, and `is_manager`.

In [42]:
np.random.seed(42) # For reproducibility
n_samples = 100 # Number of samples

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}
df = pd.DataFrame(data)

Q1. View data structure

In [16]:
import numpy as np
import pandas as pd

# For reproducibility
np.random.seed(42)

# Number of samples
n_samples = 100

# Create synthetic dataset
data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)



In [18]:
# View first 5 rows
print(df.head())

# View last 5 rows
print(df.tail())

# View shape of DataFrame
print(df.shape)

# View column names
print(df.columns)

# View data types
print(df.dtypes)


   age  salary department  years_experience  is_manager
0   56   38392         IT              -0.8           0
1   46   60535  Marketing               3.4           1
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1
    age  salary department  years_experience  is_manager
95   59   82662         IT               4.0           0
96   56   42688         HR               4.4           1
97   58   55342  Marketing               6.0           0
98   45   67157         HR              11.4           0
99   24   97863         HR               5.2           1
(100, 5)
Index(['age', 'salary', 'department', 'years_experience', 'is_manager'], dtype='object')
age                   int32
salary                int32
department           object
years_experience    float64
is_manager            int32
dtype: object


Q2. Get DataFrame Info and Summary Stats

In [20]:
import numpy as np
import pandas as pd

# For reproducibility
np.random.seed(42)

# Number of samples
n_samples = 100

# Generate synthetic dataset
data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)


In [22]:
print(df.head())        # First 5 rows
print(df.tail())        # Last 5 rows
print(df.shape)         # Shape of DataFrame
print(df.columns)       # Column names
print(df.dtypes)        # Data types of each column


   age  salary department  years_experience  is_manager
0   56   38392         IT              -0.8           0
1   46   60535  Marketing               3.4           1
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1
    age  salary department  years_experience  is_manager
95   59   82662         IT               4.0           0
96   56   42688         HR               4.4           1
97   58   55342  Marketing               6.0           0
98   45   67157         HR              11.4           0
99   24   97863         HR               5.2           1
(100, 5)
Index(['age', 'salary', 'department', 'years_experience', 'is_manager'], dtype='object')
age                   int32
salary                int32
department           object
years_experience    float64
is_manager            int32
dtype: object


In [25]:
df.info()               # Info about DataFrame (types, non-null count)
print(df.describe())    # Summary stats for numeric columns
print(df.describe(include='object'))  # Summary for categorical columns
print(df.isnull().sum())  # Check missing values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int32  
 1   salary            100 non-null    int32  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int32  
dtypes: float64(1), int32(3), object(1)
memory usage: 2.9+ KB
              age         salary  years_experience  is_manager
count  100.000000     100.000000        100.000000  100.000000
mean    37.910000   77809.160000          4.823000    0.470000
std     12.219454   26058.643576          2.237822    0.501614
min     18.000000   30206.000000         -0.800000    0.000000
25%     26.750000   55141.000000          3.475000    0.000000
50%     38.000000   80932.000000          4.700000    0.000000
75%     46.250000   98107.250000          6.000000    1.000000
ma

Q3. Do Simple Numpy Operations

In [27]:
import numpy as np
import pandas as pd

# Re-create the DataFrame if not already done
np.random.seed(42)
n_samples = 100
data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}
df = pd.DataFrame(data)

# Simple NumPy operations
mean_age = np.mean(df['age'])
median_salary = np.median(df['salary'])
max_salary = np.max(df['salary'])
min_salary = np.min(df['salary'])
std_exp = np.std(df['years_experience'])

print("Mean Age:", mean_age)
print("Median Salary:", median_salary)
print("Max Salary:", max_salary)
print("Min Salary:", min_salary)
print("Std Dev of Experience:", std_exp)


Mean Age: 37.91
Median Salary: 80932.0
Max Salary: 119474
Min Salary: 30206
Std Dev of Experience: 2.2266052636244265


Q4. Filtering and Indexing Rows

In [29]:
import numpy as np
import pandas as pd

# 1️⃣ Create the synthetic dataset
np.random.seed(42)
n_samples = 100

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}

df = pd.DataFrame(data)

# 2️⃣ Filtering and Indexing

# Employees with salary > 90000
high_salary = df[df['salary'] > 90000]

# IT department employees who are managers
it_managers = df[(df['department'] == 'IT') & (df['is_manager'] == 1)]

# Employees with more than 5 years of experience
experienced = df[df['years_experience'] > 5]

# Access specific row by index
row_10 = df.loc[10]

# Access specific value: age of row 5
age_row5 = df.loc[5, 'age']

# 3️⃣ Print results
print("High Salary Employees:\n", high_salary.head())
print("\nIT Managers:\n", it_managers.head())
print("\nExperienced Employees (>5 yrs):\n", experienced.head())
print("\nRow 10:\n", row_10)
print("\nAge in row 5:", age_row5)



High Salary Employees:
    age  salary department  years_experience  is_manager
2   32  108603         HR               5.0           1
4   38  119135         HR               4.1           1
6   36  107373    Finance               6.7           1
7   40  109575  Marketing               6.9           0
8   28  114651  Marketing               5.8           1

IT Managers:
     age  salary department  years_experience  is_manager
9    28   93335         IT               1.2           1
10   41   40965         IT               9.5           1
15   39   57266         IT               5.8           1
31   20   97214         IT               5.7           1
37   35   52299         IT               1.5           1

Experienced Employees (>5 yrs):
     age  salary department  years_experience  is_manager
6    36  107373    Finance               6.7           1
7    40  109575  Marketing               6.9           0
8    28  114651  Marketing               5.8           1
10   41   40965      

Q5. Adding a Column

In [31]:
import numpy as np
import pandas as pd

# 1️⃣ Create the synthetic dataset
np.random.seed(42)
n_samples = 100

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}

df = pd.DataFrame(data)

# 2️⃣ Add a new column 'salary_level' based on salary ranges
df['salary_level'] = pd.cut(
    df['salary'],
    bins=[0, 60000, 90000, 120000],
    labels=['Low', 'Medium', 'High']
)

# Optional: Add bonus column (10% of salary)
df['bonus'] = df['salary'] * 0.10

# Optional: Add experience_level column
df['experience_level'] = np.where(df['years_experience'] > 7, 'Senior', 'Junior')

# 3️⃣ View the first 5 rows
print(df.head())


   age  salary department  years_experience  is_manager salary_level    bonus  \
0   56   38392         IT              -0.8           0          Low   3839.2   
1   46   60535  Marketing               3.4           1       Medium   6053.5   
2   32  108603         HR               5.0           1         High  10860.3   
3   25   82256         HR               4.2           1       Medium   8225.6   
4   38  119135         HR               4.1           1         High  11913.5   

  experience_level  
0           Junior  
1           Junior  
2           Junior  
3           Junior  
4           Junior  


Q6. Grouping and Aggregation

In [33]:
import numpy as np
import pandas as pd

# 1️⃣ Create the synthetic dataset
np.random.seed(42)
n_samples = 100

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}

df = pd.DataFrame(data)

# 2️⃣ Grouping and Aggregation

# Average salary per department
avg_salary = df.groupby('department')['salary'].mean()
print("Average Salary per Department:\n", avg_salary)

# Multiple aggregations per department
dept_summary = df.groupby('department').agg({
    'salary': ['mean', 'max', 'min'],
    'years_experience': 'mean'
})
print("\nDepartment Summary:\n", dept_summary)

# Group by multiple columns: department and manager status
grouped = df.groupby(['department', 'is_manager'])['salary'].mean()
print("\nAverage Salary by Department and Manager Status:\n", grouped)



Average Salary per Department:
 department
Finance      83124.708333
HR           73523.052632
IT           75825.476190
Marketing    77684.722222
Name: salary, dtype: float64

Department Summary:
                   salary                years_experience
                    mean     max    min             mean
department                                              
Finance     83124.708333  117897  30206         4.416667
HR          73523.052632  119135  32693         5.089474
IT          75825.476190  117455  35530         4.390476
Marketing   77684.722222  119474  30854         5.205556

Average Salary by Department and Manager Status:
 department  is_manager
Finance     0             84325.071429
            1             81444.200000
HR          0             70923.444444
            1             75862.700000
IT          0             80391.000000
            1             73015.923077
Marketing   0             72388.909091
            1             86006.714286
Name: salary, dty