### Essential NumPy and Pandas Functions for Data Analysis

Let's look at practical code snippets that demonstrate the most useful NumPy and pandas methods for data analysis in Python. You can use these in a Jupyter notebook for experimentation and workflow mastery.

#### NumPy Basics for Data Analysis
Import NumPy:

In [1]:
import numpy as np

1. Creating Arrays

In [2]:
arr = np.array([1, 2, 3, 4])            # 1D array
matrix = np.array([[1, 2], [3, 4]])     # 2D array
zeros = np.zeros((2, 3))                # 2x3 array of zeros
ones = np.ones((2, 3))                  # 2x3 array of ones
rnd = np.random.rand(2, 3)              # 2x3 array of random floats

2. Array Shape, Reshape, Flatten

In [3]:
print(arr.shape)                # (4,)
reshaped = matrix.reshape((4, 1))   # Reshape to 4 rows, 1 column
flat = matrix.ravel()               # Flatten to 1D array

(4,)


3. Indexing and Slicing

In [4]:
print(matrix[0, 1])    # element at first row, second column
print(arr[1:3])        # slice elements 1 to 2

2
[2 3]


4. Basic Math Operations

In [5]:
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])
print(a + b)           # [11 22 33]
print(a - b)           # [ 9 18 27]
print(a * b)           # [10 40 90]
print(a / b)           # [10. 10. 10.]


[11 22 33]
[ 9 18 27]
[10 40 90]
[10. 10. 10.]


5. Aggregation and Stats

In [6]:
np.sum(arr)                 # 10
np.mean(arr)                # 2.5
np.std(arr)                 # Standard deviation
np.min(arr), np.max(arr)    # Min and max values

(np.int64(1), np.int64(4))

6. Dot Product, Matrix Ops

In [7]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32
mat = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
matprod = np.matmul(mat, mat2) # Matrix multiplication

7. Logical and Filtering

In [8]:
arr[arr > 2]                # Returns [3, 4]
np.where(arr > 2)           # Indices where condition is true

(array([2, 3]),)

8. Random Sampling

In [9]:
np.random.randint(0, 100, size=5) # 5 random ints 0-99
np.random.choice([0, 1], size=10) # 10 random 0s or 1s

array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1])

#### Pandas Basics for Data Analysis
Import pandas:

In [10]:
import pandas as pd

1. Creating DataFrames

In [11]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data) # Create DataFrame

2. Inspecting Data

In [12]:
df.head()                # First 5 rows
df.tail(3)                # Last 3 rows
df.info()                 # DataFrame info + datatypes
df.describe()             # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


3. Selecting Data

In [13]:
df['A']                   # Get column A
df.loc[0]                 # Row by label (index 0)
df.iloc[1:3]              # Rows 1-2 by position
df[df['A'] > 1]           # Filter rows where A > 1

Unnamed: 0,A,B
1,2,5
2,3,6


4. Adding/Dropping Columns

In [14]:
df['C'] = df['A'] + df['B']     # Add new column
new_df = df.drop('B', axis=1)   # Drop column B

5. Missing Values & Types

In [15]:
df.isnull().sum()               # Missing values per column
df.fillna(0)                    # Replace NaNs with 0
df['A'] = df['A'].astype(float) # Change column type

6. Grouping and Aggregation

In [16]:
grouped = df.groupby('A').sum()   # Sum columns, grouped by values in A
df['A'].value_counts()            # Count unique values in A

Unnamed: 0_level_0,count
A,Unnamed: 1_level_1
1.0,1
2.0,1
3.0,1


7. Sorting

In [17]:
df.sort_values('B', ascending=False)  # Sort by column B

Unnamed: 0,A,B,C
2,3.0,6,9
1,2.0,5,7
0,1.0,4,5


8. Reading/Writing Files

In [18]:
df.to_csv('mydata.csv') # Save to CSV
df = pd.read_csv('mydata.csv') # Load from CSV
df.head()

Unnamed: 0.1,Unnamed: 0,A,B,C
0,0,1.0,4,5
1,1,2.0,5,7
2,2,3.0,6,9


9. Apply/Map Functions

In [19]:
df['A_squared'] = df['A'].apply(lambda x: x**2) # Apply to column
df['B_label'] = df['B'].map({4: 'Low', 5: 'Med', 6: 'High'}) # Map values
df.sample(3)

Unnamed: 0.1,Unnamed: 0,A,B,C,A_squared,B_label
0,0,1.0,4,5,1.0,Low
1,1,2.0,5,7,4.0,Med
2,2,3.0,6,9,9.0,High


### Task:

- Generated a Synthetic Dataset with columns of `age`, `salary`, `department`, `years_experience`, and `is_manager`.

In [20]:
np.random.seed(42) # For reproducibility
n_samples = 100 # Number of samples

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}
df = pd.DataFrame(data)

Q1. View data structure

In [21]:
display(df.head()) # View the first 5 rows
display(df.tail()) # View the last 5 rows
display(df.sample(5)) # View 5 random rows

Unnamed: 0,age,salary,department,years_experience,is_manager
0,56,38392,IT,-0.8,0
1,46,60535,Marketing,3.4,1
2,32,108603,HR,5.0,1
3,25,82256,HR,4.2,1
4,38,119135,HR,4.1,1


Unnamed: 0,age,salary,department,years_experience,is_manager
95,59,82662,IT,4.0,0
96,56,42688,HR,4.4,1
97,58,55342,Marketing,6.0,0
98,45,67157,HR,11.4,0
99,24,97863,HR,5.2,1


Unnamed: 0,age,salary,department,years_experience,is_manager
31,20,97214,IT,5.7,1
50,34,105766,Marketing,1.9,1
46,24,72941,HR,5.8,0
23,29,80636,Finance,1.9,0
42,43,32693,HR,3.3,0


Q2. Get DataFrame Info and Summary Stats

In [22]:
display(df.info())
display(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int64  
 1   salary            100 non-null    int64  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB


None

Unnamed: 0,age,salary,years_experience,is_manager
count,100.0,100.0,100.0,100.0
mean,37.91,77809.16,4.823,0.47
std,12.219454,26058.643576,2.237822,0.501614
min,18.0,30206.0,-0.8,0.0
25%,26.75,55141.0,3.475,0.0
50%,38.0,80932.0,4.7,0.0
75%,46.25,98107.25,6.0,1.0
max,59.0,119474.0,11.4,1.0


Q3. Do Simple Numpy Operations

In [23]:
import numpy as np
a = np.array([[10, 20, 30],[22,4,5]])
b = np.array([[1, 2, 3],[2,4,2]])
print(a + b)
print(a - b)
print(a * b)
print(a / b)

[[11 22 33]
 [24  8  7]]
[[ 9 18 27]
 [20  0  3]]
[[10 40 90]
 [44 16 10]]
[[10.  10.  10. ]
 [11.   1.   2.5]]


Q4. Filtering and Indexing Rows

In [24]:
# Filter rows where age is greater than 30
display(df[df['age'] > 30].head())

# Select rows by label (index)
display(df.loc[0:5]) # Select rows with index from 0 to 5 (inclusive)

# Select rows by position
display(df.iloc[10:15]) # Select rows at positions 10 to 14

Unnamed: 0,age,salary,department,years_experience,is_manager
0,56,38392,IT,-0.8,0
1,46,60535,Marketing,3.4,1
2,32,108603,HR,5.0,1
4,38,119135,HR,4.1,1
5,56,65222,Finance,2.8,0


Unnamed: 0,age,salary,department,years_experience,is_manager
0,56,38392,IT,-0.8,0
1,46,60535,Marketing,3.4,1
2,32,108603,HR,5.0,1
3,25,82256,HR,4.2,1
4,38,119135,HR,4.1,1
5,56,65222,Finance,2.8,0


Unnamed: 0,age,salary,department,years_experience,is_manager
10,41,40965,IT,9.5,1
11,53,54538,Marketing,5.3,0
12,57,100592,Finance,6.1,1
13,41,38110,HR,2.0,1
14,20,109309,Marketing,5.0,1


Q5. Adding a Column

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [25]:
# Add a new column 'salary_per_year_experience'
df['salary_per_year_experience'] = df['salary'] / (df['years_experience'] + 1) # Add 1 to avoid division by zero
display(df.head())

Unnamed: 0,age,salary,department,years_experience,is_manager,salary_per_year_experience
0,56,38392,IT,-0.8,0,191960.0
1,46,60535,Marketing,3.4,1,13757.954545
2,32,108603,HR,5.0,1,18100.5
3,25,82256,HR,4.2,1,15818.461538
4,38,119135,HR,4.1,1,23359.803922


Q6. Grouping and Aggregation

In [26]:
# Group by department and calculate the mean salary
grouped_salary = df.groupby('department')['salary'].mean()
display(grouped_salary)

# Group by department and calculate the count of employees and mean years of experience
grouped_stats = df.groupby('department').agg({
    'age': 'count',
    'years_experience': 'mean',
    'salary': 'mean'
}).rename(columns={'age': 'employee_count'})
display(grouped_stats)

Unnamed: 0_level_0,salary
department,Unnamed: 1_level_1
Finance,83124.708333
HR,73523.052632
IT,75825.47619
Marketing,77684.722222


Unnamed: 0_level_0,employee_count,years_experience,salary
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Finance,24,4.416667,83124.708333
HR,19,5.089474,73523.052632
IT,21,4.390476,75825.47619
Marketing,36,5.205556,77684.722222
