### Essential NumPy and Pandas Functions for Data Analysis

Let's look at practical code snippets that demonstrate the most useful NumPy and pandas methods for data analysis in Python. You can use these in a Jupyter notebook for experimentation and workflow mastery.

#### NumPy Basics for Data Analysis
Import NumPy:

In [None]:
import numpy as np

1. Creating Arrays

In [None]:
arr = np.array([1, 2, 3, 4])            # 1D array
matrix = np.array([[1, 2], [3, 4]])     # 2D array
zeros = np.zeros((2, 3))                # 2x3 array of zeros
ones = np.ones((2, 3))                  # 2x3 array of ones
rnd = np.random.rand(2, 3)              # 2x3 array of random floats

2. Array Shape, Reshape, Flatten

In [None]:
print(arr.shape)                # (4,)
reshaped = matrix.reshape((4, 1))   # Reshape to 4 rows, 1 column
flat = matrix.ravel()               # Flatten to 1D array

(4,)


3. Indexing and Slicing

In [None]:
print(matrix[0, 1])    # element at first row, second column
print(arr[1:3])        # slice elements 1 to 2

2
[2 3]


4. Basic Math Operations

In [None]:
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])
print(a + b)           # [11 22 33]
print(a - b)           # [ 9 18 27]
print(a * b)           # [10 40 90]
print(a / b)           # [10. 10. 10.]


[11 22 33]
[ 9 18 27]
[10 40 90]
[10. 10. 10.]


5. Aggregation and Stats

In [None]:
np.sum(arr)                 # 10
np.mean(arr)                # 2.5
np.std(arr)                 # Standard deviation
np.min(arr), np.max(arr)    # Min and max values

(1, 4)

6. Dot Product, Matrix Ops

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32
mat = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
matprod = np.matmul(mat, mat2) # Matrix multiplication

7. Logical and Filtering

In [None]:
arr[arr > 2]                # Returns [3, 4]
np.where(arr > 2)           # Indices where condition is true

(array([2, 3], dtype=int64),)

8. Random Sampling

In [None]:
np.random.randint(0, 100, size=5) # 5 random ints 0-99
np.random.choice([0, 1], size=10) # 10 random 0s or 1s

array([1, 0, 1, 1, 0, 1, 0, 1, 0, 0])

#### Pandas Basics for Data Analysis
Import pandas:

In [None]:
import pandas as pd

1. Creating DataFrames

In [None]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data) # Create DataFrame

2. Inspecting Data

In [None]:
df.head()                # First 5 rows
df.tail(3)                # Last 3 rows
df.info()                 # DataFrame info + datatypes
df.describe()             # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


3. Selecting Data

In [None]:
df['A']                   # Get column A
df.loc[0]                 # Row by label (index 0)
df.iloc[1:3]              # Rows 1-2 by position
df[df['A'] > 1]           # Filter rows where A > 1

Unnamed: 0,A,B
1,2,5
2,3,6


4. Adding/Dropping Columns

In [None]:
df['C'] = df['A'] + df['B']     # Add new column
new_df = df.drop('B', axis=1)   # Drop column B

5. Missing Values & Types

In [None]:
df.isnull().sum()               # Missing values per column
df.fillna(0)                    # Replace NaNs with 0
df['A'] = df['A'].astype(float) # Change column type

6. Grouping and Aggregation

In [None]:
grouped = df.groupby('A').sum()   # Sum columns, grouped by values in A
df['A'].value_counts()            # Count unique values in A

A
1.0    1
2.0    1
3.0    1
Name: count, dtype: int64

7. Sorting

In [None]:
df.sort_values('B', ascending=False)  # Sort by column B

Unnamed: 0,A,B,C
2,3.0,6,9
1,2.0,5,7
0,1.0,4,5


8. Reading/Writing Files

In [None]:
df.to_csv('mydata.csv') # Save to CSV
df = pd.read_csv('mydata.csv') # Load from CSV
df.head()

Unnamed: 0.1,Unnamed: 0,A,B,C
0,0,1.0,4,5
1,1,2.0,5,7
2,2,3.0,6,9


9. Apply/Map Functions

In [None]:
df['A_squared'] = df['A'].apply(lambda x: x**2) # Apply to column
df['B_label'] = df['B'].map({4: 'Low', 5: 'Med', 6: 'High'}) # Map values
df.sample(3)

Unnamed: 0.1,Unnamed: 0,A,B,C,A_squared,B_label
0,0,1.0,4,5,1.0,Low
1,1,2.0,5,7,4.0,Med
2,2,3.0,6,9,9.0,High


### Task:

- Generated a Synthetic Dataset with columns of `age`, `salary`, `department`, `years_experience`, and `is_manager`.

In [17]:
import numpy as np
import pandas as pd
np.random.seed(42) # For reproducibility
n_samples = 100 # Number of samples

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}
df = pd.DataFrame(data)
print(df.head())

   age  salary department  years_experience  is_manager
0   56   38392         IT              -0.8           0
1   46   60535  Marketing               3.4           1
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1


Q1. View data structure

In [3]:
print("First 5 rows of the DataFrame: ")
print(df.head())

First 5 rows of the DataFrame: 
   age  salary department  years_experience  is_manager
0   56   38392         IT              -0.8           0
1   46   60535  Marketing               3.4           1
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1


In [4]:
print("Column names: ")
print(df.columns.tolist())

Column names: 
['age', 'salary', 'department', 'years_experience', 'is_manager']


In [5]:
print("Shape of DataFrame: ")
print(df.shape)

Shape of DataFrame: 
(100, 5)


In [6]:
print("Data types: ")
print(df.dtypes)

Data types: 
age                   int32
salary                int32
department           object
years_experience    float64
is_manager            int64
dtype: object


Q2. Get DataFrame Info and Summary Stats

In [7]:
print("DataFrame Info:")
print(df.info())

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int32  
 1   salary            100 non-null    int32  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int64  
dtypes: float64(1), int32(2), int64(1), object(1)
memory usage: 3.3+ KB
None


In [8]:
print("Summary Statistics for Numerical Columns:")
print(df.describe())

Summary Statistics for Numerical Columns:
              age         salary  years_experience  is_manager
count  100.000000     100.000000        100.000000  100.000000
mean    37.910000   77809.160000          4.823000    0.470000
std     12.219454   26058.643576          2.237822    0.501614
min     18.000000   30206.000000         -0.800000    0.000000
25%     26.750000   55141.000000          3.475000    0.000000
50%     38.000000   80932.000000          4.700000    0.000000
75%     46.250000   98107.250000          6.000000    1.000000
max     59.000000  119474.000000         11.400000    1.000000


In [9]:
print("Summary Statistics for Categorical Columns:")
print(df['department'].describe())

Summary Statistics for Categorical Columns:
count           100
unique            4
top       Marketing
freq             36
Name: department, dtype: object


Q3. Do Simple Numpy Operations

In [10]:
age_array = df['age'].to_numpy()
salary_array = df['salary'].to_numpy()
experience_array = df['years_experience'].to_numpy()

print("NumPy Operations:")
print("Mean age:", np.mean(age_array))
print("Median salary:", np.median(salary_array))
print("Standard deviation of years_experience:", np.std(experience_array))
print("Max salary:", np.max(salary_array))
print("Min age:", np.min(age_array))

NumPy Operations:
Mean age: 37.91
Median salary: 80932.0
Standard deviation of years_experience: 2.226605263624426
Max salary: 119474
Min age: 18


Q4. Filtering and Indexing Rows

In [11]:
filtered_df = df[(df['age'] > 40) & (df['department'] == 'IT')]
print("Employees over 40 in IT department:")
print(filtered_df.head())

Employees over 40 in IT department:
    age  salary department  years_experience  is_manager
0    56   38392         IT              -0.8           0
10   41   40965         IT               9.5           1
56   59   96842         IT               1.9           1
58   46   36776         IT               2.6           1
63   53   35530         IT               1.2           1


In [12]:
mean_salary = df['salary'].mean()
high_salary_df = df[df['salary'] > mean_salary]
print("Employees with above-average salary:")
print(high_salary_df.head())

Employees with above-average salary:
   age  salary department  years_experience  is_manager
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1
6   36  107373    Finance               6.7           1
7   40  109575  Marketing               6.9           0


In [13]:
print("Rows 10 to 15:")
print(df.iloc[10:16])

Rows 10 to 15:
    age  salary department  years_experience  is_manager
10   41   40965         IT               9.5           1
11   53   54538  Marketing               5.3           0
12   57  100592    Finance               6.1           1
13   41   38110         HR               2.0           1
14   20  109309  Marketing               5.0           1
15   39   57266         IT               5.8           1


Q5. Adding a Column

In [14]:
df['salary_per_year_experience'] = df['salary'] / df['years_experience']
df['salary_per_year_experience'] = df['salary_per_year_experience'].replace([np.inf, -np.inf], np.nan)
print("DataFrame with new column (first 5 rows):")
print(df[['salary', 'years_experience', 'salary_per_year_experience']].head())

DataFrame with new column (first 5 rows):
   salary  years_experience  salary_per_year_experience
0   38392              -0.8               -47990.000000
1   60535               3.4                17804.411765
2  108603               5.0                21720.600000
3   82256               4.2                19584.761905
4  119135               4.1                29057.317073


Q6. Grouping and Aggregation

In [15]:
grouped = df.groupby('department').agg({
    'salary': 'mean',
    'years_experience': 'median',
    'is_manager': 'sum'
})

print("Grouped by department (mean salary, median years_experience, total managers):")
print(grouped)

Grouped by department (mean salary, median years_experience, total managers):
                  salary  years_experience  is_manager
department                                            
Finance     83124.708333              4.25          10
HR          73523.052632              4.20          10
IT          75825.476190              4.30          13
Marketing   77684.722222              5.30          14


In [16]:
manager_counts = df.groupby('is_manager').size()
print("Number of employees by manager status:")
print(manager_counts)

Number of employees by manager status:
is_manager
0    53
1    47
dtype: int64
