# CAP 379

# **Python Packages – NumPy and Pandas**  
**Objective:** Introduce essential libraries for **numerical computing and data manipulation** in Python.  

## **4.1 Introduction to NumPy and Pandas**  
Before diving into NumPy and Pandas, let’s understand why they are essential:  

 **NumPy**: Used for **numerical operations, arrays, and mathematical computations**.  
 **Pandas**: Used for **handling tabular data (like Excel or SQL tables) efficiently**.  

## **4.2 Installing and Importing Libraries**  
Before using these libraries, install them if necessary:  
```python
!pip install numpy pandas
```

Now, import them in your Python script:
```python
import numpy as np
import pandas as pd
```

In [25]:
import numpy as np
import pandas as pd

# **4.3 Working with NumPy Arrays**  
### **4.3.1 What is a NumPy Array?**  
- **NumPy arrays** (`ndarray`) are **faster** than Python lists.  
- Supports **vectorized operations** (applying functions to entire arrays).  

### **4.3.2 Creating NumPy Arrays**  
#### **From a List**

In [26]:
array1 = np.array([1, 2, 3, 4, 5])
print(array1)

[1 2 3 4 5]


#### **Multi-dimensional Array**

In [27]:
array2 = np.array([[1, 2, 3], [4, 5, 6]])
print(array2)

[[1 2 3]
 [4 5 6]]


### **4.3.3 NumPy Array Properties**

In [28]:
print(array2.shape)  # (rows, columns)
print(array2.size)   # Total number of elements
print(array2.dtype)  # Data type of elements

(2, 3)
6
int64


### **4.3.4 Special NumPy Arrays**

In [29]:
zeros = np.zeros((3, 3))  # 3x3 matrix of zeros
ones = np.ones((2, 2))    # 2x2 matrix of ones
random_values = np.random.rand(3, 3)  # 3x3 matrix with random values
print(zeros, ones, random_values)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]] [[1. 1.]
 [1. 1.]] [[0.89838332 0.14381125 0.22887397]
 [0.99626746 0.89030519 0.813007  ]
 [0.98550573 0.89706488 0.57329229]]


### **4.3.5 Mathematical Operations with NumPy**

In [30]:
array = np.array([10, 20, 30])
print(array + 5)  # Add 5 to all elements
print(array * 2)  # Multiply all elements by 2
print(np.mean(array))  # Mean value
print(np.max(array))  # Maximum value

[15 25 35]
[20 40 60]
20.0
30


# **4.4 Using Pandas for Data Manipulation**  
## **4.4.1 Creating a Pandas DataFrame**

In [40]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Score': [85, 90, 78, 88]
}

df = pd.DataFrame(data)
print(df)

      Name  Age  Score
0    Alice   25     85
1      Bob   30     90
2  Charlie   35     78
3    David   40     88


### **4.4.2 Selecting Columns and Rows**

In [41]:
print(df["Name"])  # Selecting a single column

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object


In [42]:
print(df[["Name", "Score"]])  # Selecting multiple columns

      Name  Score
0    Alice     85
1      Bob     90
2  Charlie     78
3    David     88


In [43]:
print(df.loc[0])  # Select by row label

Name     Alice
Age         25
Score       85
Name: 0, dtype: object


In [44]:
print(df.iloc[1])  # Select by row position

Name     Bob
Age       30
Score     90
Name: 1, dtype: object


### **4.4.3 Filtering Data**

In [45]:
high_scorers = df[df["Score"] > 80]
print(high_scorers)

    Name  Age  Score
0  Alice   25     85
1    Bob   30     90
3  David   40     88


### **4.4.4 Adding, Updating, and Removing Columns**

In [46]:
df["Passed"] = df["Score"] > 80  # New column with Boolean values
df["Score"] = df["Score"] + 5  # Increase scores by 5 points
df.drop(columns=["Age"], inplace=True)  # Remove the "Age" column
print(df)

      Name  Score  Passed
0    Alice     90    True
1      Bob     95    True
2  Charlie     83   False
3    David     93    True


# **4.5 Handling Missing Data and Outliers**  
### **4.5.1 Checking for Missing Data**

In [47]:
print(df.isnull().sum())  # Checks missing values per column

Name      0
Score     0
Passed    0
dtype: int64


### **4.5.2 Dropping Rows with Missing Data**

In [49]:

df.dropna(inplace=True)
print(df)


      Name  Score  Passed
0    Alice     90    True
1      Bob     95    True
2  Charlie     83   False
3    David     93    True


### **4.5.3 Detecting and Handling Outliers**
An **outlier** is a value that is much higher or lower than most other values in the dataset.

#### **Finding Outliers using IQR (Interquartile Range)**

In [50]:
Q1 = df["Score"].quantile(0.25)  # 25th percentile
Q3 = df["Score"].quantile(0.75)  # 75th percentile
IQR = Q3 - Q1  # Interquartile range

# Defining lower and upper limits for outliers
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

# Identifying outliers
outliers = df[(df["Score"] < lower_limit) | (df["Score"] > upper_limit)]
print("Outliers:\n", outliers)

Outliers:
 Empty DataFrame
Columns: [Name, Score, Passed]
Index: []


### **4.5.4 Removing Outliers**

In [51]:
df = df[(df["Score"] >= lower_limit) & (df["Score"] <= upper_limit)]
print(df)

      Name  Score  Passed
0    Alice     90    True
1      Bob     95    True
2  Charlie     83   False
3    David     93    True


# **Practice Exercise**
1. **Create a NumPy array of 10 random numbers between 1-100.**
2. **Find the mean, min, and max of the array.**
3. **Create a Pandas DataFrame with 3 columns (Name, Salary, Department).**
4. **Add a new column "Bonus" as 10% of the salary.**
5. **Replace any missing salary values with the average salary.**
6. **Detect and remove outliers from the salary column.**
   
![Pandas DataFrame Example Output](images/practice_2.png)