## **Session 5**
**18, 19 and 21 November 2023**
## **Jupyter, Anaconda, NumPy, Pandas, Matplotlib (1/2)**



**Agenda for Today's and the Next Session:**

1. **Introduction to Jupyter Notebooks:**
   - Understanding the interactive computing environment.
   - How to create and navigate through Jupyter Notebooks.

2. **NumPy: The Foundation of Numerical Computing:**
   - Exploring the powerful NumPy library for numerical operations.
   - Working with arrays and fundamental mathematical functions.

3. **Pandas: Data Manipulation Made Easy:**
   - Introduction to Pandas DataFrames for efficient data handling.
   - Basic data manipulation techniques, such as filtering, grouping, and merging.

4. **Matplotlib: Visualizing Data with Python:**
   - Creating stunning visualizations using the Matplotlib library.
   - Customizing plots to effectively communicate insights.

Let's get into it!

# Purpose of Jupyter Notebooks

Jupyter Notebooks serve multiple purposes and provide unique advantages for data analysis, research, and collaboration:

1. **Interactive Data Exploration**: Jupyter Notebooks allow you to interactively explore and analyze data by executing code in individual cells. You can load datasets, clean and preprocess data, perform computations, and visualize results, all within a single document.

2. **Reproducibility and Documentation**: Notebooks provide a reproducible workflow as code, visualizations, and text are combined in one place. You can document your analysis steps, include explanatory text, and share your findings, making it easier for others to understand and reproduce your work.

3. **Visualization and Storytelling**: Notebooks support the integration of rich media, including interactive plots, images, videos, and mathematical equations. This enables you to create compelling visualizations and tell a story with your data, making your analysis more engaging and informative.

4. **Collaboration and Sharing**: Jupyter Notebooks can be easily shared with others, promoting collaboration and fostering a community of knowledge sharing. You can share your notebooks as files, through platforms like GitHub or GitLab, or publish them on Jupyter Notebook hosting services like Jupyter Notebook Viewer or Google Colab.

By leveraging the capabilities of Jupyter Notebooks, you can conduct data analysis, experiment with code, document your workflow, and communicate your findings effectively in a single, versatile environment.


# Concepts Behind Jupyter Notebooks

Jupyter Notebooks are built on a few key concepts that contribute to their functionality and flexibility:

1. **Cells**: Notebooks are composed of cells, which can be of different types, such as Markdown cells for text, code cells for executing code, and raw cells for unprocessed text.

2. **Kernel**: Each notebook runs on a specific kernel, which is responsible for executing the code within the notebook. The kernel can be associated with a particular programming language like Python, R, or Julia.

3. **Code Execution**: Code cells can be executed individually or in a specific order. The output of code execution, such as printed results or visualizations, is displayed directly below the executed cell.

4. **State Persistence**: The state of variables and code execution is preserved throughout the notebook. This means that variables defined in one cell can be accessed and used in subsequent cells, allowing for interactive and incremental development.

5. **Interactivity**: Jupyter Notebooks support an interactive workflow, enabling you to edit and re-execute code cells, modify visualizations, and observe the immediate results.

6. **Markdown and LaTeX**: Notebooks support Markdown formatting, which allows you to write text, add headings, create lists, and incorporate hyperlinks. Additionally, LaTeX notation is supported for mathematical equations and symbols.

Understanding these underlying concepts will help you effectively utilize the features and capabilities of Jupyter Notebooks, enabling you to perform data analysis, experimentation, and documentation seamlessly.


# How Notebooks Work?
- Jupyter notebooks originated from the IPython project and were initially designed to send messages between a web app and an IPython kernel.
- The current architecture involves a notebook server that renders the notebook as a web app in the browser.
Code written in the web app is sent through the server to the kernel for execution.
- The kernel processes the code and sends the output back to the server, which renders it in the browser.
Notebooks are saved as JSON files with the .ipynb extension on the server.
- The kernel can be used for languages other than Python, making notebooks language agnostic.
The name "Jupyter" comes from a combination of Julia, Python, and R, representing the ability to use different kernels for different languages.
- Jupyter notebooks offer the flexibility of running the server locally or on a remote machine or cloud instance.
- Accessing the notebooks via a browser allows for worldwide accessibility to the notebook files stored on the server.

# Code Cell

In [1]:
# Simple Python code example

# Define a function to calculate the square of a number
def square(x):
    return x ** 2

# Calculate the square of a given number
number = 6
result = square(number)
print(f"The square of {number} is: {result}")

The square of 6 is: 36


# Installing, Launching, and Using

- Install Python: Better go with Anaconda and you will get Jupyter Notebooks with it.

- Install Jupyter: If installing separately, once Python is installed, open your terminal or command prompt and use the following command to install Jupyter using pip, the Python package manager:
```
pip install jupyter
```
- Launch Jupyter Notebook: After successful installation, you can launch Jupyter Notebook by running the following command in your terminal or command prompt:
```
jupyter notebook
```
- Jupyter Server Start: Jupyter Notebook will start a local server and open a web browser window displaying the Jupyter dashboard.
- Create a New Notebook: To create a new notebook, click on the "New" button in the Jupyter dashboard and select the desired programming language kernel (e.g., Python 3) to associate with the notebook.

## Magic keyword
also known as magic commands or magic functions, are special commands in Jupyter Notebooks that provide additional functionality beyond what is available in regular Python code. They are prefixed with a % character for single-line magic or %% for cell magic.

There are two types of magic keywords in Jupyter Notebooks:

- Line Magic: These magic commands are used for single-line operations and are prefixed with a single % character. They are typically used at the beginning of a line and affect the behavior of that particular line.

  - Example: `%timeit`, `%ls`, `%cd`

- Cell Magic: These magic commands operate on entire code cells and are prefixed with %% at the beginning of a cell. They affect the behavior of the entire cell and can be used to perform specialized operations or execute code in different languages.

  -Example: `%%timeit`, `%%matplotlib`

In [2]:
%timeit sum(range(1000001))

34.7 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Converting Notebook Format
Jupyter Notebooks can be converted to various file formats, including:

- HTML
- PDF
- Markdown
- LaTeX
- JSON (.ipynb by default)

1. **What is a Jupyter Notebook?**

   a. A word processing software  
   b. An integrated development environment (IDE)  
   c. An interactive document with code and multimedia elements  
   d. A programming language

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

2. **What is the purpose of a Jupyter Notebook's kernel?**

   a. Handles user interface interactions  
   b. Executes the code and manages computations  
   c. Stores and organizes data  
   d. Manages version control

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: b**

3. **Which programming language is commonly associated with Jupyter Notebooks?**

   a. Java  
   b. JavaScript  
   c. Python  
   d. C++

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

4. **What is a code cell in a Jupyter Notebook used for?**

   a. Adding comments and explanations  
   b. Writing and executing code  
   c. Inserting images and multimedia  
   d. None of the above

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: b**

5. **How does Jupyter support different programming languages?**

   a. It doesn't; Jupyter only supports Python.  
   b. By using separate notebooks for each language.  
   c. Through the use of different kernels for each language.  
   d. By embedding interpreters in each cell.

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

6. **What does the term "markdown" refer to in the context of Jupyter Notebooks?**

   a. A programming language  
   b. A type of code cell  
   c. A text formatting language for documentation  
   d. A type of kernel

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

7. **How can you restart the Jupyter Notebook kernel?**

   a. By closing and reopening the notebook  
   b. Using the "Restart Kernel" option in the Jupyter interface  
   c. Pressing a specific keyboard shortcut  
   d. All of the above

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: b**

8. **In Jupyter, what is the purpose of the `%matplotlib inline` magic command?**

   a. To display matplotlib plots in the notebook interface  
   b. To define inline comments in code cells  
   c. To enable inline debugging  
   d. None of the above

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: a**

9. **What is the file extension of a Jupyter Notebook file?**

   a. .nb  
   b. .jn  
   c. .ipynb  
   d. .jupyter

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

10. **How can you share a Jupyter Notebook with others?**

    a. Exporting as a PDF or HTML  
    b. Sending the .ipynb file  
    c. Using online platforms like GitHub  
    d. All of the above

<details>
<summary>Click to reveal the answer</summary>
    **Correct answer: d**

# Numpy
### What is NumPy:

NumPy (Numerical Python) is a Python library for scientific computing and numerical operations.
It provides powerful data structures, such as **arrays** and **matrices**, along with a collection of mathematical functions to **efficiently** perform numerical computations.

### How is it implemented:

NumPy is implemented in a combination of **Python and low-level C code** for **performance** optimization.
The **core** functionality and high-performance numerical operations are implemented in **C**, while the Python interface provides ease of use and flexibility.

### What does it do:

- NumPy provides efficient multi-dimensional array objects, known as ndarrays, which allow you to store and manipulate large amounts of homogeneous numerical data.
- It offers a wide range of mathematical functions for operations like:
  - array manipulation,
  - linear algebra,
  - Fourier transforms,
  - random number generation, and more.
- NumPy's array operations are faster and more memory-efficient compared to traditional Python lists, making it suitable for large-scale numerical computations and data analysis.

### Why should you know it:

- Understanding NumPy is crucial for working with numerical data efficiently and effectively.
- It provides a solid foundation for many other Python libraries, such as pandas, SciPy, and scikit-learn, which build upon NumPy arrays.

## Installing Numpy
```
conda install numpy
```
or
```
conda install numpy=X.XX
```
or
```
pip install --upgrade numpy==X.XX
```


In [3]:
import numpy as np

# Create a large list of numbers
numbers_list = list(range(1, 1000001))

# Perform element-wise squaring using normal Python
%timeit squared_list = [num**2 for num in numbers_list]

# Perform element-wise squaring using NumPy vectorization
%timeit squared_array = np.array(numbers_list)**2

391 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
75.7 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### **NumPy Reflection**

What does numpy.ndarray.ndim return?  
a) The total number of elements in the array  
b) The number of dimensions of the array  
c) The shape of the array  
d) The data type of the elements in the array

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) The number of dimensions of the array
</details>

What does numpy.ndarray.shape return?  
a) The total number of elements in the array  
b) The number of dimensions of the array  
c) The shape of the array  
d) The data type of the elements in the array

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) The shape of the array
</details>

What does numpy.ndarray.size return?  
a) The total number of elements in the array  
b) The number of dimensions of the array  
c) The shape of the array  
d) The data type of the elements in the array

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) The total number of elements in the array
</details>

Which function is used to save a NumPy array to a file?  
a) numpy.save  
b) numpy.load  
c) numpy.random.random  
d) numpy.random.randint  

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.save
</details>

Which function is used to load a saved NumPy array from a file?  
a) numpy.save  
b) numpy.load  
c) numpy.random.random  
d) numpy.random.normal

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.load
<details>

Which function is used to generate an array of random values between 0 and 1?  
a) numpy.random.random  
b) numpy.random.randint  
c) numpy.random.normal  
d) numpy.random.permutation

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.random.random
</details>

Which function is used to generate an array of random integers within a specified range?  
a) numpy.random.random  
b) numpy.random.randint  
c) numpy.random.normal  
d) numpy.random.permutation

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.random.randint
</details>

Which function is used to generate an array of random numbers from a normal distribution?  
a) numpy.random.random  
b) numpy.random.randint  
c) numpy.random.normal  
d) numpy.random.permutation

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) numpy.random.normal
</details>

Which function is used to randomly permute the elements of an array?  
a) numpy.random.random  
b) numpy.random.randint  
c) numpy.random.normal  
d) numpy.random.permutation  

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.random.permutation
</details>

Which function is used to create an array filled with ones?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.eye

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.ones
</details>

Which function is used to create an array filled with zeros?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.eye

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.zeros
<details>

Which function is used to reshape a NumPy array to a different shape?  
a) numpy.save  
b) numpy.load  
c) numpy.random.random  
d) numpy.reshape

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.reshape
</details>

Which function is used to reshape a NumPy array in-place?  
a) numpy.save  
b) numpy.load  
c) numpy.random.random  
d) numpy.ndarray.reshape

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.ndarray.reshape
</details>

Which function is used to create an array with a specified constant value?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.eye

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) numpy.full
</details>

Which function is used to create an identity matrix?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.eye

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.eye
</details>

Which function is used to extract the diagonal elements of an array?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.diag

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.diag
</details>

Which function is used to find unique elements in an array?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.unique

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.unique
<details>

Which function is used to create a NumPy array from a given sequence?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.array

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.array
</details>

Which function is used to create a sequence of evenly spaced values with a given step size within a specified interval?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.arange

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.arange
</details>

Which function is used to create a sequence of evenly spaced values (given the number of values) over a specified range?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.linspace

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.linspace
</details>

Which function is used to create a copy of a NumPy array?  
a) numpy.ones  
b) numpy.zeros  
c) numpy.full  
d) numpy.ndarray.copy

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.ndarray.copy
</details>

Which function is used to insert values into a NumPy array at a specified position?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.hstack

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.insert
</details>

Which function is used to delete values from a NumPy array at a specified position?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.hstack

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.delete
</details>

Which function is used to append values to the end of a NumPy array?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.hstack

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) numpy.append
</details>

Which function is used to horizontally stack multiple NumPy arrays?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.hstack

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.hstack
</details>

Which function is used to vertically stack multiple NumPy arrays?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.vstack

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.vstack
</details>

Which function is used to sort the elements of a NumPy array along a specified axis?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.sort

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.sort<br>
*if only one axis, 0 represents row, column otheriwse.*
</details>

Which method is used to sort the elements of a NumPy array in-place?  
a) numpy.insert  
b) numpy.delete  
c) numpy.append  
d) numpy.ndarray.sort

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.ndarray.sort
</details>

Which function is used to find the common elements between two NumPy arrays?  
a) numpy.intersect1d  
b) numpy.setdiff1d  
c) numpy.union1d

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.intersect1d
</details>

Which function is used to find the elements present in one array but not in another?  
a) numpy.intersect1d  
b) numpy.setdiff1d  
c) numpy.union1d

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.setdiff1d
</details>

Which function is used to find the unique elements from two arrays, removing any duplicates?  
a) numpy.intersect1d  
b) numpy.setdiff1d  
c) numpy.union1d

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) numpy.union1d
</details>

Which function is used to perform element-wise addition of two NumPy arrays?  
a) numpy.add  
b) numpy.subtract  
c) numpy.multiply  
d) numpy.divide

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.add
</details>

Which function is used to perform element-wise subtraction of two NumPy arrays?  
a) numpy.add  
b) numpy.subtract  
c) numpy.multiply  
d) numpy.divide

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.subtract
</details>

Which function is used to perform element-wise multiplication of two NumPy arrays?  
a) numpy.add  
b) numpy.subtract  
c) numpy.multiply  
d) numpy.divide

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) numpy.multiply
</details>

Which function is used to perform element-wise division of two NumPy arrays?  
a) numpy.add  
b) numpy.subtract  
c) numpy.multiply  
d) numpy.divide

<details>
<summary>Click to reveal the answer</summary>
Correct answer: d) numpy.divide
</details>

Which function is used to calculate the exponential values of elements in a NumPy array?  
a) numpy.exp  
b) numpy.power  
c) numpy.sqrt  
d) numpy.ndarray.min

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.exp
</details>

Which function is used to raise the elements of a NumPy array to a specified power?  
a) numpy.exp  
b) numpy.power  
c) numpy.sqrt  
d) numpy.ndarray.min

<details>
<summary>Click to reveal the answer</summary>
Correct answer: b) numpy.power
</details>

Which function is used to calculate the square root of elements in a NumPy array?  
a) numpy.exp  
b) numpy.power  
c) numpy.sqrt  
d) numpy.ndarray.min

<details>
<summary>Click to reveal the answer</summary>
Correct answer: c) numpy.sqrt
</details>

Which method is used to find the minimum value of a NumPy array?  
a) numpy.ndarray.min  
b) numpy.ndarray.max  
c) numpy.mean  
d) numpy.median

<details>
<summary>Click to reveal the answer</summary>
Correct answer: a) numpy.ndarray.min
</details>

### Q1: We have Python lists. Why NumPy arrays?

<div align="center">

| Feature               | Python Lists               | NumPy Arrays               |
|-----------------------|----------------------------|----------------------------|
| Data type             | Can contain different      | Homogeneous, elements      |
|                       | data types                 | should have the same type  |
|                       |                            | (int, float, etc.).        |
| Performance           | Slower for large datasets  | Faster due to vectorized   |
|                       | and mathematical          | operations.                |
|                       | operations.                |                            |
| Functionality         | Limited array operations   | Extensive array operations |
|                       | and broadcasting.          | and broadcasting.          |
| Memory Consumption    | More memory overhead       | Less memory overhead       |
|                       | due to flexibility.        | due to uniformity.         |
| Syntax                | Familiar syntax with       | Requires importing NumPy   |
|                       | built-in list functions.   | and using array syntax.    |
| Mathematical Functions| Limited mathematical       | Comprehensive mathematical |
|                       | functions available.       | functions and libraries.   |

</div>

### Q2: What is broadcasting? Why does it matter?

In [6]:
import numpy as np

# NumPy arrays
a = np.array([[1,2],[3,4]])
#b = np.array([10])
b = np.array([10, 12, 14])

# Broadcasting in NumPy
print(a + b)

# Output
#[[11 22]
# [13 24]]

# Python lists
c = [[1,2],[3,4]]
d = [10, 20]
#d = 20

# No broadcasting with lists
print(c + d)

# Output
# [[1, 2], [3, 4], 10, 20]
# TypeError: can only concatenate list (not "int") to list

ValueError: ignored

# Pandas
Pandas is a widely-used open-source Python library for data manipulation and analysis.

### What is pandas?

- Pandas is a Python library that provides data structures and data analysis tools.
- It offers powerful data manipulation capabilities, such as reading and writing data, cleaning and transforming data, and performing statistical analysis.

### Why is pandas popular?

- Pandas provides a convenient and efficient way to work with structured data, making it popular among data analysts, scientists, and engineers.
- It offers powerful data structures, such as the DataFrame, which allows for easy handling and manipulation of tabular data.
- Pandas provides a wide range of functions and methods for data analysis and data exploration, making it a versatile tool for various tasks.
- It integrates well with other popular libraries in the data science ecosystem, such as NumPy, Matplotlib, and scikit-learn, enabling seamless workflow and interoperability.

### How is pandas implemented?

- Pandas is implemented in Python, with performance-critical operations implemented in low-level languages like C or Cython for efficiency.
- It utilizes NumPy arrays as the underlying data structure to efficiently store and manipulate large datasets.
- The main data structure in pandas is the DataFrame, which is a two-dimensional table-like structure that can hold heterogeneous data types.
- The library provides a wide range of functions, methods, and operations optimized for data manipulation and analysis.
- Pandas leverages the power of vectorized operations, allowing for efficient and fast data processing.

### Lets populate a Panadas DataFrame

In [7]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset (returns data (150,4), target (0, 1, 2), target_names, DESCR, feature_names)
iris = load_iris()

# Convert the dataset to a Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['Target'] = iris.target_names[iris.target]

print(df.head(), df.shape)

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   Target  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa   (150, 5)


### Q3: If it is all about handling data, why two separate libraries in Pandas and NumPy?

Pandas and NumPy serve different purposes and work well together in data analysis. Here are a few key reasons why Pandas is needed in addition to NumPy:
- Data structures: Pandas provides high-level data structures like DataFrames and Series to facilitate data manipulation. NumPy only offers ndarrays.
- Indexing: Pandas has powerful indexing functionality like .loc, .iloc, .at that make it easier to subset and transform data. NumPy indexing is more limited.
- Missing data: Pandas handles missing data gracefully with NaN values. NumPy ndarrays require other packages like Numpy-MA for missing data.
- Data alignment: Pandas aligns data based on labels during operations. NumPy performs crude alignment based on array order.
- Data munging: Pandas is designed for practical data cleaning, preparation and munging prior to analysis. NumPy focuses on mathematical functions.
- Time Series: Pandas has built-in time series functionality like date_range, shift, rolling averages/sums. Much more limited in NumPy.

Here is a comparison of NumPy Arrays, Pandas Series, and DataFrames:

<div align="center">

| Feature               | NumPy Arrays               | Pandas Series              | Pandas DataFrame           |
|-----------------------|----------------------------|----------------------------|----------------------------|
| Dimensions            | 1-dimensional              | 1-dimensional              | 2-dimensional              |
| Data Structure        | Homogeneous                | Homogeneous                | Heterogeneous              |
| Labels (Index)        | No labels                  | Has labels (Index)         | Has labels (Index)         |
| Columns               | N/A                        | N/A                        | Columns with unique names  |
| Main Use Cases        | Numerical computations     | Time-series data,          | Heterogeneous data,        |
|                       | and numerical arrays       | representing single column | tabular data               |
|                       | operations                 | of data                    |                            |
| Creation              | From Python lists or tuples| From Python lists or tuples| From dictionaries, lists, |
|                       | or NumPy arrays            | or NumPy arrays            | CSV files, etc.            |
| Accessing Elements    | By index position          | By label (index)           | By label (index) or        |
|                       |                            | or boolean indexing        | column name                |
| Example Syntax        | `numpy_array[2]`           | `series['column_name']`    | `df['column_name']`        |


</div>

### Pandas and NumPy working together

In [8]:
# Using Pandas to calculate mean and standard deviation
%timeit sepal_length_mean_pandas = df['sepal length (cm)'].mean()
%timeit sepal_length_std_pandas = df['sepal length (cm)'].std()

# Using NumPy to calculate mean and standard deviation
%timeit sepal_length_np = df['sepal length (cm)'].to_numpy(); sepal_length_mean_numpy = np.mean(sepal_length_np); sepal_length_std_numpy = np.std(sepal_length_np)

55.9 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
120 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
32.5 µs ± 848 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Q4: What is the usual list of steps involved in a data-driven task that make use of pandas, numpy, matplotlib, and seaborn?
### Assuming the Iris dataset.
Here is a tentative list:
1. Load and Explore the Dataset
2. Summary Statistics and Data Insights
3. Data Visualization: Scatter Plots
4. Data Visualization: Pair Plots
5. Data Visualization: Box Plots
6. Data Preprocessing: Standardization
7. Data Preprocessing: Encoding Categorical Variables
8. Data Preprocessing: Handling Missing Values
9. Feature Engineering: Creating New Features
10. Data Filtering and Subsetting
11. Data Grouping and Aggregation
12. Correlation Analysis
13. Data Visualization: Heatmap
14. Data Visualization: Violin Plots
15. Data Visualization: Swarm Plots
16. Data Visualization: Histograms and Density Plots
17. Data Visualization: Bar Plots
18. Machine Learning: Train-Test Split
19. Machine Learning: Model Training and Evaluation

### Step 1: Load and Explore the Dataset

In [9]:
# Load the Iris dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(url, names=column_names)

# Display the first few rows of the DataFrame using the `head()` function
print(df.head())

   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


### Step 2: Summary Statistics and Data Insights

In [10]:
# Get basic summary statistics of the dataset using the `describe()` function
summary_stats = df.describe()

# Get the number of samples for each class using the `value_counts()` function
class_counts = df['class'].value_counts()

# Get unique classes in the dataset using the `unique()` function
unique_classes = df['class'].unique()

# Get the correlation between features using the `corr()` function
correlation_matrix = df.corr()

# Display the summary statistics, class counts, unique classes, and correlation matrix
print("Summary Statistics:")
print(summary_stats)

print("\nClass Counts:")
print(class_counts)

print("\nUnique Classes:")
print(unique_classes)

print("\nCorrelation Matrix:")
print(correlation_matrix)

Summary Statistics:
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Class Counts:
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

Unique Classes:
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

Correlation Matrix:
              sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.109369      0.871754     0.817954
sepal_width      -0.109369     1.000000     -0.420516    -0.356544
petal_length      0.871754    -0.420516    

  correlation_matrix = df.corr()


### **Pandas Reflection**

**1. What is Pandas?**

   a. A programming language  
   b. A data visualization library  
   c. A data manipulation and analysis library  
   d. An operating system  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

**2. Which data structures are used in Pandas for one-dimensional labeled data?**

   a. Lists  
   b. Arrays  
   c. Series  
   d. DataFrames  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

**3. What is a DataFrame in Pandas?**

   a. A two-dimensional labeled array  
   b. A three-dimensional array  
   c. A collection of dictionaries  
   d. A type of plot  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: a**

**4. How do you select a column 'column_name' from a DataFrame 'df' in Pandas?**

   a. `df.select('column_name')`  
   b. `df['column_name']`  
   c. `df.loc['column_name']`  
   d. `df.iloc['column_name']`  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: b**

**5. Which method is used to read a CSV file into a Pandas DataFrame?**

   a. `pd.read_excel()`  
   b. `pd.read_csv()`  
   c. `pd.read_table()`  
   d. `pd.load_csv()`  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: b**

**6. What does the `head()` method in Pandas do?**

   a. Prints the last few rows of a DataFrame  
   b. Prints the summary statistics of a DataFrame  
   c. Prints the first few rows of a DataFrame  
   d. Prints the shape of a DataFrame  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**

**7. How do you handle missing values in Pandas?**

   a. Use the `dropna()` method  
   b. Use the `fillna()` method  
   c. Use the `isna()` function  
   d. All of the above  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: d**

**8. What does the `groupby()` function in Pandas do?**

   a. Sorts the DataFrame  
   b. Groups data based on specified criteria  
   c. Calculates the mean of the DataFrame  
   d. Reshapes the DataFrame  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: b**

**9. In Pandas, what is the purpose of the `merge()` function?**

   a. Combines two DataFrames based on common columns or indices  
   b. Adds a new column to a DataFrame  
   c. Sorts the DataFrame in ascending order  
   d. Removes duplicate rows from a DataFrame  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: a**

**10. Which of the following is NOT a valid method for sorting a Pandas DataFrame?**

   a. `df.sort_values()`  
   b. `df.sort_index()`  
   c. `df.order()`  
   d. `df.sort(columns='column_name')`  

<details>
<summary>Click to reveal the answer</summary>
   **Correct answer: c**