
# **Data Toolkit: Questions and Answers**

### **1. What is NumPy, and why is it widely used in Python?**

**NumPy**, which stands for **Numerical Python**, is the fundamental package for scientific computing in Python.

It is widely used for several key reasons:

  * **The `ndarray` Object:** Its core feature is the powerful N-dimensional array object (`ndarray`), which is a fast and memory-efficient data structure for storing and manipulating numerical data.
  * **Performance:** Many of NumPy's core operations are implemented in C, making them significantly faster than equivalent operations performed on native Python lists. It avoids the overhead of type-checking and other inefficiencies of Python loops.
  * **Mathematical Functions:** It provides a vast library of high-level mathematical functions to operate on these arrays (e.g., linear algebra, Fourier transforms, random number generation).
  * **Foundation of the Ecosystem:** NumPy is the base library for a vast majority of other data science and machine learning libraries in Python, including Pandas, Scikit-learn, SciPy, and TensorFlow. These libraries are built on top of NumPy arrays.

<!-- end list -->

```python
import numpy as np

# Create a NumPy array
a = np.array([1, 2, 3, 4, 5])
print(f"NumPy Array: {a}")
print(f"Type: {type(a)}")

# Perform a fast, vectorized operation
b = a * 2
print(f"Array multiplied by 2: {b}")
```

### **2. How does broadcasting work in NumPy?**

**Broadcasting** is a powerful mechanism that allows NumPy to perform arithmetic operations on arrays of different shapes. Instead of explicitly reshaping arrays to be the same size, NumPy implicitly "broadcasts" the smaller array across the larger one so that they have compatible shapes.

The rules for broadcasting are:

1.  If the arrays do not have the same number of dimensions, prepend 1s to the shape of the smaller array until they have the same length.
2.  The size of each dimension in the output array's shape is the maximum of the input sizes in that dimension.
3.  An array can be broadcast across a dimension if its size in that dimension is 1 or if the other array's size in that dimension is the same.
4.  If these conditions are not met, a `ValueError: operands could not be broadcast together` is raised.

<!-- end list -->

```python
# Example of broadcasting
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

vector = np.array([10, 20, 30])

# NumPy "broadcasts" the vector to each row of the matrix
# The vector [10, 20, 30] is added to [1, 2, 3], then to [4, 5, 6], etc.
result = matrix + vector

print("Matrix:\n", matrix)
print("\nVector:\n", vector)
print("\nResult of broadcasting addition:\n", result)
```

### **3. What is a Pandas DataFrame?**

A **Pandas DataFrame** is a 2-dimensional, size-mutable, and potentially heterogeneous labeled data structure. You can think of it as a spreadsheet, a SQL table, or a dictionary of Series objects. It is the primary data structure used in the Pandas library.

Key features include:

  * **Labeled Axes:** Both rows (index) and columns are labeled.
  * **Heterogeneous Data:** Columns can have different data types (e.g., integer, float, string, boolean).
  * **Size-mutable:** You can add or delete columns and rows.
  * **Powerful Operations:** It supports a wide range of functionalities like merging, reshaping, slicing, grouping, and handling missing data.

<!-- end list -->

```python
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}

df = pd.DataFrame(data)
print(df)
```

### **4. Explain the use of the `groupby()` method in Pandas.**

The `groupby()` method is used for splitting a DataFrame into groups based on some criteria, applying a function to each group independently, and then combining the results into a new data structure. This is often referred to as the **"Split-Apply-Combine"** strategy.

  * **Split:** The data is split into groups based on the values in one or more columns.
  * **Apply:** A function (e.g., `sum()`, `mean()`, `count()`) is applied to each group.
  * **Combine:** The results of the function applications are combined into a final DataFrame or Series.

It is extremely useful for aggregating and summarizing data.

```python
# Sample DataFrame
data = {'Department': ['HR', 'IT', 'IT', 'HR', 'IT'],
        'Salary': [70000, 85000, 92000, 68000, 88000]}
df = pd.DataFrame(data)

# Group by 'Department' and calculate the mean salary for each
avg_salary_by_dept = df.groupby('Department')['Salary'].mean()

print("Original DataFrame:\n", df)
print("\nAverage Salary by Department:\n", avg_salary_by_dept)
```

### **5. Why is Seaborn preferred for statistical visualizations?**

**Seaborn** is a Python data visualization library based on Matplotlib. It is often preferred for statistical visualizations because:

  * **High-Level Interface:** It provides a simpler, high-level interface for creating complex and common statistical plots like histograms, box plots, violin plots, and heatmaps. This requires less code compared to Matplotlib.
  * **Aesthetic Defaults:** Seaborn comes with a number of built-in themes and color palettes that make plots more aesthetically pleasing and readable by default.
  * **Integration with Pandas:** It integrates seamlessly with Pandas DataFrames. You can often pass entire DataFrames to its plotting functions and specify column names for the axes.
  * **Specialized Statistical Plots:** It has built-in functions for complex visualizations that are difficult to create from scratch, such as `pairplot()` for exploring pairwise relationships or `lmplot()` for fitting and visualizing regression models.

### **6. What are the differences between NumPy arrays and Python lists?**

| Feature | NumPy Array (`ndarray`) | Python List (`list`) |
| :--- | :--- | :--- |
| **Data Type** | **Homogeneous:** All elements must be of the same data type. | **Heterogeneous:** Can contain elements of different data types. |
| **Performance** | **Fast.** Operations are implemented in C and executed on the entire array at once (vectorization). | **Slow.** Operations are often performed via loops in interpreted Python. |
| **Memory Usage**| **Memory-efficient.** Stores elements in a contiguous block of memory with minimal overhead. | **Less memory-efficient.** Stores pointers to objects, which can be scattered in memory, adding overhead. |
| **Functionality**| Optimized for numerical and mathematical operations. | A general-purpose data structure for storing collections of items. |
| **Operators** | Operators (`+`, `*`) perform element-wise arithmetic. | Operators (`+`, `*`) perform concatenation and repetition. |

### **7. What is a heatmap, and when should it be used?**

A **heatmap** is a graphical representation of data where individual values contained in a matrix are represented as colors. It's a 2D visualization tool that uses color to convey the magnitude of values.

**When to use it:**

  * **Correlation Matrices:** The most common use case is to visualize correlation matrices. It allows you to quickly see which variables are highly correlated (positively or negatively) with each other.
  * **Feature Analysis:** To understand the relationship between different features in a dataset.
  * **Web Analytics:** To show user engagement on different parts of a webpage.
  * **Biological Data:** To visualize gene expression data.

<!-- end list -->

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Generate some random data and compute the correlation matrix
np.random.seed(0)
data = pd.DataFrame(np.random.rand(10, 5), columns=[f'Var{i}' for i in range(1, 6)])
corr_matrix = data.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()
```

### **8. What does the term "vectorized operation" mean in NumPy?**

**Vectorization** refers to performing operations on entire arrays of data at once, rather than iterating over the elements one by one using explicit loops.

In NumPy, vectorized operations are implemented in highly optimized, pre-compiled C code. When you write `a * b` where `a` and `b` are NumPy arrays, NumPy does not perform a Python loop. Instead, it calls its internal C functions that loop over the elements much more efficiently.

**Benefits:**

  * **Faster Code:** It's significantly faster than using Python loops.
  * **More Concise Code:** It's easier to read and write (`c = a + b` is cleaner than a for loop).

<!-- end list -->

```python
# Non-vectorized (loop) vs. Vectorized operation
n = 1_000_000
a = np.arange(n)
b = np.arange(n)

# Non-vectorized approach using a list comprehension
%timeit c = [a[i] + b[i] for i in range(n)]

# Vectorized approach using NumPy
%timeit c_np = a + b
```

### **9. How does Matplotlib differ from Plotly?**

| Feature | Matplotlib | Plotly |
| :--- | :--- | :--- |
| **Interactivity** | Primarily creates **static** plots (e.g., PNG, PDF, SVG). Limited interactivity. | Creates fully **interactive** plots for web browsers (hover, zoom, pan). |
| **Output** | Images, displayed in notebooks or saved to files. | HTML files or embedded in web apps/notebooks. Can be made static. |
| **API Level** | **Lower-level.** Gives fine-grained control but can be more verbose for complex plots. | Has a high-level API (**Plotly Express**) for quick, easy plotting and a lower-level one (**Graph Objects**) for customization. |
| **Aesthetics** | Can look dated by default, though highly customizable. | Modern, clean, and publication-quality aesthetics out-of-the-box. |
| **Use Case** | Quick and simple plotting, deep customization, academic publishing. | Interactive dashboards, data applications, web-based visualizations. |

### **10. What is the significance of hierarchical indexing in Pandas?**

**Hierarchical indexing** (or **MultiIndex**) allows you to have multiple index levels on an axis. Its primary significance is that it enables you to work with higher-dimensional data in a lower-dimensional form, like a Series (1D) or DataFrame (2D).

**Significance:**

  * **Representing High-Dimensional Data:** A DataFrame with a MultiIndex on its rows and columns can represent 3D or 4D data.
  * **Sophisticated Slicing and Dicing:** It allows for more complex data selection. You can select data based on different levels of the index easily.
  * **Advanced Grouping and Reshaping:** It facilitates complex grouping and reshaping operations like `stack()` and `unstack()`.

<!-- end list -->

```python
# Creating a DataFrame with a MultiIndex
index_tuples = [('Group A', 'Type 1'), ('Group A', 'Type 2'),
                ('Group B', 'Type 1'), ('Group B', 'Type 2')]
multi_index = pd.MultiIndex.from_tuples(index_tuples, names=['Group', 'Type'])
df_multi = pd.DataFrame(np.random.randn(4, 2), index=multi_index, columns=['Value1', 'Value2'])

print("DataFrame with Hierarchical Index:\n", df_multi)

# Sophisticated slicing
print("\nSlicing 'Group A':\n", df_multi.loc['Group A'])
```

### **11. What is the role of Seaborn's `pairplot()` function?**

The `pairplot()` function in Seaborn is a powerful tool for exploratory data analysis. Its primary role is to visualize the **pairwise relationships** between numerical variables in a dataset.

It creates a grid of axes where:

  * The **off-diagonal plots** are **scatter plots**, showing the relationship between each pair of variables.
  * The **diagonal plots** are **univariate plots** (typically a histogram or a kernel density estimate (KDE)), showing the distribution of each individual variable.

This allows a data analyst to quickly get a high-level overview of the data's structure, identify trends, spot correlations, and see distributions all in one figure.

```python
# Using pairplot on the iris dataset
iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species', markers=["o", "s", "D"])
plt.suptitle('Pairplot of Iris Dataset', y=1.02)
plt.show()
```

### **12. What is the purpose of the `describe()` function in Pandas?**

The `describe()` function in Pandas is used to generate **descriptive statistics** that summarize the central tendency, dispersion, and shape of a dataset's distribution. It provides a quick and convenient overview of the numerical columns in a DataFrame.

For numerical data, it returns:

  * `count`: The number of non-null observations.
  * `mean`: The average of the values.
  * `std`: The standard deviation.
  * `min`: The minimum value.
  * `25%`: The 25th percentile (1st quartile).
  * `50%`: The 50th percentile (median).
  * `75%`: The 75th percentile (3rd quartile).
  * `max`: The maximum value.

<!-- end list -->

```python
df = pd.DataFrame({'numeric': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100],
                   'object': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})

# Describe the numeric column
print(df['numeric'].describe())
```

### **13. Why is handling missing data important in Pandas?**

Handling missing data (often represented as `NaN` - Not a Number) is a critical step in the data preprocessing pipeline for several reasons:

  * **Algorithm Incompatibility:** Many machine learning algorithms cannot handle missing values and will raise an error if they are present.
  * **Biased Results:** If not handled properly, missing data can lead to biased statistical analysis and machine learning models. For example, if the data is not missing at random, ignoring it can skew the results.
  * **Reduced Statistical Power:** Dropping rows with missing values (listwise deletion) can significantly reduce the size of the dataset, leading to a loss of information and reduced statistical power.
  * **Inaccurate Calculations:** Operations like `mean()` or `sum()` can produce incorrect or misleading results if missing values are not accounted for. Pandas often ignores `NaN` by default in these calculations, but it's crucial to be aware of how they are handled.

Common strategies include **imputation** (filling in values) or **removal** (dropping rows/columns).

### **14. What are the benefits of using Plotly for data visualization?**

The key benefits of using **Plotly** are:

  * **Interactivity:** This is Plotly's main selling point. It creates charts with built-in hover tooltips, zooming, panning, and filtering capabilities, allowing users to explore the data directly within the visualization.
  * **Aesthetic Quality:** Plotly charts are designed to be modern, clean, and visually appealing right out of the box, making them suitable for presentations and dashboards.
  * **Web-Native:** It generates HTML-based visualizations that can be easily embedded in websites, blogs, and data applications (like Dash).
  * **Wide Range of Chart Types:** It supports a huge variety of charts, from basic line/bar/scatter plots to complex 3D plots, financial charts, maps, and scientific charts.
  * **Plotly Express:** The `plotly.express` module is a high-level API that makes creating sophisticated, interactive plots with very little code incredibly simple.

### **15. How does NumPy handle multidimensional arrays?**

NumPy handles multidimensional arrays through its core `ndarray` object.

  * **Shape and Dimensions:** An `ndarray` has an attribute called `shape`, which is a tuple of integers indicating the size of the array in each dimension. The number of dimensions is given by the `ndim` attribute.
  * **Contiguous Memory Block:** All elements of the array are stored in a single, contiguous block of memory. NumPy keeps track of the `shape` and data type (`dtype`) to interpret this block of memory as a multidimensional array. This memory layout is key to its performance.
  * **Indexing:** NumPy provides a rich and intuitive syntax for accessing elements in multiple dimensions. You can use a comma-separated tuple of indices to access specific elements, e.g., `arr[row, col]` for a 2D array or `arr[plane, row, col]` for a 3D array.
  * **Strides:** Internally, NumPy uses a concept called "strides" to know how many bytes to jump in memory to get to the next element along each dimension.

<!-- end list -->

```python
# A 3D array (2 planes, 3 rows, 4 columns)
arr_3d = np.arange(24).reshape((2, 3, 4))

print("3D Array:\n", arr_3d)
print("\nShape:", arr_3d.shape)
print("Dimensions:", arr_3d.ndim)

# Accessing an element in the 1st plane, 2nd row, 3rd column
print("\nElement at (0, 1, 2):", arr_3d[0, 1, 2])
```

### **16. What is the role of Bokeh in data visualization?**

**Bokeh** is a Python interactive visualization library that targets modern web browsers for its presentation. Its primary role is to create **interactive plots, dashboards, and data applications**.

Key characteristics and roles:

  * **Interactivity for the Web:** Like Plotly, Bokeh's strength is creating interactive visualizations, but it's particularly powerful for building complex data-driven applications with linked plots, widgets (sliders, dropdowns), and streaming data capabilities.
  * **Server Component for Complex Apps:** Bokeh has an optional server component (`Bokeh Server`) that allows you to connect plots to real-time data streams or Python code that responds to user interactions. This is a key differentiator from many other plotting libraries.
  * **No JavaScript Required:** It allows data scientists to create sophisticated, interactive web-based visualizations entirely in Python, without needing to write any JavaScript.
  * **Target Audience:** Bokeh is ideal for data scientists and developers who want to build interactive dashboards and applications rather than just creating static charts for a report.

### **17. Explain the difference between `apply()`, `map()`, and `applymap()` in Pandas.**

This is a common point of confusion. Here's the breakdown:

| Function | Works On | Scope | Common Use Case |
| :--- | :--- | :--- | :--- |
| **`map()`** | **Series only** | Element-wise | Substituting each element with another value using a `dict` or a function. |
| **`apply()`** | **DataFrame** or **Series** | Row-wise / Column-wise (on DF) | Applying a complex function along an axis (e.g., a function that uses multiple columns). |
| **`applymap()`**| **DataFrame only**| Element-wise | Applying a simple function to every single element in the DataFrame (e.g., formatting). |

```python
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30], 'C': ['x', 'y', 'z']})
print("Original DataFrame:\n", df)

# 1. map() -> On a Series. Let's map column 'C'.
df['C_mapped'] = df['C'].map({'x': 'X-val', 'y': 'Y-val', 'z': 'Z-val'})
print("\nAfter map():\n", df)

# 2. apply() -> On a DataFrame. Let's calculate the range (max - min) for each numeric column.
df_range = df[['A', 'B']].apply(lambda col: col.max() - col.min(), axis=0)
print("\nAfter apply() on columns:\n", df_range)

# 3. applymap() -> On a DataFrame. Let's format all numeric values as strings.
df_numeric = df[['A', 'B']]
df_formatted = df_numeric.applymap(lambda x: f"Value: {x}")
print("\nAfter applymap():\n", df_formatted)
```

### **18. What are some advanced features of NumPy?**

Beyond basic array creation and arithmetic, NumPy has several advanced features:

  * **Broadcasting:** The ability to perform operations on arrays of different but compatible shapes (as discussed in Q2).
  * **Advanced Indexing:**
      * **Fancy Indexing:** Using arrays of indices to access or modify multiple array elements at once.
      * **Boolean Indexing:** Using a boolean array to select elements that correspond to `True` values.
  * **Linear Algebra (`numpy.linalg`):** A comprehensive module for matrix decompositions, determinants, eigenvalues, solving linear equations, and more.
  * **Universal Functions (ufuncs):** These are functions that operate on `ndarray`s in an element-by-element fashion. They are the core of NumPy's vectorized operations (e.g., `np.add`, `np.sin`, `np.exp`).
  * **Random Number Generation (`numpy.random`):** A powerful module for creating random data from various statistical distributions.
  * **Memory-Mapped Files (`numpy.memmap`):** An interface for treating a file on disk as if it were a large NumPy array, allowing you to work with datasets larger than your available RAM.

### **19. How does Pandas simplify time series analysis?**

Pandas was originally built for financial analysis and has exceptional capabilities for handling time series data.

  * **`DatetimeIndex`:** A specialized index type for timestamps that allows for powerful slicing and selection based on dates and times (e.g., `df['2023-01']` to get all data from January 2023).
  * **Date Range Generation:** Easily create sequences of dates using `pd.date_range()`.
  * **Resampling:** Easily change the frequency of time series data (e.g., converting daily data to monthly data by taking the mean) using the `.resample()` method. This is called downsampling (e.g., daily to monthly) or upsampling (e.g., daily to hourly).
  * **Shifting and Lagging:** The `.shift()` method makes it trivial to shift data forward or backward in time, which is essential for calculating returns or comparing observations to previous ones.
  * **Rolling Window Calculations:** Built-in support for rolling window operations (e.g., calculating a 30-day rolling average) using the `.rolling()` method, crucial for smoothing data and identifying trends.
  * **Time Zone Handling:** Robust support for converting between and working with different time zones.

### **20. What is the role of a pivot table in Pandas?**

A **pivot table** is a data summarization tool that reshapes or transforms data by converting unique values from one column into new columns. It's used to summarize, sort, group, and aggregate data in a tabular format. In Pandas, this is done with the `pd.pivot_table()` function.

Its role is to provide a "wide" format view of data from a "long" format, making it easier to analyze relationships between categorical variables. It has four main components:

  * `index`: The column whose unique values will become the new rows of the pivot table.
  * `columns`: The column whose unique values will become the new columns.
  * `values`: The column whose values will be aggregated and fill the table cells.
  * `aggfunc`: The aggregation function to apply to the values (e.g., `sum`, `mean`, `count`).

<!-- end list -->

```python
data = {'Date': ['2025-01-01', '2025-01-01', '2025-01-02', '2025-01-02'],
        'Product': ['A', 'B', 'A', 'B'],
        'Sales': [100, 150, 120, 180]}
df = pd.DataFrame(data)

# Create a pivot table to see sales for each product on each date
pivot = df.pivot_table(values='Sales', index='Date', columns='Product', aggfunc='sum')

print("Original DataFrame:\n", df)
print("\nPivot Table:\n", pivot)
```

### **21. Why is NumPy's array slicing faster than Python's list slicing?**

The speed difference comes down to their fundamental difference in memory layout and implementation.

  * **NumPy Array Slicing (Creates a *View*)**

      * **Memory Layout:** NumPy arrays are stored in a **contiguous block of memory**.
      * **Slicing Mechanism:** When you slice a NumPy array, you do **not** create a new copy of the data. Instead, you create a **view** of the original array. This new view object simply points to the same memory block but has different metadata (shape, strides, etc.).
      * **Result:** This operation is extremely fast and memory-efficient because no data is copied.

  * **Python List Slicing (Creates a *Copy*)**

      * **Memory Layout:** Python lists store **pointers** to objects, and these objects can be scattered all over memory.
      * **Slicing Mechanism:** When you slice a list, Python creates a **new list** (a shallow copy). It has to iterate through the original slice and copy each pointer into the new list.
      * **Result:** This process involves memory allocation and copying, making it significantly slower than creating a NumPy view.

<!-- end list -->

```python
# Slicing a large NumPy array
large_np_array = np.arange(10_000_000)
%timeit large_np_array[1000:5000]

# Slicing a large Python list
large_list = list(range(10_000_000))
%timeit large_list[1000:5000]
```

You'll observe that the NumPy slicing operation is orders of magnitude faster.

### **22. What are some common use cases for Seaborn?**

Seaborn excels at producing informative and attractive statistical graphics. Common use cases include:

  * **Visualizing Distributions:** Using `histplot`, `kdeplot`, `ecdfplot`, or `rugplot` to understand the distribution of a single variable.
  * **Comparing Distributions:** Using `boxplot`, `violinplot`, or `stripplot` to compare the distribution of a numerical variable across different categories.
  * **Plotting Categorical Data:** Using `countplot` to see the frequency of items in each category or `barplot` to show an aggregate metric for each category.
  * **Visualizing Relationships:** Using `scatterplot` or `regplot` (which adds a regression line) to see the relationship between two numerical variables.
  * **Multivariate Analysis:** Using `pairplot` to see pairwise relationships across an entire dataset (as in Q11) or `jointplot` to see both the bivariate relationship and the univariate distributions of two variables.
  * **Visualizing Matrices:** Using `heatmap` to visualize correlation matrices or other matrix-like data (as in Q7).
  * **Creating Faceted Grids:** Using `FacetGrid`, `catplot`, or `lmplot` to create grids of the same type of plot for different subsets of the data, allowing for easy comparison.


# Data Toolkit: Practical Exercises

This notebook contains the solutions to 13 practical coding tasks using Python's core data science libraries.

### **1. How do you create a 2D NumPy array and calculate the sum of each row?**

To calculate the sum of each row, you use the `sum()` method with `axis=1`. The `axis=1` argument specifies that the sum should be performed across the columns for each row.

```python
import numpy as np

# 1. Create a 2D NumPy array
my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print("Original 2D Array:")
print(my_array)

# 2. Calculate the sum of each row
row_sums = my_array.sum(axis=1)

print("\nSum of each row:")
print(row_sums)
```

### **2. Write a Pandas script to find the mean of a specific column in a DataFrame.**

You can select a specific column by its name (e.g., `df['ColumnName']`) which returns a Pandas Series. Then, you can call the `.mean()` method on that Series.

```python
import pandas as pd

# Create a sample DataFrame
data = {'Product': ['A', 'B', 'C', 'A', 'B'],
        'Sales': [250, 150, 350, 200, 180]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Find the mean of the 'Sales' column
mean_sales = df['Sales'].mean()

print(f"\nThe mean of the 'Sales' column is: {mean_sales}")
```

### **3. Create a scatter plot using Matplotlib.**

A scatter plot is used to visualize the relationship between two numerical variables. You use the `plt.scatter()` function.

```python
import matplotlib.pyplot as plt
import numpy as np

# Generate some random data for the plot
np.random.seed(42) # for reproducibility
x = np.random.rand(50) * 10
y = 2 * x + 1 + np.random.randn(50) * 2 # y = 2x + 1 with some noise

# Create the scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c='blue', alpha=0.7, edgecolors='w', s=80)

# Add labels and a title
plt.title('My First Scatter Plot')
plt.xlabel('X-axis Value')
plt.ylabel('Y-axis Value')
plt.grid(True)
plt.show()
```

### **4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?**

First, calculate the correlation matrix of a DataFrame using the `.corr()` method. Then, pass this matrix to Seaborn's `sns.heatmap()` function for visualization.

```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample DataFrame with correlated data
np.random.seed(0)
data = {'Feature A': np.random.rand(100) * 10,
        'Feature B': np.random.rand(100) * 5,
        'Feature C': np.random.rand(100) * 20}
df = pd.DataFrame(data)
df['Feature D'] = df['Feature A'] * 2 + np.random.randn(100) # Correlated with A

# 1. Calculate the correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix)

# 2. Visualize the correlation matrix with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
```

### **5. Generate a bar plot using Plotly.**

Plotly Express (`px`) is the easiest way to create figures with Plotly. The `px.bar()` function is used for creating bar plots.

```python
import plotly.express as px
import pandas as pd

# Sample data in a DataFrame
data = {'Category': ['A', 'B', 'C', 'D'],
        'Value': [23, 45, 58, 32]}
df = pd.DataFrame(data)

# Generate the bar plot
fig = px.bar(df,
             x='Category',
             y='Value',
             title='A Simple Bar Plot with Plotly',
             color='Category', # Optional: assign colors based on category
             labels={'Value': 'Count or Amount'}) # Optional: custom labels

fig.show()
```

### **6. Create a DataFrame and add a new column based on an existing column.**

You can create a new column by simply assigning the result of an operation on an existing column (or columns) to a new column name.

```python
import pandas as pd

# Create a DataFrame
data = {'Product': ['Laptop', 'Mouse', 'Keyboard'],
        'Price': [1200, 25, 75]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Add a new column 'Price_with_Tax' (assuming 18% tax)
df['Price_with_Tax'] = df['Price'] * 1.18

print("\nDataFrame with New Column:")
print(df)
```

### **7. Write a program to perform element-wise multiplication of two NumPy arrays.**

Element-wise multiplication is done using the standard multiplication operator (`*`). This is a vectorized operation and requires the arrays to have the same shape or be broadcastable.

```python
import numpy as np

# Create two NumPy arrays of the same shape
array1 = np.array([[1, 2], [3, 4]])
array2 = np.array([[10, 20], [30, 40]])

print("Array 1:")
print(array1)
print("\nArray 2:")
print(array2)

# Perform element-wise multiplication
result = array1 * array2

print("\nResult of Element-wise Multiplication:")
print(result)
```

### **8. Create a line plot with multiple lines using Matplotlib.**

You can plot multiple lines on the same axes by calling `plt.plot()` multiple times before `plt.show()`. Adding labels to each plot call and using `plt.legend()` helps distinguish the lines.

```python
import matplotlib.pyplot as plt
import numpy as np

# Create x-axis data
x = np.linspace(0, 10, 100) # 100 points from 0 to 10

# Create y-axis data for two different functions
y1 = np.sin(x)
y2 = np.cos(x)

# Plot both lines on the same figure
plt.figure(figsize=(10, 6))
plt.plot(x, y1, label='Sine Wave')
plt.plot(x, y2, label='Cosine Wave', linestyle='--')

# Add title, labels, and legend
plt.title('Multiple Lines on a Single Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()
```

### **9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.**

This technique is called **boolean indexing**. You create a boolean Series (True/False) based on a condition and use it to select rows from the DataFrame.

```python
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Score': [85, 92, 78, 95, 68]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Define a threshold
threshold = 80

# Filter rows where 'Score' is greater than the threshold
filtered_df = df[df['Score'] > threshold]

print(f"\nDataFrame filtered for Scores > {threshold}:")
print(filtered_df)
```

### **10. Create a histogram using Seaborn to visualize a distribution.**

Seaborn's `sns.histplot()` is a versatile function for creating histograms. Adding `kde=True` overlays a Kernel Density Estimate plot, which shows a smooth estimate of the distribution.

```python
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Generate some sample data from a normal distribution
np.random.seed(10)
data = np.random.randn(1000)

# Create a histogram
plt.figure(figsize=(8, 6))
sns.histplot(data, bins=30, kde=True, color='purple')

# Add titles and labels
plt.title('Histogram of a Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
```

### **11. Perform matrix multiplication using NumPy.**

Matrix multiplication is different from element-wise multiplication. In NumPy, you can perform it using the `@` operator or the `np.dot()` function. The inner dimensions of the matrices must match (e.g., an `(m x n)` matrix can be multiplied by an `(n x p)` matrix).

```python
import numpy as np

# Create two matrices with compatible shapes for multiplication
matrix_A = np.array([[1, 2, 3],
                     [4, 5, 6]]) # Shape (2, 3)

matrix_B = np.array([[7, 8],
                     [9, 10],
                     [11, 12]]) # Shape (3, 2)

print("Matrix A (2x3):\n", matrix_A)
print("\nMatrix B (3x2):\n", matrix_B)

# Perform matrix multiplication using the @ operator
result_matrix = matrix_A @ matrix_B

print("\nResult of Matrix Multiplication (2x2):\n", result_matrix)
```

### **12. Use Pandas to load a CSV file and display its first 5 rows.**

The `pd.read_csv()` function is used to load data from a CSV file into a DataFrame. The `.head()` method is then used to display the first few rows (default is 5).

```python
import pandas as pd
import os

# Step 1: Create a dummy CSV file to demonstrate.
# In a real scenario, you would just have the file path.
csv_data = """id,name,age
1,Alice,34
2,Bob,29
3,Charlie,41
4,David,25
5,Eve,38
6,Frank,45
"""

file_name = "sample_data.csv"
with open(file_name, "w") as f:
    f.write(csv_data)

# Step 2: Load the CSV file into a Pandas DataFrame
df = pd.read_csv(file_name)

# Step 3: Display the first 5 rows
print("First 5 rows of the DataFrame:")
print(df.head())

# Clean up the dummy file
os.remove(file_name)
```

### **13. Create a 3D scatter plot using Plotly.**

Plotly Express makes creating 3D plots straightforward with the `px.scatter_3d` function. You need to provide `x`, `y`, and `z` coordinates.

```python
import plotly.express as px
import pandas as pd
import numpy as np

# Generate sample 3D data
np.random.seed(0)
n_points = 100
df_3d = pd.DataFrame({
    'x': np.random.randn(n_points),
    'y': np.random.randn(n_points),
    'z': np.random.randn(n_points),
    'category': np.random.choice(['Group A', 'Group B'], n_points)
})

# Create the 3D scatter plot
fig = px.scatter_3d(df_3d,
                    x='x',
                    y='y',
                    z='z',
                    color='category',
                    title='3D Scatter Plot',
                    labels={'x': 'X Coordinate', 'y': 'Y Coordinate', 'z': 'Z Coordinate'})

# Improve the marker style (optional)
fig.update_traces(marker=dict(size=5, opacity=0.8))

fig.show()
```