# **Data Toolkit Questions**
# ***Data Toolkit Theory Questions***


## Q1. What is NumPy, and why is it widely used in Python?

NumPy (short for Numerical Python) is a powerful open-source library in Python used for numerical and scientific computing. It provides efficient tools for working with large, multi-dimensional arrays and matrices, along with a collection of high-performance mathematical functions to operate on these arrays.

### **Key Features of NumPy**

**1. N-dimensional Array Object (ndarray)**

The core feature of NumPy is its powerful ndarray, a fast, flexible container for large data sets of homogeneous types (e.g., all floats or all ints).

It supports element-wise operations, slicing, broadcasting, and vectorization.

**2. Performance**

NumPy is implemented in C, which makes it much faster than native Python lists for numerical operations.

It uses vectorized operations (no need for explicit Python loops).

**3. Mathematical Functions**

Provides functions for linear algebra, Fourier transforms, random number generation, statistics, etc.

Examples: numpy.dot(), numpy.linalg.inv(), numpy.fft.fft(), numpy.mean().

**4. Broadcasting**

Allows arithmetic operations between arrays of different shapes, making code shorter and faster.

**5. Integration**

- Works seamlessly with other Python libraries such as:

- Pandas (data analysis)

- Matplotlib (plotting)

- SciPy (scientific computing)

- TensorFlow / PyTorch (machine learning)

### **Why NumPy Is Widely Used**

In [None]:
| Advantage             | Explanation                                                                                      |
| --------------------- | ------------------------------------------------------------------------------------------------ |
| **Speed**             | NumPy operations run much faster than equivalent Python loops because they use optimized C code. |
| **Convenience**       | Clean, readable syntax for array operations without writing explicit loops.                      |
| **Interoperability**  | Foundation for many data science and ML libraries.                                               |
| **Memory Efficiency** | Stores data more compactly than Python lists.                                                    |
| **Vectorization**     | Allows you to perform batch operations on entire datasets efficiently.                           |


### **Example**

In [None]:
import numpy as np

# Create a 2D array
a = np.array([[1, 2, 3], [4, 5, 6]])

# Perform operations
print(a + 10)          # Add 10 to each element
print(a.mean())        # Compute the mean
print(a @ a.T)         # Matrix multiplication


**Output:**

In [None]:
[[11 12 13]
 [14 15 16]]
3.5
[[14 32]
 [32 77]]


## Q2. How does broadcasting work in NumPy?

### **What Is Broadcasting?**

Broadcasting is NumPy’s way of making arrays with different shapes compatible for element-wise operations.

When performing arithmetic operations (+, -, *, /, etc.) between two arrays, NumPy automatically “stretches” the smaller array along its dimensions to match the shape of the larger one — without actually copying the data in memory.

### **Broadcasting Rules**

When NumPy operates on two arrays, it compares their shapes (tuples of dimensions) element by element, starting from the trailing (rightmost) dimension.

Two dimensions are compatible when:

1. They are equal, or

2. One of them is 1

If all dimensions are compatible according to these rules, NumPy can broadcast them.

### **Examples**
**Example 1: Simple Scalar Broadcasting**

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = 2

print(a * b)


### **How it works:**

- a.shape → (3,)

- b.shape → ()

- NumPy “stretches” b to (3,)

**Output:**

In [None]:
[2 4 6]


### **Example 2: Row Vector + Column Vector**

In [None]:
a = np.array([[1], [2], [3]])  # shape (3,1)
b = np.array([10, 20, 30])     # shape (3,)

print(a + b)


**How it works:**
- a shape → (3,1)

- b shape → (3,) → treated as (1,3)

- After broadcasting → (3,3)

**Output:**

In [None]:
[[11 21 31]
 [12 22 32]
 [13 23 33]]


### **Example 3: 2D + 1D (Column Broadcasting)**

In [None]:
a = np.array([[1, 2, 3],
              [4, 5, 6]])
b = np.array([10, 20, 30])

print(a + b)


### **How it works:**

- a shape → (2,3)

- b shape → (3,)

- b is broadcast to (2,3)

**Output:**

In [None]:
[[11 22 33]
 [14 25 36]]


### **When Broadcasting Fails**

If the shapes aren’t compatible (cannot satisfy the rules), NumPy raises a ValueError.

Example:

In [None]:
a = np.ones((3, 2))
b = np.ones((3, 3))

a + b  # ❌ incompatible shapes


Error:

In [None]:
ValueError: operands could not be broadcast together with shapes (3,2) (3,3)


## Q3. What is a Pandas DataFrame?

A Pandas DataFrame is a 2D labeled data structure (like an Excel sheet) used for storing and manipulating tabular data in Python.

Built on NumPy, it allows each column to have a different data type.

Columns = Series, Rows = Records.

### **Example:**

In [None]:
import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)


**Output:**

In [None]:
   Name  Age
0  Alice   25
1    Bob   30


### **Key Features:**

- Easy data selection: df['Age'], df.loc[0]

- Filtering: df[df['Age'] > 25]

- Stats: df.describe()

- File I/O: pd.read_csv(), df.to_excel()

### **In short:**
A DataFrame is a fast, flexible, and user-friendly structure for data analysis — the backbone of Python data science.

## Q4. Explain the use of the groupby() method in Pandas.

The groupby() method in Pandas is used to split data into groups, apply a function, and combine the results — a process often called “split-apply-combine.”

It’s great for summarizing or analyzing data by categories (like “group by” in SQL or Excel Pivot Tables).

### **Basic Syntax**

In [None]:
df.groupby('column_name')


You can then apply functions like sum(), mean(), count(), etc.

### **Example**

In [None]:
import pandas as pd

data = {'Department': ['HR', 'HR', 'IT', 'IT', 'Sales'],
        'Salary': [4000, 4500, 6000, 6500, 5500]}

df = pd.DataFrame(data)

grouped = df.groupby('Department')['Salary'].mean()
print(grouped)


**Output:**

In [None]:
Department
HR       4250.0
IT       6250.0
Sales    5500.0
Name: Salary, dtype: float64


### **Common Operations**

In [None]:
| Operation        | Example                                                  | Description              |
| ---------------- | -------------------------------------------------------- | ------------------------ |
| **Mean**         | `df.groupby('Dept')['Salary'].mean()`                    | Avg salary per dept      |
| **Count**        | `df.groupby('Dept').count()`                             | Count of rows per group  |
| **Sum**          | `df.groupby('Dept')['Salary'].sum()`                     | Total salary per dept    |
| **Multiple agg** | `df.groupby('Dept')['Salary'].agg(['min','max','mean'])` | Multiple stats per group |


### **In Short**

groupby() lets you easily analyze, summarize, and transform data by groups — making it one of the most powerful tools in Pandas for data analysis.

## Q5. Why is Seaborn preferred for statistical visualizations?

Seaborn is preferred for statistical visualizations in Python because it’s built specifically to make data exploration and statistical analysis easier and more visually appealing than with plain Matplotlib.

### **Key Reasons Seaborn Is Preferred**

**1. High-Level Interface**

- Simplifies complex plots with concise code.

- Example:

In [None]:
import seaborn as sns
sns.boxplot(x='category', y='value', data=df)


**2. Beautiful Default Styles**

- Automatically applies elegant color palettes and layouts for publication-quality visuals.

**3. Built-in Statistical Support**

- Handles common statistical tasks like:

- Regression (sns.regplot())

- Distribution (sns.histplot(), sns.kdeplot())

- Categorical comparisons (sns.boxplot(), sns.violinplot())

**4. Integration with Pandas**

- Works directly with DataFrames — no need for manual array handling.

**5. Automatic Aggregation**

- Can compute summary statistics (like means, confidence intervals) automatically.

**6. Complex Plots Made Easy**

- Easily create heatmaps, pair plots, and multi-variable relationships (sns.pairplot(), sns.heatmap()).

### **In Short**

Seaborn is preferred because it combines statistical intelligence, aesthetic design, and ease of use, making it perfect for data exploration and storytelling through visuals.

## Q6. What are the differences between NumPy arrays and Python list?

Here’s a clear comparison between NumPy arrays and Python lists:

In [None]:
| Feature                     | **NumPy Array**                                              | **Python List**                                    |
| --------------------------- | ------------------------------------------------------------ | -------------------------------------------------- |
| **Data Type**               | Homogeneous (all elements must be of the same type)          | Heterogeneous (can store different data types)     |
| **Performance**             | Much faster (uses C-based vectorized operations)             | Slower (interpreted Python loops)                  |
| **Memory Usage**            | More efficient (compact storage)                             | Less efficient (stores type info for each element) |
| **Mathematical Operations** | Supports element-wise operations directly (`a + b`, `a * 2`) | Requires loops or list comprehensions              |
| **Dimensionality**          | Supports multi-dimensional arrays (matrices, tensors)        | Only 1D (nested lists needed for 2D)               |
| **Functionality**           | Rich library for linear algebra, stats, etc.                 | Limited built-in functionality                     |
| **Broadcasting**            | Automatically expands dimensions for compatible shapes       | Not supported                                      |


### **Example:**

In [None]:
import numpy as np

# NumPy array
a = np.array([1, 2, 3])
print(a * 2)  # [2 4 6]

# Python list
b = [1, 2, 3]
print([x * 2 for x in b])  # [2, 4, 6]


### **In Short**

- Use Python lists for general-purpose data storage.

- Use NumPy arrays for numerical computations — they’re faster, smaller, and more powerful.

## Q7. What is a heatmap, and when should it be used?

A heatmap is a data visualization that uses color intensity to represent the magnitude of values in a matrix or 2D dataset.

It’s a powerful way to see patterns, correlations, or anomalies at a glance — especially in large datasets.

### **Definition**

A heatmap displays data where:

- Each cell represents a value.

- Color shows how large or small that value is (e.g., darker = higher).

### **Example (with Seaborn)**

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

data = np.random.rand(4, 5)
sns.heatmap(data, annot=True, cmap='coolwarm')
plt.show()


This creates a 4×5 grid where color intensity indicates the magnitude of each number.

### **When to Use a Heatmap**

- To show correlations between variables (e.g., sns.heatmap(df.corr()))

- To visualize confusion matrices in machine learning

- To compare many values at once in a table

- To spot patterns or clusters in large datasets

### **In Short**

A heatmap turns numeric data into a color-coded visual summary, making it ideal for identifying trends, relationships, and outliers in 2D data.

## Q8. What does the term “vectorized operation” mean in NumPy?

In NumPy, a vectorized operation means performing an operation on entire arrays (vectors, matrices, etc.) at once, without using explicit Python loops.

These operations are executed using fast, low-level C code, making them much faster and more efficient than looping through elements manually.

### **Example**

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4])

# Vectorized operation
b = a * 2
print(b)


**Output:**

In [None]:
[2 4 6 8]


NumPy multiplies all elements by 2 in one step — no loop needed.

### **Without Vectorization (Plain Python)**

In [None]:
lst = [1, 2, 3, 4]
result = [x * 2 for x in lst]


This uses an explicit loop, which is slower and less concise.

### **Benefits of Vectorized Operations**

In [None]:
| Benefit        | Description                                            |
| -------------- | ------------------------------------------------------ |
| **Speed**      | Uses optimized C code — much faster than Python loops. |
| **Simplicity** | Cleaner, more readable syntax.                         |
| **Efficiency** | Minimizes memory overhead and intermediate steps.      |


### **In Short**

A vectorized operation in NumPy means performing mathematical operations on whole arrays at once, making your code faster, cleaner, and more efficient.

## Q9. How does Matplotlib differ from Plotly?

Here’s a clear comparison between Matplotlib and Plotly, two popular Python libraries for data visualization 👇

In [None]:
| Feature           | **Matplotlib**                            | **Plotly**                                                     |
| ----------------- | ----------------------------------------- | -------------------------------------------------------------- |
| **Type**          | Static, 2D plotting library               | Interactive, web-based plotting library                        |
| **Interactivity** | Limited (mostly static images)            | Highly interactive (hover, zoom, click)                        |
| **Ease of Use**   | Requires more code for styling and layout | Easier to create attractive plots quickly                      |
| **Customization** | Extremely flexible and detailed control   | Limited deep customization (but visually appealing by default) |
| **Output Format** | Static images (PNG, PDF, SVG)             | Interactive HTML, web dashboards                               |
| **Use Cases**     | Scientific publications, static reports   | Dashboards, data exploration, web apps                         |
| **Integration**   | Works well with Seaborn and Pandas        | Integrates with Dash for interactive web apps                  |


### **Example Comparison**
**Matplotlib:**

In [None]:
import matplotlib.pyplot as plt

plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Matplotlib Plot")
plt.show()


**Plotly:**

In [None]:
import plotly.express as px

fig = px.line(x=[1, 2, 3], y=[4, 5, 6], title="Plotly Plot")
fig.show()


### **In Short**

- Use Matplotlib for static, publication-quality charts.

- Use Plotly for interactive, web-ready visualizations and dashboards.

## Q10. What is the significance of hierarchical indexing in Pandas?

Hierarchical indexing (or MultiIndex) in Pandas allows a DataFrame or Series to have multiple levels of row or column labels, enabling more complex and structured data organization.

It’s like having nested labels, which is useful for working with multi-dimensional data in a 2D structure.

### **Key Significance**

**1. Organize Complex Data**

- Useful when data has multiple categorical variables (e.g., Year → Month → Day).

**2. Powerful Data Selection**

- Allows slicing and querying at different levels efficiently.

**3. Aggregation and Grouping**

- Simplifies operations like sum(), mean(), or pivoting over multiple categories.

**4. Memory Efficient**

- Can represent high-dimensional data without creating extra columns.

### **Example**

In [None]:
import pandas as pd
import numpy as np

arrays = [['2023', '2023', '2024', '2024'], ['Jan', 'Feb', 'Jan', 'Feb']]
index = pd.MultiIndex.from_arrays(arrays, names=('Year', 'Month'))
data = pd.Series([100, 150, 200, 250], index=index)

print(data)


**Output:**

In [None]:
Year  Month
2023  Jan      100
      Feb      150
2024  Jan      200
      Feb      250
dtype: int64


Now you can do:

In [None]:
data['2023']          # All months in 2023
data.loc[('2024', 'Feb')]  # Specific year & month


### **In Short**

Hierarchical indexing lets Pandas store and manipulate multi-dimensional data elegantly within a 2D table, making grouping, slicing, and aggregation much more intuitive.

## Q11. What is the role of Seaborn’s pairplot() function?

Seaborn’s pairplot() function is used to visualize pairwise relationships in a dataset. It automatically creates a matrix of plots showing scatterplots for each pair of numerical variables and histograms or density plots on the diagonal for individual distributions.

It’s extremely useful for exploratory data analysis (EDA) to detect correlations, patterns, and outliers.

### **Key Features**

**1. Scatterplots for each pair of variables**

- Shows relationships between two numerical columns.

**2. Histograms or KDE plots on the diagonal**

- Displays the distribution of each variable.

**3. Hue support**

- Can color points by a categorical variable for grouping.

**4. Quick EDA**

Gives a compact overview of variable interactions in one plot.

### **Example**

In [None]:
import seaborn as sns
import pandas as pd

df = sns.load_dataset('iris')

# Pairplot with color by species
sns.pairplot(df, hue='species')


**What it does:**

- Diagonal: histograms of sepal_length, sepal_width, etc.

- Off-diagonal: scatterplots of every variable against every other.

- Colors points by species.

### **In Short**

pairplot() is a fast way to visualize relationships and distributions among multiple variables in a dataset, making it a go-to tool for exploratory data analysis.

## Q12. What is the purpose of the describe() function in Pandas?

The describe() function in Pandas is used to generate summary statistics of a DataFrame or Series. It provides a quick overview of the data’s distribution, central tendency, and spread, which is very useful for exploratory data analysis (EDA).

### **Key Features**

For numeric data, describe() returns:

In [None]:
| Statistic | Description                    |
| --------- | ------------------------------ |
| `count`   | Number of non-missing values   |
| `mean`    | Average value                  |
| `std`     | Standard deviation             |
| `min`     | Minimum value                  |
| `25%`     | 1st quartile (25th percentile) |
| `50%`     | Median (50th percentile)       |
| `75%`     | 3rd quartile (75th percentile) |
| `max`     | Maximum value                  |


For categorical data (object or category dtype), it returns:

- count — number of non-missing values

- unique — number of unique categories

- top — most frequent value

- freq — frequency of the top value

### **Example**

In [None]:
import pandas as pd

data = {'Age': [25, 30, 22, 35, 28],
        'Salary': [5000, 6000, 4500, 7000, 5200]}

df = pd.DataFrame(data)

print(df.describe())


**Output:**

In [None]:
             Age       Salary
count   5.000000     5.000000
mean   28.000000  5540.000000
std     4.183301   899.998889
min    22.000000  4500.000000
25%    25.000000  5000.000000
50%    28.000000  5200.000000
75%    30.000000  6000.000000
max    35.000000  7000.000000


### **In Short**

describe() provides a fast, comprehensive summary of your data, helping you understand its distribution, central values, and spread with a single command.

## Q13. Why is handling missing data important in Pandas?

Handling missing data is crucial in Pandas because missing or NaN values can break analysis, lead to incorrect results, or cause errors in computations and visualizations. Proper handling ensures data integrity and accurate insights.

### **Key Reasons**

**1. Prevent Errors**

- Many functions (e.g., mathematical operations, machine learning models) fail or give inaccurate results if NaN values exist.

**2. Accurate Statistics**

- Calculating mean, sum, or correlations with missing data may produce misleading results.

**3. Data Consistency**

- Cleaning missing values ensures that datasets are consistent and ready for analysis.

**4. Better Modeling**

- Machine learning algorithms typically cannot handle missing values directly.

### **Common Handling Methods in Pandas**

In [None]:
| Method                    | Example                     | Description                                   |
| ------------------------- | --------------------------- | --------------------------------------------- |
| **Drop missing**          | `df.dropna()`               | Removes rows or columns with `NaN` values     |
| **Fill missing**          | `df.fillna(0)`              | Replace `NaN` with a specific value           |
| **Forward/Backward Fill** | `df.fillna(method='ffill')` | Propagate previous/next values                |
| **Interpolation**         | `df.interpolate()`          | Estimate missing values from surrounding data |


### **Example**

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]})

# Fill missing values with 0
df_filled = df.fillna(0)
print(df_filled)


**Output:**

In [None]:
     A    B
0  1.0  4.0
1  0.0  5.0
2  3.0  0.0


### **In Short**

Handling missing data is essential for accurate analysis, reliable statistics, and error-free computations. Ignoring it can lead to biased insights or program crashes.

## Q14. What are the benefits of using Plotly for data visualization?

Plotly is a powerful Python library for creating interactive, web-ready visualizations. Its benefits over traditional static plots include enhanced interactivity, aesthetics, and flexibility.

### **Key Benefits of Plotly**

**1. Interactivity**

- Hover over points to see details, zoom, pan, and select data.

- Great for exploratory data analysis and dashboards.

**2. Web-Ready & Shareable**

- Plots can be exported as interactive HTML files or embedded in web apps using Dash.

**3. Wide Range of Plots**

- Supports scatter, line, bar, heatmaps, 3D plots, maps, and more.

**4. Integration with Pandas**

- Works directly with DataFrames, making data handling seamless.

**5. Customizable & Aesthetic**

- Default visuals are attractive, with flexible styling options for color, layout, and annotations.

**6. Dynamic Dashboards**

- Works with Dash to create fully interactive dashboards for web apps.

**7. Cross-Language Support**

- Can be used with Python, R, MATLAB, Julia, and JavaScript.

### **Example**

In [None]:
import plotly.express as px

df = px.data.iris()
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species', size='petal_length')
fig.show()


✅ The resulting plot is interactive, allowing zooming, panning, and hover tooltips.

### **In Short**

Plotly excels in creating interactive, visually appealing, and shareable plots, making it ideal for data exploration, dashboards, and presentation-ready visualizations.

## Q15. How does NumPy handle multidimensional arrays?

NumPy handles multidimensional arrays using its core data structure called the ndarray (N-dimensional array). This allows you to store and manipulate data with more than one dimension (e.g., matrices, tensors) efficiently.

### **Key Points**

**1. Array Dimensions (Axes)**

- Each dimension is called an axis.

- Example:

  - 1D array → axis 0

  - 2D array (matrix) → axis 0 = rows, axis 1 = columns

  - 3D array → axis 0 = “depth,” axis 1 = rows, axis 2 = columns

**2. Shape**

- NumPy arrays have a .shape attribute: (rows, columns, ...)

- Example: a 3×4 matrix → shape = (3, 4)

**3. Creation**

- Can create multidimensional arrays using np.array, np.zeros, np.ones, np.random.rand, etc.

### **Example**

In [None]:
import numpy as np

# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
print(matrix.shape)  # Output: (2, 3)

# 3D array (tensor)
tensor = np.array([[[1,2],[3,4]],
                   [[5,6],[7,8]]])
print(tensor.shape)  # Output: (2, 2, 2)


### **Operations on Multidimensional Arrays**

- **Element-wise operations:** add, multiply, etc.

- **Indexing and slicing** along multiple axes.

- **Reshaping:** array.reshape(new_shape)

- **Aggregation along axes:** sum(axis=0), mean(axis=1)

- **Matrix operations:** dot(), matmul(), transpose()

### **In Short**

NumPy handles multidimensional arrays by providing a flexible, efficient ndarray that supports arbitrary dimensions, vectorized operations, and advanced indexing, making it ideal for numerical computing and linear algebra.

## Q16. What is the role of Bokeh in data visualization?

Bokeh is a Python library for creating interactive, web-based visualizations. Its main role is to help users build sophisticated and dynamic plots that can be viewed in web browsers or integrated into dashboards and web applications.

### **Key Roles and Features of Bokeh**

**1. Interactive Visualizations**

- Supports zooming, panning, hover tools, sliders, and other widgets.

- Ideal for exploring data dynamically.

**2. Web-Ready Plots**

- Generates HTML/JavaScript plots that can be embedded in web pages.

- No need for additional web frameworks to view plots.

**3. High-Level and Low-Level Interfaces**

- High-level (bokeh.plotting) for quick plots.

- Low-level (bokeh.models) for fine-grained control over interactivity and layout.

**4. Integration with Pandas**

- Works directly with DataFrames for fast visualization.

**5. Dashboards and Apps**

- Supports interactive dashboards, often used with Bokeh Server.

**6. Real-Time Streaming**

- Can update plots in real-time for live data applications.

### **Example**

In [None]:
from bokeh.plotting import figure, show

# Create a figure
p = figure(title="Simple Line Plot", x_axis_label='x', y_axis_label='y')
p.line([1, 2, 3, 4], [4, 7, 2, 5], line_width=2)

# Display plot in browser
show(p)


✅ This creates an interactive line plot that can be zoomed and hovered over in a browser.

### **In Short**

Bokeh’s role is to enable interactive, browser-based, and real-time visualizations, making it ideal for data exploration, dashboards, and web apps.

## Q17. Explain the difference between apply() and map() in Pandas.

In Pandas, both apply() and map() are used to apply functions to data, but they differ in scope, flexibility, and use cases.

### **Key Differences**

In [None]:
| Feature                  | **`map()`**                                   | **`apply()`**                                                     |
| ------------------------ | --------------------------------------------- | ----------------------------------------------------------------- |
| **Scope**                | Works on **Series only**                      | Works on **Series and DataFrames**                                |
| **Function Application** | Element-wise                                  | Can be element-wise (Series) or row/column-wise (DataFrame)       |
| **Input**                | Single values from a Series                   | Single values (Series) or entire rows/columns (DataFrame)         |
| **Output**               | Series of same length                         | Series or DataFrame depending on function                         |
| **Flexibility**          | Less flexible, mainly for transforming values | More flexible, can aggregate, transform, or return custom objects |
| **Use Case**             | Replace or transform each element             | Apply custom functions to rows, columns, or entire Series         |


### **Examples**

Using map() on a Series

In [None]:
import pandas as pd

s = pd.Series([1, 2, 3])
squared = s.map(lambda x: x**2)
print(squared)


**Output:**

In [None]:
0    1
1    4
2    9
dtype: int64


Using apply() on a DataFrame

In [None]:
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
sum_row = df.apply(lambda row: row.sum(), axis=1)  # sum along rows
print(sum_row)


**Output:**

In [None]:
0     5
1     7
2     9
dtype: int64


### **In Short**

- Use map() for element-wise transformations on a Series.

- Use apply() for flexible row/column-wise operations on Series or DataFrames.

This distinction is key for efficient and readable Pandas code.

## Q18. What are some advanced features of NumPy?

NumPy is not just about arrays and basic math — it offers a variety of advanced features that make it a cornerstone of scientific computing in Python. Here are some key advanced features:

### **Advanced Features of NumPy**

**1. Broadcasting**

- Enables arithmetic operations between arrays of different shapes without explicit loops.

- Example: adding a 1D array to each row of a 2D array.

**2. Vectorized Operations**

- Perform operations on entire arrays at once for speed and efficiency.

**3. Multidimensional Arrays & Indexing**

- Supports N-dimensional arrays (tensors) and advanced indexing techniques:

   - Boolean indexing

   - Fancy indexing with arrays of indices

   - Slicing along multiple axes

**4. Linear Algebra Functions**

- Built-in support for:

   - Matrix multiplication: np.dot(), @ operator

   - Determinant: np.linalg.det()

   - Eigenvalues: np.linalg.eig()

   - Inverse, rank, and solving linear systems

**5. Random Number Generation**

- Powerful random module: np.random for uniform, normal, multinomial, permutation, etc.

- Can seed for reproducibility.

**6. Fourier Transform & Signal Processing**

- np.fft module for fast Fourier transforms.

**7. Polynomials & Mathematical Functions**

- Polynomial fitting: np.polyfit()

- Exponential, logarithmic, trigonometric, and hyperbolic functions.

**8. Masked Arrays**

- Handle invalid or missing data efficiently with np.ma module.

**9. Memory-Efficient Operations**

- Views instead of copies, in-place operations, and broadcasting reduce memory usage.

**10. Structured Arrays**

- Store heterogeneous data (like a table) within a NumPy array.

### **Example: Broadcasting & Vectorization**

In [None]:
import numpy as np

# 2D array
a = np.array([[1, 2, 3], [4, 5, 6]])

# 1D array
b = np.array([10, 20, 30])

# Broadcast addition
c = a + b
print(c)


**Output:**

In [None]:
[[11 22 33]
 [14 25 36]]


### **In Short**

NumPy’s advanced features like broadcasting, vectorization, multidimensional arrays, linear algebra, and random number generation make it a fast, flexible, and powerful library for numerical and scientific computing.

These capabilities are why it underpins Pandas, SciPy, TensorFlow, PyTorch, and most Python-based data science tools.

## Q19.  How does Pandas simplify time series analysis?

Pandas provides a rich set of tools that make time series analysis in Python fast, flexible, and intuitive. Its built-in functionality handles dates, times, and temporal indexing, which would otherwise require extensive manual coding.

### **Key Ways Pandas Simplifies Time Series Analysis**

**1. Datetime Handling**

- Convert strings to datetime easily: pd.to_datetime()

- Extract components like year, month, day, hour: df['date'].dt.month

**2. Datetime Indexing**

- Set a column as a DatetimeIndex for fast slicing, e.g., df['2023-01':'2023-03']

**3. Resampling**

- Change frequency of data (e.g., daily → monthly) using .resample()

- Aggregate with functions like mean(), sum(), ohlc()

**4. Shifting & Lagging**

- Shift data in time for trend or correlation analysis: .shift()

**5. Rolling & Moving Windows**

- Apply moving averages, rolling sums, or other aggregations: .rolling(window=3).mean()

**6. Time Zone Handling**

- Convert between time zones using .tz_localize() and .tz_convert()

**7. Frequency Conversion**

- Convert irregular time series to regular intervals: .asfreq('D'), .asfreq('M')

**8. Visualization Integration**

- Works seamlessly with Matplotlib or Seaborn for time-based plots.

### **Example: Resampling and Rolling Mean**

In [None]:
import pandas as pd

# Sample daily data
dates = pd.date_range('2023-01-01', periods=7)
data = pd.DataFrame({'Value': [10, 12, 15, 14, 16, 18, 20]}, index=dates)

# Resample to 3-day frequency (sum)
print(data.resample('3D').sum())

# 3-day rolling mean
print(data.rolling(window=3).mean())


### **In Short**

Pandas simplifies time series analysis by providing easy datetime handling, indexing, resampling, rolling operations, and plotting, allowing analysts to focus on insights rather than low-level data manipulation.

## Q20. What is the role of a pivot table in Pandas?

A pivot table in Pandas is used to summarize, aggregate, and reorganize data in a tabular form, similar to Excel pivot tables. It allows you to group data by one or more keys and compute aggregations like sum, mean, or count.

### **Key Roles of a Pivot Table**

**1. Data Summarization**

- Quickly compute aggregates (sum, mean, count, etc.) for grouped data.

**2. Data Organization**

- Rearrange data into a matrix-like format with rows and columns representing different categories.

**3. Multi-Level Grouping**

- Group data by multiple variables using index and columns.

**4. Flexible Aggregation**

- Apply custom aggregation functions with the aggfunc parameter.

### **Example**

In [None]:
import pandas as pd

data = {'Department': ['HR', 'HR', 'IT', 'IT', 'Sales'],
        'Employee': ['Alice','Bob','Charlie','David','Eve'],
        'Salary': [4000, 4500, 6000, 6500, 5500]}

df = pd.DataFrame(data)

# Create pivot table: average salary per department
pivot = df.pivot_table(values='Salary', index='Department', aggfunc='mean')
print(pivot)


**Output:**

In [None]:
           Salary
Department
HR          4250.0
IT          6250.0
Sales       5500.0


✅ You can also add multiple aggregation functions:

In [None]:
df.pivot_table(values='Salary', index='Department', aggfunc=['mean','sum','max'])


### **In Short**

A pivot table in Pandas is used to summarize and reorganize data efficiently, making it easier to analyze grouped data and extract meaningful insights.

## Q21. Why is NumPy’s array slicing faster than Python’s list slicing?


NumPy’s array slicing is faster than Python’s list slicing because of how the data is stored and accessed in memory.
Here’s why:

### **Key Reasons**

**1. Contiguous Memory Storage**

- NumPy arrays store elements in a continuous block of memory, whereas Python lists store pointers to separate objects scattered in memory.

- This allows NumPy to access and slice data without following pointers, making operations faster.

**2. Vectorized Operations**

- Slicing in NumPy doesn’t create a copy by default; it creates a view on the same data.

- No element-by-element copying occurs, unlike Python lists.

**3. Homogeneous Data Types**

- NumPy arrays are homogeneous (all elements have the same type), enabling optimized, low-level C operations.

- Python lists can hold mixed types, requiring more overhead during slicing.

**4. Optimized C Implementation**

- NumPy is implemented in C, so slicing operations are executed at compiled speed, not interpreted Python speed.

### **Example**

In [None]:
import numpy as np
import time

# NumPy array
arr = np.arange(1000000)
start = time.time()
slice_arr = arr[100:1000000:10]
end = time.time()
print("NumPy slicing:", end - start)

# Python list
lst = list(range(1000000))
start = time.time()
slice_lst = lst[100:1000000:10]
end = time.time()
print("List slicing:", end - start)


✅ On large data, NumPy slicing is significantly faster.

### **In Short**

NumPy slicing is faster because it uses contiguous memory, vectorized operations, homogeneous data types, and low-level C optimizations, while Python lists require pointer dereferencing and per-element handling.

## Q22. What are some common use cases for Seaborn?

Seaborn is a high-level Python library for statistical data visualization, built on Matplotlib. It’s widely used because it makes it easy to create attractive and informative plots for data analysis.

### **Common Use Cases for Seaborn**

**1. Visualizing Distributions**

- sns.histplot(), sns.kdeplot(), sns.displot()

- Explore the distribution of a single variable.

**2. Comparing Categories**

- sns.boxplot(), sns.violinplot(), sns.barplot()

- Compare values across categories, detect outliers, and see spread.

**3. Analyzing Relationships**

- sns.scatterplot(), sns.lineplot(), sns.lmplot()

- Visualize correlations or trends between two or more variables.

**4. Pairwise Relationships**

- sns.pairplot()

- Plot multiple scatterplots and histograms to explore interactions between variables.

**5. Correlation Analysis**

- sns.heatmap(df.corr(), annot=True)

- Visualize correlation matrices for quick insight into relationships.

**6. Time Series and Trends**

- sns.lineplot() with hue or style

- Explore trends over time or ordered data.

**7. Categorical Data Analysis**

- Count plots: sns.countplot()

- Quickly summarize categorical distributions.

**8. Enhanced Styling**

- Built-in color palettes, themes, and aesthetics for publication-ready visuals.

### **In Short**

Seaborn is ideal for exploratory data analysis (EDA), helping to visualize distributions, relationships, correlations, and categorical comparisons in a statistically meaningful and visually appealing way.

In [None]:
'''
This answer of Data Toolkit Theory Questions
'''

# ***Data Toolkit Practical Questions***

## Q1. How do you create a 2D NumPy array and calculate the sum of each row?

In [None]:
import numpy as np

# Create a 2D array (3 rows, 4 columns)
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

# Sum of each row
row_sums = np.sum(arr, axis=1)

print("Array:")
print(arr)
print("Sum of each row:", row_sums)


**Output:**

In [None]:
Array:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
Sum of each row: [10 26 42]


## Q2. Write a Pandas script to find the mean of a specific column in a DataFrame.

In [None]:
import pandas as pd

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [5000, 6000, 7000, 8000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate mean of the 'Salary' column
mean_salary = df['Salary'].mean()

print("Mean Salary:", mean_salary)


**Output:**

In [None]:
Mean Salary: 6500.0


## Q3. Create a scatter plot using Matplotlib.

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 8]

# Create scatter plot
plt.scatter(x, y, color='blue', marker='o')

# Add title and labels
plt.title("Sample Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Show plot
plt.show()


## Q4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
}

df = pd.DataFrame(data)

# Calculate correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix)

# Visualize with heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


## Q5. Generate a bar plot using Plotly.

In [None]:
import plotly.express as px
import pandas as pd

# Sample data
data = {
    'Fruits': ['Apples', 'Bananas', 'Cherries', 'Dates'],
    'Quantity': [10, 15, 7, 12]
}

df = pd.DataFrame(data)

# Create bar plot
fig = px.bar(df, x='Fruits', y='Quantity',
             title='Fruit Quantities',
             color='Quantity',  # optional, colors bars by value
             text='Quantity')   # optional, display values on bars

# Show plot
fig.show()


## Q6. Create a DataFrame and add a new column based on an existing column.

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Salary': [5000, 6000, 7000, 8000]
}

df = pd.DataFrame(data)

# Add a new column 'Tax' which is 10% of 'Salary'
df['Tax'] = df['Salary'] * 0.10

print(df)


**Output:**

In [None]:
      Name  Salary    Tax
0    Alice    5000  500.0
1      Bob    6000  600.0
2  Charlie    7000  700.0
3    David    8000  800.0


## Q7. Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
import numpy as np

# Define two arrays
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([5, 6, 7, 8])

# Element-wise multiplication
result = arr1 * arr2

print("Array 1:", arr1)
print("Array 2:", arr2)
print("Element-wise multiplication:", result)


**Output:**

In [None]:
Array 1: [1 2 3 4]
Array 2: [5 6 7 8]
Element-wise multiplication: [ 5 12 21 32]


## Q8. Create a line plot with multiple lines using Matplotlib.

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]

# Create line plot
plt.plot(x, y1, label='Series 1', color='blue', marker='o')
plt.plot(x, y2, label='Series 2', color='red', marker='s')

# Add title and labels
plt.title("Multiple Line Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Show legend
plt.legend()

# Display plot
plt.show()


## Q9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [5000, 6000, 7000, 8000]
}

df = pd.DataFrame(data)

# Filter rows where Salary is greater than 6000
filtered_df = df[df['Salary'] > 6000]

print(filtered_df)


**Output:**

In [None]:
      Name  Age  Salary
2  Charlie   35    7000
3    David   40    8000


## Q10. Create a histogram using Seaborn to visualize a distribution.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [12, 15, 13, 15, 18, 20, 15, 17, 19, 21, 13, 16, 18]

# Create histogram
sns.histplot(data, bins=5, kde=True, color='skyblue')

# Add title and labels
plt.title("Histogram of Sample Data")
plt.xlabel("Value")
plt.ylabel("Frequency")

# Show plot
plt.show()


## Q11. Perform matrix multiplication using NumPy.

In [None]:
import numpy as np

# Define two matrices
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication
C = np.dot(A, B)   # Method 1
# Alternatively, using @ operator: C = A @ B

print("Matrix A:")
print(A)
print("Matrix B:")
print(B)
print("Matrix Multiplication Result (A x B):")
print(C)


**Output:**

In [None]:
Matrix A:
[[1 2]
 [3 4]]
Matrix B:
[[5 6]
 [7 8]]
Matrix Multiplication Result (A x B):
[[19 22]
 [43 50]]


## Q12. Use Pandas to load a CSV file and display its first 5 rows.

In [None]:
import pandas as pd

# Load CSV file
df = pd.read_csv('data.csv')  # Replace 'data.csv' with your file path

# Display the first 5 rows
print(df.head())


## Q13. Create a 3D scatter plot using Plotly.

In [None]:
import plotly.express as px
import pandas as pd

# Sample data
data = {
    'X': [1, 2, 3, 4, 5],
    'Y': [5, 4, 3, 2, 1],
    'Z': [2, 3, 4, 5, 6],
    'Category': ['A', 'B', 'A', 'B', 'A']
}

df = pd.DataFrame(data)

# Create 3D scatter plot
fig = px.scatter_3d(df, x='X', y='Y', z='Z', color='Category', size='Z',
                    title='3D Scatter Plot Example')

# Show plot
fig.show()


In [None]:
'''
This is the answer of Data Toolkit Practical Questions
'''